What is Spark?
Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.
You can download Spark from here, or follow the below instructions to download and install Spark2.2.1.
Download and install Spark2.2.1
$ sudo apt-get update $ sudo apt-get -y install default-jdk scala $ su $ wget http://www-us.apache.org/dist/spark/spark-2.2.1/spark-2.2.1-bin-hadoop2.7.tgz $ tar -zxvf spark-*.tgz $ mv spark-2.2.1-bin-hadoop2.7/ /usr/local/spark $ rm spark-2.2.1-bin-hadoop2.7.tgz $ exit
Setting up the environment for Spark
Append the below line to the
Then sourcing the
$ source ~/.bashrc
Run the Spark shell
$ cd /usr/local/spark $ spark-shell Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 18/02/15 10:04:56 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 18/02/15 10:04:57 WARN Utils: Your hostname, ubuntu resolves to a loopback address: 127.0.1.1; using 10.211.55.4 instead (on interface enp0s5) 18/02/15 10:04:57 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address Spark context Web UI available at http://10.211.55.4:4040 Spark context available as 'sc' (master = local[*], app id = local-1518642299027). Spark session available as 'spark'. Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 2.2.1 /_/ Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_151) Type in expressions to have them evaluated. Type :help for more information. scala>
Spark interactive shell
scala> val sakanaFile = sc.textFile("README.md") sakanaFile: org.apache.spark.rdd.RDD[String] = README.md MapPartitionsRDD at textFile at <console>:24 scala> sakanaFile.count() res0: Long = 103 scala> val linesWithSpark = sakanaFile.filter(line => line.contains("Spark")) linesWithSpark: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD at filter at <console>:26 scala> linesWithSpark.count() res1: Long = 20 # We can chain together transformations and actions: scala> sakanaFile.filter(line => line.contains("Spark")).count() res3: Long = 20 scala> linesWithSpark.collect() res4: Array[String] = Array(# Apache Spark, Spark is a fast and general cluster computing system for Big Data. It provides, rich set of higher-level tools including Spark SQL for SQL and DataFrames,, and Spark Streaming for stream processing., You can find the latest Spark documentation, including a programming, ## Building Spark, Spark is built using [Apache Maven](http://maven.apache.org/)., To build Spark and its example programs, run:, You can build Spark using more than one thread by using the -T option with Maven, see ["Parallel builds in Maven 3"](https://cwiki.apache.org/confluence/display/MAVEN/Parallel+builds+in+Maven+3)., ["Building Spark"](http://spark.apache.org/docs/latest/building-spark.html)., For general development tips, including info on developing Spark using an IDE,... scala> linesWithSpark.collect.foreach(println) # Apache Spark Spark is a fast and general cluster computing system for Big Data. It provides rich set of higher-level tools including Spark SQL for SQL and DataFrames, and Spark Streaming for stream processing. You can find the latest Spark documentation, including a programming ## Building Spark ... ... scala>