How to get started with Hadoop Spark
|Summary||Hadoop Spark provides a highly-available service on top of a cluster of machines. Learn how to set it up and operate it.|
|Author||Canonical Web Team firstname.lastname@example.org|
About Hadoop Spark
This provides a highly-available (HA) service on top of a cluster of machines, which is resistant to individual machine failure. It provides a flexible solution consisting of HDFS, MapReduce, and Spark that can process a wide variety of workloads.
Hadoop is designed to scale to thousands of servers, each offering local computation and storage capacity, and is able to detect and handle failures at the application layer.
In this tutorial you’ll learn how to…
- Get your Hadoop Spark cluster up and running, using JAAS
- Operate your new cluster.
- Create your first big data workload.
- Change the execution mode of Spark in your cluster.
You will need…
- An Ubuntu One account (you can set it up in the deployment process)
- A public SSH key.
- Credentials for AWS, GCE or Azure
Deploying with JAAS
To kick off, open Hadoop Spark and click the blue Deploy changes (56) button in the bottom right. Then, follow the steps to deploy your bundle.
Deployment can take up to 30-45 minutes, as Juju creates new instances in the cloud and sets up the Hadoop Spark cluster components. Pending units are outlined in orange. Up and running are outlined in black.
Congratulations! You now have a Hadoop Spark cluster up and running.
Install the Juju client
You will need to have Juju installed locally to operate your cluster. Skip this step if you already have it.
Juju is available as a client on many platforms and distributions. Visit the install docs to get the latest version of Juju on macOS, Windows or CentOS.
If you are running Ubuntu, you can install Juju through the following steps:
It’s helpful to Install Snappy if you don’t have it already.
$ sudo apt install snapd
Install Juju to get the command line client.
$ sudo snap install juju --classic
Verify you can run Juju. You will see a summary and a list of common commands.
You’re all set!
Connecting to JAAS
To connect to JAAS from the command line you’ll need to register with the JAAS controller. You will be required to do this just the first time.
$ juju register jimm.jujucharms.com
This command will open a new window in your default web browser. Use Ubuntu SSO to login and authorise your account.
You will then be asked to enter a descriptive name for the JAAS controller. We suggest using jaas.
JAAS users with existing models, might first need to switch to the relevant model:
$ juju switch <model-name>
Your Hadoop Spark cluster is managed as a model by Juju. View the model’s status with:
$ juju status
To watch continuously, in colour, run:
$ watch -c juju status --color
Running your first Spark workload
Once your bundle is deployed, in the terminal, run the sparkpi demo workload included on the Spark node:
$ juju run --unit spark/0 /home/ubuntu/sparkpi.sh
Visit the docs to learn more about how to monitor, benchmark, and scale.
Spark execution modes
By default, this bundle configures Spark in ‘yarn’ mode. This allows Spark to use the Hadoop cluster for all compute resources.
The Spark execution mode can be changed to use non-Hadoop resources for Spark jobs.
For example, switch Spark into standalone mode. In standalone mode, Spark launches a Master and Worker daemon on the Spark unit. This mode is useful for simulating a distributed cluster environment without actually setting up a cluster.
$ juju config spark spark_execution_mode=standalone
Add 2 additional units to form a 3-node Spark cluster:
$ juju add-unit -n 2 spark
in the inspector, click on the application “spark” and then “pending” to see details of the machines being provisioned and the charm software installed and initialised.
For more details on the Spark execution modes, visit the configuration section of the Spark charm.
Access the dashboards
Back in the Juju GUI, in your web browser, click on individual charms and expose these endpoints to operate your cluster.
In the GUI, select the Namenode charm.
Select Expose in the inspector on the left hand side and set the toggle ON, so you can connect to this unit. If the deployment isn’t yet completed, no public address will be available.
Click Commit changes.
Once deployment is complete, you can visit an overview of the Hadoop cluster with your web browser. Click the link public IP number and port (e.g. xxx.xxx.xxx.xxx:50070). It will open in a new browser tab.
In the GUI, select Spark.
Select Expose in the inspector on the left hand side and set the toggle ON, so you can connect to this unit.
Click Commit changes
Once deployment is complete, you can open the link xxx.xxx.xxx.xxx:18080 for spark/0 to view the Spark Job History interface, showing details of you completed jobs – including the Pi calculation you ran earlier.
Learn more about monitoring Spark.
In the GUI, select Resource Manager charm.
Expose the charm.
Click Commit changes
Once it’s ready, click the first link (e.g. xxx.xxx.xxx.xxx:8088) to open the YARN cluster dashboard, which includes information about the Hadoop compute nodes in your cluster. The second link (e.g. xxx.xxx.xxx.xxx:19888) opens the YARN History Server, which includes information about Hadoop or Spark jobs submitted to your cluster.
Select Ganglia (not Ganglia-node).
Expose the charm.
Click Commit changes
Visit xxx.xxx.xxx.xxx:80/ganglia to open the Ganglia web interface, a visual dashboard of load and performance charts for your cluster.
Note: You will need to append “/ganglia” to the URL to reach the Ganglia interface.
That’s all folks!
- Learn more about the Hadoop Spark bundle.
- Discover other Big Data solutions.
- Get involved and connect with the Juju Big Data community.
Last updated 7 months ago.