How to stream Data Analytics with Apache Hadoop

Key Value
Summary Get started with Hadoop on JAAS with this streaming data example. We will use JAAS to deploy a fully supported Hadoop stack and then stream data into the platform for interactive SQL based Analysis.
Categories deploy-applications
Difficulty 3
Author Tom Barber tom@spicule.co.uk

Overview

Duration: 2:00

Introduction

Welcome to the first tutorial in this Getting Started with Data Processing series.

This tutorial will get you started with Anssr, a data processing and analytics platform built on top of Juju from Canonical.

Streaming data processing is becoming more and more popular, aided by faster computers, cheaper storage, and open source projects like Apache Kafka, which can deal with data streams at very high scale. Being able to process data in real time can both reduce the need for batch processing over much larger data sets as well as give stakeholders quicker access to data that may otherwise take time to reach their computers.

Juju allows users to spin up complex applications onto different platforms and run the same code in each location. While this tutorial is focused on AWS, Azure or GCP, you could also run it on:

  • Other cloud services
  • OpenStack
  • LXD containers
  • MAAS

In this tutorial you will learn how to…

  • Launch a Hadoop cluster
  • Deploy a Kafka server
  • Connect Kafka to Hadoop
  • Perform SQL analysis over your streaming data and the data stored in HDFS
  • Create ultra fast Parquet files

You will need…

  • An Ubuntu One account
  • A public SSH key
  • Credentials for AWS, GCE or Azure

Installing the Juju client

Duration: 5:00

Juju has both a web based GUI and a command line client. You can also access the command line from the GUI, but some people just prefer to use the comand line entirely.

Juju is available as a client on many platforms and distributions. Visit the install docs to get the latest version of Juju on macOS, Windows or CentOS.

If you are running Ubuntu, you can install Juju through the following steps:

First, it’s helpful to install snapd if you don’t have it already.

$ sudo apt install snapd

Then you can easily install Juju via snap:

$ sudo snap install juju --classic

Verify you can run Juju. You should see a summary and a list of common commands.

$ juju

You’re all set!

Connecting to JAAS

Duration: 2:00

To connect to JAAS from the command line you’ll need to register with the JAAS controller. You will be required to do this just the first time.

$ juju register jimm.jujucharms.com

You can check connectivity with:

juju list-models

Deploying Hadoop with JAAS

Duration: 30:00

In this tutorial we are going to be streaming Kafka data, but we need somewhere to store it. This could be local files, another database, or S3, to name just a few. But in this case we’re going to persist the data into a HDFS cluster.

Deploy via the Juju GUI

The easiest was to get going is via the GUI.

Go to jaas.ai and in the search bar enter the phrase “Anssr Hadoop”. In the results bar you’ll see a bundle. Select this bundle and click on “Add to new model” in the top of the bundle page.

Once added to a new model, you can see the charms laid out on the canvas. You can see that the bundle consists of 5 applications and by default 5 machines are used to deploy it.

The applications are Namenode, Resource Manager, Workers, Client, and Plugin. This is a fully working Hadoop stack and, once deployed, it will spin up 3 workers to distribute the processing load. As here we’re just testing, we can keep costs down and reduce this to 1. To do this, click on the Worker charm (this is the leftmost icon). This should bring up some details on the top left.

Click on Units, check 2 of the checkboxes, and click the remove button.

Once you’ve done this, click “Commit changes” on the bottom right-hand corner of the screen.

If you’ve not logged into the charm store. at this point you will be asked to Login or Sign up to Juju. This uses Ubuntu One so, if you’ve already got an account there, you can enter it here.

Next you will be asked where you want to deploy your Hadoop cluster. Depending on your cloud choices, you can then select from AWS, Azure or GCP. You will need to enter your cloud credentials. You may upload your SSH key using the manual SSH key entry or else use the Github or Launchpad key installers. Make sure to click the “Add key” button before moving on.

From there you then need to click the “Commit” button.

As machines get started and applications deployed, the charms on the canvas should get different coloured outlines to indicate their status. You can also find out more about their current state by clicking on the Status tab. When finished, all the charms should be in the Active state with a ready status message.

Deploy via CLI

If on the other hand you prefer using the CLI, this is how you do it.

First you need to add a model:

juju add-model streaming <cloud name>

More details can be found here.

Then you can deploy anssr-hadoop to this model:

juju deploy cs:~spiculecharms/anssr-hadoop

And scale down the workers for this tutorial:

juju remove-unit worker/1 worker/2

To keep an eye on what’s going on, run:

juju status

Adding Kafka and Flume

Duration: 10:00

Once we have a Hadoop cluster up and running, it’s time to spin up our streaming pipeline. For this we will use a combination of Kafka and Flume.

Deploy via GUI

Search the charm store for Kafka and add it to the canvas.

Then search for Apache Flume and add both the Flume HDFS and Flume Kafka charms to the canvas.

Finally, connect the charms. To do that, simply join Kafka to the Apache Flume Kafka charm and that charm to the Apache Flume HDFS charm. Once that’s done, connect the Apache Flume HDFS charm to the Hadoop Plugin charm.

To make Kafka work, we also need a Zookeeper charm so add that to the canvas and connect it to Kafka.

Next click the “Commit changes” button; this will spin up the relevant machines and install the required software.

Deploy via CLI

If you are a command line user, you can instead use the following commands:

juju deploy kafka
juju deploy apache-flume-hdfs
juju deploy apache-flume-kafka
juju deploy zookeeper

juju add-relation apache-flume-hdfs plugin
juju add-relation kafka apache-flume-kafka
juju add-relation apache-flume-hdfs apache-flume-kafka
juju add-relation kafka zookeeper

Configuring Flume and Kafka

Once all the units are up and running, we need to configure them so they know what data to process.

In the GUI, select the Flume Kafka charm and select the “Configure” option from the menu. Towards the bottom of the menu you’ll see an entry for the kafka_topic. Set this to cpu-metrics-topic, then save and commit the changes.

Next we have to use the command line. If you’ve not installed the Juju CLI, you can click the shell icon at the top of the Juju GUI and get a shell in your browser. If you’re already using the Juju CLI, you’re all set.

To create a Kafka topic, run the following command:

juju run-action kafka/0 create-topic topic=cpu-metrics partitions=1 replication=1

This runs a script on the Kafka server telling it to create a topic to stream data through.

To check that it’s running, execute the code below, where <id> is the output from the previous command.

juju show-action-output <id> 

You should see something like the output below:

results:
  outcome: success
  raw: |
    Created topic "cpu-metrics-topic".
status: completed
timing:
  completed: 2018-10-26 09:39:10 +0000 UTC
  enqueued: 2018-10-26 09:39:06 +0000 UTC
  started: 2018-10-26 09:39:07 +0000 UTC

Streaming some data

Duration: 5:00

Next we are going to stream some data into Kafka. For this we need a little script to generate some data.

The easiest way to do this is to run it directly on the Kakfa server. To do that, you need to SSH into the server:

juju ssh kafka/0

Then you can either download the jar file like so:

wget -O kafka-cpu-metrics-producer.jar https://www.dropbox.com/s/oczzh8iebo0u7sn/kafka-cpu-metrics-producer.jar?dl=1 

or compile it from source like so:

git clone https://github.com/buggtb/kafka-streams-example
sudo apt install maven
cd kafka-streams-example 
mvn clean package
cd target

Next we can start the generator:

java -jar kafka-cpu-metrics-producer.jar

You should see it starting to write a stream of data to the screen. This data is also being ingested into Kafka. In the next section we’ll find out how to interrogate it.

Watching the data

Duration: 2:00

The data should be flowing from the generator into Kafka, from Kafka into Flume and from Flume to HDFS. Let’s check that that is indeed so.

juju ssh resourcemanager/0
hdfs dfs -ls /user/flume/flume-kafka/

You should see a date-stamped folder. To see the individual files in the Kafka output, execute:

hdfs dfs -ls /user/flume/flume-kafka/<yyyy-mm-dd>

This data could then be processed in a range of ways—MapReduce, Spark, Pig, etc. We will use Apache Drill. To find out how to use SQL to process this data, see the next section.

Extending with Apache Drill

Duration: 5:00

Apache Drill allows users to run SQL queries over NoSQL data sources. Among others, it has out-of-the-box support for HDFS, standard filesystems, MongoDB, Kudu, and Kafka.

To deploy it, in the charm store, search for Apache Drill and add it to the canvas. Then relate it to the Zookeeper charm and Kafka charm. Finally, connect it to the Hadoop Plugin charm and press the “Commit” button.

As this is a GUI-based tool, we’re also going to expose it. Select the charm, click on the “Expose” button, then expose the app. Then make sure you press the “Commit” button.

Or if you are a CLI user run:

juju deploy ~spiculecharms/apache-drill
juju add-relation apache-drill zookeeper
juju add-relation apache-drill plugin
juju add-relation apache-drill kafka
juju expose apache-drill

Once it’s deployed, you’re ready to run SQL!

Running SQL over a live stream

Duration: 5:00

To navigate to drill, find the IP in the Status tab and in a browser navigate to http://:8047. If it doesn’t load, make sure you’ve “exposed” the service.

Because we related Apache Drill to Kafka, Juju will have automatically configured our DFS and Kafka datasources. If you navigate to Apache Drill in a browser, you can see both are configured in the Storage tab. For example, a Kafka data source should look similar to:

{
  "type": "kafka",
  "kafkaConsumerProps": {
    "bootstrap.servers": "10.142.0.10:9092",
    "group.id": "drill-consumer"
  },
  "enabled": true
}

This means we should then be able to query the stream. In the query tab, paste the following query:

select * from juju_kafka.`cpu-metrics-topic`  limit 10

If all is well, you should see a table of 10 rows of data returned from the Kafka stream. This is SQL live over stream data!

SQL over Hadoop HDFS

Duration: 5:00

You can also query the files being written by Flume into the HDFS cluster.

So that Drill understands the files being written by Flume and also their location, you need to make a couple of minor tweaks to the juju_hadoop_plugin data source in Apache Drill.

First, in the top block below where it says root, add:

"flume": {
  "location": "/user/flume/flume-kafka",
  "writable": true,
  "defaultInputFormat": null,
  "allowAccessOutsideWorkspace": true
}

Then amend the JSON block:

"json": {
  "type": "json",
  "extensions": [
    "json",
    "txt"
  ]
},

Once you’ve updated the data source, click on the Query tab.

Then you can run the query below, where “yyyy-MM-dd” is the current date:

select * from `juju_hdfs_plugin`.`flume`.`yyyy-MM-dd`

This data is processed from the log files written by Flume into HDFS and eventually could be many terabytes worth of data. The great thing about Apache Drill is that it’ll piece together multiple files from a directory into a single table. The result is that, although Flume writes individual files over time, Drill will treat them all as one.

Writing your data to a Parquet file

Duration: 3:00

As the data scales, executing complex SQL queries over JSON files can be pretty slow. So how can we speed this up?

Parquet is a binary-based file format that allows you to create column store data structures. Column store databases have been used for decades to speed up query times in reporting databases. They are generally quicker as the database doesn’t have to consume an entire row when crunching its data.

Apache Drill can query Parquet files, which is cool. But it can also create Parquet files, which is even cooler. Let’s give it a go!

Execute the query below, where “yyyy-MM-dd” is the current date.

CREATE TABLE dfs.tmp.sampleparquet AS 
(select machine, cast(cpu as double) cpu, cast(memory as double) memory, status from `juju_hdfs_plugin`.`flume`.`yyyy-MM-dd`)

This runs a CREATE TABLE statement from a SQL Select statement and will write the output as a Parquet file. The CREATE TABLE statement is pretty flexible and allows you to create table structures in a range of data formats. You could also extend this to enrich your data by combining multiple queries or datasources in your query to enable greater visibility of your data.

Next, find out how to query it!

Extended data analysis

Duration: 5:00

Now that we’re writing out Parquet files, we need to be able to query them, right?

Drill has written the file below to the Hadoop temporary location where it has write access. To read it, you can run queries like the following:

select * from dfs.tmp.sampleparquet

This allows you to run an SQL query directly over a Parquet table structure. Because Parquet is a binary format, you can’t just go and look in the file—you need an application to process that data. Druid does this for you. But Druid also knows how to structure the query in such a way that the data is returned in the fastest way possible.

Where next?!

Duration: 2:00

Hopefully you’ve enjoyed this tutorial about the Anssr platform and how to get more out of your data.

In this tutorial you’ve,

  • Deployed a Hadoop cluster
  • Wired up Apache Drill
  • Spun up Kafka and connected it to Flume

Then you

  • Ran SQL queries over flat files
  • Ran SQL queries over Kafka streams
  • Generated and queried Parquet files

Not bad considering how little command line interaction has been required. This demonstrates just some of the flexibility of the Anssr platform.

What next? How about Machine Learning? Complex Event Processing? Streaming Analytics? All can be achieved using Anssr and the Juju engine.

If you have any more comments or questions, visit the Juju forums or the Spicule Juju Expert Partners page.


Last updated 10 months ago.