Zeppelin has become one of my favourite tools in my toolbox. I am heavily designing stuff for Cassandra and in Scala, and even though I love Cassandra there are times when things just gets so complicated with the CQL command line, and creating a small project in IntelliJ just seems like too much hazel. Then using Zeppelin to try out is just perfect. So this page is a How-To with some useful Cookbook recipes.

Setting Up Zeppelin

I use Docker where things are so much easier, and I pick v0.8.0 cause I never got 0.8.2 to work for some reason.

Download and Start Cassandra

docker pull cassandra

1	docker pull cassandra

docker run --name Cassandra3 -p 9042:9042 cassandra:3.11

1	docker run --name Cassandra3 -p 9042:9042 cassandra:3.11

Download and Start Zeppelin

Download Zeppelin image

docker pull apache/zeppelin:0.8.0

1	docker pull apache/zeppelin:0.8.0

Start Zeppelin on port 8080

docker run -p 8080:8080 --name zeppelin apache/zeppelin:0.8.0

1	docker run -p 8080:8080 --name zeppelin apache/zeppelin:0.8.0

-p hp:cp
hp = Host Port, the port on your local machine
cp = Container Port, the port inside the docker which is what Zeppelin is exposing

Go to localhost:8080 in your web browser and you should see something like this

Setup Zeppelin

Find out the IP address of Cassandra in you Docker network, as you can see of the inspect, the IP address is 172.17.0.3.

QSWEM078:~ teriksson$ docker network inspect bridge
[
    {
        "Name": "bridge",
        "Id": "355be8072aafa87bafa8de19d00af597746039000d27e9245e2464fa54bf81a8",
        "Created": "2020-04-03T14:23:57.446760383Z",
        "Scope": "local",
        "Driver": "bridge",
        "EnableIPv6": false,
        "IPAM": {
            "Driver": "default",
            "Options": null,
            "Config": [
                {
                    "Subnet": "172.17.0.0/16",
                    "Gateway": "172.17.0.1"
                }
            ]
        },
        "Internal": false,
        "Attachable": false,
        "Ingress": false,
        "ConfigFrom": {
            "Network": ""
        },
        "ConfigOnly": false,
        "Containers": {
            "ceda1cebea87ee7244f00d5e88292ff76fc46142627ed4064e0b98cd92f728a3": {
                "Name": "zeppelin",
                "EndpointID": "2cc39278d16db811bc593945adcc4a7ae2d0e5409a98c1ddf0d548bcf0b7052a",
                "MacAddress": "02:42:ac:11:00:02",
                "IPv4Address": "172.17.0.2/16",
                "IPv6Address": ""
            },
            "f772b8c66fe3729bd00e2bd9d2e50472ec40b1e8047796f8f69db6ecee6a77ae": {
                "Name": "<strong>cassandra3</strong>",
                "EndpointID": "23fde4a184ca9456ddec164616c4603f6ee8f3c310e21cb7c4409d350d7c3fd6",
                "MacAddress": "02:42:ac:11:00:03",
                "IPv4Address": "<strong>172.17.0.3</strong>/16",
                "IPv6Address": ""
            }
        },
        "Options": {
            "com.docker.network.bridge.default_bridge": "true",
            "com.docker.network.bridge.enable_icc": "true",
            "com.docker.network.bridge.enable_ip_masquerade": "true",
            "com.docker.network.bridge.host_binding_ipv4": "0.0.0.0",
            "com.docker.network.bridge.name": "docker0",
            "com.docker.network.driver.mtu": "1500"
        },
        "Labels": {}
    }
]

QSWEM078:~ teriksson$ docker network inspect bridge

[

{

"Name": "bridge",

"Id": "355be8072aafa87bafa8de19d00af597746039000d27e9245e2464fa54bf81a8",

"Created": "2020-04-03T14:23:57.446760383Z",

"Scope": "local",

"Driver": "bridge",

"EnableIPv6": false,

"IPAM": {

"Driver": "default",

"Options": null,

"Config": [

{

"Subnet": "172.17.0.0/16",

"Gateway": "172.17.0.1"

}

]

"Internal": false,

"Attachable": false,

"Ingress": false,

"ConfigFrom": {

"Network": ""

"ConfigOnly": false,

"Containers": {

"ceda1cebea87ee7244f00d5e88292ff76fc46142627ed4064e0b98cd92f728a3": {

"Name": "zeppelin",

"EndpointID": "2cc39278d16db811bc593945adcc4a7ae2d0e5409a98c1ddf0d548bcf0b7052a",

"MacAddress": "02:42:ac:11:00:02",

"IPv4Address": "172.17.0.2/16",

"IPv6Address": ""

"f772b8c66fe3729bd00e2bd9d2e50472ec40b1e8047796f8f69db6ecee6a77ae": {

"Name": "<strong>cassandra3</strong>",

"EndpointID": "23fde4a184ca9456ddec164616c4603f6ee8f3c310e21cb7c4409d350d7c3fd6",

"MacAddress": "02:42:ac:11:00:03",

"IPv4Address": "<strong>172.17.0.3</strong>/16",

"IPv6Address": ""

}

"Options": {

"com.docker.network.bridge.default_bridge": "true",

"com.docker.network.bridge.enable_icc": "true",

"com.docker.network.bridge.enable_ip_masquerade": "true",

"com.docker.network.bridge.host_binding_ipv4": "0.0.0.0",

"com.docker.network.bridge.name": "docker0",

"com.docker.network.driver.mtu": "1500"

"Labels": {}

}

]

Set up IP address for Cassandra in the Spark Interpreter

Go to the section on “Spark”

Now add a row that says

spark.cassandra.connection.host : <span class="ng-scope ng-binding editable">172.17.0.3 </span>

1	spark.cassandra.connection.host : <span class="ng-scope ng-binding editable">172.17.0.3 </span>

Now also edit the Dependencies

You can do this in many ways, either you specify the MAVEN repo with version OR you download the JAR file(s) to disk and copy them into the Docker. I had to do the latter due to some issue with my network.

You need these two libraries :

Simply click on the JAR file and download the file, then copy it into the docker with

docker cp spark-cassandra-connector_2.11-2.0.12.jar zeppelin:/zeppelin/interpreter/spark/dep/spark-cassandra-connector_2.11-2.0.12.jar

1	docker cp spark-cassandra-connector_2.11-2.0.12.jar zeppelin:/zeppelin/interpreter/spark/dep/spark-cassandra-connector_2.11-2.0.12.jar

docker cp jsr166e-1.1.0.jar zeppelin:/zeppelin/interpreter/spark/dep/jsr166e-1.1.0.jar

1	docker cp jsr166e-1.1.0.jar zeppelin:/zeppelin/interpreter/spark/dep/jsr166e-1.1.0.jar

Setup IP address for Cassandra in Cassandra Interpreter

cassandra.hosts : 172.17.0.3

1	cassandra.hosts : 172.17.0.3

Create your first Notebook

Cookbook Recipes

Load Table into RDD and count rows

This is just to show how you load a table into an RDD, once in the RDD you can play around with it and do lots of stuff.

%spark
import com.datastax.spark.connector._
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.SparkContext._
val rdd = sc.cassandraTable("system_schema","keyspaces")
println("Row count:" + rdd.count)

%spark

import com.datastax.spark.connector._

import org.apache.spark.{SparkConf, SparkContext}

import org.apache.spark.SparkContext._

val rdd = sc.cassandraTable("system_schema","keyspaces")

println("Row count:" + rdd.count)

Show key spaces using the built in Cassandra interpreter using CQL

%cassandra
USE "system_schema";
SELECT * FROM keyspaces;

%cassandra

USE "system_schema";

SELECT * FROM keyspaces;

The result :

Create Keyspace and Table using CQL

%cassandra
CREATE KEYSPACE cim WITH replication = {'class':'SimpleStrategy', 'replication_factor' : 1};

%cassandra

CREATE KEYSPACE cim WITH replication = {'class':'SimpleStrategy', 'replication_factor' : 1};

%cassandra
 CREATE TABLE cim.customer(
   id int PRIMARY KEY,
   name text,
   city text,
   );

%cassandra

CREATE TABLE cim.customer(

id int PRIMARY KEY,

name text,

city text,

);

Insert data by hand using CQL

%cassandra
INSERT INTO cim.customer_fast (id, ck1, name, city ) VALUES ( 1,2, 'US Robotics', 'New York' );

%cassandra

INSERT INTO cim.customer_fast (id, ck1, name, city ) VALUES ( 1,2, 'US Robotics', 'New York' );

Fill the table with bogus data using Spark and Scala

%spark
import scala.util.Random
val random = new Random
val cities = List[String]( "Stockholm", "Malmoe", "Kalmar", "Jonkoping", "Linkoping", "Karlskrona", "Ronneby" )
val companyNames = List[String]( "Ikea", "SJ", "Ericsson", "Thai Silk", "Italia", "Apple", "ASEA", "Pressbyran")
val data = (1 to 3000 ).map( id =&gt; (id,companyNames(random.nextInt(companyNames.length))+"-"+id,cities(random.nextInt(cities.length))) )
val rdd = sc.parallelize( data )
rdd.saveToCassandra( "cim", "customer", SomeColumns("id","name","city") )

%spark

import scala.util.Random

val random = new Random

val cities = List[String]( "Stockholm", "Malmoe", "Kalmar", "Jonkoping", "Linkoping", "Karlskrona", "Ronneby" )

val companyNames = List[String]( "Ikea", "SJ", "Ericsson", "Thai Silk", "Italia", "Apple", "ASEA", "Pressbyran")

val data = (1 to 3000 ).map( id => (id,companyNames(random.nextInt(companyNames.length))+"-"+id,cities(random.nextInt(cities.length))) )

val rdd = sc.parallelize( data )

rdd.saveToCassandra( "cim", "customer", SomeColumns("id","name","city") )

Select data using CQL

%cassandra
SELECT * FROM cim.customer_fast where id = 100;

%cassandra

SELECT * FROM cim.customer_fast where id = 100;

Create VIEW so that we can run SQL

%spark
import org.apache.spark.sql.cassandra._
import org.apache.spark.sql
val createTempView = """CREATE TEMPORARY VIEW customers
 USING org.apache.spark.sql.cassandra
 OPTIONS (
 table "customer",
 keyspace "cim",
 pushdown "true")"""
spark.sql(createTempView)

%spark

import org.apache.spark.sql.cassandra._

import org.apache.spark.sql

val createTempView = """CREATE TEMPORARY VIEW customers

USING org.apache.spark.sql.cassandra

OPTIONS (

table "customer",

keyspace "cim",

pushdown "true")"""

spark.sql(createTempView)

Run SQL, ohh sweet SQL 🙂

%spark
spark.sql("SELECT * FROM customers WHERE city like 'K%' limit 10").show

%spark

spark.sql("SELECT * FROM customers WHERE city like 'K%' limit 10").show

By creating temporary views like this, we can also do joins if we would like to.

Obviously this is not how Cassandra was intended to be used, but the point here is more of giving the ability to troubleshoot, turist around in the data with ease instead of setting up a project, and do the joins inside of the code. Here we are able to really trail and error until we get what we want.

That was all for now

-Tobias

21 thoughts on “Apache Zeppelin, with Spark and Cassandra, the perfect tool”

Bachelor of Informatics Telkom University says:

April 8, 2020 at 12:40 pm

how it using Zeppelin to try out?

Bachelor of Accounting Program Telkom University says:

November 10, 2020 at 9:42 am

Thanks, I have recently been looking for info about this topic for ages and yours is the best I’ve discovered so far.

Bachelor of Accounting Program Telkom University says:

November 18, 2020 at 2:07 am

thank you for your information,what’s next?

Fianda Briliyandi says:

December 17, 2020 at 2:16 pm

Good article, thanks for sharing, please visit

our website

Bachelor of Interior Program Telkom University says:

July 5, 2021 at 1:59 am

Thanks, I have recently been looking for info about this topic for ages and yours is the best I’ve discovered so far.

Jurnal kesehatan says:

July 6, 2021 at 8:13 am

I am very interested in the information contained in this post. The information contained in this post inspired me to generate research ideas. Thank You.

sandipan mukherjee says:

July 26, 2021 at 9:03 pm

yes you are right…Apache Cassandra is an open-source distributed NoSQL database management system built to handle large chunks of data over various data centers. Cassandra was developed at Facebook to overcome its “inbox search” issue and make it easier to find the conversations. Facebook later open-sourced Cassandra, and it became an Apache Foundation project. Cassandra is a highly scalable database and is freely available under the Apache License 2.0.

Bachelor of Informatics Telkom University says:

October 6, 2021 at 8:37 pm

will you share another way?

Bachelor of Telecommunication Engineering Telkom University says:

February 27, 2022 at 6:34 pm

Howdy! I just woud like to give you a huge thumbs up foor your great info you’ve got here on this post. I’ll be coming back to your site for more soon. is there a special schedule for posting?

Ikhsan says:

April 21, 2022 at 10:34 am

Thank you for nice information. Please visit our web:

https://uhamka.ac.id

Online Library Telkom University says:

April 24, 2022 at 6:58 pm

do you want to continue the development of your website?

syifa says:

January 20, 2023 at 10:19 am

Thanks for article~
Visit Website Us :
ITTELKOM JAKARTA

Arcico Weldy S says:

September 30, 2023 at 4:46 pm

Thank for the information, please visit
VisitUs

Felix Meyer says:

July 3, 2024 at 8:18 am

Your ideas absolutely shows this site could easily be one of the bests in its niche. Drop by my website Webemail24 for some fresh takes about Search Engine Optimization. Also, I look forward to your new updates.

Seoranko says:

July 3, 2024 at 1:25 pm

Your writing style is cool and I have learned several just right stuff here. I can see how much effort you’ve poured in to come up with such informative posts. If you need more input about Social Media Marketing, feel free to check out my website at Seoranko

ArticleHome says:

July 7, 2024 at 4:40 pm

Your blog has really piqued my interest on this topic. Feel free to drop by my website ArticleHome about Data Mining.

Autoprofi says:

July 9, 2024 at 11:14 pm

Hey, I enjoyed reading your posts! You have great ideas. Are you looking to get resources about Car Purchase or some new insights? If so, check out my website Autoprofi

Articlecity says:

July 10, 2024 at 5:59 pm

Great post! I learned something new and interesting, which I also happen to cover on my blog. It would be great to get some feedback from those who share the same interest about Tantric Massage, here is my website Articlecity Thank you!

Articleworld says:

July 11, 2024 at 5:43 pm

My site Articleworld covers a lot of topics about Online Music Streaming and I thought we could greatly benefit from each other. Awesome posts by the way!

Article Sphere says:

July 13, 2024 at 3:37 pm

Impressive posts! My blog Article Sphere about PR Marketing also has a lot of exclusive content I created myself, I am sure you won’t leave empty-handed if you drop by my page.

Mitzi Mickey says:

August 18, 2024 at 2:39 pm

Hey there, I appreciate you posting great content covering that topic with full attention to details and providing updated data. I believe it is my turn to give back, check out my website YW9 for additional resources about Airport Transfer.