YouTube Videos

A Simple Neural Network
KotlinConf 2018 - Mathematical Modeling
Creating a Sudoku Solver from Scratch
Traveling Salesman Problem
Text Categorization w/ Naive Bayes
Monty Hall Problem
Solving World's Hardest Sudoku

Wednesday, November 30, 2016

Using the Kotlin Language with Apache Spark


About a month ago I posted an article proposing Kotlin as another programming language for data science. It is a pragmatic, readable language created by JetBrains, the creator of Intellij IDEA and PyCharm. It has received growing popularity on Android and focuses on industrial use rather than experimental functionality. Just like Java and Scala, Kotlin compiles to bytecode and runs on the Java Virtual Machine. It also works with Java libraries out-of-the-box with no hiccups, and in this article I’m going to show how to use it with Apache Spark.

Officially, you can use Apache Spark with Scala, Java, Python, and R. If you are happy using any of these languages with Spark, you likely will not need Kotlin. But if you tried to learn Scala or Java and found it was not for you, you might want to give Kotlin a look. It is a legitimate fifth option that works out-of-the-box with Spark.

I recommend using Intellij IDEA as it natively includes Kotlin support. It is an excellent IDE that you can also use with Java and Scala. I also recommend using Gradle for your build automation.
Kotlin is replacing Groovy as the official scripting language for Gradle builds. You can read more about it in the article Kotlin Meets Gradle.

Setting Up

To get started, make sure to install the following:
  • Java JDK - Java JDK
  • Intellij IDEA - IDE for Java, Kotlin, Scala, and other JVM projects
  • Gradle - Build automation system, download Binary Only distribtion and unzip it to a location of your choice
You will need to configure Intellij IDEA to use your Gradle location. Launch Intellij IDEA and set this up in Settings -> Build, Execution, and Deployment -> Gradle. If you have trouble there should be plenty of walkthroughs online.

Let’s create our Kotlin project. Using your operating system, create a folder with the following structure:
kotlin_spark_project
      |
      └────src
            |
            └────main
                  |
                  └────kotlin

Your project folder needs to have a folder structure inside of it containing /src/main/kotlin/. This is important so Gradle will recognize this as a Kotlin project.
Next, create a text file named build.gradle and use a text editor to put in the following contents. This is the script that will configure your project as a Kotlin project. You can read more about Kotlin Gradle configurations here.
buildscript {
    ext.kotlin_version = '1.0.5'
    repositories {
        mavenCentral()
    }
    dependencies {
        classpath "org.jetbrains.kotlin:kotlin-gradle-plugin:$kotlin_version"
    }
}

apply plugin: "kotlin"

repositories {
    mavenCentral()
}

dependencies {
    compile "org.jetbrains.kotlin:kotlin-stdlib:$kotlin_version"

    //Apache Spark
    compile 'org.apache.spark:spark-core_2.10:1.6.1'
}

Finally, launch Intellij IDEA and click Import Project and navigate to the location of your Kotlin project folder you just created. In the wizard, check Import project from external model with the Gradle option. Click Next, then select Use Local Gradle Distribution with the Gradle copy you downloaded. Then click Finish.

Your workspace should now be set up with a Kotlin project as shown below. If you do not see the project explorer on the left press ALT + 1. Then double-click on the project folder and navigate down to the kotlin folder.




Right click the kotlin folder and select New -> Kotlin File/Class.



Name the file “SparkApp” and press OK. You will now see a SparkApp.kt file added to your kotlin folder. An editor will open on the right.

Using Spark with Kotlin

Let’s put our Spark usage in the SparkApp.kt file. Spark was written with Scala. While Kotlin does not work directly with Scala, it does have 100% interoperability with Java. Thankfully, Spark has a Java API by providing a JavaSparkContext. We can leverage this to use Spark out-of-the-box with Kotlin.
Create a main() function below which will be the entry point for our Kotlin application. Be sure to import the needed Spark dependencies as well. In your main() function, configure your SparkConf and create a new JavaSparkContext off of it.
import org.apache.spark.SparkConf
import org.apache.spark.api.java.JavaSparkContext

fun main(args: Array<String>) {

    val conf = SparkConf()
            .setMaster("local")
            .setAppName("Kotlin Spark Test")

    val sc = JavaSparkContext(conf)
}

The JavaSparkContext provides a Java API to create Spark streams. Thankfully, we can use the excellent Kotlin lambda syntax which the Kotlin compiler will translate into the needed Java functional types.
Let’s turn a List of Strings containing alphanumeric text values separated by / characters. Let’s break these alphanumeric values up, filter only for numbers, and then find their sum.
import org.apache.spark.SparkConf
import org.apache.spark.api.java.JavaSparkContext
import kotlin.reflect.KClass

fun main(args: Array<String>) {

    val conf = SparkConf()
            .setMaster("local")
            .setAppName("Kotlin Spark Test")

    val sc = JavaSparkContext(conf)

    val items = listOf("123/643/7563/2134/ALPHA", "2343/6356/BETA/2342/12", "23423/656/343")

    val input = sc.parallelize(items)

    val sumOfNumbers = input.flatMap { it.split("/") }
            .filter { it.matches(Regex("[0-9]+")) }
            .map { it.toInt() }
            .reduce {total,next -> total + next }

    println(sumOfNumbers)
}

If you click the Kotlin logo right next to your main() function in the gutter, you can run this Spark application.



A console should pop up below and start logging Spark’s events. I did not turn off logging so it will be a bit noisy. But ultimately you should see the value of sumOfNumbers printed.


Conclusion

I will show a few more examples in the coming weeks on how to use Kotlin with Spark (you can also check out my GitHub project). Kotlin is a pragmatic, readable language that I believe has potential for adoption in Spark. It just needs more documentation for this purpose. But If you want to learn more about Kotlin, you can read the Kotlin Reference as well as check out a few books that are out there. I heard great things about the O’Reilly video series on Kotlin which I understand is helpful for folks who do not have knowledge on Java, Scala, or other JVM languages.

If you learn Kotlin you can likely translate existing books and documentation on Spark into Kotlin usage. I’ll do my best to share my discoveries and any nuances I may encounter. For now, I do recommend giving it a look if you are not satisfied with your current languages.

10 comments:

  1. I couldn't find the word "replace" in the "Kotlin Meets Gradle". Adding support for Kotlin is different that replacing Groovy.

    ReplyDelete
    Replies
    1. True, I should say "effectively replace". They are not simply going to drop Groovy support. But it's likely innovations will happen on Kotlin side at some point while Groovy gets maintained.

      Delete
  2. Thanks for the easy to follow post. Do you have recommendations for a Kotlin/Spark workflow that uses a REPL?

    ReplyDelete
  3. Great share your blog is very informative and helpful thanks for this blog with us.
    Scala Training Online

    ReplyDelete
  4. Hello,
    I was able to build your solution using command line gradle build but I am not able to use spark-submit to run it.
    Can you (based on your git files) provide a command to run this please?

    Here are the errors: (using Kotlin 1.3 and Gradle 5.4 and Spark 2.4)

    C:\Yuri\kotlin-spark-test-master>spark-submit --class Yuri .\build\libs\kotlin-spark-test-master.jar

    Warning: Failed to load Yuri: kotlin/TypeCastException

    log4j:WARN No appenders could be found for logger (org.apache.spark.util.ShutdownHookManager).
    log4j:WARN Please initialize the log4j system properly.
    log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.

    ReplyDelete
    Replies
    1. OK, I sorted it out. I had to build a Jar with includes of kotlin-runtime (std-lib) and submit it with path to all other jars (spark, etc).
      Your example does work from IDEA but the crucial part of building a runnable/executable jar is missing. Without it this post is only half-baked. A good start though.

      Here is example:
      spark-submit --class SparkAppMain --jars .\build\libs\kotlin_spark_project-2.0-SNAPSHOT.jar .\build\libs

      Gradle build file is the key here:

      (NOTE: I am totally new to Gradle so I am sure this can be improved)

      build.gradle file:

      buildscript {
      ext.kotlin_version = '1.3.31'
      repositories {
      mavenCentral()
      maven {
      url "https://plugins.gradle.org/m2/"
      }
      }
      dependencies {
      classpath "org.jetbrains.kotlin:kotlin-gradle-plugin:$kotlin_version"
      classpath "com.github.jengelman.gradle.plugins:shadow:5.0.0"
      }

      }

      apply plugin: "kotlin"
      apply plugin: "java-library"
      apply plugin: "java"
      version '2.0-SNAPSHOT'

      // can use this one gradle shadowJar or a regular gradle jar
      apply plugin: 'com.github.johnrengelman.shadow'

      sourceCompatibility = 1.8

      repositories {
      mavenCentral()
      }

      // one option - a fat jar (140 MB fat!) gradle shadowJar
      shadowJar {
      zip64 true
      manifest {
      attributes 'Main-Class': 'SparkAppMain'
      }

      }

      // run it like so:
      // spark-submit --class SparkAppMain --jars .\build\libs\kotlin_spark_project-2.0-SNAPSHOT.jar .\build\libs

      // thinner jar with just Kotlin runtime here: gradle build or gradle jar
      // note that [ zip64 true ] below is required to build it with Spark jar dependants

      jar {
      zip64 true
      manifest {
      attributes 'Main-Class': 'SparkAppMain'
      }

      from {
      String[] include = [
      "kotlin-runtime-${kotlin_version}.jar",
      "kotlin-stdlib-${kotlin_version}.jar"
      ]

      configurations.compile
      .findAll { include.contains(it.name) }
      .collect { it.isDirectory() ? it : zipTree(it) }
      }
      }
      dependencies {
      compile "org.jetbrains.kotlin:kotlin-stdlib:$kotlin_version"
      compile group: 'org.jetbrains.kotlin', name: 'kotlin-stdlib', version: '1.3.31'
      compile group: 'org.apache.spark', name: 'spark-core_2.12', version: '2.4.2'
      compile group: 'org.apache.spark', name: 'spark-sql_2.12', version: '2.4.2'
      compile group: 'org.apache.spark', name: 'spark-hive_2.12', version: '2.4.2'
      }

      Delete
  5. I visit your web page. It is really useful and easy to understand. Hope everyone get benefit. Thanks for sharing your Knowledge and experience with us.
    McAfee Activate - Follow the steps for uninstalling, downloading, installing and activating McAfee antivirus. Visit us, enter the 25-digit activation code, click submit. mcafee.com/activate | mcafee.com/activate

    ReplyDelete
  6. Thank you so much for sharing such a superb information's with us. Your website is very cool. we are impressed by the details that you have on your site.we Bookmarked this website. keep it up and again thanks
    Login or sign up at office setup and download Microsoft Office. Install and activate the setup on your device. Verify the Office product key office.com/setup

    ReplyDelete
  7. Hard to ignore such an amazing article like this. You really amazed me with your writing talent. Thank for you shared again.
    Norton setup - Get started with Norton by downloading the setup and installing it on the device. Enter the unique 25-character alphanumeric product key for activation. Check your subscription norton.com/setup | norton.com/setup | norton.com/setup.

    ReplyDelete