Skip to content

Adds Kotlin Jupyter notebook support #137

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 41 commits into from
May 6, 2022
Merged
Show file tree
Hide file tree
Changes from 38 commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
f94c3d8
testing jupyter
Jolanrensen Mar 2, 2022
4f68d06
fixed instant tests
Jolanrensen Mar 3, 2022
8d038a5
attempting to make it work
Jolanrensen Mar 3, 2022
d5c0b32
basics work!
Jolanrensen Mar 3, 2022
e9f4b35
updating readme
Jolanrensen Mar 3, 2022
34a0cbd
added dataset html renderer, trying to fix library issues
Jolanrensen Mar 7, 2022
1104a64
fixed import
Jolanrensen Mar 7, 2022
b907213
added dataset renderer with nice css, jupyter test still broken
Jolanrensen Mar 10, 2022
f026f57
attempting to make jupyter test work
Jolanrensen Mar 11, 2022
8071758
added simple RDD rendering
Jolanrensen Mar 11, 2022
ac9b6d1
jupyter tests work when targeting jdk 11
Jolanrensen Mar 14, 2022
a6336f4
more tests and trying to please qodana
Jolanrensen Mar 14, 2022
a91d35b
more tests and trying to please qodana
Jolanrensen Mar 14, 2022
f346c3b
added more tests and catches for RDDs that cannot be rendered
Jolanrensen Mar 14, 2022
fd19408
updated readme
Jolanrensen Mar 14, 2022
ee38db2
Merge branch 'spark-3.2' into jupyter-test
Jolanrensen Apr 21, 2022
717501a
adding qodana scan to github actions. Improved tuple and rdd render s…
Jolanrensen Apr 21, 2022
61da8de
Merge branch 'spark-3.2' into jupyter-test
Jolanrensen Apr 21, 2022
33ad506
updated from main branch
Jolanrensen Apr 21, 2022
ea59d66
maybe fixed kotest?
Jolanrensen Apr 22, 2022
41e959e
adding test for filter function of dataset of local data class and ot…
Jolanrensen Apr 26, 2022
90df422
now able to debug right test
Jolanrensen Apr 26, 2022
dfb48df
now able to debug right test
Jolanrensen Apr 26, 2022
390defc
added functions to integration since json will support different inte…
Jolanrensen Apr 28, 2022
32a54f8
added functions to integration since json will support different inte…
Jolanrensen Apr 28, 2022
a0cf372
temp test to see if we can publish to gh packages from a branch
Jolanrensen Apr 28, 2022
4e9e05a
temp test to see if we can publish to gh packages from a branch
Jolanrensen Apr 29, 2022
bf119be
temp test to see if we can publish to gh packages from a branch
Jolanrensen Apr 29, 2022
8c84471
temp test to see if we can publish to gh packages from a branch
Jolanrensen Apr 29, 2022
2fb01c3
enabling snappy setting for lz4 compression codec
Jolanrensen Apr 29, 2022
b7c9297
jupyter api allows multiple integration files now
Jolanrensen May 2, 2022
10c3a90
This build will fail. Requires mavenLocal build of kotlin jupyter wit…
Jolanrensen May 2, 2022
5009355
Updated kotlin jupyter version, added streaming test for jupyter as well
Jolanrensen May 5, 2022
0be6b70
removed snappy again
Jolanrensen May 5, 2022
84bf0d5
fixed gh actions
Jolanrensen May 6, 2022
2b6aca4
some cleaning per request
Jolanrensen May 6, 2022
52398d6
split off jupyter into a separate module
Jolanrensen May 6, 2022
81448ee
gh actions deploying is working, so setting correct branch
Jolanrensen May 6, 2022
7c2d652
fixed dependencies as requested
Jolanrensen May 6, 2022
234430d
moved nexus plugin to release sign, which was renamed to central-deploy
Jolanrensen May 6, 2022
e8f4ee4
disabled qodana...
Jolanrensen May 6, 2022
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 31 additions & 0 deletions .github/workflows/publish_dev_version.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
name: Generate and publish docs

on:
push:
branches:
- "spark-3.2"

jobs:
build-and-deploy:
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v2
- name: Set up JDK 11
uses: actions/setup-java@v1
with:
distributions: adopt
java-version: 11
check-latest: true
- name: Cache Maven packages
uses: actions/cache@v2
with:
path: ~/.m2
key: ${{ runner.os }}-m2-${{ hashFiles('**/pom.xml') }}
restore-keys: ${{ runner.os }}-m2
- name: Deploy to GH Packages
run: ./mvnw --batch-mode deploy -Dkotest.tags="!Kafka"
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}


33 changes: 31 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,32 @@ Once you have configured the dependency, you only need to add the following impo
import org.jetbrains.kotlinx.spark.api.*
```

### Jupyter

The Kotlin Spark API also supports Kotlin Jupyter notebooks.
To it, simply add

```jupyterpython
%use kotlin-spark-api
```
to the top of your notebook. This will get the latest version of the API, together with the latest version of Spark.
To define a certain version of Spark or the API itself, simply add it like this:
```jupyterpython
%use kotlin-spark-api(spark=3.2, version=1.0.4)
```

Inside the notebook a Spark session will be initiated automatically. This can be accessed via the `spark` value.
`sc: JavaSparkContext` can also be accessed directly. The API operates pretty similarly.

There is also support for HTML rendering of Datasets and simple (Java)RDDs.

To use Spark Streaming abilities, instead use
```jupyterpython
%use kotlin-spark-api-streaming
```
This does not start a Spark session right away, meaning you can call `withSparkStreaming(batchDuration) {}`
in whichever cell you want.

## Kotlin for Apache Spark features

### Creating a SparkSession in Kotlin
Expand All @@ -81,12 +107,13 @@ val spark = SparkSession
.builder()
.master("local[2]")
.appName("Simple Application").orCreate

```

This is not needed when running the Kotlin Spark API from a Jupyter notebook.

### Creating a Dataset in Kotlin
```kotlin
spark.toDS("a" to 1, "b" to 2)
spark.dsOf("a" to 1, "b" to 2)
```
The example above produces `Dataset<Pair<String, Int>>`. While Kotlin Pairs and Triples are supported, Scala Tuples are reccomended for better support.

Expand All @@ -102,6 +129,8 @@ We provide you with useful function `withSpark`, which accepts everything that m

After work block ends, `spark.stop()` is called automatically.

Do not use this when running the Kotlin Spark API from a Jupyter notebook.

```kotlin
withSpark {
dsOf(1, 2)
Expand Down
7 changes: 6 additions & 1 deletion core/3.2/pom_2.12.xml
Original file line number Diff line number Diff line change
Expand Up @@ -27,8 +27,13 @@
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_${scala.compat.version}</artifactId>
<version>${spark3.version}</version>
<scope>provided</scope>
</dependency>
<!-- <dependency>-->
<!-- <groupId>org.apache.spark</groupId>-->
<!-- <artifactId>spark-core_${scala.compat.version}</artifactId>-->
<!-- <version>${spark3.version}</version>-->
<!--&lt;!&ndash; <scope>provided</scope>&ndash;&gt;-->
<!-- </dependency>-->
</dependencies>

<build>
Expand Down
210 changes: 108 additions & 102 deletions examples/pom-3.2_2.12.xml
Original file line number Diff line number Diff line change
@@ -1,108 +1,114 @@
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">

<modelVersion>4.0.0</modelVersion>
<modelVersion>4.0.0</modelVersion>

<name>Kotlin Spark API: Examples for Spark 3.2+ (Scala 2.12)</name>
<description>Example of usage</description>
<artifactId>examples-3.2_2.12</artifactId>
<parent>
<groupId>org.jetbrains.kotlinx.spark</groupId>
<artifactId>kotlin-spark-api-parent_2.12</artifactId>
<version>1.0.4-SNAPSHOT</version>
<relativePath>../pom_2.12.xml</relativePath>
</parent>
<name>Kotlin Spark API: Examples for Spark 3.2+ (Scala 2.12)</name>
<description>Example of usage</description>
<artifactId>examples-3.2_2.12</artifactId>
<parent>
<groupId>org.jetbrains.kotlinx.spark</groupId>
<artifactId>kotlin-spark-api-parent_2.12</artifactId>
<version>1.0.4-SNAPSHOT</version>
<relativePath>../pom_2.12.xml</relativePath>
</parent>

<dependencies>
<dependency>
<groupId>org.jetbrains.kotlinx.spark</groupId>
<artifactId>kotlin-spark-api-3.2</artifactId>
<version>${project.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_${scala.compat.version}</artifactId>
<version>${spark3.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_${scala.compat.version}</artifactId>
<version>${spark3.version}</version>
</dependency>
<dependency><!-- Only needed for Qodana -->
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_${scala.compat.version}</artifactId>
<version>${spark3.version}</version>
</dependency>
</dependencies>
<dependencies>
<dependency>
<groupId>org.jetbrains.kotlin</groupId>
<artifactId>kotlin-reflect</artifactId>
</dependency><!-- TODO this shouldn't be needed -->
<dependency>
<groupId>org.jetbrains.kotlinx.spark</groupId>
<artifactId>kotlin-spark-api-3.2</artifactId>
<version>${project.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_${scala.compat.version}</artifactId>
<version>${spark3.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_${scala.compat.version}</artifactId>
<version>${spark3.version}</version>
</dependency>
<dependency><!-- Only needed for Qodana -->
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_${scala.compat.version}</artifactId>
<version>${spark3.version}</version>
</dependency>
</dependencies>

<build>
<sourceDirectory>src/main/kotlin</sourceDirectory>
<testSourceDirectory>src/test/kotlin</testSourceDirectory>
<directory>target/3.2/${scala.compat.version}</directory>
<plugins>
<plugin>
<groupId>org.jetbrains.kotlin</groupId>
<artifactId>kotlin-maven-plugin</artifactId>
<executions>
<execution>
<id>compile</id>
<goals>
<goal>compile</goal>
</goals>
</execution>
<execution>
<id>test-compile</id>
<goals>
<goal>test-compile</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>${maven-assembly-plugin.version}</version>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
<archive>
<manifest>
<mainClass>org.jetbrains.spark.api.examples.WordCountKt</mainClass>
</manifest>
</archive>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-site-plugin</artifactId>
<configuration>
<skip>true</skip>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-deploy-plugin</artifactId>
<configuration>
<skip>true</skip>
</configuration>
</plugin>
<plugin>
<groupId>org.sonatype.plugins</groupId>
<artifactId>nexus-staging-maven-plugin</artifactId>
<configuration>
<skipNexusStagingDeployMojo>true</skipNexusStagingDeployMojo>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<configuration>
<source>8</source>
<target>8</target>
</configuration>
</plugin>
</plugins>
</build>
<build>
<sourceDirectory>src/main/kotlin</sourceDirectory>
<testSourceDirectory>src/test/kotlin</testSourceDirectory>
<directory>target/3.2/${scala.compat.version}</directory>
<plugins>
<plugin>
<groupId>org.jetbrains.kotlin</groupId>
<artifactId>kotlin-maven-plugin</artifactId>
<executions>
<execution>
<id>compile</id>
<goals>
<goal>compile</goal>
</goals>
</execution>
<execution>
<id>test-compile</id>
<goals>
<goal>test-compile</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>${maven-assembly-plugin.version}</version>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
<archive>
<manifest>
<mainClass>org.jetbrains.spark.api.examples.WordCountKt</mainClass>
</manifest>
</archive>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-site-plugin</artifactId>
<configuration>
<skip>true</skip>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-deploy-plugin</artifactId>
<configuration>
<skip>true</skip>
</configuration>
</plugin>
<plugin>
<groupId>org.sonatype.plugins</groupId>
<artifactId>nexus-staging-maven-plugin</artifactId>
<configuration>
<skipNexusStagingDeployMojo>true</skipNexusStagingDeployMojo>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>${maven-compiler-plugin.version}</version>
<configuration>
<source>9</source>
<target>9</target>
</configuration>
</plugin>
</plugins>
</build>
</project>
Loading