Skip to content

Commit 94097fa

Browse files
committed
Merge branch 'spark-3.1' into spark-3.0
2 parents a664548 + 8bd7b6c commit 94097fa

File tree

14 files changed

+592
-124
lines changed

14 files changed

+592
-124
lines changed

README.md

Lines changed: 35 additions & 109 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Kotlin for Apache® Spark™ [![Maven Central](https://img.shields.io/maven-central/v/org.jetbrains.kotlinx.spark/kotlin-spark-api-parent.svg?label=Maven%20Central)](https://search.maven.org/search?q=g:org.jetbrains.kotlinx.spark%20AND%20v:1.0.2) [![official JetBrains project](http://jb.gg/badges/official.svg)](https://confluence.jetbrains.com/display/ALL/JetBrains+on+GitHub) [![Join the chat at https://gitter.im/JetBrains/kotlin-spark-api](https://badges.gitter.im/JetBrains/kotlin-spark-api.svg)](https://gitter.im/JetBrains/kotlin-spark-api?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge)
1+
# Kotlin for Apache® Spark™ [![Maven Central](https://img.shields.io/maven-central/v/org.jetbrains.kotlinx.spark/kotlin-spark-api-parent-3.2.svg?label=Maven%20Central)](https://search.maven.org/search?q=g:org.jetbrains.kotlinx.spark%20AND%20v:1.1.0) [![official JetBrains project](http://jb.gg/badges/official.svg)](https://confluence.jetbrains.com/display/ALL/JetBrains+on+GitHub) [![Join the chat at https://gitter.im/JetBrains/kotlin-spark-api](https://badges.gitter.im/JetBrains/kotlin-spark-api.svg)](https://gitter.im/JetBrains/kotlin-spark-api?utm_source=badge&utm_medium=badge&utm_campaign=pr-badge&utm_content=badge)
22

33

44
Your next API to work with [Apache Spark](https://spark.apache.org/).
@@ -31,20 +31,21 @@ We have opened a Spark Project Improvement Proposal: [Kotlin support for Apache
3131

3232
## Supported versions of Apache Spark
3333

34-
| Apache Spark | Scala | Kotlin for Apache Spark |
34+
| Apache Spark | Scala | Kotlin for Apache Spark |
3535
|:------------:|:-----:|:-------------------------------:|
36-
| 3.0.0+ | 2.12 | kotlin-spark-api-3.0:1.0.2 |
37-
| 2.4.1+ | 2.12 | kotlin-spark-api-2.4_2.12:1.0.2 |
38-
| 2.4.1+ | 2.11 | kotlin-spark-api-2.4_2.11:1.0.2 |
39-
| 3.2.0+ | 2.12 | kotlin-spark-api-3.2:1.0.3 |
36+
| 3.2.1+ | 2.12 | kotlin-spark-api-3.2:1.1.0 |
37+
| 3.1.3+ | 2.12 | kotlin-spark-api-3.1:1.1.0 |
38+
| 3.0.3+ | 2.12 | kotlin-spark-api-3.0:1.1.0 |
39+
| 2.4.1+ | 2.12 | kotlin-spark-api-2.4_2.12:1.0.2 |
40+
| 2.4.1+ | 2.11 | kotlin-spark-api-2.4_2.11:1.0.2 |
4041

4142
## Releases
4243

4344
The list of Kotlin for Apache Spark releases is available [here](https://github.com/JetBrains/kotlin-spark-api/releases/).
4445
The Kotlin for Spark artifacts adhere to the following convention:
4546
`[Apache Spark version]_[Scala core version]:[Kotlin for Apache Spark API version]`
4647

47-
[![Maven Central](https://img.shields.io/maven-central/v/org.jetbrains.kotlinx.spark/kotlin-spark-api-parent.svg?label=Maven%20Central)](https://search.maven.org/search?q=g:"org.jetbrains.kotlinx.spark"%20AND%20a:"kotlin-spark-api-3.0")
48+
[![Maven Central](https://img.shields.io/maven-central/v/org.jetbrains.kotlinx.spark/kotlin-spark-api-parent-3.2.svg?label=Maven%20Central)](https://search.maven.org/search?q=g:"org.jetbrains.kotlinx.spark"%20AND%20a:"kotlin-spark-api-3.2")
4849

4950
## How to configure Kotlin for Apache Spark in your project
5051

@@ -55,7 +56,7 @@ Here's an example `pom.xml`:
5556
```xml
5657
<dependency>
5758
<groupId>org.jetbrains.kotlinx.spark</groupId>
58-
<artifactId>kotlin-spark-api-3.0</artifactId>
59+
<artifactId>kotlin-spark-api-3.1</artifactId>
5960
<version>${kotlin-spark-api.version}</version>
6061
</dependency>
6162
<dependency>
@@ -79,25 +80,28 @@ The Kotlin Spark API also supports Kotlin Jupyter notebooks.
7980
To it, simply add
8081

8182
```jupyterpython
82-
%use kotlin-spark-api
83+
%use spark
8384
```
8485
to the top of your notebook. This will get the latest version of the API, together with the latest version of Spark.
8586
To define a certain version of Spark or the API itself, simply add it like this:
8687
```jupyterpython
87-
%use kotlin-spark-api(spark=3.2, version=1.0.4)
88+
%use spark(spark=3.2, v=1.1.0)
8889
```
8990

9091
Inside the notebook a Spark session will be initiated automatically. This can be accessed via the `spark` value.
9192
`sc: JavaSparkContext` can also be accessed directly. The API operates pretty similarly.
9293

9394
There is also support for HTML rendering of Datasets and simple (Java)RDDs.
95+
Check out the [example](examples/src/main/kotlin/org/jetbrains/kotlinx/spark/examples/JupyterExample.ipynb) as well.
96+
9497

9598
To use Spark Streaming abilities, instead use
9699
```jupyterpython
97-
%use kotlin-spark-api-streaming
100+
%use spark-streaming
98101
```
99102
This does not start a Spark session right away, meaning you can call `withSparkStreaming(batchDuration) {}`
100103
in whichever cell you want.
104+
Check out the [example](examples/src/main/kotlin/org/jetbrains/kotlinx/spark/examples/streaming/JupyterStreamingExample.ipynb).
101105

102106
## Kotlin for Apache Spark features
103107

@@ -115,14 +119,19 @@ This is not needed when running the Kotlin Spark API from a Jupyter notebook.
115119
```kotlin
116120
spark.dsOf("a" to 1, "b" to 2)
117121
```
118-
The example above produces `Dataset<Pair<String, Int>>`. While Kotlin Pairs and Triples are supported, Scala Tuples are reccomended for better support.
122+
The example above produces `Dataset<Pair<String, Int>>`. While Kotlin Pairs and Triples are supported, Scala Tuples are
123+
recommended for better support.
119124

120125
### Null safety
121126
There are several aliases in API, like `leftJoin`, `rightJoin` etc. These are null-safe by design.
122127
For example, `leftJoin` is aware of nullability and returns `Dataset<Pair<LEFT, RIGHT?>>`.
123128
Note that we are forcing `RIGHT` to be nullable for you as a developer to be able to handle this situation.
124-
`NullPointerException`s are hard to debug in Spark, and we doing our best to make them as rare as possible.
129+
`NullPointerException`s are hard to debug in Spark, and we're doing our best to make them as rare as possible.
130+
131+
In Spark, you might also come across Scala-native `Option<*>` or Java-compatible `Optional<*>` classes.
132+
We provide `getOrNull()` and `getOrElse()` functions for these to use Kotlin's null safety for good.
125133

134+
Similarly, you can also create `Option<*>`s and `Optional<*>`s like `T?.toOptional()` if a Spark function requires it.
126135
### withSpark function
127136

128137
We provide you with useful function `withSpark`, which accepts everything that may be needed to run Spark — properties, name, master location and so on. It also accepts a block of code to execute inside Spark context.
@@ -134,8 +143,8 @@ Do not use this when running the Kotlin Spark API from a Jupyter notebook.
134143
```kotlin
135144
withSpark {
136145
dsOf(1, 2)
137-
.map { it X it } // creates Tuple2<Int, Int>
138-
.show()
146+
.map { it X it } // creates Tuple2<Int, Int>
147+
.show()
139148
}
140149
```
141150

@@ -152,14 +161,14 @@ To solve these problems we've added `withCached` function
152161
```kotlin
153162
withSpark {
154163
dsOf(1, 2, 3, 4, 5)
155-
.map { tupleOf(it, it + 2) }
156-
.withCached {
157-
showDS()
158-
159-
filter { it._1 % 2 == 0 }.showDS()
160-
}
161-
.map { tupleOf(it._1, it._2, (it._1 + it._2) * 2) }
162-
.show()
164+
.map { tupleOf(it, it + 2) }
165+
.withCached {
166+
showDS()
167+
168+
filter { it._1 % 2 == 0 }.showDS()
169+
}
170+
.map { tupleOf(it._1, it._2, (it._1 + it._2) * 2) }
171+
.show()
163172
}
164173
```
165174

@@ -185,49 +194,7 @@ dataset.where( col("colA") `===` 6 )
185194
dataset.where( col("colA") eq 6)
186195
```
187196

188-
In short, all supported operators are:
189-
190-
- `==`,
191-
- `!=`,
192-
- `eq` / `` `===` ``,
193-
- `neq` / `` `=!=` ``,
194-
- `-col(...)`,
195-
- `!col(...)`,
196-
- `gt`,
197-
- `lt`,
198-
- `geq`,
199-
- `leq`,
200-
- `or`,
201-
- `and` / `` `&&` ``,
202-
- `+`,
203-
- `-`,
204-
- `*`,
205-
- `/`,
206-
- `%`
207-
208-
Secondly, there are some quality of life additions as well:
209-
210-
In Kotlin, Ranges are often
211-
used to solve inclusive/exclusive situations for a range. So, you can now do:
212-
```kotlin
213-
dataset.where( col("colA") inRangeOf 0..2 )
214-
```
215-
216-
Also, for columns containing map- or array like types:
217-
218-
```kotlin
219-
dataset.where( col("colB")[0] geq 5 )
220-
```
221-
222-
Finally, thanks to Kotlin reflection, we can provide a type- and refactor safe way
223-
to create `TypedColumn`s and with those a new Dataset from pieces of another using the `selectTyped()` function, added to the API:
224-
```kotlin
225-
val dataset: Dataset<YourClass> = ...
226-
val newDataset: Dataset<Pair<TypeA, TypeB>> = dataset.selectTyped(col(YourClass::colA), col(YourClass::colB))
227-
228-
// Alternatively, for instance when working with a Dataset<Row>
229-
val typedDataset: Dataset<Pair<String, Int>> = otherDataset.selectTyped(col("a").`as`<String>(), col("b").`as`<Int>())
230-
```
197+
To read more, check the [wiki](https://github.com/JetBrains/kotlin-spark-api/wiki/Column-functions).
231198

232199
### Overload resolution ambiguity
233200

@@ -253,49 +220,7 @@ val a: Tuple2<Int, Long> = tupleOf(1, 2L)
253220
val b: Tuple3<String, Double, Int> = t("test", 1.0, 2)
254221
val c: Tuple3<Float, String, Int> = 5f X "aaa" X 1
255222
```
256-
Tuples can be expanded and merged like this:
257-
```kotlin
258-
// expand
259-
tupleOf(1, 2).appendedBy(3) == tupleOf(1, 2, 3)
260-
tupleOf(1, 2) + 3 == tupleOf(1, 2, 3)
261-
tupleOf(2, 3).prependedBy(1) == tupleOf(1, 2, 3)
262-
1 + tupleOf(2, 3) == tupleOf(1, 2, 3)
263-
264-
// merge
265-
tupleOf(1, 2) concat tupleOf(3, 4) == tupleOf(1, 2, 3, 4)
266-
tupleOf(1, 2) + tupleOf(3, 4) == tupleOf(1, 2, 3, 4)
267-
268-
// extend tuple instead of merging with it
269-
tupleOf(1, 2).appendedBy(tupleOf(3, 4)) == tupleOf(1, 2, tupleOf(3, 4))
270-
tupleOf(1, 2) + tupleOf(tupleOf(3, 4)) == tupleOf(1, 2, tupleOf(3, 4))
271-
```
272-
273-
The concept of `EmptyTuple` from Scala 3 is also already present:
274-
```kotlin
275-
tupleOf(1).dropLast() == tupleOf() == emptyTuple()
276-
```
277-
278-
Finally, all these tuple helper functions are also baked in:
279-
280-
- `componentX()` for destructuring: `val (a, b) = tuple`
281-
- `dropLast() / dropFirst()`
282-
- `contains(x)` for `if (x in tuple) { ... }`
283-
- `iterator()` for `for (x in tuple) { ... }`
284-
- `asIterable()`
285-
- `size`
286-
- `get(n) / get(i..j)` for `tuple[1] / tuple[i..j]`
287-
- `getOrNull(n) / getOrNull(i..j)`
288-
- `getAs<T>(n) / getAs<T>(i..j)`
289-
- `getAsOrNull<T>(n) / getAsOrNull<T>(i..j)`
290-
- `copy(_1 = ..., _5 = ...)`
291-
- `first() / last()`
292-
- `_1`, `_6` etc. (instead of `_1()`, `_6()`)
293-
- `zip`
294-
- `dropN() / dropLastN()`
295-
- `takeN() / takeLastN()`
296-
- `splitAtN()`
297-
- `map`
298-
- `cast`
223+
To read more about tuples and all the added functions, refer to the [wiki](https://github.com/JetBrains/kotlin-spark-api/wiki/Tuples).
299224

300225
### Streaming
301226

@@ -338,6 +263,7 @@ withSparkStreaming(batchDuration = Durations.seconds(1), timeout = 10_000) { //
338263
}
339264
```
340265

266+
For more information, check the [wiki](https://github.com/JetBrains/kotlin-spark-api/wiki/Streaming).
341267

342268
## Examples
343269

core/3.0/pom_2.12.xml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
<artifactId>core-3.0_2.12</artifactId>
99
<parent>
1010
<groupId>org.jetbrains.kotlinx.spark</groupId>
11-
<artifactId>kotlin-spark-api-parent_2.12</artifactId>
11+
<artifactId>kotlin-spark-api-parent-3.1_2.12</artifactId>
1212
<version>1.0.4-SNAPSHOT</version>
1313
<relativePath>../../pom_2.12.xml</relativePath>
1414
</parent>

dummy/pom.xml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
<?xml version="1.0" encoding="UTF-8"?>
22
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
33
<parent>
4-
<artifactId>kotlin-spark-api-parent</artifactId>
4+
<artifactId>kotlin-spark-api-parent-3.1</artifactId>
55
<groupId>org.jetbrains.kotlinx.spark</groupId>
66
<version>1.0.4-SNAPSHOT</version>
77
</parent>

examples/pom-3.0_2.12.xml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@
99
<artifactId>examples-3.0_2.12</artifactId>
1010
<parent>
1111
<groupId>org.jetbrains.kotlinx.spark</groupId>
12-
<artifactId>kotlin-spark-api-parent_2.12</artifactId>
12+
<artifactId>kotlin-spark-api-parent-3.1_2.12</artifactId>
1313
<version>1.0.4-SNAPSHOT</version>
1414
<relativePath>../pom_2.12.xml</relativePath>
1515
</parent>

0 commit comments

Comments
 (0)