You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# Kotlin for Apache® Spark™ [](https://search.maven.org/search?q=g:org.jetbrains.kotlinx.spark%20AND%20v:1.0.1)[](https://confluence.jetbrains.com/display/ALL/JetBrains+on+GitHub)
2
2
3
+
Your next API to work with [Apache Spark](https://spark.apache.org/).
3
4
4
-
Your next API to work with [Apache Spark](https://spark.apache.org/).
5
+
This project adds a missing layer of compatibility between [Kotlin](https://kotlinlang.org/)
6
+
and [Apache Spark](https://spark.apache.org/). It allows Kotlin developers to use familiar language features such as
7
+
data classes, and lambda expressions as simple expressions in curly braces or method references.
5
8
6
-
This project adds a missing layer of compatibility between [Kotlin](https://kotlinlang.org/) and [Apache Spark](https://spark.apache.org/).
7
-
It allows Kotlin developers to use familiar language features such as data classes, and lambda expressions as simple expressions in curly braces or method references.
8
-
9
-
We have opened a Spark Project Improvement Proposal: [Kotlin support for Apache Spark](http://issues.apache.org/jira/browse/SPARK-32530#) to work with the community towards getting Kotlin support as a first-class citizen in Apache Spark. We encourage you to voice your opinions and participate in the discussion.
9
+
We have opened a Spark Project Improvement
10
+
Proposal: [Kotlin support for Apache Spark](http://issues.apache.org/jira/browse/SPARK-32530#) to work with the
11
+
community towards getting Kotlin support as a first-class citizen in Apache Spark. We encourage you to voice your
12
+
opinions and participate in the discussion.
10
13
11
14
## Table of Contents
12
15
@@ -29,139 +32,153 @@ We have opened a Spark Project Improvement Proposal: [Kotlin support for Apache
## How to configure Kotlin for Apache Spark in your project
47
51
48
-
You can add Kotlin for Apache Spark as a dependency to your project: `Maven`, `Gradle`, `SBT`, and `leinengen` are supported.
49
-
52
+
You can add Kotlin for Apache Spark as a dependency to your project: `Maven`, `Gradle`, `SBT`, and `leinengen` are
53
+
supported.
54
+
50
55
Here's an example `pom.xml`:
51
56
52
57
```xml
58
+
53
59
<dependency>
54
-
<groupId>org.jetbrains.kotlinx.spark</groupId>
55
-
<artifactId>kotlin-spark-api-3.0.0</artifactId>
56
-
<version>${kotlin-spark-api.version}</version>
60
+
<groupId>org.jetbrains.kotlinx.spark</groupId>
61
+
<artifactId>kotlin-spark-api-3.0.0</artifactId>
62
+
<version>${kotlin-spark-api.version}</version>
57
63
</dependency>
58
64
<dependency>
59
-
<groupId>org.apache.spark</groupId>
60
-
<artifactId>spark-sql_2.12</artifactId>
61
-
<version>${spark.version}</version>
65
+
<groupId>org.apache.spark</groupId>
66
+
<artifactId>spark-sql_2.12</artifactId>
67
+
<version>${spark.version}</version>
62
68
</dependency>
63
69
```
64
70
65
71
Note that `core` is being compiled against Scala version `2.12`.
66
-
You can find a complete example with `pom.xml` and `build.gradle` in the [Quick Start Guide](https://github.com/JetBrains/kotlin-spark-api/wiki/Quick-Start-Guide).
72
+
You can find a complete example with `pom.xml` and `build.gradle` in
73
+
the [Quick Start Guide](https://github.com/JetBrains/kotlin-spark-api/wiki/Quick-Start-Guide).
74
+
75
+
Once you have configured the dependency, you only need to add the following import to your Kotlin file:
67
76
68
-
Once you have configured the dependency, you only need to add the following import to your Kotlin file:
69
77
```kotlin
70
78
importorg.jetbrains.kotlinx.spark.api.*
71
79
```
72
80
73
81
## Kotlin for Apache Spark features
74
82
75
83
### Creating a SparkSession in Kotlin
84
+
76
85
```kotlin
77
86
val spark =SparkSession
78
-
.builder()
79
-
.master("local[2]")
80
-
.appName("Simple Application").orCreate
87
+
.builder()
88
+
.master("local[2]")
89
+
.appName("Simple Application").orCreate
81
90
82
91
```
83
92
84
93
### Creating a Dataset in Kotlin
94
+
85
95
```kotlin
86
96
spark.toDS("a" to 1, "b" to 2)
87
97
```
98
+
88
99
The example above produces `Dataset<Pair<String, Int>>`.
89
-
100
+
90
101
### Null safety
91
-
There are several aliases in API, like `leftJoin`, `rightJoin` etc. These are null-safe by design.
92
-
For example, `leftJoin` is aware of nullability and returns `Dataset<Pair<LEFT, RIGHT?>>`.
93
-
Note that we are forcing `RIGHT` to be nullable for you as a developer to be able to handle this situation.
102
+
103
+
There are several aliases in API, like `leftJoin`, `rightJoin` etc. These are null-safe by design. For
104
+
example, `leftJoin` is aware of nullability and returns `Dataset<Pair<LEFT, RIGHT?>>`. Note that we are forcing `RIGHT`
105
+
to be nullable for you as a developer to be able to handle this situation.
94
106
`NullPointerException`s are hard to debug in Spark, and we doing our best to make them as rare as possible.
95
107
96
108
### withSpark function
97
109
98
-
We provide you with useful function `withSpark`, which accepts everything that may be needed to run Spark — properties, name, master location and so on. It also accepts a block of code to execute inside Spark context.
110
+
We provide you with useful function `withSpark`, which accepts everything that may be needed to run Spark — properties,
111
+
name, master location and so on. It also accepts a block of code to execute inside Spark context.
99
112
100
113
After work block ends, `spark.stop()` is called automatically.
101
114
102
115
```kotlin
103
116
withSpark {
104
117
dsOf(1, 2)
105
-
.map { it to it }
106
-
.show()
118
+
.map { it to it }
119
+
.show()
107
120
}
108
121
```
109
122
110
123
`dsOf` is just one more way to create `Dataset` (`Dataset<Int>`) from varargs.
111
124
112
-
### withCached function
113
-
It can easily happen that we need to fork our computation to several paths. To compute things only once we should call `cache`
114
-
method. However, it becomes difficult to control when we're using cached `Dataset` and when not.
115
-
It is also easy to forget to unpersist cached data, which can break things unexpectedly or take up more memory
116
-
than intended.
125
+
### `withCached` function
126
+
127
+
It can easily happen that we need to fork our computation to several paths. To compute things only once we should
128
+
call `cache`
129
+
method. However, it becomes difficult to control when we're using cached `Dataset` and when not. It is also easy to
130
+
forget to unpersist cached data, which can break things unexpectedly or take up more memory than intended.
117
131
118
132
To solve these problems we've added `withCached` function
Here we're showing cached `Dataset` for debugging purposes then filtering it.
135
-
The `filter` method returns filtered `Dataset` and then the cached `Dataset` is being unpersisted, so we have more memory t
136
-
o call the `map` method and collect the resulting `Dataset`.
148
+
Here we're showing cached `Dataset` for debugging purposes then filtering it. The `filter` method returns
149
+
filtered `Dataset` and then the cached `Dataset` is being unpersisted, so we have more memory t o call the `map` method
150
+
and collect the resulting `Dataset`.
137
151
138
-
### toList and toArray methods
152
+
### `toList` and `toArray` methods
139
153
140
-
For more idiomatic Kotlin code we've added `toList` and `toArray` methods in this API. You can still use the `collect` method as in Scala API, however the result should be casted to `Array`.
141
-
This is because `collect` returns a Scala array, which is not the same as Java/Kotlin one.
154
+
For more idiomatic Kotlin code we've added `toList` and `toArray` methods in this API. You can still use the `collect`
155
+
method as in Scala API, however the result should be casted to `Array`. This is because `collect` returns a Scala array,
156
+
which is not the same as Java/Kotlin one.
142
157
143
158
### Column infix/operator functions
144
159
145
-
Similar to the Scala API for `Columns`, many of the operator functions could be ported over.
146
-
For example:
160
+
Similar to the Scala API for `Columns`, many of the operator functions could be ported over. For example:
161
+
147
162
```kotlin
148
-
dataset.select(col("colA") +5)
149
-
dataset.select(col("colA") / col("colB"))
163
+
dataset.select(col("colA") +5)
164
+
dataset.select(col("colA") / col("colB"))
150
165
151
-
dataset.where(col("colA") `===` 6)
166
+
dataset.where(col("colA") `===` 6)
152
167
// or alternatively
153
-
dataset.where(col("colA") eq 6)
168
+
dataset.where(col("colA") eq 6)
154
169
```
170
+
155
171
In short, all supported operators are:
172
+
156
173
-`==`,
157
-
-`!=`,
174
+
-`!=`,
158
175
-`eq` / `` `===` ``,
159
176
-`neq` / `` `=!=` ``,
160
177
-`-col(...)`,
161
-
-`!col(...)`,
178
+
-`!col(...)`,
162
179
-`gt`,
163
180
-`lt`,
164
-
-`geq`,
181
+
-`geq`,
165
182
-`leq`,
166
183
-`or`,
167
184
-`and` / `` `&&` ``,
@@ -173,43 +190,56 @@ In short, all supported operators are:
173
190
174
191
Secondly, there are some quality of life additions as well:
175
192
176
-
In Kotlin, Ranges are often
177
-
used to solve inclusive/exclusive situations for a range. So, you can now do:
193
+
In Kotlin, Ranges are often used to solve inclusive/exclusive situations for a range. So, you can now do:
194
+
178
195
```kotlin
179
-
dataset.where(col("colA") inRangeOf 0..2)
196
+
dataset.where(col("colA") inRangeOf 0..2)
180
197
```
198
+
181
199
Also, for columns containing map- or array like types:
200
+
182
201
```kotlin
183
-
dataset.where(col("colB")[0] geq 5)
202
+
dataset.where(col("colB")[0] geq 5)
184
203
```
185
204
186
-
Finally, thanks to Kotlin reflection, we can provide a type- and refactor safe way
187
-
to create `TypedColumn`s and with those a new Dataset from pieces of another using the `selectTyped()` function, added to the API:
205
+
Finally, thanks to Kotlin reflection, we can provide a type- and refactor safe way to create `TypedColumn`s and with
206
+
those a new Dataset from pieces of another using the `selectTyped()` function, added to the API:
207
+
188
208
```kotlin
189
209
val dataset:Dataset<YourClass> =...
190
210
val newDataset:Dataset<Pair<TypeA, TypeB>> = dataset.selectTyped(col(YourClass::colA), col(YourClass::colB))
191
211
```
192
212
193
213
### `reduceGroups`
194
214
195
-
We had to implemet `reduceGroups` operator for Kotlin separately as `reduceGroupsK` function, because otherwise it caused resolution ambiguity between Kotlin, Scala and Java APIs, which was quite hard to solve.
196
-
197
-
We have a special example of work with this function in the [Groups example](https://github.com/JetBrains/kotlin-spark-api/edit/main/examples/src/main/kotlin/org/jetbrains/kotlinx/spark/examples/Group.kt).
215
+
We had to implemet `reduceGroups` operator for Kotlin separately as `reduceGroupsK` function, because otherwise it
216
+
caused resolution ambiguity between Kotlin, Scala and Java APIs, which was quite hard to solve.
198
217
218
+
We have a special example of work with this function in
219
+
the [Groups example](https://github.com/JetBrains/kotlin-spark-api/edit/main/examples/src/main/kotlin/org/jetbrains/kotlinx/spark/examples/Group.kt)
220
+
.
199
221
200
222
## Examples
201
223
202
-
For more, check out [examples](https://github.com/JetBrains/kotlin-spark-api/tree/master/examples/src/main/kotlin/org/jetbrains/kotlinx/spark/examples) module.
203
-
To get up and running quickly, check out this [tutorial](https://github.com/JetBrains/kotlin-spark-api/wiki/Quick-Start-Guide).
224
+
For more, check
225
+
out [examples](https://github.com/JetBrains/kotlin-spark-api/tree/master/examples/src/main/kotlin/org/jetbrains/kotlinx/spark/examples)
226
+
module. To get up and running quickly, check out
227
+
this [tutorial](https://github.com/JetBrains/kotlin-spark-api/wiki/Quick-Start-Guide).
204
228
205
229
## Reporting issues/Support
206
-
Please use [GitHub issues](https://github.com/JetBrains/kotlin-spark-api/issues) for filing feature requests and bug reports.
207
-
You are also welcome to join [kotlin-spark channel](https://kotlinlang.slack.com/archives/C015B9ZRGJF) in the Kotlin Slack.
230
+
231
+
Please use [GitHub issues](https://github.com/JetBrains/kotlin-spark-api/issues) for filing feature requests and bug
232
+
reports. You are also welcome to join [kotlin-spark channel](https://kotlinlang.slack.com/archives/C015B9ZRGJF) in the
233
+
Kotlin Slack.
208
234
209
235
## Code of Conduct
210
-
This project and the corresponding community is governed by the [JetBrains Open Source and Community Code of Conduct](https://confluence.jetbrains.com/display/ALL/JetBrains+Open+Source+and+Community+Code+of+Conduct). Please make sure you read it.
236
+
237
+
This project and the corresponding community is governed by
238
+
the [JetBrains Open Source and Community Code of Conduct](https://confluence.jetbrains.com/display/ALL/JetBrains+Open+Source+and+Community+Code+of+Conduct)
239
+
. Please make sure you read it.
211
240
212
241
## License
242
+
213
243
Kotlin for Apache Spark is licensed under the [Apache 2.0 License](LICENSE).
0 commit comments