You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+72-97Lines changed: 72 additions & 97 deletions
Original file line number
Diff line number
Diff line change
@@ -1,15 +1,12 @@
1
1
# Kotlin for Apache® Spark™ [](https://search.maven.org/search?q=g:org.jetbrains.kotlinx.spark%20AND%20v:1.0.1)[](https://confluence.jetbrains.com/display/ALL/JetBrains+on+GitHub)
2
2
3
-
Your next API to work with [Apache Spark](https://spark.apache.org/).
4
3
5
-
This project adds a missing layer of compatibility between [Kotlin](https://kotlinlang.org/)
6
-
and [Apache Spark](https://spark.apache.org/). It allows Kotlin developers to use familiar language features such as
7
-
data classes, and lambda expressions as simple expressions in curly braces or method references.
4
+
Your next API to work with [Apache Spark](https://spark.apache.org/).
8
5
9
-
We have opened a Spark Project Improvement
10
-
Proposal: [Kotlin support for Apache Spark](http://issues.apache.org/jira/browse/SPARK-32530#) to work with the
11
-
community towards getting Kotlin support as a first-class citizen in Apache Spark. We encourage you to voice your
12
-
opinions and participate in the discussion.
6
+
This project adds a missing layer of compatibility between [Kotlin](https://kotlinlang.org/) and [Apache Spark](https://spark.apache.org/).
7
+
It allows Kotlin developers to use familiar language features such as data classes, and lambda expressions as simple expressions in curly braces or method references.
8
+
9
+
We have opened a Spark Project Improvement Proposal: [Kotlin support for Apache Spark](http://issues.apache.org/jira/browse/SPARK-32530#) to work with the community towards getting Kotlin support as a first-class citizen in Apache Spark. We encourage you to voice your opinions and participate in the discussion.
13
10
14
11
## Table of Contents
15
12
@@ -24,7 +21,7 @@ opinions and participate in the discussion.
24
21
-[withCached function](#withcached-function)
25
22
-[toList and toArray](#tolist-and-toarray-methods)
## How to configure Kotlin for Apache Spark in your project
51
47
52
-
You can add Kotlin for Apache Spark as a dependency to your project: `Maven`, `Gradle`, `SBT`, and `leinengen` are
53
-
supported.
54
-
48
+
You can add Kotlin for Apache Spark as a dependency to your project: `Maven`, `Gradle`, `SBT`, and `leinengen` are supported.
49
+
55
50
Here's an example `pom.xml`:
56
51
57
52
```xml
58
-
59
53
<dependency>
60
-
<groupId>org.jetbrains.kotlinx.spark</groupId>
61
-
<artifactId>kotlin-spark-api-3.0.0</artifactId>
62
-
<version>${kotlin-spark-api.version}</version>
54
+
<groupId>org.jetbrains.kotlinx.spark</groupId>
55
+
<artifactId>kotlin-spark-api-3.0.0</artifactId>
56
+
<version>${kotlin-spark-api.version}</version>
63
57
</dependency>
64
58
<dependency>
65
-
<groupId>org.apache.spark</groupId>
66
-
<artifactId>spark-sql_2.12</artifactId>
67
-
<version>${spark.version}</version>
59
+
<groupId>org.apache.spark</groupId>
60
+
<artifactId>spark-sql_2.12</artifactId>
61
+
<version>${spark.version}</version>
68
62
</dependency>
69
63
```
70
64
71
65
Note that `core` is being compiled against Scala version `2.12`.
72
-
You can find a complete example with `pom.xml` and `build.gradle` in
73
-
the [Quick Start Guide](https://github.com/JetBrains/kotlin-spark-api/wiki/Quick-Start-Guide).
74
-
75
-
Once you have configured the dependency, you only need to add the following import to your Kotlin file:
66
+
You can find a complete example with `pom.xml` and `build.gradle` in the [Quick Start Guide](https://github.com/JetBrains/kotlin-spark-api/wiki/Quick-Start-Guide).
76
67
68
+
Once you have configured the dependency, you only need to add the following import to your Kotlin file:
77
69
```kotlin
78
70
importorg.jetbrains.kotlinx.spark.api.*
79
71
```
80
72
81
73
## Kotlin for Apache Spark features
82
74
83
75
### Creating a SparkSession in Kotlin
84
-
85
76
```kotlin
86
77
val spark =SparkSession
87
-
.builder()
88
-
.master("local[2]")
89
-
.appName("Simple Application").orCreate
78
+
.builder()
79
+
.master("local[2]")
80
+
.appName("Simple Application").orCreate
90
81
91
82
```
92
83
93
84
### Creating a Dataset in Kotlin
94
-
95
85
```kotlin
96
86
spark.toDS("a" to 1, "b" to 2)
97
87
```
98
-
99
88
The example above produces `Dataset<Pair<String, Int>>`.
100
-
89
+
101
90
### Null safety
102
-
103
-
There are several aliases in API, like `leftJoin`, `rightJoin` etc. These are null-safe by design. For
104
-
example, `leftJoin` is aware of nullability and returns `Dataset<Pair<LEFT, RIGHT?>>`. Note that we are forcing `RIGHT`
105
-
to be nullable for you as a developer to be able to handle this situation.
91
+
There are several aliases in API, like `leftJoin`, `rightJoin` etc. These are null-safe by design.
92
+
For example, `leftJoin` is aware of nullability and returns `Dataset<Pair<LEFT, RIGHT?>>`.
93
+
Note that we are forcing `RIGHT` to be nullable for you as a developer to be able to handle this situation.
106
94
`NullPointerException`s are hard to debug in Spark, and we doing our best to make them as rare as possible.
107
95
108
96
### withSpark function
109
97
110
-
We provide you with useful function `withSpark`, which accepts everything that may be needed to run Spark — properties,
111
-
name, master location and so on. It also accepts a block of code to execute inside Spark context.
98
+
We provide you with useful function `withSpark`, which accepts everything that may be needed to run Spark — properties, name, master location and so on. It also accepts a block of code to execute inside Spark context.
112
99
113
100
After work block ends, `spark.stop()` is called automatically.
114
101
115
102
```kotlin
116
103
withSpark {
117
104
dsOf(1, 2)
118
-
.map { it to it }
119
-
.show()
105
+
.map { it to it }
106
+
.show()
120
107
}
121
108
```
122
109
123
110
`dsOf` is just one more way to create `Dataset` (`Dataset<Int>`) from varargs.
124
111
125
-
### `withCached` function
126
-
127
-
It can easily happen that we need to fork our computation to several paths. To compute things only once we should
128
-
call `cache`
129
-
method. However, it becomes difficult to control when we're using cached `Dataset` and when not. It is also easy to
130
-
forget to unpersist cached data, which can break things unexpectedly or take up more memory than intended.
112
+
### withCached function
113
+
It can easily happen that we need to fork our computation to several paths. To compute things only once we should call `cache`
114
+
method. However, it becomes difficult to control when we're using cached `Dataset` and when not.
115
+
It is also easy to forget to unpersist cached data, which can break things unexpectedly or take up more memory
116
+
than intended.
131
117
132
118
To solve these problems we've added `withCached` function
Here we're showing cached `Dataset` for debugging purposes then filtering it. The `filter` method returns
149
-
filtered `Dataset` and then the cached `Dataset` is being unpersisted, so we have more memory t o call the `map` method
150
-
and collect the resulting `Dataset`.
134
+
Here we're showing cached `Dataset` for debugging purposes then filtering it.
135
+
The `filter` method returns filtered `Dataset` and then the cached `Dataset` is being unpersisted, so we have more memory t
136
+
o call the `map` method and collect the resulting `Dataset`.
151
137
152
-
### `toList` and `toArray` methods
138
+
### toList and toArray methods
153
139
154
-
For more idiomatic Kotlin code we've added `toList` and `toArray` methods in this API. You can still use the `collect`
155
-
method as in Scala API, however the result should be casted to `Array`. This is because `collect` returns a Scala array,
156
-
which is not the same as Java/Kotlin one.
140
+
For more idiomatic Kotlin code we've added `toList` and `toArray` methods in this API. You can still use the `collect` method as in Scala API, however the result should be casted to `Array`.
141
+
This is because `collect` returns a Scala array, which is not the same as Java/Kotlin one.
157
142
158
143
### Column infix/operator functions
159
144
160
-
Similar to the Scala API for `Columns`, many of the operator functions could be ported over. For example:
161
-
145
+
Similar to the Scala API for `Columns`, many of the operator functions could be ported over.
146
+
For example:
162
147
```kotlin
163
-
dataset.select(col("colA") +5)
164
-
dataset.select(col("colA") / col("colB"))
148
+
dataset.select(col("colA") +5)
149
+
dataset.select(col("colA") / col("colB"))
165
150
166
-
dataset.where(col("colA") `===` 6)
151
+
dataset.where(col("colA") `===` 6)
167
152
// or alternatively
168
-
dataset.where(col("colA") eq 6)
153
+
dataset.where(col("colA") eq 6)
169
154
```
170
155
171
156
In short, all supported operators are:
172
157
173
158
-`==`,
174
-
-`!=`,
159
+
-`!=`,
175
160
-`eq` / `` `===` ``,
176
161
-`neq` / `` `=!=` ``,
177
162
-`-col(...)`,
178
-
-`!col(...)`,
163
+
-`!col(...)`,
179
164
-`gt`,
180
165
-`lt`,
181
-
-`geq`,
166
+
-`geq`,
182
167
-`leq`,
183
168
-`or`,
184
169
-`and` / `` `&&` ``,
@@ -190,53 +175,43 @@ In short, all supported operators are:
190
175
191
176
Secondly, there are some quality of life additions as well:
192
177
193
-
In Kotlin, Ranges are often used to solve inclusive/exclusive situations for a range. So, you can now do:
194
-
178
+
In Kotlin, Ranges are often
179
+
used to solve inclusive/exclusive situations for a range. So, you can now do:
195
180
```kotlin
196
-
dataset.where(col("colA") inRangeOf 0..2)
181
+
dataset.where(col("colA") inRangeOf 0..2)
197
182
```
198
183
199
184
Also, for columns containing map- or array like types:
200
185
201
186
```kotlin
202
-
dataset.where(col("colB")[0] geq 5)
187
+
dataset.where(col("colB")[0] geq 5)
203
188
```
204
189
205
-
Finally, thanks to Kotlin reflection, we can provide a type- and refactor safe way to create `TypedColumn`s and with
206
-
those a new Dataset from pieces of another using the `selectTyped()` function, added to the API:
207
-
190
+
Finally, thanks to Kotlin reflection, we can provide a type- and refactor safe way
191
+
to create `TypedColumn`s and with those a new Dataset from pieces of another using the `selectTyped()` function, added to the API:
208
192
```kotlin
209
193
val dataset:Dataset<YourClass> =...
210
194
val newDataset:Dataset<Pair<TypeA, TypeB>> = dataset.selectTyped(col(YourClass::colA), col(YourClass::colB))
211
195
```
212
196
213
-
### `reduceGroups`
197
+
### Overload resolution ambiguity
198
+
199
+
We had to implement the functions `reduceGroups` and `reduce` for Kotlin separately as `reduceGroupsK` and `reduceK` respectively, because otherwise it caused resolution ambiguity between Kotlin, Scala and Java APIs, which was quite hard to solve.
214
200
215
-
We had to implemet `reduceGroups` operator for Kotlin separately as `reduceGroupsK` function, because otherwise it
216
-
caused resolution ambiguity between Kotlin, Scala and Java APIs, which was quite hard to solve.
201
+
We have a special example of work with this function in the [Groups example](https://github.com/JetBrains/kotlin-spark-api/edit/main/examples/src/main/kotlin/org/jetbrains/kotlinx/spark/examples/Group.kt).
217
202
218
-
We have a special example of work with this function in
219
-
the [Groups example](https://github.com/JetBrains/kotlin-spark-api/edit/main/examples/src/main/kotlin/org/jetbrains/kotlinx/spark/examples/Group.kt)
220
-
.
221
203
222
204
## Examples
223
205
224
-
For more, check
225
-
out [examples](https://github.com/JetBrains/kotlin-spark-api/tree/master/examples/src/main/kotlin/org/jetbrains/kotlinx/spark/examples)
226
-
module. To get up and running quickly, check out
227
-
this [tutorial](https://github.com/JetBrains/kotlin-spark-api/wiki/Quick-Start-Guide).
206
+
For more, check out [examples](https://github.com/JetBrains/kotlin-spark-api/tree/master/examples/src/main/kotlin/org/jetbrains/kotlinx/spark/examples) module.
207
+
To get up and running quickly, check out this [tutorial](https://github.com/JetBrains/kotlin-spark-api/wiki/Quick-Start-Guide).
228
208
229
209
## Reporting issues/Support
230
-
231
-
Please use [GitHub issues](https://github.com/JetBrains/kotlin-spark-api/issues) for filing feature requests and bug
232
-
reports. You are also welcome to join [kotlin-spark channel](https://kotlinlang.slack.com/archives/C015B9ZRGJF) in the
233
-
Kotlin Slack.
210
+
Please use [GitHub issues](https://github.com/JetBrains/kotlin-spark-api/issues) for filing feature requests and bug reports.
211
+
You are also welcome to join [kotlin-spark channel](https://kotlinlang.slack.com/archives/C015B9ZRGJF) in the Kotlin Slack.
234
212
235
213
## Code of Conduct
236
-
237
-
This project and the corresponding community is governed by
238
-
the [JetBrains Open Source and Community Code of Conduct](https://confluence.jetbrains.com/display/ALL/JetBrains+Open+Source+and+Community+Code+of+Conduct)
239
-
. Please make sure you read it.
214
+
This project and the corresponding community is governed by the [JetBrains Open Source and Community Code of Conduct](https://confluence.jetbrains.com/display/ALL/JetBrains+Open+Source+and+Community+Code+of+Conduct). Please make sure you read it.
0 commit comments