Skip to content

Commit d1b34fa

Browse files
committed
(DOCSP-20565) v10 partitioners (#85)
* (DOCSP-20565) v10 partitioners
1 parent 2b0e9eb commit d1b34fa

File tree

4 files changed

+96
-142
lines changed

4 files changed

+96
-142
lines changed

source/configuration/read.txt

Lines changed: 76 additions & 124 deletions
Original file line numberDiff line numberDiff line change
@@ -12,15 +12,15 @@ Read Configuration Options
1212

1313
.. _spark-input-conf:
1414

15-
Input Configuration
16-
--------------------
15+
Read Configuration
16+
------------------
1717

1818
You can configure the following properties to read from MongoDB:
1919

2020
.. note::
2121

22-
If you use ``SparkConf`` to set the connector's input configurations,
23-
prefix ``spark.mongodb.input.`` to each property.
22+
If you use ``SparkConf`` to set the connector's read configurations,
23+
prefix ``spark.mongodb.read.`` to each property.
2424

2525
.. list-table::
2626
:header-rows: 1
@@ -36,7 +36,7 @@ You can configure the following properties to read from MongoDB:
3636
address, or UNIX domain socket. If the connection string doesn't
3737
specify a ``port``, it uses the default MongoDB port, ``27017``.
3838

39-
You can append the other remaining input options to the ``uri``
39+
You can append the other remaining read options to the ``uri``
4040
setting. See :ref:`configure-input-uri`.
4141

4242
* - ``database``
@@ -80,58 +80,45 @@ You can configure the following properties to read from MongoDB:
8080

8181
The connector provides the following partitioners:
8282

83-
- ``MongoDefaultPartitioner``
84-
**Default**. Wraps the MongoSamplePartitioner and provides
85-
help for users of older versions of MongoDB.
86-
87-
- ``MongoSamplePartitioner``
88-
**Requires MongoDB 3.2**. A general purpose partitioner for
83+
- ``SamplePartitioner``
84+
A general purpose partitioner for
8985
all deployments. Uses the average document size and random
9086
sampling of the collection to determine suitable
91-
partitions for the collection. For configuration
92-
settings for the MongoSamplePartitioner, see
93-
:ref:`conf-mongosamplepartitioner`.
87+
partitions for the collection. To learn more about the
88+
``SamplePartitioner``, see the
89+
:ref:`configuration settings description <conf-samplepartitioner>`.
9490

95-
- ``MongoShardedPartitioner``
91+
- ``ShardedPartitioner``
9692
A partitioner for sharded clusters. Partitions the
9793
collection based on the data chunks. Requires read access
98-
to the ``config`` database. For configuration settings for
99-
the MongoShardedPartitioner, see
100-
:ref:`conf-mongoshardedpartitioner`.
101-
102-
- ``MongoSplitVectorPartitioner``
103-
A partitioner for standalone or replica sets. Uses the
104-
:dbcommand:`splitVector` command on the standalone or the
105-
primary to determine the partitions of the database.
106-
Requires privileges to run :dbcommand:`splitVector`
107-
command. For configuration settings for the
108-
MongoSplitVectorPartitioner, see
109-
:ref:`conf-mongosplitvectorpartitioner`.
110-
111-
- ``MongoPaginateByCountPartitioner``
112-
A slow, general purpose partitioner for all deployments.
113-
Creates a specific number of partitions. Requires a query
114-
for every partition. For configuration settings for the
115-
MongoPaginateByCountPartitioner, see
116-
:ref:`conf-mongopaginatebycountpartitioner`.
94+
to the ``config`` database. To learn more about the
95+
``ShardedPartitioner``, see the
96+
:ref:`configuration settings description <conf-shardedpartitioner>`.
11797

118-
- ``MongoPaginateBySizePartitioner``
98+
- ``PaginateBySizePartitioner``
11999
A slow, general purpose partitioner for all deployments.
120100
Creates partitions based on data size. Requires a query
121-
for every partition. For configuration settings for the
122-
MongoPaginateBySizePartitioner, see
123-
:ref:`conf-mongopaginatebysizepartitioner`.
101+
for every partition. To learn more about the
102+
``PaginateBySizePartitioner``, see the
103+
:ref:`configuration settings description <conf-paginatebysizepartitioner>`.
104+
105+
- ``PaginateIntoPartitionsPartitioner``
106+
A partitioner that creates a specified number of
107+
partitions for a collection. Uses the total number of
108+
documents and the average document size in a collection to
109+
calculate the number of documents to include in each
110+
partition. To learn more about the
111+
``PaginateIntoPartitionsPartitioner``, see the
112+
:ref:`configuration settings description <conf-paginateintopartitionspartitioner>`.
124113

125114
You can also specify a custom partitioner implementation. For
126-
custom implementations of the ``MongoPartitioner`` trait, provide
127-
the full class name. If you don't provide package names, then
128-
this property uses the default package,
129-
``com.mongodb.spark.rdd.partitioner``.
115+
custom implementations of the ``Partitioner`` interface, provide
116+
the full class name.
130117

131118
To configure options for the various partitioners, see
132119
:ref:`partitioner-conf`.
133120

134-
**Default:** ``MongoDefaultPartitioner``
121+
**Default:** ``SamplePartitioner``
135122

136123
* - ``partitionerOptions``
137124
- The custom options to configure the partitioner.
@@ -182,9 +169,10 @@ Partitioner Configuration
182169
~~~~~~~~~~~~~~~~~~~~~~~~~
183170

184171
.. _conf-mongosamplepartitioner:
172+
.. _conf-samplepartitioner:
185173

186-
``MongoSamplePartitioner`` Configuration
187-
````````````````````````````````````````
174+
``SamplePartitioner`` Configuration
175+
```````````````````````````````````
188176

189177
.. include:: /includes/sparkconf-partitioner-options-note.rst
190178

@@ -195,19 +183,20 @@ Partitioner Configuration
195183
* - Property name
196184
- Description
197185

198-
* - ``partitionKey``
199-
- The field by which to split the collection data. The field
200-
should be indexed and contain unique values.
186+
* - ``partitioner.options.partitionKey``
187+
- A comma-delimited list of fields by which to split the
188+
collection data. The fields should be indexed and contain unique
189+
values.
201190

202191
**Default:** ``_id``
203192

204-
* - ``partitionSizeMB``
193+
* - ``partitioner.options.partitionSizeMB``
205194
- The size (in MB) for each partition. Smaller partition sizes
206195
create more partitions containing fewer documents.
207196

208197
**Default:** ``64``
209198

210-
* - ``samplesPerPartition``
199+
* - ``partitioner.options.samplesPerPartition``
211200
- The number of sample documents to take for each partition in
212201
order to establish a ``partitionKey`` range for each partition.
213202

@@ -230,41 +219,29 @@ Partitioner Configuration
230219
.. example::
231220

232221
For a collection with 640 documents with an average document
233-
size of 0.5 MB, the default ``MongoSamplePartitioner`` configuration
222+
size of 0.5 MB, the default ``SamplePartitioner`` configuration
234223
values creates 5 partitions with 128 documents per partition.
235224

236225
The MongoDB Spark Connector samples 50 documents (the default 10
237226
per intended partition) and defines 5 partitions by selecting
238227
``partitionKey`` ranges from the sampled documents.
239228

240229
.. _conf-mongoshardedpartitioner:
230+
.. _conf-shardedpartitioner:
241231

242-
``MongoShardedPartitioner`` Configuration
243-
`````````````````````````````````````````
244-
245-
.. include:: /includes/sparkconf-partitioner-options-note.rst
246-
247-
.. list-table::
248-
:header-rows: 1
249-
:widths: 35 65
250-
251-
* - Property name
252-
- Description
253-
254-
* - ``shardkey``
255-
- The field by which to split the collection data. The field
256-
should be indexed.
257-
258-
**Default:** ``_id``
232+
``ShardedPartitioner`` Configuration
233+
````````````````````````````````````
259234

260-
.. important::
235+
The ``ShardedPartitioner`` automatically determines
236+
partitions to use based on your shard configuration.
261237

262-
This property is not compatible with hashed shard keys.
238+
This partitioner is not compatible with hashed shard keys.
263239

264-
.. _conf-mongosplitvectorpartitioner:
240+
.. _conf-mongopaginatebysizepartitioner:
241+
.. _conf-paginatebysizepartitioner:
265242

266-
``MongoSplitVectorPartitioner`` Configuration
267-
`````````````````````````````````````````````
243+
``PaginateBySizePartitioner`` Configuration
244+
```````````````````````````````````````````
268245

269246
.. include:: /includes/sparkconf-partitioner-options-note.rst
270247

@@ -275,22 +252,23 @@ Partitioner Configuration
275252
* - Property name
276253
- Description
277254

278-
* - ``partitionKey``
279-
- The field by which to split the collection data. The field
280-
should be indexed and contain unique values.
255+
* - ``partitioner.options.partitionKey``
256+
- A comma-delimited list of fields by which to split the
257+
collection data. The fields should be indexed and contain unique
258+
values.
281259

282260
**Default:** ``_id``
283261

284-
* - ``partitionSizeMB``
262+
* - ``partitioner.options.partitionSizeMB``
285263
- The size (in MB) for each partition. Smaller partition sizes
286264
create more partitions containing fewer documents.
287265

288266
**Default:** ``64``
289267

290-
.. _conf-mongopaginatebycountpartitioner:
268+
.. _conf-paginateintopartitionspartitioner:
291269

292-
``MongoPaginateByCountPartitioner`` Configuration
293-
`````````````````````````````````````````````````
270+
``PaginateIntoPartitionsPartitioner`` Configuration
271+
```````````````````````````````````````````````````
294272

295273
.. include:: /includes/sparkconf-partitioner-options-note.rst
296274

@@ -301,41 +279,15 @@ Partitioner Configuration
301279
* - Property name
302280
- Description
303281

304-
* - ``partitionKey``
305-
- The field by which to split the collection data. The field
306-
should be indexed and contain unique values.
282+
* - ``partitioner.options.partitionKey``
283+
- A comma-delimited list of fields by which to split the
284+
collection data. The fields should be indexed and contain unique
285+
values.
307286

308287
**Default:** ``_id``
309288

310-
* - ``numberOfPartitions``
311-
- The number of partitions to create. A greater number of
312-
partitions means fewer documents per partition.
313-
314-
**Default:** ``64``
315-
316-
.. _conf-mongopaginatebysizepartitioner:
317-
318-
``MongoPaginateBySizePartitioner`` Configuration
319-
````````````````````````````````````````````````
320-
321-
.. include:: /includes/sparkconf-partitioner-options-note.rst
322-
323-
.. list-table::
324-
:header-rows: 1
325-
:widths: 35 65
326-
327-
* - Property name
328-
- Description
329-
330-
* - ``partitionKey``
331-
- The field by which to split the collection data. The field
332-
should be indexed and contain unique values.
333-
334-
**Default:** ``_id``
335-
336-
* - ``partitionSizeMB``
337-
- The size (in MB) for each partition. Smaller partition sizes
338-
create more partitions containing fewer documents.
289+
* - ``partitioner.options.maxNumberOfPartitions``
290+
- The number of partitions to create.
339291

340292
**Default:** ``64``
341293

@@ -344,36 +296,36 @@ Partitioner Configuration
344296
``uri`` Configuration Setting
345297
-----------------------------
346298

347-
You can set all :ref:`spark-input-conf` via the input ``uri`` setting.
299+
You can set all :ref:`spark-input-conf` via the read ``uri`` setting.
348300

349-
For example, consider the following example which sets the input
301+
For example, consider the following example which sets the read
350302
``uri`` setting via ``SparkConf``:
351303

352304
.. note::
353305

354-
If you use ``SparkConf`` to set the connector's input configurations,
355-
prefix ``spark.mongodb.input.`` to the setting.
306+
If you use ``SparkConf`` to set the connector's read configurations,
307+
prefix ``spark.mongodb.read.`` to the setting.
356308

357309
.. code:: cfg
358310

359-
spark.mongodb.input.uri=mongodb://127.0.0.1/databaseName.collectionName?readPreference=primaryPreferred
311+
spark.mongodb.read.uri=mongodb://127.0.0.1/databaseName.collectionName?readPreference=primaryPreferred
360312

361313
The configuration corresponds to the following separate configuration
362314
settings:
363315

364316
.. code:: cfg
365317

366-
spark.mongodb.input.uri=mongodb://127.0.0.1/
367-
spark.mongodb.input.database=databaseName
368-
spark.mongodb.input.collection=collectionName
369-
spark.mongodb.input.readPreference.name=primaryPreferred
318+
spark.mongodb.read.uri=mongodb://127.0.0.1/
319+
spark.mongodb.read.database=databaseName
320+
spark.mongodb.read.collection=collectionName
321+
spark.mongodb.read.readPreference.name=primaryPreferred
370322

371323
If you specify a setting both in the ``uri`` and in a separate
372324
configuration, the ``uri`` setting overrides the separate
373-
setting. For example, given the following configuration, the input
325+
setting. For example, given the following configuration, the
374326
database for the connection is ``foobar``:
375327

376328
.. code:: cfg
377329

378-
spark.mongodb.input.uri=mongodb://127.0.0.1/foobar
379-
spark.mongodb.input.database=bar
330+
spark.mongodb.read.uri=mongodb://127.0.0.1/foobar
331+
spark.mongodb.read.database=bar

source/configuration/write.txt

Lines changed: 15 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -12,15 +12,15 @@ Write Configuration Options
1212

1313
.. _spark-output-conf:
1414

15-
Output Configuration
16-
--------------------
15+
Write Configuration
16+
-------------------
1717

1818
The following options for writing to MongoDB are available:
1919

2020
.. note::
2121

22-
If you use ``SparkConf`` to set the connector's output configurations,
23-
prefix ``spark.mongodb.output.`` to each property.
22+
If you use ``SparkConf`` to set the connector's write configurations,
23+
prefix ``spark.mongodb.write.`` to each property.
2424

2525
.. list-table::
2626
:header-rows: 1
@@ -103,35 +103,35 @@ The following options for writing to MongoDB are available:
103103
``uri`` Configuration Setting
104104
-----------------------------
105105

106-
You can set all :ref:`spark-output-conf` via the output ``uri``.
106+
You can set all :ref:`spark-output-conf` via the write ``uri``.
107107

108-
For example, consider the following example which sets the input
108+
For example, consider the following example which sets the write
109109
``uri`` setting via ``SparkConf``:
110110

111111
.. note::
112112

113-
If you use ``SparkConf`` to set the connector's output configurations,
114-
prefix ``spark.mongodb.output.`` to the setting.
113+
If you use ``SparkConf`` to set the connector's write configurations,
114+
prefix ``spark.mongodb.write.`` to the setting.
115115

116116
.. code:: cfg
117117

118-
spark.mongodb.output.uri=mongodb://127.0.0.1/test.myCollection
118+
spark.mongodb.write.uri=mongodb://127.0.0.1/test.myCollection
119119

120120
The configuration corresponds to the following separate configuration
121121
settings:
122122

123123
.. code:: cfg
124124

125-
spark.mongodb.output.uri=mongodb://127.0.0.1/
126-
spark.mongodb.output.database=test
127-
spark.mongodb.output.collection=myCollection
125+
spark.mongodb.write.uri=mongodb://127.0.0.1/
126+
spark.mongodb.write.database=test
127+
spark.mongodb.write.collection=myCollection
128128

129129
If you specify a setting both in the ``uri`` and in a separate
130130
configuration, the ``uri`` setting overrides the separate
131-
setting. For example, given the following configuration, the output
131+
setting. For example, given the following configuration, the
132132
database for the connection is ``foobar``:
133133

134134
.. code:: cfg
135135

136-
spark.mongodb.output.uri=mongodb://127.0.0.1/foobar
137-
spark.mongodb.output.database=bar
136+
spark.mongodb.write.uri=mongodb://127.0.0.1/foobar
137+
spark.mongodb.write.database=bar

0 commit comments

Comments
 (0)