@@ -8,7 +8,7 @@ How can I achieve data locality?
88--------------------------------
99
1010For any MongoDB deployment, the Mongo Spark Connector sets the
11- preferred location for an RDD to be where the data is:
11+ preferred location for a DataFrame or Dataset to be where the data is:
1212
1313- For a non sharded system, it sets the preferred location to be the
1414 hostname(s) of the standalone or the replica set.
@@ -30,89 +30,10 @@ To promote data locality,
3030 To partition the data by shard use the
3131 :ref:`conf-shardedpartitioner`.
3232
33- How do I interact with Spark Streams?
34- -------------------------------------
35-
36- Spark streams can be considered as a potentially infinite source of
37- RDDs. Therefore, anything you can do with an RDD, you can do with the
38- results of a Spark Stream.
39-
40- For an example, see :mongo-spark:`SparkStreams.scala
41- </blob/master/examples/src/test/scala/tour/SparkStreams.scala>`
42-
4333How do I resolve ``Unrecognized pipeline stage name`` Error?
4434------------------------------------------------------------
4535
4636In MongoDB deployments with mixed versions of :binary:`~bin.mongod`, it is
4737possible to get an ``Unrecognized pipeline stage name: '$sample'``
4838error. To mitigate this situation, explicitly configure the partitioner
4939to use and define the Schema when using DataFrames.
50-
51- How do I use MongoDB BSON types that are unsupported in Spark?
52- --------------------------------------------------------------
53-
54- Some custom MongoDB BSON types, such as ``ObjectId``, are unsupported
55- in Spark.
56-
57- The MongoDB Spark Connector converts custom MongoDB data types to and
58- from extended JSON-like representations of those data types that are
59- compatible with Spark. See :ref:`<bson-spark-datatypes>` for a list of
60- custom MongoDB types and their Spark counterparts.
61-
62- Spark Datasets
63- ~~~~~~~~~~~~~~
64-
65- To create a standard Dataset with custom MongoDB data types, use
66- ``fieldTypes`` helpers:
67-
68- .. code-block:: scala
69-
70- import com.mongodb.spark.sql.fieldTypes
71-
72- case class MyData(id: fieldTypes.ObjectId, a: Int)
73- val ds = spark.createDataset(Seq(MyData(fieldTypes.ObjectId(new ObjectId()), 99)))
74- ds.show()
75-
76- The preceding example creates a Dataset containing the following fields
77- and data types:
78-
79- - The ``id`` field is a custom MongoDB BSON type, ``ObjectId``, defined
80- by ``fieldTypes.ObjectId``.
81-
82- - The ``a`` field is an ``Int``, a data type available in Spark.
83-
84- Spark DataFrames
85- ~~~~~~~~~~~~~~~~
86-
87- To create a DataFrame with custom MongoDB data types, you must supply
88- those types when you create the RDD and schema:
89-
90- - Create RDDs using custom MongoDB BSON types
91- (e.g. ``ObjectId``). The Spark Connector handles converting
92- those custom types into Spark-compatible data types.
93-
94- - Declare schemas using the ``StructFields`` helpers for data types
95- that are not natively supported by Spark
96- (e.g. ``StructFields.objectId``). Refer to
97- :ref:`<bson-spark-datatypes>` for the mapping between BSON and custom
98- MongoDB Spark types.
99-
100- .. code-block:: scala
101-
102- import org.apache.spark.sql.Row
103- import org.apache.spark.sql.types.{StructType, StructField, IntegerType}
104- import com.mongodb.spark.sql.helpers.StructFields
105-
106- val data = Seq(Row(Row(new ObjectId().toHexString()), 99))
107- val rdd = spark.sparkContext.parallelize(data)
108- val schema = StructType(List(StructFields.objectId("id", true), StructField("a", IntegerType, true)))
109- val df = spark.createDataFrame(rdd, schema)
110- df.show()
111-
112- The preceding example creates a DataFrame containing the following
113- fields and data types:
114-
115- - The ``id`` field is a custom MongoDB BSON type, ``ObjectId``, defined
116- by ``StructFields.objectId``.
117-
118- - The ``a`` field is an ``Int``, a data type available in Spark.
0 commit comments