(DOCSP-21085) Structured streaming (#90)

zach-carr · zach-carr · commit 902ef08cad24 · 2022-04-07T10:52:37.000-04:00
diff --git a/snooty.toml b/snooty.toml
@@ -6,7 +6,7 @@ intersphinx = ["https://www.mongodb.com/docs/manual/objects.inv"]
 toc_landing_pages = ["configuration"]
 
 [constants]
-current-version = "10"
+current-version = "10.0"
 spark-core-version = "3.0.1"
 spark-sql-version = "3.0.1"
 scala-version = "2.12"
diff --git a/source/getting-started.txt b/source/getting-started.txt
@@ -50,4 +50,4 @@ Tutorials
 
 - :doc:`write-to-mongodb`
 - :doc:`read-from-mongodb`
-- :doc:`streaming`
+- :doc:`structured-streaming`
diff --git a/source/includes/streaming-distinction.rst b/source/includes/streaming-distinction.rst
@@ -0,0 +1,3 @@
+.. important::
+
+   `Spark Structured Streaming <https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html>`__ and `Spark Streaming with DStreams <https://spark.apache.org/docs/latest/streaming-programming-guide.html>`__ are different. 
diff --git a/source/index.txt b/source/index.txt
@@ -107,8 +107,7 @@ versions of Apache Spark and MongoDB:
    getting-started
    write-to-mongodb
    read-from-mongodb
-   streaming
-   tutorials
+   structured-streaming
    faq
    release-notes
    API Docs <https://www.javadoc.io/doc/org.mongodb.spark/mongo-spark-connector_{+scala-version+}/{+current-version+}>
diff --git a/source/scala/streaming.txt b/source/scala/streaming.txt
@@ -1,9 +1,11 @@
+.. include:: includes/streaming-distinction.rst
+
 Spark Streaming allows on-the-fly analysis of live data streams with
 MongoDB. See the `Apache documentation
 <http://spark.apache.org/docs/latest/streaming-programming-guide.html>`_
 for a detailed description of Spark Streaming functionality.
 
-This tutorial uses the Spark Shell.For more information about starting
+This tutorial uses the Spark Shell. For more information about starting
 the Spark Shell and configuring it for use with MongoDB, see
 :ref:`Getting Started <scala-getting-started>`.
 
diff --git a/source/streaming.txt b/source/streaming.txt
diff --git a/source/structured-streaming.txt b/source/structured-streaming.txt
@@ -0,0 +1,327 @@
+.. _spark-structured-streaming:
+
+=================================
+Structured Streaming with MongoDB
+=================================
+
+.. default-domain:: mongodb
+
+.. contents:: On this page
+   :local:
+   :backlinks: none
+   :depth: 2
+   :class: singlecol
+
+Overview
+--------
+
+Spark Structured Streaming is a data stream processing engine you can 
+use through the Dataset or DataFrame API. The MongoDB Spark Connector 
+enables you to stream to and from MongoDB using Spark Structured 
+Streaming.
+
+.. include:: includes/streaming-distinction.rst
+
+To learn more about Structured Streaming, see the 
+`Spark Programming Guide
+<https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html>`__.
+
+.. _write-structured-stream:
+
+Configuring a Write Stream to MongoDB
+-------------------------------------
+
+.. tabs-drivers::
+
+   tabs:
+     - id: java-sync
+       content: |
+
+     - id: python
+       content: |
+
+         Specify write stream configuration settings on your streaming 
+         Dataset or DataFrame using the ``writeStream`` property. You 
+         must specify the following configuration settings to write 
+         to MongoDB:
+         
+         .. list-table::
+            :header-rows: 1
+            :stub-columns: 1
+            :widths: 10 40
+         
+            * - Setting
+              - Description
+         
+            * - ``writeStream.format()``
+              - The format to use for write stream data. Use 
+                ``mongodb``.
+         
+            * - ``writeStream.option()``
+              - Use the ``option`` method to specify your MongoDB 
+                deployment connection string with the 
+                ``spark.mongodb.connection.uri`` option key.
+         
+                You must specify a database and collection, either as 
+                part of your connection string or with additional 
+                ``option`` methods using the following keys:
+         
+                - ``spark.mongodb.database``
+                - ``spark.mongodb.collection``
+         
+            * - ``writeStream.outputMode()``
+              - The output mode to use. To view a list of all supported 
+                output modes, see `the pyspark outputMode documentation <https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.streaming.DataStreamWriter.outputMode.html#pyspark.sql.streaming.DataStreamWriter.outputMode>`__.
+
+         
+         The following code snippet shows how to use the preceding 
+         configuration settings to stream data to MongoDB:
+
+         .. code-block:: python
+            :copyable: true
+            :emphasize-lines: 3-4, 7
+         
+            <streaming Dataset/ DataFrame> \
+              .writeStream \
+              .format("mongodb") \
+              .option("spark.mongodb.connection.uri", <mongodb-connection-string>) \
+              .option("spark.mongodb.database", <database-name>) \
+              .option("spark.mongodb.collection", <collection-name>) \
+              .outputMode("append")
+
+         For a complete list of methods, see the 
+         `pyspark Structured Streaming reference <https://spark.apache.org/docs/latest/api/python/reference/pyspark.ss.html>`__.
+
+     - id: scala
+       content: |
+
+.. _read-structured-stream:
+.. _continuous-processing:
+
+Configuring a Read Stream from MongoDB
+--------------------------------------
+
+Reading a stream from a MongoDB database requires 
+*continuous processing*, 
+an experimental feature introduced in Spark version 2.3. To learn 
+more about continuous processing, see the `Spark documentation <https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#continuous-processing>`__.
+
+.. tabs-drivers::
+
+   tabs:
+     - id: java-sync
+       content: |
+
+     - id: python
+       content: |
+
+         To use continuous processing with the MongoDB Spark Connector, 
+         add the ``trigger()`` method to the ``writeStream`` property 
+         of the streaming Dataset or DataFrame that you create from 
+         your MongoDB read stream. In your ``trigger()``, specify the 
+         ``continuous`` parameter.
+         
+         .. note:: 
+         
+            The connector populates its read stream from your MongoDB 
+            deployment's change stream. To populate your change stream, 
+            perform update operations on your database.
+         
+            To learn more about change streams, see 
+            :manual:`Change Streams </changeStreams>` in the MongoDB 
+            manual.
+         
+         Specify read stream configuration settings on your local 
+         SparkSession ``readStream``. You must specify the following 
+         configuration settings to read from MongoDB:
+         
+         .. list-table::
+            :header-rows: 1
+            :stub-columns: 1
+            :widths: 10 40
+         
+            * - Setting
+              - Description
+         
+            * - ``readStream.format()``
+              - The format to use for read stream data. Use ``mongodb``.
+         
+            * - ``writeStream.trigger()``
+              - Enables continuous processing for your read stream. Use 
+                the ``continuous`` parameter.
+
+         The following code snippet shows how to use the preceding 
+         configuration settings to stream data from MongoDB:
+
+         .. code-block:: python
+            :copyable: true
+            :emphasize-lines: 3, 9
+         
+            streamingDataFrame = (<local SparkSession>
+              .readStream
+              .format("mongodb")
+              .load()
+            )
+         
+            query = (streamingDataFrame
+              .writeStream
+              .trigger(continuous="1 second")
+              .format("memory")
+              .outputMode("append")
+            )
+
+            query.start()
+
+         .. note::
+            
+            Spark does not begin streaming until you call the 
+            ``start()`` method on a streaming query.
+
+         For a complete list of methods, see the 
+         `pyspark Structured Streaming reference <https://spark.apache.org/docs/latest/api/python/reference/pyspark.ss.html>`__.
+
+     - id: scala
+       content: |
+
+Examples
+--------
+
+The following examples show Spark Structured Streaming configurations 
+for streaming between MongoDB and a ``.csv`` file.
+
+Stream to MongoDB from a CSV File
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. tabs-drivers::
+
+   tabs:
+     - id: java-sync
+       content: |
+
+         .. code-block:: java
+            :copyable: true
+
+     - id: python
+       content: |
+
+         To create a :ref:`write stream <write-structured-stream>` to 
+         MongoDB from a ``.csv`` file, first create a `DataStreamReader <https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.streaming.DataStreamReader.html>`__ 
+         from the ``.csv`` file, then use that ``DataStreamReader`` to 
+         create a `DataStreamWriter <https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.streaming.DataStreamWriter.html>`__ 
+         to MongoDB. Finally, use the ``start()`` method to begin the 
+         stream.
+         
+         As streaming data is read from the ``.csv`` file, it is added 
+         to MongoDB in the `outputMode <https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.streaming.DataStreamWriter.outputMode.html#pyspark.sql.streaming.DataStreamWriter.outputMode>`__ 
+         you specify.
+
+         .. code-block:: python
+            :copyable: true
+            :emphasize-lines: 11, 17
+
+            # create a local SparkSession
+            spark = SparkSession \
+              .builder \
+              .appName("writeExample") \
+              .master("spark://spark-master:<port>") \
+              .config("spark.jars", "<mongodb-spark-connector-{+current-version+}>.jar") \
+              .getOrCreate()
+
+            # define a streaming query
+            query = (spark
+              .readStream
+              .format("csv")
+              .option("header", "true")
+              .schema(<csv-schema>)
+              .load(<csv-file-name>)
+              # manipulate your streaming data
+              .writeStream
+              .format("mongodb")
+              .option("checkpointLocation", "/tmp/pyspark/")
+              .option("forceDeleteTempCheckpointLocation", "true")
+              .option("spark.mongodb.connection.uri", <mongodb-connection-string>)
+              .option('spark.mongodb.database', <database-name>)
+              .option('spark.mongodb.collection', <collection-name>)
+              .outputMode("append")
+            )
+
+            # run the query
+            query.start()
+            
+     - id: scala
+       content: |
+
+         .. code-block:: scala
+            :copyable: true
+
+Stream to a CSV File from MongoDB
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+.. tabs-drivers::
+
+   tabs:
+     - id: java-sync
+       content: |
+
+         .. code-block:: java
+            :copyable: true
+
+     - id: python
+       content: |
+
+         To create a :ref:`read stream <read-structured-stream>` to a 
+         ``.csv`` file from MongoDB, first create a `DataStreamReader <https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.streaming.DataStreamReader.html>`__ 
+         from MongoDB, then use that ``DataStreamReader`` to 
+         create a `DataStreamWriter <https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.streaming.DataStreamWriter.html>`__ 
+         to a new ``.csv`` file. Finally, use the ``start()`` method 
+         to begin the stream.
+         
+         As new data is inserted into MongoDB, MongoDB streams that 
+         data out to a ``.csv`` file in the `outputMode <https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.streaming.DataStreamWriter.outputMode.html#pyspark.sql.streaming.DataStreamWriter.outputMode>`__ 
+         you specify.
+
+         .. code-block:: python
+            :copyable: true
+            :emphasize-lines: 19, 27, 30
+
+            # create a local SparkSession
+            spark = SparkSession \
+              .builder \
+              .appName("readExample") \
+              .master("spark://spark-master:<port>") \
+              .config("spark.jars", "<mongodb-spark-connector-{+current-version+}>.jar") \
+              .getOrCreate()
+
+            # define the schema of the source collection
+            readSchema = (StructType()
+              .add('company_symbol', StringType())
+              .add('company_name', StringType())
+              .add('price', DoubleType())
+              .add('tx_time', TimestampType())
+            )            
+
+            # define a streaming query
+            query = (spark
+              .readStream
+              .format("mongodb")
+              .option("spark.mongodb.connection.uri", <mongodb-connection-string>)
+              .option('spark.mongodb.database', <database-name>)
+              .option('spark.mongodb.collection', <collection-name>)
+              .schema(readSchema)
+              .load()
+              # manipulate your streaming data
+              .writeStream
+              .format("csv")
+              .option("path", "/output/")
+              .trigger(continuous="1 second")
+              .outputMode("append")
+            )
+
+            # run the query
+            query.start()   
+
+     - id: scala
+       content: |
+
+         .. code-block:: scala
+            :copyable: true
diff --git a/source/tutorials.txt b/source/tutorials.txt

Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,3 @@`
	`1`	`+.. important::`
	`2`	`+`
	`3`	+ `Spark Structured Streaming <https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html>`__ and `Spark Streaming with DStreams <https://spark.apache.org/docs/latest/streaming-programming-guide.html>`__ are different.