[SPARK-47793][SS][PYTHON] Implement SimpleDataSourceStreamReader for python streaming data source #45977

chaoqin-li1123 · 2024-04-10T07:04:08Z

What changes were proposed in this pull request?

SimpleDataSourceStreamReader is a simplified version of the DataSourceStreamReader interface.

There are 3 functions that needs to be defined

Read data and return the end offset.
def read(self, start: Offset) -> (Iterator[Tuple], Offset)
Read data between start and end offset, this is required for exactly once read.
def readBetweenOffset(self, start: Offset, end: Offset) -> Iterator[Tuple]
initial start offset of the streaming query.
def initialOffset() -> dict

The implementation wrap the SimpleDataSourceStreamReader instance in a DataSourceStreamReader that prefetch and cache data in latestOffset. The record prefetched in python process will be sent to JVM as arrow record batches in planInputPartitions() and cached by block manager and read by partition reader from executor later..

Why are the changes needed?

Compared to DataSourceStreamReader interface, the simplified interface has some advantages.
It doesn’t require developers to reason about data partitioning.
It doesn’t require getting the latest offset before reading data.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Add unit test and integration test.

Was this patch authored or co-authored using generative AI tooling?

No.

chaoqin-li1123 · 2024-04-10T22:38:05Z

@allisonwang-db @HyukjinKwon @HeartSaVioR PTAL, thanks!

python/pyspark/sql/datasource.py

HyukjinKwon · 2024-04-11T00:42:43Z

python/pyspark/sql/datasource.py

+            message_parameters={"feature": "read"},
+        )
+
+    def read2(self, start: dict, end: dict) -> Iterator[Tuple]:


can we have one method? you can make end argument optional

They are fundamentally different, the former read() is to read data and plan end offset, the latter is to read data between already planned start and end offset.

can we have a different name in that case instead of read2

Changed to readBetweenOffsets()

cc @HeartSaVioR

The name itself might be OK. Maybe we have an option to make the both of method names be self-descriptive (not just read), but if we prefer shorter name, maybe OK to have either to be "read".

I see a bigger issue on implementation. Let's address that first.

There can't be 2 methods named read() for the same class, python doesn't have method overloading IIRC.

It can't have overloaded ones but it can dispatch by embedding if-else, and leveraging optional argument. e.g.,

def read(self, start: dict, end: dict = None) -> Union[Tuple[Iterator[Tuple], dict], Iterator[Tuple]: if end is None: return # logic for read(start) else: return # logic for read(start, end)

Then we won't be able to enforce that read with end offset is implemented.

Yeah that would have to be documented. BTW, in Python you can't enforce anything in any event.

Co-authored-by: Hyukjin Kwon <[email protected]>

python/pyspark/sql/datasource.py

sahnib

Thanks for making these changes. Still reviewing the testcases. Left some questions/comments.

python/pyspark/sql/datasource.py

python/pyspark/sql/streaming/python_streaming_source_runner.py

python/pyspark/sql/worker/plan_data_source_read.py

...main/scala/org/apache/spark/sql/execution/datasources/v2/python/PythonMicroBatchStream.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/python/PythonStreamingSourceRunner.scala

sahnib · 2024-04-12T17:25:45Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/PythonStreamingSourceRunner.scala

+    val vectors = root.getFieldVectors().asScala.map { vector =>
+      new ArrowColumnVector(vector)
+    }.toArray[ColumnVector]
+    val rows = ArrayBuffer[InternalRow]()


We are going to buffer all the rows in memory here? Can we create this iterator lazily to avoid buffering all data from Python source?

We can't do lazy initialization here because we need to send the data from python process to JVM, the communication is synchronous.

I imagine there may be still a way to avoid materializing all rows at once (e.g. per arrow batch) but I don't concern too much about it as we know simple data source isn't intended to handle a huge amount of data.

If we call putIterator here we should be able to avoid materializing all rows at once in scala side, but it doesn't matter that much as we already materialize all rows in python side.

sahnib

LGTM, pending CI.

HyukjinKwon · 2024-04-24T09:17:47Z

The documentation build failure will go away if you sync/rebase your branch to master branch.

python/pyspark/sql/datasource_internal.py

HeartSaVioR · 2024-04-25T08:03:16Z

I'll take a look tomorrow. Sorry for the delay.

HeartSaVioR

I'm still reviewing, but wanted to call out a possible major issue in prior in case I couldn't finish reviewing by today.

python/pyspark/sql/datasource.py

HeartSaVioR · 2024-04-26T07:05:56Z

python/pyspark/sql/datasource.py

+            message_parameters={"feature": "initialOffset"},
+        )
+
+    def read(self, start: dict) -> Tuple[Iterator[Tuple], dict]:


I've been missed so far - since we are close to completion, it'd be awesome if we can give a try to remove out points of confusion from the doc, e.g. inclusive vs exclusive of offset. No need to deal with the doc update in this PR, probably worth a JIRA ticket.

python/pyspark/sql/datasource_internal.py

HeartSaVioR · 2024-04-26T07:14:56Z

python/pyspark/sql/datasource_internal.py

+        self.iterator = iterator
+
+
+class _SimpleStreamReaderWrapper(DataSourceStreamReader):


If we have to separate out private vs public class, what about above classes? Are they needed to be public classes? Here the "private" seems to be very unclear. I'm OK if this is some trick to address some gap on Python language on scoping. I just wanted to know whether this is a standard practice or not.

python/pyspark/sql/datasource_internal.py

HeartSaVioR · 2024-04-26T07:33:22Z

python/pyspark/sql/datasource_internal.py

+        return self.initial_offset
+
+    def latestOffset(self) -> dict:
+        # when query start for the first time, use initial offset as the start offset.


Actually this is the hard part of implementing prefetcher for SS data source. When the query restarts, we assume that prefetcher would be able to start from known committed offset. Unfortunately that is not true. You've mentioned that this relies on getBatch trick but that's only applicable with DSv1 and it's clearly a hack to address some specific data source. That is not a contract streaming engine guarantees.

We have an interface AcceptsLatestSeenOffset for this case (you need to adopt this on determining the start offset for prefetching), but this still does not give you the last committed offset but the latest seen offset, so Spark could still request the offset range before this offset. Though it would work if the simple data source reader can work with all planned-but-not-yet-committed offset range without relying on prefetcher. prefetcher can start prefetching with latest seen offset and previous offset range should be covered with planned batch(es).

That said, readBetweenOffsets() must be able to work without prefetcher - PREFETCHED_RECORDS_NOT_FOUND is not only happening with error case.

If you don't have a test case where you have planned batch in offset log and have to restart from there, you need to have one. Run several batches, stop the query, make the last batch be no-yet-to-be-committed, restart the query. prefetcher should not get a request to read from "initial offset", and also read request for planned batch should work without relying on prefetcher.

OK, never mind. You are dealing with all the thing individually (not just leveraging DSv1 trick). Your comment seems a bit confusing - mentioning getBatch was the starting point I got confused.

Still better to have fault-tolerance test(s) if we don't have it.

Yes, we have tests where query get restarted multiple times and verify that replay microbatch succeeds.

I realized the trick only works for V1 source and added the individual handling, let me also update the comment here.

Let's avoid randomness - you can manipulate both tests on restarting 1) restarting from the query which does not have leftover batch 2) restarting from the query which does have leftover batch (planned-but-yet-to-be-committed). We have several tests which adjusts offset log and commit log to test the behavior.

Thanks for the suggestion, added a logic to delete last committed entry in the test.

sql/core/src/main/scala/org/apache/spark/sql/execution/python/PythonStreamingSourceRunner.scala

HeartSaVioR

Yet looked into test suites but I assume people already took a look at tests in depth.

HeartSaVioR · 2024-04-29T07:53:09Z

python/pyspark/sql/datasource_internal.py

+    def read(
+        self, input_partition: SimpleInputPartition  # type: ignore[override]
+    ) -> Iterator[Tuple]:
+        return self.simple_reader.readBetweenOffsets(input_partition.start, input_partition.end)


This sounds like we also have the case where the read method of wrapper class has to be serialized and being executed in the task, say, simple reader also needs to be serialized and being executed in the task. Do I understand correctly?

If I understand correctly, I'd say you'd need to document this in the SimpleDataSourceStreamReader, as SimpleDataSourceStreamReader isn't driver-only, which means they still need to consider serialization.

That said, if we make change to either getCache or caller of getCache to call readBetweenOffsets and execute the same path (send the data via arrowbatch), this method must not be called and we wouldn't need to serialize SimpleDataSourceStreamReader instance.

Never mind I see this is still needed to handle the case of "block not found". That said, this still applies

I'd say you'd need to document this in the SimpleDataSourceStreamReader, as SimpleDataSourceStreamReader isn't driver-only, which means they still need to consider serialization.

I need to think about the serialization requirement of these methods, it should be documented in the user guide.

sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala

HeartSaVioR · 2024-04-29T08:50:23Z

python/pyspark/sql/datasource_internal.py

+            if json.dumps(entry.end) == json.dumps(end):
+                end_idx = idx
+                break
+        if end_idx > 0:


Correct me if I'm missing something. According to the interface contract, the offset "end" won't be requested. Doesn't it mean this should be end_idx > -1 and self.cache = self.cache[end_idx+1:]? Any reason we keep the cached entry which matches with end in end offset?

I am trying to be conservative here when evicting cache by keeping one extra entry.

OK, would be nice to add code comment (probably one-liner) to explicitly provide the intention.

HeartSaVioR · 2024-04-29T08:55:36Z

python/pyspark/sql/datasource_internal.py

+        for idx, entry in enumerate(self.cache):
+            # There is no convenient way to compare 2 offsets.
+            # Serialize into json string before comparison.
+            if json.dumps(entry.start) == json.dumps(start):


Does this mean we have a case where the offset range spans to multiple cache entries? Or is it just a sort of defensive programming?

It is just being defensive, currently we always call plan input partitions after a prefetch in latestOffset()

python/pyspark/sql/streaming/python_streaming_source_runner.py

...main/scala/org/apache/spark/sql/execution/datasources/v2/python/PythonMicroBatchStream.scala

HeartSaVioR · 2024-04-29T13:30:39Z

...apache/spark/sql/execution/datasources/v2/python/PythonStreamingPartitionReaderFactory.scala

+      private val outputIter = if (cachedBlock.isEmpty) {
+        // Evaluate the python read UDF if the partition is not cached as block.
+        val evaluatorFactory = source.createMapInBatchEvaluatorFactory(
+          pickledReadFunc,


Got it - we need to serialize SimpleDataSourceStreamReader to cover a bad case.

Do we have a way to trigger this artificially? Never mind if it's not feasible - looks like non-trivial but would be awesome if we can test with this path as well.

This will be triggered if the during replay of the last batch when query restart, I added the test.

...apache/spark/sql/execution/datasources/v2/python/PythonStreamingPartitionReaderFactory.scala

sql/core/src/main/scala/org/apache/spark/sql/execution/python/PythonStreamingSourceRunner.scala

HeartSaVioR · 2024-04-29T13:46:00Z

sql/core/src/main/scala/org/apache/spark/sql/execution/python/PythonStreamingSourceRunner.scala

+    val vectors = root.getFieldVectors().asScala.map { vector =>
+      new ArrowColumnVector(vector)
+    }.toArray[ColumnVector]
+    val rows = ArrayBuffer[InternalRow]()


I imagine there may be still a way to avoid materializing all rows at once (e.g. per arrow batch) but I don't concern too much about it as we know simple data source isn't intended to handle a huge amount of data.

HeartSaVioR

+1

HeartSaVioR · 2024-04-30T13:07:46Z

The GA only failed with docker integration test which isn't related.

HeartSaVioR · 2024-04-30T13:07:53Z

Thanks! Merging to master.

dongjoon-hyun

Hi, @chaoqin-li1123 and all.

The newly added test case seems to be flaky. Could you take a look please?

[info] - SimpleDataSourceStreamReader read exactly once *** FAILED *** (8 seconds, 88 milliseconds)

chaoqin-li1123 · 2024-05-08T18:55:55Z

Yes, I notice that, will send out a fix PR today.

dongjoon-hyun · 2024-05-08T18:57:20Z

Thank you so much, @chaoqin-li1123 .

chaoqin-li1123 · 2024-05-08T19:11:00Z

This is the fix #46481 @dongjoon-hyun

chaoqin-li1123 added 5 commits April 6, 2024 13:35

initial

55615f5

implementation and test

0f561b8

add comments

f17e64e

add test and comments

0dbda70

add comments

c9f5981

github-actions bot added SQL STRUCTURED STREAMING PYTHON labels Apr 10, 2024

chaoqin-li1123 added 2 commits April 10, 2024 00:08

clean up

156939a

clean up

80a1921

chaoqin-li1123 changed the title ~~[SPARK-47793][SS][PYTHON] Implement SimpleDataSourceStreamReader for python streaming data source~~ [WIP][SPARK-47793][SS][PYTHON] Implement SimpleDataSourceStreamReader for python streaming data source Apr 10, 2024

chaoqin-li1123 added 2 commits April 10, 2024 15:07

improve exactly once test

218941b

add comment

9bd5c9c

github-actions bot added the CORE label Apr 10, 2024

chaoqin-li1123 changed the title ~~[WIP][SPARK-47793][SS][PYTHON] Implement SimpleDataSourceStreamReader for python streaming data source~~ [SPARK-47793][SS][PYTHON] Implement SimpleDataSourceStreamReader for python streaming data source Apr 10, 2024

chaoqin-li1123 added 2 commits April 10, 2024 15:59

remove unused import

e53e379

format

a32cc10

HyukjinKwon reviewed Apr 11, 2024

View reviewed changes

python/pyspark/sql/datasource.py Outdated Show resolved Hide resolved

HyukjinKwon reviewed Apr 11, 2024

View reviewed changes

Update python/pyspark/sql/datasource.py

e49f47b

Co-authored-by: Hyukjin Kwon <[email protected]>

HyukjinKwon reviewed Apr 11, 2024

View reviewed changes

python/pyspark/sql/datasource.py Outdated Show resolved Hide resolved

chaoqin-li1123 added 7 commits April 10, 2024 18:25

add class reference

5676e61

rename method

ab7b146

lint

a22560a

lint

286e2da

simplify test script

18caadf

fix lint

7b5d83a

lint

8f22117

sahnib reviewed Apr 12, 2024

View reviewed changes

chaoqin-li1123 added 2 commits April 19, 2024 18:13

address comment

5645e52

address comment

c37fa3e

sahnib approved these changes Apr 22, 2024

View reviewed changes

chaoqin-li1123 added 3 commits April 22, 2024 16:01

move simple stream reader wrapper out of public interface

54d2ecd

clean up import

9efbea4

lint

e095faa

chaoqin-li1123 requested a review from allisonwang-db April 23, 2024 17:26

lint

10086f7

Merge branch 'master' of github.com:apache/spark into simple_reader_impl

fd0932d

allisonwang-db approved these changes Apr 24, 2024

View reviewed changes

python/pyspark/sql/datasource_internal.py Show resolved Hide resolved

HeartSaVioR reviewed Apr 26, 2024

View reviewed changes

sql/core/src/main/scala/org/apache/spark/sql/execution/python/PythonStreamingSourceRunner.scala Show resolved Hide resolved

address comments

b981638

HeartSaVioR reviewed Apr 29, 2024

View reviewed changes

chaoqin-li1123 added 2 commits April 29, 2024 12:11

address comment

c4d8e7a

add comments for initializing start offsets

ce0d4a9

chaoqin-li1123 requested a review from HeartSaVioR April 30, 2024 02:48

HeartSaVioR approved these changes Apr 30, 2024

View reviewed changes

HeartSaVioR closed this in c8c2492 Apr 30, 2024

dongjoon-hyun reviewed May 8, 2024

View reviewed changes

dongjoon-hyun mentioned this pull request May 8, 2024

[SPARK-47793][TEST][FOLLOWUP] Fix flaky test for Python data source exactly once. #46481

Closed

		self.iterator = iterator


		class _SimpleStreamReaderWrapper(DataSourceStreamReader):

[SPARK-47793][SS][PYTHON] Implement SimpleDataSourceStreamReader for python streaming data source #45977

[SPARK-47793][SS][PYTHON] Implement SimpleDataSourceStreamReader for python streaming data source #45977

Uh oh!

Conversation

chaoqin-li1123 commented Apr 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

chaoqin-li1123 commented Apr 10, 2024

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

sahnib left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sahnib left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Apr 24, 2024

Uh oh!

Uh oh!

HeartSaVioR commented Apr 25, 2024

Uh oh!

HeartSaVioR left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

HeartSaVioR Apr 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chaoqin-li1123 commented Apr 10, 2024 •

edited

Loading

sahnib left a comment •

edited

Loading

HeartSaVioR Apr 26, 2024 •

edited

Loading

HeartSaVioR Apr 26, 2024 •

edited

Loading