-
-
Notifications
You must be signed in to change notification settings - Fork 144
feat: streaming response #1317
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: streaming response #1317
Conversation
WalkthroughThe changes introduce support for streaming query execution by refactoring the query handler into specialized async functions for count, non-streaming, and streaming queries. A new Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant HTTPHandler
participant QueryModule
participant ResponseModule
Client->>HTTPHandler: Send query (with/without streaming flag)
HTTPHandler->>HTTPHandler: Parse and extract streaming flag
alt Count Query
HTTPHandler->>QueryModule: Execute count query
QueryModule-->>HTTPHandler: Return count result
HTTPHandler->>ResponseModule: Format count response
ResponseModule-->>HTTPHandler: JSON value
HTTPHandler->>Client: Return JSON response
else Non-Streaming Query
HTTPHandler->>QueryModule: Execute batch query
QueryModule-->>HTTPHandler: Return record batches
HTTPHandler->>ResponseModule: Format batch response
ResponseModule-->>HTTPHandler: JSON value
HTTPHandler->>Client: Return JSON response
else Streaming Query
HTTPHandler->>QueryModule: Execute streaming query
QueryModule-->>HTTPHandler: Return record batch stream
loop For each batch
HTTPHandler->>ResponseModule: Format batch to JSON
ResponseModule-->>HTTPHandler: JSON value
HTTPHandler->>Client: Stream JSON batch
end
end
Poem
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (4)
src/response.rs (1)
34-60
: Method name now misleading & header constant unused inside file
to_http
no longer creates anHttpResponse
; it only returns aserde_json::Value
.
Consider renaming it (e.g.to_json
) or updating the doc-comments to avoid confusion.Additionally,
TIME_ELAPSED_HEADER
is still declared here but not referenced in this module after the refactor.
If the constant is only used by HTTP-layer code (handlers/http/query.rs
), move it there to keep responsibilities isolated.src/query/mod.rs (1)
196-214
: Minor memory churn – clone offields
per batch
execute_stream
already copies theVec<String>
once; in the HTTP layer each batch clones it again.
If many small batches are emitted this becomes noticeable. ReturningArc<Vec<String>>
or just a sharedArc<[String]>
avoids the per-batch allocation:-pub async fn execute_stream(... ) -> Result<(SendableRecordBatchStream, Vec<String>), _> { +pub async fn execute_stream(... ) -> Result<(SendableRecordBatchStream, Arc<Vec<String>>), _> {src/handlers/http/query.rs (2)
116-149
:handle_count_query
repeats field list on every call site – could reuse QueryResponseThe manual JSON building duplicates logic already present in
QueryResponse::to_http
.
Not blocking, but using the same helper would reduce divergence:-let response = if query_request.fields { - json!({ - "fields": [column_name], - "records": [json!({column_name: count})] - }) -} else { - serde_json::Value::Array(vec![json!({column_name: count})]) -}; +let response = QueryResponse { + records: vec![], // fake batch converted below + fields: vec![column_name.to_owned()], + fill_null: false, + with_fields: query_request.fields, +}.to_http()?; // after pushing one synthetic RecordBatch
288-291
: URL-param parsing only works forstreaming=true/false
– accept upper-case?The
web::Query<HashMap<String, bool>>
deserializer is case-sensitive;"TRUE"
will fail.
Consider normalising manually or document exact expected values.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge Base: Disabled due to data retention organization setting
⛔ Files ignored due to path filters (1)
Cargo.lock
is excluded by!**/*.lock
📒 Files selected for processing (4)
Cargo.toml
(1 hunks)src/handlers/http/query.rs
(6 hunks)src/query/mod.rs
(3 hunks)src/response.rs
(2 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
src/handlers/http/query.rs (1)
src/query/mod.rs (4)
execute
(70-79)execute
(171-194)execute_stream
(81-90)execute_stream
(197-214)
⏰ Context from checks skipped due to timeout of 90000ms (10)
- GitHub Check: Build Default x86_64-pc-windows-msvc
- GitHub Check: Build Default aarch64-apple-darwin
- GitHub Check: Build Default x86_64-apple-darwin
- GitHub Check: Build Default aarch64-unknown-linux-gnu
- GitHub Check: Build Kafka aarch64-apple-darwin
- GitHub Check: Build Default x86_64-unknown-linux-gnu
- GitHub Check: Quest Smoke and Load Tests for Standalone deployments
- GitHub Check: Build Kafka x86_64-unknown-linux-gnu
- GitHub Check: Quest Smoke and Load Tests for Distributed deployments
- GitHub Check: coverage
🔇 Additional comments (6)
Cargo.toml (1)
124-124
:✅ Verification successful
Potential duplicate crate – align
futures-core
patch-version withfutures
to avoid multiple copies
futures
already re-exports the same public traits from its ownfutures-core
dependency.
Pullingfutures-core = "0.3.31"
explicitly whilefutures = "0.3"
(currently 0.3.30) is resolved may lead to two distinct versions of thefutures_core
crate in the final binary, increasing size and compile time.Two safer options:
- futures-core = "0.3.31" +# Option A – rely on transitive dep (simplest) +# – remove the explicit line above + +# Option B – keep explicit but stay semver-loose so Cargo can unify: +futures-core = "0.3"If you really need the explicit import (e.g. for feature flags) pin it to
"0.3"
so Cargo can pick a single patch level.You can confirm the duplication with:
🏁 Script executed:
#!/bin/bash cargo tree -e features -i futures-core
Length of output: 136
To verify the resolved versions and detect any version skew between
futures
andfutures-core
, let’s inspect yourCargo.toml
andCargo.lock
:
🏁 Script executed:
#!/bin/bash # Check explicit futures and futures-core entries in Cargo.toml grep -R '^futures\s*=' -n Cargo.toml grep -R '^futures-core\s*=' -n Cargo.toml # If a lockfile exists, list all resolved futures-core versions if [ -f Cargo.lock ]; then grep -R '^name = "futures-core"' -A2 -n Cargo.lock else echo "No Cargo.lock found" fiLength of output: 327
To confirm that
futures
andfutures-core
are both resolving to the same patch, let’s check the resolved version offutures
in your lockfile:
🏁 Script executed:
#!/bin/bash grep -R '^name = "futures"$' -A2 -n Cargo.lockLength of output: 164
No duplicate futures-core crate—patch versions already aligned
Verified in Cargo.lock that both
futures = "0.3"
and your explicitfutures-core = "0.3.31"
resolve to 0.3.31, so Cargo will dedupe them. You can safely:
- Remove the explicit
futures-core = "0.3.31"
line if you don’t need its feature flags and rely on the transitive dep.- Or keep it as-is (it matches the transitive version, so no duplicate crate will be pulled).
src/query/mod.rs (1)
81-90
: Returning a stream captured from another runtime may dead-lock in edge cases
execute_stream
spawnsquery.execute_stream()
onQUERY_RUNTIME
and then returns theSendableRecordBatchStream
to the caller.
The stream is subsequently polled on Actix’s Tokio runtime (not onQUERY_RUNTIME
). If DataFusion’s stream internally assumes the same runtime that created it (e.g. for cooperative scheduling or object-store IO) this separation could stall or panic under load.At minimum, document the assumption or add an integration test that exercises large multi-batch queries over S3/GCS.
An alternative is to not spawn here – the heavy lifting happens when the stream is polled, so the extraspawn
gives no benefit and avoids the cross-runtime hop:- QUERY_RUNTIME - .spawn(async move { query.execute_stream(time_partition.as_ref()).await }) - .await - .expect("The Join should have been successful") +query.execute_stream(time_partition.as_ref()).awaitsrc/handlers/http/query.rs (4)
66-68
:streaming
flag never accepted from JSON – confirm intended
#[serde(skip)]
means JSON bodies can’t enable streaming; only the URL param?streaming=true
is honoured.
If that’s intentional (and it probably is), add a doc-comment so future contributors don’t try to set it in the body.
105-114
: Branching order reads fine – butis_logical_plan_count_without_filters
runs beforestreaming
checkSo a
SELECT COUNT(*) …&streaming=true
request will still take the optimized count path and not stream.
That’s likely desirable; just flagging for awareness.
151-174
: Content-type for non-streaming path okNothing major; metrics & headers preserved.
355-356
:transform_query_for_ingestor
forwardsstreaming
flag – is that required?Ingestor paths currently ignore streaming; passing the flag risks surprising behaviour if ingestors start honouring it later.
If you only need non-streamed results for ingestion, forcestreaming = false
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (1)
src/handlers/http/query.rs (1)
304-307
: Duplicate error handling should be consistentThis is the same error handling pattern as in the other branch of the conditional - both should be updated consistently.
🧹 Nitpick comments (1)
src/handlers/http/query.rs (1)
261-314
: Reduce code duplication in streaming handlerThe two branches of the conditional have significant duplication in how they process record batches.
Consider extracting the common record batch handling logic into a separate function:
+ fn create_batch_processor( + send_null: bool, + with_fields: bool, + fields: Vec<String> + ) -> impl FnMut(Result<RecordBatch, ExecuteError>) -> Result<Bytes, actix_web::Error> { + move |batch_result| match batch_result { + Ok(batch) => { + let response = QueryResponse { + records: vec![batch], + fields: if with_fields { Vec::new() } else { fields.clone() }, + fill_null: send_null, + with_fields, + } + .to_json() + .unwrap_or_else(|e| { + error!("Failed to parse record batch into JSON: {}", e); + json!({"error": format!("Failed to serialize batch: {}", e)}) + }); + Ok(Bytes::from(format!("{}\n", response))) + } + Err(e) => Err(actix_web::error::ErrorInternalServerError(e)), + } + }Then use this helper in both branches.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge Base: Disabled due to data retention organization setting
📒 Files selected for processing (3)
src/handlers/airplane.rs
(3 hunks)src/handlers/http/query.rs
(6 hunks)src/query/mod.rs
(4 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- src/handlers/airplane.rs
- src/query/mod.rs
⏰ Context from checks skipped due to timeout of 90000ms (10)
- GitHub Check: Build Default x86_64-pc-windows-msvc
- GitHub Check: Build Default aarch64-apple-darwin
- GitHub Check: Build Default x86_64-apple-darwin
- GitHub Check: Build Default aarch64-unknown-linux-gnu
- GitHub Check: Build Kafka aarch64-apple-darwin
- GitHub Check: Quest Smoke and Load Tests for Standalone deployments
- GitHub Check: Build Default x86_64-unknown-linux-gnu
- GitHub Check: Build Kafka x86_64-unknown-linux-gnu
- GitHub Check: Quest Smoke and Load Tests for Distributed deployments
- GitHub Check: coverage
🔇 Additional comments (6)
src/handlers/http/query.rs (6)
68-68
: Good addition of streaming support to the Query structThe
streaming
field properly uses#[serde(skip)]
since it's extracted from query parameters rather than the request body.
105-123
: Clean refactoring of query execution pathsThe main
query
function now logically branches to specialized handlers for count queries, non-streaming, and streaming queries, improving code organization and readability.
125-172
: Well-documented count query optimizationThe specialized count query handler is well-documented and appropriately reuses the
CountsRequest::get_bin_density
method instead of executing the full query plan.
174-219
: Well-structured non-streaming query handlerThe non-streaming query handler properly validates the response type using the
Either
pattern matching and correctly handles timing metrics.
317-317
: Good use of appropriate MIME type for NDJSONUsing the correct content type
application/x-ndjson
for newline-delimited JSON responses follows standards and helps clients properly interpret the streaming data.
391-393
: Clean implementation of streaming parameter extractionThe implementation properly extracts the
streaming
parameter from query parameters and defaults tofalse
when not present, ensuring backward compatibility.
src/handlers/http/query.rs
Outdated
let (records_stream, fields) = execute(query, table_name, query_request.streaming).await?; | ||
let records_stream = match records_stream { | ||
Either::Left(_) => { | ||
return Err(QueryError::MalformedQuery( | ||
"Expected stream results, got batch", | ||
)) | ||
} | ||
Either::Right(stream) => stream, | ||
}; | ||
let fields = fields.clone(); | ||
let total_time = format!("{:?}", time.elapsed()); | ||
let time = time.elapsed().as_secs_f64(); | ||
|
||
QUERY_EXECUTE_TIME | ||
.with_label_values(&[&table_name]) | ||
.with_label_values(&[table_name]) | ||
.observe(time); | ||
|
||
Ok(response) | ||
let send_null = query_request.send_null; | ||
let with_fields = query_request.fields; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Verify potential streaming performance characteristics
The streaming implementation looks good, but consider testing with large datasets to verify performance.
🏁 Script executed:
#!/bin/bash
# Check for any related DataFusion stream configuration settings
rg -A 3 -B 3 "stream|batch_size|chunk_size" --type rust
# Look for any existing performance benchmarks
fd -i bench -e rs -e toml
Length of output: 65621
Add streaming performance benchmarks
The DataFusion streaming path in src/handlers/http/query.rs
delegates to execute_stream()
with batch size controlled by
PARSEABLE.options.execution_batch_size
in src/query/mod.rs
, but I didn’t find any existing benchmarks under benches/
or anywhere else. To ensure streaming queries hold up on large datasets, please:
- Create a benchmark suite (for example
benches/streaming_performance.rs
) using Criterion.rs or Rust’s built-in benchmarking. - Measure end-to-end streaming latency and throughput with large synthetic datasets.
- Tune
execution_batch_size
(default 20000) and any other relevant parameters based on your findings.
🤖 Prompt for AI Agents
In src/handlers/http/query.rs around lines 242 to 260, add a new benchmark suite
file, for example benches/streaming_performance.rs, using Criterion.rs or Rust's
built-in benchmarking framework. Implement tests that measure end-to-end
streaming latency and throughput on large synthetic datasets to evaluate the
performance of the DataFusion streaming path. Adjust the execution_batch_size
parameter and other relevant settings in src/query/mod.rs based on benchmark
results to optimize streaming performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
src/handlers/http/query.rs (1)
295-315
: Error handling could be more informativeWhile the error handling is improved from the previous implementation, consider adding more context to error messages to help with debugging streaming issues.
- actix_web::error::ErrorInternalServerError(e) + actix_web::error::ErrorInternalServerError(format!("Failed to serialize batch: {}", e))- Err(e) => Err(actix_web::error::ErrorInternalServerError(e)), + Err(e) => { + error!("Error processing record batch in stream: {}", e); + Err(actix_web::error::ErrorInternalServerError(e)) + },
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge Base: Disabled due to data retention organization setting
📒 Files selected for processing (1)
src/handlers/http/query.rs
(6 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (7)
- GitHub Check: Build Default x86_64-pc-windows-msvc
- GitHub Check: Build Default aarch64-unknown-linux-gnu
- GitHub Check: Quest Smoke and Load Tests for Distributed deployments
- GitHub Check: Build Default x86_64-unknown-linux-gnu
- GitHub Check: Quest Smoke and Load Tests for Standalone deployments
- GitHub Check: Build Kafka x86_64-unknown-linux-gnu
- GitHub Check: coverage
🔇 Additional comments (9)
src/handlers/http/query.rs (9)
56-56
: Good use of constant for header name!Using a constant for the header name improves maintainability and ensures consistency across the codebase.
69-69
: Appropriate use of#[serde(skip)]
for the streaming flagThe streaming flag is correctly marked with
#[serde(skip)]
as it's derived from query parameters rather than being part of the serialized JSON body.
107-124
: Clean separation of concerns with specialized handlersGood refactoring by dividing the query handling into specialized functions based on query type and streaming mode. This improves readability and maintainability.
126-173
: Well-documented count query handler with optimizationThe count query handler is well-documented and properly optimized by using
get_bin_density
instead of running the full query.
175-220
: Non-streaming handler appropriately handles batch resultsThe non-streaming handler correctly extracts batch results from the
Either
type and ensures that streaming results aren't accidentally processed in the wrong path.
222-293
: Streaming implementation follows best practicesThe streaming implementation correctly uses NDJSON format with the proper MIME type and only sends field metadata once as the first chunk when requested.
290-291
: Good use of NDJSON content typeUsing "application/x-ndjson" for the content type correctly indicates the newline-delimited JSON format to clients.
385-387
: Default streaming to false for backward compatibilityCorrectly defaults streaming to false when not specified, which maintains backward compatibility with existing clients.
452-452
: Properly propagates streaming flag in transform functionThe streaming flag is correctly propagated when transforming queries for the ingestor.
use datafusion's execute_stream function that sends streaming response
use query param
streaming=true
to get the streaming responsedefaulted to false, for prism or other clients to work as usual
query response (with streaming=true) sends multiple batches one after the other split by new line character
Summary by CodeRabbit
New Features
Refactor
Bug Fixes
Chores