fix: create stream before merge schema #1381

nikhilsinhaparseable · 2025-07-18T09:50:32Z

issue: merge schema and commit to memory happening before stream creation from storage
fix: first create stream from storage if not present then merge schema and commit to memory

Summary by CodeRabbit

Refactor
- Streamlined query processing by simplifying stream creation and removing fallback retries.
- Improved alert query preparation with a more efficient stream creation approach.
- Enhanced schema commitment during stream creation for better data consistency.
- Updated error handling to preserve original error types in demo data retrieval.

issue: merge schema and commit to memory happening before stream creation from storage fix: first create stream from storage if not present then merge schema and commit to memory

coderabbitai · 2025-07-18T09:50:38Z

Walkthrough

The query flow is simplified by removing the schema update step and fallback logic for missing streams. The new create_streams_for_distributed function directly creates streams for all provided names without comparing existing streams or retrying logical plan creation. Calls to into_query and prepare_query are updated accordingly, and schema commitment is integrated into stream creation from storage.

Changes

File(s)	Change Summary
src/handlers/http/query.rs	Removed `create_streams_for_querier`; replaced with simplified `create_streams_for_distributed`. Removed schema update and fallback logic from query flow. Updated `into_query` signature and usage.
src/alerts/alerts_utils.rs	Refactored `prepare_query` to replace fallback stream/schema creation with a single call to `create_streams_for_distributed`. Removed concurrent task spawning and simplified error handling.
src/handlers/airplane.rs	Removed obsolete `streams` argument from call to `into_query` in `do_get` method.
src/handlers/http/demo_data.rs	Changed error handling to return original error directly from `get_demo_data_from_ingestor` without wrapping.
src/parseable/mod.rs	Added call to `commit_schema` after creating stream from storage, integrating schema commitment into stream creation flow.

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant QueryHandler
    participant StreamCreator
    participant LogicalPlanner
    participant Executor

    Client->>QueryHandler: Send Query Request
    QueryHandler->>StreamCreator: create_streams_for_distributed(stream_names)
    StreamCreator-->>QueryHandler: Streams Created
    QueryHandler->>LogicalPlanner: into_query(query, session_state, time_range)
    LogicalPlanner-->>QueryHandler: Logical Plan
    QueryHandler->>Executor: Execute Logical Plan
    Executor-->>QueryHandler: Query Results
    QueryHandler-->>Client: Return Results

Possibly related PRs

parseablehq/parseable#1367: Replaced and simplified the previous create_streams_for_querier function with create_streams_for_distributed, changing stream creation logic.
parseablehq/parseable#1380: Introduced fallback mechanisms for stream and schema creation on logical plan failure, which this PR removes and simplifies.

Suggested labels

for next release

Suggested reviewers

nitisht
parmesant

Poem

🐇 A rabbit hops through code anew,
Streams created, schemas too.
No more retries, no fallback chase,
A simpler, cleaner query race.
With every hop, the logic’s bright,
The code now runs with pure delight! 🌿✨

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0399ba1 and 781fdd2.

📒 Files selected for processing (5)

src/alerts/alerts_utils.rs (2 hunks)
src/handlers/airplane.rs (1 hunks)
src/handlers/http/demo_data.rs (1 hunks)
src/handlers/http/query.rs (5 hunks)
src/parseable/mod.rs (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

src/handlers/http/query.rs

🧰 Additional context used

🧠 Learnings (5)

📓 Common learnings

Learnt from: de-sh
PR: parseablehq/parseable#1185
File: src/handlers/http/logstream.rs:255-261
Timestamp: 2025-02-14T09:49:25.818Z
Learning: In Parseable's logstream handlers, stream existence checks must be performed for both query and standalone modes. The pattern `!PARSEABLE.streams.contains(&stream_name) && (PARSEABLE.options.mode != Mode::Query || !PARSEABLE.create_stream_and_schema_from_storage(&stream_name).await?)` ensures proper error handling in both modes.

Learnt from: nikhilsinhaparseable
PR: parseablehq/parseable#1340
File: src/query/mod.rs:64-66
Timestamp: 2025-06-18T06:39:04.775Z
Learning: In src/query/mod.rs, QUERY_SESSION_STATE and QUERY_SESSION serve different architectural purposes: QUERY_SESSION_STATE is used for stats calculation and allows dynamic registration of individual parquet files from the staging path (files created every minute), while QUERY_SESSION is used for object store queries with the global schema provider. Session contexts with schema providers don't support registering individual tables/parquets, so both session objects are necessary for their respective use cases.

Learnt from: de-sh
PR: parseablehq/parseable#1236
File: src/prism/logstream/mod.rs:332-332
Timestamp: 2025-03-13T11:39:52.587Z
Learning: SQL injection concerns can be ignored in this codebase as all SQL queries are run against immutable data streams, limiting the potential impact of any injection.

src/parseable/mod.rs (2)

Learnt from: de-sh
PR: parseablehq/parseable#1185
File: src/handlers/http/logstream.rs:255-261
Timestamp: 2025-02-14T09:49:25.818Z
Learning: In Parseable's logstream handlers, stream existence checks must be performed for both query and standalone modes. The pattern `!PARSEABLE.streams.contains(&stream_name) && (PARSEABLE.options.mode != Mode::Query || !PARSEABLE.create_stream_and_schema_from_storage(&stream_name).await?)` ensures proper error handling in both modes.

Learnt from: nikhilsinhaparseable
PR: parseablehq/parseable#1346
File: src/parseable/streams.rs:319-331
Timestamp: 2025-06-16T09:50:38.636Z
Learning: In Parseable's Ingest or Query mode, the node_id is always available because it's generated during server initialization itself, before the get_node_id_string() function in streams.rs would be called. This makes the .expect() calls on QUERIER_META.get() and INGESTOR_META.get() safe in this context.

src/handlers/http/demo_data.rs (1)

Learnt from: nikhilsinhaparseable
PR: parseablehq/parseable#1346
File: src/handlers/http/health_check.rs:81-90
Timestamp: 2025-06-16T02:04:58.990Z
Learning: In the shutdown function in src/handlers/http/health_check.rs, the design approach is to log errors from sync operations rather than propagate them. This is intentional because the shutdown function is called on SIGTERM/SIGINT signals, and the goal is to perform best-effort cleanup (syncing pending files to object storage) while allowing the shutdown to proceed regardless of sync failures. Logging provides debugging information without blocking the shutdown process.

src/alerts/alerts_utils.rs (2)

Learnt from: de-sh
PR: parseablehq/parseable#1185
File: src/handlers/http/logstream.rs:255-261
Timestamp: 2025-02-14T09:49:25.818Z
Learning: In Parseable's logstream handlers, stream existence checks must be performed for both query and standalone modes. The pattern `!PARSEABLE.streams.contains(&stream_name) && (PARSEABLE.options.mode != Mode::Query || !PARSEABLE.create_stream_and_schema_from_storage(&stream_name).await?)` ensures proper error handling in both modes.

Learnt from: nikhilsinhaparseable
PR: parseablehq/parseable#1340
File: src/query/mod.rs:64-66
Timestamp: 2025-06-18T06:39:04.775Z
Learning: In src/query/mod.rs, QUERY_SESSION_STATE and QUERY_SESSION serve different architectural purposes: QUERY_SESSION_STATE is used for stats calculation and allows dynamic registration of individual parquet files from the staging path (files created every minute), while QUERY_SESSION is used for object store queries with the global schema provider. Session contexts with schema providers don't support registering individual tables/parquets, so both session objects are necessary for their respective use cases.

src/handlers/airplane.rs (1)

Learnt from: nikhilsinhaparseable
PR: parseablehq/parseable#1340
File: src/query/mod.rs:64-66
Timestamp: 2025-06-18T06:39:04.775Z
Learning: In src/query/mod.rs, QUERY_SESSION_STATE and QUERY_SESSION serve different architectural purposes: QUERY_SESSION_STATE is used for stats calculation and allows dynamic registration of individual parquet files from the staging path (files created every minute), while QUERY_SESSION is used for object store queries with the global schema provider. Session contexts with schema providers don't support registering individual tables/parquets, so both session objects are necessary for their respective use cases.

🧬 Code Graph Analysis (1)

src/handlers/airplane.rs (1)

src/handlers/http/query.rs (2)

query (112-146)

into_query (469-492)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)

GitHub Check: Build Kafka aarch64-apple-darwin
GitHub Check: coverage
GitHub Check: Build Default x86_64-pc-windows-msvc
GitHub Check: Build Default aarch64-apple-darwin
GitHub Check: Build Default x86_64-unknown-linux-gnu
GitHub Check: Build Default aarch64-unknown-linux-gnu
GitHub Check: Build Default x86_64-apple-darwin
GitHub Check: Build Kafka x86_64-unknown-linux-gnu
GitHub Check: Quest Smoke and Load Tests for Distributed deployments
GitHub Check: Quest Smoke and Load Tests for Standalone deployments

🔇 Additional comments (7)

src/handlers/http/demo_data.rs (1)

66-66: LGTM! Simplified error propagation.

The change correctly preserves the original error type instead of unnecessarily wrapping it, which aligns with the broader error handling simplification theme across this PR.

src/handlers/airplane.rs (1)

151-153: LGTM! Updated to match simplified into_query signature.

The removal of the streams parameter correctly aligns with the refactored into_query function signature, as seen in the relevant code snippets from src/handlers/http/query.rs.

src/parseable/mod.rs (2)

45-48: LGTM! Added necessary import for schema commitment.

The import of commit_schema is correctly added to support the schema commitment integration in the stream creation flow.

340-341: Excellent fix! Schema commitment properly integrated into stream creation.

This change directly addresses the PR objective by ensuring that the schema is committed to memory immediately after stream creation from storage. The error handling is appropriate, converting the error to StreamError::Anyhow to maintain the function's error contract.

src/alerts/alerts_utils.rs (3)

32-32: LGTM! Cleaned up unused import.

The removal of unused warn import improves code cleanliness.

36-36: LGTM! Added necessary import for streamlined stream creation.

The import of create_streams_for_distributed supports the simplified stream creation approach used in the prepare_query function.

81-86: Excellent refactoring! Simplified stream creation logic.

The replacement of manual fallback logic with a direct call to create_streams_for_distributed significantly simplifies the code while maintaining the same functionality. The error handling properly wraps the error in a custom alert error message, maintaining the function's error contract.

✨ Finishing Touches

📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Explain this complex logic.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai explain this code block.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and explain its main purpose.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

src/handlers/http/query.rs (2)

100-100: Consider removing the empty line.

This empty line appears to be leftover from the code reorganization and can be removed for cleaner code formatting.

118-118: Consider removing the empty line.

This empty line appears to be leftover from the code reorganization and can be removed for cleaner code formatting.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a883589 and 0399ba1.

📒 Files selected for processing (1)

src/handlers/http/query.rs (2 hunks)

🧰 Additional context used

🧠 Learnings (2)

📓 Common learnings

Learnt from: de-sh
PR: parseablehq/parseable#1185
File: src/handlers/http/logstream.rs:255-261
Timestamp: 2025-02-14T09:49:25.818Z
Learning: In Parseable's logstream handlers, stream existence checks must be performed for both query and standalone modes. The pattern `!PARSEABLE.streams.contains(&stream_name) && (PARSEABLE.options.mode != Mode::Query || !PARSEABLE.create_stream_and_schema_from_storage(&stream_name).await?)` ensures proper error handling in both modes.

Learnt from: nikhilsinhaparseable
PR: parseablehq/parseable#1340
File: src/query/mod.rs:64-66
Timestamp: 2025-06-18T06:39:04.775Z
Learning: In src/query/mod.rs, QUERY_SESSION_STATE and QUERY_SESSION serve different architectural purposes: QUERY_SESSION_STATE is used for stats calculation and allows dynamic registration of individual parquet files from the staging path (files created every minute), while QUERY_SESSION is used for object store queries with the global schema provider. Session contexts with schema providers don't support registering individual tables/parquets, so both session objects are necessary for their respective use cases.

src/handlers/http/query.rs (2)

Learnt from: nikhilsinhaparseable
PR: parseablehq/parseable#1340
File: src/query/mod.rs:64-66
Timestamp: 2025-06-18T06:39:04.775Z
Learning: In src/query/mod.rs, QUERY_SESSION_STATE and QUERY_SESSION serve different architectural purposes: QUERY_SESSION_STATE is used for stats calculation and allows dynamic registration of individual parquet files from the staging path (files created every minute), while QUERY_SESSION is used for object store queries with the global schema provider. Session contexts with schema providers don't support registering individual tables/parquets, so both session objects are necessary for their respective use cases.

Learnt from: nikhilsinhaparseable
PR: parseablehq/parseable#1305
File: src/handlers/http/users/dashboards.rs:0-0
Timestamp: 2025-05-01T10:27:56.858Z
Learning: The `add_tile()` function in `src/handlers/http/users/dashboards.rs` should use `get_dashboard_by_user(dashboard_id, &user_id)` instead of `get_dashboard(dashboard_id)` to ensure proper authorization checks when modifying a dashboard.

🧬 Code Graph Analysis (1)

src/handlers/http/query.rs (2)

src/utils/actix.rs (1)

extract_session_key_from_req (51-71)

src/utils/mod.rs (1)

user_auth_for_datasets (91-150)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)

GitHub Check: coverage
GitHub Check: Quest Smoke and Load Tests for Standalone deployments
GitHub Check: Quest Smoke and Load Tests for Distributed deployments
GitHub Check: Build Default x86_64-apple-darwin
GitHub Check: Build Default aarch64-apple-darwin
GitHub Check: Build Kafka aarch64-apple-darwin
GitHub Check: Build Kafka x86_64-unknown-linux-gnu
GitHub Check: Build Default x86_64-pc-windows-msvc
GitHub Check: Build Default aarch64-unknown-linux-gnu
GitHub Check: Build Default x86_64-unknown-linux-gnu

🔇 Additional comments (2)

src/handlers/http/query.rs (2)

92-92: Good fix: Schema update now happens after logical query construction.

This repositioning ensures that streams are created from storage (if needed) during into_query() before the schema is merged and committed to memory. This addresses the core issue described in the PR objectives.

121-121: Good fix: Consistent schema update positioning.

This repositioning mirrors the fix in get_records_and_fields and ensures the same correct ordering: logical query construction → schema update → authentication/authorization. This maintains consistency across both query functions.

fix: create stream before merge schema

0399ba1

issue: merge schema and commit to memory happening before stream creation from storage fix: first create stream from storage if not present then merge schema and commit to memory

coderabbitai bot reviewed Jul 18, 2025

View reviewed changes

coderabbitai bot previously approved these changes Jul 18, 2025

View reviewed changes

fix stream/schema creation

781fdd2

nikhilsinhaparseable dismissed coderabbitai[bot]’s stale review via 781fdd2 July 19, 2025 08:57

coderabbitai bot approved these changes Jul 19, 2025

View reviewed changes

nitisht merged commit 1d4e858 into parseablehq:main Jul 19, 2025
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

fix: create stream before merge schema #1381

fix: create stream before merge schema #1381

Uh oh!

nikhilsinhaparseable commented Jul 18, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jul 18, 2025 •

edited

Loading

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fix: create stream before merge schema #1381

fix: create stream before merge schema #1381

Uh oh!

Conversation

nikhilsinhaparseable commented Jul 18, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jul 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

nikhilsinhaparseable commented Jul 18, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jul 18, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)