-
-
Notifications
You must be signed in to change notification settings - Fork 150
Updates #1408
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Updates #1408
Conversation
WalkthroughAdds runtime validation in alerts to require exactly one resolved dataset; adds an Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Client
participant Server as Alerts API
participant Resolver as Dataset Resolver
participant Builder as Alert Builder
Client->>Server: Submit AlertRequest (query)
Server->>Resolver: resolve_stream_names(query)
Resolver-->>Server: datasets
Server->>Server: Check datasets.len() == 1
alt invalid count
Server-->>Client: AlertError::ValidationFailure (includes found datasets)
else valid
Server->>Builder: Construct AlertConfig
Server-->>Client: Alert created
end
sequenceDiagram
autonumber
participant Client
participant API as HTTP PUT /stream
participant Store as Log Stream Store
Client->>API: PUT stream (headers include UPDATE_STREAM_KEY)
API->>API: Read UPDATE_STREAM_KEY == "true" ?
alt Update
API->>Store: Update stream
Store-->>API: OK
API-->>Client: "Log stream updated" (200)
else Create
API->>Store: Create stream
Store-->>API: OK
API-->>Client: "Log stream created" (200)
end
sequenceDiagram
autonumber
participant Query as CountsRequest
participant Meta as Stream Metadata
participant Planner as SQL Planner
participant Querier as Querier (get_records_and_fields)
Query->>Meta: Get time_partition for stream
Meta-->>Query: time_partition or DEFAULT_TIMESTAMP_KEY
Query->>Planner: Build DATE_BIN using time_partition
Planner->>Planner: Apply time filters on chosen partition column
Planner->>Querier: Generate SQL via get_df_sql(time_partition)
Querier-->>Query: Record batches / fields
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
✨ Finishing Touches
🧪 Generate unit tests
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/handlers/http/modal/query/querier_logstream.rs (1)
121-123
: Lock is released immediately; guard must be held for the whole critical sectionlet _ = CREATE_STREAM_LOCK.lock().await; drops the guard immediately, making the lock ineffective. This can cause concurrent create/update races.
Use a named binding to hold the guard until the end of scope:
- let _ = CREATE_STREAM_LOCK.lock().await; + let _guard = CREATE_STREAM_LOCK.lock().await;
🧹 Nitpick comments (9)
src/alerts/alert_structs.rs (1)
273-279
: Tighten validation message; optionally dedup datasets before length check
- Minor: “Found- …” reads odd; use “Found: …”.
- Optional: If resolve_stream_names can return duplicates (e.g., repeated references), consider dedup before length check to avoid false negatives.
Apply this small wording fix:
- "Query should include only one dataset. Found- {datasets:?}" + "Query should include only one dataset. Found: {datasets:?}"If you want to dedup (outside this hunk), the pattern would be:
// just for illustration; this change is outside the selected lines let mut datasets = resolve_stream_names(&self.query)?; datasets.sort(); datasets.dedup();src/handlers/http/modal/query/querier_logstream.rs (3)
134-138
: Consider 201 Created for creations; keep 200 OK for updatesNot required, but more idiomatic HTTP semantics.
- if is_update { - Ok(("Log stream updated", StatusCode::OK)) - } else { - Ok(("Log stream created", StatusCode::OK)) - } + if is_update { + Ok(("Log stream updated", StatusCode::OK)) + } else { + Ok(("Log stream created", StatusCode::CREATED)) + }
41-44
: Typo in symbol name merge_quried_statsThe util is named merge_quried_stats (missing e). Not blocking, but it’s a small readability footgun and spreads if re-used.
Follow-up (separate PR): rename to merge_queried_stats and adjust call sites.
36-46
: Add missing documentation for UPDATE_STREAM_KEYUPDATE_STREAM_KEY is correctly exported as a public constant in
src/handlers/mod.rs
(line 35), but lacks any doc comment for client users. Please add a///
comment above it describing its purpose and expected values.• File: src/handlers/mod.rs
Line 35:/// HTTP header key used to signal stream updates. /// Clients should set this to `"true"` to enable update mode. pub const UPDATE_STREAM_KEY: &str = "x-p-update-stream";src/query/mod.rs (5)
348-348
: Prefer lazy default to avoid needless allocationunwrap_or(default.clone()) evaluates the default even when not needed. Minor perf nit.
- .unwrap_or(event::DEFAULT_TIMESTAMP_KEY.to_owned()); + .unwrap_or_else(|| event::DEFAULT_TIMESTAMP_KEY.to_owned());
461-463
: Qualify both DATE_BIN column references consistentlySecond DATE_BIN uses an unqualified column; qualify it with the table to avoid ambiguity and keep style consistent.
- "CAST(DATE_BIN('1 minute', \"{}\".\"{time_partition}\", TIMESTAMP '1970-01-01 00:00:00+00') AS TEXT) as start_time, DATE_BIN('1 minute', \"{time_partition}\", TIMESTAMP '1970-01-01 00:00:00+00') + INTERVAL '1 minute' as end_time", - self.stream + "CAST(DATE_BIN('1 minute', \"{table}\".\"{time_partition}\", TIMESTAMP '1970-01-01 00:00:00+00') AS TEXT) as start_time, DATE_BIN('1 minute', \"{table}\".\"{time_partition}\", TIMESTAMP '1970-01-01 00:00:00+00') + INTERVAL '1 minute' as end_time", + table = self.stream
467-469
: Same consistency improvement for 1 hour bin- "CAST(DATE_BIN('1 hour', \"{}\".\"{time_partition}\", TIMESTAMP '1970-01-01 00:00:00+00') AS TEXT) as start_time, DATE_BIN('1 hour', \"{time_partition}\", TIMESTAMP '1970-01-01 00:00:00+00') + INTERVAL '1 hour' as end_time", - self.stream + "CAST(DATE_BIN('1 hour', \"{table}\".\"{time_partition}\", TIMESTAMP '1970-01-01 00:00:00+00') AS TEXT) as start_time, DATE_BIN('1 hour', \"{table}\".\"{time_partition}\", TIMESTAMP '1970-01-01 00:00:00+00') + INTERVAL '1 hour' as end_time", + table = self.stream
473-475
: Same consistency improvement for 1 day bin- "CAST(DATE_BIN('1 day', \"{}\".\"{time_partition}\", TIMESTAMP '1970-01-01 00:00:00+00') AS TEXT) as start_time, DATE_BIN('1 day', \"{time_partition}\", TIMESTAMP '1970-01-01 00:00:00+00') + INTERVAL '1 day' as end_time", - self.stream + "CAST(DATE_BIN('1 day', \"{table}\".\"{time_partition}\", TIMESTAMP '1970-01-01 00:00:00+00') AS TEXT) as start_time, DATE_BIN('1 day', \"{table}\".\"{time_partition}\", TIMESTAMP '1970-01-01 00:00:00+00') + INTERVAL '1 day' as end_time", + table = self.stream
448-489
: Add tests to lock-in SQL generation for custom time partitionsConsider unit tests for get_df_sql verifying:
- time_partition != DEFAULT_TIMESTAMP_KEY
- bin widths across the three ranges produce correct DATE_BIN with the partition column qualified
I can scaffold tests asserting the generated SQL contains the expected "table"."partition_col" references for 1 minute/hour/day bins. Want me to push a test module?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
src/alerts/alert_structs.rs
(1 hunks)src/handlers/http/modal/query/querier_logstream.rs
(2 hunks)src/query/mod.rs
(3 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-08-18T12:37:47.703Z
Learnt from: nikhilsinhaparseable
PR: parseablehq/parseable#1405
File: src/parseable/mod.rs:528-533
Timestamp: 2025-08-18T12:37:47.703Z
Learning: In Parseable, the validate_time_partition function in src/utils/json/flatten.rs already provides a default time partition limit of 30 days using `map_or(30, |days| days.get() as i64)` when time_partition_limit is None, so no additional defaulting is needed in the stream creation logic in src/parseable/mod.rs.
Applied to files:
src/query/mod.rs
📚 Learning: 2025-02-14T09:49:25.818Z
Learnt from: de-sh
PR: parseablehq/parseable#1185
File: src/handlers/http/logstream.rs:255-261
Timestamp: 2025-02-14T09:49:25.818Z
Learning: In Parseable's logstream handlers, stream existence checks must be performed for both query and standalone modes. The pattern `!PARSEABLE.streams.contains(&stream_name) && (PARSEABLE.options.mode != Mode::Query || !PARSEABLE.create_stream_and_schema_from_storage(&stream_name).await?)` ensures proper error handling in both modes.
Applied to files:
src/handlers/http/modal/query/querier_logstream.rs
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
- GitHub Check: Quest Smoke and Load Tests for Distributed deployments
- GitHub Check: Build Default x86_64-pc-windows-msvc
- GitHub Check: Quest Smoke and Load Tests for Standalone deployments
- GitHub Check: Build Default x86_64-apple-darwin
- GitHub Check: coverage
- GitHub Check: Build Default x86_64-unknown-linux-gnu
- GitHub Check: Build Default aarch64-unknown-linux-gnu
- GitHub Check: Build Kafka x86_64-unknown-linux-gnu
- GitHub Check: Build Default aarch64-apple-darwin
- GitHub Check: Build Kafka aarch64-apple-darwin
🔇 Additional comments (3)
src/alerts/alert_structs.rs (1)
273-279
: Good guard: enforce exactly one dataset in alert queriesEarly validation prevents misconfigured alerts spanning 0 or multiple datasets. Placement before AlertConfig construction is correct.
src/query/mod.rs (2)
662-662
: Time-filter detection now honors custom time partition as well as default timestampThis closes the gap where queries using non-default time partition would miss the automatic time filters.
448-453
: Unnecessary manual mapping—From<StreamNotFound>
is already derivedThe
QueryError
enum in src/query/mod.rs already includesStreamNotFound(#[from] StreamNotFound)which the thiserror macro uses to generate
impl From<StreamNotFound> for QueryError
. Consequently,let time_partition = PARSEABLE .get_stream(&self.stream)? .get_time_partition() .unwrap_or(DEFAULT_TIMESTAMP_KEY.into());will compile as-is, and the use of
?
is correct and consistent with other variants. No change is needed here.Likely an incorrect or invalid review comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/handlers/http/cluster/utils.rs (1)
138-176
: Fix panic-prone index and order dependency in merge_queried_statsIndexing stats[1] will panic for len < 2 and assumes a specific caller push order. Make it robust: derive stream name from the first non-empty entry, handle empty vectors gracefully, and avoid relying on element order. Also, avoid overwriting format on every fold; keep the first non-empty format.
Apply:
pub fn merge_queried_stats(stats: Vec<QueriedStats>) -> QueriedStats { - // get the stream name - let stream_name = stats[1].stream.clone(); + // Pick a stable stream name: first non-empty if available, else empty + let stream_name = stats + .iter() + .find_map(|s| if !s.stream.is_empty() { Some(s.stream.clone()) } else { None }) + .unwrap_or_default(); let min_time = stats.iter().map(|x| x.time).min().unwrap_or_else(Utc::now); - let cumulative_ingestion = - stats - .iter() - .map(|x| &x.ingestion) - .fold(IngestionStats::default(), |acc, x| IngestionStats { - count: acc.count + x.count, - - size: acc.size + x.size, - format: x.format.clone(), - lifetime_count: acc.lifetime_count + x.lifetime_count, - lifetime_size: acc.lifetime_size + x.lifetime_size, - deleted_count: acc.deleted_count + x.deleted_count, - deleted_size: acc.deleted_size + x.deleted_size, - }); + let cumulative_ingestion = stats + .iter() + .map(|x| &x.ingestion) + .fold(IngestionStats::default(), |mut acc, x| { + acc.count += x.count; + acc.size += x.size; + if acc.format.is_empty() { + acc.format = x.format.clone(); + } + acc.lifetime_count += x.lifetime_count; + acc.lifetime_size += x.lifetime_size; + acc.deleted_count += x.deleted_count; + acc.deleted_size += x.deleted_size; + acc + }); - let cumulative_storage = - stats - .iter() - .map(|x| &x.storage) - .fold(StorageStats::default(), |acc, x| StorageStats { - size: acc.size + x.size, - format: x.format.clone(), - lifetime_size: acc.lifetime_size + x.lifetime_size, - deleted_size: acc.deleted_size + x.deleted_size, - }); + let cumulative_storage = stats + .iter() + .map(|x| &x.storage) + .fold(StorageStats::default(), |mut acc, x| { + acc.size += x.size; + if acc.format.is_empty() { + acc.format = x.format.clone(); + } + acc.lifetime_size += x.lifetime_size; + acc.deleted_size += x.deleted_size; + acc + }); QueriedStats::new( &stream_name, min_time, cumulative_ingestion, cumulative_storage, ) }
♻️ Duplicate comments (1)
src/handlers/http/modal/query/querier_logstream.rs (1)
126-130
: Avoid panic on invalid header and support case-insensitive boolean parsingto_str().unwrap() will panic on non-UTF8 header values, and only "true" (lowercase) is recognized.
Apply:
- let is_update = if let Some(val) = headers.get(UPDATE_STREAM_KEY) { - val.to_str().unwrap() == "true" - } else { - false - }; + let is_update = headers + .get(UPDATE_STREAM_KEY) + .and_then(|v| v.to_str().ok()) + .map(|s| s.trim().eq_ignore_ascii_case("true")) + .unwrap_or(false);
🧹 Nitpick comments (4)
src/handlers/http/modal/query/querier_logstream.rs (2)
121-121
: Lock guard binding nit acknowledged.Renaming to _guard makes the intent explicit and avoids accidental drop; semantics unchanged.
234-235
: Call site updated to merge_queried_stats; consider upstream robustness.The usage is fine. Be aware the current implementation in utils.rs previously depended on vector ordering; with the proposed fix there, call sites become order-agnostic.
src/query/mod.rs (2)
647-663
: Also check the column’s relation when detecting time filtersYou now consider both the per-stream partition column and the default. Good. For extra safety, also match the relation (table name) to avoid false positives where similarly named columns exist in more complex plans.
Apply this diff:
- table - .filters - .iter() - .filter_map(|x| { - if let Expr::BinaryExpr(binexpr) = x { - Some(binexpr) - } else { - None - } - }) - .any(|expr| { - matches!(&*expr.left, Expr::Column(Column { name, .. }) - if name == &default_timestamp || name == time_column) - }) + table + .filters + .iter() + .filter_map(|x| { + if let Expr::BinaryExpr(binexpr) = x { + Some(binexpr) + } else { + None + } + }) + .any(|expr| match &*expr.left { + Expr::Column(Column { relation, name }) => { + // Either explicitly qualified with this table, or unqualified + let rel_matches = relation + .as_ref() + .map(|r| r.as_ref() == table.table_name.as_ref()) + .unwrap_or(true); + rel_matches && (name == &default_timestamp || name == time_column) + } + _ => false, + })
344-385
: Use exclusive end boundary and safe timestamp conversion
- Avoid unwrapping
from_timestamp_millis
to prevent panics on out-of-range values.- Align with other filters’
[start, end)
semantics by making the end boundary exclusive.File:
src/query/mod.rs
(within the manifest-file filtering loop)-TypedStatistics::Int(Int64Type { min, .. }) => { - let min = DateTime::from_timestamp_millis(*min).unwrap(); - bin.start <= min && bin.end >= min -} +TypedStatistics::Int(Int64Type { min, .. }) => { + if let Some(min_ts) = DateTime::from_timestamp_millis(*min) { + // [start, end) to match transform()’s excluded end bound + bin.start <= min_ts && bin.end > min_ts + } else { + false + } +}
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (5)
src/alerts/alert_structs.rs
(1 hunks)src/handlers/http/cluster/utils.rs
(1 hunks)src/handlers/http/modal/query/querier_logstream.rs
(3 hunks)src/prism/logstream/mod.rs
(2 hunks)src/query/mod.rs
(4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- src/alerts/alert_structs.rs
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-08-18T12:37:47.732Z
Learnt from: nikhilsinhaparseable
PR: parseablehq/parseable#1405
File: src/parseable/mod.rs:528-533
Timestamp: 2025-08-18T12:37:47.732Z
Learning: In Parseable, the validate_time_partition function in src/utils/json/flatten.rs already provides a default time partition limit of 30 days using `map_or(30, |days| days.get() as i64)` when time_partition_limit is None, so no additional defaulting is needed in the stream creation logic in src/parseable/mod.rs.
Applied to files:
src/query/mod.rs
📚 Learning: 2025-02-14T09:49:25.818Z
Learnt from: de-sh
PR: parseablehq/parseable#1185
File: src/handlers/http/logstream.rs:255-261
Timestamp: 2025-02-14T09:49:25.818Z
Learning: In Parseable's logstream handlers, stream existence checks must be performed for both query and standalone modes. The pattern `!PARSEABLE.streams.contains(&stream_name) && (PARSEABLE.options.mode != Mode::Query || !PARSEABLE.create_stream_and_schema_from_storage(&stream_name).await?)` ensures proper error handling in both modes.
Applied to files:
src/handlers/http/modal/query/querier_logstream.rs
🧬 Code Graph Analysis (3)
src/prism/logstream/mod.rs (1)
src/handlers/http/cluster/utils.rs (1)
merge_queried_stats
(138-176)
src/query/mod.rs (1)
src/utils/time.rs (1)
parse_human_time
(80-102)
src/handlers/http/modal/query/querier_logstream.rs (3)
src/handlers/http/mod.rs (1)
base_path_without_preceding_slash
(79-81)src/handlers/http/cluster/mod.rs (3)
fetch_daily_stats
(469-494)fetch_stats_from_ingestors
(497-557)sync_streams_with_ingestors
(113-165)src/handlers/http/cluster/utils.rs (1)
merge_queried_stats
(138-176)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
- GitHub Check: Build Kafka aarch64-apple-darwin
- GitHub Check: Build Default x86_64-pc-windows-msvc
- GitHub Check: Build Default x86_64-apple-darwin
- GitHub Check: Build Default aarch64-unknown-linux-gnu
- GitHub Check: Build Kafka x86_64-unknown-linux-gnu
- GitHub Check: Build Default x86_64-unknown-linux-gnu
- GitHub Check: Build Default aarch64-apple-darwin
- GitHub Check: Quest Smoke and Load Tests for Distributed deployments
- GitHub Check: Quest Smoke and Load Tests for Standalone deployments
- GitHub Check: coverage
🔇 Additional comments (5)
src/prism/logstream/mod.rs (1)
33-34
: All references updated—no stale occurrences ofmerge_query_stats
remain.Rename adoption looks correct (imports and calls updated to
merge_queried_stats
). No functional change; aligns with the exported symbol rename.Verified in:
- src/prism/logstream/mod.rs (lines 33–34, 137–140)
src/handlers/http/modal/query/querier_logstream.rs (2)
36-46
: Import updates look good and consistent with the symbol rename and new header key.No issues spotted; the module paths and re-exports line up.
134-138
: Response codes for create vs update are sensible.OK for clients to differentiate between created (201) and updated (200).
src/query/mod.rs (2)
59-59
: Import simplification looks goodDirectly importing DEFAULT_TIMESTAMP_KEY is clearer and avoids the unused
event
module reference.
448-453
: No compilation issue:QueryError
already handlesStreamNotFound
TheQueryError
enum derives#[from] StreamNotFound
, so the bare?
onget_stream(&self.stream)?
compiles and maps intoQueryError::StreamNotFound
. The suggestedmap_err(...)?
is unnecessary and purely cosmetic. No changes needed.Likely an incorrect or invalid review comment.
7f78686
to
cb6e5ec
Compare
cb6e5ec
to
6d7e0f9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/handlers/http/cluster/utils.rs (1)
138-176
: Indexing bug: stats[1] can panic; prefer robust stream selection and handle empty inputAccessing stats[1] will panic when fewer than 2 entries are provided. This is fragile and coupled to call-site ordering. Also, choosing format from the last folded element is nondeterministic.
Apply this safer merge:
pub fn merge_queried_stats(stats: Vec<QueriedStats>) -> QueriedStats { - // get the stream name - let stream_name = stats[1].stream.clone(); + // Defensive: handle empty input and prefer a non-empty stream name (typically the local stat appended last) + if stats.is_empty() { + return QueriedStats::new( + "", + Utc::now(), + IngestionStats::default(), + StorageStats::default(), + ); + } + // Prefer the last non-empty stream name; fall back to the first element + let stream_name = stats + .iter() + .rev() + .find(|s| !s.stream.is_empty()) + .map(|s| s.stream.clone()) + .unwrap_or_else(|| stats[0].stream.clone()); @@ - .fold(IngestionStats::default(), |acc, x| IngestionStats { - count: acc.count + x.count, - - size: acc.size + x.size, - format: x.format.clone(), - lifetime_count: acc.lifetime_count + x.lifetime_count, - lifetime_size: acc.lifetime_size + x.lifetime_size, - deleted_count: acc.deleted_count + x.deleted_count, - deleted_size: acc.deleted_size + x.deleted_size, - }); + .fold(IngestionStats::default(), |acc, x| IngestionStats { + count: acc.count + x.count, + size: acc.size + x.size, + // Keep the first non-empty format to avoid non-determinism + format: if acc.format.is_empty() { x.format.clone() } else { acc.format.clone() }, + lifetime_count: acc.lifetime_count + x.lifetime_count, + lifetime_size: acc.lifetime_size + x.lifetime_size, + deleted_count: acc.deleted_count + x.deleted_count, + deleted_size: acc.deleted_size + x.deleted_size, + }); @@ - .fold(StorageStats::default(), |acc, x| StorageStats { - size: acc.size + x.size, - format: x.format.clone(), - lifetime_size: acc.lifetime_size + x.lifetime_size, - deleted_size: acc.deleted_size + x.deleted_size, - }); + .fold(StorageStats::default(), |acc, x| StorageStats { + size: acc.size + x.size, + // Keep the first non-empty format to avoid non-determinism + format: if acc.format.is_empty() { x.format.clone() } else { acc.format.clone() }, + lifetime_size: acc.lifetime_size + x.lifetime_size, + deleted_size: acc.deleted_size + x.deleted_size, + });
♻️ Duplicate comments (1)
src/handlers/http/modal/query/querier_logstream.rs (1)
126-131
: Avoid panic on invalid header and support case-insensitive booleanto_str().unwrap() can panic; also consider "True"/"TRUE" and whitespace.
- let is_update = if let Some(val) = headers.get(UPDATE_STREAM_KEY) { - val.to_str().unwrap() == "true" - } else { - false - }; + let is_update = headers + .get(UPDATE_STREAM_KEY) + .and_then(|v| v.to_str().ok()) + .map(|s| s.trim().eq_ignore_ascii_case("true")) + .unwrap_or(false);
🧹 Nitpick comments (2)
src/handlers/http/modal/query/querier_logstream.rs (1)
134-138
: Consider 201 Created for new stream creationSemantics: respond with StatusCode::CREATED for creations to align with HTTP conventions; keep 200 OK for updates.
- if is_update { - Ok(("Log stream updated", StatusCode::OK)) - } else { - Ok(("Log stream created", StatusCode::OK)) - } + if is_update { + Ok(("Log stream updated", StatusCode::OK)) + } else { + Ok(("Log stream created", StatusCode::CREATED)) + }src/query/mod.rs (1)
461-463
: Fully-qualifying both DATE_BIN occurrences fixes ambiguity; consider building via DataFusion expressionsThe qualification with "table"."column" for both start_time and end_time avoids ambiguous/unresolved columns. Optional: construct the projection using DataFusion Exprs (date_bin function + alias) rather than string formatting to avoid identifier quoting edge cases.
Also applies to: 467-469, 473-475
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (5)
src/alerts/alert_structs.rs
(1 hunks)src/handlers/http/cluster/utils.rs
(1 hunks)src/handlers/http/modal/query/querier_logstream.rs
(3 hunks)src/prism/logstream/mod.rs
(2 hunks)src/query/mod.rs
(4 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- src/alerts/alert_structs.rs
- src/prism/logstream/mod.rs
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-08-18T12:37:47.732Z
Learnt from: nikhilsinhaparseable
PR: parseablehq/parseable#1405
File: src/parseable/mod.rs:528-533
Timestamp: 2025-08-18T12:37:47.732Z
Learning: In Parseable, the validate_time_partition function in src/utils/json/flatten.rs already provides a default time partition limit of 30 days using `map_or(30, |days| days.get() as i64)` when time_partition_limit is None, so no additional defaulting is needed in the stream creation logic in src/parseable/mod.rs.
Applied to files:
src/query/mod.rs
📚 Learning: 2025-02-14T09:49:25.818Z
Learnt from: de-sh
PR: parseablehq/parseable#1185
File: src/handlers/http/logstream.rs:255-261
Timestamp: 2025-02-14T09:49:25.818Z
Learning: In Parseable's logstream handlers, stream existence checks must be performed for both query and standalone modes. The pattern `!PARSEABLE.streams.contains(&stream_name) && (PARSEABLE.options.mode != Mode::Query || !PARSEABLE.create_stream_and_schema_from_storage(&stream_name).await?)` ensures proper error handling in both modes.
Applied to files:
src/handlers/http/modal/query/querier_logstream.rs
🧬 Code Graph Analysis (2)
src/query/mod.rs (1)
src/utils/time.rs (1)
parse_human_time
(80-102)
src/handlers/http/modal/query/querier_logstream.rs (3)
src/handlers/http/mod.rs (1)
base_path_without_preceding_slash
(79-81)src/handlers/http/cluster/mod.rs (2)
fetch_stats_from_ingestors
(497-557)sync_streams_with_ingestors
(113-165)src/handlers/http/cluster/utils.rs (1)
merge_queried_stats
(138-176)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (10)
- GitHub Check: Quest Smoke and Load Tests for Distributed deployments
- GitHub Check: Quest Smoke and Load Tests for Standalone deployments
- GitHub Check: Build Default aarch64-apple-darwin
- GitHub Check: Build Default x86_64-pc-windows-msvc
- GitHub Check: Build Default x86_64-unknown-linux-gnu
- GitHub Check: Build Kafka x86_64-unknown-linux-gnu
- GitHub Check: Build Kafka aarch64-apple-darwin
- GitHub Check: Build Default x86_64-apple-darwin
- GitHub Check: Build Default aarch64-unknown-linux-gnu
- GitHub Check: coverage
🔇 Additional comments (7)
src/handlers/http/modal/query/querier_logstream.rs (3)
36-46
: Import updates and rename alignment look goodUsing UPDATE_STREAM_KEY and merge_queried_stats from the new paths is consistent with the PR-wide renames.
121-121
: Good: keep lock guard bound to a nameBinding the mutex guard to _guard ensures the lock is held for the intended scope, avoiding accidental early drop.
232-235
: LGTM on adopting merge_queried_statsCall site correctly appends local stats to ingestor stats before merging. Once merge_queried_stats is fixed to avoid indexing at [1], this path is solid.
src/query/mod.rs (4)
60-60
: Import simplification is fineDirectly importing DEFAULT_TIMESTAMP_KEY improves clarity without changing behavior.
344-349
: Sane fallback to default timestamp keyUsing DEFAULT_TIMESTAMP_KEY when time_partition is absent ensures robust manifest scanning.
448-453
: Good: consistently derive time_partition for SQL binningFetching per-stream time_partition (with default fallback) before constructing the SQL keeps binning aligned with stream configuration.
660-663
: Time filter detection now accounts for dynamic partition columnChecking for either the default timestamp or the resolved time_partition reduces missed pushdowns. Looks good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
src/handlers/http/cluster/mod.rs (1)
1125-1158
: Compile-time type mismatch (NodeMetadata vs QuerierMetadata) and lock held across await
- get_node_info(NodeType::Querier) is typed here as Vec, but you later store into QuerierStatus.metadata (QuerierMetadata). Assigning NodeMetadata to a QuerierMetadata field won’t compile.
- A QUERIER_MAP write lock is held across an .await (the liveness checks), which can lead to stalls and unnecessary contention.
Proposed fix:
- Request Vec from get_node_info to keep types consistent.
- Compute liveness results outside the write lock, then acquire the write lock only to update the map.
Apply this diff:
pub async fn get_available_querier() -> Result<QuerierMetadata, QueryError> { // Get all querier metadata - let querier_metadata: Vec<NodeMetadata> = get_node_info(NodeType::Querier).await?; + let querier_metadata: Vec<QuerierMetadata> = get_node_info(NodeType::Querier).await?; // No queriers found if querier_metadata.is_empty() { return Err(QueryError::NoAvailableQuerier); } // Limit concurrency for liveness checks to avoid resource exhaustion const MAX_CONCURRENT_LIVENESS_CHECKS: usize = 10; let semaphore = Arc::new(Semaphore::new(MAX_CONCURRENT_LIVENESS_CHECKS)); - // Update the querier map with new metadata and get an available querier - let mut map = QUERIER_MAP.write().await; - - let existing_domains: Vec<String> = map.keys().cloned().collect(); - let mut live_domains = std::collections::HashSet::new(); + // Snapshot existing domains without holding a write lock across await + let existing_domains: Vec<String> = { + let map = QUERIER_MAP.read().await; + map.keys().cloned().collect() + }; + let mut live_domains = std::collections::HashSet::new(); // Use stream with concurrency limit instead of join_all - let liveness_results: Vec<(String, bool, NodeMetadata)> = stream::iter(querier_metadata) + let liveness_results: Vec<(String, bool, QuerierMetadata)> = stream::iter(querier_metadata) .map(|metadata| { let domain = metadata.domain_name.clone(); let metadata_clone = metadata.clone(); let semaphore = Arc::clone(&semaphore); async move { - let _permit = semaphore.acquire().await.unwrap(); + // Avoid panic if the semaphore is closed + let _permit = match semaphore.acquire().await { + Ok(p) => p, + Err(_) => return (domain, false, metadata_clone), + }; let is_live = check_liveness(&domain).await; (domain, is_live, metadata_clone) } }) .buffer_unordered(MAX_CONCURRENT_LIVENESS_CHECKS) .collect() .await; + + // Update the querier map with new metadata within a short-lived write lock + let mut map = QUERIER_MAP.write().await;src/handlers/http/query.rs (2)
379-396
: Field name inconsistency: "endTime" vs "end_time"This endpoint returns fields ["start_time","endTime","count"] while the records produced by SQL use "end_time". Downstream consumers will see a mismatch. Use "end_time" to stay consistent with CountsRecord and Prism.
- let res = json!({ - "fields": vec!["start_time", "endTime", "count"], - "records": records, - }); + let res = json!({ + "fields": vec!["start_time", "end_time", "count"], + "records": records, + });And similarly in the non-conditions path:
- let res = json!({ - "fields": vec!["start_time", "endTime", "count"], - "records": records, - }); + let res = json!({ + "fields": vec!["start_time", "end_time", "count"], + "records": records, + });
370-396
: Unify JSON field naming: replace “endTime” with “end_time” across handlersTo prevent client-side schema drift, all JSON responses should use snake_case for field names. Update the remaining occurrences of “endTime” in your handlers:
• src/handlers/http/query.rs
- Line 380: change
vec!["start_time", "endTime", "count"]
→vec!["start_time", "end_time", "count"]
- Line 394: same change
• src/handlers/airplane.rs
- Line 165: change the JSON key
"endTime": end_time
→"end_time": end_time
After these updates, rerun a search for
\bendTime\b
to confirm no instances remain.
♻️ Duplicate comments (1)
src/query/mod.rs (1)
452-470
: Bug: end_time DATE_BIN is unqualified; will cause ambiguous/unresolved column errorsThe second DATE_BIN in each format! is not table-qualified. This risks parse/resolve errors. Pass self.stream twice and qualify both occurrences.
- format!( - "CAST(DATE_BIN('1 minute', \"{}\".\"{time_column}\", TIMESTAMP '1970-01-01 00:00:00+00') AS TEXT) as start_time, DATE_BIN('1 minute', \"{time_column}\", TIMESTAMP '1970-01-01 00:00:00+00') + INTERVAL '1 minute' as end_time", - self.stream - ) + format!( + "CAST(DATE_BIN('1 minute', \"{}\".\"{time_column}\", TIMESTAMP '1970-01-01 00:00:00+00') AS TEXT) as start_time, DATE_BIN('1 minute', \"{}\".\"{time_column}\", TIMESTAMP '1970-01-01 00:00:00+00') + INTERVAL '1 minute' as end_time", + self.stream, self.stream + ) } else if dur.num_minutes() > 60 * 10 && dur.num_minutes() < 60 * 240 { // date_bin 1 hour - format!( - "CAST(DATE_BIN('1 hour', \"{}\".\"{time_column}\", TIMESTAMP '1970-01-01 00:00:00+00') AS TEXT) as start_time, DATE_BIN('1 hour', \"{time_column}\", TIMESTAMP '1970-01-01 00:00:00+00') + INTERVAL '1 hour' as end_time", - self.stream - ) + format!( + "CAST(DATE_BIN('1 hour', \"{}\".\"{time_column}\", TIMESTAMP '1970-01-01 00:00:00+00') AS TEXT) as start_time, DATE_BIN('1 hour', \"{}\".\"{time_column}\", TIMESTAMP '1970-01-01 00:00:00+00') + INTERVAL '1 hour' as end_time", + self.stream, self.stream + ) } else { // date_bin 1 day - format!( - "CAST(DATE_BIN('1 day', \"{}\".\"{time_column}\", TIMESTAMP '1970-01-01 00:00:00+00') AS TEXT) as start_time, DATE_BIN('1 day', \"{time_column}\", TIMESTAMP '1970-01-01 00:00:00+00') + INTERVAL '1 day' as end_time", - self.stream - ) + format!( + "CAST(DATE_BIN('1 day', \"{}\".\"{time_column}\", TIMESTAMP '1970-01-01 00:00:00+00') AS TEXT) as start_time, DATE_BIN('1 day', \"{}\".\"{time_column}\", TIMESTAMP '1970-01-01 00:00:00+00') + INTERVAL '1 day' as end_time", + self.stream, self.stream + ) };
🧹 Nitpick comments (6)
src/handlers/http/cluster/mod.rs (2)
1149-1153
: Avoid unwrap() on semaphore.acquire()acquire().await.unwrap() will panic if the semaphore is closed. Prefer a graceful fallback that treats the node as not live.
- let _permit = semaphore.acquire().await.unwrap(); + let _permit = match semaphore.acquire().await { + Ok(p) => p, + Err(_) => return (domain, false, metadata_clone), + };
1188-1208
: Holding QUERIER_MAP write lock across await (select_next_querier) can be avoidedselect_next_querier awaits on LAST_USED_QUERIER locks while holding the QUERIER_MAP write lock. While not immediately deadlocking, this increases contention. Consider:
- Capturing the necessary state (available domains) without a write lock.
- Performing LAST_USED_QUERIER reads/writes.
- Reacquiring QUERIER_MAP write lock only to mutate the chosen entry.
This will reduce lock hold time and improve concurrency.
If you want, I can propose a non-async variant of select_next_querier that accepts the last_used value as an input to avoid awaits under the write lock.
src/query/mod.rs (1)
373-376
: Avoid unwrap() on DateTime::from_timestamp_millisfrom_timestamp_millis can return None for out-of-range values. Using unwrap() can panic. Return 0 or skip the file on None to keep counts robust.
- TypedStatistics::Int(Int64Type { min, .. }) => { - let min = DateTime::from_timestamp_millis(*min).unwrap(); - bin.start <= min && bin.end >= min // Determines if a column matches the bin's time range. - } + TypedStatistics::Int(Int64Type { min, .. }) => { + if let Some(min) = DateTime::from_timestamp_millis(*min) { + bin.start <= min && bin.end >= min + } else { + false + } + }src/handlers/http/query.rs (2)
371-374
: Redundant credential extraction; reuse existing credsYou already extracted creds at Line 343. Re-extracting is unnecessary and can fail again.
- let creds = extract_session_key_from_req(&req)?; - - let (records, _) = get_records_and_fields(&query_request, &creds).await?; + let (records, _) = get_records_and_fields(&query_request, &creds).await?;
339-399
: Optional: remove duplicate permission check downstreamget_records_and_fields performs a permission check; you already did user_auth_for_datasets here. Keeping both is defensive, but if redundancy is unintended, you can rely on a single check to avoid double work.
src/prism/logstream/mod.rs (1)
410-421
: Avoid JSON round-trip for CountsResponseYou serialize a JSON value and immediately deserialize to CountsResponse. Build the struct directly to avoid overhead and potential key casing mismatches.
// Instead of building `res` and then serde_json::from_value(res)? Ok(CountsResponse { fields: vec!["start_time".into(), "end_time".into(), "count".into()], records: serde_json::from_value::<Vec<CountsRecord>>(json!(records))?, // or map directly })
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (4)
src/handlers/http/cluster/mod.rs
(1 hunks)src/handlers/http/query.rs
(6 hunks)src/prism/logstream/mod.rs
(8 hunks)src/query/mod.rs
(6 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-06-18T12:44:31.983Z
Learnt from: parmesant
PR: parseablehq/parseable#1347
File: src/handlers/http/query.rs:0-0
Timestamp: 2025-06-18T12:44:31.983Z
Learning: The counts API in src/handlers/http/query.rs does not currently support group_by functionality in COUNT queries, so the hard-coded fields array ["start_time", "end_time", "count"] is appropriate for the current scope.
Applied to files:
src/query/mod.rs
📚 Learning: 2025-06-18T06:39:04.775Z
Learnt from: nikhilsinhaparseable
PR: parseablehq/parseable#1340
File: src/query/mod.rs:64-66
Timestamp: 2025-06-18T06:39:04.775Z
Learning: In src/query/mod.rs, QUERY_SESSION_STATE and QUERY_SESSION serve different architectural purposes: QUERY_SESSION_STATE is used for stats calculation and allows dynamic registration of individual parquet files from the staging path (files created every minute), while QUERY_SESSION is used for object store queries with the global schema provider. Session contexts with schema providers don't support registering individual tables/parquets, so both session objects are necessary for their respective use cases.
Applied to files:
src/handlers/http/query.rs
🧬 Code Graph Analysis (2)
src/handlers/http/query.rs (2)
src/event/mod.rs (2)
map
(131-131)commit_schema
(121-136)src/utils/actix.rs (2)
extract_session_key_from_req
(51-71)req
(31-31)
src/prism/logstream/mod.rs (3)
src/handlers/http/query.rs (4)
query
(110-144)get_records_and_fields
(81-108)get_counts
(339-398)from
(599-601)src/utils/arrow/mod.rs (1)
record_batches_to_json
(50-64)src/utils/time.rs (1)
truncate_to_minute
(295-306)
🔇 Additional comments (12)
src/handlers/http/cluster/mod.rs (1)
1123-1131
: Public exposure of get_available_querier is fineMaking get_available_querier public aligns with the broader PR surface changes. No issues with the signature.
src/query/mod.rs (4)
59-59
: Import simplification is fineUsing crate::event::DEFAULT_TIMESTAMP_KEY directly is clearer. No issues.
344-349
: Correct fallback to DEFAULT_TIMESTAMP_KEYFalling back to DEFAULT_TIMESTAMP_KEY when the stream has no time partition is correct and aligns with transform/filtering logic.
656-657
: Time filter detection correctly considers dynamic partition and default timestampChecking either the dynamic time_partition or the DEFAULT_TIMESTAMP_KEY is sufficient to avoid adding duplicate filters.
452-470
: No unqualified DATE_BIN references remain
Verified viarg -n -C2 --type=rust -P 'DATE_BIN\(.+?\)'
; only the three intended, fully-qualified uses insrc/query/mod.rs
(lines 455, 461, 467) were found.src/handlers/http/query.rs (4)
22-23
: SessionKey import addition is correctSwitching get_records_and_fields to accept SessionKey is consistent with the new auth flow.
47-48
: DEFAULT_TIMESTAMP_KEY import is fineUsed for counts path and to align with dynamic time-partition fallbacks.
80-108
: Signature change to creds: &SessionKey and permission check here looks goodCentralizing permission checks in get_records_and_fields simplifies call sites and reduces duplication.
354-360
: Dynamic time_partition propagation into counts SQL is correctUsing stream.get_time_partition().unwrap_or(DEFAULT_TIMESTAMP_KEY) and passing it to get_df_sql aligns with the transform/binning changes in src/query/mod.rs.
src/prism/logstream/mod.rs (3)
37-41
: merge_queried_stats rename usage is correctThe import and usage of merge_queried_stats aligns with the helper rename.
226-232
: Serialize derivation on PrismDatasetRequestDeriving Serialize is a harmless addition and useful for logging/testing.
460-473
: New error variants mapping is sensibleMapping SerdeError and ReqwestError to 500 is reasonable for Prism.
3be89c2
to
2b32a90
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
src/query/mod.rs (2)
370-385
: Fix boundary inclusion and avoid potential panic when parsing min timestampThe predicate uses
bin.end >= min
, which double-counts rows on bin boundaries. Also,from_timestamp_millis(*min).unwrap()
can panic.Use half-open intervals [start, end) and handle invalid millis safely.
- if f.columns.iter().any(|c| { - c.name == time_partition - && c.stats.as_ref().is_some_and(|stats| match stats { - TypedStatistics::Int(Int64Type { min, .. }) => { - let min = DateTime::from_timestamp_millis(*min).unwrap(); - bin.start <= min && bin.end >= min // Determines if a column matches the bin's time range. - } - _ => false, - }) - }) { + if f.columns.iter().any(|c| { + c.name == time_partition + && c.stats.as_ref().is_some_and(|stats| match stats { + TypedStatistics::Int(Int64Type { min, .. }) => { + if let Some(min) = DateTime::from_timestamp_millis(*min) { + // Half-open [start, end): include start, exclude end + bin.start <= min && min < bin.end + } else { + false + } + } + _ => false, + }) + }) {
396-441
: Off-by-one and division-by-zero in bin bounds; generate exactly num_bins binsWhen remainder > 0, the loop creates
num_bins
bins and then adds a final bin, yieldingnum_bins + 1
. Also,total_minutes / self.num_bins
panics ifnum_bins == 0
.Compute spans using quotient/remainder distribution and guard
num_bins == 0
./// Calculate the end time for each bin based on the number of bins fn get_bounds(&self, time_range: &TimeRange) -> Vec<TimeBounds> { - let total_minutes = time_range + // Guard against invalid input + if self.num_bins == 0 { + return vec![]; + } + + let total_minutes = time_range .end .signed_duration_since(time_range.start) .num_minutes() as u64; - // divide minutes by num bins to get minutes per bin - let quotient = total_minutes / self.num_bins; - let remainder = total_minutes % self.num_bins; - let have_remainder = remainder > 0; - - // now create multiple bounds [startTime, endTime) - // Should we exclude the last one??? - let mut bounds = vec![]; - - let mut start = time_range.start; - - let loop_end = if have_remainder { - self.num_bins - } else { - self.num_bins - 1 - }; - - // Create bins for all but the last date - for _ in 0..loop_end { - let end = start + Duration::minutes(quotient as i64); - bounds.push(TimeBounds { start, end }); - start = end; - } - - // Add the last bin, accounting for any remainder, should we include it? - if have_remainder { - bounds.push(TimeBounds { - start, - end: start + Duration::minutes(remainder as i64), - }); - } else { - bounds.push(TimeBounds { - start, - end: start + Duration::minutes(quotient as i64), - }); - } - - bounds + // Distribute total minutes across bins: first `remainder` bins get one extra minute + let quotient = total_minutes / self.num_bins; + let remainder = total_minutes % self.num_bins; + + let mut bounds = Vec::with_capacity(self.num_bins as usize); + let mut start = time_range.start; + + for i in 0..self.num_bins { + let span = quotient + u64::from(i < remainder); + let end = start + Duration::minutes(span as i64); + bounds.push(TimeBounds { start, end }); + start = end; + } + + bounds }src/handlers/http/cluster/mod.rs (1)
1117-1202
: Type mismatch: use QuerierMetadata consistently (currently declared as NodeMetadata)
QuerierStatus
storesQuerierMetadata
, butget_available_querier
populates it withNodeMetadata
. This won’t compile unless they are the same type alias. Fetch and trackQuerierMetadata
here.-pub async fn get_available_querier() -> Result<QuerierMetadata, QueryError> { +pub async fn get_available_querier() -> Result<QuerierMetadata, QueryError> { // Get all querier metadata - let querier_metadata: Vec<NodeMetadata> = get_node_info(NodeType::Querier).await?; + let querier_metadata: Vec<QuerierMetadata> = get_node_info::<QuerierMetadata>(NodeType::Querier).await?; @@ - let liveness_results: Vec<(String, bool, NodeMetadata)> = stream::iter(querier_metadata) + let liveness_results: Vec<(String, bool, QuerierMetadata)> = stream::iter(querier_metadata) .map(|metadata| { let domain = metadata.domain_name.clone(); let metadata_clone = metadata.clone(); let semaphore = Arc::clone(&semaphore);
♻️ Duplicate comments (4)
src/query/mod.rs (1)
444-485
: Qualify end_time DATE_BIN with table name to avoid ambiguous column errorsThe second DATE_BIN (for end_time) is unqualified in all three branches. Qualify it with the table and pass
self.stream
twice toformat!
.This was raised earlier and remains unresolved in this diff.
let date_bin = if dur.num_minutes() <= 60 * 10 { // date_bin 1 minute format!( - "CAST(DATE_BIN('1 minute', \"{}\".\"{time_column}\", TIMESTAMP '1970-01-01 00:00:00+00') AS TEXT) as start_time, DATE_BIN('1 minute', \"{time_column}\", TIMESTAMP '1970-01-01 00:00:00+00') + INTERVAL '1 minute' as end_time", - self.stream + "CAST(DATE_BIN('1 minute', \"{}\".\"{time_column}\", TIMESTAMP '1970-01-01 00:00:00+00') AS TEXT) as start_time, DATE_BIN('1 minute', \"{}\".\"{time_column}\", TIMESTAMP '1970-01-01 00:00:00+00') + INTERVAL '1 minute' as end_time", + self.stream, self.stream ) } else if dur.num_minutes() > 60 * 10 && dur.num_minutes() < 60 * 240 { // date_bin 1 hour format!( - "CAST(DATE_BIN('1 hour', \"{}\".\"{time_column}\", TIMESTAMP '1970-01-01 00:00:00+00') AS TEXT) as start_time, DATE_BIN('1 hour', \"{time_column}\", TIMESTAMP '1970-01-01 00:00:00+00') + INTERVAL '1 hour' as end_time", - self.stream + "CAST(DATE_BIN('1 hour', \"{}\".\"{time_column}\", TIMESTAMP '1970-01-01 00:00:00+00') AS TEXT) as start_time, DATE_BIN('1 hour', \"{}\".\"{time_column}\", TIMESTAMP '1970-01-01 00:00:00+00') + INTERVAL '1 hour' as end_time", + self.stream, self.stream ) } else { // date_bin 1 day format!( - "CAST(DATE_BIN('1 day', \"{}\".\"{time_column}\", TIMESTAMP '1970-01-01 00:00:00+00') AS TEXT) as start_time, DATE_BIN('1 day', \"{time_column}\", TIMESTAMP '1970-01-01 00:00:00+00') + INTERVAL '1 day' as end_time", - self.stream + "CAST(DATE_BIN('1 day', \"{}\".\"{time_column}\", TIMESTAMP '1970-01-01 00:00:00+00') AS TEXT) as start_time, DATE_BIN('1 day', \"{}\".\"{time_column}\", TIMESTAMP '1970-01-01 00:00:00+00') + INTERVAL '1 day' as end_time", + self.stream, self.stream ) };src/handlers/http/modal/query/querier_logstream.rs (1)
121-131
: Avoid panic on invalid header and accept case-insensitive boolean for UPDATE_STREAM_KEY
to_str().unwrap()
can panic on invalid header bytes. Also accept "True"/"TRUE".- let is_update = if let Some(val) = headers.get(UPDATE_STREAM_KEY) { - val.to_str().unwrap() == "true" - } else { - false - }; + let is_update = headers + .get(UPDATE_STREAM_KEY) + .and_then(|v| v.to_str().ok()) + .map(|s| s.trim().eq_ignore_ascii_case("true")) + .unwrap_or(false);src/prism/logstream/mod.rs (2)
357-386
: Use the stream’s actual time partition column in Conditions (not DEFAULT_TIMESTAMP_KEY)For streams with a custom time_partition, base the WHERE conditions on that column. Using DEFAULT_TIMESTAMP_KEY can cause column-not-found errors or miss pruning.
- let conditions = if PARSEABLE.get_stream(stream)?.get_time_partition().is_some() { - Some(CountConditions { + let conditions = if let Some(time_partition) = PARSEABLE.get_stream(stream)?.get_time_partition() { + Some(CountConditions { conditions: Some(Conditions { operator: Some(crate::alerts::LogicalOperator::And), condition_config: vec![ ConditionConfig { - column: DEFAULT_TIMESTAMP_KEY.into(), + column: time_partition.clone(), operator: crate::alerts::WhereConfigOperator::GreaterThanOrEqual, value: Some(start.to_rfc3339()), }, ConditionConfig { - column: DEFAULT_TIMESTAMP_KEY.into(), + column: time_partition, operator: crate::alerts::WhereConfigOperator::LessThan, value: Some(end.to_rfc3339()), }, ], }), group_by: None, }) } else { None };
396-401
: Pass the same time column to get_df_sql that you used in ConditionsKeep binning and filtering consistent by using the stream’s partition column.
- let query = count_request - .get_df_sql(DEFAULT_TIMESTAMP_KEY.into()) - .await?; + // Use the same time column as in the conditions above + let time_column = PARSEABLE + .get_stream(stream)? + .get_time_partition() + .unwrap_or_else(|| DEFAULT_TIMESTAMP_KEY.into()); + let query = count_request.get_df_sql(time_column).await?;
🧹 Nitpick comments (3)
src/query/mod.rs (1)
444-485
: Prefer building expressions via DataFusion APIs or escape identifiersRaw string formatting with unescaped identifiers risks SQL injection or invalid SQL if stream/column names contain quotes. Consider constructing the plan using DataFusion logical expressions, or at minimum escape
"
in identifiers.Example escape helper (if you keep SQL):
fn escape_ident(id: &str) -> String { id.replace('"', "\"\"") }Then wrap
escape_ident(&self.stream)
andescape_ident(&time_column)
in the format strings.src/handlers/http/cluster/mod.rs (1)
1204-1292
: Optional: reduce lock hold time and avoid awaiting while holding QUERIER_MAP write lock
select_next_querier
awaits onLAST_USED_QUERIER
while the QUERIER_MAP write lock is held. Consider computing the candidate domain(s) first without changing state, then reacquire the write lock to mutate entries. This reduces contention and lock ordering risks.src/prism/logstream/mod.rs (1)
412-423
: Avoid JSON round-trip: build CountsResponse directly from ArrowConverting RecordBatches to JSON and then deserializing back into CountsResponse adds overhead. Optionally map Arrow arrays to CountsRecord directly to reduce allocations and parsing.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (7)
src/alerts/alert_structs.rs
(1 hunks)src/handlers/http/cluster/mod.rs
(1 hunks)src/handlers/http/cluster/utils.rs
(1 hunks)src/handlers/http/modal/query/querier_logstream.rs
(3 hunks)src/handlers/http/query.rs
(6 hunks)src/prism/logstream/mod.rs
(8 hunks)src/query/mod.rs
(6 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- src/handlers/http/cluster/utils.rs
- src/alerts/alert_structs.rs
- src/handlers/http/query.rs
🧰 Additional context used
🧠 Learnings (3)
📚 Learning: 2025-06-18T12:44:31.983Z
Learnt from: parmesant
PR: parseablehq/parseable#1347
File: src/handlers/http/query.rs:0-0
Timestamp: 2025-06-18T12:44:31.983Z
Learning: The counts API in src/handlers/http/query.rs does not currently support group_by functionality in COUNT queries, so the hard-coded fields array ["start_time", "end_time", "count"] is appropriate for the current scope.
Applied to files:
src/query/mod.rs
📚 Learning: 2025-08-18T12:37:47.732Z
Learnt from: nikhilsinhaparseable
PR: parseablehq/parseable#1405
File: src/parseable/mod.rs:528-533
Timestamp: 2025-08-18T12:37:47.732Z
Learning: In Parseable, the validate_time_partition function in src/utils/json/flatten.rs already provides a default time partition limit of 30 days using `map_or(30, |days| days.get() as i64)` when time_partition_limit is None, so no additional defaulting is needed in the stream creation logic in src/parseable/mod.rs.
Applied to files:
src/prism/logstream/mod.rs
📚 Learning: 2025-02-14T09:49:25.818Z
Learnt from: de-sh
PR: parseablehq/parseable#1185
File: src/handlers/http/logstream.rs:255-261
Timestamp: 2025-02-14T09:49:25.818Z
Learning: In Parseable's logstream handlers, stream existence checks must be performed for both query and standalone modes. The pattern `!PARSEABLE.streams.contains(&stream_name) && (PARSEABLE.options.mode != Mode::Query || !PARSEABLE.create_stream_and_schema_from_storage(&stream_name).await?)` ensures proper error handling in both modes.
Applied to files:
src/handlers/http/modal/query/querier_logstream.rs
🧬 Code Graph Analysis (2)
src/prism/logstream/mod.rs (3)
src/handlers/http/query.rs (4)
query
(110-144)get_records_and_fields
(81-108)get_counts
(339-398)from
(599-601)src/utils/arrow/mod.rs (1)
record_batches_to_json
(50-64)src/utils/time.rs (1)
truncate_to_minute
(295-306)
src/handlers/http/modal/query/querier_logstream.rs (2)
src/handlers/http/cluster/mod.rs (1)
fetch_stats_from_ingestors
(497-557)src/handlers/http/cluster/utils.rs (1)
merge_queried_stats
(138-176)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (9)
- GitHub Check: Quest Smoke and Load Tests for Distributed deployments
- GitHub Check: Quest Smoke and Load Tests for Standalone deployments
- GitHub Check: Build Default aarch64-apple-darwin
- GitHub Check: Build Default x86_64-apple-darwin
- GitHub Check: Build Default x86_64-unknown-linux-gnu
- GitHub Check: Build Default x86_64-pc-windows-msvc
- GitHub Check: Build Kafka x86_64-unknown-linux-gnu
- GitHub Check: Build Default aarch64-unknown-linux-gnu
- GitHub Check: coverage
🔇 Additional comments (4)
src/query/mod.rs (1)
489-495
: CountsResponse now derives Deserialize — good for JSON roundtripsThis aligns with Prism’s path that deserializes a JSON-shaped response into CountsResponse.
src/handlers/http/modal/query/querier_logstream.rs (2)
36-46
: Import updates and rename to merge_queried_stats look correctBringing UPDATE_STREAM_KEY into scope and using merge_queried_stats aligns with the upstream changes.
134-138
: LGTM: Clear responses for create vs updateStatus and messages are consistent and straightforward.
src/prism/logstream/mod.rs (1)
141-147
: Rename to merge_queried_stats applied correctlyThe new import and usage of merge_queried_stats looks consistent with utils.
Fixes #XXXX.
Description
This PR has:
Summary by CodeRabbit
New Features
Bug Fixes