Skip to content

fix: data type for static schema #1235

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Mar 14, 2025

Conversation

nikhilsinhaparseable
Copy link
Contributor

@nikhilsinhaparseable nikhilsinhaparseable commented Mar 12, 2025

if string parsable to int, consider it valid
if string parsable to float, consider it valid

Summary by CodeRabbit

  • New Features

    • Introduced handling for Date32 values in the casting mechanism, allowing date representations to be processed as integers.
    • Added support for mapping "date" data types to the Arrow schema format, enhancing schema conversion capabilities.
  • Refactor

    • Updated the event processing interface to include a new parameter for improved type validation, ensuring consistent performance and reliable schema compatibility.

Copy link
Contributor

coderabbitai bot commented Mar 12, 2025

Walkthrough

This pull request introduces a new boolean parameter, static_schema_flag, to several functions involved in event formatting and validation. The change updates the method signatures of to_data, fields_mismatch, and valid_type in the JSON event module, allowing numeric strings to be parsed as integers or floats when the flag is set. Additionally, the EventFormat trait in the module is modified to include the flag, ensuring that data conversion calls incorporate this new logic. The cast_or_none function is also updated to handle ScalarValue::Date32, and a new mapping for date types is added in the static schema conversion.

Changes

File(s) Change Summary
src/event/format/json.rs Updated method signatures for to_data, fields_mismatch, and valid_type to include static_schema_flag. Modified type validation logic to allow string-to-number parsing when the flag is true. Added helper functions validate_int and validate_float.
src/event/format/mod.rs Updated the EventFormat trait's to_data method signature to require static_schema_flag and adjusted invocation in the data conversion flow.
src/query/stream_schema_provider.rs Updated cast_or_none to handle ScalarValue::Date32, returning an integer representation.
src/static_schema.rs Added mapping for "date" to DataType::Date32 in convert_static_schema_to_arrow_schema.

Sequence Diagram(s)

sequenceDiagram
    participant Caller as Caller
    participant Event as Event
    participant Validator as Validator (Validation Logic)

    Caller->>Event: to_data(..., static_schema_flag)
    Event->>Validator: fields_mismatch(..., static_schema_flag)
    Validator->>Validator: valid_type(..., static_schema_flag)
    Validator-->>Event: Validation result
    Event-->>Caller: Data conversion result
Loading

Possibly related PRs

  • refactor: Event per log, streamline data handling #1209: The changes in the main PR are related to the modifications in the to_data method and the introduction of the static_schema_flag, which are also reflected in the into_event method of the retrieved PR, indicating a direct connection in their implementation.
  • refactor: capture ingestion time at receive #1210: The changes in the main PR are related to the modifications in the into_recordbatch method, as both PRs introduce new parameters to this method and alter its signature, indicating a direct connection in their implementation.

Suggested reviewers

  • de-sh

Poem

I'm a bunny with hops so grand,
Adding flags to help data stand.
Numbers dance when strings convert,
In my code garden, errors avert.
With every change, I grin ear to ear—
Celebrate with a twitch, cheer, and a hop of cheer! 🐇

Tip

⚡🧪 Multi-step agentic review comment chat (experimental)
  • We're introducing multi-step agentic chat in review comments. This experimental feature enhances review discussions with the CodeRabbit agentic chat by enabling advanced interactions, including the ability to create pull requests directly from comments.
    - To enable this feature, set early_access to true under in the settings.

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ef72675 and 82fe3ec.

📒 Files selected for processing (4)
  • src/event/format/json.rs (4 hunks)
  • src/event/format/mod.rs (2 hunks)
  • src/query/stream_schema_provider.rs (1 hunks)
  • src/static_schema.rs (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/event/format/mod.rs
  • src/event/format/json.rs
⏰ Context from checks skipped due to timeout of 90000ms (10)
  • GitHub Check: Build Default x86_64-pc-windows-msvc
  • GitHub Check: Build Default x86_64-apple-darwin
  • GitHub Check: Build Default aarch64-apple-darwin
  • GitHub Check: Build Kafka x86_64-unknown-linux-gnu
  • GitHub Check: Build Default aarch64-unknown-linux-gnu
  • GitHub Check: Build Default x86_64-unknown-linux-gnu
  • GitHub Check: Build Kafka aarch64-apple-darwin
  • GitHub Check: coverage
  • GitHub Check: Quest Smoke and Load Tests for Distributed deployments
  • GitHub Check: Quest Smoke and Load Tests for Standalone deployments
🔇 Additional comments (2)
src/static_schema.rs (1)

114-114: Support for Date32 data type added correctly

The added mapping for "date" to DataType::Date32 properly extends the schema conversion capabilities to support date fields. This is a good addition that aligns well with the PR objective of enhancing data type validation.

src/query/stream_schema_provider.rs (1)

970-970: Date32 casting implementation looks good

The implementation to handle ScalarValue::Date32 values by casting them to i64 is consistent with the handling of other scalar types. This ensures that date values can be properly compared in queries and validation checks.

✨ Finishing Touches
  • 📝 Generate Docstrings

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (4)
src/event/format/mod.rs (1)

105-105: Add a doc comment for the new parameter.

This parameter significantly changes how data type validation is performed. Adding a brief explanation (e.g., what “static schema” implies) helps future maintainers and users understand its purpose.

 /// Converts data into the appropriate format...
 fn to_data(
     self,
     schema: &HashMap<String, Arc<Field>>,
     time_partition: Option<&String>,
     schema_version: SchemaVersion,
+    /// If true, interpret certain string values as integers/floats.
     static_schema_flag: bool,
 ) -> Result<(Self::Data, EventSchema, bool), AnyError>;
src/event/format/json.rs (3)

65-65: Provide brief documentation for the new static_schema_flag.

A small inline comment or doc attribute helps clarify the reasoning for including this parameter and how it affects parsing logic.

 fn to_data(
     self,
     schema: &HashMap<String, Arc<Field>>,
     time_partition: Option<&String>,
     schema_version: SchemaVersion,
+    /// If true, allows parsing string values as integer/float for static schema validation.
     static_schema_flag: bool,
 ) -> Result<(Self::Data, Vec<Arc<Field>>, bool), anyhow::Error> {

277-282: Add explanation of static_schema_flag in valid_type.

This function’s responsibility expands with the new parameter. Consider adding a short doc comment to help clarify the branching logic.


283-306: Consider trimming input strings before parsing.

Leading/ trailing whitespace or other formatting in numeric strings could break parsing in real-world data.

- if let Value::String(s) = value {
-     return s.parse::<i64>().is_ok();
+ if let Value::String(s) = value {
+     return s.trim().parse::<i64>().is_ok();
 }

Similarly for the float parsing block:

- return s.parse::<f64>().is_ok() || s.parse::<i64>().is_ok();
+ let trimmed = s.trim();
+ return trimmed.parse::<f64>().is_ok() || trimmed.parse::<i64>().is_ok();
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 506527d and 02610d0.

📒 Files selected for processing (2)
  • src/event/format/json.rs (5 hunks)
  • src/event/format/mod.rs (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (10)
  • GitHub Check: coverage
  • GitHub Check: Build Default x86_64-pc-windows-msvc
  • GitHub Check: Build Default x86_64-unknown-linux-gnu
  • GitHub Check: Build Default aarch64-unknown-linux-gnu
  • GitHub Check: Build Default x86_64-apple-darwin
  • GitHub Check: Build Kafka x86_64-unknown-linux-gnu
  • GitHub Check: Build Default aarch64-apple-darwin
  • GitHub Check: Build Kafka aarch64-apple-darwin
  • GitHub Check: Quest Smoke and Load Tests for Standalone deployments
  • GitHub Check: Quest Smoke and Load Tests for Distributed deployments
🔇 Additional comments (4)
src/event/format/mod.rs (1)

121-126: Invocation and parameter order look correct.

The call to to_data follows the updated signature precisely, ensuring the static_schema_flag is correctly passed. No issues spotted here.

src/event/format/json.rs (3)

115-115: Correct propagation of static_schema_flag.

Passing the flag to fields_mismatch accurately extends the static schema checks. Implementation looks good.


257-262: Function signature looks consistent with new parameter.

The updated fields_mismatch function aligns with the revised type-validation flow. No major concerns.


270-270: Properly forwarding the flag to valid_type.

This ensures the integer/float parsing logic can be triggered. Code is correct.

coderabbitai[bot]
coderabbitai bot previously approved these changes Mar 13, 2025
@nitisht nitisht merged commit 7f32288 into parseablehq:main Mar 14, 2025
14 checks passed
@coderabbitai coderabbitai bot mentioned this pull request Mar 21, 2025
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants