Skip to content

feat: gcs support for storage #1373

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Jul 15, 2025

Conversation

solomonope
Copy link
Contributor

@solomonope solomonope commented Jul 13, 2025

Fixes #XXXX.

Description


This PR has:

  • been tested to ensure log ingestion and log query works.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added documentation for new or modified features or behaviors.

Summary by CodeRabbit

  • New Features

    • Added support for Google Cloud Storage (GCS) as a storage backend, including configuration, CLI integration, metrics instrumentation, multipart upload, and DataFusion runtime integration.
    • Introduced a new Docker Compose file for distributed GCS-backed testing environments.
    • Enhanced Helm charts and values to support GCS secrets, high availability mode, and refined configuration for GCS deployments.
  • Refactor

    • Simplified and standardized Helm template logic, environment variable handling, and volume management for GCS and other storage options.
    • Improved Dockerfile base image reference clarity.
  • Chores

    • Updated dependency features and formatting for improved maintainability.

Copy link
Contributor

coderabbitai bot commented Jul 13, 2025

Walkthrough

This update introduces comprehensive support for Google Cloud Storage (GCS) as a new object storage backend. It includes the implementation of a GCS storage module, CLI and server integration, metrics instrumentation, Helm chart and Docker Compose enhancements for GCS configuration, and updates to dependency and configuration files.

Changes

File(s) Change Summary
src/storage/gcs.rs New module implementing GCS backend, with config struct, async storage trait, multipart upload, and metrics integration.
src/storage/mod.rs Added gcs module import and public re-export of GcsConfig.
src/cli.rs, src/parseable/mod.rs Added GCS as a storage option in CLI and server initialization; new struct and enum variant for GCS; updated storage mode string logic.
src/metrics/storage.rs Added GCS metrics module with Prometheus histograms and trait implementation for GcsConfig.
Cargo.toml Expanded object_store features to include gcp; reformatted rdkafka and kafka features; whitespace cleanup.
Dockerfile.debug Changed Rust base image reference to include docker.io/ prefix.
docker-compose-gcs-distributed-test.yaml New Docker Compose file for distributed GCS-backed test environment with query, ingest, and client services.
helm/templates/ingestor-statefulset.yaml, helm/templates/querier-statefulset.yaml, helm/templates/standalone-deployment.yaml Added conditional GCS secret mounting, environment variable, and volume management; improved formatting and logic for GCS support.
helm/values.yaml Enabled GCS mode secret; enabled high availability; formatting updates and configuration for GCS authentication.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CLI
    participant ParseableServer
    participant GCSBackend
    participant Metrics

    User->>CLI: Selects GCS as storage option
    CLI->>ParseableServer: Passes GcsStoreArgs
    ParseableServer->>GCSBackend: Initializes with GcsConfig
    ParseableServer->>GCSBackend: Performs storage operations (GET/PUT/Multipart)
    GCSBackend->>Metrics: Records request/response times
    GCSBackend->>GCSBackend: Handles object operations (upload, list, delete)
    ParseableServer->>User: Responds to storage requests
Loading

Possibly related PRs

  • fix: bugs introduced in #1143 #1185: Fixes deadlock by changing how options and storage are passed to IngestorMetadata::load; related to this PR’s changes in src/parseable/mod.rs affecting ingestor metadata initialization.

Suggested reviewers

  • parmesant

Poem

🐰
In the clouds a new path does appear,
GCS joins the storage frontier!
With secrets and charts, and Compose in tow,
Metrics and CLI now help data flow.
Parseable hops with a cloudier cheer—
More options for logs, far and near!
☁️✨


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4f2494c and 989464a.

📒 Files selected for processing (1)
  • helm/templates/querier-statefulset.yaml (6 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • helm/templates/querier-statefulset.yaml

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

github-actions bot commented Jul 13, 2025

CLA Assistant Lite bot All contributors have signed the CLA ✍️ ✅

@solomonope solomonope marked this pull request as ready for review July 13, 2025 15:15
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🧹 Nitpick comments (1)
src/metrics/storage.rs (1)

129-166: LGTM with minor capitalization inconsistency.

The GCS metrics module implementation follows the established pattern from other storage backends. However, there's a small inconsistency in the metric description.

-            HistogramOpts::new("gcs_response_time", "gcs Request Latency")
+            HistogramOpts::new("gcs_response_time", "GCS Request Latency")
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c4533be and c46cac6.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (12)
  • Cargo.toml (3 hunks)
  • Dockerfile.debug (1 hunks)
  • docker-compose-gcs-distributed-test.yaml (1 hunks)
  • helm/templates/ingestor-statefulset.yaml (5 hunks)
  • helm/templates/querier-statefulset.yaml (6 hunks)
  • helm/templates/standalone-deployment.yaml (4 hunks)
  • helm/values.yaml (16 hunks)
  • src/cli.rs (4 hunks)
  • src/metrics/storage.rs (1 hunks)
  • src/parseable/mod.rs (2 hunks)
  • src/storage/gcs.rs (1 hunks)
  • src/storage/mod.rs (2 hunks)
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: de-sh
PR: parseablehq/parseable#0
File: :0-0
Timestamp: 2025-03-20T15:50:45.435Z
Learning: Pay close attention to code comments for typos and semantic clarity during reviews for the Parseable project.
src/storage/mod.rs (1)
Learnt from: nikhilsinhaparseable
PR: parseablehq/parseable#1340
File: src/query/mod.rs:64-66
Timestamp: 2025-06-18T06:39:04.775Z
Learning: In src/query/mod.rs, QUERY_SESSION_STATE and QUERY_SESSION serve different architectural purposes: QUERY_SESSION_STATE is used for stats calculation and allows dynamic registration of individual parquet files from the staging path (files created every minute), while QUERY_SESSION is used for object store queries with the global schema provider. Session contexts with schema providers don't support registering individual tables/parquets, so both session objects are necessary for their respective use cases.
Cargo.toml (2)
Learnt from: nikhilsinhaparseable
PR: parseablehq/parseable#1340
File: src/storage/object_storage.rs:31-41
Timestamp: 2025-06-18T11:15:10.836Z
Learning: DataFusion's parquet reader defaults to using view types (Utf8View, BinaryView) when reading parquet files via the schema_force_view_types configuration (default: true). This means StringViewArray and BinaryViewArray downcasting is required when processing Arrow arrays from DataFusion parquet operations, even though these types are behind nightly feature flags.
Learnt from: nikhilsinhaparseable
PR: parseablehq/parseable#1302
File: .github/workflows/build.yaml:170-175
Timestamp: 2025-04-26T03:58:02.341Z
Learning: In the parseable project, the Linux-specific environment variables (PKG_CONFIG_PATH, SASL2_DIR, OPENSSL_DIR, OPENSSL_ROOT_DIR, SASL2_STATIC) in the Kafka build step of GitHub Actions workflows don't cause issues for macOS builds and can safely be applied to both platforms.
src/parseable/mod.rs (1)
Learnt from: de-sh
PR: parseablehq/parseable#1185
File: src/handlers/http/logstream.rs:255-261
Timestamp: 2025-02-14T09:49:25.818Z
Learning: In Parseable's logstream handlers, stream existence checks must be performed for both query and standalone modes. The pattern `!PARSEABLE.streams.contains(&stream_name) && (PARSEABLE.options.mode != Mode::Query || !PARSEABLE.create_stream_and_schema_from_storage(&stream_name).await?)` ensures proper error handling in both modes.
🔇 Additional comments (21)
Dockerfile.debug (1)

17-17: Good practice: Explicit registry specification.

Making the Docker registry explicit improves build reproducibility and clarity. This is a recommended practice for production Dockerfiles.

src/storage/mod.rs (2)

40-40: LGTM: Module declaration follows established pattern.

The GCS module declaration is consistent with other storage backend modules in the codebase.


50-50: LGTM: Config re-export follows established pattern.

The GCSConfig re-export is consistent with other storage backend configurations (AzureBlobConfig, S3Config, etc.).

Cargo.toml (4)

13-13: LGTM: Minor formatting cleanup.

Whitespace cleanup improves code consistency.


20-25: LGTM: GCS feature addition and improved formatting.

Adding the "gcp" feature enables Google Cloud Storage support through the object_store crate. The multi-line formatting improves maintainability.


42-46: LGTM: Improved formatting for rdkafka features.

Multi-line formatting improves readability and maintainability without changing functionality.


156-163: LGTM: Improved formatting for kafka features.

Multi-line formatting improves readability and maintainability without changing functionality.

src/parseable/mod.rs (2)

120-125: LGTM: GCS integration follows established pattern.

The StorageOptions::GCS match arm follows the same pattern as other storage backends (S3, Blob, Local) and properly initializes the Parseable instance with GCS storage configuration.


252-253: LGTM: Storage mode string mapping for GCS.

The storage mode string "Google Object Store" is consistent with the naming pattern used for other storage backends and provides clear identification in logs/UI.

src/cli.rs (4)

30-30: LGTM: GCS config import addition.

Adding GCSConfig to the storage imports enables GCS configuration through the CLI, consistent with other storage backend imports.


84-85: LGTM: GCS storage option addition.

The GCS variant addition to StorageOptions follows the established pattern and enables GCS as a CLI storage option with the command name "gcs-store".


121-130: LGTM: GCS store arguments structure.

The GCSStoreArgs struct follows the exact same pattern as other storage argument structs (S3StoreArgs, BlobStoreArgs), with flattened options, storage configuration, and conditional Kafka configuration.


355-355: LGTM: Minor formatting cleanup.

Whitespace cleanup improves code consistency.

helm/values.yaml (2)

7-7: Configuration updated for GCS storage backend.

The store configuration has been properly set to use GCS as the storage backend.


140-156: GCS mode secret configuration looks correct.

The GCS authentication configuration includes all necessary components: secret mounting, environment variables, and required keys for GCS bucket access.

helm/templates/standalone-deployment.yaml (3)

39-39: Simplified container arguments for better consistency.

The removal of the special case handling for "gcs-store" simplifies the template logic and ensures consistent behavior across all storage backends.


94-108: GCS environment variable configuration is correct.

The addition of GOOGLE_APPLICATION_CREDENTIALS pointing to the mounted secret path and the removal of the "GCS" to "S3" replacement ensures proper GCS authentication.


133-147: Conditional GCS secret mounting implemented correctly.

The volume and volume mount configuration properly handles GCS secrets when enabled, ensuring credentials are available at the expected path.

helm/templates/querier-statefulset.yaml (2)

114-127: GCS environment variables configured correctly.

The GOOGLE_APPLICATION_CREDENTIALS path includes both mount path and secret key, and the environment variable naming preserves "GCS" instead of replacing with "S3".


158-162: Conditional GCS secret mounting matches standalone template.

The volume mount configuration is consistent with the standalone deployment template, ensuring uniform GCS credential handling across deployment types.

helm/templates/ingestor-statefulset.yaml (1)

69-74: GCS authentication setup looks good!

The implementation correctly handles GCS authentication by:

  • Creating a volume from the GCS secret when enabled
  • Setting the GOOGLE_APPLICATION_CREDENTIALS environment variable
  • Mounting the secret read-only at the configured path

This follows GCS authentication best practices.

Also applies to: 120-121, 173-178

@solomonope solomonope changed the title Feat/gcs support for storage feat: gcs support for storage Jul 13, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
src/storage/gcs.rs (2)

327-330: Enable multipart upload for improved performance.

Multipart upload is currently disabled with a TODO comment. This will impact performance for large file uploads.


642-654: Add explanatory comments for s3:// scheme usage in DataFusion methods.

As mentioned in the past review, please add comments explaining why the s3:// scheme is used for GCS in the query_prefixes and store_url methods, similar to the comment at line 136.

 fn query_prefixes(&self, prefixes: Vec<String>) -> Vec<ListingTableUrl> {
+    // Use "s3://" scheme for DataFusion compatibility - see comment in get_datafusion_runtime
     prefixes
         .into_iter()
         .map(|prefix| {
             let path = format!("s3://{}/{}", &self.bucket, prefix);
             ListingTableUrl::parse(path).unwrap()
         })
         .collect()
 }

 fn store_url(&self) -> url::Url {
+    // Use "s3://" scheme for DataFusion compatibility - see comment in get_datafusion_runtime
     url::Url::parse(&format!("s3://{}", self.bucket)).unwrap()
 }
🧹 Nitpick comments (1)
src/storage/gcs.rs (1)

137-137: Fix typo in comment.

Missing 'n' at the end of "implementation".

-        // object store calls to our GoogleCloudStorage implementatio
+        // object store calls to our GoogleCloudStorage implementation
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c46cac6 and 861dde3.

📒 Files selected for processing (6)
  • docker-compose-gcs-distributed-test.yaml (1 hunks)
  • src/cli.rs (4 hunks)
  • src/metrics/storage.rs (1 hunks)
  • src/parseable/mod.rs (2 hunks)
  • src/storage/gcs.rs (1 hunks)
  • src/storage/mod.rs (2 hunks)
✅ Files skipped from review due to trivial changes (1)
  • src/parseable/mod.rs
🚧 Files skipped from review as they are similar to previous changes (4)
  • src/storage/mod.rs
  • src/metrics/storage.rs
  • src/cli.rs
  • docker-compose-gcs-distributed-test.yaml

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

♻️ Duplicate comments (1)
src/storage/gcs.rs (1)

327-350: Enable multipart upload functionality for improved performance.

Multipart upload is currently disabled with a hardcoded false value and TODO comment. This significantly impacts performance for large file uploads.

The multipart upload implementation exists but is disabled. Large file uploads will suffer from poor performance and potential timeouts.

- // // TODO: Uncomment this when multipart is fixed
- // let should_multipart = std::fs::metadata(path)?.len() > MULTIPART_UPLOAD_SIZE as u64;
-
- let should_multipart = false;
+ let should_multipart = std::fs::metadata(path)?.len() > MIN_MULTIPART_UPLOAD_SIZE as u64;
 let res = if should_multipart {
-     // self._upload_multipart(key, path).await
-     // this branch will never get executed
-     Ok(())
+     self._upload_multipart(&RelativePath::from(key), path).await
 } else {
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b6d4d54 and 3193302.

📒 Files selected for processing (1)
  • src/storage/gcs.rs (1 hunks)
🔇 Additional comments (3)
src/storage/gcs.rs (3)

60-97: LGTM! Configuration struct is well-designed.

The GCS configuration struct properly uses clap for CLI argument parsing with appropriate environment variable bindings and defaults. The TLS skip verification option uses the correct GCS-specific environment variable name.


136-138: LGTM! Proper URL scheme usage with explanatory comment.

The code correctly uses the gs:// scheme for GCS and includes a helpful comment explaining why this is necessary for DataFusion compatibility.


646-648: LGTM! Consistent URL scheme usage.

The query_prefixes and store_url methods correctly use the gs:// scheme, maintaining consistency throughout the implementation.

Also applies to: 653-653

@solomonope
Copy link
Contributor Author

I have read the CLA Document and I hereby sign the CLA

nitisht added a commit to parseablehq/.github that referenced this pull request Jul 14, 2025
coderabbitai[bot]
coderabbitai bot previously approved these changes Jul 14, 2025
coderabbitai[bot]
coderabbitai bot previously approved these changes Jul 14, 2025
Copy link
Contributor

@nikhilsinhaparseable nikhilsinhaparseable left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good to merge

@nitisht nitisht merged commit 932daa0 into parseablehq:main Jul 15, 2025
13 of 14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants