-
-
Notifications
You must be signed in to change notification settings - Fork 147
feat: gcs support for storage #1373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: gcs support for storage #1373
Conversation
WalkthroughThis update introduces comprehensive support for Google Cloud Storage (GCS) as a new object storage backend. It includes the implementation of a GCS storage module, CLI and server integration, metrics instrumentation, Helm chart and Docker Compose enhancements for GCS configuration, and updates to dependency and configuration files. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant CLI
participant ParseableServer
participant GCSBackend
participant Metrics
User->>CLI: Selects GCS as storage option
CLI->>ParseableServer: Passes GcsStoreArgs
ParseableServer->>GCSBackend: Initializes with GcsConfig
ParseableServer->>GCSBackend: Performs storage operations (GET/PUT/Multipart)
GCSBackend->>Metrics: Records request/response times
GCSBackend->>GCSBackend: Handles object operations (upload, list, delete)
ParseableServer->>User: Responds to storage requests
Possibly related PRs
Suggested reviewers
Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
CLA Assistant Lite bot All contributors have signed the CLA ✍️ ✅ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 8
🧹 Nitpick comments (1)
src/metrics/storage.rs (1)
129-166
: LGTM with minor capitalization inconsistency.The GCS metrics module implementation follows the established pattern from other storage backends. However, there's a small inconsistency in the metric description.
- HistogramOpts::new("gcs_response_time", "gcs Request Latency") + HistogramOpts::new("gcs_response_time", "GCS Request Latency")
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
Cargo.lock
is excluded by!**/*.lock
📒 Files selected for processing (12)
Cargo.toml
(3 hunks)Dockerfile.debug
(1 hunks)docker-compose-gcs-distributed-test.yaml
(1 hunks)helm/templates/ingestor-statefulset.yaml
(5 hunks)helm/templates/querier-statefulset.yaml
(6 hunks)helm/templates/standalone-deployment.yaml
(4 hunks)helm/values.yaml
(16 hunks)src/cli.rs
(4 hunks)src/metrics/storage.rs
(1 hunks)src/parseable/mod.rs
(2 hunks)src/storage/gcs.rs
(1 hunks)src/storage/mod.rs
(2 hunks)
🧰 Additional context used
🧠 Learnings (4)
📓 Common learnings
Learnt from: de-sh
PR: parseablehq/parseable#0
File: :0-0
Timestamp: 2025-03-20T15:50:45.435Z
Learning: Pay close attention to code comments for typos and semantic clarity during reviews for the Parseable project.
src/storage/mod.rs (1)
Learnt from: nikhilsinhaparseable
PR: parseablehq/parseable#1340
File: src/query/mod.rs:64-66
Timestamp: 2025-06-18T06:39:04.775Z
Learning: In src/query/mod.rs, QUERY_SESSION_STATE and QUERY_SESSION serve different architectural purposes: QUERY_SESSION_STATE is used for stats calculation and allows dynamic registration of individual parquet files from the staging path (files created every minute), while QUERY_SESSION is used for object store queries with the global schema provider. Session contexts with schema providers don't support registering individual tables/parquets, so both session objects are necessary for their respective use cases.
Cargo.toml (2)
Learnt from: nikhilsinhaparseable
PR: parseablehq/parseable#1340
File: src/storage/object_storage.rs:31-41
Timestamp: 2025-06-18T11:15:10.836Z
Learning: DataFusion's parquet reader defaults to using view types (Utf8View, BinaryView) when reading parquet files via the schema_force_view_types configuration (default: true). This means StringViewArray and BinaryViewArray downcasting is required when processing Arrow arrays from DataFusion parquet operations, even though these types are behind nightly feature flags.
Learnt from: nikhilsinhaparseable
PR: parseablehq/parseable#1302
File: .github/workflows/build.yaml:170-175
Timestamp: 2025-04-26T03:58:02.341Z
Learning: In the parseable project, the Linux-specific environment variables (PKG_CONFIG_PATH, SASL2_DIR, OPENSSL_DIR, OPENSSL_ROOT_DIR, SASL2_STATIC) in the Kafka build step of GitHub Actions workflows don't cause issues for macOS builds and can safely be applied to both platforms.
src/parseable/mod.rs (1)
Learnt from: de-sh
PR: parseablehq/parseable#1185
File: src/handlers/http/logstream.rs:255-261
Timestamp: 2025-02-14T09:49:25.818Z
Learning: In Parseable's logstream handlers, stream existence checks must be performed for both query and standalone modes. The pattern `!PARSEABLE.streams.contains(&stream_name) && (PARSEABLE.options.mode != Mode::Query || !PARSEABLE.create_stream_and_schema_from_storage(&stream_name).await?)` ensures proper error handling in both modes.
🔇 Additional comments (21)
Dockerfile.debug (1)
17-17
: Good practice: Explicit registry specification.Making the Docker registry explicit improves build reproducibility and clarity. This is a recommended practice for production Dockerfiles.
src/storage/mod.rs (2)
40-40
: LGTM: Module declaration follows established pattern.The GCS module declaration is consistent with other storage backend modules in the codebase.
50-50
: LGTM: Config re-export follows established pattern.The
GCSConfig
re-export is consistent with other storage backend configurations (AzureBlobConfig
,S3Config
, etc.).Cargo.toml (4)
13-13
: LGTM: Minor formatting cleanup.Whitespace cleanup improves code consistency.
20-25
: LGTM: GCS feature addition and improved formatting.Adding the "gcp" feature enables Google Cloud Storage support through the object_store crate. The multi-line formatting improves maintainability.
42-46
: LGTM: Improved formatting for rdkafka features.Multi-line formatting improves readability and maintainability without changing functionality.
156-163
: LGTM: Improved formatting for kafka features.Multi-line formatting improves readability and maintainability without changing functionality.
src/parseable/mod.rs (2)
120-125
: LGTM: GCS integration follows established pattern.The
StorageOptions::GCS
match arm follows the same pattern as other storage backends (S3, Blob, Local) and properly initializes the Parseable instance with GCS storage configuration.
252-253
: LGTM: Storage mode string mapping for GCS.The storage mode string "Google Object Store" is consistent with the naming pattern used for other storage backends and provides clear identification in logs/UI.
src/cli.rs (4)
30-30
: LGTM: GCS config import addition.Adding
GCSConfig
to the storage imports enables GCS configuration through the CLI, consistent with other storage backend imports.
84-85
: LGTM: GCS storage option addition.The
GCS
variant addition toStorageOptions
follows the established pattern and enables GCS as a CLI storage option with the command name "gcs-store".
121-130
: LGTM: GCS store arguments structure.The
GCSStoreArgs
struct follows the exact same pattern as other storage argument structs (S3StoreArgs
,BlobStoreArgs
), with flattened options, storage configuration, and conditional Kafka configuration.
355-355
: LGTM: Minor formatting cleanup.Whitespace cleanup improves code consistency.
helm/values.yaml (2)
7-7
: Configuration updated for GCS storage backend.The store configuration has been properly set to use GCS as the storage backend.
140-156
: GCS mode secret configuration looks correct.The GCS authentication configuration includes all necessary components: secret mounting, environment variables, and required keys for GCS bucket access.
helm/templates/standalone-deployment.yaml (3)
39-39
: Simplified container arguments for better consistency.The removal of the special case handling for "gcs-store" simplifies the template logic and ensures consistent behavior across all storage backends.
94-108
: GCS environment variable configuration is correct.The addition of
GOOGLE_APPLICATION_CREDENTIALS
pointing to the mounted secret path and the removal of the "GCS" to "S3" replacement ensures proper GCS authentication.
133-147
: Conditional GCS secret mounting implemented correctly.The volume and volume mount configuration properly handles GCS secrets when enabled, ensuring credentials are available at the expected path.
helm/templates/querier-statefulset.yaml (2)
114-127
: GCS environment variables configured correctly.The
GOOGLE_APPLICATION_CREDENTIALS
path includes both mount path and secret key, and the environment variable naming preserves "GCS" instead of replacing with "S3".
158-162
: Conditional GCS secret mounting matches standalone template.The volume mount configuration is consistent with the standalone deployment template, ensuring uniform GCS credential handling across deployment types.
helm/templates/ingestor-statefulset.yaml (1)
69-74
: GCS authentication setup looks good!The implementation correctly handles GCS authentication by:
- Creating a volume from the GCS secret when enabled
- Setting the
GOOGLE_APPLICATION_CREDENTIALS
environment variable- Mounting the secret read-only at the configured path
This follows GCS authentication best practices.
Also applies to: 120-121, 173-178
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (2)
src/storage/gcs.rs (2)
327-330
: Enable multipart upload for improved performance.Multipart upload is currently disabled with a TODO comment. This will impact performance for large file uploads.
642-654
: Add explanatory comments for s3:// scheme usage in DataFusion methods.As mentioned in the past review, please add comments explaining why the s3:// scheme is used for GCS in the
query_prefixes
andstore_url
methods, similar to the comment at line 136.fn query_prefixes(&self, prefixes: Vec<String>) -> Vec<ListingTableUrl> { + // Use "s3://" scheme for DataFusion compatibility - see comment in get_datafusion_runtime prefixes .into_iter() .map(|prefix| { let path = format!("s3://{}/{}", &self.bucket, prefix); ListingTableUrl::parse(path).unwrap() }) .collect() } fn store_url(&self) -> url::Url { + // Use "s3://" scheme for DataFusion compatibility - see comment in get_datafusion_runtime url::Url::parse(&format!("s3://{}", self.bucket)).unwrap() }
🧹 Nitpick comments (1)
src/storage/gcs.rs (1)
137-137
: Fix typo in comment.Missing 'n' at the end of "implementation".
- // object store calls to our GoogleCloudStorage implementatio + // object store calls to our GoogleCloudStorage implementation
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
docker-compose-gcs-distributed-test.yaml
(1 hunks)src/cli.rs
(4 hunks)src/metrics/storage.rs
(1 hunks)src/parseable/mod.rs
(2 hunks)src/storage/gcs.rs
(1 hunks)src/storage/mod.rs
(2 hunks)
✅ Files skipped from review due to trivial changes (1)
- src/parseable/mod.rs
🚧 Files skipped from review as they are similar to previous changes (4)
- src/storage/mod.rs
- src/metrics/storage.rs
- src/cli.rs
- docker-compose-gcs-distributed-test.yaml
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
♻️ Duplicate comments (1)
src/storage/gcs.rs (1)
327-350
: Enable multipart upload functionality for improved performance.Multipart upload is currently disabled with a hardcoded
false
value and TODO comment. This significantly impacts performance for large file uploads.The multipart upload implementation exists but is disabled. Large file uploads will suffer from poor performance and potential timeouts.
- // // TODO: Uncomment this when multipart is fixed - // let should_multipart = std::fs::metadata(path)?.len() > MULTIPART_UPLOAD_SIZE as u64; - - let should_multipart = false; + let should_multipart = std::fs::metadata(path)?.len() > MIN_MULTIPART_UPLOAD_SIZE as u64;let res = if should_multipart { - // self._upload_multipart(key, path).await - // this branch will never get executed - Ok(()) + self._upload_multipart(&RelativePath::from(key), path).await } else {
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/storage/gcs.rs
(1 hunks)
🔇 Additional comments (3)
src/storage/gcs.rs (3)
60-97
: LGTM! Configuration struct is well-designed.The GCS configuration struct properly uses clap for CLI argument parsing with appropriate environment variable bindings and defaults. The TLS skip verification option uses the correct GCS-specific environment variable name.
136-138
: LGTM! Proper URL scheme usage with explanatory comment.The code correctly uses the
gs://
scheme for GCS and includes a helpful comment explaining why this is necessary for DataFusion compatibility.
646-648
: LGTM! Consistent URL scheme usage.The
query_prefixes
andstore_url
methods correctly use thegs://
scheme, maintaining consistency throughout the implementation.Also applies to: 653-653
I have read the CLA Document and I hereby sign the CLA |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good to merge
Fixes #XXXX.
Description
This PR has:
Summary by CodeRabbit
New Features
Refactor
Chores