-
Notifications
You must be signed in to change notification settings - Fork 86
Add design for the Backup Storage Location Server #1830
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Introduces the BSLS design to enable backup and restore operations through a proxy service managed by the OADP Operator. Signed-off-by: Michal Pryc <[email protected]>
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mpryc The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
This looks great to me @mpryc |
|
@weshayutin certainly, I will actually combine the BSLR and BSLS designs into one more "usecase centric" and less implementation driven - this was a great offline comment from @kaovilai. |
@weshayutin how about "Virtual Machine Data Protection" (VMDP), The Disaster Recovery imo implies the ability to recover an entire virtual machine to a functional state which would first need a traditional block-level backup and then restore (from a CSI snapshot). This new feature won't be able to restore users actual VM on it's own. |
|
|
||
| The BSLS is a persistent server component deployed in the OpenShift cluster that proxies secure access to a shared Kopia repository. | ||
|
|
||
| The BSLS acts as a secure proxy, enabling users to connect to it via Kopia-compatible clients with per user individual credentials. These credentials are provisioned and managed as OpenShift `Secrets` and are synced to the Kopia repository by the BSLS controller to enforce user-level access control. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we use the OAuth tokens for this?
| * Verify that the spec.LocationRepository field references a valid and Ready BackupStorageLocationRepository (BSLR) in the same namespace. | ||
| * If invalid, mark the BSLS as NotReady and Requeue. | ||
| 2. **TLS Setup** | ||
| * Generate new or use a TLS certificate(s) from mounted from the OpenShift Secret for the BSLS service. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will need to make sure that anything we do here is FIPS-compliant if we generate the certs. I don't see why that would be an immediate problem, but it's something to verify.
shawn-hurley
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would love for the spec of this CRD to be added to the enhancement to get a better feel for it.
|
@mpryc: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
Issues go stale after 90d of inactivity. Mark the issue as fresh by commenting If this issue is safe to close now please do so with /lifecycle stale |
WalkthroughAdds a comprehensive design document for BackupStorageLocationServer (BSLS), a Kubernetes Custom Resource that proxies access to a Kopia-based backup repository. Details include architecture, reconciliation flow, security considerations, controller responsibilities, and example scenarios for cluster-based backup management. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
✨ Finishing touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (3)
docs/design/backupstoragelocationserver-design.md (3)
200-214: Open Issues section appropriately identifies audit logging and monitoring gaps.The open issues are well-articulated and highlight critical areas (audit logging, client TLS fingerprint management, repository password management, IAM integration). However, line 206 on "Audit Logging" deserves emphasis: in regulated environments (healthcare, financial, etc.), audit logging is often a compliance requirement, not optional. This should be flagged as a blocker for production use in such environments.
Recommend prioritizing:
- Audit Logging (line 206): This is critical for compliance and forensics.
- Client TLS Fingerprint Management (line 208): Important for usability in dynamic VM environments.
- Master Repository Password Management (line 210): Security concern if left unaddressed.
Consider opening separate issues or design docs for audit logging, IAM integration, and secrets management to unblock implementation in regulated environments. I can help draft these if needed.
52-58: Fix Markdown indentation for consistency and readability.The document has inconsistent indentation in bullet lists (lines 52–58, 110–114, 122–135, and others). While this doesn't impede understanding, it reduces visual clarity and consistency. All bullet points at the same nesting level should align.
Example fix for lines 52–58:
- BSLS acts as a repository server, proxying all access to the underlying Kopia repository managed by the BSLR. - BSLS handles user authentication, authorization, and access control, providing username/password-based access without exposing repository storage credentials to clients. - Repository-level access and configuration, including credentials, storage backend, and repository parameters, are managed by the BSLR. - BSLS enforces per-user isolation of snapshots and policy manifests, ensuring users see only their own backups and configurations. - The BSLS communicates with Kopia clients over TLS-encrypted connections, ensuring secure data transmission. - Access control lists (ACLs) and permissions are managed by BSLS, limiting user capabilities based on predefined rules and preventing unauthorized data modification or access. - This design assumes that the BSLS is deployed within the OpenShift cluster or in a network environment that provides VMs running Kopia clients with low-latency, high-bandwidth internal access, ensuring efficient and secure backup and restore operations.Apply similar fixes to lines 110–114 and 122–135 for consistency.
Also applies to: 110-114, 122-135
82-82: Fix hyphenation in compound adjectives for grammar and consistency.Three instances of missing hyphens in compound adjectives:
- Line 82: "per user individual credentials" → "per-user individual credentials"
- Line 101: "Kopia specific configuration parameters" → "Kopia-specific configuration parameters"
- Line 140: "Kopia compatible client" → "Kopia-compatible client"
Apply this diff:
- These credentials are provisioned and managed as OpenShift `Secrets` and are synced to the Kopia repository by the BSLS controller to enforce user-level access control. + These credentials are provisioned and managed as OpenShift `Secrets` and are synced to the Kopia repository by the BSLS controller to enforce per-user access control. - BSLS does not contain Kopia specific configuration parameters such as encryption algorithms, compression settings, and other repository-specific options, those are managed by the BSLR or DPA. + BSLS does not contain Kopia-specific configuration parameters such as encryption algorithms, compression settings, and other repository-specific options, those are managed by the BSLR or DPA. - In the context of KubeVirt, the BSLS can be used to manage backups for virtual machines running within the cluster. Users within a KubeVirt VM can use a standard Kopia compatible client to back up and restore their own files on their own schedule, leveraging the BSLS to manage the repository lifecycle and ensure secure, efficient backups. + In the context of KubeVirt, the BSLS can be used to manage backups for virtual machines running within the cluster. Users within a KubeVirt VM can use a standard Kopia-compatible client to back up and restore their own files on their own schedule, leveraging the BSLS to manage the repository lifecycle and ensure secure, efficient backups.Also applies to: 101-101, 140-140
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting
📒 Files selected for processing (1)
docs/design/backupstoragelocationserver-design.md(1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**
⚙️ CodeRabbit configuration file
-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.
Files:
docs/design/backupstoragelocationserver-design.md
🪛 LanguageTool
docs/design/backupstoragelocationserver-design.md
[style] ~21-~21: ‘with respect to’ might be wordy. Consider a shorter alternative.
Context: ...t rest. BSLS itself remains stateless with respect to user data. It does not persist sensitiv...
(EN_WORDINESS_PREMIUM_WITH_RESPECT_TO)
[grammar] ~82-~82: Use a hyphen to join words.
Context: ...it via Kopia-compatible clients with per user individual credentials. These crede...
(QB_NEW_EN_HYPHEN)
[grammar] ~101-~101: Use a hyphen to join words.
Context: ...tion**: - BSLS does not contain Kopia specific configuration parameters such a...
(QB_NEW_EN_HYPHEN)
[grammar] ~123-~123: Ensure spelling is correct
Context: ...pace. * If invalid, mark the BSLS as NotReady and Requeue. 2. TLS Setup * Gen...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
[grammar] ~140-~140: Use a hyphen to join words.
Context: ...n a KubeVirt VM can use a standard Kopia compatible client to back up and restore...
(QB_NEW_EN_HYPHEN)
🪛 markdownlint-cli2 (0.18.1)
docs/design/backupstoragelocationserver-design.md
52-52: Unordered list indentation
Expected: 0; Actual: 1
(MD007, ul-indent)
53-53: Unordered list indentation
Expected: 0; Actual: 1
(MD007, ul-indent)
54-54: Unordered list indentation
Expected: 0; Actual: 1
(MD007, ul-indent)
55-55: Unordered list indentation
Expected: 0; Actual: 1
(MD007, ul-indent)
56-56: Unordered list indentation
Expected: 0; Actual: 1
(MD007, ul-indent)
57-57: Unordered list indentation
Expected: 0; Actual: 1
(MD007, ul-indent)
58-58: Unordered list indentation
Expected: 0; Actual: 1
(MD007, ul-indent)
110-110: Unordered list indentation
Expected: 0; Actual: 1
(MD007, ul-indent)
112-112: Inconsistent indentation for list items at the same level
Expected: 1; Actual: 2
(MD005, list-indent)
112-112: Unordered list indentation
Expected: 0; Actual: 2
(MD007, ul-indent)
114-114: Unordered list indentation
Expected: 0; Actual: 1
(MD007, ul-indent)
122-122: Unordered list indentation
Expected: 0; Actual: 3
(MD007, ul-indent)
123-123: Unordered list indentation
Expected: 0; Actual: 3
(MD007, ul-indent)
125-125: Unordered list indentation
Expected: 0; Actual: 3
(MD007, ul-indent)
126-126: Unordered list indentation
Expected: 0; Actual: 3
(MD007, ul-indent)
127-127: Unordered list indentation
Expected: 0; Actual: 3
(MD007, ul-indent)
129-129: Unordered list indentation
Expected: 0; Actual: 3
(MD007, ul-indent)
130-130: Unordered list indentation
Expected: 0; Actual: 3
(MD007, ul-indent)
132-132: Unordered list indentation
Expected: 0; Actual: 3
(MD007, ul-indent)
133-133: Unordered list indentation
Expected: 0; Actual: 3
(MD007, ul-indent)
135-135: Unordered list indentation
Expected: 0; Actual: 3
(MD007, ul-indent)
180-180: Unordered list indentation
Expected: 0; Actual: 2
(MD007, ul-indent)
182-182: Unordered list indentation
Expected: 0; Actual: 2
(MD007, ul-indent)
184-184: Unordered list indentation
Expected: 0; Actual: 2
(MD007, ul-indent)
186-186: Unordered list indentation
Expected: 0; Actual: 2
(MD007, ul-indent)
193-193: Unordered list indentation
Expected: 0; Actual: 1
(MD007, ul-indent)
195-195: Unordered list indentation
Expected: 0; Actual: 1
(MD007, ul-indent)
197-197: Unordered list indentation
Expected: 0; Actual: 1
(MD007, ul-indent)
206-206: Unordered list indentation
Expected: 0; Actual: 1
(MD007, ul-indent)
208-208: Unordered list indentation
Expected: 0; Actual: 1
(MD007, ul-indent)
210-210: Unordered list indentation
Expected: 0; Actual: 1
(MD007, ul-indent)
212-212: Unordered list indentation
Expected: 0; Actual: 1
(MD007, ul-indent)
214-214: Unordered list indentation
Expected: 0; Actual: 1
(MD007, ul-indent)
🔇 Additional comments (2)
docs/design/backupstoragelocationserver-design.md (2)
124-127: Rewrite: Several concerns in the review are already addressed in the design; focus on genuine gaps only.The design document (lines 124-127 and surrounding context) already specifies:
- Certificate source: "from mounted from the OpenShift Secret for the BSLS service" (line 125)
- Fingerprint bootstrap complexity: The design acknowledges this at line 208: "The requirement for clients to validate the TLS certificate fingerprint introduces complexity in dynamic or short-lived VM environments. Automating this bootstrap process would improve usability."
Valid remaining gaps to address:
- Certificate rotation strategy: No lifecycle or rotation frequency specified
- FIPS compliance approach: Not mentioned; clarify if FIPS-compliant algorithms and key generation are required for regulated deployments
Recommend adding a brief note on certificate rotation cadence and whether FIPS compliance is in scope for this design.
Likely an incorrect or invalid review comment.
82-84: Authentication design defers critical mechanics to "open issues"—clarify before implementation.The document specifies username/password-based access (line 53) and notes credentials are managed via Secrets (line 95), but leaves essential details to future work:
- Password provisioning and lifecycle: No workflow documented. Line 210 flags "Master Repository Password Management" as an open issue.
- OAuth/IAM integration: Line 212 explicitly marks this as open, though it's foundational for enterprise adoption.
- Credential sync error handling: Not addressed. No scenarios for invalid credentials, missing Secrets, or failed sync to repository.
- FIPS compliance: Not mentioned in certificate generation (lines 125–127).
- User isolation implementation: Line 96 relies on "Kopia's access control mechanisms" without specifying how BSLS enforces per-user boundaries.
Before implementing BSLS, these gaps should either be resolved in the design or explicitly scoped out with justification. Security-critical items (provisioning workflow, error recovery) warrant design clarity; others (IAM, FIPS) can be deferred if documented as such.
| * **Credential Management**: | ||
| * **User Access**: | ||
| - User credentials (username/password) for accessing the BSLS are stored as OpenShift `Secrets`. These are managed by the BSLS controller and are used by Kopia clients running inside VMs. | ||
| - One BSLS can be shared between multiple users. These users cannot see each other’s snapshots, policies, or data. Access is isolated per user through Kopia’s access control mechanisms. | ||
|
|
||
| * **Repository Access**: | ||
| Credentials required by the BSLS to open and manage the Kopia repository are also stored as OpenShift `Secrets` and referenced in the OADP `DataProtectionApplication` (DPA) or `BSLR` CRDs. The BSLS uses these to authenticate to the backend storage securely. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
User isolation implementation mechanism unclear.
Line 96 states users "cannot see each other's snapshots, policies, or data. Access is isolated per user through Kopia's access control mechanisms," but the design doesn't explain how this isolation is enforced:
- Does Kopia use RBAC or a user namespace mechanism?
- Is isolation enforced by the BSLS controller or delegated entirely to Kopia?
- What happens if a user tries to access another user's snapshot? Is it rejected at the Kopia API level or BSLS level?
- How are policies enforced?
Without clarity here, implementers may make inconsistent assumptions about responsibility boundaries.
Please clarify the user isolation mechanism, either in the design or referenced documentation:
- Where is isolation enforced (BSLS controller vs. Kopia server)?
- What is the specific Kopia feature/configuration used?
- How are access violations detected and logged?
🤖 Prompt for AI Agents
In docs/design/backupstoragelocationserver-design.md around lines 93 to 99,
clarify the user isolation mechanism described: explicitly state whether
isolation is enforced by the BSLS controller or delegated to the Kopia server
(or a combination), name the exact Kopia features/configuration used (e.g.,
Kopia user principals/roles, repository-per-user vs. shared repo with ACLs,
access tokens, policy bindings or RBAC integration), and explain the enforcement
point for access checks (e.g., Kopia API rejects cross-user access; BSLS
performs request validation/authorization before proxying). Also add how access
violations are detected and logged (what components emit logs/audit events, log
locations and message formats, and any OpenShift audit integration), and provide
a brief example flow for a rejected access attempt to make responsibilities and
failure behavior unambiguous.
| ### Reconciliation Flow | ||
|
|
||
| When a BackupStorageLocationServer (BSLS) resource is created or modified, the controller takes the following steps: | ||
|
|
||
| 1. **Validation** | ||
| * Verify that the spec.LocationRepository field references a valid and Ready BackupStorageLocationRepository (BSLR) in the same namespace. | ||
| * If invalid, mark the BSLS as NotReady and Requeue. | ||
| 2. **TLS Setup** | ||
| * Generate new or use a TLS certificate(s) from mounted from the OpenShift Secret for the BSLS service. | ||
| * Record the certificate's SHA256 fingerprint in the BSLS .status.tlsFingerprint field for client verification. | ||
| * The BackupStorageLocationServer Certificate's SHA256 fingerprint is required for the client to connect. | ||
| 3. **Deployment/Service Management** | ||
| * Deploy or modify already existing Kopia Server Pod with appropriate Configuration. | ||
| * Expose the server internally via OpenShift Service. | ||
| 4. **User Secret Synchronization** | ||
| * Reconcile each credential into the Kopia repository using the Kopia server’s user management API. | ||
| * Watch for referenced OpenShift Secrets (user credentials) and re-run Kopia repository credential update logic. | ||
| 5. **Change Handling** | ||
| * If the BSLR is modified (e.g., storage config updated), the BSLS is Reconciled. | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reconciliation flow incomplete; missing error handling scenarios.
The reconciliation flow (lines 121–135) describes the happy path but lacks error recovery for critical scenarios:
- What happens if the BSLR transitions to NotReady after BSLS is running?
- How are failed user secret sync operations retried or reported?
- What is the recovery mechanism if the Kopia server pod crashes or restarts?
- How are transient network failures (e.g., API server unavailable) handled?
- What is the status condition when reconciliation fails at step 4 or 5?
This is important for reliability and debuggability in production environments.
Extend the reconciliation flow section to include:
- Error conditions at each step (validation, TLS, deployment, user sync, BSLR change handling)
- Retry/backoff strategy for transient failures
- Terminal failure conditions and recovery steps
- Status conditions and messages for each failure mode
🧰 Tools
🪛 LanguageTool
[grammar] ~123-~123: Ensure spelling is correct
Context: ...pace. * If invalid, mark the BSLS as NotReady and Requeue. 2. TLS Setup * Gen...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
🪛 markdownlint-cli2 (0.18.1)
122-122: Unordered list indentation
Expected: 0; Actual: 3
(MD007, ul-indent)
123-123: Unordered list indentation
Expected: 0; Actual: 3
(MD007, ul-indent)
125-125: Unordered list indentation
Expected: 0; Actual: 3
(MD007, ul-indent)
126-126: Unordered list indentation
Expected: 0; Actual: 3
(MD007, ul-indent)
127-127: Unordered list indentation
Expected: 0; Actual: 3
(MD007, ul-indent)
129-129: Unordered list indentation
Expected: 0; Actual: 3
(MD007, ul-indent)
130-130: Unordered list indentation
Expected: 0; Actual: 3
(MD007, ul-indent)
132-132: Unordered list indentation
Expected: 0; Actual: 3
(MD007, ul-indent)
133-133: Unordered list indentation
Expected: 0; Actual: 3
(MD007, ul-indent)
135-135: Unordered list indentation
Expected: 0; Actual: 3
(MD007, ul-indent)
🤖 Prompt for AI Agents
In docs/design/backupstoragelocationserver-design.md around lines 117 to 136,
extend the Reconciliation Flow to explicitly document error handling for each
step: enumerate possible failure modes for Validation, TLS Setup,
Deployment/Service Management, User Secret Synchronization, and BSLR change
handling; specify retry/backoff behavior for transient errors (exponential
backoff with max retries and requeue intervals), describe terminal failure
detection and recovery actions (e.g., mark NotReady with remediation steps,
fallback to last-known-good config, manual operator intervention), define the
exact status conditions, reason codes and human-readable messages to set on the
BSLS resource for each failure mode, and explain how controller will surface
failures (events, logs) and reconcile after recovery (e.g., watch for
BSLR->Ready transition, requeue user-syncs on secret updates, restart deployment
reconciliations on pod crashes).
|
/hold being replaced by #1845 |
|
@mpryc close? |
Introduces the BSLS design to enable backup and restore operations through a proxy service managed by the OADP Operator.
Why the changes were made
This is complementary design to the #1827
To enable backup and restore operations via a proxy service managed by the OADP Operator, improving flexibility and management of backup workflows.
How to test the changes made
Read the design.