Skip to content

Conversation

@mpryc
Copy link
Contributor

@mpryc mpryc commented Jul 15, 2025

Introduces the BSLS design to enable backup and restore operations through a proxy service managed by the OADP Operator.

Why the changes were made

This is complementary design to the #1827

To enable backup and restore operations via a proxy service managed by the OADP Operator, improving flexibility and management of backup workflows.

How to test the changes made

Read the design.

Introduces the BSLS design to enable backup and restore
operations through a proxy service managed by the OADP Operator.

Signed-off-by: Michal Pryc <[email protected]>
@openshift-ci openshift-ci bot requested review from kaovilai and sseago July 15, 2025 08:46
@openshift-ci
Copy link

openshift-ci bot commented Jul 15, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mpryc

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 15, 2025
@weshayutin
Copy link
Contributor

This looks great to me @mpryc
Do you think we can market this feature as OADP VMDR ( Virtual Machine Disaster Recovery? )

@mpryc
Copy link
Contributor Author

mpryc commented Jul 15, 2025

@weshayutin certainly, I will actually combine the BSLR and BSLS designs into one more "usecase centric" and less implementation driven - this was a great offline comment from @kaovilai.

@mpryc
Copy link
Contributor Author

mpryc commented Jul 15, 2025

This looks great to me @mpryc Do you think we can market this feature as OADP VMDR ( Virtual Machine Disaster Recovery? )

@weshayutin how about "Virtual Machine Data Protection" (VMDP), The Disaster Recovery imo implies the ability to recover an entire virtual machine to a functional state which would first need a traditional block-level backup and then restore (from a CSI snapshot). This new feature won't be able to restore users actual VM on it's own.


The BSLS is a persistent server component deployed in the OpenShift cluster that proxies secure access to a shared Kopia repository.

The BSLS acts as a secure proxy, enabling users to connect to it via Kopia-compatible clients with per user individual credentials. These credentials are provisioned and managed as OpenShift `Secrets` and are synced to the Kopia repository by the BSLS controller to enforce user-level access control.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use the OAuth tokens for this?

* Verify that the spec.LocationRepository field references a valid and Ready BackupStorageLocationRepository (BSLR) in the same namespace.
* If invalid, mark the BSLS as NotReady and Requeue.
2. **TLS Setup**
* Generate new or use a TLS certificate(s) from mounted from the OpenShift Secret for the BSLS service.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will need to make sure that anything we do here is FIPS-compliant if we generate the certs. I don't see why that would be an immediate problem, but it's something to verify.

Copy link
Contributor

@shawn-hurley shawn-hurley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would love for the spec of this CRD to be added to the enhancement to get a better feel for it.

@openshift-ci
Copy link

openshift-ci bot commented Aug 13, 2025

@mpryc: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@openshift-bot
Copy link

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

@openshift-ci openshift-ci bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 12, 2025
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 12, 2025

Walkthrough

Adds a comprehensive design document for BackupStorageLocationServer (BSLS), a Kubernetes Custom Resource that proxies access to a Kopia-based backup repository. Details include architecture, reconciliation flow, security considerations, controller responsibilities, and example scenarios for cluster-based backup management.

Changes

Cohort / File(s) Summary
Design Documentation
docs/design/backupstoragelocationserver-design.md
Introduces a new design document covering BSLS architecture as a secure multi-tenant proxy, reconciliation logic with TLS setup and credential management, security model with client-side encryption and isolation, deployment flow, operational scenarios, and open issues.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Verify architectural decisions align with security and isolation requirements
  • Review reconciliation flow logic for completeness and correctness
  • Confirm all referenced components (BSLR, TLS, credential handling) are properly explained
  • Validate example scenarios cover realistic use cases
✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
docs/design/backupstoragelocationserver-design.md (3)

200-214: Open Issues section appropriately identifies audit logging and monitoring gaps.

The open issues are well-articulated and highlight critical areas (audit logging, client TLS fingerprint management, repository password management, IAM integration). However, line 206 on "Audit Logging" deserves emphasis: in regulated environments (healthcare, financial, etc.), audit logging is often a compliance requirement, not optional. This should be flagged as a blocker for production use in such environments.

Recommend prioritizing:

  1. Audit Logging (line 206): This is critical for compliance and forensics.
  2. Client TLS Fingerprint Management (line 208): Important for usability in dynamic VM environments.
  3. Master Repository Password Management (line 210): Security concern if left unaddressed.

Consider opening separate issues or design docs for audit logging, IAM integration, and secrets management to unblock implementation in regulated environments. I can help draft these if needed.


52-58: Fix Markdown indentation for consistency and readability.

The document has inconsistent indentation in bullet lists (lines 52–58, 110–114, 122–135, and others). While this doesn't impede understanding, it reduces visual clarity and consistency. All bullet points at the same nesting level should align.

Example fix for lines 52–58:

- BSLS acts as a repository server, proxying all access to the underlying Kopia repository managed by the BSLR.
- BSLS handles user authentication, authorization, and access control, providing username/password-based access without exposing repository storage credentials to clients.
- Repository-level access and configuration, including credentials, storage backend, and repository parameters, are managed by the BSLR.
- BSLS enforces per-user isolation of snapshots and policy manifests, ensuring users see only their own backups and configurations.
- The BSLS communicates with Kopia clients over TLS-encrypted connections, ensuring secure data transmission.
- Access control lists (ACLs) and permissions are managed by BSLS, limiting user capabilities based on predefined rules and preventing unauthorized data modification or access.
- This design assumes that the BSLS is deployed within the OpenShift cluster or in a network environment that provides VMs running Kopia clients with low-latency, high-bandwidth internal access, ensuring efficient and secure backup and restore operations.

Apply similar fixes to lines 110–114 and 122–135 for consistency.

Also applies to: 110-114, 122-135


82-82: Fix hyphenation in compound adjectives for grammar and consistency.

Three instances of missing hyphens in compound adjectives:

  • Line 82: "per user individual credentials" → "per-user individual credentials"
  • Line 101: "Kopia specific configuration parameters" → "Kopia-specific configuration parameters"
  • Line 140: "Kopia compatible client" → "Kopia-compatible client"

Apply this diff:

- These credentials are provisioned and managed as OpenShift `Secrets` and are synced to the Kopia repository by the BSLS controller to enforce user-level access control.
+ These credentials are provisioned and managed as OpenShift `Secrets` and are synced to the Kopia repository by the BSLS controller to enforce per-user access control.

- BSLS does not contain Kopia specific configuration parameters such as encryption algorithms, compression settings, and other repository-specific options, those are managed by the BSLR or DPA.
+ BSLS does not contain Kopia-specific configuration parameters such as encryption algorithms, compression settings, and other repository-specific options, those are managed by the BSLR or DPA.

- In the context of KubeVirt, the BSLS can be used to manage backups for virtual machines running within the cluster. Users within a KubeVirt VM can use a standard Kopia compatible client to back up and restore their own files on their own schedule, leveraging the BSLS to manage the repository lifecycle and ensure secure, efficient backups.
+ In the context of KubeVirt, the BSLS can be used to manage backups for virtual machines running within the cluster. Users within a KubeVirt VM can use a standard Kopia-compatible client to back up and restore their own files on their own schedule, leveraging the BSLS to manage the repository lifecycle and ensure secure, efficient backups.

Also applies to: 101-101, 140-140

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 79d3f28 and a9d5864.

📒 Files selected for processing (1)
  • docs/design/backupstoragelocationserver-design.md (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

  • docs/design/backupstoragelocationserver-design.md
🪛 LanguageTool
docs/design/backupstoragelocationserver-design.md

[style] ~21-~21: ‘with respect to’ might be wordy. Consider a shorter alternative.
Context: ...t rest. BSLS itself remains stateless with respect to user data. It does not persist sensitiv...

(EN_WORDINESS_PREMIUM_WITH_RESPECT_TO)


[grammar] ~82-~82: Use a hyphen to join words.
Context: ...it via Kopia-compatible clients with per user individual credentials. These crede...

(QB_NEW_EN_HYPHEN)


[grammar] ~101-~101: Use a hyphen to join words.
Context: ...tion**: - BSLS does not contain Kopia specific configuration parameters such a...

(QB_NEW_EN_HYPHEN)


[grammar] ~123-~123: Ensure spelling is correct
Context: ...pace. * If invalid, mark the BSLS as NotReady and Requeue. 2. TLS Setup * Gen...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)


[grammar] ~140-~140: Use a hyphen to join words.
Context: ...n a KubeVirt VM can use a standard Kopia compatible client to back up and restore...

(QB_NEW_EN_HYPHEN)

🪛 markdownlint-cli2 (0.18.1)
docs/design/backupstoragelocationserver-design.md

52-52: Unordered list indentation
Expected: 0; Actual: 1

(MD007, ul-indent)


53-53: Unordered list indentation
Expected: 0; Actual: 1

(MD007, ul-indent)


54-54: Unordered list indentation
Expected: 0; Actual: 1

(MD007, ul-indent)


55-55: Unordered list indentation
Expected: 0; Actual: 1

(MD007, ul-indent)


56-56: Unordered list indentation
Expected: 0; Actual: 1

(MD007, ul-indent)


57-57: Unordered list indentation
Expected: 0; Actual: 1

(MD007, ul-indent)


58-58: Unordered list indentation
Expected: 0; Actual: 1

(MD007, ul-indent)


110-110: Unordered list indentation
Expected: 0; Actual: 1

(MD007, ul-indent)


112-112: Inconsistent indentation for list items at the same level
Expected: 1; Actual: 2

(MD005, list-indent)


112-112: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


114-114: Unordered list indentation
Expected: 0; Actual: 1

(MD007, ul-indent)


122-122: Unordered list indentation
Expected: 0; Actual: 3

(MD007, ul-indent)


123-123: Unordered list indentation
Expected: 0; Actual: 3

(MD007, ul-indent)


125-125: Unordered list indentation
Expected: 0; Actual: 3

(MD007, ul-indent)


126-126: Unordered list indentation
Expected: 0; Actual: 3

(MD007, ul-indent)


127-127: Unordered list indentation
Expected: 0; Actual: 3

(MD007, ul-indent)


129-129: Unordered list indentation
Expected: 0; Actual: 3

(MD007, ul-indent)


130-130: Unordered list indentation
Expected: 0; Actual: 3

(MD007, ul-indent)


132-132: Unordered list indentation
Expected: 0; Actual: 3

(MD007, ul-indent)


133-133: Unordered list indentation
Expected: 0; Actual: 3

(MD007, ul-indent)


135-135: Unordered list indentation
Expected: 0; Actual: 3

(MD007, ul-indent)


180-180: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


182-182: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


184-184: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


186-186: Unordered list indentation
Expected: 0; Actual: 2

(MD007, ul-indent)


193-193: Unordered list indentation
Expected: 0; Actual: 1

(MD007, ul-indent)


195-195: Unordered list indentation
Expected: 0; Actual: 1

(MD007, ul-indent)


197-197: Unordered list indentation
Expected: 0; Actual: 1

(MD007, ul-indent)


206-206: Unordered list indentation
Expected: 0; Actual: 1

(MD007, ul-indent)


208-208: Unordered list indentation
Expected: 0; Actual: 1

(MD007, ul-indent)


210-210: Unordered list indentation
Expected: 0; Actual: 1

(MD007, ul-indent)


212-212: Unordered list indentation
Expected: 0; Actual: 1

(MD007, ul-indent)


214-214: Unordered list indentation
Expected: 0; Actual: 1

(MD007, ul-indent)

🔇 Additional comments (2)
docs/design/backupstoragelocationserver-design.md (2)

124-127: Rewrite: Several concerns in the review are already addressed in the design; focus on genuine gaps only.

The design document (lines 124-127 and surrounding context) already specifies:

  • Certificate source: "from mounted from the OpenShift Secret for the BSLS service" (line 125)
  • Fingerprint bootstrap complexity: The design acknowledges this at line 208: "The requirement for clients to validate the TLS certificate fingerprint introduces complexity in dynamic or short-lived VM environments. Automating this bootstrap process would improve usability."

Valid remaining gaps to address:

  • Certificate rotation strategy: No lifecycle or rotation frequency specified
  • FIPS compliance approach: Not mentioned; clarify if FIPS-compliant algorithms and key generation are required for regulated deployments

Recommend adding a brief note on certificate rotation cadence and whether FIPS compliance is in scope for this design.

Likely an incorrect or invalid review comment.


82-84: Authentication design defers critical mechanics to "open issues"—clarify before implementation.

The document specifies username/password-based access (line 53) and notes credentials are managed via Secrets (line 95), but leaves essential details to future work:

  • Password provisioning and lifecycle: No workflow documented. Line 210 flags "Master Repository Password Management" as an open issue.
  • OAuth/IAM integration: Line 212 explicitly marks this as open, though it's foundational for enterprise adoption.
  • Credential sync error handling: Not addressed. No scenarios for invalid credentials, missing Secrets, or failed sync to repository.
  • FIPS compliance: Not mentioned in certificate generation (lines 125–127).
  • User isolation implementation: Line 96 relies on "Kopia's access control mechanisms" without specifying how BSLS enforces per-user boundaries.

Before implementing BSLS, these gaps should either be resolved in the design or explicitly scoped out with justification. Security-critical items (provisioning workflow, error recovery) warrant design clarity; others (IAM, FIPS) can be deferred if documented as such.

Comment on lines +93 to +99
* **Credential Management**:
* **User Access**:
- User credentials (username/password) for accessing the BSLS are stored as OpenShift `Secrets`. These are managed by the BSLS controller and are used by Kopia clients running inside VMs.
- One BSLS can be shared between multiple users. These users cannot see each other’s snapshots, policies, or data. Access is isolated per user through Kopia’s access control mechanisms.

* **Repository Access**:
Credentials required by the BSLS to open and manage the Kopia repository are also stored as OpenShift `Secrets` and referenced in the OADP `DataProtectionApplication` (DPA) or `BSLR` CRDs. The BSLS uses these to authenticate to the backend storage securely.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

User isolation implementation mechanism unclear.

Line 96 states users "cannot see each other's snapshots, policies, or data. Access is isolated per user through Kopia's access control mechanisms," but the design doesn't explain how this isolation is enforced:

  • Does Kopia use RBAC or a user namespace mechanism?
  • Is isolation enforced by the BSLS controller or delegated entirely to Kopia?
  • What happens if a user tries to access another user's snapshot? Is it rejected at the Kopia API level or BSLS level?
  • How are policies enforced?

Without clarity here, implementers may make inconsistent assumptions about responsibility boundaries.

Please clarify the user isolation mechanism, either in the design or referenced documentation:

  1. Where is isolation enforced (BSLS controller vs. Kopia server)?
  2. What is the specific Kopia feature/configuration used?
  3. How are access violations detected and logged?
🤖 Prompt for AI Agents
In docs/design/backupstoragelocationserver-design.md around lines 93 to 99,
clarify the user isolation mechanism described: explicitly state whether
isolation is enforced by the BSLS controller or delegated to the Kopia server
(or a combination), name the exact Kopia features/configuration used (e.g.,
Kopia user principals/roles, repository-per-user vs. shared repo with ACLs,
access tokens, policy bindings or RBAC integration), and explain the enforcement
point for access checks (e.g., Kopia API rejects cross-user access; BSLS
performs request validation/authorization before proxying). Also add how access
violations are detected and logged (what components emit logs/audit events, log
locations and message formats, and any OpenShift audit integration), and provide
a brief example flow for a rejected access attempt to make responsibilities and
failure behavior unambiguous.

Comment on lines +117 to +136
### Reconciliation Flow

When a BackupStorageLocationServer (BSLS) resource is created or modified, the controller takes the following steps:

1. **Validation**
* Verify that the spec.LocationRepository field references a valid and Ready BackupStorageLocationRepository (BSLR) in the same namespace.
* If invalid, mark the BSLS as NotReady and Requeue.
2. **TLS Setup**
* Generate new or use a TLS certificate(s) from mounted from the OpenShift Secret for the BSLS service.
* Record the certificate's SHA256 fingerprint in the BSLS .status.tlsFingerprint field for client verification.
* The BackupStorageLocationServer Certificate's SHA256 fingerprint is required for the client to connect.
3. **Deployment/Service Management**
* Deploy or modify already existing Kopia Server Pod with appropriate Configuration.
* Expose the server internally via OpenShift Service.
4. **User Secret Synchronization**
* Reconcile each credential into the Kopia repository using the Kopia server’s user management API.
* Watch for referenced OpenShift Secrets (user credentials) and re-run Kopia repository credential update logic.
5. **Change Handling**
* If the BSLR is modified (e.g., storage config updated), the BSLS is Reconciled.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Reconciliation flow incomplete; missing error handling scenarios.

The reconciliation flow (lines 121–135) describes the happy path but lacks error recovery for critical scenarios:

  • What happens if the BSLR transitions to NotReady after BSLS is running?
  • How are failed user secret sync operations retried or reported?
  • What is the recovery mechanism if the Kopia server pod crashes or restarts?
  • How are transient network failures (e.g., API server unavailable) handled?
  • What is the status condition when reconciliation fails at step 4 or 5?

This is important for reliability and debuggability in production environments.

Extend the reconciliation flow section to include:

  1. Error conditions at each step (validation, TLS, deployment, user sync, BSLR change handling)
  2. Retry/backoff strategy for transient failures
  3. Terminal failure conditions and recovery steps
  4. Status conditions and messages for each failure mode
🧰 Tools
🪛 LanguageTool

[grammar] ~123-~123: Ensure spelling is correct
Context: ...pace. * If invalid, mark the BSLS as NotReady and Requeue. 2. TLS Setup * Gen...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

🪛 markdownlint-cli2 (0.18.1)

122-122: Unordered list indentation
Expected: 0; Actual: 3

(MD007, ul-indent)


123-123: Unordered list indentation
Expected: 0; Actual: 3

(MD007, ul-indent)


125-125: Unordered list indentation
Expected: 0; Actual: 3

(MD007, ul-indent)


126-126: Unordered list indentation
Expected: 0; Actual: 3

(MD007, ul-indent)


127-127: Unordered list indentation
Expected: 0; Actual: 3

(MD007, ul-indent)


129-129: Unordered list indentation
Expected: 0; Actual: 3

(MD007, ul-indent)


130-130: Unordered list indentation
Expected: 0; Actual: 3

(MD007, ul-indent)


132-132: Unordered list indentation
Expected: 0; Actual: 3

(MD007, ul-indent)


133-133: Unordered list indentation
Expected: 0; Actual: 3

(MD007, ul-indent)


135-135: Unordered list indentation
Expected: 0; Actual: 3

(MD007, ul-indent)

🤖 Prompt for AI Agents
In docs/design/backupstoragelocationserver-design.md around lines 117 to 136,
extend the Reconciliation Flow to explicitly document error handling for each
step: enumerate possible failure modes for Validation, TLS Setup,
Deployment/Service Management, User Secret Synchronization, and BSLR change
handling; specify retry/backoff behavior for transient errors (exponential
backoff with max retries and requeue intervals), describe terminal failure
detection and recovery actions (e.g., mark NotReady with remediation steps,
fallback to last-known-good config, manual operator intervention), define the
exact status conditions, reason codes and human-readable messages to set on the
BSLS resource for each failure mode, and explain how controller will surface
failures (events, logs) and reconcile after recovery (e.g., watch for
BSLR->Ready transition, requeue user-syncs on secret updates, restart deployment
reconciliations on pod crashes).

@kaovilai
Copy link
Member

/hold being replaced by #1845

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 12, 2025
@weshayutin
Copy link
Contributor

@mpryc close?

@mpryc mpryc closed this Nov 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants