[Task]: Backend dedup architecture (MVP)

### Description

The general architecture for dedup is as follows:
- Dedup is handled by an index associated with a Collection
- The index is keyed by hash for all URLs for all crawls in the collection
- The mapping is hash->date|url (could also add crawl id)
- The index is stored in Redis (or Redis compatible server, like KVRocks)
- The index can be regenerated on the fly from an existing CDX, by loading one or more WACZ files which contain CDXs.
- Adding new crawls to a collection will start an import job to import the CDX from the new crawls.
- Crawl workflows store an option id that is used as a dedup index for new crawls. It must also be a collection that the crawl is auto-added to.
- Collections have a bool indicating if they also have a dedup index.

The backend implementation involves:
- [x] A new CollIndex CRD type
- [x] Operator that manages the new CRD type, creating a new Redis instance when the index should exist
- [x] Operator starts the crawler in 'indexer' mode (will be available from webrecorder/browsertrix-crawler#884)
- [x] Collection has a new 'hasDedupIndex' field
- [x] Workflows have a new 'dedupCollIndex' field for dedup while crawling. The dedupCollIndex must also be a collection that the crawl is auto-added to.
- [x] There is a new waiting state: 'waiting_for_dedup_index' that is entered if a crawl is starting, but index is not yet ready.

### Context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Task]: Backend dedup architecture (MVP) #2867

Description

Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Task]: Backend dedup architecture (MVP) #2867

Description

Description

Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions