Skip to content

[Task]: Backend dedup architecture (MVP) #2867

@ikreymer

Description

@ikreymer

Description

The general architecture for dedup is as follows:

  • Dedup is handled by an index associated with a Collection
  • The index is keyed by hash for all URLs for all crawls in the collection
  • The mapping is hash->date|url (could also add crawl id)
  • The index is stored in Redis (or Redis compatible server, like KVRocks)
  • The index can be regenerated on the fly from an existing CDX, by loading one or more WACZ files which contain CDXs.
  • Adding new crawls to a collection will start an import job to import the CDX from the new crawls.
  • Crawl workflows store an option id that is used as a dedup index for new crawls. It must also be a collection that the crawl is auto-added to.
  • Collections have a bool indicating if they also have a dedup index.

The backend implementation involves:

  • A new CollIndex CRD type
  • Operator that manages the new CRD type, creating a new Redis instance when the index should exist
  • Operator starts the crawler in 'indexer' mode (will be available from Deduplication (Initial Support). browsertrix-crawler#884)
  • Collection has a new 'hasDedupIndex' field
  • Workflows have a new 'dedupCollIndex' field for dedup while crawling. The dedupCollIndex must also be a collection that the crawl is auto-added to.
  • There is a new waiting state: 'waiting_for_dedup_index' that is entered if a crawl is starting, but index is not yet ready.

Context

No response

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

Projects

Status

Todo

Relationships

None yet

Development

No branches or pull requests

Issue actions