-
-
Notifications
You must be signed in to change notification settings - Fork 57
Open
Milestone
Description
Description
The general architecture for dedup is as follows:
- Dedup is handled by an index associated with a Collection
- The index is keyed by hash for all URLs for all crawls in the collection
- The mapping is hash->date|url (could also add crawl id)
- The index is stored in Redis (or Redis compatible server, like KVRocks)
- The index can be regenerated on the fly from an existing CDX, by loading one or more WACZ files which contain CDXs.
- Adding new crawls to a collection will start an import job to import the CDX from the new crawls.
- Crawl workflows store an option id that is used as a dedup index for new crawls. It must also be a collection that the crawl is auto-added to.
- Collections have a bool indicating if they also have a dedup index.
The backend implementation involves:
- A new CollIndex CRD type
- Operator that manages the new CRD type, creating a new Redis instance when the index should exist
- Operator starts the crawler in 'indexer' mode (will be available from Deduplication (Initial Support). browsertrix-crawler#884)
- Collection has a new 'hasDedupIndex' field
- Workflows have a new 'dedupCollIndex' field for dedup while crawling. The dedupCollIndex must also be a collection that the crawl is auto-added to.
- There is a new waiting state: 'waiting_for_dedup_index' that is entered if a crawl is starting, but index is not yet ready.
Context
No response
Metadata
Metadata
Assignees
Labels
No labels
Type
Projects
Status
Todo