Skip to content

Density-based clustering for vector embeddings using HDBSCAN and cosine similarity. Features automatic parameter search, PCA, and quality metrics without defining cluster counts.

Notifications You must be signed in to change notification settings

yigitkonur/hdbscan-cluster-tool

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

9 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿ”ฌ Embedding Clustering Toolkit ๐Ÿ”ฌ

Stop guessing cluster counts. Start discovering natural groupings.

The ultimate clustering toolkit for high-dimensional text embeddings. It finds the natural structure in your data using DBSCAN & HDBSCANโ€”no magic numbers required.

python embeddings ย ย โ€ขย ย  license platform

no k jupyter


Embedding Clustering Toolkit is the analysis partner your embeddings deserve. Stop arbitrarily picking "k=5" and praying your clusters make sense. This toolkit uses density-based algorithms that discover the natural groupings in your dataโ€”automatically identifying how many clusters exist and which points are just noise.

๐ŸŽฏ

Auto Cluster Detection
No predefined k needed

๐Ÿ”

Parameter Search
Find optimal thresholds

๐Ÿ“Š

Quality Metrics
Silhouette scores built-in

๐Ÿ—‘๏ธ

Noise Handling
Outliers isolated cleanly

How it slaps:

  • You: Load your OpenAI/Cohere/any embeddings CSV
  • Toolkit: Searches parameters, finds natural clusters, isolates noise
  • You: Export to Excel, visualize, analyze
  • Result: Meaningful groupings without arbitrary decisions. Go grab a coffee. โ˜•

๐Ÿ’ฅ Why This Slaps K-Means

Clustering embeddings with K-Means is like forcing your data into boxes it doesn't fit. Density-based clustering finds the boxes that actually exist.

โŒ The K-Means Way (Pain) โœ… The DBSCAN Way (Glory)
  1. Guess k=10. Run K-Means.
  2. Results look weird. Try k=15.
  3. Still bad. Maybe k=8?
  4. Elbow method says k=12. Sure, why not.
  5. Get clusters that mix unrelated items.
  1. Run the notebook.
  2. Algorithm finds 47 natural clusters.
  3. Outliers are flagged as noise.
  4. Silhouette score confirms quality.
  5. Export and ship. Done. ๐Ÿš€

We're not forcing structure. We're discovering structure with cosine similarity, density estimation, and automatic parameter optimization that processes your high-dimensional embeddings the right way.


๐Ÿš€ Get Started in 60 Seconds

Prerequisites

  • Python 3.9+
  • Your embeddings in CSV format

Installation

# Clone the repository
git clone https://github.com/yigitkonur/embedding-clustering-toolkit.git
cd embedding-clustering-toolkit

# Install dependencies
pip install -r requirements.txt

Quick Start with Jupyter Notebook

The recommended way to use this toolkit is through the interactive Jupyter notebook:

# Launch Jupyter
jupyter notebook embedding_clustering_toolkit.ipynb

The notebook provides:

  • ๐Ÿ“‹ Configurable parameters at the top
  • ๐Ÿ“Š Interactive visualizations
  • ๐Ÿ” Parameter search with visual results
  • ๐Ÿ’พ One-click export to Excel

Quick Start with Scripts

If you prefer command-line scripts:

# 1. Edit the input path in classify.py
# 2. Run DBSCAN clustering
python classify.py

# Or run HDBSCAN with PCA
python classify_hdbscan.py

# Or find optimal parameters first
python sweet_spot_finder.py

๐ŸŽฎ Usage: Fire and Forget

Using the Jupyter Notebook (Recommended)

1. Configure Your Analysis

config = ClusteringConfig(
    input_csv_path="your_embeddings.csv",
    vector_dimension=3072,  # Match your embedding model
    similarity_threshold=0.78,  # Higher = tighter clusters
    min_samples=2,  # Minimum points for a cluster
)

2. Run All Cells

The notebook walks you through:

  1. Loading and validating your embeddings
  2. Finding optimal parameters (optional but recommended)
  3. Running DBSCAN and/or HDBSCAN clustering
  4. Visualizing results
  5. Exporting to Excel

3. Analyze Results

๐Ÿ“Š DBSCAN Results:
   โ”œโ”€ Clusters: 47
   โ”œโ”€ Noise points: 23 (4.2%)
   โ”œโ”€ Clustered points: 527 (95.8%)
   โ”œโ”€ Avg cluster size: 11.2
   โ””โ”€ Silhouette score: 0.634

CSV Format

Your CSV should have embeddings split across columns or in a single column:

Split format (default):

Name,1,2,3,4,5,6
"Document A","0.001,0.023,...","0.045,0.012,...","...",...

Single column format:

Name,embedding
"Document A","[0.001, 0.023, 0.045, ...]"

โœจ Feature Breakdown: The Secret Sauce

Feature What It Does Why You Care
๐ŸŽฏ DBSCAN Clustering
Cosine similarity
Groups embeddings by semantic similarity without predefined k Natural clusters that actually make sense
โšก HDBSCAN + PCA
Dimensionality reduction
Reduces 3072D โ†’ 30D, then clusters 10x faster on large datasets
๐Ÿ” Parameter Search
Grid search optimization
Tests hundreds of threshold/min_samples combos Find the "sweet spot" automatically
๐Ÿ“Š Quality Metrics
Silhouette scoring
Measures how well-separated your clusters are Know if your clustering is actually good
๐Ÿ—‘๏ธ Noise Detection
Outlier isolation
Flags points that don't belong anywhere Clean clusters without forced assignments
๐Ÿ“ˆ Visualizations
PCA projections
2D scatter plots of your clusters See the structure in your data
๐Ÿ’พ Excel Export
One-click output
Sorted results with cluster IDs Ready for downstream analysis

โš™๏ธ Configuration & Customization

Key Parameters

Parameter Default Description
similarity_threshold 0.78 Cosine similarity cutoff (0-1). Higher = tighter clusters.
min_samples 2 Minimum points to form a cluster.
vector_dimension 3072 Expected embedding dimensions.
n_pca_components 30 PCA dimensions for HDBSCAN.

Choosing Parameters

If you get... Try...
Too many tiny clusters Lower similarity_threshold (e.g., 0.70)
Everything in one cluster Raise similarity_threshold (e.g., 0.85)
Too much noise Lower min_samples to 1
Noisy clusters Raise min_samples to 3-5

Embedding Model Dimensions

Model Dimensions Set vector_dimension to
OpenAI text-embedding-3-large 3072 3072
OpenAI text-embedding-3-small 1536 1536
OpenAI text-embedding-ada-002 1536 1536
Cohere embed-english-v3.0 1024 1024
Voyage voyage-2 1024 1024
Custom varies your dimension

๐Ÿ“ Repository Structure

embedding-clustering-toolkit/
โ”œโ”€โ”€ ๐Ÿ““ embedding_clustering_toolkit.ipynb  # Interactive notebook (START HERE)
โ”œโ”€โ”€ ๐Ÿ“œ classify.py                         # DBSCAN clustering script
โ”œโ”€โ”€ ๐Ÿ“œ classify_hdbscan.py                 # HDBSCAN + PCA script
โ”œโ”€โ”€ ๐Ÿ“œ sweet_spot_finder.py                # Parameter optimization script
โ”œโ”€โ”€ ๐Ÿ“‹ sample.csv                          # Example embeddings data
โ”œโ”€โ”€ ๐Ÿ“‹ requirements.txt                    # Python dependencies
โ””โ”€โ”€ ๐Ÿ“– README.md                           # You are here

๐Ÿ” Understanding the Output

Cluster Labels

  • cluster >= 0: Assigned to a specific cluster
  • cluster = -1: Noise/outlier (doesn't fit any cluster)

Quality Metrics

Metric Good Acceptable Poor
Silhouette Score > 0.5 0.25 - 0.5 < 0.25
Noise Ratio < 10% 10-30% > 30%
Avg Cluster Size > 5 3-5 < 3

๐Ÿ”ฅ Common Issues & Quick Fixes

Expand for troubleshooting tips
Problem Solution
All points are noise Lower similarity_threshold significantly (try 0.5-0.6)
One giant cluster Raise similarity_threshold (try 0.85-0.95)
Out of memory Use HDBSCAN with PCA instead of DBSCAN
Invalid vector dimensions Check your CSV format matches the expected column structure
hdbscan import error Run pip install hdbscan (may need C compiler on some systems)
Slow clustering Use HDBSCAN + PCA or reduce n_pca_components

๐Ÿ†š DBSCAN vs HDBSCAN: When to Use Which

DBSCAN HDBSCAN + PCA
  • Smaller datasets (< 10K points)
  • You know a good similarity threshold
  • Uniform cluster densities
  • Full dimensional analysis needed
  • Large datasets (10K+ points)
  • Varying cluster densities
  • Speed is important
  • Very high dimensions (3000+)

๐Ÿ› ๏ธ Advanced: Using as a Library

from embedding_clustering_toolkit import (
    ClusteringConfig,
    EmbeddingDataLoader,
    DBSCANClusterer,
    ParameterSearcher
)

# Configure
config = ClusteringConfig(
    input_csv_path="my_embeddings.csv",
    vector_dimension=1536
)

# Load data
loader = EmbeddingDataLoader(config)
df, valid_df = loader.load()

# Find best parameters
searcher = ParameterSearcher(loader.get_vector_matrix())
results = searcher.search()
best = searcher.get_best_params(results)

# Cluster
clusterer = DBSCANClusterer(loader.get_vector_matrix())
labels = clusterer.fit(
    similarity_threshold=best['similarity_threshold'],
    min_samples=int(best['min_samples'])
)

๐ŸŒŸ Star This Repo

If this toolkit saved you from K-Means hell, give it a โญ

Built with ๐Ÿ”ฅ because guessing cluster counts is a soul-crushing waste of time.

MIT ยฉ YiฤŸit Konur

About

Density-based clustering for vector embeddings using HDBSCAN and cosine similarity. Features automatic parameter search, PCA, and quality metrics without defining cluster counts.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published