🔬 Embedding Clustering Toolkit 🔬

Stop guessing cluster counts. Start discovering natural groupings.

The ultimate clustering toolkit for high-dimensional text embeddings. It finds the natural structure in your data using DBSCAN & HDBSCAN—no magic numbers required.

•

🧭 Quick Navigation

⚡ Get Started • ✨ Key Features • 🎮 Usage & Examples • ⚙️ Configuration • 🆚 Why This Works

Embedding Clustering Toolkit is the analysis partner your embeddings deserve. Stop arbitrarily picking "k=5" and praying your clusters make sense. This toolkit uses density-based algorithms that discover the natural groupings in your data—automatically identifying how many clusters exist and which points are just noise.

🎯

Auto Cluster Detection
_{No predefined k needed}

🔍

Parameter Search
_{Find optimal thresholds}

📊

Quality Metrics
_{Silhouette scores built-in}

🗑️

Noise Handling
_{Outliers isolated cleanly}

How it slaps:

You: Load your OpenAI/Cohere/any embeddings CSV
Toolkit: Searches parameters, finds natural clusters, isolates noise
You: Export to Excel, visualize, analyze
Result: Meaningful groupings without arbitrary decisions. Go grab a coffee. ☕

💥 Why This Slaps K-Means

Clustering embeddings with K-Means is like forcing your data into boxes it doesn't fit. Density-based clustering finds the boxes that actually exist.

❌ The K-Means Way (Pain)	✅ The DBSCAN Way (Glory)
Guess k=10. Run K-Means. Results look weird. Try k=15. Still bad. Maybe k=8? Elbow method says k=12. Sure, why not. Get clusters that mix unrelated items.	Run the notebook. Algorithm finds 47 natural clusters. Outliers are flagged as noise. Silhouette score confirms quality. Export and ship. Done. 🚀

We're not forcing structure. We're discovering structure with cosine similarity, density estimation, and automatic parameter optimization that processes your high-dimensional embeddings the right way.

🚀 Get Started in 60 Seconds

Prerequisites

Python 3.9+
Your embeddings in CSV format

Installation

# Clone the repository
git clone https://github.com/yigitkonur/embedding-clustering-toolkit.git
cd embedding-clustering-toolkit

# Install dependencies
pip install -r requirements.txt

Quick Start with Jupyter Notebook

The recommended way to use this toolkit is through the interactive Jupyter notebook:

# Launch Jupyter
jupyter notebook embedding_clustering_toolkit.ipynb

The notebook provides:

📋 Configurable parameters at the top
📊 Interactive visualizations
🔍 Parameter search with visual results
💾 One-click export to Excel

Quick Start with Scripts

If you prefer command-line scripts:

# 1. Edit the input path in classify.py
# 2. Run DBSCAN clustering
python classify.py

# Or run HDBSCAN with PCA
python classify_hdbscan.py

# Or find optimal parameters first
python sweet_spot_finder.py

🎮 Usage: Fire and Forget

Using the Jupyter Notebook (Recommended)

1. Configure Your Analysis

config = ClusteringConfig(
    input_csv_path="your_embeddings.csv",
    vector_dimension=3072,  # Match your embedding model
    similarity_threshold=0.78,  # Higher = tighter clusters
    min_samples=2,  # Minimum points for a cluster
)

2. Run All Cells

The notebook walks you through:

Loading and validating your embeddings
Finding optimal parameters (optional but recommended)
Running DBSCAN and/or HDBSCAN clustering
Visualizing results
Exporting to Excel

3. Analyze Results

📊 DBSCAN Results:
   ├─ Clusters: 47
   ├─ Noise points: 23 (4.2%)
   ├─ Clustered points: 527 (95.8%)
   ├─ Avg cluster size: 11.2
   └─ Silhouette score: 0.634

CSV Format

Your CSV should have embeddings split across columns or in a single column:

Split format (default):

Name,1,2,3,4,5,6
"Document A","0.001,0.023,...","0.045,0.012,...","...",...

Single column format:

Name,embedding
"Document A","[0.001, 0.023, 0.045, ...]"

✨ Feature Breakdown: The Secret Sauce

Feature	What It Does	Why You Care
🎯 DBSCAN Clustering Cosine similarity	Groups embeddings by semantic similarity without predefined k	Natural clusters that actually make sense
⚡ HDBSCAN + PCA Dimensionality reduction	Reduces 3072D → 30D, then clusters	10x faster on large datasets
🔍 Parameter Search Grid search optimization	Tests hundreds of threshold/min_samples combos	Find the "sweet spot" automatically
📊 Quality Metrics Silhouette scoring	Measures how well-separated your clusters are	Know if your clustering is actually good
🗑️ Noise Detection Outlier isolation	Flags points that don't belong anywhere	Clean clusters without forced assignments
📈 Visualizations PCA projections	2D scatter plots of your clusters	See the structure in your data
💾 Excel Export One-click output	Sorted results with cluster IDs	Ready for downstream analysis

⚙️ Configuration & Customization

Key Parameters

Parameter	Default	Description
`similarity_threshold`	`0.78`	Cosine similarity cutoff (0-1). Higher = tighter clusters.
`min_samples`	`2`	Minimum points to form a cluster.
`vector_dimension`	`3072`	Expected embedding dimensions.
`n_pca_components`	`30`	PCA dimensions for HDBSCAN.

Choosing Parameters

If you get...	Try...
Too many tiny clusters	Lower `similarity_threshold` (e.g., 0.70)
Everything in one cluster	Raise `similarity_threshold` (e.g., 0.85)
Too much noise	Lower `min_samples` to 1
Noisy clusters	Raise `min_samples` to 3-5

Embedding Model Dimensions

Model	Dimensions	Set `vector_dimension` to
OpenAI text-embedding-3-large	3072	`3072`
OpenAI text-embedding-3-small	1536	`1536`
OpenAI text-embedding-ada-002	1536	`1536`
Cohere embed-english-v3.0	1024	`1024`
Voyage voyage-2	1024	`1024`
Custom	varies	your dimension

📁 Repository Structure

embedding-clustering-toolkit/
├── 📓 embedding_clustering_toolkit.ipynb  # Interactive notebook (START HERE)
├── 📜 classify.py                         # DBSCAN clustering script
├── 📜 classify_hdbscan.py                 # HDBSCAN + PCA script
├── 📜 sweet_spot_finder.py                # Parameter optimization script
├── 📋 sample.csv                          # Example embeddings data
├── 📋 requirements.txt                    # Python dependencies
└── 📖 README.md                           # You are here

🔍 Understanding the Output

Cluster Labels

cluster >= 0: Assigned to a specific cluster
cluster = -1: Noise/outlier (doesn't fit any cluster)

Quality Metrics

Metric	Good	Acceptable	Poor
Silhouette Score	> 0.5	0.25 - 0.5	< 0.25
Noise Ratio	< 10%	10-30%	> 30%
Avg Cluster Size	> 5	3-5	< 3

🔥 Common Issues & Quick Fixes

Expand for troubleshooting tips

Problem	Solution
All points are noise	Lower `similarity_threshold` significantly (try 0.5-0.6)
One giant cluster	Raise `similarity_threshold` (try 0.85-0.95)
Out of memory	Use HDBSCAN with PCA instead of DBSCAN
Invalid vector dimensions	Check your CSV format matches the expected column structure
hdbscan import error	Run `pip install hdbscan` (may need C compiler on some systems)
Slow clustering	Use HDBSCAN + PCA or reduce `n_pca_components`

🆚 DBSCAN vs HDBSCAN: When to Use Which

DBSCAN	HDBSCAN + PCA
Smaller datasets (< 10K points) You know a good similarity threshold Uniform cluster densities Full dimensional analysis needed	Large datasets (10K+ points) Varying cluster densities Speed is important Very high dimensions (3000+)

🛠️ Advanced: Using as a Library

from embedding_clustering_toolkit import (
    ClusteringConfig,
    EmbeddingDataLoader,
    DBSCANClusterer,
    ParameterSearcher
)

# Configure
config = ClusteringConfig(
    input_csv_path="my_embeddings.csv",
    vector_dimension=1536
)

# Load data
loader = EmbeddingDataLoader(config)
df, valid_df = loader.load()

# Find best parameters
searcher = ParameterSearcher(loader.get_vector_matrix())
results = searcher.search()
best = searcher.get_best_params(results)

# Cluster
clusterer = DBSCANClusterer(loader.get_vector_matrix())
labels = clusterer.fit(
    similarity_threshold=best['similarity_threshold'],
    min_samples=int(best['min_samples'])
)

🌟 Star This Repo

If this toolkit saved you from K-Means hell, give it a ⭐

Built with 🔥 because guessing cluster counts is a soul-crushing waste of time.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🔬 Embedding Clustering Toolkit 🔬

Stop guessing cluster counts. Start discovering natural groupings.

🧭 Quick Navigation

🎯

🔍

📊

🗑️

💥 Why This Slaps K-Means

🚀 Get Started in 60 Seconds

Prerequisites

Installation

Quick Start with Jupyter Notebook

Quick Start with Scripts

🎮 Usage: Fire and Forget

Using the Jupyter Notebook (Recommended)

CSV Format

✨ Feature Breakdown: The Secret Sauce

⚙️ Configuration & Customization

Key Parameters

Choosing Parameters

Embedding Model Dimensions

📁 Repository Structure

🔍 Understanding the Output

Cluster Labels

Quality Metrics

🔥 Common Issues & Quick Fixes

🆚 DBSCAN vs HDBSCAN: When to Use Which

🛠️ Advanced: Using as a Library

🌟 Star This Repo

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
README.md		README.md
classify.py		classify.py
classify_hdbscan.py		classify_hdbscan.py
embedding_clustering_toolkit.ipynb		embedding_clustering_toolkit.ipynb
requirements.txt		requirements.txt
sample.csv		sample.csv
sweet_spot_finder.py		sweet_spot_finder.py

yigitkonur/hdbscan-cluster-tool

Folders and files

Latest commit

History

Repository files navigation

🔬 Embedding Clustering Toolkit 🔬

Stop guessing cluster counts. Start discovering natural groupings.

🧭 Quick Navigation

🎯

🔍

📊

🗑️

💥 Why This Slaps K-Means

🚀 Get Started in 60 Seconds

Prerequisites

Installation

Quick Start with Jupyter Notebook

Quick Start with Scripts

🎮 Usage: Fire and Forget

Using the Jupyter Notebook (Recommended)

CSV Format

✨ Feature Breakdown: The Secret Sauce

⚙️ Configuration & Customization

Key Parameters

Choosing Parameters

Embedding Model Dimensions

📁 Repository Structure

🔍 Understanding the Output

Cluster Labels

Quality Metrics

🔥 Common Issues & Quick Fixes

🆚 DBSCAN vs HDBSCAN: When to Use Which

🛠️ Advanced: Using as a Library

🌟 Star This Repo

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages