The ultimate clustering toolkit for high-dimensional text embeddings. It finds the natural structure in your data using DBSCAN & HDBSCANโno magic numbers required.
โก Get Started โข โจ Key Features โข ๐ฎ Usage & Examples โข โ๏ธ Configuration โข ๐ Why This Works
Embedding Clustering Toolkit is the analysis partner your embeddings deserve. Stop arbitrarily picking "k=5" and praying your clusters make sense. This toolkit uses density-based algorithms that discover the natural groupings in your dataโautomatically identifying how many clusters exist and which points are just noise.
|
Auto Cluster Detection No predefined k needed |
Parameter Search Find optimal thresholds |
Quality Metrics Silhouette scores built-in |
Noise Handling Outliers isolated cleanly |
How it slaps:
- You: Load your OpenAI/Cohere/any embeddings CSV
- Toolkit: Searches parameters, finds natural clusters, isolates noise
- You: Export to Excel, visualize, analyze
- Result: Meaningful groupings without arbitrary decisions. Go grab a coffee. โ
Clustering embeddings with K-Means is like forcing your data into boxes it doesn't fit. Density-based clustering finds the boxes that actually exist.
| โ The K-Means Way (Pain) | โ The DBSCAN Way (Glory) |
|
|
We're not forcing structure. We're discovering structure with cosine similarity, density estimation, and automatic parameter optimization that processes your high-dimensional embeddings the right way.
- Python 3.9+
- Your embeddings in CSV format
# Clone the repository
git clone https://github.com/yigitkonur/embedding-clustering-toolkit.git
cd embedding-clustering-toolkit
# Install dependencies
pip install -r requirements.txtThe recommended way to use this toolkit is through the interactive Jupyter notebook:
# Launch Jupyter
jupyter notebook embedding_clustering_toolkit.ipynbThe notebook provides:
- ๐ Configurable parameters at the top
- ๐ Interactive visualizations
- ๐ Parameter search with visual results
- ๐พ One-click export to Excel
If you prefer command-line scripts:
# 1. Edit the input path in classify.py
# 2. Run DBSCAN clustering
python classify.py
# Or run HDBSCAN with PCA
python classify_hdbscan.py
# Or find optimal parameters first
python sweet_spot_finder.py1. Configure Your Analysis
config = ClusteringConfig(
input_csv_path="your_embeddings.csv",
vector_dimension=3072, # Match your embedding model
similarity_threshold=0.78, # Higher = tighter clusters
min_samples=2, # Minimum points for a cluster
)2. Run All Cells
The notebook walks you through:
- Loading and validating your embeddings
- Finding optimal parameters (optional but recommended)
- Running DBSCAN and/or HDBSCAN clustering
- Visualizing results
- Exporting to Excel
3. Analyze Results
๐ DBSCAN Results:
โโ Clusters: 47
โโ Noise points: 23 (4.2%)
โโ Clustered points: 527 (95.8%)
โโ Avg cluster size: 11.2
โโ Silhouette score: 0.634
Your CSV should have embeddings split across columns or in a single column:
Split format (default):
Name,1,2,3,4,5,6
"Document A","0.001,0.023,...","0.045,0.012,...","...",...
Single column format:
Name,embedding
"Document A","[0.001, 0.023, 0.045, ...]"
| Feature | What It Does | Why You Care |
|---|---|---|
| ๐ฏ DBSCAN Clustering Cosine similarity |
Groups embeddings by semantic similarity without predefined k | Natural clusters that actually make sense |
| โก HDBSCAN + PCA Dimensionality reduction |
Reduces 3072D โ 30D, then clusters | 10x faster on large datasets |
| ๐ Parameter Search Grid search optimization |
Tests hundreds of threshold/min_samples combos | Find the "sweet spot" automatically |
| ๐ Quality Metrics Silhouette scoring |
Measures how well-separated your clusters are | Know if your clustering is actually good |
| ๐๏ธ Noise Detection Outlier isolation |
Flags points that don't belong anywhere | Clean clusters without forced assignments |
| ๐ Visualizations PCA projections |
2D scatter plots of your clusters | See the structure in your data |
| ๐พ Excel Export One-click output |
Sorted results with cluster IDs | Ready for downstream analysis |
| Parameter | Default | Description |
|---|---|---|
similarity_threshold |
0.78 |
Cosine similarity cutoff (0-1). Higher = tighter clusters. |
min_samples |
2 |
Minimum points to form a cluster. |
vector_dimension |
3072 |
Expected embedding dimensions. |
n_pca_components |
30 |
PCA dimensions for HDBSCAN. |
| If you get... | Try... |
| Too many tiny clusters | Lower similarity_threshold (e.g., 0.70) |
| Everything in one cluster | Raise similarity_threshold (e.g., 0.85) |
| Too much noise | Lower min_samples to 1 |
| Noisy clusters | Raise min_samples to 3-5 |
| Model | Dimensions | Set vector_dimension to |
|---|---|---|
| OpenAI text-embedding-3-large | 3072 | 3072 |
| OpenAI text-embedding-3-small | 1536 | 1536 |
| OpenAI text-embedding-ada-002 | 1536 | 1536 |
| Cohere embed-english-v3.0 | 1024 | 1024 |
| Voyage voyage-2 | 1024 | 1024 |
| Custom | varies | your dimension |
embedding-clustering-toolkit/
โโโ ๐ embedding_clustering_toolkit.ipynb # Interactive notebook (START HERE)
โโโ ๐ classify.py # DBSCAN clustering script
โโโ ๐ classify_hdbscan.py # HDBSCAN + PCA script
โโโ ๐ sweet_spot_finder.py # Parameter optimization script
โโโ ๐ sample.csv # Example embeddings data
โโโ ๐ requirements.txt # Python dependencies
โโโ ๐ README.md # You are here
cluster >= 0: Assigned to a specific clustercluster = -1: Noise/outlier (doesn't fit any cluster)
| Metric | Good | Acceptable | Poor |
|---|---|---|---|
| Silhouette Score | > 0.5 | 0.25 - 0.5 | < 0.25 |
| Noise Ratio | < 10% | 10-30% | > 30% |
| Avg Cluster Size | > 5 | 3-5 | < 3 |
Expand for troubleshooting tips
| Problem | Solution |
|---|---|
| All points are noise | Lower similarity_threshold significantly (try 0.5-0.6) |
| One giant cluster | Raise similarity_threshold (try 0.85-0.95) |
| Out of memory | Use HDBSCAN with PCA instead of DBSCAN |
| Invalid vector dimensions | Check your CSV format matches the expected column structure |
| hdbscan import error | Run pip install hdbscan (may need C compiler on some systems) |
| Slow clustering | Use HDBSCAN + PCA or reduce n_pca_components |
| DBSCAN | HDBSCAN + PCA |
|
|
from embedding_clustering_toolkit import (
ClusteringConfig,
EmbeddingDataLoader,
DBSCANClusterer,
ParameterSearcher
)
# Configure
config = ClusteringConfig(
input_csv_path="my_embeddings.csv",
vector_dimension=1536
)
# Load data
loader = EmbeddingDataLoader(config)
df, valid_df = loader.load()
# Find best parameters
searcher = ParameterSearcher(loader.get_vector_matrix())
results = searcher.search()
best = searcher.get_best_params(results)
# Cluster
clusterer = DBSCANClusterer(loader.get_vector_matrix())
labels = clusterer.fit(
similarity_threshold=best['similarity_threshold'],
min_samples=int(best['min_samples'])
)If this toolkit saved you from K-Means hell, give it a โญ
Built with ๐ฅ because guessing cluster counts is a soul-crushing waste of time.
MIT ยฉ Yiฤit Konur