This is the source code to go along with the series of blog articles
- Word Embeddings and Document Vectors: Part 1. Similarity
- Word Embeddings and Document Vectors: Part 2. Classification
The code employs,
- 
Elasticsearch (localhost:9200) as the repository - to save tokens to, and get them as needed.
- to save word-vectors (pre-trained or custom) to, and get them as needed.
 
- 
See the Pipfle for Python dependencies 
- 
Generate tokens for the 20-news corpus & the movie review data set and save them to Elasticsearch. - The dataset for 20-news is downloaded as part of the script. But you need to download the movie review dataset separately.
- The shell script & python code in the folders text-data/twenty-news & text-data/acl-imdb
 
- 
Generate custom word vectors for the two text corpus in 1 above and save them to Elasticsearch. text-data/twenty-news/vectors & text-data/acl-imdb/vectors directories have the scripts 
- 
Process pre-trained vectors and save them to Elasticsearch. Look into pre-trained-vectors/ for the code. You need to download the actual published vectors from their sources. We have used Word2Vec, Glove and FastText in these articles. 
- 
The script run.sh can be configured to run whichever combination of the pipeline steps. 
- 
The logs contain the F-scores and timing results. Create a "logs" directory before running the run.sh script mkdir logs