a Python repository for content-based recommendation based on Latent semantic analysis (LSA) topic distance and Rocchio Algorithm. Science Concierge is an backend algorithm for Scholarfy www.scholarfy.net, an automatic scheduler for conference.
See full article on PLOS ONE, Arxiv or full tex manuscript and presentation here. You can also see the scale version of Scholarfy to 14.3M articles from Pubmed at pubmed.scholarfy.net.
First, clone the repository.
$ git clone https://github.com/titipata/science_conciergeInstall dependencies using pip,
$ pip install -r requirements.txtInstall the library using setup.py,
$ python setup.py develop installWe provide example csv file from Pubmed Open Acess Subset that you can download and
play with (we parsed using pubmed_parser).
Each file contains pmc, pmid, title, abstract, publication_year as column name.
Use download function to download example data,
import science_concierge
science_concierge.download(['pubmed_oa_2015.csv', 'pubmed_oa_2016.csv'])We provide pubmed_oa_{year}.csv from {year} = 2007, ..., 2016 (note 2007 is
all publications before year 2008). Alternative is to use awscli to download,
$ aws s3 cp s3://science-of-science-bucket/science_concierge/data/ . --recursiveYou can build quick recommendation by importing ScienceConcierge class
then use fit method to fit list of documents. Then use recommend to recommend
documents based on like or dislike documents.
import pandas as pd
from science_concierge import ScienceConcierge
df = pd.read_csv('data/pubmed_oa_2016.csv', encoding='utf-8')
docs = list(df.abstract) # provide list of abstracts
titles = list(df.title) # titles
# select weighting from 'count', 'tfidf', or 'entropy'
recommend_model = ScienceConcierge(stemming=True, ngram_range=(1,1),
weighting='entropy', norm=None,
n_components=200, n_recommend=200,
verbose=True)
recommend_model.fit(docs) # input list of documents or abstracts
index = recommend_model.recommend(likes=[10000], dislikes=[]) # input list of like/dislike index (here we like title[10000])
docs_recommend = [titles[i] for i in index[0:10]] # recommended documentsWe have adds on vectorizer classes including LogEntropyVectorizer and
BM25Vectorizer for calculating documents-terms weighting from input
list of documents. Here is an example usage.
from science_concierge import LogEntropyVectorizer
l_model = LogEntropyVectorizer(norm=None, ngram_range=(1,2),
stop_words='english', min_df=1, max_df=0.8)
X = l_model.fit_transform(docs) # where docs is list of documentsIn this case when we have sparse matrix of documents,
we can use fit_document_matrix method directly.
recommend_model = ScienceConcierge(n_components=200, n_recommend=200)
recommend_model.fit_document_matrix(X)
index = recommend_model.recommend(likes=[10000], dislikes=[])- numpy
- pandas
- unidecode
- nltk with white space tokenizer and Porter stemmer,
usescience_concierge.download_nltk()to download required corpora (there is a stemmer bug innltk==3.2.2) - scikit-learn
- cachetools
- joblib
Copyright (c) 2015 Titipat Achakulvisut, Daniel E. Acuna, Tulakan Ruangrong, Konrad Kording
