Document classification

Details

The following steps are taken to train a classifier on the collected data:

Create vocabulary V: set of all words making up all articles, save to database.
Divide number of comments over five target classes (from 'very low' to 'very high').
Each class should contain an approximately equal number of articles to help keep the datasets balanced.
Create feature vectors for all articles, save to database.
The i-th element in article a's feature vector contains the number of occurrences in a of the i-th word in V.
Train a multinomial Naive Bayes classifier on the dataset and evaluate it using cross-validation.