Skip to content

Document classification

Selman Ercan edited this page Jun 8, 2016 · 18 revisions

Details

The following steps are taken to train a classifier on the collected data:

  • Create vocabulary V: set of all words making up all articles, save to database.
  • Divide number of comments over five target classes (from 'very low' to 'very high').
    Each class should contain an approximately equal number of articles to help keep the datasets balanced.
  • Create feature vectors for all articles, save to database.
    The i-th element in article a's feature vector contains the number of occurrences in a of the i-th word in V.
  • Train a multinomial Naive Bayes classifier on the dataset and evaluate it using cross-validation.

Techniques

Resources

Clone this wiki locally