-
Notifications
You must be signed in to change notification settings - Fork 0
Document classification
Selman Ercan edited this page Jun 8, 2016
·
18 revisions
The following steps are taken to train a classifier on the collected data:
- Create vocabulary V: set of all words making up all articles, save to database.
- Divide number of comments over five target classes (from 'very low' to 'very high').
Each class should contain an approximately equal number of articles to help keep the datasets balanced. - Create feature vectors for all articles, save to database.
The i-th element in article a's feature vector contains the number of occurrences in a of the i-th word in V. - Train a multinomial Naive Bayes classifier on the dataset and evaluate it using cross-validation.