-
Notifications
You must be signed in to change notification settings - Fork 9
Getting Started
Open a terminal. Move to the directory you want to contain the Fieldspring directory, then clone the repository:
git clone https://github.com/utcompling/fieldspring.git
Set the environment variable FIELDSPRING_DIR to point to Fieldspring's directory, and add FIELDSPRING_DIR/bin to your PATH.
Compile Fieldspring like this:
./build update compile
Download the version of the GeoNames gazetteer we used from the following location:
ADD URL HERE
Once you've obtained the correct allCountries.zip, import the gazetteer for use with Fieldspring by running this from FIELDSPRING_DIR:
fieldspring --memory 8g import-gazetteer -i data/gazetteers/allCountries.zip -o geonames-1dpc.ser.gz -dkm
Download the TR-CoNLL corpus from the following location:
ADD URL HERE
ADD INSTRUCTIONS FOR SPLITTING INTO DEV AND TEST
Now you should have a directory (we'll call it /path/to/trconll/xml/) containing the TR-CoNLL corpus in XML format, with the subdirectories dev/ and test/ for each split. To import the test portion to be used with Fieldspring, run this from FIELDSPRING_DIR:
fieldspring --memory 8g import-corpus -i /path/to/trconll/xml/test/ -cf tr -gt -sg geonames-1dpc.ser.gz -sco trftest-gt-g1dpc.ser.gz
You should see output that includes this:
Number of word tokens: 67572
Number of word types: 11241
Number of toponym tokens: 1903
Number of toponym types: 440
Average ambiguity (locations per toponym): 13.68891224382554
Maximum ambiguity (locations per toponym): 857
Serializing corpus to trftest-gt-g1dpc.ser.gz ...done.
This will give you the version of the corpus with gold toponym identifications. Run this to get the version with NER identified toponyms:
fieldspring --memory 8g import-corpus -i /path/to/trconll/xml/test/ -cf tr -sg geonames-1dpc.ser.gz -sco trftest-ner-g1dpc.ser.gz
ADD HOW TO GET THE ORIGINAL CWAR CORPUS HERE ADD HOW TO CONVERT IT TO THE RIGHT XML FORMAT, GIVEN THE KML FILE, HERE
Once you have the CWar corpus in the correct format in a directory (we'll call it /path/to/cwar/xml/) with subdirectories dev/ and test/ for each split, import the test portion by running this from FIELDSPRING_DIR:
fieldspring --memory 30g import-corpus -i /path/to/cwar/xml/text -cf tr -gt -sg geonames-1dpc.ser.gz -sco cwartest-gt-g1dpc-20spd.ser.gz -spd 20
This will give you the version of the corpus with gold toponym identifications. Run this to get the version with NER identified toponyms:
fieldspring --memory 30g import-corpus -i /path/to/cwar/xml/text -cf tr -sg geonames-1dpc.ser.gz -sco cwartest-ner-g1dpc-20spd.ser.gz -spd 20
SAY WHERE TO DOWNLOAD enwiki-20130102-pages-articles.xml.bz2
SAY HOW TO RUN BEN'S PREPROC SCRIPT ON IT
SAY HOW TO RUN FilterGeotaggedWiki
For the WISTR training instances relevant to the test split of TR-CoNLL, run the following from FIELDSPRING_DIR:
fieldspring --memory 30g run opennlp.fieldspring.tr.app.SupervisedTRFeatureExtractor -w /path/to/filtered-geo-text-training.txt -c /path/to/enwiki-20130102-permuted-training-unigram-counts.txt.bz2 -i /path/to/trconll/xml/test/ -g geonames-1dpc.ser.gz -s src/main/resources/data/eng/stopwords.txt -d /path/to/suptr-models-trtest/
Where /path/to/suptr-models-trtest/ is the path to the directory where the training instances will be written.
To train the models given the training instances, run this from FIELDSPRING_DIR:
fieldspring --memory 30g run opennlp.fieldspring.tr.app.SupervisedTRMaxentModelTrainer /path/to/suptr-models-trtest/
SAY HOW TO RUN geolocate-document
SAY HOW TO RUN THE SHELL FILE, AND HOW TO READ IT