Extracting six domain-specific QA datasets from MS MARCO. More details can be found in our paper "Forget Me Not: Reducing Catastrophic Forgetting for Domain Adaptation in Reading Comprehension"
Please use Python 2.7 environment.
Download MS MARCO Reading Comprehension v2.1 training set
pip install json argparse
git clone https://github.com/ibm-aur-nlp/domain-specific-QA.git
cd <YOUR_CLONE_PATH>/domain-specific-QA
python extract_domain_specific_samples.py --marco <YOUR_MARCO_DOWNLOAD_PATH>/train_v2.1.json --out_dir <YOUR_OUTPUT_PATH> --lookup_table lookup_table.json
In <YOUR_OUTPUT_PATH>, you should see json files of format squad.<DOMAIN>.<SPLIT>.json. The statistics of the 6 domain specific QA datasets is:
| Domain | Total | Train | Dev | Test | 
|---|---|---|---|---|
| music | 3,596 | 2,517 | 539 | 540 | 
| biomedical | 31,620 | 22,134 | 4,743 | 4,743 | 
| film | 5,032 | 3,522 | 755 | 755 | 
| finance | 9,700 | 6,790 | 1,455 | 1,455 | 
| law | 4,436 | 3,105 | 665 | 666 | 
| computing | 4,316 | 3,021 | 647 | 648 |