This repository contains the code required to download audiodata from openspeechcorpus.com
Open Speech Corpus is composed by far for three subcorpuses:
- Tales: A crowdsourced corpus based on reading of latin american short tales
- Aphasia: A crowdsourced corpus based in words categorized in 4 levels of difficulty
- Isolated words: A crowdsourced corpus based in isolated words
To download files from the Tales Project use
ops \
--output_folder tales/ \
--output_file tales.txt \
--corpus talesTo download files from the Isolated Words Project use
ops \
--output_folder isolated_words/ \
--output_file isolated_words.txt \
--corpus wordsTo download files from the Aphasia Project use
ops \
--output_folder aphasia/ \
--output_file aphasia.txt \
--corpus aphasiaYou can download the whole corpus using the flag --download_all
ops \
--output_folder aphasia/ \
--output_file aphasia.txt \
--corpus aphasia \
--download_allBy default the page size is 500, to modify it use the args --from and --to i.e:
ops \
--from 500 \
--to 1000 \
--output_folder aphasia/ \
--output_file aphasia.txt \
--corpus aphasiaIf you use the flag --download_all with the flag --from the process will start in the specified arg from using a
page size of 500
We also support an argument --extra_query_params which overwrites the --from, --to and --download_all arguments
and downloads all files in the body response, you must define the --corpus argument anyway
ops \
--output_folder aphasia/ \
--output_file aphasia_letters.txt \
--corpus aphasia \
--extra_query_params "level_sentence__id__gte=846&level_sentence__id__lte=870"The Open Speech Corpus stores its files in mp4 format, which is undesired for most audio processing tasks. To convert
the files into a wav format, first install ffmpeg, then you can execute the
recursive_convert utility which receives as first argument the source_folder with the mp4 files and as second argument
the output folder i.e.:
recursive_convert aphasia aphasia_wavThe Open Speech Corpus also defines some scripts to generate configurations for CMU Sphinx.
First initialize a project with the sphinx_train command
sphinxtrain -t simple_words setupTo generate a configuration use the command configure_sphinx, which creates the transcription, fileids, fillers and
dic files.
configure_sphinx simple_words \
--transcription_file words.txt \
--etc_folder simple_words/etc \
--test_size 0.5Also you need to define a language model which receives the DB_NAME and the base project folder
generate_language_model simple_words simple_wordsTo delete the configuration files use the command clean_previous_configuration
clean_previous_configuration simple_words --etc_folder simple_words/etc/The Open Speech Corpus also defines some scripts to train models using HTK
To generate a word grammar use
configure_htk \
--transcription_file words.txt \
--project_folder htk_words \
--wav_folder words_wav \
htk_words