This repository contains frequency dictionaries in the form of text files, with one word per line.
The repository is organized into two folders:
freq_dicts_dirty: Contains dictionaries with words that may not appear in a "standard" dictionary.freq_dicts_clean: Contains dictionaries that have been cleaned and supplemented to include only words found in a "standard" dictionary.
The files in this folder were derived from the LuminosoInsight/wordfreq project. These dictionaries were converted into .txt files with one word per line, ordered by frequency (most frequent words come first). Only words longer than two characters were retained.
The conversion process involved:
- Using the jakm/msgpack-cli tool to convert
.msgpackfiles to.jsonformat. - Transforming the
.jsonfiles into.txtfiles with one word per line usingsedandgrep.
The files in this folder were created by cleaning the dictionaries in the freq_dicts_dirty folder. This process involved removing words not found in the corresponding dictionaries from titoBouzout/Dictionaries.
- Files named
short_xx.txtretain their original names. - Files originally named
long_xx.txthave been renamed tomedium_xx.txt. - New
long_xx.txtfiles are created frommedium_xx.txt(orshort_xx.txtwhen applicable). These are supplemented by appending, in alphabetical order, all words present in the "standard" dictionary but absent from the "frequency" dictionary.
This repository is licensed under the Apache License, Version 2.0. See the LICENSE file for details.
This repository is based on two primary sources:
- The
rspeer/wordfreqproject by Robyn Speer. - Dictionaries from the
titoBouzout/Dictionariesrepository, originally derived from the OpenOffice dictionary list.
- Robyn Speer must be credited as specified in NOTICE.md.
- For a detailed list of data sources and their licenses, see the original
//wordfreqNOTICE.md. - Data from
wordfreq/wordfreqis redistributed under terms compatible with their original licenses, including the Creative Commons Attribution-ShareAlike 4.0 license.
- The dictionaries included in this repository are derived from the OpenOffice dictionary list, as referenced in
titoBouzout/Dictionaries. - While no formal license is provided in the source, credits to the original contributors are acknowledged in the respective
LANG.txtfiles in thetitoBouzout/Dictionariesrepository. - For more details about the dictionaries' origins and attribution requirements, see NOTICE.md.
The combined content of this repository complies with the terms of the Apache License 2.0 and respects the attribution requirements of the original sources. See NOTICE.md for further details.