gec-transcriber

gec-transcriber is a convenience tool to assist in improving the quality of datasets, especially those that are created through automated means such as scraping.

Capabilities:

Convert txt data (in the form of Name: Message) to csv
-c: Specify column # to process (or last to specify last column, useful for groups of datasets with varying quantities of columns), otherwise process all columns
-s: Strip HTML tags
-u: Convert unicode punctuation to ASCII nearest equivalent
-U: Convert unicode using unidecode (not implemented yet)
-S: Strip unicode characters (runs after -u and -U to strip unicode characters that couldn't be converted to ASCII equivalents)
-f: Use fastpunct to help restore any missing punctuation and correct spelling. Works reasonably well with low quality data, poor English, etc.
-p: Specify custom prompt for GEC model
Run a GEC model over the dataset (prompt dependent)

Parallelism:

-ct: # CPU threads (default 16)
-gt: # GPU threads (default 16)
-b: Batch size (default 1)

Convert model to CTranslate2:

python main.py ../full-models/coedit-large convert models/coedit-large -q int8_float16

Process my_dataset using GPU, where data is in column 5, and using all available processing features:

python main.py coedit-large transcribe ../datasets/my_dataset -c 5 -d cuda -suUSf -b 32

Processed datasets are outputed to /path/to/your/dataset/output

Installation

Clone repo and enter directory:

git clone https://github.com/sigmareaver/gec-transcriber.git
cd gec-transcriber

Create and activate an environment (e.g. with conda):

conda create -n gec
conda activate gec

Install requirements:

pip install -r requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

gec-transcriber

Installation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
models		models
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Uh oh!

Uh oh!

sigmareaver/gec-transcriber

Folders and files

Latest commit

History

Repository files navigation

gec-transcriber

Installation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages