💫 Industrial-strength Natural Language Processing (NLP) in Python
-
Updated
Apr 11, 2025 - Python
💫 Industrial-strength Natural Language Processing (NLP) in Python
Easy token price estimates for 400+ LLMs. TokenOps.
👑 spaCy building blocks and visualizers for Streamlit apps
All the slides, accompanying code and exercises all stored in this repo. 🎈
Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing
Ekphrasis is a text processing tool, geared towards text from social networks, such as Twitter or Facebook. Ekphrasis performs tokenization, word normalization, word segmentation (for splitting hashtags) and spell correction, using word statistics from 2 big corpora (english Wikipedia, twitter - 330mil english tweets).
The official code 👩💻 for - TOTEM: TOkenized Time Series EMbeddings for General Time Series Analysis
[NeurIPS 2024]OmniTokenizer: one model and one weight for image-video joint tokenization.
Rule-based token, sentence segmentation for Russian language
[Paper][AAAI 2025] (MyGO)Tokenization, Fusion, and Augmentation: Towards Fine-grained Multi-modal Entity Representation
Simple multilingual lemmatizer for Python, especially useful for speed and efficiency
Fast bare-bones BPE for modern tokenizer training
Code for the paper "Fishing for Magikarp"
A unified tokenization tool for Images, Chinese and English.
Implementation of the LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens Paper
Code for Zero-Shot Tokenizer Transfer
使用sentencepiece中BPE训练中文词表,并在transformers中进行使用。
Implementation of the GBST block from the Charformer paper, in Pytorch
[LREC 2022] An off-the-shelf pre-trained Tweet NLP Toolkit (NER, tokenization, lemmatization, POS tagging, dependency parsing) + Tweebank-NER dataset
FPE - Format Preserving Encryption with FF3 in Python
Add a description, image, and links to the tokenization topic page so that developers can more easily learn about it.
To associate your repository with the tokenization topic, visit your repo's landing page and select "manage topics."