Releases · huggingface/transformers

08 Jan 10:33

v5.0.0rc2

57278c9

Release candidate 5.0.0rc2 Pre-release

Pre-release

What's Changed

This release candidate is focused on fixing AutoTokenizer, expanding the dynamic weight loading support, and improving performances with MoEs!

MoEs and performances:

batched and grouped experts implementations by @IlyasMoutawwakil in #42697
Optimize MoEs for decoding using batched_mm by @IlyasMoutawwakil in #43126

Tokenization:

The main issue with the tokenization refactor is that tokenizer_class are now "enforced" when in most cases they are wrong. This took a while to properly isolate and now we try to use TokenizersBackend whenever we can. #42894 has a much more detailed description of the big changes!

use TokenizersBackend by @ArthurZucker in #42894
Fix convert_tekken_tokenizer by @juliendenize in #42592
refactor more tokenizers - v5 guide update by @itazap in #42768
[Tokenizers] Change treatment of special tokens by @vasqu in #42903

Core

Here we focused on boosting the performances of loading weights on device!

[saving] Simplify general logic by @Cyrilvallez in #42766
Do not rely on config for inferring model dtype by @Cyrilvallez in #42838
Improve BatchFeature: stack list and lists of torch tensors by @yonigozlan in #42750
Remove tied weights from internal attribute if they are not tied by @Cyrilvallez in #42871
Enforce call to post_init and fix all of them by @Cyrilvallez in #42873
Simplify tie weights logic by @Cyrilvallez in #42895
Add buffers to _init_weights for ALL models by @Cyrilvallez in #42309
[loading] Really initialize on meta device for huge perf gains by @Cyrilvallez in #42941
Do not use accelerate hooks if the device_map has only 1 device by @Cyrilvallez in #43019
Move missing weights and non-persistent buffers to correct device earlier by @Cyrilvallez in #43021

New models

Sam: Perception Encoder Audiovisual by @eustlb in #42905
adds jais2 model support by @sarathc-cerebras in #42684
Add Pixio pre-trained models by @LiheYoung in #42795
[Ernie 4.5] Ernie VL models by @vasqu in #39585
[loading][TP] Fix device placement at loading-time, and simplify sharding primitives by @Cyrilvallez in #43003
GLM-ASR Support by @zRzRzRzRzRzRzR in #42875

Quantization

[Devstral] Make sure FP8 conversion works correctly by @patrickvonplaten in #42715
Fp8 dq by @SunMarc in #42926
[Quantization] Removing misleading int8 quantization in Finegrained FP8 by @MekkCyber in #42945
Fix deepspeed + quantization by @SunMarc in #43006

Breaking changes

Mostly around processors!

🚨 Fix ConvNeXt image processor default interpolation to BICUBIC by @lukepayyapilli in #42934
🚨 Fix EfficientNet image processor default interpolation to BICUBIC by @lukepayyapilli in #42956
Add fast version of convert_segmentation_map_to_binary_masks to EoMT by @simonreise in #43073
🚨Fix MobileViT image processor default interpolation to BICUBIC by @lukepayyapilli in #43024

Thanks again to everyone !

New Contributors

@ZX-ModelCloud made their first contribution in #42833
@AYou0207 made their first contribution in #42863
@wasertech made their first contribution in #42864
@preetam1407 made their first contribution in #42685
@Taise228 made their first contribution in #41416
@CandiedCode made their first contribution in #42885
@sarathc-cerebras made their first contribution in #42684
@nandan2003 made their first contribution in #42318
@LiheYoung made their first contribution in #42795
@majiayu000 made their first contribution in #42928
@lukepayyapilli made their first contribution in #42934
@leaderofARS made their first contribution in #42966
@qianyue76 made their first contribution in #43095
@stefgina made their first contribution in #43033
@HuiyingLi made their first contribution in #43084
@raimbekovm made their first contribution in #43038
@PredictiveManish made their first contribution in #43053
@pushkar-hue made their first contribution in #42736
@vykhovanets made their first contribution in #43042
@tanmay2004 made their first contribution in #42737
@atultw made their first contribution in #43061

Full Changelog: v5.0.0rc1...v5.0.0rc2

Contributors

HuiyingLi, CandiedCode, and 32 other contributors

Assets 2

08 Jan 10:15

ArthurZucker

v5.0.0rc1

bdc85cb

Release candidate 5.0.0rc1 Pre-release

Pre-release

What's Changed

This release candidate was focused mostly on quantization support with the new dynamic weight loader, and a few notable 🚨 breaking changes🚨:

Default dtype for any model when using from_pretrained is now auto!

Default auto 🚨 🚨 by @ArthurZucker in #42805

Default shard size when saving a model is now 50GB:

🚨🚨 [saving] Default to 50GB shards, and remove non-safe serialization by @Cyrilvallez in #42734
This is now as fast as before thanks to xet, and is just more convenient on the hub.

Kwargs. They are fundamental to enable integration with vllm and other toosl:

Every model forward() should have **kwargs by @Rocketknight1 in #42603

Dynamic weight loader updates:

Mostly QOL and fixed + support back CPU offloading.

mark params as _is_hf_initialized with DS Zero3 from weight conversion by @winglian in #42626
[loading] Allow loading to happen without threading by @Cyrilvallez in #42619
[loading] Correctly load params during offloading & careful memory considerations by @Cyrilvallez in #42632
allow registration of custom checkpoint conversion mappings by @winglian in #42634

New models:

Add FastVLM by @camilla-deckard in #41112
Lasr model by @eustlb in #42648
[Model] Add PaddleOCR-VL Model Support by @zhang-prog in #42178

Some notable quantization fixes:

Mostly added support for fbgemme , quanto,

Fix fp8 + some enhancement by @SunMarc in #42455
Fix eetq quanto quant methods by @SunMarc in #42557
[Quantization] per tensor quantization kernel by @MekkCyber in #42560
[Quantization] fix fbgemm by @MekkCyber in #42561
[Quantization] Fix FP8 experts replacing by @MekkCyber in #42654
[Quantization] Fix Static FP8 Quantization by @MekkCyber in #42775
[core] fix fp-quant by @MekkCyber in #42613

Peft:

The dynamic weight loader broke small things, this adds glue for all models but MoEs.

FIX Error when trying to load non-LoRA PEFT by @BenjaminBossan in #42663
Fix PEFT integration with new weight loader by @Cyrilvallez in #42701

Misc

Tokenization needed more refactoring, this time its a lot cleaner!

Refactor-tokenization-more by @ArthurZucker in #42563
Only default rope_parameters to empty dict if there is something to put in it by @hmellor in #42651

We omitted a lot of other commits for clarity, but thanks to everyone and the new contributors!

New Contributors

@camilla-deckard made their first contribution in #41112
@Aaraviitkgp made their first contribution in #42466
@ngazagna-qc made their first contribution in #40691
@arrdel made their first contribution in #42577
@marconaguib made their first contribution in #42587
@Xiao-Chenguang made their first contribution in #42436
@Furkan-rgb made their first contribution in #42465
@mertunsall made their first contribution in #42615
@anranlee99 made their first contribution in #42438
@UserChen666 made their first contribution in #42335
@efazal made their first contribution in #41723
@Harrisonyong made their first contribution in #36416
@hawon223 made their first contribution in #42384
@Bissmella made their first contribution in #42647
@AgainstEntropy made their first contribution in #42689
@dongluw made their first contribution in #42642
@hqkqn32 made their first contribution in #42620
@zhang-prog made their first contribution in #42178

Full Changelog: v5.0.0rc0...v5.0.0rc1

Contributors

winglian, BenjaminBossan, and 25 other contributors

Assets 2

01 Dec 18:14

LysandreJik

v5.0.0rc0

cbc6716

Transformers v5.0.0rc0 Pre-release

Pre-release

Transformers v5 release notes

Highlights
Significant API changes: dynamic weight loading, tokenization
Backwards Incompatible Changes
Bugfixes and improvements

Highlights

We are excited to announce the initial release of Transformers v5. This is the first major release in five years, and the release is significant: 800 commits have been pushed to main since the latest minor release. This release removes a lot of long-due deprecations, introduces several refactors that significantly simplify our APIs and internals, and comes with a large number of bug fixes.

We give an overview of our focus for this release in the following blogpost. In these release notes, we'll focus directly on the refactors and new APIs coming with v5.

This release is a release candidate (RC). It is not the final v5 release, and we will push on pypi as a pre-release. This means that the current release is purely opt-in, as installing transformers without specifying this exact release will install the latest version instead (v4.57.3 as of writing).

In order to install this release, please do so with the following:

pip install transformers --pre

For us to deliver the best package possible, it is imperative that we have feedback on how the toolkit is currently working for you. Please try it out, and open an issue in case you're facing something inconsistent/a bug.

Transformers version 5 is a community endeavor, and this is the last mile. Let's ship this together!

Significant API changes

Note

👀 Nothing is final and things are still actively in movement. We have a section dedicated to what is planned for future release candidates, yet is known not to work in the RC0. Look for "Disclaimers for the RC0".

We'll be eagerly awaiting your feedback in our GitHub issues!

Dynamic weight loading

We introduce a new weight loading API in transformers, which significantly improves on the previous API. This
weight loading API is designed to apply operations to the checkpoints loaded by transformers.

Instead of loading the checkpoint exactly as it is serialized within the model, these operations can reshape, merge,
and split the layers according to how they're defined in this new API. These operations are often a necessity when
working with quantization or parallelism algorithms.

This new API is centered around the new WeightConverter class:

class WeightConverter(WeightTransform):
    operations: list[ConversionOps]
    source_keys: Union[str, list[str]]
    target_keys: Union[str, list[str]]

The weight converter is designed to apply a list of operations on the source keys, resulting in target keys. A common
operation done on the attention layers is to fuse the query, key, values layers. Doing so with this API would amount
to defining the following conversion:

conversion = WeightConverter(
    ["self_attn.q_proj", "self_attn.k_proj", "self_attn.v_proj"],  # The input layers
    "self_attn.qkv_proj",  # The single layer as output
    operations=[Concatenate(dim=0)],
)

In this situation, we apply the Concatenate operation, which accepts a list of layers as input and returns a single
layer.

This allows us to define a mapping from architecture to a list of weight conversions. Applying those weight conversions
can apply arbitrary transformations to the layers themselves. This significantly simplified the from_pretrained method
and helped us remove a lot of technical debt that we accumulated over the past few years.

This results in several improvements:

Much cleaner definition of transformations applied to the checkpoint
Reversible transformations, so loading and saving a checkpoint should result in the same checkpoint
Faster model loading thanks to scheduling of tensor materialization
Enables complex mix of transformations that wouldn't otherwise be possible (such as quantization + MoEs, or TP + MoEs)

While this is being implemented, expect varying levels of support across different release candidates.

Linked PR: #41580

Tokenization

Just as we moved towards a single backend library for model definition, we want our tokenizers, and the Tokenizer object to be a lot more intuitive. With v5, tokenizer definition is much simpler; one can now initialize an empty LlamaTokenizer and train it directly on your corpus.

Defining a new tokenizer object should be as simple as this:

from transformers import TokenizersBackend, generate_merges
from tokenizers import pre_tokenizers, Tokenizer
from tokenizers.model import BPE

class Llama5Tokenizer(TokenizersBackend):
    def __init__(self, unk_token="<unk>",bos_token="<s>", eos_token="</s>", vocab=None, merges=None ):
        if vocab is None:
            self._vocab = {
                str(unk_token): 0,
                str(bos_token): 1,
                str(eos_token): 2,
            }

        else:
            self._vocab = vocab

        if merges is not None:
            self._merges = merges
        else:
            self._merges = generate_merges(filtered_vocab)

        self._tokenizer = Tokenizer(
            BPE(vocab=self._vocab, merges=self._merges, fuse_unk=True)
        )
        self._tokenizer.pre_tokenizer = pre_tokenizers.Metaspace(
            replacement="▁", prepend_scheme=_get_prepend_scheme(self.add_prefix_space, self), split=False
        )
        super().__init__(
            tokenizer_object=self._tokenizer,
            unk_token=unk_token,
            bos_token=bos_token,
            eos_token=eos_token,
        )

Once the tokenizer is defined as above, you can load it with the following: Llama5Tokenizer(). Doing this returns you an empty, trainable tokenizer that follows the definition of the authors of Llama5 (it does not exist yet 😉).

The above is the main motivation towards refactoring tokenization: we want tokenizers to behave similarly to models: trained or empty, and with exactly what is defined in their class definition.

Backend Architecture Changes: moving away from the slow/fast tokenizer separation

Up to now, transformers maintained two parallel implementations for many tokenizers:

"Slow" tokenizers (tokenization_<model>.py) - Python-based implementations, often using SentencePiece as the backend.
"Fast" tokenizers (tokenization_<model>_fast.py) - Rust-based implementations using the 🤗 tokenizers library.

In v5, we consolidate to a single tokenizer file per model: tokenization_<model>.py. This file will use the most appropriate backend available:

TokenizersBackend (preferred): Rust-based tokenizers from the 🤗 tokenizers library. In general it provides optimal performance, but it also offers a lot more features that are commonly adopted across the ecosystem:

handling additional tokens
a full python API for setting and updating
automatic parallelization,
automatic offsets
customization
training

SentencePieceBackend: for tokenizers requiring the sentencepiece library. It inherits from PythonBackend.
PythonBackend: a Python implementations of the features provided by tokenizers. Basically allows adding tokens.
MistralCommonBackend: relies on MistralCommon's tokenization library. (Previously known as the MistralCommonTokenizer)

The AutoTokenizer automatically selects the appropriate backend based on available files and dependencies. This is transparent, you continue to use AutoTokenizer.from_pretrained() as before. This allows transformers to be future-proof and modular to easily support future backends.

Defining a tokenizers outside of the existing backends

We enable users and tokenizer builders to define their own tokenizers from top to bottom. Tokenizers are usually defined using a backend such as tokenizers, sentencepiece or mistral-common, but we offer the possibility to design the tokenizer at a higher-level, without relying on those backends.

To do so, you can import the PythonBackend (which was previously known as PreTrainedTokenizer). This class encapsulates all the logic related to added tokens, encoding, and decoding.

If you want something even higher up the stack, then PreTrainedTokenizerBase is what PythonBackend inherits from. It contains the very basic tokenizer API features:

encode
decode
vocab_size
get_vocab
convert_tokens_to_ids
convert_ids_to_tokens
from_pretrained
save_pretrained
among a few others

API Changes

1. Direct tokenizer initialization with vocab and merges

Starting with v5, we now enable initializing blank, untrained tokenizers-backed tokenizers:

from transformers import LlamaTokenizer

tokenizer = LlamaTokenizer()

This tokenizer will therefore follow the definition of the LlamaTokenizer as defined in its class definition. It can then be trained on a corpus as can be seen in the tokenizers documentation.

These tokenizers can also be initialized from vocab and merges (if necessary), like the previous "slow" tokenizers:

from transformers import LlamaTokenizer

vocab = {"<unk>": 0, "<s>": 1, "</s>": 2, "hello": 3, "world": 4}
merges = [("h", "e"), ("l", "l"), ("o", " ")]

tokenizer = LlamaTokenizer(vocab=vocab, merges=merges)

This tokenizer will behave as a Llama-like toke...

Contributors

kashif, nihui, and 145 other contributors

Assets 2

25 Nov 15:51

ArthurZucker

v4.57.3

47b0e47

Patch release v4.57.3 Latest

Latest

There was a hidden bug when loading models with local_files_only=True and a typo related to the recent patch.

The main fix is: b605555.

We are really sorry that this slipped through, our CIs just did not catch it.

As it affects a lot of users we are gonna yank the previous release

Assets 2

24 Nov 17:54

Cyrilvallez

v4.57.2

2915fb3

Patch Release v4.57.2

This patch most notably fixes an issue on some Mistral tokenizers. It contains the following commits:

Add AutoTokenizer mapping for mistral3 and ministral (#42198)
Auto convert tekken.json (#42299)
fix tekken pattern matching (#42363)
Check model inputs - hidden states (#40994)
Remove invalid @staticmethod from module-level get_device_and_memory_breakdown (#41747)

Assets 2

14 Oct 15:39

Cyrilvallez

v4.57.1

8cb5963

Patch release v4.57.1

This patch most notably fixes an issue with an optional dependency (optax), which resulted in parsing errors with poetry. It contains the following fixes:

fix optax dep issue
remove offload_state_dict from kwargs
Fix bnb fsdp loading for pre-quantized checkpoint (#41415)
Fix tests fsdp (#41422)
Fix trainer for py3.9 (#41359)

Assets 2

03 Oct 17:04

LysandreJik

v4.57.0

8ac2b91

v4.57.0: Qwen3-Next, Vault Gemma, Qwen3 VL, LongCat Flash, Flex OLMO, LFM2 VL, BLT, Qwen3 OMNI MoE, Parakeet, EdgeTAM, OLMO3

New model additions

Qwen3 Next

The Qwen3-Next series represents the Qwen team's next-generation foundation models, optimized for extreme context length and large-scale parameter efficiency.
The series introduces a suite of architectural innovations designed to maximize performance while minimizing computational cost:

Hybrid Attention: Replaces standard attention with the combination of Gated DeltaNet and Gated Attention, enabling efficient context modeling.
High-Sparsity MoE: Achieves an extreme low activation ratio as 1:50 in MoE layers — drastically reducing FLOPs per token while preserving model capacity.
Multi-Token Prediction(MTP): Boosts pretraining model performance, and accelerates inference.
Other Optimizations: Includes techniques such as zero-centered and weight-decayed layernorm, Gated Attention, and other stabilizing enhancements for robust training.

Built on this architecture, they trained and open-sourced Qwen3-Next-80B-A3B — 80B total parameters, only 3B active — achieving extreme sparsity and efficiency.

Despite its ultra-efficiency, it outperforms Qwen3-32B on downstream tasks — while requiring less than 1/10 of the training cost.
Moreover, it delivers over 10x higher inference throughput than Qwen3-32B when handling contexts longer than 32K tokens.

For more details, please visit their blog Qwen3-Next (blog post).

Adding Support for Qwen3-Next by @bozheng-hit in #40771

Vault Gemma

VaultGemma is a text-only decoder model derived from Gemma 2, notably it drops the norms after the Attention and MLP blocks, and uses full attention for all layers instead of alternating between full attention and local sliding attention. VaultGemma is available as a pretrained model with 1B parameters that uses a 1024 token sequence length.

VaultGemma was trained from scratch with sequence-level differential privacy (DP). Its training data includes the same mixture as the Gemma 2 models, consisting of a number of documents of varying lengths. Additionally, it is trained using DP stochastic gradient descent (DP-SGD) and provides a (ε ≤ 2.0, δ ≤ 1.1e-10)-sequence-level DP guarantee, where a sequence consists of 1024 consecutive tokens extracted from heterogeneous data sources. Specifically, the privacy unit of the guarantee is for the sequences after sampling and packing of the mixture.

add: differential privacy research model by @RyanMullins in #40851

Qwen3 VL

Qwen3-VL is a multimodal vision-language model series, encompassing both dense and MoE variants, as well as Instruct and Thinking versions.

Building upon its predecessors, Qwen3-VL delivers significant improvements in visual understanding while maintaining strong pure text capabilities. Key architectural advancements include: enhanced MRope with interleaved layout for better spatial-temporal modeling, DeepStack integration to effectively leverage multi-level features from the Vision Transformer (ViT), and improved video understanding through text-based time alignment—evolving from T-RoPE to text timestamp alignment for more precise temporal grounding.

These innovations collectively enable Qwen3-VL to achieve superior performance in complex multimodal tasks.

Adding Support for Qwen3-VL Series by @JJJYmmm in #40795

Longcat Flash

The LongCatFlash model was proposed in LongCat-Flash Technical Report by the Meituan LongCat Team. LongCat-Flash is a 560B parameter Mixture-of-Experts (MoE) model that activates 18.6B-31.3B parameters dynamically (average ~27B). The model features a shortcut-connected architecture enabling high inference speed (>100 tokens/second) and advanced reasoning capabilities.

The abstract from the paper is the following:

We present LongCat-Flash, a 560 billion parameter Mixture-of-Experts (MoE) language model featuring a dynamic computation mechanism that activates 18.6B-31.3B parameters based on context (average ~27B). The model incorporates a shortcut-connected architecture enabling high inference speed (>100 tokens/second) and demonstrates strong performance across multiple benchmarks including 89.71% accuracy on MMLU and exceptional agentic tool use capabilities.

Tips:

LongCat-Flash uses a unique shortcut-connected MoE architecture that enables faster inference compared to traditional MoE models
The model supports up to 128k context length for long-form tasks
Dynamic parameter activation makes it computationally efficient while maintaining high performance
Best suited for applications requiring strong reasoning, coding, and tool-calling capabilities
The MoE architecture includes zero experts (nn.Identity modules) which act as skip connections, allowing tokens to bypass expert computation when appropriate

Add LongCat-Flash by @molbap in #40730

Flex Olmo

FlexOlmo is a new class of language models (LMs) that supports (1) distributed training without data sharing, where different model parameters are independently trained on closed datasets, and (2) data-flexible inference, where these parameters along with their associated data can be flexibly included or excluded from model inferences with no further training. FlexOlmo employs a mixture-of-experts (MoE) architecture where each expert is trained independently on closed datasets and later integrated through a new domain-informed routing without any joint training. FlexOlmo is trained on FlexMix, a corpus we curate comprising publicly available datasets alongside seven domain-specific sets, representing realistic approximations of closed sets.

You can find all the original FlexOlmo checkpoints under the FlexOlmo collection.

Add FlexOlmo model by @2015aroras in #40921

LFM2 VL

LFM2-VL first series of vision-language foundation models developed by Liquid AI. These multimodal models are designed for low-latency and device-aware deployment. LFM2-VL extends the LFM2 family of open-weight Liquid Foundation Models (LFMs) into the vision-language space, supporting both text and image inputs with variable resolutions.

Architecture

LFM2-VL consists of three main components: a language model backbone, a vision encoder, and a multimodal projector. LFM2-VL builds upon the LFM2 backbone, inheriting from either LFM2-1.2B (for LFM2-VL-1.6B) or LFM2-350M (for LFM2-VL-450M). For the vision tower, LFM2-VL uses SigLIP2 NaFlex encoders to convert input images into token sequences. Two variants are implemented:

Shape-optimized (400M) for more fine-grained vision capabilities for LFM2-VL-1.6B
Base (86M) for fast image processing for LFM2-VL-450M

The encoder processes images at their native resolution up to 512×512 pixels, efficiently handling smaller images without upscaling and supporting non-standard aspect ratios without distortion. Larger images are split into non-overlapping square patches of 512×512 each, preserving detail. In LFM2-VL-1.6B, the model also receives a thumbnail (a small, downscaled version of the original image capturing the overall scene) to enhance global context understanding and alignment. Special tokens mark each patch’s position and indicate the thumbnail’s start. The multimodal connector is a 2-layer MLP connector with pixel unshuffle to reduce image token count.

Add new model LFM2-VL by @zucchini-nlp in #40624

BLT

The BLT model was proposed in Byte Latent Transformer: Patches Scale Better Than Tokens by Artidoro Pagnoni, Ram Pasunuru, Pedro Rodriguez, John Nguyen, Benjamin Muller, Margaret Li1, Chunting Zhou, Lili Yu, Jason Weston, Luke Zettlemoyer, Gargi Ghosh, Mike Lewis, Ari Holtzman†, Srinivasan Iyer.
BLT is a byte-level LLM that achieves tokenization-level performance through entropy-based dynamic patching.

The abstract from the paper is the following:

*We introduce the Byte Latent Transformer (BLT), a new byte-level LLM architecture that, for the first time, matches tokenization-based LLM performance at scale with significant improvements in inference
efficiency and robustness. BLT encodes bytes into dynamically sized patches, which serve as the primary units of computation. Patches are segmented based on the entropy of the next byte, allocating
more compute and model capacity where increased data complexity demands it. We present the first flop controlled sca...

Contributors

kashif, clinty, and 76 other contributors

Assets 2

17 Sep 09:13

LysandreJik

v4.56.2

cd74917

Patch release v4.56.2

Processor load with multi-processing (#40786)
[Jetmoe] Fix RoPE (#40819)
Fix getter regression (#40824)
Fix config dtype parsing for Emu3 edge case (#40766)

Assets 2

12 Sep 15:43

LysandreJik

v4.56.1-Vault-Gemma-preview

291772b

Vault-Gemma (based on v4.56.1)

A new model is added to transformers: Vault-Gemma
It is added on top of the v4.56.1 release, and can be installed from the following tag: v4.56.1-Vault-Gemma-preview.

In order to install this version, please install with the following command:

pip install git+https://github.com/huggingface/[email protected]

If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.

As the tag implies, this tag is a preview of the Vault-Gemma model. This tag is a tagged version of the main branch and does not follow semantic versioning. This model will be included in the next minor release: v4.57.0.

Vault-Gemma

The example below demonstrates how to chat with the model with pipeline:

from transformers import pipeline

pipe = pipeline(
    task="text-generation",
    model="google/vaultgemma-1b",
    dtype="auto",
    device_map="auto",
)

text = "Tell me an unknown interesting biology fact about the brain."
outputs = pipe(text, max_new_tokens=32)
response = outputs[0]["generated_text"]
print(response)

with the AutoModelForCausalLM class:

# pip install accelerate
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "google/vaultgemma-1b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", dtype="auto")

text = "Tell me an unknown interesting biology fact about the brain."
input_ids = tokenizer(text, return_tensors="pt").to(model.device)

outputs = model.generate(**input_ids, max_new_tokens=32)
print(tokenizer.decode(outputs[0]))

or with transformers chat:

transformers chat google/vaultgemma-1b

Assets 2

04 Sep 15:53

LysandreJik

v4.56.0-Embedding-Gemma-preview

60b68e3

Embedding Gemma (based on v4.56.0)

A new model is added to transformers: Embedding Gemma
It is added on top of the v4.56.0 release, and can be installed from the following tag: v4.56.0-Embedding-Gemma-preview.

In order to install this version, please install with the following command:

pip install git+https://github.com/huggingface/[email protected]

If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.

As the tag implies, this tag is a preview of the EmbeddingGemma model. This tag is a tagged version of the main branch and does not follow semantic versioning. This model will be included in the next minor release: v4.57.0.

Embedding-Gemma

Today, Google releases EmbeddingGemma, a state-of-the-art multilingual embedding model perfect for on-device use cases. Designed for speed and efficiency, the model features a compact size of 308M parameters and a 2K context window, unlocking new possibilities for mobile RAG pipelines, agents, and more. EmbeddingGemma is trained to support over 100 languages and is the highest-ranking text-only multilingual embedding model under 500M on the Massive Text Embedding Benchmark (MTEB) at the time of writing.

Usage example

EmbeddingGemma can be found on the Huggingface Hub. It is integrated in sentence-transformers which depends on transformers.

See below for sentence-transformers examples using the model:

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("google/embeddinggemma-300m")

# Run inference with queries and documents
query = "Which planet is known as the Red Planet?"
documents = [
    "Venus is often called Earth's twin because of its similar size and proximity.",
    "Mars, known for its reddish appearance, is often referred to as the Red Planet.",
    "Jupiter, the largest planet in our solar system, has a prominent red spot.",
    "Saturn, famous for its rings, is sometimes mistaken for the Red Planet."
]
query_embeddings = model.encode_query(query)
document_embeddings = model.encode_document(documents)
print(query_embeddings.shape, document_embeddings.shape)
# (768,) (4, 768)

# Compute similarities to determine a ranking
similarities = model.similarity(query_embeddings, document_embeddings)
print(similarities)
# tensor([[0.3011, 0.6359, 0.4930, 0.4889]])

# Convert similarities to a ranking
ranking = similarities.argsort(descending=True)[0]
print(ranking)
# tensor([1, 2, 3, 0])

Assets 2

Releases: huggingface/transformers

Release candidate 5.0.0rc2

What's Changed

MoEs and performances:

Tokenization:

Core

New models

Quantization

Breaking changes

New Contributors

Contributors

Uh oh!

Release candidate 5.0.0rc1

What's Changed

Dynamic weight loader updates:

New models:

Some notable quantization fixes:

Peft:

Misc

New Contributors

Contributors

Uh oh!

Transformers v5.0.0rc0

Transformers v5 release notes

Highlights

Significant API changes

Dynamic weight loading

Tokenization

Backend Architecture Changes: moving away from the slow/fast tokenizer separation

Defining a tokenizers outside of the existing backends

API Changes

1. Direct tokenizer initialization with vocab and merges

Contributors

Uh oh!

Patch release v4.57.3

Uh oh!

Patch Release v4.57.2

Uh oh!

Patch release v4.57.1

Uh oh!

v4.57.0: Qwen3-Next, Vault Gemma, Qwen3 VL, LongCat Flash, Flex OLMO, LFM2 VL, BLT, Qwen3 OMNI MoE, Parakeet, EdgeTAM, OLMO3

New model additions

Qwen3 Next

Vault Gemma

Qwen3 VL

Longcat Flash

Flex Olmo

LFM2 VL

Architecture

BLT

Contributors

Uh oh!

Patch release v4.56.2

Uh oh!

Vault-Gemma (based on v4.56.1)

Vault-Gemma

Uh oh!

Embedding Gemma (based on v4.56.0)

Embedding-Gemma

Usage example

Uh oh!