Skip to content

Conversation

@agourdel
Copy link

Add Binary Serialization for Index and Index Player Tool

Overview

This PR introduces binary serialization capabilities for the Index structure, along with a web-based debugging tool for exploring FSM indices.

New Features

1. Index Serialization (save() and load())

Added two new methods to the Index structure:

save(path: &Path) -> Result<()>

Serializes the Index to a compressed binary file. This method:

  • Converts all Index data (vocabulary size, EOS token ID, initial state, final states, and transitions) into a compact binary format
  • Compresses the data using gzip compression (via flate2)
  • Writes the compressed data to the specified file path

Usage:

let index = Index::new(regex, &vocabulary)?;
index.save("index.outlines")?;

load(path: &Path) -> Result<Index>

Deserializes an Index from a compressed binary file. This static method:

  • Reads and decompresses the gzip file
  • Parses the binary format according to the specification
  • Reconstructs the complete Index structure with all states and transitions

Usage:

let index = Index::load("index.outlines")?;

Benefits:

  • Performance: Loading a pre-built Index is significantly faster than rebuilding it from regex and vocabulary
  • Storage: Gzip compression reduces file size by 50-90% depending on the data
  • Portability: Binary files can be shared and loaded across different environments
  • Caching: Enables efficient caching of complex FSM indices

2. Binary Format Specification

The serialization uses a custom binary format optimized for FSM representation:

Format Structure (uncompressed)

Component Size Description
vocab_size 32 bits Size of the vocabulary
eos_token_id 32 bits End-of-sequence token ID
initial_state_id 32 bits ID of the initial state
num_final_states 32 bits Number of final states
final_states 32 bits × N Array of final state IDs
index_type 8 bits Format version identifier (currently type 1)
num_states 32 bits Number of states with transitions
For each state:
└─ state_id 32 bits Current state ID
└─ num_transitions 32 bits Number of transitions from this state
└─ For each transition:
└─ token_id 32 bits Token that triggers the transition
└─ next_state_id 32 bits Destination state ID

Key Features:

  • All integers stored in little-endian format
  • The entire structure is compressed with gzip before writing to disk
  • The index_type field allows for future format extensions
  • Fixed-size fields enable efficient parsing

Full specification available in INDEX_BINARY_FORMAT.md.

3. Index Player Tool (tools/index_player.html)

A standalone HTML/CSS/JavaScript tool for debugging and exploring FSM indices.

Purpose

The Index Player serves as a debug and explanation tool that allows developers to:

  • Visualize FSM state transitions
  • Understand why a model might generate specific tokens
  • Explore valid token sequences for any given state
  • Debug regex-vocabulary compatibility issues
  • Track paths through the automaton

How It Works

The tool is a fully static, single-file application that runs entirely in the browser:

  1. Load Index File: Upload a binary .outlines file created with Index::save()

    • Automatically decompresses gzip using browser's native DecompressionStream API
    • Parses the binary format and reconstructs the FSM in memory
  2. Load Vocabulary (Optional): Upload a vocab.json file from HuggingFace

    • Maps token IDs to their string representations
    • Enables human-readable token display
  3. Interactive Exploration:

    • Current State Display: Shows the active state (highlighted if final)
    • Path History: Visual timeline of selected tokens
    • Generated Text: Real-time concatenation of token values (when vocab is loaded)
    • Available Transitions: Grid of all valid next tokens from current state
    • Navigation Controls:
      • Click any transition card to advance
      • Or type token ID manually
      • "Go Back" to undo last transition
      • "Reset" to return to initial state
  4. Visual Feedback:

    • Color-coded final states (green badges)
    • Token values highlighted in purple/gradient colors
    • Error messages for invalid transitions
    • Compact info panel showing FSM metadata

Screenshot

image

Use Cases

  • Model Debugging: Understand why a model generated unexpected output by tracing the valid path through the Index
  • Regex Validation: Verify that a regex pattern correctly matches expected token sequences
  • Education: Learn how FSM-based constrained generation works
  • Token Analysis: Discover which tokens are valid at any point in the generation process

Testing

Added comprehensive Rust tests for serialization:

  • test_save_and_load: Verifies round-trip serialization preserves Index integrity
  • test_save_and_load_multibyte: Tests with multi-byte Unicode characters (emojis)
  • test_load_nonexistent_file: Error handling for missing files
  • test_load_corrupted_file: Error handling for invalid data
  • test_save_preserves_file_size: Validates compression is working

All tests pass successfully.

Dependencies

  • Added flate2 crate for gzip compression/decompression

Files Changed

  • src/index.rs: Added save() and load() methods
  • src/error.rs: Added IOError variant for I/O operations
  • Cargo.toml: Added flate2 dependency
  • INDEX_BINARY_FORMAT.md: Complete binary format specification
  • tools/index_player.html: New interactive debugging tool
  • tests/create_index_binary.py: Example script for creating binary indices

Breaking Changes

None. This is a purely additive change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant