From Time to Text: Investigating Text-Synchronous Speech Representations

Manakul, Potsawee

Speech embeddings are inherently time-synchronous (e.g., one vector every 80ms), whereas LLMs operate in a text-synchronous space (one vector per subword). Towards the broader goal of improving speech-text modeling, we investigate this question: could we turn speech representations into text-synchronous ones?

To answer this question, we introduce a lightweight architecture (cross-attention encoder + causal decoder) to an existing audio tokenizer (Mimi), turning it into TextSyncMimi. We train the additional components while freezing the original audio tokenizer using only 1.5K hours of speech data. The resulting model (TextSyncMimi) produces speech representations that are synchronous with text tokens with low reconstruction error.

This article discusses the time-text mismatch problem, our architectural solution, and experimental results. We also provide an interactive demo for token-level speech editing, showcasing one utility of the text-synchronous speech representations, alongside the "lessons learned" and pitfalls to avoid from our study.

Speech embeddings (i.e., encoders) map raw audio waveforms into sequences of continuous vectors that represent fixed time intervals (e.g., one vector every 80ms)—time-synchronous representations. While this works well for speech processing, LLMs operate in a different temporal space: they process semantic concepts as discrete text tokens (e.g., one vector for a subword)—text-synchronous representations. This mismatch could make it more difficult to build multimodal speech-text LLMs. For example, Moshi, SyncLLM, SALMONN-Omni operate in time (e.g., one step = fixed ms), while Llama-Omni2, Qwen3-Omni, and LLM-based TTS (e.g., CSM, OrpheusTTS) mix time and text tokens.

We ask: "can we turn^[1] existing time-synchronous speech representations into text-synchronous ones?"

^[1] Training a new speech embedding tokenizer from scratch requires massive data (e.g., Mimi was trained on 7M hours of speech), and this work focuses on adapting existing ones (e.g., with around 1K hours of speech).

To transform these representations, we introduce an Additional Encoder and an Additional Decoder built on top of an existing speech tokenizer, where we use Mimi as an example in this work.

The speech encoder of Mimi operates at 12.5 Hz (i.e., yielding one representation every 80 ms), which is discretized by an RVQ network with 32 codebooks. The speech decoder of Mimi takes the reconstructed representation (one vector every 80 ms) and decodes the waveform back.

To align Mimi's continuous speech representations with text embeddings (i.e., turning time-sync into text-sync representation), we add a cross-attention network to the encoder, explicitly targeting the text token length. Simultaneously, the causal transformer decoder uses force-aligned signals to map the cross-attention output back to the continuous latent space (for reconstruction or speech generation purpose). By doing this (well we still have to train the cross-attention and the causal decoder), we turn time-sync Mimi tokenizer into text-sync Mimi tokenizer.

An additional encoder (4-layer, 16.8M parameters) that aligns Mimi's speech representations with text embeddings (we use the Llama3 embeddings, specifically from meta-llama/Llama-3.1-8B-Instruct). It maps the audio sequence to the semantic token space via a cross-attention bottleneck:

Input:
- Key & Value: Mimi's time-sync speech representation (length T)
- Query: Llama's text embedding (length N)
Output: Text-synchronous speech representation (length N)

The causal transformer decoder (4-layer, 12.6M parameters) reconstructs the time-synchronous latents from the text-synchronous representations. Given the encoder’s output [t_i, s_i] for each text token, the decoder must produce a variable-length sequence of Mimi latent vectors [z_(i,1), z_(i,2), …, z_{(i,T_i)}] for each token—where T_i depends on how many speech frames correspond to the i-th text token.

Force Alignment (Training Only)

To train this decoder, we need to know the mapping between text tokens and speech frames—i.e., which consecutive time-synchronous frames [z_(i,1), …, z_{(i,T_i)}] correspond to each text token i. We obtain this via force alignment: given an audio file and its text transcript, force alignment produces character-level timestamps, which we then map to subword token boundaries.

In practice, we use WhisperX (backed by wav2vec2 alignment models) to obtain character-level alignments, and then aggregate them to match the Llama tokenizer’s subword segmentation. This gives us the start/end timestamps for each subword token, which we use to slice the Mimi encoder’s output into per-token groups. The alignment was performed on LibriSpeech (960 hours) and LibriTTS (585 hours) for the English data used in this work.

Important: Force alignment is needed only during training to construct the target sequences. At inference time, the decoder autoregressively generates z-frames and uses a learned stop token to determine when to stop generating z-frames for the current token and proceed to the next—no alignment information is required at inference.

Decoder Sequence

With the per-token frame counts from force alignment, we construct a flattened sequence that interleaves the text-synchronous representations with the time-synchronous latents. The causal transformer is trained to predict this joint continuous sequence autoregressively (left-to-right):

More concretely, the repeating unit for each token i in the flattened sequence is:

<|text_speech_start|> [t_i] [s_i] <|text_speech_end|> <|time_speech_start|> [z_(i,1)] [z_(i,2)] ... [z_(i,T_i)] <|time_speech_end|>

At each step, the decoder predicts the continuous latent z directly (minimizing the L₂-norm), so the predicted vectors can be fed into the frozen Mimi decoder for waveform generation. The <|time_speech_end|> token is trained with a binary cross-entropy loss to signal when to stop generating z-frames for the current token and proceed to the next.

Streaming decoding: Because the causal transformer operates left-to-right, the decoding process is naturally streamable. As soon as each z-frame is predicted, it can be immediately passed to the frozen Mimi decoder to produce audio—there is no need to wait for the entire sequence to be generated before starting waveform synthesis.

💻 The model definition for the Text-Sync Mimi tokenizer can be found at modeling_text_sync_mimi.py.

The system predicts the continuous latent space directly. We optimize this mapping using a combination of L2-distance between the ground-truth Mimi representation (z) and our causal decoder's prediction (z_hat), alongside a Binary Cross-Entropy (BCE) loss for the stop token <|time_speech_end|>.

\[ L = L_2(z, \hat{z}) + \alpha L_{BCE}(\text{stop}) \]

Model Parameters:
- Frozen Base: Mimi Encoder & Decoder
- Trainable components: 29.4M parameters in total, comprising a 4-layer Cross-Attention network (16.8M) and a 4-layer Autoregressive (AR) Decoder (12.6M)
Training Data: A combined dataset of 1.5K hours, consisting of LibriSpeech (960 hours) and LibriTTS (585 hours).

We note that the model can be evaluated in different ways. First, since TextSyncMimi must preserve the speech content through its encode-decode architecture, we measure speech reconstruction quality using Word Error Rate (WER) on a standard benchmark. Second, to demonstrate the text-synchronous property of our representations, we show a speech editing application where representations at specific token positions can be swapped between two utterances. We provide audio samples below and a live interactive demo on Hugging Face to demonstrate both capabilities.

Speech Reconstruction Performance

First, we report the speech reconstruction performance of our model on a standard benchmark, evaluating the Word Error Rate (WER) on the LibriSpeech test-clean dataset, and compare against baselines and other methods.

Method	Representation Type	Training datasets	Training data (hours)	WER ↓
Ground-truth	–	–	–	2.12
Mimi	Time Synchronous	–	7M	2.29
TASTE	Text Synchronous	Emilia + LibriTTS	~40K	4.40
TextSyncMimi v1	Text Synchronous	LibriSpeech + LibriTTS	~1.5K*	3.06

*1.5K hours is used to train the additional cross-attention encoder and causal decoder only; the base Mimi encoder and decoder remain frozen.

Audio Samples

Below we demonstrate two key properties of TextSyncMimi: (1) speech reconstruction through its encode-decode architecture, and (2) speech editing at the text token level enabled by the text-synchronous nature of its representations. Because representations are aligned with text tokens, one can swap the latent representations at the target token position between two utterances—an operation that is only meaningful with text-synchronous representations.

🎮 Want to test more? Try your own inputs on the Live Demo on Hugging Face Space 🤗

① Speech Reconstruction

TextSyncMimi encodes speech into text-synchronous representations and decodes them back to a waveform. The reconstructed audio should closely match the original, demonstrating that minimal information is lost through the bottleneck.

Transcript

"This example illustrates the reconstruction property of the tokenizer model as well as zero-shot token-level speech editing capability of the model."

ORIGINAL

Speaker A — original recording

→

RECONSTRUCTED

Speaker A — after encode → decode (with TextSyncMimi)

② Speech Editing via Representation Swapping

Given two utterances (using the same underlying text spoken by different speakers in this example), we can swap the text-synchronous representation at selected token positions from Speaker B into Speaker A's sequence. The decoder then produces speech that follows Speaker A's voice for most tokens but switches to Speaker B's voice characteristics at the swapped positions. This is only possible because our representations are aligned to text tokens.

Speech editing diagram: swapping text-synchronous representations between Speaker A and Speaker B at selected token positions — Selected token positions in Speaker A's representations are replaced with Speaker B's representations before decoding.

Shared Transcript

"This example illustrates the reconstruction property of the tokenizer model as well as zero-shot token-level speech editing capability of the model."

Input Utterances

SPEAKER A

SPEAKER B

Edited Output

Swapped tokens shown in orange (from Speaker B): This example illustrates the reconstruction property of the tokenizer model as well as zero -shot token -level speech editing capability of the model .

EDITED (A + B)

Speaker A's sequence with highlighted positions replaced by Speaker B's representations

Reflections & Lessons Learned

A Hidden Assumption in Mimi's Architecture

A practical constraint we encountered: in Mimi—and neural codecs more generally—the RVQ module is not simply a quantizer. It also includes a learned transformation that maps between the encoder's output space and the decoder's input space. Concretely, even without any quantization error, the encoder output and the decoder input are different vectors. We initially treated them as interchangeable (modulo rounding), and discovered mid-way through that this assumption does not hold.

The consequence is that our text-synchronous representations cannot be directly discretized using Mimi's existing quantizer. Our TextSyncMimi therefore operates in continuous latent space—which works well for reconstruction tasks, but it would require a new/modified quantizer to be used as discrete tokens in cross-entropy loss training.

Related Work: TASTE

This work was originally conducted in early summer 2025. It was intended as an investigation into better audio tokenizers for speech-text foundation models. With several results still inconclusive, we proceeded with our audio pre-training project, SODA, using an existing tokenizer (Mimi). Also, we later came across TASTE—a well-executed concurrent work that tackles the same problem of text-sync representations, and TASTE also covers a range of experiments in this space (we recommend reading it!).

Disentangling Semantic and Acoustic Information

We also explored an additional research direction: disentangling semantic and acoustic information within the text-synchronous latent space, aiming for independent control over the what (content) and the how (prosody/acoustics) of speech. However, this proved challenging—our attempts led to significant degradation in reconstruction quality. Details on this disentanglement investigation (inc. architectural explorations and training objectives) with negative results will be added in a future update.

From Time to Text: Investigating Text-Synchronous Speech Representations

An exploration into turning time-synchronous speech representations into text-synchronous speech representations that are temporally aligned with text embeddings

TL;DR

Background: The Time-Text Alignment Problem

Proposed Architecture: Text-Synchronous Mimi

Encoding: Cross-Attention Network

Decoding: Causal Transformer Decoder