C-T2M: Controllable Autoregressive Text-to-Motion Generation

Project Report · CS-503, EPFL, 2026

Maria Pilligua1, 2 (444123), Pau Amargant1 (408221), Miquel Lopez1 (415700), Nahush Rajesh Kolhe1 (407562)
1EPFL 2CVLab

Abstract

We present C-T2M, a decoupled approach to controllable text-to-motion generation that combines an autoregressive language model over discrete motion tokens with an explicit trajectory controller. Our pipeline consists of three components: (i) a closed-loop GRU-based trajectory controller that generates smooth XZ root paths from user-specified waypoints and captions, (ii) a caption-driven body GPT that synthesizes motion tokens over a learned VQ-VAE codebook conditioned on text, and (iii) a deterministic recompose step that blends the body's predicted gait onto the controller's trajectory. By decoupling path control from body motion generation, our system runs in 413 ms per 200-frame clip, an order of magnitude faster than the state-of-the-art diffusion baseline Kimodo, while following user constraints with 3.94 cm root error (vs. Kimodo's 4.88 cm). We train on the BONES dataset and evaluate against Kimodo on both the Repetition Text-to-Motion split and a dedicated constraint-following protocol. We additionally study a residual VQ-VAE codec (RVQ-VAE) and ablate the constraint-conditioning architecture, the temporal compression rate, the heading representation, and the token sampling strategy.

1. Introduction

Generating realistic human motion from natural-language descriptions has made significant progress in recent years. Diffusion-based systems such as Kimodo are now capable of producing high-fidelity human motion conditioned on free-form text, and additionally support a rich set of kinematic constraints (pose, joint trajectory, root path). Kimodo is trained on the BONES dataset, a large-scale motion capture corpus roughly ten times larger than the commonly used HumanML3D benchmark, which enables this quality improvement.

Despite these advances, current state-of-the-art systems share a common limitation: they generate the entire motion clip in a single, expensive forward pass (typically requiring tens to hundreds of denoising steps). This makes them unsuitable for online settings, where the user wants to update waypoints, captions, or scene geometry mid-execution, and for interactive settings, where latency directly bounds usability. A robotics policy, a digital twin, or an interactive game character all need motion generation that reacts to changing input without recomputing everything from scratch.

1.1. Problem Statement

We design a controllable text-to-motion model that takes as input (i) a free-form caption and (ii) a root XZ path constraint (either a small set of waypoints or a dense trajectory), and produces frame-by-frame 3D joint positions of a humanoid skeleton that obey both inputs. Our goals are:

  • Follow root XZ path constraints accurately, comparable to a diffusion baseline of much higher capacity.
  • Run substantially faster than diffusion baselines, in particular fast enough to recompute the response to a new constraint without restarting the rollout.
  • Maintain motion quality (text alignment, foot-contact behaviour) at a level acceptable for downstream use.

1.2. Contributions

  • A decoupled architecture that separates trajectory control (a small GRU) from body motion generation (a Transformer over VQ-VAE tokens), connected by a deterministic recompose step.
  • A closed-loop trajectory controller trained with a travel-relative heading loss plus a still-frame target that fixes the spin-on-spot pathology common to absolute-heading parameterisations.
  • A residual VQ-VAE (RVQ-VAE) codec that lifts reconstruction-side R-Precision by 9-12 points over a single VQ-VAE baseline, halves the reconstruction FID, and reduces foot-skating.
  • A systematic ablation of constraint-conditioning architectures showing that our decoupled design dominates four end-to-end variants on every constraint-following metric we report.
  • An evaluation against Kimodo on a 256-testcase path-following protocol, where our system reaches 3.94 cm 2D root error versus Kimodo's 4.88 cm at roughly 10× lower latency.

2. Related Work

Recent text-to-motion literature has been dominated by diffusion models (MDM, MotionDiffuse, FLAME, ReMoDiffuse), which synthesise high-fidelity human movement by iteratively denoising continuous latent representations. A key advantage of the diffusion formulation is its amenability to complex spatial and temporal conditioning. Frameworks such as Kimodo, OmniControl and PriorMDM leverage this capability to introduce fine-grained kinematic constraints, including end-effector trajectories, root paths, and keyframe poses, typically via guided denoising or latent inpainting. However, their reliance on multiple reverse diffusion steps incurs high inference latency, severely limiting their deployment in real-time, dynamic environments.

To overcome this latency bottleneck, a parallel line of research reformulates motion generation as a discrete sequence prediction task. T2M-GPT pioneered this approach by compressing continuous 3D motion into a discrete vocabulary via a VQ-VAE and using a causal Transformer to autoregressively predict motion tokens from text. This paradigm has inspired numerous variants, including MotionGPT for unified multimodal tasks and T2M-HiFiGPT for enhanced artifact reduction. Notably, while successors like MoMask achieve superior generation fidelity by employing masked bidirectional modelling, this non-causal iterative refinement process breaks the strict left-to-right generation required for online, streaming applications. Consequently, we select the strictly causal T2M-GPT architecture as our baseline and extend it with an explicit, decoupled trajectory controller.

3. Method

At its core, text-to-motion predicts the 3D position of each joint of a humanoid skeleton, frame by frame, from a natural-language description. Our approach extends the T2M-GPT baseline with one key addition: an explicit path controller that lets the model follow spatial constraints, instead of leaving the trajectory entirely up to the language model. The pipeline consists of four blocks, summarised in Figure 1, that we describe in detail below.

C-T2M architecture diagram
Figure 1. C-T2M decoupled architecture. A VQ-VAE is pretrained to encode motion into a discrete codebook. The caption is encoded by CLIP and the path waypoints by a small linear projection. A GRU path controller predicts the next (x, y, heading) at each frame. A Transformer decoder autoregressively predicts motion tokens, conditioned on the caption and on the controller's injected path. The predicted tokens are decoded back to motion through the VQ-VAE decoder, and the controller's trajectory is then blended onto the result by a deterministic recompose step.

3.1. VQ-VAE Motion Codec

The body decoder is a temporal vector-quantised autoencoder (VQ-VAE). The encoder maps a window of raw motion features into a sequence of discrete tokens drawn from a learned codebook; the decoder reconstructs continuous motion from those tokens. The autoregressive language model is then trained to predict these tokens, so "predicting motion" is reduced to "predicting codebook indices". We use a codebook of 2048 entries, a 2x temporal downsampling rate, and a window size of 64 frames. The choice of compression rate is studied in Section 3.4 and on the Ablations page; 4x converges to the lowest reconstruction loss and the highest per-clip code diversity, while 2x retains a denser token stream that the autoregressive model benefits from.

3.2. Residual VQ-VAE (RVQ-VAE)

We additionally explore a residual VQ-VAE in which quantisation is performed at K hierarchical levels: at each level, the residual error left by the previous level is itself quantised by a dedicated codebook. The body language model is then adapted to predict K token streams per frame. We use K = 4 levels with 1024 codebook entries per level, trained with a cosine-weighted reconstruction loss that emphasises higher-frequency details at deeper levels. On the codec reconstruction benchmark this lifts R-Precision by roughly 9 points on the Overview split and 12 points on the Timeline-multi split, halves the FID, and shaves about 6 cm/s off the foot-skating; full numbers are reported in the codec reconstruction ablation.

3.3. Text and Path Conditioning

The caption is embedded by a frozen CLIP text encoder (ViT-B/32) into a single 512-dimensional vector. Path constraints are encoded by a small linear projection that takes, for each frame, the next four waypoints (x, y, t) expressed relative to the skeleton's current position, plus its absolute position and current frame index. This compact representation is shared between the trajectory controller and the body GPT.

3.4. Trajectory Controller

The trajectory controller is a closed-loop GRU (about 2 M parameters) that takes the caption embedding and path conditioning features and predicts, for each frame of the rollout, the next root (x, y) position and a heading angle. It is trained to minimise the regression loss to the ground-truth waypoint trajectory, plus a heading-smoothness regulariser that removes per-frame jitter and an arrive-radius gating that prevents orbiting around the final waypoint. The heading parameterisation is travel-relative (turn rate w.r.t. the direction of motion) rather than absolute world-frame yaw; on still frames, where the travel direction is ill-defined, we add an explicit target-zero loss on the predicted turn rate, which fixes the spin-on-spot pathology that arises with the naive travel-relative formulation. See the heading representation ablation.

3.5. Body GPT

The body GPT is a causal Transformer decoder over the VQ-VAE token vocabulary. At each step it predicts the next motion token conditioned on (a) the CLIP caption embedding, (b) the path conditioning features at the current frame, and (c) the previously predicted tokens. We use a 9-layer decoder with embedding dimension 1024, 16 attention heads, a block size of 51 tokens, and a feed-forward expansion factor of 4. Because the trajectory is provided to the GPT through the path conditioning, the GPT is no longer in charge of global position: it is conditioned on it, and instead focuses on producing motion that is consistent with the chosen trajectory. Trajectory comes from the controller, naturalness from the GPT.

At inference time we sample tokens with a sliding window of size SW = 25. We use plain multinomial sampling rather than greedy, top-p, or top-k; the decoder ablation shows that multinomial is the only sampler whose body motion does not collapse to a held pose on a non-trivial fraction of captions.

3.6. Deterministic Recompose

Finally, a deterministic recompose step blends the body GPT's predicted motion onto the controller's trajectory. The GPT outputs a locally-coherent gait; the controller produces a globally-correct root trajectory. The recompose step warps the body's local frames so that the root XZ position matches the controller's rollout exactly, while leaving the limbs, heading, and vertical motion untouched. This keeps the body GPT decoupled from the global path: it does not need to memorise long-range path-following, the controller does. Because the recompose step is closed form, it adds negligible latency.

4. Experiments

We evaluate C-T2M against Kimodo on the BONES benchmark in two settings: standard text-to-motion (Section 4.2) and constraint following (Section 4.3). We then summarise the key ablations (Section 4.4), with full results on the Ablations page.

4.1. Setup

We train on the BONES dataset (full 127k clips). All quantitative results in this section use the BONES Repetition Text-to-Motion split (6,539 testcases), with held-out captions. We run stages 2-5 of the Kimodo benchmark pipeline (token decoding, motion synthesis, TMR text-encoder embedding, R-Precision and FID computation, and foot-skate scoring), so that all systems are evaluated by exactly the same downstream code. Latency is measured end-to-end on a single A100 80 GB GPU at fp32, batch size 1, 200-frame clip, averaged over 50 runs with 10 warm-ups.

4.2. Main Results (Text-to-Motion vs. Kimodo)

Table 1 reports the headline comparison on the Repetition Text-to-Motion split. Our 9L single-VQ system reaches R@3 = 58.66 on the Overview split, a 10× latency reduction over Kimodo (413 ms vs. 4280 ms per clip) and roughly 20× fewer effective parameters (395 M vs. 8.3 B). We accept a quality deficit on this generic-text benchmark, driven by a much narrower training vocabulary (walk-and-jog subset for our baseline ablations; full BONES for the headline system) and by a single autoregressive pass per clip instead of 100 denoising steps. The foot-skate gap reflects the codec ceiling on frame-level discrete codecs; a Kimodo-style foot-contact re-projection step would close most of this gap and is left to future work.

Table 1. Main results on the BONES Repetition Text-to-Motion split (Overview, 2380 testcases). Higher is better for R@3 and Contact; lower for FID, Skate, and latency. See Home / Experiments for the full three-split table including Timeline-single and Timeline-multi.
Model Params (M) Latency (ms/clip) ↓ R@3 ↑ FID ↓ Skate ↓ Cont ↑
Ground Truth 94.030.0002.111.000
Kimodo-SOMA-SEED-v1.1 8 3004280 88.070.0073.740.978
Ours (decoupled, 9L sVQ) 395413 58.660.19133.070.617
Ours (RVQ-GPT) 565486 18.740.64439.240.481

4.3. Constraint Following vs. Kimodo

On Kimodo's dedicated path-following split (content/constraints_withtext/root/path_2dpos, 256 testcases), C-T2M reaches a 2D root error of 3.94 cm, slightly below Kimodo's 4.88 cm on the same testcases under the same evaluator. Root-accuracy (fraction of frames within 10 cm of the target path) is 95.7 % for ours versus 93.4 % for Kimodo. Foot-skate and contact consistency favour Kimodo, consistent with the codec ceiling discussed in Section 4.2.

Table 2. Constraint-following head-to-head vs. Kimodo on the 256-testcase root XZ subset. All values from a single shared evaluator.
Model 2D root error (cm) ↓ Acc ≤ 10 cm ↑ Foot skate ↓ Contact ↑
Ground truth (lower bound) 3.761.0000.1031.000
Kimodo-SOMA-SEED-v1.1 (DDIM 100) 4.880.9340.1170.966
Ours (decoupled) 3.940.9570.5830.582

4.4. Ablations

The full set of ablation tables and qualitative comparisons lives on the Ablations page. The key findings are:

  • Constraint-conditioning architecture. Our decoupled controller-and-recompose design dominates four end-to-end variants (static cross-attention, per-position relative cross-attention, prepend constraint tokens, body-GPT-without-constraint) on every constraint- following and quality metric, achieving roughly 2.5-4× lower root error and 2-3× higher motion-to-motion retrieval recall in a neutral encoder space.
  • Heading representation. A travel-relative heading loss with a still-frame target halves the all-frame heading error compared to the naive travel-relative formulation, at a modest cost in waypoint accuracy. This fixes the spin-on-spot pathology on non-locomotion captions.
  • VQ-VAE temporal compression. 4× downsampling converges fastest and to the lowest reconstruction loss with the highest unique-codes-per-clip; 16× collapses on reconstruction.
  • Codec reconstruction. RVQ-VAE retains roughly 88 % of ground-truth R@3 on the Overview split and 87 % on Timeline-multi, halves FID, and reduces foot-skate compared to a single VQ-VAE baseline.
  • Body decoder sampler. Multinomial sampling dominates greedy, top-p, top-k, and top-p-with- repetition-penalty on every motion-quality metric and is the only sampler that never collapses to a held pose.

4.5. Qualitative Results

Beyond the quantitative tables, we report a representative set of qualitative samples. All clips are rendered with pyrender on top of the SOMA skeleton; foot contacts are post-processed with the Kimodo motion-correction solver. Captions are held out from training.

4.5.1 Free text-to-motion samples

A grid of free-generation samples from our body GPT (single-VQ codebook 2048, sliding window 25, multinomial sampling). Each clip is produced from the caption alone; no path constraint is given.

“a person executes a front flip with a smooth landing”

“high jump over a moving pole in one place”

“lively charleston dance with a forward leg kick”

“a neutral throw and release of a ball”

Figure 2. Free text-to-motion samples on held-out captions.

4.5.2 Model comparison: Kimodo vs. RVQ-GPT vs. VQ-GPT

Side-by-side comparison of three text-to-motion models on the same held-out captions. Kimodo is a diffusion-based baseline (state-of-the-art quality, slow); VQ-GPT is our single-codebook autoregressive model; RVQ-GPT is our residual-quantisation variant.

Figure 3. Example A. RVQ-GPT (left) | VQ-GPT (center) | Kimodo (right). All three systems generate motion from the same caption; Kimodo produces the smoothest result while our two systems trade quality for an order-of-magnitude latency reduction.
Figure 4. Example B. Same layout as Figure 3.

4.5.3 Constraint following

Walking along several user-specified XZ paths. The GRU controller rolls out the trajectory; the body GPT supplies the gait; the recompose step stitches them together. The same body GPT handles sparse waypoints, dense trajectories, and curves without retraining.

Dense spiral path.

Wavy dense trajectory.

Sparse waypoints, right arc.

Single diagonal waypoint.

Figure 5. Path-following samples with varying constraint densities.

4.5.4 Online adaptation

Because trajectory and body are decoupled, the path can be edited mid rollout without restarting generation. The controller absorbs the new waypoints immediately and the body GPT continues synthesizing motion along the updated trajectory.

Manual path update mid-rollout

User redraws the path mid-rollout.

A* planned path around obstacles

A* planned path around obstacles.

Figure 6. Online adaptation: the controller responds to waypoint edits in real time without re-running the body generator.

5. Conclusion & Limitations

We presented C-T2M, a decoupled autoregressive text-to-motion pipeline that follows user-specified root XZ paths. By delegating trajectory control to a small closed-loop GRU and letting a caption-driven body GPT focus on plausible gait, our method matches Kimodo's constraint-following quality (3.94 cm vs. 4.88 cm 2D root error) with an order of magnitude lower latency and roughly twenty times fewer effective parameters. Path and gait turn out to be separable problems, and the decoupled solution is cheap, modular, and faster than diffusion-based alternatives, making it suitable for online and interactive settings.

Limitations and future work

Our body decoder inherits the foot-skating ceiling of frame-level discrete codecs, which is the main remaining gap to Kimodo on the foot-contact suite. A Kimodo-style foot re-projection post-processing step would close most of that gap and is left to future work. The recompose step also assumes a planar floor and a single root XZ trajectory, so it does not yet generalise to end-effector or full-body keyframe constraints. Finally, the trajectory controller currently accounts for almost half of the end-to-end latency; a learned feed-forward controller could replace the closed-loop GRU and remove the largest single latency block.

6. Individual Contributions

  • Maria Pilligua — TODO: list contributions (e.g. path controller, training, benchmarks).
  • Pau Amargant — TODO: list contributions (e.g. VQ-VAE / RVQ-VAE codec).
  • Miquel Lopez — TODO: list contributions (e.g. body GPT, decoder ablations).
  • Nahush Rajesh Kolhe — TODO: list contributions (e.g. Kimodo baseline, evaluation pipeline).

7. References

  1. J. Zhang et al. T2M-GPT: Generating Human Motion from Textual Descriptions with Discrete Representations. CVPR, 2023. arXiv:2301.06052
  2. C. Guo et al. MoMask: Generative Masked Modeling of 3D Human Motions. CVPR, 2024. arXiv:2312.00063
  3. G. Tevet et al. Human Motion Diffusion Model. ICLR, 2023. OpenReview
  4. M. Zhang et al. MotionDiffuse: Text-Driven Human Motion Generation with Diffusion Model. 2022. arXiv:2208.15001
  5. B. Jiang et al. MotionGPT: Human Motion as a Foreign Language. NeurIPS, 2023. arXiv:2306.14795
  6. Y. Xie et al. OmniControl: Control Any Joint at Any Time for Human Motion Generation. project page
  7. Y. Shafir et al. Human Motion Diffusion as a Generative Prior. project page
  8. C. Guo et al. Generating Diverse and Natural 3D Human Motions from Text. CVPR, 2022. (HumanML3D dataset.) project page
  9. NVIDIA. Kimodo: Foundation Model for Human Motion. Tech report, 2026. project page
  10. A. Radford et al. Learning Transferable Visual Models from Natural Language Supervision. ICML, 2021. (CLIP.) arXiv:2103.00020

BibTeX

@misc{ct2m2026,
  title  = {Controllable Autoregressive Text-to-Motion Generation},
  author = {Amargant, Pau and Lopez, Miquel and Pilligua, Maria and Kolhe, Nahush Rajesh},
  year   = {2026},
  note   = {CS-503 Project, EPFL},
}