C-T2M: Controllable Autoregressive Text-to-Motion Generation

Maria Pilligua1, 2 (444123), Pau Amargant1 (408221), Miquel Lopez1 (415700), Nahush Rajesh Kolhe1 (407562)
1EPFL 2CVLab

Generating realistic human motion from natural language has improved remarkably, but most state-of-the-art systems share a limitation: they produce an entire clip in one shot. If the user wants the character to follow a precise spatial path, or update that path mid-motion, the system has to recompute everything from scratch. That is too slow for interactive use, and too rigid for applications like robotics, digital twins, or game characters that need to react to a changing environment.

We present C-T2M, a decoupled approach that splits the problem into two independent modules: a small trajectory controller that rolls out an XZ root path from user-specified waypoints, and a caption-driven body GPT that synthesizes natural body motion along that path. A deterministic recompose step then blends them. Because the two modules are independent, the controller can react to new constraints instantly without re-running the body generator, the body GPT stays simple and fast, and the whole pipeline runs in 413 ms per 200-frame clip: an order of magnitude faster than Kimodo, the state-of-the-art diffusion baseline, while following constraints with comparable accuracy (3.94 cm vs. 4.88 cm root error).

Key Capabilities of C-T2M

Three things our system does that motivate the design.

Text-to-Motion Generation

Introduction

Recent advances in text-to-motion (T2M) generation have enabled models to synthesize realistic human motion from natural language descriptions. Diffusion based approaches such as Kimodo have further improved motion quality and introduced flexible control through kinematic constraints, including pose, joint, and trajectory specifications. Notably, Kimodo is trained on the BONES dataset, a large-scale motion capture dataset that is significantly (around ten times) larger than commonly used benchmarks such as HumanML3D, enabling higher quality and more diverse motion generation.

However, Kimodo generates the full motion sequence in one shot, which makes it less suitable for settings where the motion should be updated online as new constraints or environment changes appear during execution. Motivated by this limitation, we explore an alternative approach based on latent autoregressive (AR) modeling, where motion is generated sequentially and can therefore be controlled more dynamically through kinematic constraints during inference.

Problem statement

In this project we design a controllable autoregressive text-to-motion model trained on the BONES dataset. Our goals are:

  • Incorporate explicit root XZ path constraints (waypoints or dense trajectories) into the model to guide the generated motion along a user-specified ground path.
  • Enable sequential motion generation that can be updated during inference, allowing dynamic and controllable behaviour while reducing inference time.

Significance

Real-time and controllable human motion generation is essential for applications such as robotics, digital twin simulation, and interactive systems. These require models that can adapt to new goals and environmental changes.

Related Works

Recent text-to-motion literature has been dominated by diffusion models (MDM, MotionDiffuse, FLAME, ReMoDiffuse), which synthesize high-fidelity human movements by iteratively denoising continuous latent representations. A key advantage of the diffusion formulation is its amenability to complex spatial and temporal conditioning. Frameworks such as Kimodo, OmniControl and PriorMDM leverage this capability to introduce fine-grained kinematic constraints, including end-effector trajectories, root paths, and keyframe poses, typically via guided denoising or latent inpainting. However, their reliance on multiple reverse diffusion steps incurs high inference latency, severely limiting their deployment in real-time, dynamic environments.

To overcome this latency bottleneck, a parallel line of research reformulates motion generation as a discrete sequence prediction task. T2M-GPT pioneered this approach by compressing continuous 3D motion into a discrete vocabulary via a VQ-VAE and utilizing a causal Transformer to autoregressively predict motion tokens from text. This paradigm inspired numerous variants, including MotionGPT for unified multimodal tasks and T2M-HiFiGPT for enhanced artifact reduction. Notably, while successors like MoMask achieve superior generation fidelity by employing masked bidirectional modeling, this non-causal iterative refinement process breaks the strict left-to-right generation required for online, streaming applications. Consequently, we select the strictly causal T2M-GPT architecture as our baseline.

Method

At its core, text-to-motion predicts the 3D position of each joint of a humanoid skeleton, frame by frame, from a natural-language description. Our approach extends the T2M-GPT baseline with one key addition: an explicit path controller that lets the model follow spatial constraints, instead of leaving the trajectory entirely up to the language model.

The pipeline is composed of four blocks: (i) a VQ-VAE pretrained as an autoencoder to map raw motion into a discrete codebook and decode it back, so that "predicting motion" reduces to predicting codebook tokens; (ii) a conditioning module that encodes the textual prompt with CLIP and the path constraints with a small linear projection that takes the next four waypoints (x, y, t) relative to the skeleton's current position, plus its absolute position and current frame index; (iii) a path controller, a closed-loop GRU that takes the caption and path constraints and predicts the next (x, y, heading) for each frame, making it solely responsible for where the skeleton goes; and (iv) a Transformer decoder (body GPT) that autoregressively predicts the next motion token conditioned on the CLIP embedding, with the controller's trajectory injected at generation time. The GPT is no longer in charge of global position (it is conditioned on it), so it focuses on producing motion coherent with the chosen trajectory. Trajectory comes from the controller, naturalness from the GPT.

C-T2M architecture: a VQ-VAE is pretrained to encode motion into a discrete codebook; a CLIP text encoder and a linear projection on path waypoints feed both a GRU path controller (predicting xy-heading per frame) and a Transformer decoder that autoregressively predicts motion tokens, with the controller's path injected into the GPT's stream; the predicted tokens are decoded back to motion through the VQ-VAE decoder.
Figure 5. C-T2M decoupled architecture — VQ-VAE codec, text/path conditioning, GRU path controller and a Transformer token decoder.

VQ-VAE pretraining learns a discrete motion codebook. Conditioning encodes the caption with CLIP and the path waypoints with a linear projection. The path controller (GRU) predicts the next (x, y, heading) for each frame from the caption and constraints. The GPT generation block (Transformer decoder) autoregressively predicts discrete motion tokens conditioned on text and on the controller's injected path. Finally, the VQ-VAE decoder maps tokens back to motion.

Residual VQ-VAE (RVQ-VAE)

TODO: explain the RVQ-VAE architecture. Describe how it differs from the single VQ-VAE: K levels of residual quantisation, cosine-weighted loss, 1024 codebook entries per level. Explain how the residual structure progressively refines the reconstruction (each level encodes the error left by the previous), why this improves R-Precision and FID on the codec reconstruction ablation, and how the GPT decoder is adapted to predict the K-level token streams. Add a small RVQ-VAE diagram alongside this paragraph.

Experiments

Setup

All experiments use the BONES-SEED motion capture dataset, resampled to 30 fps and converted to the 30-joint SOMA skeleton used by Kimodo's evaluation pipeline. Motion is represented in a local velocity space (frame-to-frame translation and heading changes, joint rotations and velocities expressed in the character local frame); 59 near-constant feature dimensions are dropped, yielding the 310-dimensional vectors used for all training. The final model is trained on the full curated split; ablations are run on a walk-and-jog subset of 22 726 clips for tractable iteration without changing the conclusions. Evaluation follows the Kimodo BONES protocol: text alignment is measured with FID and R-Precision (top-1, top-3) on the TMR-SOMA-RP-v1 embeddings, motion quality with foot-skating and contact consistency, and path following with the mean Euclidean error between the generated root XZ trajectory and the user-specified waypoints. Inference latency is measured end-to-end on a single A100 80GB in fp32 with batch size 1 over 50 measured runs after 10 warmups.

30-joint SOMA skeleton and constraint representation. Our pipeline controls motion through 2D root XZ waypoints (rightmost panel).
Figure 2. SOMA skeleton and constraint representation — we control motion through 2D root (XZ) waypoints.

We control motion through 2D root (XZ) waypoints (rightmost panel). End-effector and full-body keyframe constraints (middle panels) are shown for context only and are not targeted in our final pipeline.

Baselines

We compare our decoupled pipeline against two reference points chosen to bracket the design space. The first is caption-only T2M-GPT, the autoregressive backbone we build on, with no path conditioning at all. It defines a ceiling on motion quality and text alignment for our architecture while having zero path controllability, so any gap on text alignment isolates the cost of adding constraints without confounding it with a change of backbone. The second is Kimodo, the state-of-the-art diffusion-based controllable T2M model trained on the same BONES dataset. It provides an upper reference on motion quality and constraint following at the price of high inference latency, which lets us quantify the speed and controllability trade-off that motivates the autoregressive formulation. Where appropriate, we additionally consider two architectural ablations that inject constraints inside the T2M-GPT body itself (cross-attention with ControlNet-style zero-init, and constraint tokens prepended to the motion sequence); these are described in the next section and ablated in the constraint-conditioning architecture comparison in the Ablations section.

1. Text to motion

We first evaluate the unconstrained text-to-motion behaviour of our pipeline: given a caption alone (no path constraint), can the model generate plausible motion that matches the text? We report quantitative results on the Kimodo Repetition benchmark and show qualitative samples from held-out captions.

Quantitative results.

We compare our system against Kimodo on the Repetition Text-to-Motion split of the Kimodo BONES benchmark (6 539 testcases). Both systems are evaluated under the same evaluation pipeline (stages 2 to 5 of jobs/08_run_gpt_benchmark_pipeline.sh): we generate motions with each system, embed them with the TMR-SOMA-RP text encoder, and compute R-Precision, FID, foot-skating and contact consistency in the same space. Inference latency is measured end-to-end on a single A100 80GB, fp32, batch size 1, 200-frame clip, mean over 50 runs (10 warmups).

Our system is designed to operate under root XZ path constraints with streaming, autoregressive inference, a setting Kimodo does not target: Kimodo runs one diffusion process per clip with one hundred denoising steps and is not autoregressive, so a constraint introduced mid-sequence requires re-running the whole diffusion process. The headline numbers below should be read in that light. Our model is also roughly twenty times smaller than Kimodo in effective parameters and runs an order of magnitude faster on the same hardware. Kimodo's quality advantage on the generic text-to-motion benchmark comes from a much larger architecture, a richer training corpus (the full SOMA-SEED, roughly four times larger than the walk-jog-run subset we train on) and a hundred denoising steps per clip; we trade those for size, speed, and streaming controllability.

  • R@3 — text–motion retrieval: how often the right motion lands in the top 3 for its caption (higher = better text match).
  • FID — distance between generated and real motion distributions (lower = more realistic).
  • Skate — foot-skating: how much the feet slide while in contact with the ground (lower = cleaner footwork).
  • Cont — foot-contact consistency: how stable ground contacts are (higher = steadier).
  • 2D root error — average distance between the generated path and the target path, in cm (lower = follows the path better).
  • MedR — median retrieval rank of the correct motion (lower = better).

R@3 ↑ (text-match, %)

FID ↓ (realism, lower is better)

Efficiency — Kimodo vs. Ours (log scale, lower is better)

model params (M) GFLOPs / clip ↓ latency (ms / clip) ↓ Overview Timeline single Timeline multi
R@3 ↑FID ↓Skate ↓Cont ↑ R@3 ↑FID ↓Skate ↓Cont ↑ R@3 ↑FID ↓Skate ↓Cont ↑
Ground Truth --- 94.030.0002.111.000 90.040.0002.041.000 94.490.0001.931.000
Kimodo-SOMA-SEED-v1.1 8 300~15004280 88.070.0073.740.978 77.180.0113.600.981 88.650.0103.350.980
Ours (decoupled, 9L GPT, full BONES) 395 ~470 413 58.660.19133.070.617 47.440.22027.950.654 37.040.30727.200.664
Ours (RVQ-GPT) 565 ~640 486 18.740.64439.240.481 22.180.41134.930.507 15.120.67039.080.468
Results. Text-to-motion quality and efficiency on the Kimodo Repetition split — we trade some text-match quality for a model roughly 20× smaller and 10× faster than Kimodo.

Includes the frozen Llama-3-8B-Instruct text encoder Kimodo loads at inference (about 8 000 M parameters); the trainable adapters, diffusion UNet and heads on top sum to 283 M parameters. We report the effective parameters loaded at inference because they determine the memory footprint and the floating point operations per clip.

R@3 is text-to-motion R-Precision at top-3 with the TMR-SOMA-RP encoder. FID and foot-skate are in the same space; contact (Cont) is foot-contact consistency against contacts inferred from foot height and velocity. Our model is roughly twenty times smaller in effective parameters, uses about three times fewer GFLOPs per clip, and runs roughly ten times faster than Kimodo on the same hardware. On this generic text-to-motion benchmark we accept a quality deficit driven by a narrower training vocabulary (walk, jog and run only) and a single autoregressive pass per clip instead of one hundred denoising steps. The foot-skate gap matches the codec ceiling reported in the codec reconstruction ablation: any frame-level discrete codec smoothes high-frequency foot detail, and we do not currently apply Kimodo-style foot re-projection on top of the body decoder.

Neutral encoder evaluation (TMR-petrovich).

To isolate encoder bias, we also evaluate on a neutral text-to-motion encoder (TMR-petrovich, HumanML3D-trained, DistilBERT text) rather than the Kimodo-specific TMR-SOMA-RP encoder used above. This eliminates the bias toward Kimodo's training distribution and provides a fairer comparison. Both systems are evaluated on HumanML3D using the same retrieval protocol.

model R@1 ↑ R@3 ↑ R@5 ↑ R@10 ↑ MedR ↓
HumanML3D test ceiling 15.3 34.7 43.4 57.1 7
Ground Truth (BONES) 4.6 11.1 16.5 25.4 48
Kimodo-SOMA-SEED-v1.1 5.4 11.6 16.8 27.5 39
Ours (9L VQ-VAE) 4.5 9.5 13.8 23.3 51
Figure. Text-to-motion retrieval on a neutral encoder (TMR-petrovich). Our model tracks the BONES ground truth and stays within ~2 points of Kimodo (R@3 9.5 vs. 11.6) — the large gap seen on Kimodo's own encoder is mostly encoder bias, not a quality difference.

R@k is text-to-motion R-Precision at top-k: the fraction of queries whose correct motion is retrieved within the top k (higher is better). MedR is the median rank of the correct motion (lower is better), shown in the chart tooltips. For reference, the in-distribution HumanML3D test ceiling reaches R@1 15.3 / R@10 57.1; it is omitted from the chart axis so the three close-to-GT systems stay legible.

On the biased Kimodo-TMR encoder our system trails by ~30 points on R@3 (58 vs. 88). On this neutral encoder both systems reach similar GT-adjacent retrievability with only a 2-point gap (9.5 vs. 11.6), showing the large apparent quality gap is largely an artifact of encoder bias toward Kimodo's training distribution rather than a ground-truth quality difference.

Qualitative results.

Free-generation samples from our body GPT (single-VQ codebook 2048, sliding window 25, multinomial sampling) on captions held out from training. Each clip is produced from the caption alone; no path constraint is given. The character mesh and ground are rendered with pyrender on top of the SOMA skeleton; foot contacts are post-processed with the Kimodo motion-correction solver.

Model comparison: Kimodo, RVQ-GPT, and VQ-GPT

Qualitative comparison of three text-to-motion models on the same held-out captions. Kimodo is a diffusion-based baseline (state-of-the-art quality, slow); VQ-GPT is our single-codebook autoregressive baseline; RVQ-GPT is our final model with residual quantization. The RVQ codec provides measurable quality improvements, particularly in joint smoothness and foot contact stability.

Example 1

RVQ-GPT (left) | VQ-GPT (center) | Kimodo (right)

Example 2

RVQ-GPT (left) | VQ-GPT (center) | Kimodo (right)

2. Constraint following

We then evaluate the model under explicit path constraints: given a caption and a target XZ trajectory, how closely does the generated motion follow the trajectory while remaining faithful to the caption? We report qualitative samples on hand-drawn paths and quantitative results on Kimodo's path-following split.

Quantitative results.

We evaluate constraint following on the same testsuite split Kimodo publishes numbers for: content/constraints_withtext/root/path_2dpos, the path-following root-XZ constraint subset (256 testcases). Each testcase pairs a caption with a target root-XZ trajectory; the model must generate motion that matches the caption while keeping the root within a small tolerance of the specified path. We re-ran Kimodo-SOMA-SEED-v1.1 under our own pipeline on the same 256 testcases so the comparison is apples-to-apples (same evaluator, same testcases, same skeleton handling).

2D root error ↓ (cm)

Foot & path quality (0–1 scale)

model 2D root error (cm) ↓ root accuracy (≤ 10 cm) ↑ foot skate ratio ↓ foot contact consistency ↑
Ground truth (lower bound) 3.76 1.000 0.103 1.000
Kimodo-SOMA-SEED-v1.1 4.88 0.934 0.117 0.966
Ours (decoupled, ctrl_full_v2_stillzero) 3.94 0.957 0.583 0.582
Results. We beat Kimodo on path adherence (~24% lower 2D root error) but trade away foot-skate quality — the expected cost of decoupling the controller from the body GPT.

Our pipeline matches the ground-truth lower bound on root-2D adherence to within two millimetres and undercuts Kimodo by roughly one centimetre (24 percent relative error reduction), with a higher fraction of frames inside the ten-centimetre tolerance (95.7 percent versus 93.4 percent). Kimodo wins decisively on the foot-skate suite: a five-fold lower skate ratio and a 1.66 higher contact consistency. This is the expected trade-off of decoupling: our explicit controller produces a clean world-frame path that the body GPT was not jointly trained to follow, so when the controller-imposed path disagrees with the body's predicted gait the feet slide. Kimodo's diffusion model jointly generates body and root, avoiding the disagreement; the same property is consistent with the codec-ceiling foot-skate reported in the codec reconstruction ablation. A foot-projection post-processing step would close most of this gap (Kimodo enables one by default) and is left to future work.

Qualitative results.

Free-generation samples on caption “a person walks forward at a neutral pace” conditioned on the hand-drawn XZ paths shown on the ground. Same pipeline as Table 12 (decoupled, ctrl_full_v2_stillzero controller, single-VQ body GPT, sliding window 25, multinomial sampling). The character follows the path while the body GPT freely synthesises the gait.

3. Online adaptation

Beyond static text-to-motion and constraint following, our pipeline can adapt motion during generation. Because the model is autoregressive, we can interrupt the sequence at any point, keep the last few tokens as context, and reinfer from that point forward with a new condition: either a different text prompt, a new path constraint, or both. The model uses the last few generated tokens as a warm start, so the transition is smooth rather than an abrupt reset. This section first reports the per-block latency that makes online updates practical, then shows qualitative results of mid-sequence text and path injection.

Diagram of the cap-and-reinfer process: the sequence is cut at an arbitrary frame, the last tokens are kept as context, and a new condition (text or path) drives the remainder.
Figure 8. Online adaptation via cap-and-reinfer — cut the sequence, keep recent tokens as context, and reinfer the rest under a new condition.

At an arbitrary frame the sequence is cut and the last few generated tokens are kept as context. A new condition — a different text prompt, a new path constraint, or both — is then fed to the model, which reinfers the remainder of the sequence from that point. Because the warm-start context comes from the already-generated tokens, the transition is smooth rather than an abrupt reset.



Per-block inference cost of the decoupled pipeline

Single A100 80GB, fp32, batch size 1, clip length 200 frames, mean ± standard deviation over 50 measured runs after 10 warmups. End-to-end latency is 413 ms per clip (2.4 clips per second). The autoregressive body sampler and the trajectory controller together account for 96 % of the cost; the VQ decoder and the CPU recompose step are effectively free.

blockms / clip% of total
CLIP caption encoder7.40 ± 0.241.79
Trajectory controller (GRU, 2M parameters)179.73 ± 4.4543.47
Body GPT autoregressive sampling (51 tokens, 30M parameters)217.00 ± 1.7052.49
VQ decoder1.51 ± 0.080.37
Recompose, time-warp and reprojection (CPU)7.76 ± 2.521.88
Total 413.40 100.00
Latency. End-to-end inference is 413 ms per clip (2.4 clips/s) — the body GPT sampler and trajectory controller together account for 96 % of the cost, while the VQ decoder and CPU recompose are effectively free.

Measured on a single A100 80GB, fp32, batch size 1, 200-frame clip, mean ± standard deviation over 50 runs after 10 warmups. The two autoregressive components dominate: body GPT sampling over 51 tokens (217.00 ms, 52.5 %) and the GRU trajectory controller (179.73 ms, 43.5 %). The CLIP caption encoder (7.40 ms), VQ decoder (1.51 ms) and CPU recompose/reprojection (7.76 ms) together add under 4 % of the total.



Qualitative results: Text prompt injection

We can change what the character does mid-sequence by injecting a new text prompt. For example, starting from "a person walking forward at a steady pace" we can inject "a person sits criss-cross" or "a person starts walking right" and the motion adapts accordingly.

Walking forward, then sits criss-cross

"a person walking forward" → "a person sits criss-cross"

Walking forward, then starts walking right

"a person walking forward" → "a person starts walking right"

Qualitative results: Path constraint injection

We can also inject a new path constraint mid-sequence. This is useful when the environment changes, for instance when a new obstacle appears and the character must reroute. We show two cases: waypoints specified manually, and waypoints computed automatically with the A* algorithm to avoid obstacles.

Path injection with manually placed waypoints

Manual waypoints

Path injection with A* computed detour around obstacle

A* path around a new obstacle

Other Experiments

Fine-grained Prompt Experiment

To study how much fine-grained information the model can capture from text, we run a simple prompt ablation. We start from a basic walking prompt and progressively add more specific motion details:

level_1: "a person walks"
level_2: "a person walks forward slowly"
level_3: "a person walks forward slowly while waving the right hand"
level_4: "a person walks forward slowly while waving the right hand and nodding their head"
level_5: "a person walks forward slowly while waving the right hand, nodding their head, and raising their left arm"

The generated motions show that the model captures the main semantic content well up to Level 3. At Level 4, the motion still contains walking and right-hand waving, but the head nodding is weak and not as clear as expected. At Level 5, the motion becomes less consistent: the model keeps the dominant walking behavior, but it does not clearly preserve the added head nodding and left-arm raising details.

This suggests that the model handles simple prompts and one additional action reasonably well, but struggles when several fine-grained constraints are combined in the same sentence. In practice, the text condition acts as a bottleneck: the model tends to preserve the most salient parts of the prompt while dropping or weakening secondary details. This supports the idea that text is a lossy representation of motion, especially when the caption contains multiple simultaneous body-part-specific actions.

Level 1
"a person walks"

Level 2
"a person walks forward slowly"

Level 3
"…while waving the right hand"

Level 4
"…and nodding their head"

Level 5
"…and raising their left arm"

See more ablations...

Conclusion and Limitations

We presented C-T2M, a decoupled autoregressive text-to-motion pipeline that follows user-specified root XZ paths. By delegating trajectory control to a small closed-loop GRU and letting a caption-driven body GPT focus on plausible gait, our method matches Kimodo's constraint-following quality (3.94 cm vs. 4.88 cm 2D root error) with an order of magnitude lower latency and roughly twenty times fewer effective parameters. Path and gait turn out to be separable problems, and the decoupled solution is cheap, modular, and faster than diffusion-based alternatives, making it suitable for online and interactive settings.

Limitations and future work

Our body decoder inherits the foot-skating ceiling of frame-level discrete codecs, which is the main remaining gap to Kimodo on the foot-contact suite. A Kimodo-style foot re-projection post-processing step would close most of that gap and is left to future work. The recompose step also assumes a planar floor and a single root XZ trajectory, so it does not yet generalise to end-effector or full-body keyframe constraints. Finally, the trajectory controller currently accounts for almost half of the end-to-end latency; a learned feed-forward controller could replace the closed-loop GRU and remove the largest single latency block.

Individual Contributions

  • Maria Pilligua - TODO: list contributions (e.g. path controller, training, benchmarks).
  • Pau Amargant - TODO: list contributions (e.g. VQ-VAE / RVQ-VAE codec).
  • Miquel Lopez - TODO: list contributions (e.g. body GPT, decoder ablations).
  • Nahush Rajesh Kolhe - TODO: list contributions (e.g. Kimodo baseline, evaluation pipeline).

BibTeX

@misc{ct2m2026,
  title  = {Controllable Autoregressive Text-to-Motion Generation},
  author = {Amargant, Pau and Lopez, Miquel and Pilligua, Maria and Kolhe, Nahush Rajesh},
  year   = {2026},
  note   = {CS-503 Project, EPFL},
}