Project Report · CS-503, EPFL, 2026
We present C-T2M, a decoupled approach to controllable text-to-motion generation that combines an autoregressive language model over discrete motion tokens with an explicit trajectory controller. Our pipeline consists of three components: (i) a closed-loop GRU-based trajectory controller that generates smooth XZ root paths from user-specified waypoints and captions, (ii) a caption-driven body GPT that synthesizes motion tokens over a learned VQ-VAE codebook conditioned on text, and (iii) a deterministic recompose step that blends the body's predicted gait onto the controller's trajectory. By decoupling path control from body motion generation, our system runs in 413 ms per 200-frame clip, an order of magnitude faster than the state-of-the-art diffusion baseline Kimodo, while following user constraints with 3.94 cm root error (vs. Kimodo's 4.88 cm). We train on the BONES dataset and evaluate against Kimodo on both the Repetition Text-to-Motion split and a dedicated constraint-following protocol. We additionally study a residual VQ-VAE codec (RVQ-VAE) and ablate the constraint-conditioning architecture, the temporal compression rate, the heading representation, and the token sampling strategy.
Generating realistic human motion from natural-language descriptions has made significant progress in recent years. Diffusion-based systems such as Kimodo are now capable of producing high-fidelity human motion conditioned on free-form text, and additionally support a rich set of kinematic constraints (pose, joint trajectory, root path). Kimodo is trained on the BONES dataset, a large-scale motion capture corpus roughly ten times larger than the commonly used HumanML3D benchmark, which enables this quality improvement.
Despite these advances, current state-of-the-art systems share a common limitation: they generate the entire motion clip in a single, expensive forward pass (typically requiring tens to hundreds of denoising steps). This makes them unsuitable for online settings, where the user wants to update waypoints, captions, or scene geometry mid-execution, and for interactive settings, where latency directly bounds usability. A robotics policy, a digital twin, or an interactive game character all need motion generation that reacts to changing input without recomputing everything from scratch.
We design a controllable text-to-motion model that takes as input (i) a free-form caption and (ii) a root XZ path constraint (either a small set of waypoints or a dense trajectory), and produces frame-by-frame 3D joint positions of a humanoid skeleton that obey both inputs. Our goals are:
Recent text-to-motion literature has been dominated by diffusion models (MDM, MotionDiffuse, FLAME, ReMoDiffuse), which synthesise high-fidelity human movement by iteratively denoising continuous latent representations. A key advantage of the diffusion formulation is its amenability to complex spatial and temporal conditioning. Frameworks such as Kimodo, OmniControl and PriorMDM leverage this capability to introduce fine-grained kinematic constraints, including end-effector trajectories, root paths, and keyframe poses, typically via guided denoising or latent inpainting. However, their reliance on multiple reverse diffusion steps incurs high inference latency, severely limiting their deployment in real-time, dynamic environments.
To overcome this latency bottleneck, a parallel line of research reformulates motion generation as a discrete sequence prediction task. T2M-GPT pioneered this approach by compressing continuous 3D motion into a discrete vocabulary via a VQ-VAE and using a causal Transformer to autoregressively predict motion tokens from text. This paradigm has inspired numerous variants, including MotionGPT for unified multimodal tasks and T2M-HiFiGPT for enhanced artifact reduction. Notably, while successors like MoMask achieve superior generation fidelity by employing masked bidirectional modelling, this non-causal iterative refinement process breaks the strict left-to-right generation required for online, streaming applications. Consequently, we select the strictly causal T2M-GPT architecture as our baseline and extend it with an explicit, decoupled trajectory controller.
At its core, text-to-motion predicts the 3D position of each joint of a humanoid skeleton, frame by frame, from a natural-language description. Our approach extends the T2M-GPT baseline with one key addition: an explicit path controller that lets the model follow spatial constraints, instead of leaving the trajectory entirely up to the language model. The pipeline consists of four blocks, summarised in Figure 1, that we describe in detail below.
The body decoder is a temporal vector-quantised autoencoder (VQ-VAE). The encoder maps a window of raw motion features into a sequence of discrete tokens drawn from a learned codebook; the decoder reconstructs continuous motion from those tokens. The autoregressive language model is then trained to predict these tokens, so "predicting motion" is reduced to "predicting codebook indices". We use a codebook of 2048 entries, a 2x temporal downsampling rate, and a window size of 64 frames. The choice of compression rate is studied in Section 3.4 and on the Ablations page; 4x converges to the lowest reconstruction loss and the highest per-clip code diversity, while 2x retains a denser token stream that the autoregressive model benefits from.
We additionally explore a residual VQ-VAE in which quantisation is performed at K hierarchical levels: at each level, the residual error left by the previous level is itself quantised by a dedicated codebook. The body language model is then adapted to predict K token streams per frame. We use K = 4 levels with 1024 codebook entries per level, trained with a cosine-weighted reconstruction loss that emphasises higher-frequency details at deeper levels. On the codec reconstruction benchmark this lifts R-Precision by roughly 9 points on the Overview split and 12 points on the Timeline-multi split, halves the FID, and shaves about 6 cm/s off the foot-skating; full numbers are reported in the codec reconstruction ablation.
The caption is embedded by a frozen CLIP text encoder (ViT-B/32) into a single 512-dimensional vector. Path constraints are encoded by a small linear projection that takes, for each frame, the next four waypoints (x, y, t) expressed relative to the skeleton's current position, plus its absolute position and current frame index. This compact representation is shared between the trajectory controller and the body GPT.
The trajectory controller is a closed-loop GRU (about 2 M parameters) that takes the caption embedding and path conditioning features and predicts, for each frame of the rollout, the next root (x, y) position and a heading angle. It is trained to minimise the regression loss to the ground-truth waypoint trajectory, plus a heading-smoothness regulariser that removes per-frame jitter and an arrive-radius gating that prevents orbiting around the final waypoint. The heading parameterisation is travel-relative (turn rate w.r.t. the direction of motion) rather than absolute world-frame yaw; on still frames, where the travel direction is ill-defined, we add an explicit target-zero loss on the predicted turn rate, which fixes the spin-on-spot pathology that arises with the naive travel-relative formulation. See the heading representation ablation.
The body GPT is a causal Transformer decoder over the VQ-VAE token vocabulary. At each step it predicts the next motion token conditioned on (a) the CLIP caption embedding, (b) the path conditioning features at the current frame, and (c) the previously predicted tokens. We use a 9-layer decoder with embedding dimension 1024, 16 attention heads, a block size of 51 tokens, and a feed-forward expansion factor of 4. Because the trajectory is provided to the GPT through the path conditioning, the GPT is no longer in charge of global position: it is conditioned on it, and instead focuses on producing motion that is consistent with the chosen trajectory. Trajectory comes from the controller, naturalness from the GPT.
At inference time we sample tokens with a sliding window of size SW = 25. We use plain multinomial sampling rather than greedy, top-p, or top-k; the decoder ablation shows that multinomial is the only sampler whose body motion does not collapse to a held pose on a non-trivial fraction of captions.
Finally, a deterministic recompose step blends the body GPT's predicted motion onto the controller's trajectory. The GPT outputs a locally-coherent gait; the controller produces a globally-correct root trajectory. The recompose step warps the body's local frames so that the root XZ position matches the controller's rollout exactly, while leaving the limbs, heading, and vertical motion untouched. This keeps the body GPT decoupled from the global path: it does not need to memorise long-range path-following, the controller does. Because the recompose step is closed form, it adds negligible latency.
We evaluate C-T2M against Kimodo on the BONES benchmark in two settings: standard text-to-motion (Section 4.2) and constraint following (Section 4.3). We then summarise the key ablations (Section 4.4), with full results on the Ablations page.
We train on the BONES dataset (full 127k clips). All quantitative results in this section use the BONES Repetition Text-to-Motion split (6,539 testcases), with held-out captions. We run stages 2-5 of the Kimodo benchmark pipeline (token decoding, motion synthesis, TMR text-encoder embedding, R-Precision and FID computation, and foot-skate scoring), so that all systems are evaluated by exactly the same downstream code. Latency is measured end-to-end on a single A100 80 GB GPU at fp32, batch size 1, 200-frame clip, averaged over 50 runs with 10 warm-ups.
Table 1 reports the headline comparison on the Repetition Text-to-Motion split. Our 9L single-VQ system reaches R@3 = 58.66 on the Overview split, a 10× latency reduction over Kimodo (413 ms vs. 4280 ms per clip) and roughly 20× fewer effective parameters (395 M vs. 8.3 B). We accept a quality deficit on this generic-text benchmark, driven by a much narrower training vocabulary (walk-and-jog subset for our baseline ablations; full BONES for the headline system) and by a single autoregressive pass per clip instead of 100 denoising steps. The foot-skate gap reflects the codec ceiling on frame-level discrete codecs; a Kimodo-style foot-contact re-projection step would close most of this gap and is left to future work.
| Model | Params (M) | Latency (ms/clip) ↓ | R@3 ↑ | FID ↓ | Skate ↓ | Cont ↑ |
|---|---|---|---|---|---|---|
| Ground Truth | – | – | 94.03 | 0.000 | 2.11 | 1.000 |
| Kimodo-SOMA-SEED-v1.1 | 8 300 | 4280 | 88.07 | 0.007 | 3.74 | 0.978 |
| Ours (decoupled, 9L sVQ) | 395 | 413 | 58.66 | 0.191 | 33.07 | 0.617 |
| Ours (RVQ-GPT) | 565 | 486 | 18.74 | 0.644 | 39.24 | 0.481 |
On Kimodo's dedicated path-following split
(content/constraints_withtext/root/path_2dpos, 256
testcases), C-T2M reaches a 2D root error of 3.94 cm, slightly below
Kimodo's 4.88 cm on the same testcases under the same evaluator.
Root-accuracy (fraction of frames within 10 cm of the target path)
is 95.7 % for ours versus 93.4 % for Kimodo. Foot-skate and
contact consistency favour Kimodo, consistent with the codec ceiling
discussed in Section 4.2.
| Model | 2D root error (cm) ↓ | Acc ≤ 10 cm ↑ | Foot skate ↓ | Contact ↑ |
|---|---|---|---|---|
| Ground truth (lower bound) | 3.76 | 1.000 | 0.103 | 1.000 |
| Kimodo-SOMA-SEED-v1.1 (DDIM 100) | 4.88 | 0.934 | 0.117 | 0.966 |
| Ours (decoupled) | 3.94 | 0.957 | 0.583 | 0.582 |
The full set of ablation tables and qualitative comparisons lives on the Ablations page. The key findings are:
Beyond the quantitative tables, we report a representative set of qualitative samples. All clips are rendered with pyrender on top of the SOMA skeleton; foot contacts are post-processed with the Kimodo motion-correction solver. Captions are held out from training.
A grid of free-generation samples from our body GPT (single-VQ codebook 2048, sliding window 25, multinomial sampling). Each clip is produced from the caption alone; no path constraint is given.
“a person executes a front flip with a smooth landing”
“high jump over a moving pole in one place”
“lively charleston dance with a forward leg kick”
“a neutral throw and release of a ball”
Figure 2. Free text-to-motion samples on held-out captions.
Side-by-side comparison of three text-to-motion models on the same held-out captions. Kimodo is a diffusion-based baseline (state-of-the-art quality, slow); VQ-GPT is our single-codebook autoregressive model; RVQ-GPT is our residual-quantisation variant.
Walking along several user-specified XZ paths. The GRU controller rolls out the trajectory; the body GPT supplies the gait; the recompose step stitches them together. The same body GPT handles sparse waypoints, dense trajectories, and curves without retraining.
Dense spiral path.
Wavy dense trajectory.
Sparse waypoints, right arc.
Single diagonal waypoint.
Figure 5. Path-following samples with varying constraint densities.
Because trajectory and body are decoupled, the path can be edited mid rollout without restarting generation. The controller absorbs the new waypoints immediately and the body GPT continues synthesizing motion along the updated trajectory.
User redraws the path mid-rollout.
A* planned path around obstacles.
Figure 6. Online adaptation: the controller responds to waypoint edits in real time without re-running the body generator.
We presented C-T2M, a decoupled autoregressive text-to-motion pipeline that follows user-specified root XZ paths. By delegating trajectory control to a small closed-loop GRU and letting a caption-driven body GPT focus on plausible gait, our method matches Kimodo's constraint-following quality (3.94 cm vs. 4.88 cm 2D root error) with an order of magnitude lower latency and roughly twenty times fewer effective parameters. Path and gait turn out to be separable problems, and the decoupled solution is cheap, modular, and faster than diffusion-based alternatives, making it suitable for online and interactive settings.
Our body decoder inherits the foot-skating ceiling of frame-level discrete codecs, which is the main remaining gap to Kimodo on the foot-contact suite. A Kimodo-style foot re-projection post-processing step would close most of that gap and is left to future work. The recompose step also assumes a planar floor and a single root XZ trajectory, so it does not yet generalise to end-effector or full-body keyframe constraints. Finally, the trajectory controller currently accounts for almost half of the end-to-end latency; a learned feed-forward controller could replace the closed-loop GRU and remove the largest single latency block.
@misc{ct2m2026,
title = {Controllable Autoregressive Text-to-Motion Generation},
author = {Amargant, Pau and Lopez, Miquel and Pilligua, Maria and Kolhe, Nahush Rajesh},
year = {2026},
note = {CS-503 Project, EPFL},
}