Projects - Maria Pilligua

(Full) Attention is More Than You Need: Sparse Cross-View Attention in VGGT ^🔗

Maria Pilligua, supervised by Aoxiang Fan, Pascal Fua

Semester Project · EPFL CVLab (2026)

Replaced VGGT's dense global cross-view attention with a learned sparse index over patch correspondences: 32x faster than dense FlashAttention at 96 frames, and gains +5 pose AUC@30 on ScanNet. A contrastive InfoNCE loss teaches the backbone to build its own sparse index at inference, with no geometry.

Page Report Slides Code (soon)

Controllable Autoregressive Text-to-Motion Generation ^🔗

Maria Pilligua, Pau Amargant, Miquel Lopez, Nahush Rajesh Kolhe

EPFL CS-503 · Final Grade 5.75/6

A real-time text-to-motion model where the user can steer generation mid-sequence: at any frame, specify a full-body pose, hand or foot position, or a path the character must follow. Built on T2M-GPT with a VQ-VAE motion codebook and an 18-layer causal Transformer, trained on the BONES dataset. Two constraint-injection architectures explored: ControlNet-style cross-attention adapter and prefix-token with custom causal masking.

Page Code

VQ-CoT: latent chain-of-thought quantised via VQ-VAE

VQ-CoT: Discretising Latent Chain-of-Thought ^🔗

Maria Pilligua, Chengheng Li Chen, Nil Biescas, Pau Amargant

EPFL CS-552 Modern NLP · Final Grade 6/6

Probed whether latent chain-of-thought models (Coconut, CODI) genuinely reason. Discretised their thought vectors with VQ-VAE and FSQ, then dissected the latents with logit lenses, ablations, and content swaps. CODI's separator latents are content-free placeholders (a 3-latent student matches the 6-latent teacher), and Qwen3 backbones up to 4B write structured latents they never actually read at inference.

Page Report

No Gradients, No Problem? A Comparative Study of Zeroth-Order Optimizers ^🔗

Maria Pilligua, Nil Biescas, Chengheng Li Chen

EPFL CS-439 Optimization for ML · Final Grade 6/6

Controlled comparison of 10 zeroth-order optimizers and an AdamW baseline on the same harness, fine-tuning Qwen2.5 and Qwen3.5 on SuperGLUE. Sparse-MeZO was the strongest ZO method, 5 points below AdamW at 1/3 memory, and the only one that improved with model scale. Cross-entropy still beat training on non-differentiable accuracy for almost every method.

Page Report Code

RC car with LiDAR and camera on a taped circuit

Autonomous Navigation on a Physical Car ^🔗

Joan Lafuente, Maria Pilligua, Xavi Soto

UAB Autonomous Navigation course · 2025 · Final Grade 10/10

End-to-end imitation-learning pipeline on a small physical RC car with an RGB camera and a 2D LiDAR. We drove the car around taped-up circuits with a joystick to collect the dataset, then trained a CIL++ model to predict steering and throttle from the sensors. LiDAR turned out to be the key ingredient for obstacle avoidance: the front camera alone cannot tell that the car is still in the middle of a passing manoeuvre once the obstacle leaves the frame.

Page Report

MAPA: A Quadruped Robot Guide Dog ^🔗

Maria Pilligua, with the RoboHack team

RoboHack EPFL 2026 · 3rd Place

Guide-dog agent on a Unitree quadruped, built in one weekend. A tool-using agent hears app-based voice requests, uses Claude as a VLM for scene understanding, YOLO for detection and tracking, and SLAM for live mapping and localisation, then physically walks the user to their destination while avoiding obstacles and narrating what it is doing.

Page Devpost Code

Point2Weight: HyperNetwork-Driven Implicit Function Estimation from Point Clouds ^🔗

Maria Pilligua, Júlia Garcia, Joan Lafuente

UAB 3D Vision course · 2025 · Final Grade 10/10

3D surface reconstruction from noisy point clouds via implicit functions. A PointNet embedding of the point cloud conditions a HyperNetwork that predicts, per object, both the weights of an MLP-based signed-distance decoder and the parameters of a multi-resolution hash grid used as positional encoding. Trained on the ABC dataset, it generalises to unseen and incomplete shapes better than classical Poisson Surface Reconstruction and DL baselines like Points2Surf.

Page Report

Towards Controllable Image Relighting ^🔗

Maria Pilligua, supervised by Javier Vazquez-Corral

Bachelor's Thesis · UAB · 2025 · Final Grade 10/10 with honors

Comparative study of methods for controllable image relighting, across two very different architectures. On the diffusion side, tested Concept Sliders (LoRA), IP-Adapter, and a custom ControlNet trained on RGBX-extracted albedos on top of Stable Diffusion. On the transformer side, adapted DAT and Restormer to consume lighting parameters via PromptIR-style prompt injection into cross-attention. Restormer + light-prompt won: 28.66 PSNR / 0.785 SSIM averaged over all lighting-pair transitions, and generalises to out-of-domain iPhone photos. Built the paired RSR dataset (9-light rig) used for training.

Page Report Slides

HyperNVD: Accelerating Neural Video Decomposition via Hypernetworks. ^🔗

Maria Pilligua, Danna Xue, Javier Vazquez-Corral

CVPR 2025

Page Paper Code

PAH: Prototype Augmented Hypernetworks ^🔗

Neil De La Fuente, Maria Pilligua, Daniel Vidal, Alvin Soutif, Andrey Barskey

Workshop CVPR 2025

Page Paper Code

PromptNorm: Image Geometry Guides Ambient Light Normalization ^🔗

David Serrano-Lozano, Francisco A. Molina Bakhos, Danna Xue, Yixiong Yang, Maria Pilligua, Ramon Baldrich, Maria Vanrell, Javier Vazquez-Corral

Workshop CVPR 2025

Page Paper Code

LayeredDoc: Domain Adaptive Document Restoration with a Layer Separation Approach. ^🔗

Maria Pilligua, Nil Biescas, Javier Vazquez-Corral, Josep Lladós, Ernest Valveny, Sanket Biswas

ECCV WiCV 2024, ICDAR 2024

Page Paper Code Poster BibTex

@misc{pilligua2024layereddoc,
  title={LayeredDoc: ...},
  author={Maria Pilligua and ...},
  year={2024},
  ...
}

Text-to-SVG comparison: previous LLM-SVG output (broken) vs my diffusion pipeline (GoodNotes doodle style)

Text-to-SVG for GoodNotes ^🔗

Maria Pilligua, with the GoodNotes ML team

GoodNotes ML Internship · London · 2025 · Shipped to millions

GoodNotes' previous pipeline asked an LLM to write SVG code directly and broke on anything geometric, LLMs are not spatial reasoners. I proposed reframing the task as computer vision: draw the picture with an image-diffusion model, then vectorize. Designed and built the full stack (prompt rewriting, diffusion generation, style-alignment on GoodNotes' doodle aesthetic, filtering + ranking, vectorizer, conversion to native editable objects). Prototyped 10+ variants. Latency dropped from ~40s to ~7s at much higher quality; adopted and shipped.

Page GoodNotes AI

Evaluating Low-Light Image Enhancement Across Multiple Intensity Levels ^🔗

Maria Pilligua, David Serrano-Lozano, Pai Peng, Ramon Baldrich, Michael S. Brown, Javier Vazquez-Corral

CVPR Findings 2026

Page Paper Code