Masked Temporal Interpolation Diffusion for Procedure Planning in Instructional Videos

Jan 1, 1970·

Yufan Zhou

Lingshuai Lin

Junqi Jing(荆浚淇)

Et Al.

· 0 min read

PDF Cite Code Source Document

Image credit: MTID project.

Abstract

We propose MTID, a novel diffusion-based model for procedure planning in instructional videos—predicting coherent action sequences given the start and end observations. Unlike previous works that rely heavily on text-level supervision, MTID introduces a latent space temporal interpolation module to synthesize richer mid-state visual features, improving temporal reasoning. We further design an action-aware mask projection and a task-adaptive masked proximity loss, enabling the model to focus on task-relevant, temporally coherent actions. Our method achieves state-of-the-art performance on multiple benchmarks.

Type

Conference paper

Publication

In International Conference on Learning Representations (ICLR) 2025

Last updated on Jan 1, 1970

Diffusion Models Temporal Reasoning Embodied AI Instructional Videos

Authors

Junqi Jing(荆浚淇)

Student

← Not yet Sep 1, 2015