Masked Temporal Interpolation Diffusion for Procedure Planning in Instructional Videos
MTID is a diffusion-based model for instructional video planning, incorporating latent temporal interpolation and task-aware masking for improved temporal reasoning.
Jan 1, 1970