A bidirectional teacher with privileged future information during training proves surprisingly effective in reducing error accumulation in the causal student (see video below). This form of asymmetric distillation, where the student and teacher use different architectures, is only feasible with DMD-style distillation. Other methods, such as progressive distillation or consistency models, require identical architectures for both the student and teacher.