Yingyu Liang, Zhizhou Sha, Zhenmei Shi, Zhao Song, Mingda Wan, Yufa Zhou (alphabetical order)
ICCV 2025
We provide a theoretical analysis showing that for diffusion models with Gaussian mixture data, the diffusion process preserves the mixture structure; we derive tight, component-independent bounds on Lipschitz constants and second moments, and establish error guarantees for diffusion solvers—offering deeper insights into the diffusion dynamics under common data distributions.
Yingyu Liang, Zhizhou Sha, Zhenmei Shi, Zhao Song, Mingda Wan, Yufa Zhou (alphabetical order)
ICCV 2025
We provide a theoretical analysis showing that for diffusion models with Gaussian mixture data, the diffusion process preserves the mixture structure; we derive tight, component-independent bounds on Lipschitz constants and second moments, and establish error guarantees for diffusion solvers—offering deeper insights into the diffusion dynamics under common data distributions.
Yufa Zhou*, Shaobo Wang*, Xingyu Dong*, Xiangqi Jin, Yifang Chen, Yue Min, Kexin Yang, Xingzhang Ren, Dayiheng Liu, Linfeng Zhang (* equal contribution)
arXiv 2025
We investigate whether post-training techniques such as SFT and RLVR can generalize to multi-agent systems, and introduce Recon—a 7B model trained on a curated dataset of economic reasoning problems—which achieves strong benchmark performance and exhibits emergent strategic generalization in multi-agent games.
Yufa Zhou*, Shaobo Wang*, Xingyu Dong*, Xiangqi Jin, Yifang Chen, Yue Min, Kexin Yang, Xingzhang Ren, Dayiheng Liu, Linfeng Zhang (* equal contribution)
arXiv 2025
We investigate whether post-training techniques such as SFT and RLVR can generalize to multi-agent systems, and introduce Recon—a 7B model trained on a curated dataset of economic reasoning problems—which achieves strong benchmark performance and exhibits emergent strategic generalization in multi-agent games.
Xuan Shen, Weize Ma, Yufa Zhou, Enhao Tang, Yanyue Xie, Zhengang Li, Yifan Gong, Quanyi Wang, Henghui Ding, Yiwei Wang, Yanzhi Wang, Pu Zhao, Jun Lin, Jiuxiang Gu
arXiv 2025
FastCar is a unified framework that accelerates auto-regressive video generation by leveraging temporal redundancy in MLP outputs through a Temporal Attention Score (TAS), enabling selective reuse of computations, integrating with sparse attention to mitigate drifting, and supporting real-time, high-resolution synthesis on edge devices via a TAS-guided Dynamic Resource Scheduling (DRS) FPGA accelerator, achieving over 2.1× speedup and improved efficiency with minimal quality loss.
Xuan Shen, Weize Ma, Yufa Zhou, Enhao Tang, Yanyue Xie, Zhengang Li, Yifan Gong, Quanyi Wang, Henghui Ding, Yiwei Wang, Yanzhi Wang, Pu Zhao, Jun Lin, Jiuxiang Gu
arXiv 2025
FastCar is a unified framework that accelerates auto-regressive video generation by leveraging temporal redundancy in MLP outputs through a Temporal Attention Score (TAS), enabling selective reuse of computations, integrating with sparse attention to mitigate drifting, and supporting real-time, high-resolution synthesis on edge devices via a TAS-guided Dynamic Resource Scheduling (DRS) FPGA accelerator, achieving over 2.1× speedup and improved efficiency with minimal quality loss.
Xuan Shen*, Chenxia Han*, Yufa Zhou*, Yanyue Xie, Yifan Gong, Quanyi Wang, Yiwei Wang, Yanzhi Wang, Pu Zhao, Jiuxiang Gu (* equal contribution)
arXiv 2025
DraftAttention accelerates video diffusion transformers by using low-resolution pooled attention maps for dynamic sparse attention and hardware-efficient execution, achieving up to 1.75× speedup with minimal quality loss.
Xuan Shen*, Chenxia Han*, Yufa Zhou*, Yanyue Xie, Yifan Gong, Quanyi Wang, Yiwei Wang, Yanzhi Wang, Pu Zhao, Jiuxiang Gu (* equal contribution)
arXiv 2025
DraftAttention accelerates video diffusion transformers by using low-resolution pooled attention maps for dynamic sparse attention and hardware-efficient execution, achieving up to 1.75× speedup with minimal quality loss.
Yingyu Liang, Zhizhou Sha, Zhenmei Shi, Zhao Song, Yufa Zhou (alphabetical order)
AISTATS 2025
We demonstrate that a looped 23-layer ReLU-MLP can function as a universal programmable computer—revealing that simple neural network modules possess greater expressive power than previously thought and can perform complex tasks without relying on advanced architectures like Transformers.
Yingyu Liang, Zhizhou Sha, Zhenmei Shi, Zhao Song, Yufa Zhou (alphabetical order)
AISTATS 2025
We demonstrate that a looped 23-layer ReLU-MLP can function as a universal programmable computer—revealing that simple neural network modules possess greater expressive power than previously thought and can perform complex tasks without relying on advanced architectures like Transformers.
Yingyu Liang, Jiangxuan Long, Zhenmei Shi, Zhao Song, Yufa Zhou (alphabetical order)
ICLR 2025
We introduce a novel LLM weight pruning method that directly optimizes for approximating the non-linear attention matrix—with theoretical convergence guarantees—effectively reducing computational costs while maintaining model performance.
Yingyu Liang, Jiangxuan Long, Zhenmei Shi, Zhao Song, Yufa Zhou (alphabetical order)
ICLR 2025
We introduce a novel LLM weight pruning method that directly optimizes for approximating the non-linear attention matrix—with theoretical convergence guarantees—effectively reducing computational costs while maintaining model performance.
Xuan Shen, Zhao Song, Yufa Zhou, Bo Chen, Jing Liu, Ruiyi Zhang, Ryan A. Rossi, Hao Tan, Tong Yu, Xiang Chen, Yufan Zhou, Tong Sun, Pu Zhao, Yanzhi Wang, Jiuxiang Gu
AAAI 2025
We present a training-free structural pruning method using Newton’s method and compensation algorithms to efficiently compress decoder-only transformer models, achieving state-of-the-art performance with reduced memory usage and faster generation on GPUs.
Xuan Shen, Zhao Song, Yufa Zhou, Bo Chen, Jing Liu, Ruiyi Zhang, Ryan A. Rossi, Hao Tan, Tong Yu, Xiang Chen, Yufan Zhou, Tong Sun, Pu Zhao, Yanzhi Wang, Jiuxiang Gu
AAAI 2025
We present a training-free structural pruning method using Newton’s method and compensation algorithms to efficiently compress decoder-only transformer models, achieving state-of-the-art performance with reduced memory usage and faster generation on GPUs.
Xuan Shen, Zhao Song, Yufa Zhou, Bo Chen, Yanyu Li, Yifan Gong, Kai Zhang, Hao Tan, Jason Kuen, Henghui Ding, Zhihao Shu, Wei Niu, Pu Zhao, Yanzhi Wang, Jiuxiang Gu
AAAI 2025
We present LazyDiT, a framework that accelerates Diffusion Transformers by reusing computations from previous steps and dynamically skipping redundancies, achieving superior performance over existing methods like DDIM across multiple models and devices.
Xuan Shen, Zhao Song, Yufa Zhou, Bo Chen, Yanyu Li, Yifan Gong, Kai Zhang, Hao Tan, Jason Kuen, Henghui Ding, Zhihao Shu, Wei Niu, Pu Zhao, Yanzhi Wang, Jiuxiang Gu
AAAI 2025
We present LazyDiT, a framework that accelerates Diffusion Transformers by reusing computations from previous steps and dynamically skipping redundancies, achieving superior performance over existing methods like DDIM across multiple models and devices.
Yeqi Gao, Zhao Song, Xin Yang, Yufa Zhou (alphabetical order)
NeurIPS 2024 Workshop: Safe Generative AI
We propose an efficient algorithm to approximate the attention matrix in Transformer-based large language models with differential privacy guarantees, addressing security and privacy concerns by preventing leakage of sensitive information during inference—building on advancements in fast attention computation and differentially private matrix publishing.
Yeqi Gao, Zhao Song, Xin Yang, Yufa Zhou (alphabetical order)
NeurIPS 2024 Workshop: Safe Generative AI
We propose an efficient algorithm to approximate the attention matrix in Transformer-based large language models with differential privacy guarantees, addressing security and privacy concerns by preventing leakage of sensitive information during inference—building on advancements in fast attention computation and differentially private matrix publishing.
Xiaoyu Li, Yingyu Liang, Zhenmei Shi, Zhao Song, Yufa Zhou (alphabetical order)
Arxiv 2024
We establish tight I/O complexity bounds for attention mechanisms in large language models across small and large cache sizes—confirming FlashAttention's optimality in large caches, improving algorithms for small caches, extending analysis to sparse attention, and offering insights for efficient LLM training and inference.
Xiaoyu Li, Yingyu Liang, Zhenmei Shi, Zhao Song, Yufa Zhou (alphabetical order)
Arxiv 2024
We establish tight I/O complexity bounds for attention mechanisms in large language models across small and large cache sizes—confirming FlashAttention's optimality in large caches, improving algorithms for small caches, extending analysis to sparse attention, and offering insights for efficient LLM training and inference.
Yingyu Liang, Zhizhou Sha, Zhenmei Shi, Zhao Song, Yufa Zhou (alphabetical order)
NeurIPS 2024 Workshop: Optimization for Machine Learning
We prove that gradients in multi-layer transformer models can be computed in almost linear time $n^{1+o(1)}$ using a novel fast approximation method with polynomially small error, overcoming the quadratic complexity bottleneck of self-attention and enabling more efficient training and deployment of long-context language models with general loss functions and common sub-modules like residual connections, causal masks, and multi-head attention.
Yingyu Liang, Zhizhou Sha, Zhenmei Shi, Zhao Song, Yufa Zhou (alphabetical order)
NeurIPS 2024 Workshop: Optimization for Machine Learning
We prove that gradients in multi-layer transformer models can be computed in almost linear time $n^{1+o(1)}$ using a novel fast approximation method with polynomially small error, overcoming the quadratic complexity bottleneck of self-attention and enabling more efficient training and deployment of long-context language models with general loss functions and common sub-modules like residual connections, causal masks, and multi-head attention.
Yingyu Liang, Zhenmei Shi, Zhao Song, Yufa Zhou (alphabetical order)
NeurIPS 2024 Workshop: Safe Generative AI
We present the first differential privacy (DP) data structure for cross-attention modules—securing sensitive information in key and value matrices across AI applications like retrieval-augmented generation and guided stable diffusion—with theoretical guarantees on privacy and efficiency, robustness to adaptive attacks, and potential to inspire future privacy designs in large generative models.
Yingyu Liang, Zhenmei Shi, Zhao Song, Yufa Zhou (alphabetical order)
NeurIPS 2024 Workshop: Safe Generative AI
We present the first differential privacy (DP) data structure for cross-attention modules—securing sensitive information in key and value matrices across AI applications like retrieval-augmented generation and guided stable diffusion—with theoretical guarantees on privacy and efficiency, robustness to adaptive attacks, and potential to inspire future privacy designs in large generative models.
Yingyu Liang, Zhenmei Shi, Zhao Song, Yufa Zhou (alphabetical order)
NeurIPS 2024 Workshop: Adaptive Foundation Models: Evolving AI for Personalized and Efficient Learning
We prove that, under bounded entries, the backward gradient of tensor attention can be computed in almost linear time—overcoming the $O(n^3)$ complexity barrier—and propose efficient methods to enable practical higher-order transformer training with tensor attention architectures.
Yingyu Liang, Zhenmei Shi, Zhao Song, Yufa Zhou (alphabetical order)
NeurIPS 2024 Workshop: Adaptive Foundation Models: Evolving AI for Personalized and Efficient Learning
We prove that, under bounded entries, the backward gradient of tensor attention can be computed in almost linear time—overcoming the $O(n^3)$ complexity barrier—and propose efficient methods to enable practical higher-order transformer training with tensor attention architectures.
Hui Liu, Lianxiong Chen, Yi Jiang, Dezhou Zhu, Yufa Zhou, Xinzhong Wang
Composite Structures 2023
We develop a multiscale optimization framework for graded lattice structures—both non-stochastic and stochastic—by modeling microstructures, optimizing macroscopic relative density, and reconstructing full-scale lattices, demonstrating mechanical advantages over traditional single-scale structures through analysis and experiments.
Hui Liu, Lianxiong Chen, Yi Jiang, Dezhou Zhu, Yufa Zhou, Xinzhong Wang
Composite Structures 2023
We develop a multiscale optimization framework for graded lattice structures—both non-stochastic and stochastic—by modeling microstructures, optimizing macroscopic relative density, and reconstructing full-scale lattices, demonstrating mechanical advantages over traditional single-scale structures through analysis and experiments.