Yufa Zhou 周宇发
Logo Master Student @ UPenn

I am a second-year master's student at the University of Pennsylvania.

I have a profound interest in AI, encompassing theoretical, empirical, and even philosophical aspects. My current research focuses on LLM understanding (mechanisms, theory), optimization (acceleration, efficiency), and trustworthiness (safety, privacy, interpretability). I’m also open to exploring RAG, RLHF, agents, reasoning, and alignment. Feel free to connect with me!

Curriculum Vitae

Education
  • University of Pennsylvania
    University of Pennsylvania
    M.S.E. in Scientific Computing
    Sep. 2023 - May. 2025
  • Wuhan University
    Wuhan University
    B.E. in Engineering Mechanics
    Sep. 2019 - Jul. 2023
News
2025
1 paper got accepted by AISTATS 2025 and 1 paper got accepted by ICLR 2025
Jan 22
2024
2 papers got accepted by AAAI 2025
Dec 09
4 papers got accepted by NeurIPS 2024 Workshop
Oct 10
Selected Publications (view all )
Beyond Linear Approximations: A Novel Pruning Approach for Attention Matrix
Beyond Linear Approximations: A Novel Pruning Approach for Attention Matrix

Yingyu Liang, Jiangxuan Long, Zhenmei Shi, Zhao Song, Yufa Zhou (alphabetical order)

ICLR 2025

We introduce a novel LLM weight pruning method that directly optimizes for approximating the non-linear attention matrix—with theoretical convergence guarantees—effectively reducing computational costs while maintaining model performance.

Beyond Linear Approximations: A Novel Pruning Approach for Attention Matrix

Yingyu Liang, Jiangxuan Long, Zhenmei Shi, Zhao Song, Yufa Zhou (alphabetical order)

ICLR 2025

We introduce a novel LLM weight pruning method that directly optimizes for approximating the non-linear attention matrix—with theoretical convergence guarantees—effectively reducing computational costs while maintaining model performance.

Numerical Pruning for Efficient Autoregressive Models
Numerical Pruning for Efficient Autoregressive Models

Xuan Shen, Zhao Song, Yufa Zhou, Bo Chen, Jing Liu, Ruiyi Zhang, Ryan A. Rossi, Hao Tan, Tong Yu, Xiang Chen, Yufan Zhou, Tong Sun, Pu Zhao, Yanzhi Wang, Jiuxiang Gu

AAAI 2025

We present a training-free structural pruning method using Newton’s method and compensation algorithms to efficiently compress decoder-only transformer models, achieving state-of-the-art performance with reduced memory usage and faster generation on GPUs.

Numerical Pruning for Efficient Autoregressive Models

Xuan Shen, Zhao Song, Yufa Zhou, Bo Chen, Jing Liu, Ruiyi Zhang, Ryan A. Rossi, Hao Tan, Tong Yu, Xiang Chen, Yufan Zhou, Tong Sun, Pu Zhao, Yanzhi Wang, Jiuxiang Gu

AAAI 2025

We present a training-free structural pruning method using Newton’s method and compensation algorithms to efficiently compress decoder-only transformer models, achieving state-of-the-art performance with reduced memory usage and faster generation on GPUs.

Fine-grained Attention I/O Complexity: Comprehensive Analysis for Backward Passes
Fine-grained Attention I/O Complexity: Comprehensive Analysis for Backward Passes

Xiaoyu Li, Yingyu Liang, Zhenmei Shi, Zhao Song, Yufa Zhou (alphabetical order)

Arxiv 2024

We establish tight I/O complexity bounds for attention mechanisms in large language models across small and large cache sizes—confirming FlashAttention's optimality in large caches, improving algorithms for small caches, extending analysis to sparse attention, and offering insights for efficient LLM training and inference.

Fine-grained Attention I/O Complexity: Comprehensive Analysis for Backward Passes

Xiaoyu Li, Yingyu Liang, Zhenmei Shi, Zhao Song, Yufa Zhou (alphabetical order)

Arxiv 2024

We establish tight I/O complexity bounds for attention mechanisms in large language models across small and large cache sizes—confirming FlashAttention's optimality in large caches, improving algorithms for small caches, extending analysis to sparse attention, and offering insights for efficient LLM training and inference.

Multi-Layer Transformers Gradient Can be Approximated in Almost Linear Time

Yingyu Liang, Zhizhou Sha, Zhenmei Shi, Zhao Song, Yufa Zhou (alphabetical order)

NeurIPS 2024 Workshop: Optimization for Machine Learning

We prove that gradients in multi-layer transformer models can be computed in almost linear time $n^{1+o(1)}$ using a novel fast approximation method with polynomially small error, overcoming the quadratic complexity bottleneck of self-attention and enabling more efficient training and deployment of long-context language models with general loss functions and common sub-modules like residual connections, causal masks, and multi-head attention.

Multi-Layer Transformers Gradient Can be Approximated in Almost Linear Time

Yingyu Liang, Zhizhou Sha, Zhenmei Shi, Zhao Song, Yufa Zhou (alphabetical order)

NeurIPS 2024 Workshop: Optimization for Machine Learning

We prove that gradients in multi-layer transformer models can be computed in almost linear time $n^{1+o(1)}$ using a novel fast approximation method with polynomially small error, overcoming the quadratic complexity bottleneck of self-attention and enabling more efficient training and deployment of long-context language models with general loss functions and common sub-modules like residual connections, causal masks, and multi-head attention.

Differential Privacy of Cross-Attention with Provable Guarantee
Differential Privacy of Cross-Attention with Provable Guarantee

Yingyu Liang, Zhenmei Shi, Zhao Song, Yufa Zhou (alphabetical order)

NeurIPS 2024 Workshop: Safe Generative AI

We present the first differential privacy (DP) data structure for cross-attention modules—securing sensitive information in key and value matrices across AI applications like retrieval-augmented generation and guided stable diffusion—with theoretical guarantees on privacy and efficiency, robustness to adaptive attacks, and potential to inspire future privacy designs in large generative models.

Differential Privacy of Cross-Attention with Provable Guarantee

Yingyu Liang, Zhenmei Shi, Zhao Song, Yufa Zhou (alphabetical order)

NeurIPS 2024 Workshop: Safe Generative AI

We present the first differential privacy (DP) data structure for cross-attention modules—securing sensitive information in key and value matrices across AI applications like retrieval-augmented generation and guided stable diffusion—with theoretical guarantees on privacy and efficiency, robustness to adaptive attacks, and potential to inspire future privacy designs in large generative models.

All publications
Academic Services
  • Conference Reviewer: ICLR 2025, NAACL 2025, IJCAI 2025, ACL 2025.
  • Journal Reviewer: TKDE.