Yufa Zhou
Logo Master Student @ UPenn

I am a second-year master's student at the University of Pennsylvania, currently working as a research intern under the guidance of Yingyu Liang and Zhao Song.

I have a profound interest in AI, encompassing theoretical, empirical, and even philosophical aspects. Currently, my research interests include LLM understanding, LLM acceleration, and safe generative AI. I am actively seeking research opportunities and a Ph.D. position for Fall 2025. Feel free to send me an email!

Curriculum Vitae

Education
  • University of Pennsylvania
    University of Pennsylvania
    M.S.E. in Scientific Computing
    Sep. 2023 - May. 2025
  • Wuhan University
    Wuhan University
    B.E. in Engineering Mechanics
    Sep. 2019 - Jul. 2023
Selected Publications (view all )
Looped relu mlps may be all you need as practical programmable computers
Looped relu mlps may be all you need as practical programmable computers

Yingyu Liang, Zhizhou Sha, Zhenmei Shi, Zhao Song, Yufa Zhou (alphabetical order)

Arxiv 2024

We demonstrate that a looped 23-layer ReLU-MLP can function as a universal programmable computer—revealing that simple neural network modules possess greater expressive power than previously thought and can perform complex tasks without relying on advanced architectures like Transformers.

Looped relu mlps may be all you need as practical programmable computers

Yingyu Liang, Zhizhou Sha, Zhenmei Shi, Zhao Song, Yufa Zhou (alphabetical order)

Arxiv 2024

We demonstrate that a looped 23-layer ReLU-MLP can function as a universal programmable computer—revealing that simple neural network modules possess greater expressive power than previously thought and can perform complex tasks without relying on advanced architectures like Transformers.

Fine-grained Attention I/O Complexity: Comprehensive Analysis for Backward Passes
Fine-grained Attention I/O Complexity: Comprehensive Analysis for Backward Passes

Xiaoyu Li, Yingyu Liang, Zhenmei Shi, Zhao Song, Yufa Zhou (alphabetical order)

Arxiv 2024

We establish tight I/O complexity bounds for attention mechanisms in large language models across small and large cache sizes—confirming FlashAttention's optimality in large caches, improving algorithms for small caches, extending analysis to sparse attention, and offering insights for efficient LLM training and inference.

Fine-grained Attention I/O Complexity: Comprehensive Analysis for Backward Passes

Xiaoyu Li, Yingyu Liang, Zhenmei Shi, Zhao Song, Yufa Zhou (alphabetical order)

Arxiv 2024

We establish tight I/O complexity bounds for attention mechanisms in large language models across small and large cache sizes—confirming FlashAttention's optimality in large caches, improving algorithms for small caches, extending analysis to sparse attention, and offering insights for efficient LLM training and inference.

Multi-Layer Transformers Gradient Can be Approximated in Almost Linear Time

Yingyu Liang, Zhizhou Sha, Zhenmei Shi, Zhao Song, Yufa Zhou (alphabetical order)

NeurIPS 2024 Workshop: Optimization for Machine Learning

We prove that gradients in multi-layer transformer models can be computed in almost linear time $n^{1+o(1)}$ using a novel fast approximation method with polynomially small error, overcoming the quadratic complexity bottleneck of self-attention and enabling more efficient training and deployment of long-context language models with general loss functions and common sub-modules like residual connections, causal masks, and multi-head attention.

Multi-Layer Transformers Gradient Can be Approximated in Almost Linear Time

Yingyu Liang, Zhizhou Sha, Zhenmei Shi, Zhao Song, Yufa Zhou (alphabetical order)

NeurIPS 2024 Workshop: Optimization for Machine Learning

We prove that gradients in multi-layer transformer models can be computed in almost linear time $n^{1+o(1)}$ using a novel fast approximation method with polynomially small error, overcoming the quadratic complexity bottleneck of self-attention and enabling more efficient training and deployment of long-context language models with general loss functions and common sub-modules like residual connections, causal masks, and multi-head attention.

Differential Privacy of Cross-Attention with Provable Guarantee

Yingyu Liang, Zhenmei Shi, Zhao Song, Yufa Zhou (alphabetical order)

NeurIPS 2024 Workshop: Safe Generative AI

We present the first differential privacy (DP) data structure for cross-attention modules—securing sensitive information in key and value matrices across AI applications like retrieval-augmented generation and guided stable diffusion—with theoretical guarantees on privacy and efficiency, robustness to adaptive attacks, and potential to inspire future privacy designs in large generative models.

Differential Privacy of Cross-Attention with Provable Guarantee

Yingyu Liang, Zhenmei Shi, Zhao Song, Yufa Zhou (alphabetical order)

NeurIPS 2024 Workshop: Safe Generative AI

We present the first differential privacy (DP) data structure for cross-attention modules—securing sensitive information in key and value matrices across AI applications like retrieval-augmented generation and guided stable diffusion—with theoretical guarantees on privacy and efficiency, robustness to adaptive attacks, and potential to inspire future privacy designs in large generative models.

Tensor attention training: Provably efficient learning of higher-order transformers
Tensor attention training: Provably efficient learning of higher-order transformers

Yingyu Liang, Zhenmei Shi, Zhao Song, Yufa Zhou (alphabetical order)

NeurIPS 2024 Workshop: Adaptive Foundation Models: Evolving AI for Personalized and Efficient Learning

We prove that, under bounded entries, the backward gradient of tensor attention can be computed in almost linear time—overcoming the $O(n^3)$ complexity barrier—and propose efficient methods to enable practical higher-order transformer training with tensor attention architectures.

Tensor attention training: Provably efficient learning of higher-order transformers

Yingyu Liang, Zhenmei Shi, Zhao Song, Yufa Zhou (alphabetical order)

NeurIPS 2024 Workshop: Adaptive Foundation Models: Evolving AI for Personalized and Efficient Learning

We prove that, under bounded entries, the backward gradient of tensor attention can be computed in almost linear time—overcoming the $O(n^3)$ complexity barrier—and propose efficient methods to enable practical higher-order transformer training with tensor attention architectures.

All publications