Publications

(2024). SuffixDecoding: A Model-Free Approach to Speeding Up Large Language Model Inference. ArXiv 2024.

PDF Cite DOI

(2024). Quantized Side Tuning: Fast and Memory-Efficient Tuning of Quantized Large Language Models. ACL 2024 Oral (Outstanding paper award 🏆).

PDF Cite DOI

(2024). SpecInfer: Accelerating Generative Large Language Model Serving with Tree-based Speculative Inference and Verification. ASPLOS 2024.

PDF Cite Code DOI

(2024). FlexLLM: A System for Co-Serving Large Language Model Inference and Parameter-Efficient Finetuning. ArXiv 2024.

PDF Cite Code DOI

(2024). Optimal Kernel Orchestration for Tensor Programs with Korch. ASPLOS 2024.

PDF Code

(2023). Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems. ArXiv 2023.

PDF Cite DOI

(2023). Direct Telemetry Access. SIGCOMM 2023.

PDF Cite Code DOI

(2021). Zero-CPU Collection with Direct Telemetry Access. HotNets 2021.

PDF Cite DOI