Stop Wrestling PyTorch! Build Insane LLM Speed with pegainfer

What if everything you hate about LLM deployment—bloated frameworks, mysterious CUDA errors, Python GIL contention—could vanish overnight?

Picture this: you've spent six hours debugging why your PyTorch model crashes on a specific GPU driver version. Your Docker image is 8GB of dependencies you'll never use. Your inference latency spikes randomly because Python's garbage collector decided to wake up. Meanwhile, a competitor just shipped a feature that runs 91 tokens per second on consumer hardware with a binary smaller than your virtual environment.

That competitor isn't using magic. They're using pegainfer.

In an era where "AI infrastructure" defaults to stacking abstraction upon abstraction, xiaguan/pegainfer commits heresy: it throws out PyTorch, ONNX, and every framework runtime entirely. What remains is ~9,600 lines of Rust, ~2,600 lines of CUDA, and ~1,400 lines of Triton GPU kernels—a from-scratch LLM inference engine that understands every layer of the stack because it built every layer.

This isn't a toy project. This is a production-capable, OpenAI-compatible server pushing ~91 tok/s throughput on an RTX 5070 Ti with BF16 precision and CUDA Graph optimization. No Python at runtime. No hidden framework magic. Just metal-meeting-metal performance with the safety guarantees Rust developers dream about.

Ready to see what LLM inference looks like when you strip away a decade of accumulated cruft? Let's dive deep.

What Is pegainfer?

pegainfer is a pure Rust and CUDA LLM inference engine created by xiaguan with a radical premise: understand every layer by building it from the ground up. Released as open-source under the MIT license, it represents a growing movement of systems programmers rejecting the "Python + PyTorch = default" assumption for production inference workloads.

The project emerged from a simple frustration: modern LLM deployment stacks are opaque, overweight, and unnecessarily complex. You don't need 4GB of PyTorch wheels to multiply matrices on a GPU. You don't need a Python interpreter in your hot path to serve tokens at scale. What you need is precise control over memory layout, kernel fusion, and scheduling—and that's exactly what Rust's zero-cost abstractions plus hand-tuned CUDA deliver.

Why it's trending now:

Post-PyTorch performance engineering: As LLMs commoditize, latency and throughput separate winners from losers. pegainfer proves custom engines can beat generic frameworks.
Rust in AI infrastructure: The language's memory safety without garbage collection is irresistible for 24/7 inference services.
Consumer GPU optimization: RTX 5070 Ti numbers prove you don't need H100s for impressive local inference.
Educational transparency: Every kernel, every memory allocation, every scheduling decision is inspectable source code.

The architecture follows a per-model crate boundary design—each supported model (Qwen3-4B, Qwen3.5-4B, DeepSeek-V4-Flash) owns its complete inference pipeline including config parsing, weight loading, scheduler/executor logic, and kernel-specific optimizations. This eliminates the one-size-fits-all inefficiency of generic frameworks.

Key Features That Crush the Competition

Zero Runtime Python Dependency

Here's the secret most "fast" inference engines won't tell you: they still drag Python along for the ride. vLLM, TensorRT-LLM, even many "optimized" solutions keep Python in their serving path. pegainfer uses Python only at build time for Triton AOT kernel compilation. At runtime? Pure Rust binary. No GIL. No GC pauses. No interpreter overhead.

Custom GPU Kernel Ecosystem

The engine doesn't just call cuBLAS and hope for the best. It orchestrates multiple kernel strategies:

Hand-written CUDA for decode-critical paths where every microsecond counts
Triton AOT compilation for Qwen3.5 compatibility kernels (fused operations that generic frameworks can't optimize)
TileLang-generated CUDA for DeepSeek V4's exotic MP8 sparse attention patterns
FlashInfer integration for battle-tested paged attention and sampling
NCCL for multi-GPU reductions on DeepSeek's 8-way model parallelism

BF16 Storage with FP32 Accumulation

Numerical stability without memory bloat. This hybrid precision approach matches what expensive enterprise frameworks do, implemented in ~200 lines of kernel code you can actually read.

CUDA Graph Optimization

For Qwen decode paths, pegainfer captures and replays kernel launch sequences as CUDA Graphs—eliminating CPU launch overhead entirely. On short prompts, this delivers ~14ms TTFT (Time To First Token), a figure that rivals commercial APIs.

Auto-Detecting Model Architecture

No manual configuration files. Point --model-path at any supported model directory; pegainfer reads config.json and instantiates the correct engine crate automatically. Qwen3's full attention? Handled. Qwen3.5's hybrid linear+full attention? Routed correctly. DeepSeek's MoE monstrosity? Feature-gated but functional.

OpenAI-Compatible API

Drop-in replacement for /v1/completions with streaming SSE support. Temperature, top-k, top-p sampling—all implemented in Rust without calling back to any Python tokenizer libraries at runtime.

Use Cases Where pegainfer Dominates

1. High-Frequency Local Inference Services

Running a coding assistant or chatbot on-premise? Python-based solutions suffer from unpredictable latency spikes under concurrent load. pegainfer's Rust runtime provides deterministic memory usage and sub-millisecond scheduling jitter. Deploy on edge servers with 16GB VRAM and serve 90+ tokens/second consistently.

2. Security-Sensitive Environments

Government, finance, healthcare—sectors where "pip install torch" triggers compliance nightmares. pegainfer's minimal dependency surface (Rust toolchain + CUDA drivers) shrinks your attack vector dramatically. No conda environments. No mystery wheels from PyPI.

3. Custom Model Research & Education

Want to understand exactly how GQA (Grouped Query Attention) changes your memory bandwidth? pegainfer's per-model crates expose every detail. The Qwen3-4B crate shows 32 query heads, 8 KV heads (4:1 GQA ratio), head_dim=128—all in readable Rust with inline kernel plans. Graduate students and researchers can trace execution from HTTP request to CUDA kernel launch.

4. Hybrid Attention Architecture Deployment

Qwen3.5's 24 linear attention + 8 full attention layers break generic frameworks that assume homogeneous transformer stacks. pegainfer's model-specific crates implement Gated Delta Rule recurrent state management for linear layers and full FlashInfer paged attention for standard layers—automatically, correctly, efficiently.

5. Multi-GPU MoE Inference (DeepSeek V4)

The 671B parameter / 37B active DeepSeek-V4-Flash model requires 8-way model parallelism with FP8/FP4 kernels. pegainfer's feature-gated deepseek-v4 path uses TileLang-generated kernels, NCCL all-reduces, and explicit multi-stage scheduling. This isn't theoretical—the initial greedy serving path works today on 8x GPU setups.

Step-by-Step Installation & Setup Guide

Prerequisites

Before building, ensure your system meets these requirements:

Rust (2024 edition) — install via rustup
CUDA Toolkit with nvcc and cuBLAS (tested with CUDA 12.x)
CUDA-capable GPU (RTX 5070 Ti or better recommended for full performance)
Python 3 + Triton — build-time only, for AOT kernel compilation
TileLang — only if building DeepSeek V4 support

One-Time Python Environment Setup

The Triton AOT compilation requires a Python virtual environment with PyTorch installed:

# Create virtual environment
uv venv && source .venv/bin/activate

# Install PyTorch with CUDA 12.8 support
uv pip install torch --index-url https://download.pytorch.org/whl/cu128

Critical: This Python environment is only used during cargo build. The final binary has zero Python dependency.

Download Model Weights

# Using Hugging Face CLI
huggingface-cli download Qwen/Qwen3-4B --local-dir models/Qwen3-4B

Build and Launch Server

# Required environment variables
export CUDA_HOME=/usr/local/cuda
export PEGAINFER_TRITON_PYTHON=.venv/bin/python

# Build and run from workspace root
cargo run --release

The server starts on port 8000 by default. The actual CLI entrypoint lives in pegainfer-server; running from the workspace root automatically selects this package.

Windows-Specific Build

$env:CUDA_PATH = "C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v12.x"
uv venv .venv --python 3.12
uv pip install "triton-windows<3.7"
$env:PEGAINFER_TRITON_PYTHON = ".venv\Scripts\python.exe"

cargo build --release
cargo run --release --bin pegainfer -- --model-path models/Qwen3-4B

Environment Variables Reference

Variable	Purpose	Example
`CUDA_HOME`	CUDA Toolkit path	`/usr/local/cuda`
`PEGAINFER_TRITON_PYTHON`	Python with Triton for build	`.venv/bin/python`
`PEGAINFER_TILELANG_PYTHON`	Python with TileLang for DeepSeek	`.venv/bin/python`
`PEGAINFER_CUDA_SM`	GPU SM override	`120` (for RTX 5070 Ti)

Critical Performance Note

Always use --release. Debug builds disable optimizations and emit GPU code so slow you'll think your GPU failed:

# CORRECT: Optimized build
cargo run --release

# WRONG: 10-100x slower, only for debugging kernel crashes
cargo run

REAL Code Examples from the Repository

Let's examine actual code patterns from pegainfer's README and source, with detailed explanations of what makes them work.

Example 1: Basic Completions API Call

The simplest way to interact with a running pegainfer server:

# Standard non-streaming completion
curl -s http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"prompt": "The capital of France is", "max_tokens": 32}'

This returns a JSON response matching OpenAI's /v1/completions format. The server parses this through pegainfer-server/src/vllm_frontend.rs, which bridges the HTTP layer to the generic EngineHandle abstraction. The prompt string gets tokenized via the model's native tokenizer (loaded from the model directory's tokenizer.json), then routed to the appropriate per-model engine crate—in this case, pegainfer-qwen3-4b.

The max_tokens: 32 parameter controls the generation length, defaulting to 128 if omitted. With temperature: 0.0 (the default), this performs greedy decoding—always selecting the highest-probability next token. This deterministic mode is actually the fastest path, avoiding sampling overhead.

Example 2: Streaming SSE Output

For interactive applications, streaming is essential. pegainfer implements Server-Sent Events:

# Streaming with -N flag to disable curl buffering
curl -N http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"prompt": "Write a haiku about Rust:", "max_tokens": 64, "stream": true}'

The -N flag is critical—without it, curl buffers SSE chunks and you'll see nothing until completion. The "stream": true parameter triggers a different code path in pegainfer-server/src/main.rs: instead of collecting all tokens before responding, the engine yields each generated token as it's produced, formatted as data: {...}\n\n SSE messages.

Behind the scenes, the per-model executor runs decode iterations in a tight loop, emitting TokenEvent structs through an async channel. The vllm_frontend.rs bridge consumes these events and formats them into OpenAI-compatible SSE chunks. Latency from token generation to network transmission is typically under 1 millisecond for local connections.

Example 3: DeepSeek V4 Multi-GPU Launch

For the most demanding workload, DeepSeek-V4-Flash requires explicit feature flags and multi-GPU setup:

# Install TileLang for kernel generation
uv pip install "tilelang==0.1.9"
export PEGAINFER_TILELANG_PYTHON=.venv/bin/python

# Launch with 8-way model parallelism
cargo run --release --features deepseek-v4 -- \
  --model-path models/DeepSeek-V4-Flash

This command activates three complex systems simultaneously:

TileLang kernel generation: At build time, TileLang generates FP8/FP4 CUDA kernels for DeepSeek's sparse attention patterns. These are compiled into the binary via the PEGAINFER_TILELANG_PYTHON toolchain.
Feature-gated compilation: The --features deepseek-v4 flag enables the pegainfer-deepseek-v4 crate, which pulls in MP8 (Model Parallelism 8-way) code paths, NCCL multi-GPU reduction logic, and MoE routing kernels excluded from default builds to keep binary size manageable.
Implicit multi-GPU: The DeepSeek crate automatically targets CUDA devices 0 through 7. The 671B parameter model uses 37B active parameters per forward pass, distributed across GPUs with expert parallelism for the MoE layers.

Current limitations to note: DeepSeek support is intentionally narrower than Qwen paths—greedy decoding only, no CUDA Graph yet, and explicit stop_reason responses for unsupported parameters like temperature > 0 or logprobs requests.

Example 4: E2E Regression Testing

Validate your build against known-good outputs:

# Qwen3-4B exact match regression test
PEGAINFER_TEST_MODEL_PATH=models/Qwen3-4B \
  cargo test --release -p pegainfer-qwen3-4b --test e2e

# Qwen3.5 hybrid attention verification
PEGAINFER_TEST_MODEL_PATH=models/Qwen3.5-4B \
  cargo test --release -p pegainfer-qwen35-4b --test e2e

# DeepSeek V4 multi-GPU smoke test
PEGAINFER_TEST_MODEL_PATH=models/DeepSeek-V4-Flash \
  cargo test --release -p pegainfer-deepseek-v4 \
  --features deepseek-v4 --test e2e

These tests verify bit-exact output matching against pre-generated regression data. The PEGAINFER_TEST_MODEL_PATH environment variable tells each model crate where to find weights. Running with --release ensures you're testing the actual optimized code path, not a debug approximation.

The Qwen3.5 test is particularly valuable—it validates that the hybrid attention implementation (24 linear + 8 full layers) produces identical outputs to a reference implementation, confirming the Gated Delta Rule recurrent state management works correctly across layer type transitions.

Advanced Usage & Best Practices

CUDA Graph Debugging

When kernel launches fail mysteriously, disable CUDA Graph to get readable error traces:

cargo run --release -- --cuda-graph=false

This falls back to individual kernel launches. You'll lose ~10-15% throughput but gain actionable cuda-gdb stack traces.

Memory-Constrained Deployment

For GPUs with less than 16GB VRAM, monitor the paged KV cache in pegainfer-core/src/kv_pool.rs. The default page size is tuned for RTX 5070 Ti; smaller GPUs may need custom page counts. Currently this requires source modification—future releases may expose runtime configuration.

Custom Kernel Development

Adding a new model? Study pegainfer-qwen3-4b/src/kernel_plan.rs—this "model DAG phase → kernel routing index" is your template. The architecture separates what to compute (model crate) from how to compute on GPU (pegainfer-kernels), enabling rapid experimentation with new attention variants.

Production Monitoring

The server exposes structured logging via pegainfer-server/src/logging.rs. Pipe to your favorite observability stack. Key metrics to watch: TTFT p99 (should stay under 50ms), TPOT consistency (variance indicates scheduling contention), and KV cache utilization (page fragmentation).

Build Cache Optimization

Triton AOT compilation is slow. Set CARGO_TARGET_DIR to a persistent SSD location and use sccache for CUDA object files. CI builds benefit enormously from caching the .triton kernel cache directory.

Comparison with Alternatives

Feature	pegainfer	vLLM	TensorRT-LLM	llama.cpp
Runtime Language	Rust	Python/C++	C++	C/C++
PyTorch Dependency	None	Required	None	None
Python at Runtime	None	Yes (GIL-bound)	None	None
Custom Kernels	CUDA + Triton + TileLang	CUDA (PagedAttention)	TensorRT plugins	Hand-optimized CPU/GPU
CUDA Graphs	Yes (Qwen paths)	Yes	Yes	Limited
BF16 Native	Yes	Yes	Yes	No (FP16/INT8/INT4)
Multi-GPU MoE	Yes (DeepSeek V4)	Yes	Limited	No
Hybrid Attention	Yes (Qwen3.5)	No	No	No
Binary Size	~50MB + model	~4GB+ (Python env)	~500MB	~5MB
OpenAI API	Native	Native	Via Triton	Via separate server
Throughput (RTX 5070 Ti)	~91 tok/s	~75 tok/s	~85 tok/s	~45 tok/s (GPU)
Code Transparency	Complete	Partial	None (proprietary)	Complete

Why choose pegainfer?

Over vLLM: Eliminate Python overhead and GIL contention; understand every kernel in your serving path.
Over TensorRT-LLM: Avoid proprietary black-box optimizations; customize attention mechanisms freely.
Over llama.cpp: Achieve 2x higher GPU throughput with modern CUDA kernel fusion; support non-llama architectures natively.

The trade-off? Smaller model ecosystem (3 architectures vs. hundreds) and you compile from source. For teams prioritizing performance transparency and deployment simplicity, pegainfer's trade-offs are compelling.

FAQ

Is pegainfer production-ready?

For Qwen3 and Qwen3.5 models: yes, with greedy and sampling support, CUDA Graph optimization, and OpenAI-compatible API. DeepSeek V4 is feature-gated and intentionally narrower—greedy only, no CUDA Graph, 8-GPU requirement. Evaluate against your specific use case.

Why Rust instead of C++ for the runtime?

Memory safety without garbage collection, fearless concurrency for request scheduling, and Cargo's dependency management. The ~9.6K lines of Rust replace what would typically be 20K+ lines of C++ with equivalent or better performance, plus compile-time guarantees against data races.

Can I run this without NVIDIA GPUs?

No. CUDA is mandatory. The Triton kernels compile to CUDA PTX; there's no ROCm or CPU fallback path. This is explicitly a CUDA-first project.

How does model auto-detection work?

The server reads config.json from your --model-path directory, matching architectures and model_type fields against known configurations. Each model crate registers its detector in pegainfer-server/src/server_engine.rs. Add a new model by implementing the ModelConfig trait and registering in the detection router.

What's the catch with "no Python at runtime"?

Tokenizer initialization still requires the model's tokenizer.json (standard Hugging Face format). This is parsed in Rust via the tokenizers crate—no Python involved. The build-time Python dependency (Triton/TileLang) never touches your serving binary.

How do I add INT4/INT8 quantization?

Currently unimplemented—marked explicitly in the README's "What's not (yet) implemented" section. The BF16 + FP32 accumulation path is production-ready; quantization would require new kernel plans in each model crate and potential calibration data pipelines.

Can I use this commercially?

Yes—MIT licensed. No attribution required beyond the license text. The creator explicitly encourages production use and contributions.

Conclusion

pegainfer is what happens when a systems programmer looks at LLM deployment and asks: "what's actually necessary here?" The answer—~13,600 lines of Rust, CUDA, and Triton—delivers 91 tokens per second on consumer hardware with a binary smaller than a single PyTorch wheel.

This isn't just about raw speed. It's about owning your inference stack—understanding every kernel launch, every memory allocation, every scheduling decision. In an industry drowning in abstraction, pegainfer's radical transparency is both educational tool and competitive weapon.

For researchers, it demystifies attention mechanisms by exposing them in readable Rust. For engineers, it eliminates entire categories of production failures by removing Python from the runtime. For the curious, it proves that building from scratch still beats stacking frameworks in 2025.

The project is actively developed, with DeepSeek V4 support expanding and new quantization modes on the roadmap. Whether you're optimizing edge deployment, studying transformer internals, or simply tired of debugging conda environments—star pegainfer on GitHub, build it, benchmark it against your current stack.

The future of LLM inference isn't bigger frameworks. It's smaller, faster, comprehensible systems. pegainfer is already there.