Research

Routing, evaluation, serving, policy, memory, and verification research behind Brain and open-source vLLM Semantic Router.

Open research index Brain

0015 documents / synced index

Papers and research records.

Synced from Agentic Intelligence Lab research index.

2026 / Position paper

vLLM Semantic Router: Signal Driven Decision Routing for Mixture-of-Modality Models

Signal-driven routing across model pools, safety plugins, privacy policies, and cost-aware selection.

Authors: vLLM Semantic Router Team

Paper

2026 / Vision paper

The Workload-Router-Pool Architecture for LLM Inference Optimization: A Vision Paper from the vLLM Semantic Router Project

A synthesis of routing, fleet planning, multimodal, and governance results into one deployment architecture.

Authors include Huamin Chen, Xunzhuo Liu, Bowei He, Fuyuan Lyu, Yankai Chen, Xue Liu, Yuhan Liu, Junchen Jiang

Paper

2026 / Security

Visual Confused Deputy: Exploiting and Defending Perception Failures in Computer-Using Agents

A defense-oriented treatment of perception failures in computer-use agents and click/action guardrails.

Authors include Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, Huamin Chen

Paper

2026 / Tool routing

Outcome-Aware Tool Selection for Semantic Routers: Latency-Constrained Learning Without LLM Inference

Latency-constrained learning for tool ranking under single-digit millisecond CPU budgets.

Authors include Huamin Chen, Xunzhuo Liu, Junchen Jiang, Bowei He, Xue Liu

Paper

2026 / VLM routing

Adaptive Vision-Language Model Routing for Computer Use Agents

Estimates action difficulty and routes each computer-use step to the cheapest model that meets reliability targets.

Authors include Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, Huamin Chen

Paper

2026 / Latency

98x Faster LLM Routing Without a Dedicated GPU: Flash Attention, Prompt Compression, and Near-Streaming for the vLLM Semantic Router

Flash attention, prompt compression, and near-streaming reduce routing latency from seconds to tens of milliseconds.

Authors include Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, Huamin Chen

Paper

2026 / Fleet planning

inference-fleet-sim: A Queueing-Theory-Grounded Fleet Capacity Planner for LLM Inference

A queueing-theory-grounded fleet planner for sizing multi-pool GPU fleets against P99 TTFT targets.

Authors include Huamin Chen, Xunzhuo Liu, Yuhan Liu, Junchen Jiang, Bowei He, Xue Liu

Paper

2026 / Fleet planning

FleetOpt: Analytical Fleet Provisioning for LLM Inference with Compress-and-Route as Implementation Mechanism

An analytical method for deriving minimum-cost two-pool fleets from workload CDFs and P99 TTFT targets.

Authors include Huamin Chen, Xunzhuo Liu, Yuhan Liu, Junchen Jiang, Bowei He, Xue Liu

Paper

2026 / Energy

The 1/W Law: An Analytical Study of Context-Length Routing Topology and GPU Generation Gains for LLM Inference Energy Efficiency

Context-length routing topology can matter more than pure GPU generation upgrades for tokens per watt.

Authors include Huamin Chen, Xunzhuo Liu, Yuhan Liu, Junchen Jiang, Bowei He, Xue Liu

Paper

2026 / Policy

Conflict-Free Policy Languages for Probabilistic ML Predicates: A Framework and Case Study with the Semantic Router DSL

A framework for conflict detection when probabilistic ML predicates can silently co-fire in routing policy languages.

Authors include Xunzhuo Liu, Hao Wu, Huamin Chen, Bowei He, Xue Liu

Paper

2026 / Agent orchestration

From Inference Routing to Agent Orchestration: Declarative Policy Compilation with Cross-Layer Verification

A cross-layer extension of the Semantic Router DSL from stateless request routing into multi-step agent workflows.

Authors include Huamin Chen, Xunzhuo Liu, Bowei He, Xue Liu

Paper

2026 / Memory

Knowledge Access Beats Model Size: Memory Augmented Routing for Persistent AI Agents

Conversational memory and retrieval-grounded routing recover most of a 235B model's performance while cutting effective inference cost.

Authors include Xunzhuo Liu, Bowei He, Xue Liu, Andy Luo, Haichen Zhang, Huamin Chen

Paper

2026 / Verification

Fast and Faithful: Real-Time Verification for Long-Document Retrieval-Augmented Generation Systems

A real-time verification component for long-document RAG that preserves grounding checks without truncated validation.

Authors include Xunzhuo Liu, Bowei He, Xue Liu, Haichen Zhang, Huamin Chen

Paper

2025 / Reasoning routing

When to Reason: Semantic Router for vLLM

A semantic router that classifies queries by reasoning need and selectively applies reasoning only when beneficial.

Authors include Chen Wang, Xunzhuo Liu, Yuhan Liu, Yue Zhu, Xiangxi Mo, Junchen Jiang, Huamin Chen

Paper

2025 / Semantic caching

Category-Aware Semantic Caching for Heterogeneous LLM Workloads

A category-aware semantic caching architecture where similarity thresholds, TTLs, and quotas vary by workload class.

Authors include Chen Wang, Xunzhuo Liu, Yue Zhu, Alaa Youssef, Priya Nagpurkar, Huamin Chen

Paper