Meeting¶

约 1 个字预计阅读时间不到 1 分钟

2025年11月15日
分类于 Meeting
需要 1 分钟阅读时间

AI/ML 分会场

dynamic.ai

DRA dynamic resource allocation

粒度精确到卡，如果放小模型，显存利用率低（～5%）
小模型在垂直领域更稳定幻觉更少
AWS：提供粒度为半卡的调度。痛点：还是太大了

dynamia.ai：异构 GPU 资源调度 ( 主要负责南北向流量的资源申请 )

调度器层：调度器扩展
容器层：hami core 符号劫持

volcano

也使用 hami core
任务级调度能力较优

multi-agent

Why agent: LLM is certainly the main application is the trend of AI

Function call is a turning point where we can use some existing tools
Modern agents:
- Task decomposition
- Tools selection
- Memory
- Multi-agent

strands-agents: an OSS given by AWS

built-in tools: python tools; MCP tools; pre-built tools;
a large set of tools with better support and intergration for cloud services
support multiple types of model

Orchestrator: agents as tools, with every

agent graph: concat some tools into a bigger part
graph vs workflow

swarm mode: decentralized; share context

Structured out

put & func call in vLLM

choice：choice = ["red","green"]
regex：json schema
son
grammar
sql
EBNF / Lark grammars

中间输出 json

structural tag

guided output 的 CPU bubble 问题

原理

在 sampler 中把不符合要求的去除
状态机跟踪某个时间点哪些 token 是合规的

tool parser

template 渲染

RoleBasedGroup

迭代性：PD 分离一定是未来的架构吗？extensive
有状态性：
运行可靠

RBG 是面向多角色协同场景设计的工作负载

多角色的创建、调度、升级、故障自愈、服务发现

蚂蚁： Gateway API inference extension GIE

什么是云原生

什么是 `

genai-bench

llm-optimizer SLO

goodput 的理念：满足 SLO 约束的

GuideLLM from vllm

stepfun 的论文

计数器

sglang roter

AIGW

业务架构还没有收敛，基础架构

Share on Share on

2025年11月8日
分类于 Meeting
需要 2 分钟阅读时间

Note

另一个推理框架 sglang 的分享
Aibrix 火山的开源端到端推理框架，主要在集群
Elastic EP 支持容错的一个设计，更灵活的扩缩容：mooncake 和 mooncake-pytorch 后端
dgx spark 打广告
Nvidia 的一个根据需求计算 PD 分离配置的工具 aiconfigurator（github 百星）
在大规模分布式编译器／框架 triton-distributed

Main Takeaway

小规模的自己玩的系统确实不需要注意可观测，可观测是为了面向服务，更好上线 2.

sglang 社区

一些新的 feature

Hierachical KV Caching
- backend support: mooncake, 3FS, NIXL
Piecewise CUDA Graph and Torch Compile
what is 推理 backend，常见的有哪些
通信源语有哪些 all reduce 是什么
overlap scheduling with spec decoding

milestone & roadmap
- wanna to support more features zhengjiaode
- turn these methods into a callable library so that everyone can use

Dynamo pd planner Slang model gateway Vllm semetic router

AIbrix

AIbrix 火山引擎性能与成本

优势

面向生产环境，经过大规模线上环境检验：生产中部署大模型比较复杂：限流容错弹性
开源、可扩展
推理全栈

分布式 / 分离式：

为什么需要分离式的部署

并行方式：TP DP EP

model 的 -》资源的异质性 -〉

dense - sparse 分离

KVCache 的架构
支持多种存储

大模型部署优化 K8s 控制面 + 数据面

搜广推按照流程而不是 model 拆分

分离方法：中心式；p2p

面向 pd 分离的请求路由
Pd 请求的编排

Kvcache 的卸载交给第三方的 management 远端 rdma 降低 ttft

不同的策略支持 plugin

成本问题

弹性伸缩：cpu 利用率 latency

传统指标是有局限性的

Qps 没有升高 latency / utils 都升高

特定伸缩的 metrics 2508.19559

Lora 微调的降本长尾模型的支持，70 模型承担 30% 的流量 vllm 的模型恭喜爱过你

如何使用 serverless 的方式中

aibrix

elastic EP

多模型 serving

更灵活的扩缩容

支持部分 rank 偶发故障的 ep 并行方案计算通信计算通信

card lost

qps=16

mooncake EP 故障感知的通讯库：动过点对点 GPU RDMA mooncake pytorch 后端具备容错的通信源语

部分 rank 失效的 EPLB 算法’

个人开发者在桌面的 AI 对于大模型开发的

大内存
AI 软件
支持 cuda

RDMA 双机可以实现 FP4 405B

Ai Configurator

Welcome to AIBrix — AIBrix

痛点 1: PD 是否更优痛点 2: PD 如何配置

并行的方式

Triton-distributed

vlm 优化

Vlm 数据缓存：多图重复场景，computer use agent；具身智能：T 时刻 1 张传感器 +n 张低分辨率

结合具体场景

多模态数据序列化

zmq

pytorch 底层安全角度

算子库选择很多

多模态模型 decode 次数很少所以 ll 上的配置可能需要重新设置

SGlang on hopper -96G

机内通信快机间通信慢算力的瓶颈

TP8 方式部署

SLO 要求

token 到达无序。Down gemm

EPLB：应该同时激活的 expert 放到两个机器上，开销会高很多，高概率激活的两个专家在一个卡上面激活 5%

专家 LB 有权重迁移毛刺会影响服务

Async rebalance 缓解这个问题

EP16 50% 走机内通信

flashMLA backend 艳吗机制导致 topk 必须是 1

算力不高的卡 batchsize 比较小

如何做优化

SBO & TPO

Share on Share on