行业研究公司研究宏观策略财报招股书会议纪要 Token 低空经济十五五 AIGC 大模型

基于小型语言模型的终身智能体

2026-04-26 - - carry~强

永久智能体：小型语言模型

永久智能体正变得普遍，例如 OpenClaw，一个由 Peter Steinberger 和开源社区开发的个人 AI 助手。它运行在用户机器上，通过 WhatsApp、Slack、Telegram 进行交流，具有跨对话的持久记忆（偏好、事实、决策），并能执行操作（控制浏览器、运行脚本、设置提醒），且模型无关（Claude、GPT 或本地模型）。

然而，大多数部署仍调用前沿 API，存在成本、延迟/隐私、个性化等问题。前沿 API 不适合永久智能体，因为成本高昂，需要靠近用户，且无法满足个性化需求。因此，永久智能体必须在用户的设备上运行，小型语言模型是唯一现实的部署目标。

永久智能体在三个粒度上适应：领域、用户和交互。主要研究包括：

1. 特殊化：A3

小型开放权重代理在 Web 任务上落后于前沿模型 20+ 个百分点。标准 SFT 演示蒸馏会导致过拟合。A3 通过将 LLM 模块（如 Gemini-3-Pro）用于任务设计、任务生成和评估，替代了三个人类标注角色，并在 9B 模型上实现了与 27B 模型相当的性能。

A3 将 9B 模型在 WebArena 上的表现提升至 41.5%，与 27B 模型相当，比前沿模型（如 Claude 3.5 Sonnet）提高了 5.1 个百分点。
A3 实现了跨 Web 环境的泛化，而非任务过拟合。

2. 个性化：AdaptArena

即使有能力的代理也不知道用户。AdaptArena 是一个用于测试时个性化 Web 代理的基准，包含 110 个收集的任务，推断偏好并部署 110 个任务。AdaptiveAgent 接收新任务和 k 个过去轨迹，使用截图或文本表示。

AdaptiveAgent 的成功率（基于 Gemini-3-Pro）为 44.5%（截图）和 40.0%（文本），而 User-Centric（完整金 Profile）为 70.0%，Oracle（任务金偏好）为 85.5%。
通过交换用户历史记录，性能降至无 Profile 基线，表明收益来自正确的用户上下文对齐，而非通用上下文示例。未来方向包括在线学习、多偏好和多轨迹推理。

3. 通信：LLM2Vec-Gen

代理每个回合有两个需求：从记忆中检索和与同行通信。LLM2Vec-Gen 使用 LLM 生成响应并学习表示。

LLM2Vec-Gen 在 MTEB 上的得分比 LLM2Vec 高 5.1 个百分点。
输出空间胜于推理，因为基于 LLM 已知的内容进行检索比从原始输入检索更好。输出空间嵌入继承了 LLM 的推理。
推理密集型检索随 LLM 大小扩展。BRIGHT 比例超过 LLM2Vec：+7.7%（0.6B）、+11.7%（1.7B）、+19.7%（4B）、+35.6%（8B）。
嵌入是可解释的，使用 Logit Lens 可以将嵌入投影为文本。每个嵌入都是 LLM 会说的内容的加权袋。
A2A 通信今天使用 token，慢、有损、顺序。嵌入是密集的、并行的，并且是模型的本地内容。100 个响应 token 可以在一个前向传递中压缩成 10 个潜在 token。A2A 协议在规模上需要一个共享表示，而 LLM 已经提供了这个表示。未来方向包括 A2A 协议和嵌入空间协调。

总结

对于大多数专门任务，小型语言模型就足够了。专门化 + 个性化 + 检索 SLM 是多智能体系统中的缺失基础形式。当这些基础形式组合时，多智能体系统本身是否成为永久性的？

ICLR 2026 — Lifelong Agents Workshop Siva Reddy McGill · Mila · ServiceNow Research Lifelong agents are becoming universal OpenClaw: a personal AIassistant by PeterSteinberger and the open-source community Lifelong agents are becoming universal OpenClaw: a personal AIassistant by PeterSteinberger and the open-source community Runs on your machine; talks through WhatsApp, Slack, TelegramPersistent memoryacross conversations: preferences, facts, decisionsTakes actions — controls your browser, runs scripts, sets remindersModel-agnostic: Claude, GPT, or a local model Lifelong agents are becoming universal OpenClaw: a personal AIassistant by PeterSteinberger and the open-source community Runs on your machine; talks through WhatsApp, Slack, TelegramPersistent memoryacross conversations: preferences, facts, decisionsTakes actions — controls your browser, runs scripts, sets remindersModel-agnostic: Claude, GPT, or a local model Most deployments today still call a frontier API. Where would a lifelong agent live? A frontier API is the wrong substrate: Where would a lifelong agent live? A frontier API is the wrong substrate: Cost— billions of [tasks × users × interactions] makes per-call pricinguntenable Where would a lifelong agent live? A frontier API is the wrong substrate: Cost— billions of [tasks × users × interactions] makes per-call pricinguntenableLatency / privacy— the agent has to benearthe user Where would a lifelong agent live? A frontier API is the wrong substrate: Cost— billions of [tasks × users × interactions] makes per-call pricinguntenableLatency / privacy— the agent has to benearthe userPersonalization— a single hosted model cannot be many users at once Where would a lifelong agent live? A frontier API is the wrong substrate: Cost— billions of [tasks × users × interactions] makes per-call pricinguntenableLatency / privacy— the agent has to benearthe userPersonalization— a single hosted model cannot be many users at once A lifelong agent must run on the user’s device — asmall language modelisthe only realistic deployment target. Three problems Three problems Per domain— Can the small model do the job at all?A3 — agentic distillation Three problems Per domain— Can the small model do the job at all?A3 — agentic distillationPer user— Does it know me?AdaptArena — test-time personalization Three problems Per domain— Can the small model do the job at all?A3 — agentic distillationPer user— Does it know me?AdaptArena — test-time personalizationPer interaction— How does it remember and retrieve memories?LLM2Vec-Gen — output-space embeddings Three problems Per domain— Can the small model do the job at all?A3 — agentic distillationPer user— Does it know me?AdaptArena — test-time personalizationPer interaction— How does it remember and retrieve memories?LLM2Vec-Gen — output-space embeddings A lifelong agent adapts at three granularities:domain→user→interaction. 1. Specialization Per domain — A3 Structured Distillation of Web Agent Capabilities Enables Generalization Specialization: the gap Small open-weight agents trail frontier by20+ ppon web tasks. Specialization: the gap Small open-weight agents trail frontier by20+ ppon web tasks. Qwen 3.5 9B on WebArena: ~31% — Gemini-3-Pro: ~51% Specialization: the gap Small open-weight agents trail frontier by20+ ppon web tasks. Qwen 3.5 9B on WebArena: ~31% — Gemini-3-Pro: ~51%Standard SFT distillation overfits to training tasks Specialization: the gap Small open-weight agents trail frontier by20+ ppon web tasks. Qwen 3.5 9B on WebArena: ~31% — Gemini-3-Pro: ~51%Standard SFT distillation overfits to training tasksCan we transfer frontier capability into a 9B modelwhile enablinggeneralization across web environments? Specialization: the gap Small open-weight agents trail frontier by20+ ppon web tasks. Qwen 3.5 9B on WebArena: ~31% — Gemini-3-Pro: ~51%Standard SFT distillation overfits to training tasksCan we transfer frontier capability into a 9B modelwhile enablinggeneralization across web environments? Tension: more demonstrations help WebArena but hurt out-of-distributiontransfer (e.g., WorkArena, VisualWebArena, MiniWoB). A3: Agent-as-Annotators A3 replaces three human annotation roles with LLM modules: Human roleLLM module (Gemini-3-Pro)Outputs Task DesignerPersona + Task GeneratorPersona, task intent, evaluation hints AnnotatorAgentTrajectory + reasoning trace A3: Agent-as-Annotators A3 replaces three human annotation roles with LLM modules: Human roleLLM module (Gemini-3-Pro)Outputs Task DesignerPersona + Task GeneratorPersona, task intent, evaluation hints AnnotatorAgentTrajectory + reasoning trace SupervisorJudgePass/fail using the hints The student (Qwen3.5-9B) isfine-tuned onjudge-filteredtrajectories withreasoning intact:2,322successful out of 3,000 attempts,6 webenvironments. Example: one annotated rollout Persona(Task Designer):Maya, e-commerce admin who clears pending ordersfirst thingevery

点击免费查看完整报告

基于小型语言模型的终身智能体

永久智能体：小型语言模型

1. 特殊化：A3

2. 个性化：AdaptArena

3. 通信：LLM2Vec-Gen

总结

你可能感兴趣

基于大型语言模型的智能体的兴起与发展

大型语言模型时代的协作式AI智能体

第四届挑战赛A1-基于深度学习和语言模型的印刷文字 OCR 系统

基于大型语言模型的代理的兴起和潜力：一项调查

基于大型语言模型的代理的兴起和潜力: 一项调查

第八届挑战赛C2-基于BERT深度语言模型的“智慧政务”文本挖掘应用

基于基础模型的工业自动化智能体：目的、能力与开放挑战

小型电波暗室信道——基于暗室模型估算信道容量

从语言到行动：大语言模型作为自主智能体与工具使用者的综述

基于_Rust_语言编写的可编程的全球分布式_MQTT_服务器_王文庭