2026-03-31论文总结

主题: 对于多个Agent相互协作的Agentic AI系统中系统层面有关问题的研究，如系统延迟、系统架构设计等。

在这个主题下筛选得到了3篇论文。

CirrusBench: Evaluating LLM-based Agents Beyond Correctness in Real-World Cloud Service Environments

与主题的相关性

技术术语的重合度
- 描述：论文中使用的具体技术术语、模型名称、数据集、算法编号与主题词的匹配程度。论文是否直接讨论了与研究主题相关的核心技术？
- 权重：0.500
- 得分：9/10
- 理由：论文直接讨论了“LLM-based agents”、“multi-turn tasks”、“resolution efficiency”、“Multi-Turn Latency”等，这些术语与研究主题中的“多个Agent相互协作的Agentic AI系统”及“系统延迟”高度相关。
实验设置的适配性
- 描述：论文的实验环境、数据集、评估指标是否是该主题公认的标准？实验是否验证了在该主题场景下的有效性？
- 权重：0.200
- 得分：9/10
- 理由：论文的实验环境（基于真实云服务工单数据）、评估指标（如多轮延迟、标准化效率指数）直接针对研究主题中的“系统延迟”和“效率”等系统层面问题，提供了在真实、复杂协作场景下的有效性验证。
方法论的直接关联性
- 描述：论文提出的方法是否可以被直接应用于解决该主题下的核心问题？该方法是否针对该主题的已知痛点？
- 权重：0.200
- 得分：8/10
- 理由：论文提出的CirrusBench评估框架，通过引入以客户为中心的指标（如多轮延迟）来量化解决效率，这直接针对Agentic AI系统中“系统延迟”这一核心问题，并为评估和优化系统架构设计提供了方法论工具。
代码与数据的可获得性
- 描述：论文是否提供了开源代码、预训练模型或公开数据集？这对于快速跟进和应用至关重要。
- 权重：0.100
- 得分：10/10
- 理由：论文摘要末尾明确声明其评估框架CirrusBench已公开发布（提供了URL链接），这表明代码和/或数据集是可获得的。
总结

整体的评分为89.00。

上述内容由deepseek-chat生成。

摘要

The increasing agentic capabilities of Large Language Models (LLMs) have enabled their deployment in real-world applications, such as cloud services, where customer-assistant interactions exhibit high technical complexity and long-horizon dependencies, making robustness and resolution efficiency critical for customer satisfaction. However, existing benchmarks for LLM-based agents largely rely on synthetic environments that fail to capture the diversity and unpredictability of authentic customer inputs, often ignoring the resolution efficiency essential for real-world deployment. To bridge this gap, we introduce CirrusBench, a novel evaluation framework distinguished by its foundation in real-world data from authentic cloud service tickets. CirrusBench preserves the intricate multi-turn logical chains and realistic tool dependencies inherent to technical service environments. Moving beyond execution correctness, we introduce novel Customer-Centric metrics to define agent success, quantifying service quality through metrics such as the Normalized Efficiency Index and Multi-Turn Latency to explicitly measure resolution efficiency. Experiments utilizing our framework reveal that while state-of-the-art models demonstrate strong reasoning capabilities, they frequently struggle in complex, realistic multi-turn tasks and fail to meet the high-efficiency standards required for customer service, highlighting critical directions for the future development of LLM-based agents in practical technical service applications. CirrusBench evaluation framework is released at: this https URL

arXiv页面

下载PDF

下载Tex

Heddle: A Distributed Orchestration System for Agentic RL Rollout

与主题的相关性

技术术语的重合度
- 描述：论文中使用的具体技术术语、模型名称、数据集、算法编号与主题词的匹配程度。论文是否直接讨论了与研究主题相关的核心技术？
- 权重：0.500
- 得分：10/10
- 理由：论文直接讨论Agentic RL、多步交互、系统瓶颈（如队列延迟、干扰开销）、分布式编排系统等，与技术主题高度重合。
实验设置的适配性
- 描述：论文的实验环境、数据集、评估指标是否是该主题公认的标准？实验是否验证了在该主题场景下的有效性？
- 权重：0.200
- 得分：9/10
- 理由：论文评估了多种Agentic RL工作负载，并比较了端到端吞吐量等系统级指标，直接针对多Agent协作系统中的延迟和性能瓶颈问题，实验设置适配主题。
方法论的直接关联性
- 描述：论文提出的方法是否可以被直接应用于解决该主题下的核心问题？该方法是否针对该主题的已知痛点？
- 权重：0.200
- 得分：10/10
- 理由：Heddle系统通过轨迹级调度、放置和资源管理，直接解决多Agent协作中由长尾轨迹生成引起的队列延迟、干扰开销和令牌时间膨胀等系统层面问题，方法高度相关。
代码与数据的可获得性
- 描述：论文是否提供了开源代码、预训练模型或公开数据集？这对于快速跟进和应用至关重要。
- 权重：0.100
- 得分：0/10
- 理由：论文摘要未提及代码、模型或数据集的公开情况，无法判断可获得性。
总结

整体的评分为88.00。

上述内容由deepseek-chat生成。

摘要

Agentic Reinforcement Learning (RL) enables LLMs to solve complex tasks by alternating between a data-collection rollout phase and a policy training phase. During rollout, the agent generates trajectories, i.e., multi-step interactions between LLMs and external tools. Yet, frequent tool calls induce long-tailed trajectory generation that bottlenecks rollouts. This stems from step-centric designs that ignore trajectory context, triggering three system problems for long-tail trajectory generation: queueing delays, interference overhead, and inflated per-token time. We propose Heddle, a trajectory-centric system to optimize the when, where, and how of agentic rollout execution. Heddle integrates three core mechanisms: trajectory-level scheduling using runtime prediction and progressive priority to minimize cumulative queueing; trajectory-aware placement via presorted dynamic programming and opportunistic migration during idle tool call intervals to minimize interference; and trajectory-adaptive resource manager that dynamically tunes model parallelism to accelerate the per-token time of long-tail trajectories while maintaining high throughput for short trajectories. Evaluations across diverse agentic RL workloads demonstrate that Heddle effectively neutralizes the long-tail bottleneck, achieving up to 2.5$\times$ higher end-to-end rollout throughput compared to state-of-the-art baselines.

arXiv页面

下载PDF

下载Tex

CARLA-Air: Fly Drones Inside a CARLA World -- A Unified Infrastructure for Air-Ground Embodied Intelligence

与主题的相关性

技术术语的重合度
- 描述：论文中使用的具体技术术语、模型名称、数据集、算法编号与主题词的匹配程度。论文是否直接讨论了与研究主题相关的核心技术？
- 权重：0.500
- 得分：9/10
- 理由：论文直接讨论了与研究主题高度相关的核心技术，包括多智能体（aerial and ground agents）、协同系统（air-ground cooperative systems）、系统架构（unified infrastructure, shared physics tick and rendering pipeline）和同步机制（synchronously capturing sensor modalities）。论文明确提到了智能体协作（cooperation）和系统层面的设计（extensible asset pipeline, modern infrastructure）。
实验设置的适配性
- 描述：论文的实验环境、数据集、评估指标是否是该主题公认的标准？实验是否验证了在该主题场景下的有效性？
- 权重：0.200
- 得分：7/10
- 理由：论文的实验环境（CARLA-Air平台）是一个统一的仿真基础设施，专门设计用于评估空中和地面智能体的协同工作负载，包括协作、导航和策略训练。它支持多模态感知和数据集构建，这些是评估多智能体系统延迟和架构设计的典型场景。然而，论文没有明确使用研究主题中提到的“系统延迟”作为标准评估指标。
方法论的直接关联性
- 描述：论文提出的方法是否可以被直接应用于解决该主题下的核心问题？该方法是否针对该主题的已知痛点？
- 权重：0.200
- 得分：8/10
- 理由：论文提出的方法（CARLA-Air基础设施）可以直接应用于解决研究主题下的核心问题，如系统架构设计（通过统一仿真环境避免同步开销）和系统层面的一致性（保证严格的时空一致性）。它针对多智能体系统中已知的痛点，如现有平台的领域隔离和同步问题。但论文主要关注仿真基础设施，而非直接解决系统延迟的具体算法或优化。
代码与数据的可获得性
- 描述：论文是否提供了开源代码、预训练模型或公开数据集？这对于快速跟进和应用至关重要。
- 权重：0.100
- 得分：10/10
- 理由：论文明确提供了开源代码，包括预构建的二进制文件和完整源代码，通过一个公开的URL发布。这确保了快速跟进和应用，符合评估量表对代码可获得性的高标准要求。
总结

整体的评分为85.00。

上述内容由deepseek-chat生成。

摘要

The convergence of low-altitude economies, embodied intelligence, and air-ground cooperative systems creates growing demand for simulation infrastructure capable of jointly modeling aerial and ground agents within a single physically coherent environment. Existing open-source platforms remain domain-segregated: driving simulators lack aerial dynamics, while multirotor simulators lack realistic ground scenes. Bridge-based co-simulation introduces synchronization overhead and cannot guarantee strict spatial-temporal consistency. We present CARLA-Air, an open-source infrastructure that unifies high-fidelity urban driving and physics-accurate multirotor flight within a single Unreal Engine process. The platform preserves both CARLA and AirSim native Python APIs and ROS 2 interfaces, enabling zero-modification code reuse. Within a shared physics tick and rendering pipeline, CARLA-Air delivers photorealistic environments with rule-compliant traffic, socially-aware pedestrians, and aerodynamically consistent UAV dynamics, synchronously capturing up to 18 sensor modalities across all platforms at each tick. The platform supports representative air-ground embodied intelligence workloads spanning cooperation, embodied navigation and vision-language action, multi-modal perception and dataset construction, and reinforcement-learning-based policy training. An extensible asset pipeline allows integration of custom robot platforms into the shared world. By inheriting AirSim's aerial capabilities -- whose upstream development has been archived -- CARLA-Air ensures this widely adopted flight stack continues to evolve within a modern infrastructure. Released with prebuilt binaries and full source: this https URL

arXiv页面

下载PDF

下载Tex

页面生成的统计项

本页面使用deepseek-chat模型生成，token用量统计如下：

类型	用量
提示词缓存未命中tokens	725418
提示词缓存命中tokens	68480
补全tokens	313242
思考链tokens	0
总计	1107140

页面生成的总用时为22m 34s

<< 昨天的论文总结

>> 明天的论文总结