Loading...
Loading...
Loading...
https://www.promptingguide.ai/ https://github.com/rxlqn/awesome-llm-self-reflection UPRISE: Universal Prompt Retrieval for Improving Zero-Shot Evaluation # New Papers ## Tool Use ### Tool for other tasks TORL: Scaling Tool-Integrated RL ReTool: Reinforcement Learning for Strategic Tool Use in LLMs (see related work of this paper) ### Survey **Tool learning with large language models: A survey (2405.17935)** -> see Fig. 1-2 to papers before 2024Q2. **Tool Learning with Foundation Models (2304.08354)** ### Datasets **ToolACE: Winning the Points of LLM Function Calling (2409.00920)** Beyond collecting real API data, we developed a Tool Self-Evolution Synthesis (TSS) module that synthesizes API definitions with various data types and constraints. Specifically, we utilize pretraining data to extract an API context tree **Toolalpaca: Generalized tool learning for language models with 3000 simulated cases.(2306.05301)** **Toolllm: Facilitating large language models to master 16000+ real-world apis (2307.16789)** (toolbench) **Gorilla: Large language model connected with massive apis (2305.15334)** **API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs (2304.08244)** **APIGen: Automated PIpeline for Generating Verifiable and Diverse Function-Calling Datasets (2406.18518)** **TravelPlanner: A Benchmark for Real-World Planning with Language Agents (sandbox) (2402.01622)** **Toolformer: Language Models Can Teach Themselves to Use Tools (2302.04761)** **τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains (2406.12045)** **StableToolBench: Towards Stable Large-Scale Benchmarking on Tool Learning of Large Language Models (2403.07114)** **On the Tool Manipulation Capability of Open-source Large Language Models (2305.16504)** **RestGPT: Connecting Large Language Models with Real-World RESTful APIs (2306.06624)** **Nestools: A dataset for evaluating nested tool learning abilities of large language models (2410.11805)** **Api-blend: A comprehensive corpora for training and benchmarking api llms. (2402.15491)** **Seal-tools: Self-instruct tool learning dataset for agent tuning and detailed benchmark (2405.08355)** **Taskbench: Benchmarking large language models for task automation. (2311.18760)** ### evaluation **ToolEyes: Fine-Grained Evaluation for Tool Learning Capabilities of Large Language Models in Real-world Scenarios (2401.00741)** BFCL: https://gorilla.cs.berkeley.edu/blogs/8_berkeley_function_calling_leaderboard.html **ToolQA: A Dataset for LLM Question Answering with External Tools (2306.13304)** **MINT: EVALUATING LLMS IN MULTI-TURN INTERACTION WITH TOOLS AND LANGUAGE FEEDBACK (2309.10691)** **NESTFUL: A Benchmark for Evaluating LLMs on Nested Sequences of API Calls (2409.03797)** **Nexusraven: a commercially-permissive language model for function calling.** **Granite-function calling model: Introducing function calling abilities via multi-task learning of granular tasks. (2407.00121)** **Wtu-eval: A whether-or-not tool usage evaluation benchmark for large language models. (2407.12823)** **A comprehensive evaluation of tool-assisted generation strategies.(2310.10062)** **Planning, Creation, Usage: Benchmarking LLMs for Comprehensive Tool Utilization in Real-World Complex Scenarios (2401.17167) (ultratools)** **T-Eval: Evaluating the Tool Utilization Capability of Large Language Models Step by Step (2312.14033)** **ToolSword: Unveiling Safety Issues of Large Language Models in Tool Learning Across Three Stages (2402.10753)** **AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents (2407.18901)** **Metatool benchmark for large language models: Deciding whether to use tools and which to use (2310.03128)** Rapid. Rapid api, 2023. https://rapidapi.com/ **Identifying the risks of lm agents with an lm-emulated sandbox (2309.15817)** **m&m’s: A benchmark to evaluate tool-use for multi-step multimodal tasks. (2403.11085)** **Gta: A benchmark for general tool agents (2407.08713)** **Sciagent: Tool-augmented language models for scientific reasoning(2402.11451)** **Ctooleval: A chinese benchmark for llm-powered agent evaluation in realworld api interactions (2024.8)** **Toolsandbox: A stateful, conversational, interactive evaluation benchmark for llm tool use capabilities (2408.04682).** **Agentboard: An analytical evaluation board of multi-turn llm agents (2401.13178)** **Injecagent: Benchmarking indirect prompt injections in toolintegrated large language model agents (2403.02691)** **Tooltalk: Evaluating tool-usage in a conversational setting (2311.10775)** **Gaia: a benchmark for general ai assistants (2311.12983)** **Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments (2404.07972)** **Ml-bench: Large language models leverage open-source libraries for machine learning tasks. (2311.09835)** **AGENTBENCH: EVALUATING LLMS AS AGENTS (2308.03688)** **An LLM compiler for parallel function calling. (2312.04511)** 重要!fc complier可能和我们的CI有关。 ### Training **WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models (2401.13919)** **Genegpt: Augmenting large language modelswith domain tools for improved access to biomedical information (2304.09667)** **ADC: Enhancing Function Calling Via Adversarial Datasets and Code Line-Level Feedback (2412.17754)** CodeAlpaca [3] uses 21 seed tasks and generates a 20k dataset via self-instruct [4], while Wizardcoder [5], MagicCoder [6], and WaveCoder [7] apply advanced heuristics and novel data generation processes based on open-source code snippets and code instruction data to enhance the complexity of initial code instructions CodeNet [23], which comprises approximately 14 million code snippets, and POJ104 [24], a smaller dataset consisting of 52,000 code snippets focused on 104 algorithmic problems **Efficient and scalable estimation of tool representations in vector space (2409.02141)** **Look before you leap: Towards decision-aware and generalizable tool-usage for large language models (2402.16696)** (tooldeer) **Toolverifier: Generalization to new tools via self-verification(2402.14158)** **Llms in the imaginarium: Tool learning through simulated trial and error. (2403.04746)** **Mllm-tool: A multimodal large language model for tool agent learning (2401.10727)** **iTool: Boosting Tool Use of Large Language Models via Iterative Reinforced Fine-Tuning (2501.09766)** **Let Me Do It For You: Towards LLM Empowered Recommendation via Tool Learning (2405.15114)** **AnyTool: Self-Reflective, Hierarchical Agents for Large-Scale API Calls (2402.04253)** **Hammer: Robust function-calling for on-device language models via function masking (2410.04587)** 重要!这个里面的内容可以参考一下。 **Facilitating multiturn function calling for llms via compositional instruction tuning (2410.12952)** 重要!这个里面的内容可以参考一下。 Enhancing tool retrieval with iterative feedback from large language models Confucius: Iterative Tool Learning from Introspection Feedback by Easy-to-Difficult Curriculum. Magicoder: Source code is all you need. Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. . Executable code actions elicit better llm agents watt-tool ### Data synthesis Magnet: Multi-turn Tool-use Data Synthesis and Distillation via Graph Translation ### Prompt Engineering **Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models (2303.04671)** ToolChain\*: Efficient Action Space Navigation in Large Language Models with A\* Search Critic: Large language models can self-correct with tool-interactive critiquing Tool documentation enables zero-shot tool-usage with large language models Measuring and narrowing the compositionality gap in language models **ToolRerank: Adaptive and Hierarchy-Aware Reranking for Tool Retrieval (2403.06551)** Tool documentation enables zero-shot tool-usage with large language models ### Tool Retrieval **Towards completeness-oriented tool retrieval for large language models (2405.16089)** Conversely, semantic-based methods, such as ANCE [45], TAS-B [12], coCondensor [7], and Contriever [15] ... ### Tool Creation CREATOR: Tool Creation for Disentangling Abstract and Concrete Reasoning of Large Language Models Large Language Models as Tool Makers (2305.17126) Craft: Customizing llms by creating and retrieving from specialized toolsets. EASYTOOL: Enhancing LLM-based agents with concise tool instruction ### Early **Internetaugmented language models through few-shot prompting for open-domain question answering (2203.05115)** **ViperGPT: Visual Inference via Python Execution for Reasoning (2303.08128)** **TaskMatrix.AI: Completing Tasks by Connecting Foundation Models with Millions of APIs (2303.16434)** TALM: Tool Augmented Language Models ToolkenGPT: Augmenting Frozen Language Models with Massive Tools via Tool Embeddings Making Language Models Better Tool Learners with Execution Feedback Chameleon: Plug-and-play compositional reasoning with large language models. ### Uncategorized Art: Automatic multistep reasoning and tool-use for large language models. Tool-lmm: A large multimodal model for tool agent learning. GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction Learning to use tools via cooperative and interactive agents. Openagi: When llm meets domain experts. A solution-based llm apiusing methodology for academic information seeking. Advancing tool-augmented large language models: Integrating insights from errors in inference trees TPTU: Task planning and tool usage of large language modelbased AI agents. --- Reliable LLM-based user simulator for task-oriented dialogue systems. **CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training (2504.13161)** **Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning (2503.20752)** **A Minimalist Approach to LLM Reasoning: from Rejection Sampling to Reinforce (2504.11343)** **TRUST REGION PREFERENCE APPROXIMATION: A SIMPLE AND STABLE REINFORCEMENT LEARNING ALGORITHM FOR LLM REASONING (2504.04524)** **What’s Behind PPO’s Collapse in Long-CoT? Value Optimization Holds the Secret (2503.01491)** **The Evolution of LLM Adoption in Industry Data Curation Practices (2503.01491)** **Demystifying Long Chain-of-Thought Reasoning in LLMs (2405.09798)** **Magnet: Multi-turn Tool-use Data Synthesis and Distillation via Graph Translation (2503.07826)** **The N+ Implementation Details of RLHF with PPO: A Case Study on TL;DR Summarization (2403.17031)** **UNA: UNIFYING ALIGNMENTS OF RLHF/PPO, DPO AND KTO BY A GENERALIZED IMPLICIT REWARD FUNCTION (2408.15339)** ## Many-shot ## Self-train ## Math ## Reflection **Many-shot In-Context Learning (2404.11018)** 放了最多2048个example到context里面。不同的顺序对不同的子任务performance不一样,但平均起来看没有什么太大区别。 **Buffer-of-Thoughts** 超高的game of 24成功率 基本的想法是生成一些“template”然后在解决问题时取用。 "self-training": model generated What is the fundamental limits of LLM self-training? 我们要做的这个东西好像带一点many-shot ICL (但是不是example而是template)又带一点reflection(因为要把题目加入训练)和self-train(因为要self-train) 其实或许可以理解为一种active learning,就是说绝大部分情况下可以自己learn,但是少部分情况下需求人工标注 首先从方法上,偏向于finetuning(学校里面太缺计算资源了,希望能做一些学校里做不到的事情) 就现在来说,我比较关心三个问题:第一个是multi-turn LLM,第二个是understanding real numbers,第三个是LLM直接解决传统的mujoco问题。multi-turn LLM这里是以archer这篇文章为代表。 understanding real numbers这篇是以https://arxiv.org/abs/2401.03735这篇文章为代表的。 今天聊天的时候提出了一个LLM建立了“model-based”的概念,那具体是什么意思呢……? # Papers Summary **LDB: Large Model Debugger (2402.16906)** Script generate control flow graph; analyze in breakpoints between flow graph nodes **Don't Trust: Verify (2403.18120)** Generate formal languages and verify **Can LLM Infer Causation from Correlation? (2306.05836)** new problems of inference, new dataset **CREATOR: ... Disentangling Abstract and Concrete Reasoning of LLM (2305.14318)** let an agent to generate high-level plans as decision; execute and rectify using compiler feedback; create tools to address this **DeepSeekMath (2402.03300)** group relative policy optimization (grpo) instead of PPO **MUSR: Multi-Step Soft Reasoning (2310.16049)** new dataset of building logic trees **Let's Verify Step by Step (2305.20050)** train a good reward model; use "feedback in the middle" / "supervise in the proecess" **Solving Math Word Problems (2211.14275)** Early attempt of solving math problem. Basically standard RLHF. Tried by-step and overall feedback. It seems that RM-weighted return is much greater than majority or greedy answers. **Scaling Relationship with LLM (2308.01825)** rejection sampling fine-tuning; data augmentation **LM Understand Numbers, at Least Partially (2401.03735)** empirical study on the internal embedding with numbers as input in LM; uses a single linear layer to "decode" **AlphaMath Almost Zero (2405.03553)** MCTS evaluation to get a value function as reward model; do process supervision with reward assigned by the reward model **Best Practices and Lessons Learned in Synthetic Data (2404.07503)** an introduction to variant synthetic data for LLM papers **Fine-Tuning LVLM ... Using RL (2405.10292)** RL+VLM **Controlling LLM Agents with Entropic Activation Steering (2406.00244)** to increase diversity of LLM output; train a "steering vector" **InterCode: Standardizing and Benchmarking Interactive Code with Execution Feedback (2306.14898)** Another paper which uses agent code command as action and interpreter as environment to interact. **ToolAlpaca (2306.05301)** 经典的framework,它在自己的语料库上训练,其中包含了大量的tool use。 **ToolFormer (2302.04761)** 和toolalpaca类似,但是是更早的工作,没有那么成熟。 **In-Context AE for Context Compression in LLM (2307.06945)** 实际上和activation beacon或者recurrent transformer都很像, 总的来说就是一些特殊的、在上一个chunk结尾继承了大部分信息的token。 **AutoAct (2401.05268)** autoact同样是分为几个agent:meta-agent, plan-agent,tool-agent和reflect-agent。meta-agent会负责选择合适的工具,然后还会用react生成一些成功的trajectory作为training source。在这个基础上用lora finetune其他三个agent。比起之前诸如fireact和lumos这样的工作,autoact不需要使用GPT-4。autoact在Tab.1里测试了大量LLM agent,可以参考。 **AutoGen (2308.08155)** 同样是分为几个agent:assistant, proxy和groupchat。user和assistant两个agent会互相对话,proxy会register一些函数,而assistant则会运行这些函数来返回结果。proxy同时也有python interpreter作为assistant的feedback。 group manager会动态地添加agent/维持一组agent的交流。 **LLM can Strategically Deceive their Users (2311.07590)** The first work that shows LLM cheats even when not instructed to do so but under high pressure from human instructors **Executable Code Actions Elicit Better LLM Agents (2402.01030)** use code as action, makes trajectory more compact and minimizes the number of steps **Data-Copilot (2306.07209)** data-copilot是一个用来实时处理大规模信息的LLM agent。它会自动生成代码来处理这些消息,然后调用一些之前设计好的处理接口。这些接口本身也是在搜索数据时LLM agent自己建立起来的。 **WavCraft: Audio Editing & Generation with LLMs (2403.09527)** 这篇paper是整合了声音输入和编辑的multimodal model,他的编辑是利用了tool API(利用提前整合好的instruction template),输入利用了专门的audio analysis module。实际上外接了大量的已有处理声音的模型。 **SAGE: Semantic and Actionable Parts for Generalizable Manipulation of Articulated Objects (2312.01307)** 实际上是一个visual input的agent,它自己可以描述场景。它调用3D segmentation模型以获知在一个物体上哪些部位是可以动的,这样就可以把现在可以做的action补充到描述里。然后LLM作为一个不断接受feedback的agent来输出policy,使用别的经典算法来估测需要移动的幅度作为反馈。输出是一个tuple,包含目标、动作和参数,这些东西会被送进motion planner。 **Simulating Opinion Dynamics with Networks of LLM-based Agents (2311.09618)** 实际上是一个偏向社会学的工作,试图通过LLM的自然语言来描述人类的心态进而考虑其行为,而不是简单地用先验的公式来描述人类行为。LLM population里每个agent的想法写出来实际上就是模拟每个人内心的想法。当然,所有的信息交换也是通过模拟的twitter实现的。 **Towards Unified Alignment Between Agents, Humans and Environment (2402.07744)** 大部分篇幅描述了一种设计LLM agent的原则,即要认识到人类的目的、要对环境dynamics有认知和要满足经济性等约束。他也提出了一种方法,从过去成功的traj里抽取关键动作,然后根据关键动作的走向匹配最接近的成功案例这样。 **Self-Training Language Models in Arithmetic Reasoning (ICLR 2024 workshop)** proposes calcX, a new dataset; use calculator API and self-training with preference optimization. **Reinforced Self-Training (REST) for LLM (2308.08998)** 从本质上是一种curriculum learning(逐步提高接受data的bar)+rejection sampling(只有return足够高的data才会被接受进入训练) **LLM can self-improve (2023 emnlp)** 跟我们的想法有一点像,他通过voting选出来的最好的reasoning path会被加入到之后的training samples里面 **Agents: An Open-Source Framework for Autonomous Language Agents (2309.07870)** multi-agent,是一种“基础模型”,它整合了tool use、multi-agent、HCI、symbolic等内容,可以被用于后续的训练中。 **LEAGUE++ (ICLR 2024)** LLM robotic agent,使用形式化的语言描述plan,生成reward,同时调用一个semantic skills library。这个library会被rollout中余弦相似度最像的traj不断更新。 **Agent LUMOS: Unified and Modular Training for Open-Source Language Agents** LUMOS features a learnable, unified and modular architecture with a planning module that learns highlevel subgoal generation, and a grounding module trained to translate these into the actions using various tools in the execution module. 仍然是同一个套路:一个planning-agent自然语言生成high-level plan,再用grounding agent细化成标准化动作,最后用各种API实现。 **R2E: Turning any Github Repository into a Programming Agent Environment** 对LLM的自动化testbed生成器。对于每个repo,首先找出一些“有意思的函数”(注意并不是整个repo),然后收集它们的context;利用prompted 程序分析生成testing harness,即自动化测试框架。注意到这里并不是简单地生成输入输出,而是包含了一整套外在的config。 **Beyond A\*: Better Planning with Transformers via Search Dynamics Bootstrapping (2402.14083)** 训练一个transformer来预测search trace。测试了两种情况,一种是先输出trace后输出plan,一种是直接输出plan。显然前者的正确率更高。 **FINMEM: LLM trading agent (2311.13743)** 一个llm trading agent,能够处理不同类型、特别是时效性不同的金融数据。它分为profiling,memory和decision-making三个模块。其实profiling大概就是一个prompt,描述了一些过去的信息。然后有不同agent扮演不同风险爱好的决策者。memory模块会结合每天的新闻、股价、公司报告等选出最相关的信息(综合自不同时效),并且根据时效其重要性会指数衰减。同时,agent还会做reflection。decision-making就是简单地从buy/sell/hold里选一个动作。 **Travel Planner: A Benchmark for Real-World Planning with Language Agents** 基本上就是一个新的testbed,没有提出新方法。 **Towards General Computer Control: RDR II (2402.01030)** 整个pipeline大概分为self-reflection,task inference,然后coding得到对应的操作,再根据操作做planning得到最终要做的动作。总的来说就是一种比较高级的以code作为action的react。实际上和elicity一文有点类似。 **Mobile Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception** 一个基于GPT-4V和一些辅助工具生成的手机助手task。不过没有测试baseline。 **LLF-Bench: Interactive Learning from Language Feedback(2312.06853)** 一个新的benchmark,在一般agent环境的基础上添加了语言feedback。 **AgentOhana: Design Unified Data and Training Pipeline for Effective Agent Learning** 它把来自不同环境的trajectory综合成一种恒定的格式,从而生成一种新的dataloader。与此同时利用这个dataloader训练了一个新模型xLAM-v0.1。 **Can Large Language Models be Good path planners? (2310.03249)** 测试了一系列LLM在gridworld navigation上的performance。总的来说,LLM的表现并不好——需要situated spatial info和持续不断的反馈。如果是finetune的LLM,则generalizability不太好。相对来说,ReAct效果不错。不过需要注意到这篇文章比较早了,现在可能有更好的方法。 **Human-inspired reading agent with Gist Memory** 基本上就是让llm缩写context之后把缩写结果扔到context里面然后再RAG。需要把一个很长的文章用LLM分段,然后对每一段做缩写。在提问的时候,prompt llm问他需不需要再仔细看某一页的内容。这篇文章里提到了许多context compression的方法。 **The ART of LLM Refinement: Ask, Refine, and Trust** 首先生成一个初始的答案,然后让一个提问题的LLM基于这个问题分解出一系列小问题,如果都能回答上来则直接输出,否则基于这些小问题修正答案。如果修正了答案,那么让一个truster LLM来评价一下是初始答案更好还是后面的答案更好。 **Do LLM Agents have Regret? A case study in online learning and games** 研究了一些简单的bandit环境。有些情况下是无regret的,但是也有很简单的情况会有regret **If LLM is the Wizard, then code is the wand (2401.00812)** 是一篇survey,提到了code是如何让LLM变得更好的。具体地说,LLM可以直接写code、评价code,可以做program-of-thought,可以辅助CoT做更好的task decomposition,可以建立reasoning graph,辅助vision input,使用工具,从interpreter那里得到反馈,做env的感知和planning,作为action,组织memory或者是自我改进。 **AgentBoard: An analytical evaluation board of Multi-turn LLM Agents** A testbed parallel to agentbench **Is it possible to edit LLM robustly?** 研究是否在finetune改变llm的某个认知之后会被用户几句话给拐回来。事实表明,越是接近基础的认知越不容易被robustly edited,也就是越容易被用户拐回来。 ---------------------------------------------------------------------------------------------------------------------------------------- # Chain of Thought Vanilla: Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In NeurIPS, 2022. thought-observation-action: ReAct: Synergizing Reasoning and Acting in Language Models. In ICLR, 2023. in plan and out-of-plan: AdaPlanner: Adaptive Planning from Feedback with Language Models. In ArXiv, 2023. long-term memory by self-reflection: Reflexion: Language Agents with Verbal Reinforcement Learning. In ArXiv, 2023. majority vote: Complexity-Based Prompting for Multi-Step Reasoning. In ICLR, 2023. Self-consistency: Self-Consistency Improves Chain of Thought Reasoning in Language Models. In ICLR, 2023. Tree of thought: Tree of Thoughts: Deliberate Problem Solving with Large Language Models. In ArXiv, 2023. Graph of thought: Graph of Thoughts: Solving Elaborate Problems with Large Language Model. In ArXiv, 2023. Algorithm-of-Thought: Algorithm of Thoughts: Enhancing Exploration of Ideas in Large Language Models. In ArXiv, 2023. Skeleton-of-thought: Cumulative Reasoning With Large Language Models. In ArXiv, 2023. # RL for LLM ## Policy Gradient and Actor-Critic PPO + nucleus sampling to narrow down possible actions: Is Reinforcement Learning (Not) for Natural Language Processing: Benchmarks, Baselines, and Building Blocks for Natural Language Policy Optimization. In ICLR, 2023. explore in the vicinity: Semi-Offline Reinforcement Learning for Optimized Text Generation. In ICML, 2023. ## MCTS Reasoning with Language Model is Planning with World Model. In ArXiv, 2023. ## In-context RL In-context Reinforcement Learning with Algorithm Distillation. In NeurIPS, 2022. Transformers as Decision Makers: Provable In-Context Reinforcement Learning via Supervised Pretraining Emergent agentic transformer from chain of hindsight experience Emergence of In-Context Reinforcement Learning from Noise Distillation AMAGO: Scalable In-Context Reinforcement Learning for Adaptive Agents In-Context Reinforcement Learning for Variable Action Spaces Supervised Pretraining Can Learn In-Context Reinforcement Learning, in NeurIPS, 2023. In-context Exploration-Exploitation for Reinforcement Learning
- Without a harness, you **can't compare** prompts, models, retrieval configs, or costs.
Evaluate, benchmark, and regression-test AI/LLM systems. Covers evaluation framework design, benchmark creation, human evaluation protocols, automated evaluation (LLM-as-judge), regression testing, statistical significance, and continuous evaluation pipelines.
<img width="1388" height="298" alt="full_diagram" src="https://github.com/user-attachments/assets/12a2371b-8be2-4219-9b48-90503eb43c69" />
A list of all public EEG-datasets. This list of EEG-resources is not exhaustive. If you find something new, or have explored any unfiltered link in depth, please update the repository.