Reader Comments

Exploring DeepSeek-R1's Agentic Capabilities Through Code Actions

by Mariel Schulze (2025-02-10)

 |  Post Reply

FB-AI-istockphoto-1206796363-612x612-1.j

I ran a fast experiment investigating how DeepSeek-R1 carries out on agentic tasks, despite not supporting tool use natively, and I was rather pleased by initial results. This experiment runs DeepSeek-R1 in a single-agent setup, where the model not only prepares the actions however likewise formulates the actions as executable Python code. On a subset1 of the GAIA recognition split, DeepSeek-R1 outperforms Claude 3.5 Sonnet by 12.5% absolute, from 53.1% to 65.6% correct, users.atw.hu and other designs by an even larger margin:


The experiment followed model usage standards from the DeepSeek-R1 paper and the design card: Don't use few-shot examples, avoid including a system prompt, and set the temperature to 0.5 - 0.7 (0.6 was utilized). You can find additional assessment details here.


Approach


DeepSeek-R1's strong coding abilities enable it to act as a representative without being clearly trained for tool usage. By enabling the design to create actions as Python code, it can flexibly interact with environments through code execution.


Tools are implemented as Python code that is consisted of straight in the prompt. This can be a basic function meaning or a module of a bigger plan - any legitimate Python code. The design then generates code actions that call these tools.


Arise from performing these actions feed back to the model as follow-up messages, driving the next actions till a final response is reached. The representative structure is an easy iterative coding loop that moderates the discussion between the design and lespoetesbizarres.free.fr its environment.


Conversations


DeepSeek-R1 is utilized as chat design in my experiment, where the model autonomously pulls additional context from its environment by using tools e.g. by using a search engine or bring data from websites. This drives the discussion with the environment that continues up until a final answer is reached.


In contrast, o1 designs are understood to carry out poorly when used as chat models i.e. they do not try to pull context during a conversation. According to the linked short article, o1 models perform best when they have the full context available, with clear directions on what to do with it.


Initially, I likewise attempted a complete context in a single prompt method at each step (with arise from previous steps consisted of), however this resulted in considerably lower scores on the GAIA subset. Switching to the conversational method explained above, users.atw.hu I was able to reach the reported 65.6% performance.


This raises an interesting question about the claim that o1 isn't a chat model - maybe this observation was more appropriate to older o1 models that lacked tool usage capabilities? After all, isn't tool use support an essential system for allowing designs to pull extra context from their environment? This conversational technique certainly seems reliable for surgiteams.com DeepSeek-R1, though I still need to carry out comparable experiments with o1 designs.


Generalization

110819_ts_ai_feat.jpg?fit\u003d1028%2C57

Although DeepSeek-R1 was mainly trained with RL on mathematics and coding tasks, it is remarkable that generalization to agentic tasks with tool use through code actions works so well. This ability to generalize to agentic jobs advises of current research study by DeepMind that shows that RL generalizes whereas SFT remembers, although generalization to tool usage wasn't examined because work.


Despite its ability to generalize to tool usage, DeepSeek-R1 typically produces really long thinking traces at each action, compared to other models in my experiments, limiting the usefulness of this design in a single-agent setup. Even easier jobs in some cases take a long time to complete. Further RL on agentic tool use, be it through code actions or not, might be one alternative to enhance efficiency.


Underthinking


I likewise observed the underthinking phenomon with DeepSeek-R1. This is when a reasoning model frequently switches between various reasoning thoughts without adequately exploring promising paths to reach a correct solution. This was a significant factor for extremely long thinking traces produced by DeepSeek-R1. This can be seen in the tape-recorded traces that are available for download.


Future experiments


Another common application of thinking models is to use them for shiapedia.1god.org planning just, while using other models for creating code actions. This might be a possible new function of freeact, if this separation of functions shows beneficial for more complex tasks.


I'm likewise curious about how reasoning designs that currently support tool use (like o1, o3, ...) perform in a single-agent setup, with and without generating code actions. Recent advancements like OpenAI's Deep Research or Hugging Face's open-source Deep Research, which also utilizes code actions, look intriguing.

what-ai-can-do-for-you.jpg

Add comment