Finetuning a Reasoning LLM with Supervised or Reinforcement Learning?

Hello,

I have a task to fine-tune small LLMs on annotated conversational data. The dataset contains not only the final answers, but also reasoning traces and tool-calling decisions (i.e., when the model should think and when it should call a tool).

I am wondering what the best training approach would be and why.

My current dataset is stored in a chat format similar to this:

system
user
assistant_think
assistant_tool
assistant_answer

user
assistant_think
assistant_tool
assistant_answer
...

My current idea is to split each conversation into multiple training samples. For example, if a conversation contains two user turns, I would create two samples:

Sample 1

system
user
assistant_think
assistant_tool
assistant_answer

Sample 2

system
user
assistant_think
assistant_tool
assistant_answer

user
assistant_think
assistant_tool
assistant_answer

In other words, each sample contains all previous conversation history up to the assistant response being trained.

For training, the loss would be computed only on the assistant-generated tokens:

assistant_think
assistant_tool
assistant_answer

while the system and user messages would be masked out from the loss.

Is this approach correct, or is there a better way to structure the training data for reasoning and tool-calling behavior?

My second question is about reinforcement learning.

After completing supervised fine-tuning (SFT) on the dataset described above, should I also incorporate RL to further train the model on e.g. when a tool should or should not be called?

If so:

  • What advantages would RL provide over SFT alone for tool-use and reasoning?

  • How would you design the reward function?

  • Under what circumstances is RL actually necessary, and when is SFT sufficient?

I would appreciate any practical advice, papers, blog posts, or open-source examples related to training reasoning and tool-calling models.

Hmm, maybe something like this:


I would separate this into three questions that often get mixed together:

  1. How should I represent the training data?
  2. Which tokens should actually receive loss during SFT?
  3. When, if ever, should I move from SFT to preference optimization or RL?

My short answer would be:

Start with SFT if you already have correct trajectories.
Treat assistant_think, assistant_tool, and assistant_answer as an internal annotation format, not necessarily as model roles.
Convert them into the target model’s actual chat template / tool-calling format.
Add no-tool, clarification, and unavailable-tool examples.
Consider DPO if you can create good/bad trajectory pairs.
Consider GRPO/RL only if you have executable tools, a rollout environment, and reliable rewards.

Below is the longer version.

1. Your SFT intuition is mostly right, but I would not train arbitrary custom roles directly

Your idea of making each training example condition on the conversation history and supervise the next assistant-side behavior is basically reasonable.

For example, conceptually:

sample 1:
  system
  user_1
  assistant_1

sample 2:
  system
  user_1
  assistant_1
  user_2
  assistant_2

That is a normal way to turn multi-turn dialogue into next-assistant-response training examples.

However, I would be careful with custom roles like:

assistant_think
assistant_tool
assistant_answer

Those can be useful as raw annotations, but I would not assume the model understands them as roles unless your target model’s chat template explicitly supports them.

In Hugging Face / Transformers / TRL terms, the more standard representation is usually closer to:

assistant message containing reasoning-compatible content, if you really want to supervise reasoning text
assistant message containing tool_calls
tool role message containing the tool result
assistant message containing the final answer
tools column containing the available tool schemas

TRL’s SFTTrainer now explicitly supports tool-calling SFT. Its docs say that each tool-calling dataset example should include conversation messages with tool_calls and tool role messages, plus the list of available tools in a tools column, typically as JSON schemas.

The Transformers tool-use docs also describe the same general shape: assistant messages can contain tool_calls, tool responses should be represented as tool role messages, and tools are supplied as schemas/functions to the chat template / tokenizer layer. See Transformers: Tool use.

So I would think of your format like this:

Your raw annotation Training-format target
assistant_think Model-specific reasoning span, if you want to train visible reasoning
assistant_tool assistant message with tool_calls
tool result / observation tool role message
assistant_answer final assistant message
available tools tools column / JSON schemas

This distinction between “raw trajectory format” and “training format” is important. A related research direction is the Agent Data Protocol, which treats heterogeneous agent trajectories as something to normalize into a common schema before training. You do not need to adopt ADP specifically, but the principle is useful: keep your internal annotation format separate from the model-specific training format.

2. The chat template is not a cosmetic wrapper

For chat/instruct/tool models, the chat template is part of the interface the model was trained on.

The Transformers chat templating docs explain that role/content dictionaries are converted into a token sequence through the model’s chat template. Different model families use different control tokens and different tool-call formats.

That means this is risky:

role = assistant_think
role = assistant_tool
role = assistant_answer

unless you intentionally write a chat template that renders those roles into the exact format your model should learn and later use.

This becomes especially important for reasoning/tool models:

Model family / runtime Why format matters
Qwen / Qwen3 Qwen has model-specific function-calling templates and parsers; Qwen-Agent encapsulates Qwen’s tool-calling templates/parsers.
GPT-OSS GPT-OSS models were trained on the Harmony response format, which defines conversation structure, reasoning output, and function calls.
vLLM serving vLLM’s tool calling docs require a chat template that handles tool role messages and assistant messages containing previous tool calls.
Generic Transformers The tool-use docs expect tool schemas and model-specific rendering through apply_chat_template.

So my practical recommendation would be:

Keep assistant_think, assistant_tool, and assistant_answer in your preprocessing code if they help you reason about the data, but convert them before training into the exact message/tool format expected by your target model and inference stack.

3. Which tokens should receive loss?

For SFT, you usually do not want to train on every token in the serialized conversation.

A reasonable default is:

Span Should receive loss? Notes
system prompt No Conditioning context
user messages No Conditioning context
assistant reasoning / thinking Maybe Only if you intentionally want the model to emit that reasoning format
assistant tool call Yes The model must learn when/how to call tools
tool result / observation No External environment output, not model-generated text
final assistant answer Yes The model should learn the final response

TRL has assistant_only_loss=True for assistant-message-only loss, and also supports completion-only loss for prompt/completion style datasets. See SFTTrainer: Train on assistant messages only.

However, there is an important caveat: assistant_only_loss=True depends on the chat template being able to mark generation spans. The TRL docs mention that this uses {% generation %} / {% endgeneration %} blocks in the chat template. There is also an active-looking implementation/documentation issue around adding such generation markers to common chat templates: TRL issue #5471.

So I would not just trust the flag blindly. I would inspect the first batch.

A simple sanity check is:

# Pseudocode / sketch
batch = next(iter(trainer.get_train_dataloader()))

input_ids = batch["input_ids"][0]
labels = batch["labels"][0]

visible_label_ids = [
    token_id for token_id, label_id in zip(input_ids, labels)
    if label_id != -100
]

print(tokenizer.decode(visible_label_ids))

You want this decoded text to contain only the assistant-side spans you intend to supervise, such as tool calls and final answers. If it includes user messages, tool observations, or system text, your masking is wrong.

4. Tool-call examples alone are not enough

A common failure mode is: after fine-tuning on tool-call examples, the model starts calling tools too often.

So the dataset should not only contain “here is how to call a tool” examples. It should also contain:

Case type Why it matters
Tool-required examples Teach the model to call tools when needed
No-tool examples Teach the model to answer directly when no tool is needed
Clarification examples Teach the model to ask for missing required arguments
Unavailable-tool examples Teach the model to admit that the provided tools cannot solve the request
Irrelevant-tool examples Teach the model not to force an unrelated tool call
Bad-result / failed-tool examples Teach recovery or fallback behavior
Multi-turn tool-result examples Teach the model to incorporate observations into later turns

This point is not just theoretical. The paper When2Call: When (not) to Call Tools focuses exactly on tool-calling decision-making: when to call a tool, when to ask follow-up questions, and when to admit that the question cannot be answered with the provided tools.

That is the part people often miss. Calling the right tool with the right arguments is one skill. Deciding whether a tool call should happen at all is another skill.

5. Validate trajectories at the step level, not only the final answer level

If you have multi-turn trajectories, I would also inspect them at the turn/step level before training.

A trajectory can have a correct final answer but still contain a bad intermediate action, such as:

wrong tool call
lucky tool result
correct final answer

If you train on that trajectory, the model may learn the bad intermediate policy.

This is one reason recent tool-use dataset work emphasizes filtering or validating intermediate steps. For example, ToolMind argues that trajectory-level validation can miss turn-level errors, and uses fine-grained turn-level filtering to remove erroneous or suboptimal steps.

For your case, I would check each step:

Step Check
Reasoning / planning Did the assistant correctly identify whether a tool is needed?
Tool selection Was the selected tool relevant?
Arguments Were the arguments available from context and schema-valid?
Tool result Was the observation inserted into the dialogue correctly?
Final answer Did the final answer use the tool result rather than hallucinating?
Cost Did the trajectory avoid unnecessary tool calls?

6. When is SFT enough?

SFT is the right first move when you have high-quality demonstrations.

SFT is especially good for:

Goal SFT suitability
Learning the serialized tool-call format High
Learning JSON/schema shape High
Learning basic tool choice from examples Medium to high
Learning to use tool results in final answers High
Learning no-tool behavior Good if no-tool examples are included
Learning robust exploration over new tools Limited
Optimizing tool-use cost Limited
Recovering from tool failure Depends heavily on data

So I would start with SFT, but I would not assume that SFT alone solves the full policy problem.

A practical first checkpoint after SFT:

Metric What to measure
Format validity Can you parse the model’s tool call?
Schema validity Do required fields and types match the schema?
Tool selection accuracy Is the selected tool correct?
No-tool accuracy Does it avoid tools when unnecessary?
Clarification accuracy Does it ask for missing required info?
Grounding Does the final answer use the tool result?
Final answer correctness Is the final answer correct?
Tool-call count Is the model overusing tools?

For evaluation inspiration, see the Berkeley Function Calling Leaderboard, which focuses on function/tool-call accuracy, and ToolSandbox, which evaluates stateful, conversational, interactive tool use.

7. DPO can be a natural next step before RL

If you can build preferred/rejected trajectory pairs, DPO is often simpler than full RL.

TRL’s DPOTrainer supports tool-calling data too: examples can include prompt, chosen, and rejected conversations with tool_calls, tool role messages, and a tools column.

Examples of useful DPO pairs:

Situation Chosen Rejected
Tool needed Correct tool call + grounded answer Hallucinated direct answer
Tool not needed Direct answer Unnecessary tool call
Missing required argument Clarifying question Invalid tool call with guessed argument
Irrelevant tools only Explain that available tools are not enough Force an unrelated tool call
Tool result given Answer grounded in result Answer ignores result
Cost-sensitive task Minimal sufficient calls Excessive repeated calls
Invalid JSON risk Parseable/schema-valid call Malformed call

This is often a very practical middle ground:

SFT teaches the model the basic behavior.
DPO nudges the model away from bad variants of that behavior.
RL is only needed if you have an executable environment and reliable rewards.

8. When should you use RL / GRPO?

I would only move to RL if you have more than just example trajectories.

You need at least some of the following:

Requirement Why it matters
Executable tools The model’s tool calls must actually run during rollout
Parser The training loop must parse tool calls from model output
Environment state Multi-turn tool use often changes state
Verifier You need to score success or failure
Reward components Tool selection, arguments, execution, grounding, cost
Stable chat template Tool calls and observations must serialize consistently
Initial tool-capable policy Otherwise RL may not explore useful tool calls

TRL’s GRPOTrainer supports tools and also an environment_factory mode, where the trainer creates an environment instance per rollout and exposes public methods as tools. TRL’s OpenEnv integration is also relevant if you want environment-backed training.

The important point is that RL is not just “SFT plus a reward function”. You need the full loop:

model generates
→ parser extracts tool call
→ tool/environment executes
→ observation is returned to the model
→ model continues
→ verifier computes rewards
→ policy update happens

If you cannot execute tools during rollout or cannot compute meaningful rewards, I would not start with RL.

9. Reward design for tool use should be decomposed

A final-answer-only reward is often too coarse.

The paper ToolRL: Reward is All Tool Learning Needs makes this point directly: tool-use RL is hard because multiple tools and diverse parameters require more fine-grained feedback than simple answer matching.

A useful reward decomposition might be:

Reward component Example
Format reward Output is parseable as a tool call or final answer
Schema reward Required arguments exist and have correct types
Tool selection reward Correct tool selected
Argument semantic reward Arguments are correct given the conversation
Execution reward Tool executes successfully
Grounding reward Final answer uses the tool observation
Final correctness reward The final answer is correct
No-tool reward Avoids tools when no tool is needed
Clarification reward Asks for missing required information
Cost penalty Penalizes unnecessary tool calls or excessive calls

Also, beware of overusing tools. Work such as OTC: Optimal Tool Calls via Reinforcement Learning focuses on encouraging accurate answers with fewer tool calls. This matters because a reward that only values final correctness can accidentally teach the model to call tools too often.

10. Suggested practical training path

I would use this staged approach:

Stage Do this Move on when
0. Normalize data Convert raw assistant_think/tool/answer annotations into target chat/tool format The rendered examples match the target model’s template
1. Mask inspection Verify which tokens receive loss Only intended assistant spans are supervised
2. SFT Train on high-quality trajectories Format, schema, and basic tool use work
3. Evaluation Test tool/no-tool, schema, grounding, final correctness You know the failure modes
4. DPO Use chosen/rejected pairs for common mistakes Over-calling, invalid calls, and hallucinations improve
5. RL/GRPO Only if tools are executable and rewards are reliable You can run environment-backed rollouts

In short:

If you have demonstrations:
  start with SFT.

If you have good vs bad trajectory pairs:
  consider DPO.

If you have executable tools + verifier + reward:
  consider GRPO/RL.

If you have none of those:
  build evaluation and clean the dataset first.

11. A possible data representation

As an internal raw format, something like this is fine:

{
  "system": "You are a helpful assistant with tool access.",
  "turns": [
    {
      "user": "What's the weather in Paris tomorrow?",
      "assistant_think": "The user asks for current/future weather, so I need a weather tool.",
      "assistant_tool": {
        "name": "get_weather",
        "arguments": {
          "city": "Paris",
          "date": "tomorrow"
        }
      },
      "tool_result": {
        "forecast": "Light rain, 13C"
      },
      "assistant_answer": "Tomorrow in Paris, expect light rain and about 13°C."
    }
  ]
}

But before training, I would convert it to a model/tool format closer to:

{
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant with tool access."
    },
    {
      "role": "user",
      "content": "What's the weather in Paris tomorrow?"
    },
    {
      "role": "assistant",
      "tool_calls": [
        {
          "type": "function",
          "function": {
            "name": "get_weather",
            "arguments": {
              "city": "Paris",
              "date": "tomorrow"
            }
          }
        }
      ]
    },
    {
      "role": "tool",
      "content": "{\"forecast\":\"Light rain, 13C\"}"
    },
    {
      "role": "assistant",
      "content": "Tomorrow in Paris, expect light rain and about 13°C."
    }
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get a weather forecast for a city and date.",
        "parameters": {
          "type": "object",
          "properties": {
            "city": {
              "type": "string"
            },
            "date": {
              "type": "string"
            }
          },
          "required": ["city", "date"]
        }
      }
    }
  ]
}

The exact schema may differ depending on your model, trainer, and serving stack. The key point is not this exact JSON shape. The key point is that the training format should match the model’s tool-calling chat template.

12. Final recommendation

So my answer would be:

  1. Yes, start with SFT if you have correct trajectories.
  2. Do not train arbitrary custom roles directly unless your target model’s template supports them.
  3. Convert your annotations into the target tool-call format, usually tool_calls, tool role messages, and tools schemas.
  4. Mask loss carefully: user/system/tool observations should generally not be supervised; assistant tool calls and final answers should be.
  5. Inspect the labels, because assistant-only loss depends on the chat template.
  6. Add no-tool, clarification, and unavailable-tool cases, not only positive tool-call examples.
  7. Use DPO if you can create chosen/rejected trajectory pairs.
  8. Use GRPO/RL only when you have executable tools and meaningful rewards.
  9. Evaluate more than final accuracy: measure format validity, schema validity, tool selection, no-tool behavior, clarification behavior, grounding, final correctness, and tool-call cost.

The practical path is:

SFT first.
DPO if you can create preference pairs.
GRPO/RL only if you can run tools during rollout and compute reliable rewards.

Useful references: