Finetuning a Reasoning LLM with Supervised or Reinforcement Learning?

zlapik · June 1, 2026, 2:05pm

Hello,

I have a task to fine-tune small LLMs on annotated conversational data. The dataset contains not only the final answers, but also reasoning traces and tool-calling decisions (i.e., when the model should think and when it should call a tool).

I am wondering what the best training approach would be and why.

My current dataset is stored in a chat format similar to this:

system
user
assistant_think
assistant_tool
assistant_answer

user
assistant_think
assistant_tool
assistant_answer
...

My current idea is to split each conversation into multiple training samples. For example, if a conversation contains two user turns, I would create two samples:

Sample 1

system
user
assistant_think
assistant_tool
assistant_answer

Sample 2

system
user
assistant_think
assistant_tool
assistant_answer

user
assistant_think
assistant_tool
assistant_answer

In other words, each sample contains all previous conversation history up to the assistant response being trained.

For training, the loss would be computed only on the assistant-generated tokens:

assistant_think
assistant_tool
assistant_answer

while the system and user messages would be masked out from the loss.

Is this approach correct, or is there a better way to structure the training data for reasoning and tool-calling behavior?

My second question is about reinforcement learning.

After completing supervised fine-tuning (SFT) on the dataset described above, should I also incorporate RL to further train the model on e.g. when a tool should or should not be called?

If so:

What advantages would RL provide over SFT alone for tool-use and reasoning?
How would you design the reward function?
Under what circumstances is RL actually necessary, and when is SFT sufficient?

I would appreciate any practical advice, papers, blog posts, or open-source examples related to training reasoning and tool-calling models.

John6666 · June 2, 2026, 4:45am

Hmm, maybe something like this:

I would separate this into three questions that often get mixed together:

How should I represent the training data?
Which tokens should actually receive loss during SFT?
When, if ever, should I move from SFT to preference optimization or RL?

My short answer would be:

Start with SFT if you already have correct trajectories.
Treat assistant_think, assistant_tool, and assistant_answer as an internal annotation format, not necessarily as model roles.
Convert them into the target model’s actual chat template / tool-calling format.
Add no-tool, clarification, and unavailable-tool examples.
Consider DPO if you can create good/bad trajectory pairs.
Consider GRPO/RL only if you have executable tools, a rollout environment, and reliable rewards.

Below is the longer version.

1. Your SFT intuition is mostly right, but I would not train arbitrary custom roles directly

Your idea of making each training example condition on the conversation history and supervise the next assistant-side behavior is basically reasonable.

For example, conceptually:

sample 1:
  system
  user_1
  assistant_1

sample 2:
  system
  user_1
  assistant_1
  user_2
  assistant_2

That is a normal way to turn multi-turn dialogue into next-assistant-response training examples.

However, I would be careful with custom roles like:

assistant_think
assistant_tool
assistant_answer

Those can be useful as raw annotations, but I would not assume the model understands them as roles unless your target model’s chat template explicitly supports them.

In Hugging Face / Transformers / TRL terms, the more standard representation is usually closer to:

assistant message containing reasoning-compatible content, if you really want to supervise reasoning text
assistant message containing tool_calls
tool role message containing the tool result
assistant message containing the final answer
tools column containing the available tool schemas

TRL’s SFTTrainer now explicitly supports tool-calling SFT. Its docs say that each tool-calling dataset example should include conversation messages with tool_calls and tool role messages, plus the list of available tools in a tools column, typically as JSON schemas.

The Transformers tool-use docs also describe the same general shape: assistant messages can contain tool_calls, tool responses should be represented as tool role messages, and tools are supplied as schemas/functions to the chat template / tokenizer layer. See Transformers: Tool use.

So I would think of your format like this:

Your raw annotation	Training-format target
`assistant_think`	Model-specific reasoning span, if you want to train visible reasoning
`assistant_tool`	`assistant` message with `tool_calls`
tool result / observation	`tool` role message
`assistant_answer`	final `assistant` message
available tools	`tools` column / JSON schemas

This distinction between “raw trajectory format” and “training format” is important. A related research direction is the Agent Data Protocol, which treats heterogeneous agent trajectories as something to normalize into a common schema before training. You do not need to adopt ADP specifically, but the principle is useful: keep your internal annotation format separate from the model-specific training format.

2. The chat template is not a cosmetic wrapper

For chat/instruct/tool models, the chat template is part of the interface the model was trained on.

The Transformers chat templating docs explain that role/content dictionaries are converted into a token sequence through the model’s chat template. Different model families use different control tokens and different tool-call formats.

That means this is risky:

role = assistant_think
role = assistant_tool
role = assistant_answer

unless you intentionally write a chat template that renders those roles into the exact format your model should learn and later use.

This becomes especially important for reasoning/tool models:

Model family / runtime	Why format matters
Qwen / Qwen3	Qwen has model-specific function-calling templates and parsers; Qwen-Agent encapsulates Qwen’s tool-calling templates/parsers.
GPT-OSS	GPT-OSS models were trained on the Harmony response format, which defines conversation structure, reasoning output, and function calls.
vLLM serving	vLLM’s tool calling docs require a chat template that handles `tool` role messages and assistant messages containing previous tool calls.
Generic Transformers	The tool-use docs expect tool schemas and model-specific rendering through `apply_chat_template`.

So my practical recommendation would be:

Keep assistant_think, assistant_tool, and assistant_answer in your preprocessing code if they help you reason about the data, but convert them before training into the exact message/tool format expected by your target model and inference stack.

3. Which tokens should receive loss?

For SFT, you usually do not want to train on every token in the serialized conversation.

A reasonable default is:

Span	Should receive loss?	Notes
system prompt	No	Conditioning context
user messages	No	Conditioning context
assistant reasoning / thinking	Maybe	Only if you intentionally want the model to emit that reasoning format
assistant tool call	Yes	The model must learn when/how to call tools
tool result / observation	No	External environment output, not model-generated text
final assistant answer	Yes	The model should learn the final response

TRL has assistant_only_loss=True for assistant-message-only loss, and also supports completion-only loss for prompt/completion style datasets. See SFTTrainer: Train on assistant messages only.

However, there is an important caveat: assistant_only_loss=True depends on the chat template being able to mark generation spans. The TRL docs mention that this uses {% generation %} / {% endgeneration %} blocks in the chat template. There is also an active-looking implementation/documentation issue around adding such generation markers to common chat templates: TRL issue #5471.

So I would not just trust the flag blindly. I would inspect the first batch.

A simple sanity check is:

# Pseudocode / sketch
batch = next(iter(trainer.get_train_dataloader()))

input_ids = batch["input_ids"][0]
labels = batch["labels"][0]

visible_label_ids = [
    token_id for token_id, label_id in zip(input_ids, labels)
    if label_id != -100
]

print(tokenizer.decode(visible_label_ids))

You want this decoded text to contain only the assistant-side spans you intend to supervise, such as tool calls and final answers. If it includes user messages, tool observations, or system text, your masking is wrong.

4. Tool-call examples alone are not enough

A common failure mode is: after fine-tuning on tool-call examples, the model starts calling tools too often.

So the dataset should not only contain “here is how to call a tool” examples. It should also contain:

Case type	Why it matters
Tool-required examples	Teach the model to call tools when needed
No-tool examples	Teach the model to answer directly when no tool is needed
Clarification examples	Teach the model to ask for missing required arguments
Unavailable-tool examples	Teach the model to admit that the provided tools cannot solve the request
Irrelevant-tool examples	Teach the model not to force an unrelated tool call
Bad-result / failed-tool examples	Teach recovery or fallback behavior
Multi-turn tool-result examples	Teach the model to incorporate observations into later turns

This point is not just theoretical. The paper When2Call: When (not) to Call Tools focuses exactly on tool-calling decision-making: when to call a tool, when to ask follow-up questions, and when to admit that the question cannot be answered with the provided tools.

That is the part people often miss. Calling the right tool with the right arguments is one skill. Deciding whether a tool call should happen at all is another skill.

5. Validate trajectories at the step level, not only the final answer level

If you have multi-turn trajectories, I would also inspect them at the turn/step level before training.

A trajectory can have a correct final answer but still contain a bad intermediate action, such as:

wrong tool call
lucky tool result
correct final answer

If you train on that trajectory, the model may learn the bad intermediate policy.

This is one reason recent tool-use dataset work emphasizes filtering or validating intermediate steps. For example, ToolMind argues that trajectory-level validation can miss turn-level errors, and uses fine-grained turn-level filtering to remove erroneous or suboptimal steps.

For your case, I would check each step:

Step	Check
Reasoning / planning	Did the assistant correctly identify whether a tool is needed?
Tool selection	Was the selected tool relevant?
Arguments	Were the arguments available from context and schema-valid?
Tool result	Was the observation inserted into the dialogue correctly?
Final answer	Did the final answer use the tool result rather than hallucinating?
Cost	Did the trajectory avoid unnecessary tool calls?

6. When is SFT enough?

SFT is the right first move when you have high-quality demonstrations.

SFT is especially good for:

Goal	SFT suitability
Learning the serialized tool-call format	High
Learning JSON/schema shape	High
Learning basic tool choice from examples	Medium to high
Learning to use tool results in final answers	High
Learning no-tool behavior	Good if no-tool examples are included
Learning robust exploration over new tools	Limited
Optimizing tool-use cost	Limited
Recovering from tool failure	Depends heavily on data

So I would start with SFT, but I would not assume that SFT alone solves the full policy problem.

A practical first checkpoint after SFT:

Metric	What to measure
Format validity	Can you parse the model’s tool call?
Schema validity	Do required fields and types match the schema?
Tool selection accuracy	Is the selected tool correct?
No-tool accuracy	Does it avoid tools when unnecessary?
Clarification accuracy	Does it ask for missing required info?
Grounding	Does the final answer use the tool result?
Final answer correctness	Is the final answer correct?
Tool-call count	Is the model overusing tools?

For evaluation inspiration, see the Berkeley Function Calling Leaderboard, which focuses on function/tool-call accuracy, and ToolSandbox, which evaluates stateful, conversational, interactive tool use.

7. DPO can be a natural next step before RL

If you can build preferred/rejected trajectory pairs, DPO is often simpler than full RL.

TRL’s DPOTrainer supports tool-calling data too: examples can include prompt, chosen, and rejected conversations with tool_calls, tool role messages, and a tools column.

Examples of useful DPO pairs:

Situation	Chosen	Rejected
Tool needed	Correct tool call + grounded answer	Hallucinated direct answer
Tool not needed	Direct answer	Unnecessary tool call
Missing required argument	Clarifying question	Invalid tool call with guessed argument
Irrelevant tools only	Explain that available tools are not enough	Force an unrelated tool call
Tool result given	Answer grounded in result	Answer ignores result
Cost-sensitive task	Minimal sufficient calls	Excessive repeated calls
Invalid JSON risk	Parseable/schema-valid call	Malformed call

This is often a very practical middle ground:

SFT teaches the model the basic behavior.
DPO nudges the model away from bad variants of that behavior.
RL is only needed if you have an executable environment and reliable rewards.

8. When should you use RL / GRPO?

I would only move to RL if you have more than just example trajectories.

You need at least some of the following:

Requirement	Why it matters
Executable tools	The model’s tool calls must actually run during rollout
Parser	The training loop must parse tool calls from model output
Environment state	Multi-turn tool use often changes state
Verifier	You need to score success or failure
Reward components	Tool selection, arguments, execution, grounding, cost
Stable chat template	Tool calls and observations must serialize consistently
Initial tool-capable policy	Otherwise RL may not explore useful tool calls

TRL’s GRPOTrainer supports tools and also an environment_factory mode, where the trainer creates an environment instance per rollout and exposes public methods as tools. TRL’s OpenEnv integration is also relevant if you want environment-backed training.

The important point is that RL is not just “SFT plus a reward function”. You need the full loop:

model generates
→ parser extracts tool call
→ tool/environment executes
→ observation is returned to the model
→ model continues
→ verifier computes rewards
→ policy update happens

If you cannot execute tools during rollout or cannot compute meaningful rewards, I would not start with RL.

9. Reward design for tool use should be decomposed

A final-answer-only reward is often too coarse.

The paper ToolRL: Reward is All Tool Learning Needs makes this point directly: tool-use RL is hard because multiple tools and diverse parameters require more fine-grained feedback than simple answer matching.

A useful reward decomposition might be:

Reward component	Example
Format reward	Output is parseable as a tool call or final answer
Schema reward	Required arguments exist and have correct types
Tool selection reward	Correct tool selected
Argument semantic reward	Arguments are correct given the conversation
Execution reward	Tool executes successfully
Grounding reward	Final answer uses the tool observation
Final correctness reward	The final answer is correct
No-tool reward	Avoids tools when no tool is needed
Clarification reward	Asks for missing required information
Cost penalty	Penalizes unnecessary tool calls or excessive calls

Also, beware of overusing tools. Work such as OTC: Optimal Tool Calls via Reinforcement Learning focuses on encouraging accurate answers with fewer tool calls. This matters because a reward that only values final correctness can accidentally teach the model to call tools too often.

10. Suggested practical training path

I would use this staged approach:

Stage	Do this	Move on when
0. Normalize data	Convert raw `assistant_think/tool/answer` annotations into target chat/tool format	The rendered examples match the target model’s template
1. Mask inspection	Verify which tokens receive loss	Only intended assistant spans are supervised
2. SFT	Train on high-quality trajectories	Format, schema, and basic tool use work
3. Evaluation	Test tool/no-tool, schema, grounding, final correctness	You know the failure modes
4. DPO	Use chosen/rejected pairs for common mistakes	Over-calling, invalid calls, and hallucinations improve
5. RL/GRPO	Only if tools are executable and rewards are reliable	You can run environment-backed rollouts

In short:

If you have demonstrations:
  start with SFT.

If you have good vs bad trajectory pairs:
  consider DPO.

If you have executable tools + verifier + reward:
  consider GRPO/RL.

If you have none of those:
  build evaluation and clean the dataset first.

11. A possible data representation

As an internal raw format, something like this is fine:

{
  "system": "You are a helpful assistant with tool access.",
  "turns": [
    {
      "user": "What's the weather in Paris tomorrow?",
      "assistant_think": "The user asks for current/future weather, so I need a weather tool.",
      "assistant_tool": {
        "name": "get_weather",
        "arguments": {
          "city": "Paris",
          "date": "tomorrow"
        }
      },
      "tool_result": {
        "forecast": "Light rain, 13C"
      },
      "assistant_answer": "Tomorrow in Paris, expect light rain and about 13°C."
    }
  ]
}

But before training, I would convert it to a model/tool format closer to:

{
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant with tool access."
    },
    {
      "role": "user",
      "content": "What's the weather in Paris tomorrow?"
    },
    {
      "role": "assistant",
      "tool_calls": [
        {
          "type": "function",
          "function": {
            "name": "get_weather",
            "arguments": {
              "city": "Paris",
              "date": "tomorrow"
            }
          }
        }
      ]
    },
    {
      "role": "tool",
      "content": "{\"forecast\":\"Light rain, 13C\"}"
    },
    {
      "role": "assistant",
      "content": "Tomorrow in Paris, expect light rain and about 13°C."
    }
  ],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get a weather forecast for a city and date.",
        "parameters": {
          "type": "object",
          "properties": {
            "city": {
              "type": "string"
            },
            "date": {
              "type": "string"
            }
          },
          "required": ["city", "date"]
        }
      }
    }
  ]
}

The exact schema may differ depending on your model, trainer, and serving stack. The key point is not this exact JSON shape. The key point is that the training format should match the model’s tool-calling chat template.

12. Final recommendation

So my answer would be:

Yes, start with SFT if you have correct trajectories.
Do not train arbitrary custom roles directly unless your target model’s template supports them.
Convert your annotations into the target tool-call format, usually tool_calls, tool role messages, and tools schemas.
Mask loss carefully: user/system/tool observations should generally not be supervised; assistant tool calls and final answers should be.
Inspect the labels, because assistant-only loss depends on the chat template.
Add no-tool, clarification, and unavailable-tool cases, not only positive tool-call examples.
Use DPO if you can create chosen/rejected trajectory pairs.
Use GRPO/RL only when you have executable tools and meaningful rewards.
Evaluate more than final accuracy: measure format validity, schema validity, tool selection, no-tool behavior, clarification behavior, grounding, final correctness, and tool-call cost.

The practical path is:

SFT first.
DPO if you can create preference pairs.
GRPO/RL only if you can run tools during rollout and compute reliable rewards.

Useful references:

Topic		Replies	Views
Need advice: open-source surgical LLM fine-tune (90k Q&A) — multi-turn stability, RL (DPO), and RAG Intermediate	1	104	January 8, 2026
Fine Tuning Format/Structure for data for llma3.1 models Intermediate	0	87	October 28, 2024
🚧 ReTool: PyTorch Implementation of Strategic Tool Use in LLMs (Seeking Collaborators) Research	0	62	June 1, 2025
Fine-Tuning Help for Personal Project Beginners	1	95	March 28, 2025
Accidental Attention Anchoring? Repeated phrase in SFT dataset drastically improved context adherence Beginners	2	40	May 27, 2026