Hmm, maybe something like this:
I would separate this into three questions that often get mixed together:
- How should I represent the training data?
- Which tokens should actually receive loss during SFT?
- When, if ever, should I move from SFT to preference optimization or RL?
My short answer would be:
Start with SFT if you already have correct trajectories.
Treat assistant_think, assistant_tool, and assistant_answer as an internal annotation format, not necessarily as model roles.
Convert them into the target model’s actual chat template / tool-calling format.
Add no-tool, clarification, and unavailable-tool examples.
Consider DPO if you can create good/bad trajectory pairs.
Consider GRPO/RL only if you have executable tools, a rollout environment, and reliable rewards.
Below is the longer version.
1. Your SFT intuition is mostly right, but I would not train arbitrary custom roles directly
Your idea of making each training example condition on the conversation history and supervise the next assistant-side behavior is basically reasonable.
For example, conceptually:
sample 1:
system
user_1
assistant_1
sample 2:
system
user_1
assistant_1
user_2
assistant_2
That is a normal way to turn multi-turn dialogue into next-assistant-response training examples.
However, I would be careful with custom roles like:
assistant_think
assistant_tool
assistant_answer
Those can be useful as raw annotations, but I would not assume the model understands them as roles unless your target model’s chat template explicitly supports them.
In Hugging Face / Transformers / TRL terms, the more standard representation is usually closer to:
assistant message containing reasoning-compatible content, if you really want to supervise reasoning text
assistant message containing tool_calls
tool role message containing the tool result
assistant message containing the final answer
tools column containing the available tool schemas
TRL’s SFTTrainer now explicitly supports tool-calling SFT. Its docs say that each tool-calling dataset example should include conversation messages with tool_calls and tool role messages, plus the list of available tools in a tools column, typically as JSON schemas.
The Transformers tool-use docs also describe the same general shape: assistant messages can contain tool_calls, tool responses should be represented as tool role messages, and tools are supplied as schemas/functions to the chat template / tokenizer layer. See Transformers: Tool use.
So I would think of your format like this:
| Your raw annotation |
Training-format target |
assistant_think |
Model-specific reasoning span, if you want to train visible reasoning |
assistant_tool |
assistant message with tool_calls |
| tool result / observation |
tool role message |
assistant_answer |
final assistant message |
| available tools |
tools column / JSON schemas |
This distinction between “raw trajectory format” and “training format” is important. A related research direction is the Agent Data Protocol, which treats heterogeneous agent trajectories as something to normalize into a common schema before training. You do not need to adopt ADP specifically, but the principle is useful: keep your internal annotation format separate from the model-specific training format.
2. The chat template is not a cosmetic wrapper
For chat/instruct/tool models, the chat template is part of the interface the model was trained on.
The Transformers chat templating docs explain that role/content dictionaries are converted into a token sequence through the model’s chat template. Different model families use different control tokens and different tool-call formats.
That means this is risky:
role = assistant_think
role = assistant_tool
role = assistant_answer
unless you intentionally write a chat template that renders those roles into the exact format your model should learn and later use.
This becomes especially important for reasoning/tool models:
| Model family / runtime |
Why format matters |
| Qwen / Qwen3 |
Qwen has model-specific function-calling templates and parsers; Qwen-Agent encapsulates Qwen’s tool-calling templates/parsers. |
| GPT-OSS |
GPT-OSS models were trained on the Harmony response format, which defines conversation structure, reasoning output, and function calls. |
| vLLM serving |
vLLM’s tool calling docs require a chat template that handles tool role messages and assistant messages containing previous tool calls. |
| Generic Transformers |
The tool-use docs expect tool schemas and model-specific rendering through apply_chat_template. |
So my practical recommendation would be:
Keep assistant_think, assistant_tool, and assistant_answer in your preprocessing code if they help you reason about the data, but convert them before training into the exact message/tool format expected by your target model and inference stack.
3. Which tokens should receive loss?
For SFT, you usually do not want to train on every token in the serialized conversation.
A reasonable default is:
| Span |
Should receive loss? |
Notes |
| system prompt |
No |
Conditioning context |
| user messages |
No |
Conditioning context |
| assistant reasoning / thinking |
Maybe |
Only if you intentionally want the model to emit that reasoning format |
| assistant tool call |
Yes |
The model must learn when/how to call tools |
| tool result / observation |
No |
External environment output, not model-generated text |
| final assistant answer |
Yes |
The model should learn the final response |
TRL has assistant_only_loss=True for assistant-message-only loss, and also supports completion-only loss for prompt/completion style datasets. See SFTTrainer: Train on assistant messages only.
However, there is an important caveat: assistant_only_loss=True depends on the chat template being able to mark generation spans. The TRL docs mention that this uses {% generation %} / {% endgeneration %} blocks in the chat template. There is also an active-looking implementation/documentation issue around adding such generation markers to common chat templates: TRL issue #5471.
So I would not just trust the flag blindly. I would inspect the first batch.
A simple sanity check is:
# Pseudocode / sketch
batch = next(iter(trainer.get_train_dataloader()))
input_ids = batch["input_ids"][0]
labels = batch["labels"][0]
visible_label_ids = [
token_id for token_id, label_id in zip(input_ids, labels)
if label_id != -100
]
print(tokenizer.decode(visible_label_ids))
You want this decoded text to contain only the assistant-side spans you intend to supervise, such as tool calls and final answers. If it includes user messages, tool observations, or system text, your masking is wrong.
4. Tool-call examples alone are not enough
A common failure mode is: after fine-tuning on tool-call examples, the model starts calling tools too often.
So the dataset should not only contain “here is how to call a tool” examples. It should also contain:
| Case type |
Why it matters |
| Tool-required examples |
Teach the model to call tools when needed |
| No-tool examples |
Teach the model to answer directly when no tool is needed |
| Clarification examples |
Teach the model to ask for missing required arguments |
| Unavailable-tool examples |
Teach the model to admit that the provided tools cannot solve the request |
| Irrelevant-tool examples |
Teach the model not to force an unrelated tool call |
| Bad-result / failed-tool examples |
Teach recovery or fallback behavior |
| Multi-turn tool-result examples |
Teach the model to incorporate observations into later turns |
This point is not just theoretical. The paper When2Call: When (not) to Call Tools focuses exactly on tool-calling decision-making: when to call a tool, when to ask follow-up questions, and when to admit that the question cannot be answered with the provided tools.
That is the part people often miss. Calling the right tool with the right arguments is one skill. Deciding whether a tool call should happen at all is another skill.
5. Validate trajectories at the step level, not only the final answer level
If you have multi-turn trajectories, I would also inspect them at the turn/step level before training.
A trajectory can have a correct final answer but still contain a bad intermediate action, such as:
wrong tool call
lucky tool result
correct final answer
If you train on that trajectory, the model may learn the bad intermediate policy.
This is one reason recent tool-use dataset work emphasizes filtering or validating intermediate steps. For example, ToolMind argues that trajectory-level validation can miss turn-level errors, and uses fine-grained turn-level filtering to remove erroneous or suboptimal steps.
For your case, I would check each step:
| Step |
Check |
| Reasoning / planning |
Did the assistant correctly identify whether a tool is needed? |
| Tool selection |
Was the selected tool relevant? |
| Arguments |
Were the arguments available from context and schema-valid? |
| Tool result |
Was the observation inserted into the dialogue correctly? |
| Final answer |
Did the final answer use the tool result rather than hallucinating? |
| Cost |
Did the trajectory avoid unnecessary tool calls? |
6. When is SFT enough?
SFT is the right first move when you have high-quality demonstrations.
SFT is especially good for:
| Goal |
SFT suitability |
| Learning the serialized tool-call format |
High |
| Learning JSON/schema shape |
High |
| Learning basic tool choice from examples |
Medium to high |
| Learning to use tool results in final answers |
High |
| Learning no-tool behavior |
Good if no-tool examples are included |
| Learning robust exploration over new tools |
Limited |
| Optimizing tool-use cost |
Limited |
| Recovering from tool failure |
Depends heavily on data |
So I would start with SFT, but I would not assume that SFT alone solves the full policy problem.
A practical first checkpoint after SFT:
| Metric |
What to measure |
| Format validity |
Can you parse the model’s tool call? |
| Schema validity |
Do required fields and types match the schema? |
| Tool selection accuracy |
Is the selected tool correct? |
| No-tool accuracy |
Does it avoid tools when unnecessary? |
| Clarification accuracy |
Does it ask for missing required info? |
| Grounding |
Does the final answer use the tool result? |
| Final answer correctness |
Is the final answer correct? |
| Tool-call count |
Is the model overusing tools? |
For evaluation inspiration, see the Berkeley Function Calling Leaderboard, which focuses on function/tool-call accuracy, and ToolSandbox, which evaluates stateful, conversational, interactive tool use.
7. DPO can be a natural next step before RL
If you can build preferred/rejected trajectory pairs, DPO is often simpler than full RL.
TRL’s DPOTrainer supports tool-calling data too: examples can include prompt, chosen, and rejected conversations with tool_calls, tool role messages, and a tools column.
Examples of useful DPO pairs:
| Situation |
Chosen |
Rejected |
| Tool needed |
Correct tool call + grounded answer |
Hallucinated direct answer |
| Tool not needed |
Direct answer |
Unnecessary tool call |
| Missing required argument |
Clarifying question |
Invalid tool call with guessed argument |
| Irrelevant tools only |
Explain that available tools are not enough |
Force an unrelated tool call |
| Tool result given |
Answer grounded in result |
Answer ignores result |
| Cost-sensitive task |
Minimal sufficient calls |
Excessive repeated calls |
| Invalid JSON risk |
Parseable/schema-valid call |
Malformed call |
This is often a very practical middle ground:
SFT teaches the model the basic behavior.
DPO nudges the model away from bad variants of that behavior.
RL is only needed if you have an executable environment and reliable rewards.
8. When should you use RL / GRPO?
I would only move to RL if you have more than just example trajectories.
You need at least some of the following:
| Requirement |
Why it matters |
| Executable tools |
The model’s tool calls must actually run during rollout |
| Parser |
The training loop must parse tool calls from model output |
| Environment state |
Multi-turn tool use often changes state |
| Verifier |
You need to score success or failure |
| Reward components |
Tool selection, arguments, execution, grounding, cost |
| Stable chat template |
Tool calls and observations must serialize consistently |
| Initial tool-capable policy |
Otherwise RL may not explore useful tool calls |
TRL’s GRPOTrainer supports tools and also an environment_factory mode, where the trainer creates an environment instance per rollout and exposes public methods as tools. TRL’s OpenEnv integration is also relevant if you want environment-backed training.
The important point is that RL is not just “SFT plus a reward function”. You need the full loop:
model generates
→ parser extracts tool call
→ tool/environment executes
→ observation is returned to the model
→ model continues
→ verifier computes rewards
→ policy update happens
If you cannot execute tools during rollout or cannot compute meaningful rewards, I would not start with RL.
9. Reward design for tool use should be decomposed
A final-answer-only reward is often too coarse.
The paper ToolRL: Reward is All Tool Learning Needs makes this point directly: tool-use RL is hard because multiple tools and diverse parameters require more fine-grained feedback than simple answer matching.
A useful reward decomposition might be:
| Reward component |
Example |
| Format reward |
Output is parseable as a tool call or final answer |
| Schema reward |
Required arguments exist and have correct types |
| Tool selection reward |
Correct tool selected |
| Argument semantic reward |
Arguments are correct given the conversation |
| Execution reward |
Tool executes successfully |
| Grounding reward |
Final answer uses the tool observation |
| Final correctness reward |
The final answer is correct |
| No-tool reward |
Avoids tools when no tool is needed |
| Clarification reward |
Asks for missing required information |
| Cost penalty |
Penalizes unnecessary tool calls or excessive calls |
Also, beware of overusing tools. Work such as OTC: Optimal Tool Calls via Reinforcement Learning focuses on encouraging accurate answers with fewer tool calls. This matters because a reward that only values final correctness can accidentally teach the model to call tools too often.
10. Suggested practical training path
I would use this staged approach:
| Stage |
Do this |
Move on when |
| 0. Normalize data |
Convert raw assistant_think/tool/answer annotations into target chat/tool format |
The rendered examples match the target model’s template |
| 1. Mask inspection |
Verify which tokens receive loss |
Only intended assistant spans are supervised |
| 2. SFT |
Train on high-quality trajectories |
Format, schema, and basic tool use work |
| 3. Evaluation |
Test tool/no-tool, schema, grounding, final correctness |
You know the failure modes |
| 4. DPO |
Use chosen/rejected pairs for common mistakes |
Over-calling, invalid calls, and hallucinations improve |
| 5. RL/GRPO |
Only if tools are executable and rewards are reliable |
You can run environment-backed rollouts |
In short:
If you have demonstrations:
start with SFT.
If you have good vs bad trajectory pairs:
consider DPO.
If you have executable tools + verifier + reward:
consider GRPO/RL.
If you have none of those:
build evaluation and clean the dataset first.
11. A possible data representation
As an internal raw format, something like this is fine:
{
"system": "You are a helpful assistant with tool access.",
"turns": [
{
"user": "What's the weather in Paris tomorrow?",
"assistant_think": "The user asks for current/future weather, so I need a weather tool.",
"assistant_tool": {
"name": "get_weather",
"arguments": {
"city": "Paris",
"date": "tomorrow"
}
},
"tool_result": {
"forecast": "Light rain, 13C"
},
"assistant_answer": "Tomorrow in Paris, expect light rain and about 13°C."
}
]
}
But before training, I would convert it to a model/tool format closer to:
{
"messages": [
{
"role": "system",
"content": "You are a helpful assistant with tool access."
},
{
"role": "user",
"content": "What's the weather in Paris tomorrow?"
},
{
"role": "assistant",
"tool_calls": [
{
"type": "function",
"function": {
"name": "get_weather",
"arguments": {
"city": "Paris",
"date": "tomorrow"
}
}
}
]
},
{
"role": "tool",
"content": "{\"forecast\":\"Light rain, 13C\"}"
},
{
"role": "assistant",
"content": "Tomorrow in Paris, expect light rain and about 13°C."
}
],
"tools": [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get a weather forecast for a city and date.",
"parameters": {
"type": "object",
"properties": {
"city": {
"type": "string"
},
"date": {
"type": "string"
}
},
"required": ["city", "date"]
}
}
}
]
}
The exact schema may differ depending on your model, trainer, and serving stack. The key point is not this exact JSON shape. The key point is that the training format should match the model’s tool-calling chat template.
12. Final recommendation
So my answer would be:
- Yes, start with SFT if you have correct trajectories.
- Do not train arbitrary custom roles directly unless your target model’s template supports them.
- Convert your annotations into the target tool-call format, usually
tool_calls, tool role messages, and tools schemas.
- Mask loss carefully: user/system/tool observations should generally not be supervised; assistant tool calls and final answers should be.
- Inspect the labels, because assistant-only loss depends on the chat template.
- Add no-tool, clarification, and unavailable-tool cases, not only positive tool-call examples.
- Use DPO if you can create chosen/rejected trajectory pairs.
- Use GRPO/RL only when you have executable tools and meaningful rewards.
- Evaluate more than final accuracy: measure format validity, schema validity, tool selection, no-tool behavior, clarification behavior, grounding, final correctness, and tool-call cost.
The practical path is:
SFT first.
DPO if you can create preference pairs.
GRPO/RL only if you can run tools during rollout and compute reliable rewards.
Useful references: