Training lora for LTX2.3 voice / sound only

Hello guys,

I am kind of stuck at the moment. I am trying to train Lora for voice only through Ostris AI ToolKit - VPS - Vast RTX5090. Here is the thing, I want individually or separately train voice lora only for my character. So when I manage that, I will train a video lora with character+voice. But as I mentioned above I am stuck. I am getting multiple errors from ostris AI. I got 27 clips between 6-10 seconds all well captioned.
This is the error which mostly appears among others - RuntimeError: Internal error: Internal Writer Error: Background writer channel closed.
Not even sure if my lora training settings are correct

Thank you for all the answers if some appears lol

job: “extension”
config:
"
process:

  • type: “diffusion_trainer”
    training_folder: “/workspace/ai-toolkit/output”
    sqlite_db_path: “./aitk_db.db”
    device: “cuda”
    trigger_word: “”
    performance_log_every: 10
    network:
    type: “lora”
    linear: 32
    linear_alpha: 32
    conv: 16
    conv_alpha: 16
    lokr_full_rank: true
    lokr_factor: -1
    network_kwargs:
    ignore_if_contains:
    save:
    dtype: “bf16”
    save_every: 500
    max_step_saves_to_keep: 4
    save_format: “diffusers”
    push_to_hub: false
    datasets:
  • folder_path: “/workspace/ai-toolkit/datasets/ema_voice”
    mask_path: null
    mask_min_value: 0.1
    default_caption: “”
    caption_ext: “txt”
    caption_dropout_rate: 0.05
    cache_latents_to_disk: true
    is_reg: false
    network_weight: 1
    resolution:
  • 512
    controls:
    shrink_video_to_frames: true
    num_frames: 1
    flip_x: false
    flip_y: false
    num_repeats: 1
    do_i2v: false
    do_audio: true
    fps: 24
    auto_frame_count: true
    train:
    batch_size: 1
    bypass_guidance_embedding: false
    steps: 5000
    gradient_accumulation: 1
    train_unet: true
    train_text_encoder: false
    gradient_checkpointing: true
    noise_scheduler: “flowmatch”
    optimizer: “adamw8bit”
    timestep_type: “weighted”
    content_or_style: “balanced”
    optimizer_params:
    weight_decay: 0.0001
    unload_text_encoder: false
    cache_text_embeddings: false
    lr: 0.0001
    ema_config:
    use_ema: false
    ema_decay: 0.99
    skip_first_sample: false
    force_first_sample: false
    disable_sampling: false
    dtype: “bf16”
    diff_output_preservation: false
    diff_output_preservation_multiplier: 1
    diff_output_preservation_class: “person”
    switch_boundary_every: 1
    loss_type: “mse”
    audio_loss_multiplier: 1
    logging:
    log_every: 1
    use_ui_logger: true
    model:
    name_or_path: “Lightricks/LTX-2.3/ltx-2.3-22b-dev.safetensors”
    quantize: true
    qtype: “qfloat8”
    quantize_te: true
    qtype_te: “qfloat8”
    arch: “ltx2.3”
    low_vram: true
    model_kwargs: {}
    layer_offloading: false
    layer_offloading_text_encoder_percent: 1
    layer_offloading_transformer_percent: 1
    sample:
    sampler: “flowmatch”
    sample_every: 500
    width: 768
    height: 768
    samples:
    neg: ""
    seed: 42
    walk_seed: true
    guidance_scale: 4
    sample_steps: 30
    num_frames: 121
    fps: 24

meta:
name: “[name]”
version: “1.0”

Maybe something like this would work:


I think I would first reframe this as an Audio-Video LoRA problem, not a pure “voice-only LoRA” problem.

That does not mean your goal is impossible. It just means I would avoid starting from num_frames: 1 and expecting LTX-2.3 to behave like a TTS / speaker-LoRA system. LTX-2.3 is an audio-video model, and the official training docs describe Audio-Video LoRA as a LoRA that can affect both video and audio output.

Short answer

I would try this order:

  1. First make a normal short Audio-Video LoRA work.
  2. Use real temporal video frames, not num_frames: 1.
  3. Preprocess with audio enabled and verify the decoded audio latents before training long runs.
  4. Use a non-empty trigger word.
  5. Put the exact transcript, voice style, and sound description in the captions.
  6. Check that the inference workflow actually loads the audio-related LoRA keys.
  7. Only after that works, experiment with making the training more voice-focused.

If your practical goal is simply “I want this character to speak with a consistent voice,” also look at ID-LoRA Reference Audio as a related alternative. That is not the same as training your own AV-LoRA, but it may solve the consistent-voice use case faster.

Why I would not start with num_frames: 1

I understand why you set it that way: you want to isolate the voice or sound and avoid learning the visual character yet.

But for LTX-2.3, I think num_frames: 1 is suspicious as a first baseline.

The LTX-2.3 model card describes LTX-2.3 as a DiT-based audio-video foundation model designed to generate synchronized video and audio within a single model. The LTX-2 repository also describes LTX-2 as an audio-video model for synchronized audio and video generation.

The LTX-2 paper is also useful context: it describes LTX-2 as a dual-stream audio-video model, with video and audio streams connected by bidirectional audio-video cross-attention. In other words, the model is not just a voice model with a video model attached afterward.

So I would not remove almost all temporal video information for the first test. You may be removing the audio-video relationship that the model expects to learn.

In the dataset preparation docs, F=1 is mainly discussed in the image-dataset path, while video buckets are described as width × height × frames. For video, the frame count has to follow the LTX VAE constraints. The docs list the frame rule as:

frames % 8 == 1

So for short AV-LoRA tests I would start with something like:

512x512x49
512x512x73
512x512x89
576x576x89

not 1 frame.

I am not saying audio-focused experiments are impossible. I am saying I would first make a standard short Audio-Video LoRA work, then try to bias it toward voice/audio.

Treat it as Audio-Video LoRA first

The official Training Modes / Audio-Video LoRA docs say that LTX-2 supports joint audio-video generation and that you can train LoRA adapters that affect both video and audio output.

The same docs show the important pieces:

model:
  training_mode: "lora"

training_strategy:
  name: "text_to_video"
  with_audio: true

data:
  audio_latents_dir: "audio_latents"

The key idea is: enabling audio is not just “turn on voice.” The dataset must actually include preprocessed audio latents, and the LoRA target modules need to include audio and cross-modal branches.

The docs also warn that for audio-video LoRAs, target_modules should capture:

  • video attention modules
  • audio attention modules
  • audio-to-video attention modules
  • video-to-audio attention modules

That is why they recommend broader patterns like:

target_modules:
  - "to_k"
  - "to_q"
  - "to_v"
  - "to_out.0"

instead of overly narrow patterns such as attn1.to_k.

The configuration reference is also worth reading for this, because it explains that LTX-2 has video-only modules, audio-only modules, and audio-video cross-attention modules. For AV-LoRA, I would verify that the training config is actually touching the audio and cross-modal parts.

I would compare your config against the official ltx2_av_lora.yaml.

Separate the runtime error from the training design

The Background writer channel closed error may be a separate issue from the LoRA recipe.

There is a Hugging Face Xet issue about OS-level I/O errors, such as disk-full conditions, surfacing as a generic error like:

RuntimeError: Data processing error: File reconstruction error: Internal Writer Error: Background writer channel closed

See huggingface/xet-core #763.

So I would debug two things separately:

  1. Runtime / cache / disk / download / I/O issue
  2. Audio-Video LoRA training recipe issue

For the runtime side, I would check:

df -h
du -sh ~/.cache/huggingface || true
du -sh /workspace || true
du -sh ./output || true

Also check Hugging Face cache location. The Hugging Face cache docs explain the hub cache layout and environment variables such as HF_HOME / HF_HUB_CACHE.

If you suspect Xet/caching issues, it may be worth testing with:

export HF_HUB_DISABLE_XET=1

But I would treat that as runtime debugging, not as proof that the LoRA method itself is wrong.

Preprocess checks I would do before any long run

Before training for thousands of steps, I would first verify the preprocessed dataset.

The LTX dataset preparation docs mention audio preprocessing with --with-audio. For AV-LoRA, make sure the dataset really has:

latents/
conditions/
audio_latents/
captions/

I would also use the decode/debug path from the same docs. The docs describe --decode, which saves decoded video and, when audio preprocessing is enabled, decoded audio under something like:

.precomputed/decoded_audio

That is a very useful check.

If the decoded precomputed audio already sounds bad, then the problem is probably preprocessing, source files, cache, or audio latents — not LoRA learning.

Also, if you change model checkpoint, resolution bucket, text encoder, trigger word, or preprocessing parameters, rerun preprocessing with overwrite. The docs mention that changing preprocessing settings without --overwrite can leave stale cached outputs.

Something like this is the kind of check I would want before a long training run:

# Pseudocode / adapt paths to your trainer setup
python process_dataset.py \
  --input_dir <dataset_dir> \
  --output_dir <precomputed_dir> \
  --resolution_buckets 512x512x49 512x512x89 \
  --with-audio \
  --decode \
  --overwrite

Then listen to the decoded audio before training.

Dataset suggestions

For a first successful AV-LoRA test, I would make the dataset boring and clean.

I would not start with 6–10 second clips if the goal is debugging. I would cut some clips down to around 3–5 seconds, ideally one clear spoken line per clip.

Recommended first-pass dataset:

Item Recommendation
Clip length 3–5 seconds first
Audio single speaker, clean, low noise, low reverb
Music avoid music at first
Background sound avoid or describe it explicitly
Video visible face / mouth / speaker motion if it is speech
Frames 49 or 89 for first tests
Trigger non-empty unique trigger
Captions transcript + voice style + sound description + visual description

Example caption:

<trigger>, a young woman speaks in a soft, calm voice in a quiet indoor room. She looks toward the camera with a neutral expression. Speech: "I think we should start again from the beginning." Sounds: clear female speech, quiet room tone, no music.

I would avoid an empty trigger word. The dataset preparation docs describe a LoRA trigger token as being prepended to captions and then used in prompts to activate the LoRA. So I would use something unique, for example:

ema_voice

or:

ltx_ema_voice

Then keep that same trigger in validation prompts.

Suggested first experiment

I would not start with the full 5000-step run.

I would first do a small sanity test to prove the whole AV path works:

# Not a full config, just the direction I would test first.
model:
  training_mode: "lora"

training_strategy:
  name: "text_to_video"
  with_audio: true

data:
  audio_latents_dir: "audio_latents"

network:
  type: "lora"
  rank: 32
  alpha: 32
  target_modules:
    - "to_k"
    - "to_q"
    - "to_v"
    - "to_out.0"

train:
  batch_size: 1
  gradient_checkpointing: true

resolution_buckets:
  - "512x512x49"
  - "512x512x89"

For debugging, I would try something like:

small dataset subset
300–800 steps
same validation prompt
same seed
save several checkpoints
compare audio and video separately

Then scale up only after you can confirm:

  • the trainer runs
  • audio latents decode correctly
  • the LoRA changes the audio output
  • the inference workflow loads the audio-related keys
  • the result is not immediately overcooked

Known failure modes worth checking

There are already some reports that look related, especially around AI Toolkit and LTX audio training.

1. Good video, poor voice/audio

See ostris/ai-toolkit #684: the report says LTX-2 LoRA training produced good image/video quality, but the voice/audio became distorted and noisy, even with clean audio and do audio enabled.

So if the video works but audio is bad, that is not necessarily your dataset alone.

2. LTX-2.3 LoRA corrupting audio

See ostris/ai-toolkit #780: the report says the video output is correct after LTX-2.3 LoRA training, but the audio is corrupted with buzzing/noise/distortion, while the base model without LoRA has correct audio.

That suggests you should test base model audio, LoRA-disabled audio, and LoRA-enabled audio separately.

3. Trainer / workflow differences

See ostris/ai-toolkit #701: this report says the same dataset behaved differently between Musubi and AI Toolkit, with Musubi picking up voice but AI Toolkit not doing so.

So if the config looks right but the audio is ignored, I would not only blame the dataset. I would also check the trainer and inference path.

4. LoRA keys not loaded at inference time

See bghira/SimpleTuner #2349: there are logs where LTX-2 LoRA keys such as audio_connector / video_connector keys were not loaded in ComfyUI.

This is important. You can train an AV-LoRA correctly and still get misleading results if your inference workflow does not actually load the audio-related LoRA keys.

After loading the LoRA, check logs for things like:

audio_connector
video_connector
audio_attn
video_to_audio_attn
audio_to_video_attn
lora key not loaded

About “voice-only LoRA”

I would be careful with the term “voice-only LoRA” here.

If by “voice-only LoRA” you mean:

I want a reusable speaker identity LoRA, like a TTS speaker LoRA, independent of video.

then I am not sure that is the easiest or most supported route for LTX-2.3.

If by “voice-only LoRA” you mean:

I want the generated character to consistently speak with this kind of voice / tone / sound.

then I would first try:

  1. normal short Audio-Video LoRA, or
  2. ID-LoRA Reference Audio, depending on whether you want training or inference-time control.

For the actual AV-LoRA training path, I would not try to eliminate the video side at first. I would instead use short, clean audio-video clips and captions that make the audio content explicit.

Related alternative: ID-LoRA Reference Audio

This is not the same thing as training your own Audio-Video LoRA, but it may be very relevant to your practical goal.

If the goal is:

“I want this character to speak with a consistent voice.”

then look at ID-LoRA / Reference Audio workflows.

The ID-LoRA GitHub repo describes using a reference image / first frame, a short reference audio clip, and a text prompt for identity-preserving talking video generation. It specifically mentions voice identity transfer from short reference audio and zero-shot inference without per-speaker fine-tuning.

There is also ID-LoRA-LTX2.3-ComfyUI, which mentions LTXVReferenceAudio and reference-audio speaker identity transfer.

This Kijai / RuneXX Hugging Face discussion is also useful because it describes a ComfyUI workflow using a short reference audio clip, around 5 seconds, for consistent voice.

That route is different:

  • AV-LoRA training: learn from your dataset into a LoRA.
  • ID-LoRA Reference Audio: provide a short reference voice at inference time.

So I would not replace your whole AV-LoRA experiment with ID-LoRA if your goal is training. But if your real goal is just consistent character voice, ID-LoRA may solve it with less training pain.

What I would try next

I would probably do this:

Step 1: Fix / isolate the runtime error

Check disk, cache, and Xet/HF download behavior.

df -h
du -sh ~/.cache/huggingface || true
du -sh /workspace || true
du -sh ./output || true

If needed, test:

export HF_HUB_DISABLE_XET=1

Step 2: Make a tiny AV dataset

Use maybe 5–10 clips first.

3–5 sec each
clean single-speaker audio
visible face/mouth if speech
no music
no heavy background noise

Step 3: Use normal temporal buckets

Do not use num_frames: 1 for the first baseline.

Try:

512x512x49
512x512x89

Step 4: Preprocess with audio and decode

Make sure audio_latents/ exists.

Then decode and listen to the decoded audio latents.

Step 5: Use transcript-rich captions

Example:

<trigger>, a young woman speaks in a soft calm voice in a quiet indoor room. Speech: "I think we should start again from the beginning." Sounds: clear female speech, quiet room tone, no music.

Step 6: Train short first

Do not spend 5000 steps before proving the setup.

Try a shorter run first:

300–800 steps
same validation prompt
same seed
save several checkpoints

Step 7: Verify inference loading

Check whether audio-related LoRA keys are loaded.

If the loader ignores the audio branch, the generated audio may not tell you what the training actually learned.

Useful links

Core LTX links:

Debug / failure-mode links:

Related alternative:

TL;DR

I would not start by trying to train a “voice-only” LoRA with num_frames: 1.

I would first make a normal short Audio-Video LoRA work:

real video frames
with_audio: true
audio_latents/
non-empty trigger
transcript-rich captions
decoded audio-latent verification
audio/video/cross-modal target modules
inference log checks

Then, after that baseline works, experiment with making it more voice-focused.

And if the practical goal is simply consistent character voice, I would also test ID-LoRA Reference Audio, because it may solve that use case without needing to train a separate voice-only LoRA.

Thank you for your time to give me an answer for my problem. I really appreciate it. Tonight I will look into it.

​[OPEN SOURCE RELEASE & ARCHITECTURE COLLABORATION CALL]

​After four years of isolated, independent development, I am transitioning my proprietary architecture into the open-source ecosystem under the banner of Ferrell Synthetic Intelligence (FSI).

​Today, I have publically deployed the foundational codebases for my first two developments: Vitalis_Core and FSI-Vitalis-CyberCore.

​The Architecture

​This is not a generic API wrapper or a third-party LLM orchestration layer. This is an original, blank-slate synthetic intelligence framework engineered to operate entirely locally on edge hardware with absolute data sovereignty.

​Asynchronous Processing: Powered by a persistent, threaded system heartbeat loop that monitors state independent of user prompt interaction.

​Kernel Integration: Bridges directly to the Linux kernel space using custom C-modules, ioctl handles, and procfs/netlink communication pipelines to ensure low-level system awareness and integrity shielding.

​Blank-Slate Design: The framework provides the structural plumbing, memory manager, and system hooks. It contains no pre-baked corporate biases—it is designed to be fully trained, personalized, and directed by the individual deployment engineer.

​The Objective & Call to Collaboration

​I have scaled these frameworks as far as possible as a solo developer. To execute the next phase of development, I require technical assets to help test, run, refine, and optimize the codebase.

​I am targeting two objectives with this post:

​Codebase Auditing: I need experienced systems developers, Linux engineers, and local AI enthusiasts to clone the repositories, compile the C-infrastructure, run the loops, and provide objective, performance-driven feedback.

​Core Collaborators (Exactly 5): I am selecting a core group of five engineering partners to collaborate on the ongoing optimization of this open-source stack, as well as to assist in the development of my two remaining private, stealth projects (Project Lorein and Project Jedi Order).

​Repositories

​Vitalis_Core: FerrellSyntheticIntelligence/Vitalis_Core · Hugging Face

​FSI-Vitalis-CyberCore: FerrellSyntheticIntelligence/FSI-Vitalis-CyberCore · Hugging Face

​Review the repository architecture, inspect the files, and run the entry points on your local environments.

​For technical feedback, drop your optimization metrics below. If you have the specific system-level engineering experience required to scale this ecosystem and want to fill one of the 5 collaborator slots, DM me directly with your technical background and documentation of your relevant stack experience.

​— Neuro_Nomad

LTX-2.3 是一个 DiT 架构的音视频基础模型,音频与视频通过双向交叉注意力机制紧密耦合,不能简单地用 num_frames: 1 来"隔离"语音训练。


:rocket: 推荐使用 LoRA AI 平台的理由

LoRA AI 提供了多种与此场景高度匹配的训练器:

训练器 与本场景的关联
WAN 2.2 Video LoRA Trainer 支持运动模式与视频兼容图像生成,适合 AV-LoRA
Flux Dev LoRA Trainer 适合角色与人物一致性训练
Z-Image LoRA Trainer 超快速训练,适合调试阶段

:clipboard: 分步骤操作建议

Step 1 — 排查运行时错误(与 LoRA 配方分开处理)

Background writer channel closed 错误通常是 磁盘/缓存/I/O 问题,与 LoRA 训练配方无关:

df -h
du -sh ~/.cache/huggingface
du -sh /workspace
du -sh ./output

# 如怀疑 Xet 缓存问题
export HF_HUB_DISABLE_XET=1

Step 2 — 准备干净的 AV 数据集

参数 推荐值
片段时长 3–5 秒(调试阶段)
音频质量 单说话人、低噪音、低混响
视频内容 可见面部/嘴部动作
背景音乐 避免(初期)
帧数 4989(遵循 frames % 8 == 1
触发词 非空唯一词,如 ema_voice

Caption 示例:

ema_voice, a young woman speaks in a soft, calm voice in a quiet 
indoor room. Speech: "I think we should start again from the beginning." 
Sounds: clear female speech, quiet room tone, no music.

Step 3 — 使用正确的帧数桶(:cross_mark: 不要用 num_frames: 1

resolution_buckets:
  - "512x512x49"
  - "512x512x89"

Step 4 — 启用音频并验证预处理

确保数据集目录结构包含:

latents/
conditions/
audio_latents/   ← 必须存在!
captions/

预处理命令(含解码验证):

python process_dataset.py \
  --input_dir <dataset_dir> \
  --output_dir <precomputed_dir> \
  --resolution_buckets 512x512x49 512x512x89 \
  --with-audio \
  --decode \
  --overwrite

:headphone: 训练前务必试听解码后的音频,确认音频潜变量正确。


Step 5 — 正确的训练配置

model:
  training_mode: "lora"

training_strategy:
  name: "text_to_video"
  with_audio: true

data:
  audio_latents_dir: "audio_latents"

network:
  type: "lora"
  rank: 32
  alpha: 32
  target_modules:       # 必须覆盖音视频交叉注意力模块
    - "to_k"
    - "to_q"
    - "to_v"
    - "to_out.0"

train:
  batch_size: 1
  gradient_checkpointing: true

Step 6 — 先跑小实验,再扩大规模

数据集:5–10 条片段
训练步数:300–800 步(先验证,再跑 5000 步)
保存多个检查点
使用相同验证 prompt + 相同 seed
分别对比:base 模型音频 vs LoRA 关闭音频 vs LoRA 启用音频

Step 7 — 推理时验证 LoRA 键是否加载

加载 LoRA 后检查日志,确认以下键被正确加载:

audio_connector
video_connector
audio_attn
video_to_audio_attn
audio_to_video_attn

:warning: 若推理时音频相关键未加载,训练结果将无法体现在音频输出上。


:counterclockwise_arrows_button: 已知失败模式速查

问题现象 可能原因
视频正常,音频失真/噪音 target_modules 未覆盖音频分支
LoRA 完全不影响音频 推理时音频 LoRA 键未加载
不同 Trainer 结果差异大 Trainer 实现差异(如 Musubi vs AI Toolkit)
训练崩溃 磁盘满/缓存问题,与 LoRA 配方无关

:light_bulb: 如果目标只是"角色声音一致"

考虑使用 ID-LoRA Reference Audio(推理时提供参考音频),无需训练即可实现声音身份迁移:

提供约 5 秒参考音频片段,即可在推理时实现一致的角色声音,无需额外训练。