I looked into this a bit. Depending on your VRAM, I think there are a few workable routes. (Long version here) :
Short version
I would not treat this as one single “image-to-image” problem.
It is really a bundle of smaller tasks:
- product cutout / masking
- clean packshot generation
- background replacement
- floor and contact-shadow generation
- relighting / color matching
- product placement into a lifestyle scene
- hand-object interaction if a person is holding the product
- logo / label / text / hardware / stitching preservation
- low-VRAM execution
- product-identity QA
The safest rule is:
If SKU identity matters, do not regenerate the whole product unless you absolutely have to.
For e-commerce/product photography, a beautiful image can still be a failure if the product is no longer the same product. I would usually start from workflows that preserve the original product pixels and only generate the background, floor, shadow, lighting, or small boundary/contact areas around it.
A useful mental model is:
protect product pixels
generate or edit background pixels
add floor/contact shadow
relight or color-match
composite original product back if needed
verify product identity
This is also why I would separate “phone shot to studio packshot” from “person holding the product.” The latter is not ordinary product placement; it is hand-object interaction with occlusion.
1. First principle: prompt the scene, not the SKU
For background replacement, do not over-describe the product itself. If you prompt “a brown leather handbag with gold zipper and braided handle,” the model may try to recreate a plausible brown handbag instead of preserving the exact one.
A better prompt usually describes:
- where the product is grounded
- the studio/background scene
- the floor/surface
- lighting style
- camera/product-photography style
Shopify’s old SDXL background-replacement Space has a very useful prompting rule in this direction: do not describe the product; describe grounding, scene, and style. See Shopify/background-replacement.
So my shorthand would be:
Prompt the scene, not the SKU.
Example:
clean white ecommerce studio background, product standing on a matte white surface, soft diffused studio lighting, subtle realistic contact shadow, catalog product photography
Not:
brown leather handbag with gold zipper, braided handle, front logo, side stitching
The second prompt may encourage the model to redraw the product.
2. Workflow options
W0. Cutout + composite + shadow
This is the safest baseline for a clean packshot.
input product photo
-> background removal / segmentation
-> original product cutout with alpha
-> white or light gray canvas
-> scale and center
-> synthetic or inpainted contact shadow
-> optional relighting / color match
-> final QA
This is not flashy, but it preserves SKU identity better than almost any full-image generation workflow. Use this as the control group before testing Flux/Qwen/Kontext workflows.
Useful parts:
- BRIA RMBG-2.0 for background removal
- SAM2 for promptable segmentation
- Segment Anything if you need interactive masks
- Pillow/OpenCV/Photoshop/ComfyUI nodes for compositing
Best for:
- white-background product listing
- exact product shape/color/material
- low VRAM
- batch packshots
Main failure modes:
- halo around edges
- no contact shadow
- product looks pasted
- white product on white background loses shape
W1. SDXL background-only inpaint
SDXL is not the newest or strongest editor, but it is still a useful low-VRAM baseline.
Use it for:
- background-only inpainting
- floor/contact shadow experiments
- quick packshot tests
- comparison baseline before heavier Flux/Qwen workflows
The important point is to protect the product.
input photo
-> product mask
-> invert mask or protect product region
-> inpaint only background/floor/shadow
-> composite original product back
-> QA product crop
Do not ask SDXL to redraw the product if exact identity matters. It may produce a similar-looking product with changed hardware, stitching, logo, label, color, or proportions.
Useful links:
W2. Flux Fill for background/floor/shadow
FLUX.1 Fill dev is very relevant, but I would frame it as a masked completion component, not a one-click product-photography solution.
Good use:
protect original product
mask background/floor/shadow area
Flux Fill generates only the missing background/floor/shadow
composite original product back if needed
relight / blend / QA
It is promising for:
- replacing messy phone-shot backgrounds
- adding studio floors
- extending canvas/outpainting
- creating more natural shadows around a protected product
But product-background swap quality depends heavily on:
- mask precision
- mask expansion/blur
- contact shadow
- relighting
- final blending
- whether you composite the original product back
Low-VRAM users may need GGUF/NF4/offload/custom nodes. Also see:
W3. Flux Kontext direct edit
FLUX.1 Kontext dev is probably one of the strongest local candidates for direct “phone shot → studio shot” editing. The model card describes image editing from text instructions, object/style/character reference, and successive edits with minimal visual drift.
Test it like this:
input product photo
-> Flux Kontext
-> prompt: turn this into a professional ecommerce studio product photo
-> output
-> product identity QA
However, for strict e-commerce use, I would not trust the direct output blindly. A direct edit may look excellent while quietly changing:
- silhouette
- color
- leather/fabric texture
- zipper or buckle shape
- logo
- label text
- handle length
- stitching
- product proportions
For serious use, compare two versions:
A. Flux Kontext direct edit
B. Flux Kontext for studio look + original product composited back
A may look better. B is usually safer for SKU identity.
W3b. Flux Kontext composite-back variant
This is the safer Kontext route.
input product photo
-> Flux Kontext creates target studio look/background/lighting
-> use generated output as visual target
-> cut out original product
-> composite original product back
-> contact shadow / relight / color match
-> QA
This is useful when Kontext gives good lighting/background style but changes the product too much.
W4. Finegrain Product Placement
Finegrain Product Placement LoRA is useful for thinking about product placement. It is a Flux Kontext LoRA aimed at product photography with bounding-box control.
The mental model is not “just prompt harder.” It is:
scene image
+ transparent product cutout
+ placement box
-> product blended into scene
The Finegrain Product Placement Space exposes this clearly: upload a scene photo, draw a box where the item should go, and provide a product image with transparent background.
Important caveat: the model card explicitly says products in hands are not supported. So Finegrain is relevant for:
- product on table
- product on shelf
- product on floor
- product in a room scene
- product on display
It is not the answer to:
- person holding the bag
- hand gripping the handle
- shoulder-worn bag
- complex hand/object occlusion
Also check the official blog: Finegrain product placement Flux LoRA experiment.
W5. Qwen-Image-Edit for labels, packaging, logos, printed text
Qwen-Image-Edit is especially relevant when product text matters. I would not necessarily start with it for a plain leather bag, but I would test it for:
- product boxes
- bottles
- packaging
- labels
- signs
- logos
- printed instructions
- UI/product mockups
- localized marketing creatives
Qwen’s strength is text-aware image editing, but:
text-capable is not SKU-safe.
For product work, the question is not merely “is the generated text readable?” The question is “is this still the same label/logo/brand/product?”
Use:
- OCR before/after
- manual logo review
- crop comparison
- original label-region composite-back if needed
Useful links:
W6. Relighting / IC-Light
Relighting deserves its own step.
Many product/background swaps fail because the old phone-shot lighting remains on the product. The background changes, but the product still has the old shadows and highlights, so the image looks pasted together.
Use relighting after:
- cutout + composite
- generated background
- Flux Fill background work
- manual placement
- product-background blending
A generic route:
product cutout
+ selected/generated background
-> composite product
-> relight foreground to match background
-> add/refine contact shadow
-> restore original product details if softened
Useful links:
W7. Manual placement + boundary/shadow fill
If product identity matters, a controlled manual workflow can be safer than a powerful all-in-one model.
product cutout
+ target scene/background
-> manually place product
-> mask only boundary/contact/shadow area
-> SDXL or Flux Fill repairs local boundary/shadow
-> relight/color-match
-> QA
This is good for:
- bag on table
- shoes on floor
- bottle on bathroom counter
- product on shelf
- small accessory on desk
It is weak for:
- hand holding product
- product worn on body
- heavy occlusion
- wrong product perspective
W8. Product in hand / person holding product
This is the hardest case.
A person holding a product is not ordinary product placement. It adds:
- hand/product occlusion
- fingers wrapping around handles
- product scale relative to body
- gravity and strap deformation
- contact shadows
- foreground/background ordering
- hand reconstruction
- product identity preservation
I would not expect normal product placement to solve this.
A safer local workaround is:
person image with suitable pose
+ original product cutout
-> manually place product near hand
-> mask only fingers / handle / contact / occlusion
-> local inpaint / hand repair / Flux Fill / SDXL inpaint
-> composite original product body back
-> relight / shadow / QA
The key is to regenerate only the tiny contact/occlusion region, not the whole product.
Useful links:
There are cloud/partner-model templates for product-in-hand UGC-style workflows, but I would treat those as reference/fallback, not the main local/open route.
3. VRAM guide
This is approximate. “Runs on 8GB” or “runs on 12GB” is not enough information. It depends on:
- model
- quantization
- text encoder
- VAE
- resolution
- steps
- LoRA/distillation
- CPU/RAM offload
- ComfyUI version
- node implementation
- system RAM
- generation time
8GB VRAM
Start with:
- cutout + composite + shadow
- SDXL background-only inpaint
- small resolution tests
- VAE tiling/slicing
- aggressive offload if needed
Treat Flux/Qwen as experimental. Some community workflows may run, but speed and stability can be poor.
12GB VRAM
More realistic:
- SDXL composite workflows
- Flux GGUF experiments
- Flux Kontext GGUF tests
- Flux Fill with quant/offload
- careful text encoder choice
Still log runtime. A 12GB report can mean under a minute or many minutes depending on quantization and workflow.
16GB VRAM
A good experimentation tier:
- Flux GGUF/FP8 becomes more serious
- Qwen 4-bit/GGUF/NF4 becomes testable
- Finegrain placement may be possible
- relighting/composite workflows are practical
24GB VRAM
A practical local comparison tier:
- Flux Kontext
- Flux Fill
- Qwen-Image-Edit quantized or optimized
- SDXL + ControlNet/IP-Adapter workflows
- more comfortable high-res tests
32GB+
At this point, focus less on “can it run?” and more on:
- product identity
- failure rate
- batch reliability
- legal/license terms
- QA automation
- repeatability
- throughput
Useful low-VRAM links:
4. Suggested order of testing
I would test in this order:
1. W0 cutout + composite + shadow
2. W1 SDXL background-only inpaint
3. W2 Flux Fill background/floor/shadow
4. W6 relighting / IC-Light
5. W3 Flux Kontext direct edit
6. W3b Flux Kontext composite-back
7. W4 Finegrain product placement
8. W5 Qwen-Image-Edit for labels/text
9. W8 product-in-hand local workaround
Reason:
- start with the least destructive workflow
- establish a product-identity baseline
- add generation only where it helps
- reserve direct full-image editing for cases where the safer route is not enough
5. QA checklist
Do not evaluate only the full image. The background is supposed to change. The product is not.
Compare:
- original product crop
- generated product crop
- original mask
- generated/product mask
- label crop
- logo crop
- hardware crop
- full image
Check product identity:
- silhouette / proportions
- color
- material texture
- leather grain / fabric weave
- hardware
- zipper / buckle / strap / handle
- stitching
- label/logo
- small text
- barcode if relevant
- product scale
Check scene realism:
- contact shadow
- light direction
- floor contact
- perspective
- reflection
- background consistency
- old lighting still on product
- pasted/cutout look
For text-heavy products:
- run OCR before/after
- inspect manually
- preserve original label region if needed
Useful QA/research links:
Automated metrics are useful as red flags, not final approval. Human review is still necessary for SKU identity.
6. Research framing
This problem is close to e-commerce item insertion / virtual try-all research.
Diffuse to Choose is especially relevant because it frames the task as inserting an e-commerce item into a target scene while preserving fine-grained reference-item details and producing plausible blending, lighting, and shadows.
Useful research links:
The practical local/ComfyUI route is basically an approximation of this harder research problem:
reference product
+ product mask/cutout
+ target scene/background
+ local edit/fill/blend
+ relighting
+ product-consistency QA
7. Commercial APIs
I would treat commercial product-shot APIs as reference/fallback, not the main answer.
They can be useful for:
- benchmarking quality
- fast production
- product-shot-specific pipelines
- cases where local VRAM is too limited
- product-in-hand or UGC-style templates
But check:
- cost
- privacy
- uploaded product/customer images
- licensing
- output usage rights
- data retention
- brand safety
- repeatability
Examples to compare against, not necessarily start with:
8. Compact decision tree
Need exact white-background packshot?
-> Use cutout + composite + shadow first.
-> Avoid regenerating the product.
Need background replacement?
-> Segment product.
-> Inpaint/fill only background/floor/shadow.
-> Composite original product back if identity matters.
-> Relight.
Need one-shot phone-shot-to-studio conversion?
-> Try Flux Kontext.
-> Also make a composite-back version.
-> Compare product crop.
Need product in a lifestyle scene?
-> If placed on a surface: try manual placement, Finegrain, or fill boundary/shadow.
-> If held by a person: treat as hand-object interaction.
Product has important text/logo/label?
-> Test Qwen-Image-Edit.
-> OCR + manual review.
-> Composite original label/logo region if needed.
Low VRAM?
-> 8GB: cutout/composite + SDXL baseline.
-> 12GB: Flux GGUF experiments.
-> 16GB: Flux/Qwen quantized experiments.
-> 24GB+: serious comparison.
9. My practical recommendation
I would start with this baseline:
1. Segment/remove background.
2. Save original product cutout and mask.
3. Create or generate a clean studio background.
4. Composite original product onto it.
5. Add/inpaint contact shadow only.
6. Relight/color-match if needed.
7. Compare product crop, label crop, and full image.
Then compare against:
- Flux Kontext direct edit
- Flux Fill masked background/floor/shadow
- Finegrain placement for surface placement
- Qwen-Image-Edit for label/text-heavy products
- commercial APIs only as reference/fallback
For a simple packshot, the safest result may come from boring compositing rather than the strongest model. For lifestyle placement, the best route is usually product cutout + target scene + local fill/blend + relighting. For a person holding the product, expect the task to be much harder and use local hand/contact inpainting rather than ordinary product placement.
Also check model cards, repo licenses, API terms, brand policy, and privacy constraints before using outputs commercially.