Phone shot image to studio shot image version for products

jeffs99 · May 31, 2026, 8:54pm

Hello, everyone.

I have been trying to build a working prototype for a project that turn phone taken images of a product to studio shot product image that can be used in online stores.

One example I have been trying:
Input: phone taken image of a leather bag shot from the side
Desired Output: that same leather bag but in a studio with white background with small shadow and also a person holding that bag

I tried all the following with no success so far:
SD XL, IP-Adapter, Inpainting and none of them worked, all of them generaeted completely different images, or broken or baken versions and mostly of them just edited the texture which was not supposed to happen

I am really stuck in building the correct pipeline.

I would really really appreciate if anyone can help me out and show me what to do

The main issue is that the difference between reference image and genereated image is too big

John6666 · June 1, 2026, 2:58am

I looked into this a bit. Depending on your VRAM, I think there are a few workable routes. (Long version here) :

Short version

I would not treat this as one single “image-to-image” problem.

It is really a bundle of smaller tasks:

product cutout / masking
clean packshot generation
background replacement
floor and contact-shadow generation
relighting / color matching
product placement into a lifestyle scene
hand-object interaction if a person is holding the product
logo / label / text / hardware / stitching preservation
low-VRAM execution
product-identity QA

The safest rule is:

If SKU identity matters, do not regenerate the whole product unless you absolutely have to.

For e-commerce/product photography, a beautiful image can still be a failure if the product is no longer the same product. I would usually start from workflows that preserve the original product pixels and only generate the background, floor, shadow, lighting, or small boundary/contact areas around it.

A useful mental model is:

protect product pixels
generate or edit background pixels
add floor/contact shadow
relight or color-match
composite original product back if needed
verify product identity

This is also why I would separate “phone shot to studio packshot” from “person holding the product.” The latter is not ordinary product placement; it is hand-object interaction with occlusion.

1. First principle: prompt the scene, not the SKU

For background replacement, do not over-describe the product itself. If you prompt “a brown leather handbag with gold zipper and braided handle,” the model may try to recreate a plausible brown handbag instead of preserving the exact one.

A better prompt usually describes:

where the product is grounded
the studio/background scene
the floor/surface
lighting style
camera/product-photography style

Shopify’s old SDXL background-replacement Space has a very useful prompting rule in this direction: do not describe the product; describe grounding, scene, and style. See Shopify/background-replacement.

So my shorthand would be:

Prompt the scene, not the SKU.

Example:

clean white ecommerce studio background, product standing on a matte white surface, soft diffused studio lighting, subtle realistic contact shadow, catalog product photography

Not:

brown leather handbag with gold zipper, braided handle, front logo, side stitching

The second prompt may encourage the model to redraw the product.

2. Workflow options

W0. Cutout + composite + shadow

This is the safest baseline for a clean packshot.

input product photo
-> background removal / segmentation
-> original product cutout with alpha
-> white or light gray canvas
-> scale and center
-> synthetic or inpainted contact shadow
-> optional relighting / color match
-> final QA

This is not flashy, but it preserves SKU identity better than almost any full-image generation workflow. Use this as the control group before testing Flux/Qwen/Kontext workflows.

Useful parts:

BRIA RMBG-2.0 for background removal
SAM2 for promptable segmentation
Segment Anything if you need interactive masks
Pillow/OpenCV/Photoshop/ComfyUI nodes for compositing

Best for:

white-background product listing
exact product shape/color/material
low VRAM
batch packshots

Main failure modes:

halo around edges
no contact shadow
product looks pasted
white product on white background loses shape

W1. SDXL background-only inpaint

SDXL is not the newest or strongest editor, but it is still a useful low-VRAM baseline.

Use it for:

background-only inpainting
floor/contact shadow experiments
quick packshot tests
comparison baseline before heavier Flux/Qwen workflows

The important point is to protect the product.

input photo
-> product mask
-> invert mask or protect product region
-> inpaint only background/floor/shadow
-> composite original product back
-> QA product crop

Do not ask SDXL to redraw the product if exact identity matters. It may produce a similar-looking product with changed hardware, stitching, logo, label, color, or proportions.

Useful links:

W2. Flux Fill for background/floor/shadow

FLUX.1 Fill dev is very relevant, but I would frame it as a masked completion component, not a one-click product-photography solution.

Good use:

protect original product
mask background/floor/shadow area
Flux Fill generates only the missing background/floor/shadow
composite original product back if needed
relight / blend / QA

It is promising for:

replacing messy phone-shot backgrounds
adding studio floors
extending canvas/outpainting
creating more natural shadows around a protected product

But product-background swap quality depends heavily on:

mask precision
mask expansion/blur
contact shadow
relighting
final blending
whether you composite the original product back

Low-VRAM users may need GGUF/NF4/offload/custom nodes. Also see:

W3. Flux Kontext direct edit

FLUX.1 Kontext dev is probably one of the strongest local candidates for direct “phone shot → studio shot” editing. The model card describes image editing from text instructions, object/style/character reference, and successive edits with minimal visual drift.

Test it like this:

input product photo
-> Flux Kontext
-> prompt: turn this into a professional ecommerce studio product photo
-> output
-> product identity QA

However, for strict e-commerce use, I would not trust the direct output blindly. A direct edit may look excellent while quietly changing:

silhouette
color
leather/fabric texture
zipper or buckle shape
logo
label text
handle length
stitching
product proportions

For serious use, compare two versions:

A. Flux Kontext direct edit
B. Flux Kontext for studio look + original product composited back

A may look better. B is usually safer for SKU identity.

W3b. Flux Kontext composite-back variant

This is the safer Kontext route.

input product photo
-> Flux Kontext creates target studio look/background/lighting
-> use generated output as visual target
-> cut out original product
-> composite original product back
-> contact shadow / relight / color match
-> QA

This is useful when Kontext gives good lighting/background style but changes the product too much.

W4. Finegrain Product Placement

Finegrain Product Placement LoRA is useful for thinking about product placement. It is a Flux Kontext LoRA aimed at product photography with bounding-box control.

The mental model is not “just prompt harder.” It is:

scene image
+ transparent product cutout
+ placement box
-> product blended into scene

The Finegrain Product Placement Space exposes this clearly: upload a scene photo, draw a box where the item should go, and provide a product image with transparent background.

Important caveat: the model card explicitly says products in hands are not supported. So Finegrain is relevant for:

product on table
product on shelf
product on floor
product in a room scene
product on display

It is not the answer to:

person holding the bag
hand gripping the handle
shoulder-worn bag
complex hand/object occlusion

Also check the official blog: Finegrain product placement Flux LoRA experiment.

W5. Qwen-Image-Edit for labels, packaging, logos, printed text

Qwen-Image-Edit is especially relevant when product text matters. I would not necessarily start with it for a plain leather bag, but I would test it for:

product boxes
bottles
packaging
labels
signs
logos
printed instructions
UI/product mockups
localized marketing creatives

Qwen’s strength is text-aware image editing, but:

text-capable is not SKU-safe.

For product work, the question is not merely “is the generated text readable?” The question is “is this still the same label/logo/brand/product?”

Use:

OCR before/after
manual logo review
crop comparison
original label-region composite-back if needed

Useful links:

W6. Relighting / IC-Light

Relighting deserves its own step.

Many product/background swaps fail because the old phone-shot lighting remains on the product. The background changes, but the product still has the old shadows and highlights, so the image looks pasted together.

Use relighting after:

cutout + composite
generated background
Flux Fill background work
manual placement
product-background blending

A generic route:

product cutout
+ selected/generated background
-> composite product
-> relight foreground to match background
-> add/refine contact shadow
-> restore original product details if softened

Useful links:

W7. Manual placement + boundary/shadow fill

If product identity matters, a controlled manual workflow can be safer than a powerful all-in-one model.

product cutout
+ target scene/background
-> manually place product
-> mask only boundary/contact/shadow area
-> SDXL or Flux Fill repairs local boundary/shadow
-> relight/color-match
-> QA

This is good for:

bag on table
shoes on floor
bottle on bathroom counter
product on shelf
small accessory on desk

It is weak for:

hand holding product
product worn on body
heavy occlusion
wrong product perspective

W8. Product in hand / person holding product

This is the hardest case.

A person holding a product is not ordinary product placement. It adds:

hand/product occlusion
fingers wrapping around handles
product scale relative to body
gravity and strap deformation
contact shadows
foreground/background ordering
hand reconstruction
product identity preservation

I would not expect normal product placement to solve this.

A safer local workaround is:

person image with suitable pose
+ original product cutout
-> manually place product near hand
-> mask only fingers / handle / contact / occlusion
-> local inpaint / hand repair / Flux Fill / SDXL inpaint
-> composite original product body back
-> relight / shadow / QA

The key is to regenerate only the tiny contact/occlusion region, not the whole product.

Useful links:

There are cloud/partner-model templates for product-in-hand UGC-style workflows, but I would treat those as reference/fallback, not the main local/open route.

3. VRAM guide

This is approximate. “Runs on 8GB” or “runs on 12GB” is not enough information. It depends on:

model
quantization
text encoder
VAE
resolution
steps
LoRA/distillation
CPU/RAM offload
ComfyUI version
node implementation
system RAM
generation time

8GB VRAM

Start with:

cutout + composite + shadow
SDXL background-only inpaint
small resolution tests
VAE tiling/slicing
aggressive offload if needed

Treat Flux/Qwen as experimental. Some community workflows may run, but speed and stability can be poor.

12GB VRAM

More realistic:

SDXL composite workflows
Flux GGUF experiments
Flux Kontext GGUF tests
Flux Fill with quant/offload
careful text encoder choice

Still log runtime. A 12GB report can mean under a minute or many minutes depending on quantization and workflow.

16GB VRAM

A good experimentation tier:

Flux GGUF/FP8 becomes more serious
Qwen 4-bit/GGUF/NF4 becomes testable
Finegrain placement may be possible
relighting/composite workflows are practical

24GB VRAM

A practical local comparison tier:

Flux Kontext
Flux Fill
Qwen-Image-Edit quantized or optimized
SDXL + ControlNet/IP-Adapter workflows
more comfortable high-res tests

32GB+

At this point, focus less on “can it run?” and more on:

product identity
failure rate
batch reliability
legal/license terms
QA automation
repeatability
throughput

Useful low-VRAM links:

4. Suggested order of testing

I would test in this order:

1. W0 cutout + composite + shadow
2. W1 SDXL background-only inpaint
3. W2 Flux Fill background/floor/shadow
4. W6 relighting / IC-Light
5. W3 Flux Kontext direct edit
6. W3b Flux Kontext composite-back
7. W4 Finegrain product placement
8. W5 Qwen-Image-Edit for labels/text
9. W8 product-in-hand local workaround

Reason:

start with the least destructive workflow
establish a product-identity baseline
add generation only where it helps
reserve direct full-image editing for cases where the safer route is not enough

5. QA checklist

Do not evaluate only the full image. The background is supposed to change. The product is not.

Compare:

original product crop
generated product crop
original mask
generated/product mask
label crop
logo crop
hardware crop
full image

Check product identity:

silhouette / proportions
color
material texture
leather grain / fabric weave
hardware
zipper / buckle / strap / handle
stitching
label/logo
small text
barcode if relevant
product scale

Check scene realism:

contact shadow
light direction
floor contact
perspective
reflection
background consistency
old lighting still on product
pasted/cutout look

For text-heavy products:

run OCR before/after
inspect manually
preserve original label region if needed

Useful QA/research links:

Automated metrics are useful as red flags, not final approval. Human review is still necessary for SKU identity.

6. Research framing

This problem is close to e-commerce item insertion / virtual try-all research.

Diffuse to Choose is especially relevant because it frames the task as inserting an e-commerce item into a target scene while preserving fine-grained reference-item details and producing plausible blending, lighting, and shadows.

Useful research links:

The practical local/ComfyUI route is basically an approximation of this harder research problem:

reference product
+ product mask/cutout
+ target scene/background
+ local edit/fill/blend
+ relighting
+ product-consistency QA

7. Commercial APIs

I would treat commercial product-shot APIs as reference/fallback, not the main answer.

They can be useful for:

benchmarking quality
fast production
product-shot-specific pipelines
cases where local VRAM is too limited
product-in-hand or UGC-style templates

But check:

cost
privacy
uploaded product/customer images
licensing
output usage rights
data retention
brand safety
repeatability

Examples to compare against, not necessarily start with:

8. Compact decision tree

Need exact white-background packshot?
-> Use cutout + composite + shadow first.
-> Avoid regenerating the product.

Need background replacement?
-> Segment product.
-> Inpaint/fill only background/floor/shadow.
-> Composite original product back if identity matters.
-> Relight.

Need one-shot phone-shot-to-studio conversion?
-> Try Flux Kontext.
-> Also make a composite-back version.
-> Compare product crop.

Need product in a lifestyle scene?
-> If placed on a surface: try manual placement, Finegrain, or fill boundary/shadow.
-> If held by a person: treat as hand-object interaction.

Product has important text/logo/label?
-> Test Qwen-Image-Edit.
-> OCR + manual review.
-> Composite original label/logo region if needed.

Low VRAM?
-> 8GB: cutout/composite + SDXL baseline.
-> 12GB: Flux GGUF experiments.
-> 16GB: Flux/Qwen quantized experiments.
-> 24GB+: serious comparison.

9. My practical recommendation

I would start with this baseline:

1. Segment/remove background.
2. Save original product cutout and mask.
3. Create or generate a clean studio background.
4. Composite original product onto it.
5. Add/inpaint contact shadow only.
6. Relight/color-match if needed.
7. Compare product crop, label crop, and full image.

Then compare against:

Flux Kontext direct edit
Flux Fill masked background/floor/shadow
Finegrain placement for surface placement
Qwen-Image-Edit for label/text-heavy products
commercial APIs only as reference/fallback

For a simple packshot, the safest result may come from boring compositing rather than the strongest model. For lifestyle placement, the best route is usually product cutout + target scene + local fill/blend + relighting. For a person holding the product, expect the task to be much harder and use local hand/contact inpainting rather than ordinary product placement.

Also check model cards, repo licenses, API terms, brand policy, and privacy constraints before using outputs commercially.

Topic		Replies	Views
Create images from items without loosing detail Beginners	12	917	October 8, 2024
A few questions about models Beginners	3	130	December 16, 2025
Need help to harness the power of generative AI for product images Beginners	0	216	April 11, 2024
Image diffuser improver Beginners	0	174	March 15, 2024
Help... looking for Smart Object Swap model with reference Models	3	80	April 20, 2026

Phone shot image to studio shot image version for products

Short version

1. First principle: prompt the scene, not the SKU

2. Workflow options

W0. Cutout + composite + shadow

W1. SDXL background-only inpaint

W2. Flux Fill for background/floor/shadow

W3. Flux Kontext direct edit

W3b. Flux Kontext composite-back variant

W4. Finegrain Product Placement

W5. Qwen-Image-Edit for labels, packaging, logos, printed text

W6. Relighting / IC-Light

W7. Manual placement + boundary/shadow fill

W8. Product in hand / person holding product

3. VRAM guide

8GB VRAM

12GB VRAM

16GB VRAM

24GB VRAM

32GB+

4. Suggested order of testing

5. QA checklist

6. Research framing

7. Commercial APIs

8. Compact decision tree

9. My practical recommendation

Related topics