Phone shot image to studio shot image version for products

Hello, everyone.

I have been trying to build a working prototype for a project that turn phone taken images of a product to studio shot product image that can be used in online stores.

One example I have been trying:
Input: phone taken image of a leather bag shot from the side
Desired Output: that same leather bag but in a studio with white background with small shadow and also a person holding that bag

I tried all the following with no success so far:
SD XL, IP-Adapter, Inpainting and none of them worked, all of them generaeted completely different images, or broken or baken versions and mostly of them just edited the texture which was not supposed to happen

I am really stuck in building the correct pipeline.

I would really really appreciate if anyone can help me out and show me what to do

The main issue is that the difference between reference image and genereated image is too big

I looked into this a bit. Depending on your VRAM, I think there are a few workable routes. (Long version here) :


Short version

I would not treat this as one single “image-to-image” problem.

It is really a bundle of smaller tasks:

  • product cutout / masking
  • clean packshot generation
  • background replacement
  • floor and contact-shadow generation
  • relighting / color matching
  • product placement into a lifestyle scene
  • hand-object interaction if a person is holding the product
  • logo / label / text / hardware / stitching preservation
  • low-VRAM execution
  • product-identity QA

The safest rule is:

If SKU identity matters, do not regenerate the whole product unless you absolutely have to.

For e-commerce/product photography, a beautiful image can still be a failure if the product is no longer the same product. I would usually start from workflows that preserve the original product pixels and only generate the background, floor, shadow, lighting, or small boundary/contact areas around it.

A useful mental model is:

protect product pixels
generate or edit background pixels
add floor/contact shadow
relight or color-match
composite original product back if needed
verify product identity

This is also why I would separate “phone shot to studio packshot” from “person holding the product.” The latter is not ordinary product placement; it is hand-object interaction with occlusion.


1. First principle: prompt the scene, not the SKU

For background replacement, do not over-describe the product itself. If you prompt “a brown leather handbag with gold zipper and braided handle,” the model may try to recreate a plausible brown handbag instead of preserving the exact one.

A better prompt usually describes:

  • where the product is grounded
  • the studio/background scene
  • the floor/surface
  • lighting style
  • camera/product-photography style

Shopify’s old SDXL background-replacement Space has a very useful prompting rule in this direction: do not describe the product; describe grounding, scene, and style. See Shopify/background-replacement.

So my shorthand would be:

Prompt the scene, not the SKU.

Example:

clean white ecommerce studio background, product standing on a matte white surface, soft diffused studio lighting, subtle realistic contact shadow, catalog product photography

Not:

brown leather handbag with gold zipper, braided handle, front logo, side stitching

The second prompt may encourage the model to redraw the product.


2. Workflow options

W0. Cutout + composite + shadow

This is the safest baseline for a clean packshot.

input product photo
-> background removal / segmentation
-> original product cutout with alpha
-> white or light gray canvas
-> scale and center
-> synthetic or inpainted contact shadow
-> optional relighting / color match
-> final QA

This is not flashy, but it preserves SKU identity better than almost any full-image generation workflow. Use this as the control group before testing Flux/Qwen/Kontext workflows.

Useful parts:

  • BRIA RMBG-2.0 for background removal
  • SAM2 for promptable segmentation
  • Segment Anything if you need interactive masks
  • Pillow/OpenCV/Photoshop/ComfyUI nodes for compositing

Best for:

  • white-background product listing
  • exact product shape/color/material
  • low VRAM
  • batch packshots

Main failure modes:

  • halo around edges
  • no contact shadow
  • product looks pasted
  • white product on white background loses shape

W1. SDXL background-only inpaint

SDXL is not the newest or strongest editor, but it is still a useful low-VRAM baseline.

Use it for:

  • background-only inpainting
  • floor/contact shadow experiments
  • quick packshot tests
  • comparison baseline before heavier Flux/Qwen workflows

The important point is to protect the product.

input photo
-> product mask
-> invert mask or protect product region
-> inpaint only background/floor/shadow
-> composite original product back
-> QA product crop

Do not ask SDXL to redraw the product if exact identity matters. It may produce a similar-looking product with changed hardware, stitching, logo, label, color, or proportions.

Useful links:


W2. Flux Fill for background/floor/shadow

FLUX.1 Fill dev is very relevant, but I would frame it as a masked completion component, not a one-click product-photography solution.

Good use:

protect original product
mask background/floor/shadow area
Flux Fill generates only the missing background/floor/shadow
composite original product back if needed
relight / blend / QA

It is promising for:

  • replacing messy phone-shot backgrounds
  • adding studio floors
  • extending canvas/outpainting
  • creating more natural shadows around a protected product

But product-background swap quality depends heavily on:

  • mask precision
  • mask expansion/blur
  • contact shadow
  • relighting
  • final blending
  • whether you composite the original product back

Low-VRAM users may need GGUF/NF4/offload/custom nodes. Also see:


W3. Flux Kontext direct edit

FLUX.1 Kontext dev is probably one of the strongest local candidates for direct “phone shot → studio shot” editing. The model card describes image editing from text instructions, object/style/character reference, and successive edits with minimal visual drift.

Test it like this:

input product photo
-> Flux Kontext
-> prompt: turn this into a professional ecommerce studio product photo
-> output
-> product identity QA

However, for strict e-commerce use, I would not trust the direct output blindly. A direct edit may look excellent while quietly changing:

  • silhouette
  • color
  • leather/fabric texture
  • zipper or buckle shape
  • logo
  • label text
  • handle length
  • stitching
  • product proportions

For serious use, compare two versions:

A. Flux Kontext direct edit
B. Flux Kontext for studio look + original product composited back

A may look better. B is usually safer for SKU identity.


W3b. Flux Kontext composite-back variant

This is the safer Kontext route.

input product photo
-> Flux Kontext creates target studio look/background/lighting
-> use generated output as visual target
-> cut out original product
-> composite original product back
-> contact shadow / relight / color match
-> QA

This is useful when Kontext gives good lighting/background style but changes the product too much.


W4. Finegrain Product Placement

Finegrain Product Placement LoRA is useful for thinking about product placement. It is a Flux Kontext LoRA aimed at product photography with bounding-box control.

The mental model is not “just prompt harder.” It is:

scene image
+ transparent product cutout
+ placement box
-> product blended into scene

The Finegrain Product Placement Space exposes this clearly: upload a scene photo, draw a box where the item should go, and provide a product image with transparent background.

Important caveat: the model card explicitly says products in hands are not supported. So Finegrain is relevant for:

  • product on table
  • product on shelf
  • product on floor
  • product in a room scene
  • product on display

It is not the answer to:

  • person holding the bag
  • hand gripping the handle
  • shoulder-worn bag
  • complex hand/object occlusion

Also check the official blog: Finegrain product placement Flux LoRA experiment.


W5. Qwen-Image-Edit for labels, packaging, logos, printed text

Qwen-Image-Edit is especially relevant when product text matters. I would not necessarily start with it for a plain leather bag, but I would test it for:

  • product boxes
  • bottles
  • packaging
  • labels
  • signs
  • logos
  • printed instructions
  • UI/product mockups
  • localized marketing creatives

Qwen’s strength is text-aware image editing, but:

text-capable is not SKU-safe.

For product work, the question is not merely “is the generated text readable?” The question is “is this still the same label/logo/brand/product?”

Use:

  • OCR before/after
  • manual logo review
  • crop comparison
  • original label-region composite-back if needed

Useful links:


W6. Relighting / IC-Light

Relighting deserves its own step.

Many product/background swaps fail because the old phone-shot lighting remains on the product. The background changes, but the product still has the old shadows and highlights, so the image looks pasted together.

Use relighting after:

  • cutout + composite
  • generated background
  • Flux Fill background work
  • manual placement
  • product-background blending

A generic route:

product cutout
+ selected/generated background
-> composite product
-> relight foreground to match background
-> add/refine contact shadow
-> restore original product details if softened

Useful links:


W7. Manual placement + boundary/shadow fill

If product identity matters, a controlled manual workflow can be safer than a powerful all-in-one model.

product cutout
+ target scene/background
-> manually place product
-> mask only boundary/contact/shadow area
-> SDXL or Flux Fill repairs local boundary/shadow
-> relight/color-match
-> QA

This is good for:

  • bag on table
  • shoes on floor
  • bottle on bathroom counter
  • product on shelf
  • small accessory on desk

It is weak for:

  • hand holding product
  • product worn on body
  • heavy occlusion
  • wrong product perspective

W8. Product in hand / person holding product

This is the hardest case.

A person holding a product is not ordinary product placement. It adds:

  • hand/product occlusion
  • fingers wrapping around handles
  • product scale relative to body
  • gravity and strap deformation
  • contact shadows
  • foreground/background ordering
  • hand reconstruction
  • product identity preservation

I would not expect normal product placement to solve this.

A safer local workaround is:

person image with suitable pose
+ original product cutout
-> manually place product near hand
-> mask only fingers / handle / contact / occlusion
-> local inpaint / hand repair / Flux Fill / SDXL inpaint
-> composite original product body back
-> relight / shadow / QA

The key is to regenerate only the tiny contact/occlusion region, not the whole product.

Useful links:

There are cloud/partner-model templates for product-in-hand UGC-style workflows, but I would treat those as reference/fallback, not the main local/open route.


3. VRAM guide

This is approximate. “Runs on 8GB” or “runs on 12GB” is not enough information. It depends on:

  • model
  • quantization
  • text encoder
  • VAE
  • resolution
  • steps
  • LoRA/distillation
  • CPU/RAM offload
  • ComfyUI version
  • node implementation
  • system RAM
  • generation time

8GB VRAM

Start with:

  • cutout + composite + shadow
  • SDXL background-only inpaint
  • small resolution tests
  • VAE tiling/slicing
  • aggressive offload if needed

Treat Flux/Qwen as experimental. Some community workflows may run, but speed and stability can be poor.

12GB VRAM

More realistic:

  • SDXL composite workflows
  • Flux GGUF experiments
  • Flux Kontext GGUF tests
  • Flux Fill with quant/offload
  • careful text encoder choice

Still log runtime. A 12GB report can mean under a minute or many minutes depending on quantization and workflow.

16GB VRAM

A good experimentation tier:

  • Flux GGUF/FP8 becomes more serious
  • Qwen 4-bit/GGUF/NF4 becomes testable
  • Finegrain placement may be possible
  • relighting/composite workflows are practical

24GB VRAM

A practical local comparison tier:

  • Flux Kontext
  • Flux Fill
  • Qwen-Image-Edit quantized or optimized
  • SDXL + ControlNet/IP-Adapter workflows
  • more comfortable high-res tests

32GB+

At this point, focus less on “can it run?” and more on:

  • product identity
  • failure rate
  • batch reliability
  • legal/license terms
  • QA automation
  • repeatability
  • throughput

Useful low-VRAM links:


4. Suggested order of testing

I would test in this order:

1. W0 cutout + composite + shadow
2. W1 SDXL background-only inpaint
3. W2 Flux Fill background/floor/shadow
4. W6 relighting / IC-Light
5. W3 Flux Kontext direct edit
6. W3b Flux Kontext composite-back
7. W4 Finegrain product placement
8. W5 Qwen-Image-Edit for labels/text
9. W8 product-in-hand local workaround

Reason:

  • start with the least destructive workflow
  • establish a product-identity baseline
  • add generation only where it helps
  • reserve direct full-image editing for cases where the safer route is not enough

5. QA checklist

Do not evaluate only the full image. The background is supposed to change. The product is not.

Compare:

  • original product crop
  • generated product crop
  • original mask
  • generated/product mask
  • label crop
  • logo crop
  • hardware crop
  • full image

Check product identity:

  • silhouette / proportions
  • color
  • material texture
  • leather grain / fabric weave
  • hardware
  • zipper / buckle / strap / handle
  • stitching
  • label/logo
  • small text
  • barcode if relevant
  • product scale

Check scene realism:

  • contact shadow
  • light direction
  • floor contact
  • perspective
  • reflection
  • background consistency
  • old lighting still on product
  • pasted/cutout look

For text-heavy products:

  • run OCR before/after
  • inspect manually
  • preserve original label region if needed

Useful QA/research links:

Automated metrics are useful as red flags, not final approval. Human review is still necessary for SKU identity.


6. Research framing

This problem is close to e-commerce item insertion / virtual try-all research.

Diffuse to Choose is especially relevant because it frames the task as inserting an e-commerce item into a target scene while preserving fine-grained reference-item details and producing plausible blending, lighting, and shadows.

Useful research links:

The practical local/ComfyUI route is basically an approximation of this harder research problem:

reference product
+ product mask/cutout
+ target scene/background
+ local edit/fill/blend
+ relighting
+ product-consistency QA

7. Commercial APIs

I would treat commercial product-shot APIs as reference/fallback, not the main answer.

They can be useful for:

  • benchmarking quality
  • fast production
  • product-shot-specific pipelines
  • cases where local VRAM is too limited
  • product-in-hand or UGC-style templates

But check:

  • cost
  • privacy
  • uploaded product/customer images
  • licensing
  • output usage rights
  • data retention
  • brand safety
  • repeatability

Examples to compare against, not necessarily start with:


8. Compact decision tree

Need exact white-background packshot?
-> Use cutout + composite + shadow first.
-> Avoid regenerating the product.

Need background replacement?
-> Segment product.
-> Inpaint/fill only background/floor/shadow.
-> Composite original product back if identity matters.
-> Relight.

Need one-shot phone-shot-to-studio conversion?
-> Try Flux Kontext.
-> Also make a composite-back version.
-> Compare product crop.

Need product in a lifestyle scene?
-> If placed on a surface: try manual placement, Finegrain, or fill boundary/shadow.
-> If held by a person: treat as hand-object interaction.

Product has important text/logo/label?
-> Test Qwen-Image-Edit.
-> OCR + manual review.
-> Composite original label/logo region if needed.

Low VRAM?
-> 8GB: cutout/composite + SDXL baseline.
-> 12GB: Flux GGUF experiments.
-> 16GB: Flux/Qwen quantized experiments.
-> 24GB+: serious comparison.

9. My practical recommendation

I would start with this baseline:

1. Segment/remove background.
2. Save original product cutout and mask.
3. Create or generate a clean studio background.
4. Composite original product onto it.
5. Add/inpaint contact shadow only.
6. Relight/color-match if needed.
7. Compare product crop, label crop, and full image.

Then compare against:

  • Flux Kontext direct edit
  • Flux Fill masked background/floor/shadow
  • Finegrain placement for surface placement
  • Qwen-Image-Edit for label/text-heavy products
  • commercial APIs only as reference/fallback

For a simple packshot, the safest result may come from boring compositing rather than the strongest model. For lifestyle placement, the best route is usually product cutout + target scene + local fill/blend + relighting. For a person holding the product, expect the task to be much harder and use local hand/contact inpainting rather than ordinary product placement.

Also check model cards, repo licenses, API terms, brand policy, and privacy constraints before using outputs commercially.