12 min read

Scaling Computer Vision to 1M Images: Architecture Deep-Dive of Remove-BG.io

From 15-second processing times to sub-2-second latency at scale: the complete technical journey of building a computer vision service that processed 1M images for $40/month using edge-first architecture, aggressive caching, and PyTorch optimization.

Computer Vision
Architecture
Scaling
Performance
PyTorch
ONNX

When I launched Remove-BG.io, the first user upload took 15 seconds to process. Unacceptable. Six months later, after architecting a global edge network with intelligent caching, the service processes images in under 2 seconds and costs $40/month to run at 1M+ processed images.

This is the technical deep-dive: the architecture decisions, optimization strategies, and hard-won lessons from scaling a computer vision service from prototype to production.

The Problem Space: Background Removal at Scale

Background removal is computationally expensive:

Per-image requirements:

  • Input: Variable resolution images (100KB - 10MB)
  • Processing: Semantic segmentation via deep learning (U2-Net model)
  • Model size: 176MB PyTorch checkpoint
  • Inference time: 800ms - 3s depending on resolution
  • Memory: 2GB+ for large images
  • Output: Transparent PNG with alpha channel

Scale requirements:

  • Target latency: <2s end-to-end (including network)
  • Target cost: <$0.05 per image
  • Uptime: 99.5%+
  • Geographic distribution: Global users
  • Cache efficiency: Maximize to reduce compute

The challenge: How do you make this fast and cheap at scale?

Architecture Evolution: Three Iterations

V1: Naive Approach (Failed)

┌──────┐     ┌────────────┐     ┌──────────┐
│ User │────→│ Next.js    │────→│ FastAPI  │
│      │     │ (Vercel)   │     │ (EC2)    │
└──────┘     └────────────┘     └──────────┘
                                      │
                                      ▼
                                ┌──────────┐
                                │ PyTorch  │
                                │ U2-Net   │
                                └──────────┘
                                      │
                                      ▼
                                ┌──────────┐
                                │ AWS S3   │
                                └──────────┘

Problems:

  • 15s cold start (model loading)
  • $0.40/image in compute costs
  • Single region (high latency for global users)
  • No caching (redundant processing)
  • OOM crashes on 4K images

Verdict: Completely unsustainable

V2: Edge-First with Caching

                    ┌─────────────────┐
                    │ Cloudflare CDN  │
                    │ (Global Edge)   │
                    └────────┬────────┘
                             │
                    Cache Hit? (73%)
                             │
                    ┌────────▼────────┐
                    │ Cache Miss      │
                    └────────┬────────┘
                             │
         ┌───────────────────┼───────────────────┐
         │                   │                   │
    ┌────▼────┐         ┌────▼────┐        ┌────▼────┐
    │ Fly.io  │         │ Fly.io  │        │ Fly.io  │
    │ US-WEST │         │ EU-WEST │        │ AP-EAST │
    └────┬────┘         └────┬────┘        └────┬────┘
         │                   │                   │
         └───────────────────┼───────────────────┘
                             │
                      ┌──────▼──────┐
                      │ Redis Cache │
                      │ (5min TTL)  │
                      └──────┬──────┘
                             │
                      ┌──────▼──────┐
                      │ U2-Net      │
                      │ Inference   │
                      └──────┬──────┘
                             │
                      ┌──────▼──────┐
                      │ Cloudflare  │
                      │ R2 Storage  │
                      └─────────────┘

Improvements:

  • Edge deployment: Multi-region reduces latency 60%
  • CDN caching: 73% cache hit rate = zero compute
  • Redis layer: Deduplication prevents redundant processing
  • R2 storage: 10x cheaper egress vs S3

Results:

  • Latency: 15s → 2.1s average
  • Cost: $0.40 → $0.04 per image
  • Uptime: 94% → 99.7%

This architecture shipped to production.

V3: Current (Optimized Production)

Added layers of sophistication:

  1. Content-addressable caching: Hash-based deduplication
  2. Progressive image resizing: Handle 8K images without OOM
  3. Request coalescing: Batch identical concurrent requests
  4. Model quantization: INT8 reduces model size 4x
  5. Warm container pools: Eliminate cold starts

Deep-Dive: Critical Optimizations

1. The Profiling Revelation

My intuition about bottlenecks was completely wrong.

What I thought was slow:

  • U2-Net inference (the deep learning model)

What was actually slow:

  • Image decoding/encoding (60% of total time)
  • Tensor transformations (25% of total time)
  • Model inference (15% of total time)

Profiling code:

import cProfile
import pstats
from io import StringIO

def profile_inference(image_bytes):
    profiler = cProfile.Profile()
    profiler.enable()

    result = remove_background_full_pipeline(image_bytes)

    profiler.disable()

    s = StringIO()
    ps = pstats.Stats(profiler, stream=s)
    ps.strip_dirs()
    ps.sort_stats('cumulative')
    ps.print_stats(20)

    print(s.getvalue())
    return result

# Results (cumulative time for 2000x2000 image):
# 1. PIL.Image.open()          -> 1.2s  (40%)
# 2. PIL.Image.save()          -> 0.6s  (20%)
# 3. transforms.ToTensor()     -> 0.5s  (17%)
# 4. transforms.Normalize()    -> 0.25s (8%)
# 5. model(input)              -> 0.45s (15%)
#                      TOTAL:    3.0s

The fix: Replace PIL with OpenCV

# BEFORE (PIL): 1.8s for decode + encode
from PIL import Image
import io

image = Image.open(io.BytesIO(image_bytes))
# ... processing ...
output_bytes = io.BytesIO()
image.save(output_bytes, format='PNG')

# AFTER (OpenCV): 0.5s for decode + encode (3.6x faster)
import cv2
import numpy as np

# Decode
nparr = np.frombuffer(image_bytes, np.uint8)
image = cv2.imdecode(nparr, cv2.IMREAD_COLOR)

# ... processing ...

# Encode
_, buffer = cv2.imencode('.png', result_image)
output_bytes = buffer.tobytes()

Impact: Total processing time dropped from 3.0s → 1.4s (53% reduction)

Lesson: Always profile. Intuition lies.

2. Edge-First Architecture: The Game Changer

The single most important architectural decision.

Request flow with caching:

from hashlib import sha256
import redis
import boto3

redis_client = redis.Redis(host='localhost', decode_responses=True)
r2_client = boto3.client('s3', endpoint_url='https://r2.cloudflarestorage.com')

async def process_image_request(image_bytes: bytes) -> bytes:
    """
    Multi-layer caching strategy:
    1. CDN (Cloudflare) - instant delivery for 73% of requests
    2. Redis - in-memory cache for recently processed images
    3. R2 - persistent storage for all processed images
    4. Compute - only if none of the above have the result
    """

    # Content-addressable storage: hash the input
    image_hash = sha256(image_bytes).hexdigest()
    cache_key = f"processed:{image_hash}"

    # Layer 1: Check Redis (hot cache)
    if cached := redis_client.get(cache_key):
        log_metric("cache_hit", source="redis")
        return base64.b64decode(cached)

    # Layer 2: Check R2 (warm cache)
    try:
        response = r2_client.get_object(
            Bucket='remove-bg-results',
            Key=f"{image_hash}.png"
        )
        result_bytes = response['Body'].read()

        # Backfill Redis for future requests
        redis_client.setex(
            cache_key,
            300,  # 5min TTL
            base64.b64encode(result_bytes)
        )

        log_metric("cache_hit", source="r2")
        return result_bytes

    except r2_client.exceptions.NoSuchKey:
        pass  # Cache miss, need to process

    # Layer 3: Compute (cache miss)
    log_metric("cache_miss")
    result_bytes = await run_inference(image_bytes)

    # Store in all cache layers
    redis_client.setex(
        cache_key,
        300,
        base64.b64encode(result_bytes)
    )

    r2_client.put_object(
        Bucket='remove-bg-results',
        Key=f"{image_hash}.png",
        Body=result_bytes,
        ContentType='image/png',
        CacheControl='public, max-age=31536000'  # 1 year
    )

    return result_bytes

Cache efficiency metrics (after 6 months):

LayerHit RateLatencyCost/Request
CDN (Cloudflare)73%50ms$0.000
Redis18%150ms$0.001
R26%400ms$0.005
Compute3%1.8s$0.040

Effective cost per request: (0.73 × $0) + (0.18 × $0.001) + (0.06 × $0.005) + (0.03 × $0.040) = $0.0015

3. Model Optimization: Quantization & ONNX

U2-Net baseline: 176MB FP32 PyTorch checkpoint

Optimization journey:

# Stage 1: PyTorch FP32 (baseline)
# Model size: 176MB
# Inference time: 450ms
# Accuracy: 94.2% IoU

import torch
model = torch.load("u2net.pth")
model.eval()

with torch.no_grad():
    output = model(input_tensor)

# Stage 2: PyTorch FP16 (half precision)
# Model size: 88MB
# Inference time: 380ms (-15%)
# Accuracy: 94.1% IoU (negligible loss)

model = model.half()
input_tensor = input_tensor.half()

# Stage 3: Dynamic Quantization to INT8
# Model size: 44MB
# Inference time: 320ms (-29% vs baseline)
# Accuracy: 93.7% IoU (acceptable tradeoff)

model_int8 = torch.quantization.quantize_dynamic(
    model,
    {torch.nn.Linear, torch.nn.Conv2d},
    dtype=torch.qint8
)

# Stage 4: ONNX Runtime (best option)
# Model size: 44MB (INT8 exported)
# Inference time: 280ms (-38% vs baseline)
# Accuracy: 93.7% IoU

import onnxruntime as ort

# Export PyTorch to ONNX
torch.onnx.export(
    model_int8,
    dummy_input,
    "u2net_int8.onnx",
    opset_version=14,
    input_names=['input'],
    output_names=['output'],
    dynamic_axes={
        'input': {0: 'batch', 2: 'height', 3: 'width'},
        'output': {0: 'batch', 2: 'height', 3: 'width'}
    }
)

# Load with ONNX Runtime (CPU optimization)
session = ort.InferenceSession(
    "u2net_int8.onnx",
    providers=['CPUExecutionProvider']
)

# Inference
output = session.run(
    ['output'],
    {'input': input_array}
)[0]

Final choice: ONNX Runtime INT8

Why ONNX over PyTorch?

  • 38% faster inference
  • 75% smaller model size
  • Better CPU optimization (important for cost - no GPU needed)
  • Native quantization support
  • Simpler deployment (no PyTorch runtime dependency)

4. Handling Large Images: Progressive Resizing

4K and 8K images crashed the server with OOM errors.

The problem:

  • 4K image (3840x2160): 24MB in memory (RGB)
  • U2-Net input: Must resize to 320x320
  • Intermediate tensors: 3x memory during transformation
  • Peak memory: 100MB+ per request
  • Container limit: 512MB → OOM kill

Solution: Intelligent preprocessing

def preprocess_image_adaptive(image: np.ndarray) -> tuple[np.ndarray, dict]:
    """
    Adaptively resize large images to prevent OOM
    while preserving output quality
    """

    original_height, original_width = image.shape[:2]

    # Define size thresholds
    MAX_DIMENSION = 2048  # Process at most 2K resolution
    TARGET_DIMENSION = 320  # Model input size

    metadata = {
        "original_shape": (original_height, original_width),
        "was_downscaled": False
    }

    # Check if image exceeds limits
    if max(original_height, original_width) > MAX_DIMENSION:
        # Downscale to MAX_DIMENSION
        scale = MAX_DIMENSION / max(original_height, original_width)
        new_width = int(original_width * scale)
        new_height = int(original_height * scale)

        image = cv2.resize(
            image,
            (new_width, new_height),
            interpolation=cv2.INTER_AREA  # Best for downscaling
        )

        metadata["was_downscaled"] = True
        metadata["processing_shape"] = (new_height, new_width)

    return image, metadata

def postprocess_mask_adaptive(
    mask: np.ndarray,
    metadata: dict
) -> np.ndarray:
    """
    Upscale mask back to original resolution if needed
    """

    if metadata["was_downscaled"]:
        original_height, original_width = metadata["original_shape"]

        # Upscale mask to original size
        mask = cv2.resize(
            mask,
            (original_width, original_height),
            interpolation=cv2.INTER_CUBIC  # Smooth mask edges
        )

    return mask

# Usage in inference pipeline
def remove_background_safe(image_bytes: bytes) -> bytes:
    """Production-ready inference with OOM protection"""

    # Decode
    image = cv2.imdecode(
        np.frombuffer(image_bytes, np.uint8),
        cv2.IMREAD_COLOR
    )

    # Adaptive preprocessing
    processed_image, metadata = preprocess_image_adaptive(image)

    # Model inference (on potentially downscaled image)
    mask = run_model_inference(processed_image)

    # Adaptive postprocessing (upscale if needed)
    final_mask = postprocess_mask_adaptive(mask, metadata)

    # Apply mask to original image
    result = cv2.bitwise_and(image, image, mask=final_mask)

    # Encode to PNG with alpha channel
    b, g, r = cv2.split(result)
    rgba = cv2.merge([b, g, r, final_mask])
    _, buffer = cv2.imencode('.png', rgba)

    return buffer.tobytes()

Impact:

  • Before: 8K images → OOM crash (100% failure rate)
  • After: 8K images → 2.4s processing (0% failure rate)
  • Quality loss: Minimal (mask is upscaled with interpolation)
  • Memory usage: 512MB → 180MB peak (64% reduction)

5. Cold Start Elimination: Model Warm Pools

The problem: Fly.io scales to zero during low traffic. First request after downtime loads the 44MB model from disk → 15-second cold start.

Solution 1: Health check cron (basic)

# Ping every 5 minutes to keep container warm
*/5 * * * * curl https://api.remove-bg.io/health

Downside: Wasteful during zero-traffic hours (midnight-6am)

Solution 2: Smart warm pools (production)

from functools import lru_cache
import onnxruntime as ort
import asyncio

# Global model instance (lazy-loaded, cached)
@lru_cache(maxsize=1)
def get_model() -> ort.InferenceSession:
    """Load model once per worker, cache indefinitely"""

    print("Loading model from disk...")
    start = time.time()

    session = ort.InferenceSession(
        "u2net_int8.onnx",
        providers=['CPUExecutionProvider']
    )

    # Warm up with dummy input (trace JIT compilation)
    dummy_input = np.random.randn(1, 3, 320, 320).astype(np.float32)
    _ = session.run(['output'], {'input': dummy_input})

    elapsed = time.time() - start
    print(f"Model loaded and warmed in {elapsed:.2f}s")

    return session

# FastAPI startup event
@app.on_event("startup")
async def startup_event():
    """Pre-load model before handling requests"""

    # Load in background thread (don't block startup)
    loop = asyncio.get_event_loop()
    await loop.run_in_executor(None, get_model)

    print("Server ready - model pre-loaded")

# Fly.io configuration (fly.toml)
# [deploy]
#   min_instances = 1  # Always keep 1 instance warm
#   max_instances = 10
#
# [[services]]
#   internal_port = 8000
#   protocol = "tcp"
#
#   [[services.http_checks]]
#     interval = 60000  # Health check every 60s
#     timeout = 5000
#     path = "/health"

Results:

  • Before: 15s cold start, 30% of requests
  • After: <200ms first request (model pre-loaded), 0% cold starts
  • Cost: $8/month for min 1 instance (acceptable)

Production API: FastAPI Implementation

Here's the production-ready endpoint with all optimizations:

from fastapi import FastAPI, File, UploadFile, HTTPException
from fastapi.responses import Response
import asyncio
from typing import Optional
import structlog

app = FastAPI(title="Remove-BG API")
logger = structlog.get_logger()

# Semaphore for concurrency control (prevent OOM from too many concurrent requests)
MAX_CONCURRENT = 4  # Based on 512MB container
inference_semaphore = asyncio.Semaphore(MAX_CONCURRENT)

@app.post("/api/remove-background")
async def remove_background_endpoint(
    file: UploadFile = File(...),
    quality: Optional[str] = "balanced"  # low, balanced, high
):
    """
    Remove background from uploaded image

    Returns: PNG image with transparent background
    """

    # Validate file type
    if not file.content_type.startswith("image/"):
        raise HTTPException(400, "File must be an image")

    # Read image bytes
    image_bytes = await file.read()

    # Validate size (10MB limit)
    if len(image_bytes) > 10 * 1024 * 1024:
        raise HTTPException(413, "Image too large (max 10MB)")

    # Log request
    logger.info(
        "processing_request",
        filename=file.filename,
        size_bytes=len(image_bytes),
        quality=quality
    )

    try:
        # Acquire semaphore (wait if too many concurrent requests)
        async with inference_semaphore:
            # Process with caching
            result_bytes = await process_image_request(
                image_bytes,
                quality=quality
            )

        # Return PNG with proper headers
        return Response(
            content=result_bytes,
            media_type="image/png",
            headers={
                "Cache-Control": "public, max-age=31536000",  # 1 year
                "Content-Disposition": f"inline; filename=removed_bg_{file.filename}"
            }
        )

    except Exception as e:
        logger.error(
            "processing_failed",
            error=str(e),
            filename=file.filename
        )
        raise HTTPException(500, "Processing failed")

@app.get("/health")
async def health_check():
    """Health check endpoint for Fly.io"""

    # Verify model is loaded
    try:
        model = get_model()
        return {"status": "healthy", "model_loaded": True}
    except Exception as e:
        return {"status": "unhealthy", "error": str(e)}, 503

# Metrics endpoint (for monitoring)
@app.get("/metrics")
async def metrics():
    """Prometheus-compatible metrics"""

    cache_stats = redis_client.info("stats")

    return {
        "images_processed_total": cache_stats.get("processed_count", 0),
        "cache_hit_rate": cache_stats.get("hit_rate", 0),
        "avg_processing_time_ms": cache_stats.get("avg_time_ms", 0),
        "active_requests": MAX_CONCURRENT - inference_semaphore._value
    }

The Business Side: Distribution & Growth

Technical excellence means nothing without users.

What Worked: SEO & Organic Growth

Month 1: 0 users Month 6: 47,000 users

Traffic breakdown:

  1. Organic search (60%): "free background remover", "remove bg online"

    • Rank #8 on Google for primary keyword
    • 700+ monthly search volume
  2. ProductHunt (25%): Launch day spike

    • 400+ upvotes
    • Featured in "AI Tools" category
    • 10k visitors in 24 hours
  3. Reddit (15%): Genuine community sharing

    • r/Entrepreneur (viral post: 800 upvotes)
    • r/SideProject
    • r/Photoshop

SEO strategy that worked:

<!-- Meta tags optimized for conversions -->
<title>Free Background Remover - Remove Image Background Online | Remove-BG</title>
<meta name="description" content="Remove image backgrounds in 2 seconds for free. AI-powered background removal tool with no watermarks. Process unlimited images online.">

<!-- Schema.org structured data -->
<script type="application/ld+json">
{
  "@context": "https://schema.org",
  "@type": "SoftwareApplication",
  "name": "Remove-BG",
  "applicationCategory": "MultimediaApplication",
  "offers": {
    "@type": "Offer",
    "price": "0",
    "priceCurrency": "USD"
  },
  "aggregateRating": {
    "@type": "AggregateRating",
    "ratingValue": "4.7",
    "ratingCount": "1240"
  }
}
</script>

Why "free" was critical:

  • Competitors: $9.99/month
  • My differentiation: Free is the feature
  • User psychology: No signup, no paywall, no watermark
  • Viral coefficient: Users share because it's genuinely free

Monetization: $300/mo Without Premium Tiers

Decided against freemium model.

Revenue sources:

  1. Affiliate links: Design tool recommendations ($180/mo)
  2. Sponsored listings: "Related Tools" section ($80/mo)
  3. Donations: Ko-fi + PayPal ($40/mo)

Total: ~$300/month Costs: ~$40/month Profit margin: 87%

Key insight: Don't commoditize your differentiation. If free is your moat, keep it free.

Metrics That Matter: 6 Month Report

Scale:

  • 1,043,284 images processed
  • 47,293 unique users
  • 14 countries (top: US, India, Brazil)

Performance:

  • Average latency: 2.1s (p50)
  • p95 latency: 4.8s
  • p99 latency: 12.3s
  • Error rate: 0.3%

Efficiency:

  • Cache hit rate: 73% (CDN)
  • Compute utilization: 27% (only cache misses)
  • Cost per image: $0.0015 (effective)

Infrastructure costs:

  • Fly.io (compute): $25/mo
  • Cloudflare R2 (storage): $8/mo
  • Redis (Upstash): $5/mo
  • Domain + monitoring: $2/mo
  • Total: $40/month

Cost per 1M images: $38

The Hard Lessons

1. Profile Before Optimizing

Spent 2 weeks optimizing the model. Profiling revealed the bottleneck was image I/O (not the model).

Time wasted: 2 weeks Actual fix: 30 minutes (switch PIL → OpenCV)

2. Caching Is Infrastructure

The edge-first architecture with CDN caching reduced costs by 89% while improving latency.

Insight: Cache layers are not optional—they're the architecture.

3. Launch Fast, Iterate Faster

Spent 3 months building "perfect" MVP. Should've launched in 2 weeks.

Lost opportunity: 3 months of user feedback

4. Distribution > Product

Spent 80% effort on product, 20% on distribution.

Should've been: 50/50

Reality: SEO, ProductHunt, and Reddit drove 100% of growth. Features drove 0%.

What's Next: Batch Processing & API Access

Based on user requests:

Q1 2025:

  1. Batch API: Process 100+ images via API
  2. Webhooks: Async processing with callbacks
  3. S3 integration: Direct S3 bucket processing

Stack additions:

  • Celery for job queues
  • PostgreSQL for job tracking
  • Stripe for API tier pricing

Try It Yourself: Open Source Components

Remove-BG.io: https://remove-bg.io

Open source resources:

Want to build a similar service?

  1. Pick a $10/month SaaS tool
  2. Build the free version with smart caching
  3. Deploy edge-first (Fly.io + Cloudflare)
  4. Optimize for distribution (SEO, ProductHunt)
  5. Ship in 2 weeks, not 3 months

Questions on scaling computer vision services? Follow my technical journey on X where I share architecture deep-dives and performance optimization techniques.

Code Examples & Benchmarks:

Author

Yashwant Bezawada

Software Engineer Advisor at FedEx specializing in Generative AI, Machine Learning, and Full-Stack Development. Building AI-powered solutions that transform industries and sharing insights from the journey.

Enjoyed this post? Let's continue the conversation!