LLM_log #023: Visual Complexity Scoring — CLIP vs Gemini on SAVOIAS Advertisements

LLM_log #023: Visual Complexity Scoring — CLIP vs Gemini on SAVOIAS Advertisements

Highlights:

In this post we try to answer one practical question: can we score the visual complexity of an image without humans in the loop? We grab the SAVOIAS dataset (1,420 images with human-rated complexity), focus on the 200 advertisements, and try five different recipes — zero-shot CLIP prompts, Ridge regression on CLIP embeddings, pairwise ranking on embedding differences, kNN retrieval on cosine similarity, and finally Gemini 2.5 Pro as a chain-of-thought judge. The headline: every CLIP-based method collapses on truly held-out data once we strip the leakage, and only Gemini delivers a usable score. Every number below is reproducible from a single Colab cell and ten cheap API calls. Let’s dig in.

Tutorial Overview:

  1. The Question — Why Score Complexity at All?
  2. SAVOIAS — How the Ground Truth Was Built
  3. The Frozen CLIP Backbone
  4. Method 1 — Zero-shot CLIP Prompts
  5. Method 2 — Ridge Regression on CLIP Embeddings
  6. Method 3 — Pairwise on Embedding Differences
  7. The Pairwise Pipeline in 10 Pictures
  8. Method 4 — kNN on CLIP Cosine Similarity
  9. Method 5 — Gemini-as-Judge with Chain of Thought
  10. Few-Shot Gemini — Does Calibration Help?
  11. The Honest Held-Out Scoreboard
  12. Why CLIP-Based Methods Fail (Semantic ≠ Complexity)
  13. Implications for the LLM-Pairwise Pipeline
  14. Summary

1. The Question — Why Score Complexity at All?

Visual complexity is the perceived busyness of an image — how many objects fight for the eye, how much text crowds the layout, how dense the patterns are. It matters in advertising (cluttered ads are forgotten), packaging design (busy labels lose at the shelf), UI (over-dense screens fatigue users), and image retrieval (a “minimalist scene” search needs a notion of simplicity).

The naive recipe is to ask humans. Crowdsource thousands of pairwise judgments — “which of A and B is more complex?” — fit Bradley-Terry, get one score per image on 0–100. That is exactly how the SAVOIAS dataset was built. But for production, paying humans for every new image is impractical. So the question becomes: can a frozen vision model do this for us?

Five recipes go on the bench in this post, all sharing the same 200 SAVOIAS advertisements. Spoiler: the simplest model wins.


2. SAVOIAS — How the Ground Truth Was Built

SAVOIAS holds 1,420 images across seven categories. We focus on the Advertisement category (200 images, 170 .jpg + 30 .png). The complexity scores were derived from 37,000+ pairwise comparisons by 1,600+ crowdworkers on Figure-Eight. Importantly, raters only compared images inside a category — an interior was never compared to an ad. Bradley-Terry was fit per-category and the resulting scores were rescaled to 0–100 separately for each category.

Important caveat: SAVOIAS scores are not comparable across categories. An advertisement with GT = 80 sits in the busiest 20% of ads; an interior with GT = 80 sits in the busiest 20% of interiors. Every result in this post is within-category.

The advertisement subset spans the full range: min 0, max 100, mean 52.5. Filenames are integer stems (0.jpg, 1.png …) and the GT file lists them all as .png — a small mismatch you have to join by stem, not by filename string. Here is the score distribution:

GT complexity score distribution

Fig 1. Histogram of the 200 Advertisement GT complexity scores. Mean 52.5, slight skew toward the complex end.

To get a feel for the scale, here is one median ad per decile:

One ad per decile, sorted by GT

Fig 2. Decile sweep — one ad picked from each 10% slice of the GT distribution. Decile 1 = simplest 10%, decile 10 = busiest 10%.


3. The Frozen CLIP Backbone

Every classical method in this post sits on top of one shared frozen model: OpenAI CLIP ViT-L/14, loaded through open_clip_torch. We use only the image encoder. Total parameters \(\approx 304\) M, all frozen — no gradient ever flows in. Image preprocessing is the standard CLIP transform: resize the shorter side to 224, center-crop to \(224 \times 224\), normalize with CLIP’s own mean and std.

The visual transformer has 24 blocks, 16 heads, internal dimension 1024. After the final LayerNorm, the CLS token is projected from 1024 to 768 (the text-alignment projection trained on the contrastive objective). We L2-normalise the resulting 768-vector. Cached embeddings live in a single .npz on disk so all five methods read from the same cache:

import torch, open_clip
from PIL import Image

model, _, preprocess = open_clip.create_model_and_transforms(
    "ViT-L-14", pretrained="openai", device="cuda")
model.eval()
for p in model.parameters(): p.requires_grad_(False)

with torch.no_grad():
    img = preprocess(Image.open(path).convert("RGB")).unsqueeze(0).cuda()
    emb = model.encode_image(img)                 # [1, 768]
    emb = emb / emb.norm(dim=-1, keepdim=True)    # L2-normalise

One forward pass on a CPU is ~80 ms; on an M2 MPS device ~15 ms. The full 200-ad cache builds in under 30 seconds.


4. Method 1 — Zero-shot CLIP Prompts

The simplest recipe ignores GT scores entirely. Encode two text prompts, measure cosine similarity to each image, take the softmax probability of the “complex” class as the predicted complexity score.

PROMPTS = [
    "a minimalist, simple, uncluttered advertisement",
    "a cluttered, busy, complex advertisement with lots of text and graphics",
]

import torch
tokens = open_clip.tokenize(PROMPTS).cuda()
with torch.no_grad():
    txt = model.encode_text(tokens)
    txt = txt / txt.norm(dim=-1, keepdim=True)    # [2, 768]

# X is the cached [200, 768] image-embedding matrix, L2-normed
sims = X @ txt.T.cpu().numpy()                    # [200, 2]
probs = (sims * 100.0).softmax(axis=1)            # CLIP logit scale
zero_shot_score = probs[:, 1] * 100               # high = complex

On all 200 ads the zero-shot Spearman is \(|\rho| = 0.55\). On a held-out 10-ad subset (we’ll fix this subset for every method) it drops to \(|\rho| = 0.30\). The two prompts simply do not span the actual complexity axis, and CLIP gets fooled by ads with a single bold headline.


5. Method 2 — Ridge Regression on CLIP Embeddings

The next obvious thing: fit a linear regression that maps the 768-d embedding to the 0–100 GT score, with \(L_2\) regularisation.

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge

scaler = StandardScaler().fit(X_train)
ridge  = Ridge(alpha=1.0).fit(scaler.transform(X_train), y_train)
y_pred = ridge.predict(scaler.transform(X_test))

Watch the leakage. Fit on all 200 ads and evaluate on the same 200 ads (or 5-fold CV with thin folds), and you will see SROCC \(\approx 0.38\) on this category. The number looks fine until you do a clean 190-train / 10-held-out split: SROCC collapses to −0.19, slightly anti-correlated. The original 0.38 was the small-fold interpolation flattering the model. Do not trust regression numbers without a truly held-out test set.

Ridge on CLIP embeddings simply does not transfer. The complexity axis is not expressible as a single linear weight over the 768 CLIP dimensions.


6. Method 3 — Pairwise on Embedding Differences

The trick learning-to-rank uses: instead of predicting a scalar, train a binary classifier on pairs. For pair \((i, j)\) the feature is \(\mathbf{x}_i – \mathbf{x}_j\) and the label is 1 if image \(i\) is more complex than \(j\). The trained weight vector \(\mathbf{w}\) is itself a scoring function: \(\text{score}(\mathbf{x}) = \mathbf{w} \cdot \mathbf{x}\). This is essentially RankNet without the depth.

from sklearn.linear_model import LogisticRegression
import numpy as np

rng = np.random.default_rng(42)
pairs = []
while len(pairs) < 6000:
    i, j = rng.integers(0, len(y_train), size=2)
    if i == j or abs(y_train[i] - y_train[j]) < 4.0:
        continue
    pairs.append((int(i), int(j), int(y_train[i] > y_train[j])))
pairs = np.array(pairs)
pos = pairs[pairs[:, 2] == 1][:1500]
neg = pairs[pairs[:, 2] == 0][:1500]
pairs = np.concatenate([pos, neg])

F = X_train[pairs[:, 0]] - X_train[pairs[:, 1]]
L = pairs[:, 2]
clf = LogisticRegression(fit_intercept=False, C=0.1, max_iter=2000).fit(F, L)

w        = clf.coef_[0]
y_score  = X_test @ w                     # global complexity score

When trained on pairs sampled from all 200 ads and then evaluated on the same 10 ads, this method scores SROCC = 0.78. The number looks great. It is wrong, for the same reason Ridge looked great in the deep dive: the 10 test ads appeared inside the 3,000 training pairs. Retrain the classifier on pairs drawn only from the 190 non-test ads and evaluate on the 10:

The honest pairwise number on truly held-out ads: SROCC \(\approx 0.18\), held-out pair accuracy \(\approx 56.5\%\) — barely above chance. The same leakage that flattered Ridge was flattering pairwise too.


7. The Pairwise Pipeline in 10 Pictures

Before we add the last two baselines, here is a visual walkthrough of what the pairwise tournament actually does end-to-end. The point is to make the data flow concrete — what is GT, what is the classifier output, what is the final prediction. Same colour key on every figure: blue = data from the labelled GT reference pool, orange = output of the trained pairwise classifier, red = the final predicted score, grey = audit-only.

The problem — predict the complexity of a new ad we have no GT for

Fig 3. Setup. We have a new ad. We do NOT have its GT score. We want to predict it on the 0–100 scale from a handful of pair comparisons.

The reference pool of labelled ads

Fig 4. The reference pool. 199 ads we have already hand-labelled. Their GT is shown in blue because we trust it — these labels are what calibrate the rest of the pipeline.

CLIP turns each image into a 768-d vector

Fig 5. CLIP step. Every image — test or reference — goes through the same frozen CLIP ViT-L/14 to become a 768-dim vector. CLIP reads pixels only; no GT is touched in this step.

The pairwise classifier compares two embeddings

Fig 6. The pairwise classifier. Takes two image embeddings A and B, computes A − B, outputs the probability that A is more complex than B. Trained once on GT-labelled pairs; frozen at inference. The orange parts are the classifier’s output.

One tournament round — 10 random anchors compared against the test ad

Fig 7. One tournament round. Sample K = 10 random anchors from the reference pool. Run the pairwise classifier 10 times — once per anchor. Five wins, five losses in this example.

Sort anchors by GT and find the win-cutoff

Fig 8. Sort and interpolate. Order the 10 anchors by their known GT. The test ad “fits between” the 5th and 6th anchor (5 wins, 5 losses). Predicted GT = midpoint = (42 + 48) / 2 = 45.

Repeat 20 times and average

Fig 9. Twenty rounds. Each round picks 10 DIFFERENT random anchors. Average the 20 round predictions to wash out the classifier’s per-pair noise. Final prediction = the mean.

Where ground truth enters the pipeline

Fig 10. Where ground truth enters. The test image’s GT is NEVER used at inference. Anchor GTs ARE used — but only at the interpolation step, after the classifier has cast its votes.

Audit: predicted vs GT

Fig 11. Audit on one image. After the prediction is locked in, we compare against the held-out GT to score the system. The GT was hidden the whole time; this step is just for evaluation.

Honest population view across 200 ads

Fig 12. Honest population view. Across the 200 ads, |error| has a long right tail — about 5% of ads have |err| ≥ 20. The ranking is right (SROCC ≈ 0.84 with leakage; falls to ~0.18 once leakage is removed, see §12), the absolute scores are right on average, and a small tail of hard cases stays stubbornly wrong.

Why this still matters even after the post concludes Gemini wins. The visual pipeline above is the architecture the original LLM-pairwise plan was built around. In privacy- or latency-constrained settings (an image cannot leave your infrastructure, or each call must finish in ≤ 100 ms), you cannot use Gemini directly and need a local tournament estimator like this one — calibrated with a small labelled anchor pool. The conclusion at the end of the post (just use Gemini) only applies when you CAN send the image to a hosted LLM.


8. Method 4 — kNN on CLIP Cosine Similarity

If neither regression nor a linear ranker work, what about the laziest method of all? For each test ad, find the \(K\) closest training ads in CLIP cosine, take the mean of their GT scores. No model, no training step beyond CLIP itself.

import numpy as np

# X_train: [190, 768] L2-normed   y_train: [190]   x_test: [768]
sims = X_train @ x_test                # cosine similarity for unit-norm
order = np.argsort(-sims)[:7]          # top-7 neighbours
y_pred = float(np.mean(y_train[order]))

Result on the same 10 held-out ads: kNN-7 hits MAE \(\approx 18.7\), SROCC \(\approx 0.32\). Better than Ridge or honest pairwise, but still far below useful. A K-sweep from 1 to 190 confirms the verdict — no K rescues this method:

kNN K-sweep on CLIP cosine

Fig 13. K-sweep from 1 to 190 on the projected 768-d CLIP feature. MAE is flat at \(\approx 19\) for any reasonable K. Dashed line = Gemini zero-shot MAE for reference.

We also tried the pre-projection 1024-d CLS hidden state (the feature CLIP produces before the 1024→768 text-alignment projection). It is marginally worse (MAE 19.2, SROCC 0.22) — the complexity signal is not hiding behind the projection, it is simply not there.

The failure mode is visually obvious. For one test ad with GT = 15 (minimalist sky), CLIP’s 7 nearest neighbours have GT scores of 13, 20, 39, 44, 49, 63, 67 — spread of 54 points. CLIP grouped them by subject matter (people on light backgrounds, clean compositions), not by complexity:

kNN neighbours for the GT=15 sky ad

Fig 14. Test ad (left, GT = 15) and its 7 nearest CLIP neighbours. Neighbour GTs span 13–67. Mean = 42. Prediction error = 27. CLIP groups by content; complexity is orthogonal.


9. Method 5 — Gemini-as-Judge with Chain of Thought

Now the bigger gun. Send the ad straight to Gemini 2.5 Pro with a chain-of-thought prompt and ask for an integer score on 0–100. No CLIP, no embedding, no training. The model has never seen SAVOIAS.

from google import genai
from PIL import Image

COT_PROMPT = """You are an expert at rating the VISUAL COMPLEXITY of advertisements.

Visual complexity is the PERCEIVED busyness or density of visual information in
the ad — count of distinct objects, amount of text, color variety, edge/pattern
density, overlapping elements.

Rate on a 0–100 scale:
  0   = extremely minimalist (single object, clean background, no/little text)
  50  = average advertisement (a few elements, moderate text)
  100 = extremely cluttered (many overlapping objects, dense text, busy patterns)

Walk through your reasoning step by step:
  1. List the main visual elements you see.
  2. Note text density and color variety.
  3. Note patterns, textures, overlapping elements.
  4. Compare mentally to mid-range ads (50/100).
  5. Decide a final integer score.

Be concise (≤ 8 short sentences).

On the LAST LINE output EXACTLY:
SCORE: <integer 0–100>
"""

client = genai.Client(api_key=API_KEY)
img = Image.open("ad.jpg").convert("RGB")
resp = client.models.generate_content(
    model="gemini-2.5-pro",
    contents=[img, COT_PROMPT],
)
text = resp.candidates[0].content.parts[0].text
# parse: regex 'SCORE\s*[:=]\s*(\d{1,3})'

One call costs about $0.003 and takes ~10–20 s including the reasoning. The CoT lets us audit why Gemini scored what it scored. Below is one card from the experiment — note all five model scores side by side, plus Gemini’s full reasoning text. This particular ad (a CHANEL split-panel) shows the most interesting disagreement: humans rated it 52 (mid-complexity); Gemini rated it 30, correctly noting the minimalist negative space; CLIP-based methods are all over the map.

Per-ad card: CHANEL mascara, all model scores plus Gemini CoT

Fig 15. Per-ad card for the CHANEL mascara split-panel ad. GT = 52, Gemini = 30, |err| = 22. Gemini’s reasoning correctly identifies the minimalist design language; the SAVOIAS raters weighted the woman’s face and the bold contrast more heavily. Where Gemini disagrees with humans, the disagreement is usually interpretable.


10. Few-Shot Gemini — Does Calibration Help?

Reasonable next question: if we hand Gemini six reference ads with their known GTs before showing it the test ad, does it score the test ad better? In other words, does in-context calibration help an already-strong model?

We pick 6 calibration anchors from the 190 non-test ads at target GTs \([10, 30, 45, 60, 75, 90]\), then send all 7 images (6 anchors + 1 test) in a single multimodal prompt and ask Gemini to interpolate.

contents = [FEWSHOT_PROMPT]
for k, anchor_idx in enumerate(calib_indices, 1):
    anchor_img = Image.open(df.iloc[anchor_idx]["path"]).convert("RGB")
    contents.append(f"REFERENCE #{k} — known complexity score = "
                     f"{y[anchor_idx]:.0f}/100")
    contents.append(anchor_img)

contents.append("NEW advertisement to rate:")
contents.append(Image.open(test_path).convert("RGB"))
contents.append(NEW_QUERY_INSTRUCTION)

resp = client.models.generate_content(model="gemini-2.5-pro", contents=contents)

The honest answer is: no improvement on this set. On the 10 test ads, zero-shot Gemini gets MAE 9.2 and SROCC 0.87; few-shot Gemini with 6 anchors gets MAE 9.4 and SROCC 0.85. Five ads improved with the anchors, five got worse:

Per-ad delta between zero-shot and few-shot Gemini

Fig 16. Per-ad change in absolute error after adding 6 calibration anchors to the prompt. Green = few-shot got closer to GT; red = further. Five and five. Mean Δ|err| = +0.2 (slight regression).

The interpretation: Gemini’s internal complexity scale already matches SAVOIAS humans closely enough that 6 anchors add noise as often as signal. Calibration helps when the model is miscalibrated; here it is not.


11. The Honest Held-Out Scoreboard

Same 10 ads, same GTs, every method evaluated with no leakage. Ridge and pairwise are retrained on the 190 non-test ads. Gemini runs zero-shot on each test ad with no prior context. kNN-7 retrieves only from the 190-ad pool.

All six methods bar chart

Fig 17. Held-out comparison across six methods on 10 test ads. Left = MAE (lower better), right = |SROCC| (higher better). Gemini wins both, by a wide margin.

Method MAE RMSE |SROCC| PLCC
Zero-shot CLIP (two prompts) 23.9 28.9 0.30 +0.40
Ridge regression (clean 190/10) 25.0 27.0 0.19 −0.12
Pairwise on Δ-embeddings (clean 190/10) 19.6 21.3 0.18 +0.42
kNN-7 on CLIP cosine 18.7 20.5 0.32 +0.47
Gemini 2.5 Pro (zero-shot CoT) 9.2 11.5 0.87 +0.87
Gemini 2.5 Pro (6 anchors, few-shot) 9.4 11.3 0.85 +0.87

Read this carefully. The four classical methods cluster in the MAE 18–25 / |SROCC| 0.18–0.32 range — all far worse than the headline numbers we’d have reported with leakage. Gemini lands in MAE 9 / |SROCC| 0.87. A small caveat: 10 test ads is a small sample and SROCC estimates have wide variance at this scale. The qualitative ordering (Gemini ≫ all CLIP methods) is robust; the exact numbers are not.


12. Why CLIP-Based Methods Fail (Semantic ≠ Complexity)

The kNN visualisation (Fig 14) is the cleanest evidence. CLIP organises its 768-d image space by subject matter — “car ads”, “perfume ads”, “ads with a face on a white background”. Visually similar ads from CLIP’s perspective span the full complexity range. The complexity axis is simply not a direction in CLIP space.

That single observation explains why every classical recipe fails:

  • Zero-shot prompts rely on CLIP’s text-aligned axis matching the human notion of complexity. It doesn’t — bold-headline ads register as “complex” even when their layout is empty.
  • Ridge regression tries to find a single linear weight \(\mathbf{w}\) such that \(\mathbf{w} \cdot \mathbf{x}\) predicts complexity. Such a weight does not exist when the signal is not linearly extractable.
  • Pairwise on differences is more expressive than Ridge but still searches for a single linear ranking direction. On a held-out set it generalises about as well as Ridge.
  • kNN retrieval assumes “visually similar in CLIP space ⇒ similar in complexity”. The premise fails for the reason above.

Gemini, in contrast, looks at the actual pixels with its own visual encoder, can read the text on the ad, and produces a token-by-token explanation of what it sees. It does not need to project complexity onto CLIP’s pre-existing axes because it never visits them.


13. Implications for the LLM-Pairwise Pipeline

The original plan was: have an LLM rate pairwise judgments on packaging images, train a tiny pairwise classifier on CLIP-Δ features, deploy. This post falsifies the “train a pairwise classifier on CLIP-Δ” half of that plan. Pairwise on CLIP differences works only when the classifier has already seen the test items at train time.

The replacement plan is simpler: just ask Gemini directly. At $0.003 per image and ~15 s of latency, scoring a 10 000-image packaging catalogue costs $30 and runs in a few hours. No embedding cache, no calibration set, no train/test ceremony.

Reserve the CLIP-pairwise machinery for environments where images cannot leave your infrastructure (privacy, latency, regulatory). In those settings, the “use 10 random anchors with known GT, average the LLM-pairwise wins” tournament estimator from the earlier deep-dive remains the right design — just trained on LLM-labelled pairs in the data the model will actually see.


14. Summary

Conclusion infographic — five methods scored on 10 held-out ads, Gemini wins

Fig 18. The held-out scoreboard at a glance. Four CLIP-based methods cluster around MAE 19–25 / |SROCC| 0.18–0.32 once the leakage is removed. Gemini 2.5 Pro with a one-shot chain-of-thought prompt lands at MAE 9.2 / |SROCC| 0.87 — the only method good enough to ship.

Five methods, one dataset, ten honest test ads. Headline rankings:

  • Gemini 2.5 Pro CoT — MAE 9.2, |SROCC| 0.87. Off the shelf, no training, no embeddings.
  • Everything CLIP-based — MAE 18.7–25.0, |SROCC| 0.18–0.32 on truly held-out ads. The 0.78 pairwise number from the deep-dive was leakage; with a clean 190/10 split it drops to 0.18.
  • Few-shot anchors don’t improve Gemini — it’s already calibrated.
  • kNN-7 fails because CLIP groups by content, not complexity — neighbours of a simple sky ad include both other minimalist scenes and dense studio shots, mean regresses to the population.

For practical visual-complexity scoring on advertisements: skip the CLIP pipeline. Send the image to an LLM judge. Audit the chain-of-thought when the score surprises you. Calibrate only if the population mean is clearly off — for SAVOIAS ads it is not.

Code, cached embeddings, all five method scripts, and the full 28-page deep dive PDF are in the repository associated with this post. Total compute: about 4 minutes on an M2 MPS device plus 20 Gemini API calls (≈ $0.06).

dataHacker.rs — LLM_log #023 — Visual Complexity Scoring — Vladimir Matic. CLIP via open_clip; LLM judge via Gemini 2.5 Pro.