By Joao da Silva · April 26, 2026
TL;DR. We tested 40 SaaS brands' Layer 1 AI visibility (entity foundation, training-data recognition, web-search recognition) on both gpt-4o and gpt-5.2. Only 12 brands cleanly passed all three sub-tests, and the same 12 passed under both models. The model upgrade did not move the strict pass count. The binding constraint is the Google Knowledge Graph entity, which does not change when you upgrade your LLM. Three findings below, plus the full dataset and methodology.

The most-discussed assumption in AI visibility right now is that the next model will fix it. ChatGPT does not know your brand today, the argument goes, but gpt-5.2 has a fresh training cutoff and better recall, so wait six months and the problem shrinks. We wanted to test that. Specifically: does the leap from gpt-4o (training cutoff Oct 2023) to gpt-5.2 (Dec 2025) move the needle on how many SaaS brands AI can correctly recognize?
Short answer: no. Not at the strict pass count. The same 12 of 40 brands cleanly pass all three Layer 1 sub-tests on both gpt-4o and gpt-5.2. The newer model recovers some borderline cases (Linear, Bezel, Greptile flipped from "I'm not familiar" to correct identification) and surfaces a regression worth noting (Celest's web search ranking dropped). But the strict pass count of 12 is identical, because the binding constraint is upstream of the LLM.
What did we set out to test?
Layer 1 of the 15-prompt AI visibility audit framework splits into three sub-levels. Entity foundation: does AI have your brand in its knowledge graph at all? Training-data recognition: does the LLM know you from training, with no live retrieval? Web-search recognition: does the LLM correctly identify you when allowed to search the live web? Each sub-level fails for different reasons and requires a different fix.
We wanted to answer four questions:
- What share of SaaS brands cleanly pass Layer 1 today?
- Does the gpt-4o → gpt-5.2 upgrade meaningfully change that share?
- When brands fail, what failure mode dominates?
- How does the failure pattern split between established (G2 category leaders) and recent (YC W24/W25) cohorts?
How did we run the audit?
40 SaaS brands, stratified to test the training-data depth hypothesis:
- 20 G2 category leaders. Five each from CRM, project management, product analytics, and AI content/MarTech. Established brands with deep training-data presence. The "known good" baseline.
- 20 Y Combinator W24/W25 SaaS startups. First 20 alphabetical from the YC batch site, B2B Software industry tag. Recent enough that gpt-4o (Oct 2023 training cutoff) has not seen them, but gpt-5.2 (Dec 2025) might have.
Three tests per brand:
| Test | Method | What it measures |
|---|---|---|
| 1. Entity foundation | Google Knowledge Graph Search API, query brand name, capture top result + resultScore | Does the brand exist as a structured entity? |
| 2. Training-data recognition | OpenAI gpt-4o and gpt-5.2 via API, prompt "Who is [brand]?", no tools | Does the LLM know the brand from training, with no live retrieval? |
| 3. Web-search recognition | Same models with web_search tool use forced |
Does the LLM correctly identify the brand when allowed to search? |
Scoring rubric for tests 2 and 3: PASS if the response correctly identifies the brand and its category. WEAK_PASS if identified but with wrong details. FAIL if "I'm not familiar" (NOT_RECOGNIZED) or "describes a different company" (CONFUSED_IDENTITY). Single-rater scoring; we read every response. Total combined runtime: about 1 hour 37 minutes. Total combined API cost: about $0.87.
Limitations up front: 40 brands is directional, not statistically powered. OpenAI only; Anthropic Claude and Google Gemini may have different recall profiles. Layer 1 only; Layer 2 (visibility) and Layer 3 (recommendation) were not in scope. Single-rater scoring with no inter-rater reliability check. The full dataset, methodology, and raw response logs are linked at the bottom.
Finding 1: Why is the Knowledge Graph the binding constraint?
Answer: every brand that cleanly passed Layer 1 had a high-confidence Knowledge Graph entry. None of the brands without one passed.
12 of 40 brands (30%) cleanly passed all three Layer 1 sub-tests under both gpt-4o and gpt-5.2. The same 12, on both models:
- HubSpot, Pipedrive (CRM)
- Asana, Monday.com, Notion, ClickUp (project management)
- Mixpanel, Amplitude, Heap, Hotjar (product analytics)
- Copy.ai, Writesonic (AI content)
All 12 share one structural property: a high-confidence Knowledge Graph entry with resultScore over 100. None of the brands without that structural anchor passed all three sub-tests. The model upgrade improved recall for some borderline cases (gpt-5.2 correctly identified Linear and 2 YC startups that gpt-4o did not), but the strict pass count was identical. For a brand pursuing AI visibility, fixing the Knowledge Graph entity is the highest-leverage Layer 1 action; it gates everything downstream.
There is also a structural-vs-recall gap worth flagging. 6 of 40 brands have a Knowledge Graph entity (PASS or WEAK_PASS on Test 1) but fail training-data recognition on gpt-5.2. The same 6 fail on gpt-4o. All 6 have resultScore below 100, meaning their KG presence is barely-there. The implication: low-confidence Knowledge Graph entries do not appear to feed LLM training corpora reliably. A KG entry with score over 100 is the threshold that matters. Search Engine Land has documented similar patterns where structural authority gates LLM citation behaviour more than on-page optimization does.
Finding 2: How does the failure mode shift on newer models?
Answer: NOT_RECOGNIZED drops, CONFUSED_IDENTITY rises. As models get better at recall, confidently-wrong answers become the dominant failure mode.
The total failure-event count dropped from 29 on gpt-4o to 23 on gpt-5.2 (a 21% reduction). But the shape of those failures changed meaningfully:
| Failure mode | gpt-4o | gpt-5.2 | Δ |
|---|---|---|---|
| NOT_RECOGNIZED ("I'm not familiar with...") | 19 (65%) | 12 (52%) | -7 |
| CONFUSED_IDENTITY (describes a different company) | 10 (35%) | 11 (48%) | +1 |
NOT_RECOGNIZED dropped meaningfully. CONFUSED_IDENTITY held flat in absolute count and rose to nearly half of all failures as a share. As model recall improves, confidently-wrong answers become the dominant failure mode. This matters because CONFUSED_IDENTITY is more dangerous than NOT_RECOGNIZED: a buyer who hears "I'm not familiar with Acme" knows to keep looking. A buyer who hears "Acme is a Belgian video-tech company" (when Acme is actually a YC SaaS startup) walks away with the wrong mental model and never knows.
Two oddball examples worth featuring. gpt-5.2 misidentifies "Nitrode" as a fictional Roman character from Petronius's Satyricon (the "Widow of Ephesus" tale), confidently presenting it as the most likely meaning before the modern fintech. Generic-but-Latin-sounding names introduce a new failure mode we are calling scholarly hallucination. "Spott" gets identified as a Belgian video-tech company by gpt-5.2 (not the YC W25 recruitment ATS spott.io). No disclaimer, no hedge, just a confident description of the wrong company.
Finding 3: Why do generic-named brands stay broken across model upgrades?
Answer: even web search cannot disambiguate them from their famous-name collisions. Four brands in our cohort fail on both tests across both model generations.
Four brands in the cohort fail CONFUSED_IDENTITY across both training-data and web-search tests, on both gpt-4o and gpt-5.2:
- Bud → describes Budweiser, Bud Financial, Bud Grant, "buddy" nickname
- Forge → describes Forge Global, ForgeRock, Atlassian Forge
- Roark → describes Howard Roark (Ayn Rand) and Roark Capital
- Trim → describes Trim fintech app, Trim County Ireland, TRIM SSD command
Generic naming fails across the board. Bud, Forge, Roark, and Trim fail CONFUSED_IDENTITY on every test on both gpt-4o and gpt-5.2. The model picks Budweiser, Atlassian Forge, Howard Roark, and Trim County Ireland — even with web search enabled.
For these four, even web search could not disambiguate them from their famous-name collisions. Generic naming is a Layer 1 visibility issue that no model upgrade is likely to fix. A Series A startup with strong product traction stays invisible in AI when its name collides with a famous fictional character (Roark, Forge), an established corporate trademark (Forge, Trim), or a common English word with strong semantic associations (Bud).
The implication is brutal but actionable. If your brand name is generic, every other Layer 1 fix you invest in (Wikipedia, schema, founder presence) has to work twice as hard to overcome the disambiguation problem AI is solving against you.
If you are still pre-launch and choosing a name, "uniquely searchable" should rank above "memorable" and "easy to spell." And if you have already named your company a common English word, the realistic Layer 1 strategy is to over-invest in disambiguation content. That means an unambiguous About page with structured Organization schema, a Wikipedia entry with explicit disambiguation, and third-party content that anchors your brand to a specific market or category.
What does this mean for your AI visibility strategy?
Three takeaways from the data:
-
Knowledge Graph first. If you do not have a high-confidence Knowledge Graph entry, every other Layer 1 investment is downstream of fixing that. Submit your brand via structured Organization schema, build a Wikipedia entry, get cited in third-party publications that Google trusts as KG sources.
-
Audit on multiple model generations, not one. Picking a single LLM for your audit (typically the latest) hides regressions like the Celest case. Run the audit on at least two model generations and look at the overlap. The strict pass set across generations is your real Layer 1 score.
-
Renaming is on the table for generic-named brands. If your name is Bud, Forge, Roark, Trim, or any close analog, the Layer 1 disambiguation problem is structurally hard. Either commit to disproportionate disambiguation content investment for the next 18 to 24 months, or evaluate whether a partial rename would be cheaper.
For the broader audit framework, see the 15-prompt AI visibility audit. For the specific patterns this audit surfaced, see the 11 AI visibility failure modes guide, which maps the CONFUSED_IDENTITY data to a fix-priority order.
What were the limitations?
Honest accounting of where this audit falls short, before someone else points it out:
- OpenAI only. Anthropic Claude and Google Gemini may have different recall profiles, different retrieval mechanics, different failure-mode distributions. A multi-LLM v2 of this audit is the natural next study.
- Layer 1 only, prompt 1.1 only. Layer 2 (visibility / leaderboard) and Layer 3 (recommendation) were not tested in this run. Within Layer 1, this audit ran only the most basic prompt (1.1 — "Who is [brand]?") against three retrieval mechanisms. The full Layer 1 prompt set (1.1 through 1.5) was not run; that scope is on the v3 backlog. The "30% pass" headline is a Layer 1 / prompt-1.1 number; brands that pass cleanly here may still fail prompts 1.2 through 1.5 or Layers 2 and 3.
- 40 brands is directional. Statistical power for sub-segment claims (CRM vs analytics, G2 vs YC) requires a larger N. Treat the per-stratum numbers as suggestive, not authoritative.
- Single-rater scoring. No inter-rater reliability check. The PASS/WEAK_PASS/FAIL boundary on edge cases (e.g., "identifies the right company but with one wrong fact") was a judgment call.
- Forced web_search on Test 3. v2 forces tool use for fair comparison with
gpt-4o-search-preview. Without forcing, gpt-5.2 sometimes skips the tool and answers from training data. Production deployments where tool use is optional would see lower web-search pass rates. - Brand selection method. YC W24/W25 alphabetical-first-20 is reproducible but introduces selection bias toward names starting with letters early in the alphabet. A random sample would be more defensible.
We will re-run a v3 quarterly with these limitations addressed (multi-LLM, larger N, dual-rater scoring, random YC sample). The dataset link below is the v2 baseline.
Frequently Asked Questions
Did the gpt-4o → gpt-5.2 upgrade help any brands?
Yes, at the margin. Linear, Bezel, and Greptile flipped from FAIL to PASS on training-data recognition. crmCopilot, Ellipsis, and Firebender flipped from FAIL to PASS on web-search. Six brands recovered. One regression: Celest's web search ranking dropped (gpt-4o-search-preview ranked celest.dev as #4 in disambiguation; gpt-5.2 + web_search dropped it entirely). Net positive but not a wholesale shift.
Why does the Knowledge Graph matter so much for LLM training?
Google Knowledge Graph entries (especially high-confidence ones) are widely used as canonical entity sources by LLM training pipelines. When training data ingests the open web, named entities with Knowledge Graph anchors get linked, deduplicated, and reinforced. Brands without that structural anchor end up as scattered text mentions that may or may not survive deduplication. The 12 brands that passed all three sub-tests in our cohort all had resultScore over 100 in the Google KG; the 6 brands with KG presence below that threshold all failed training-data recognition.
How is this different from existing AI visibility studies?
Most existing studies measure Layer 2 (visibility / leaderboard) on prompts like "best CRM for startups." Omniscient Digital's 200-prompt analysis of 25,755 AI citations is the canonical example. Our audit measures Layer 1 (entity recognition) instead, which is the upstream constraint: brands that fail Layer 1 cannot show up in Layer 2 leaderboards regardless of how much off-site authority they have. The two studies are complementary; we are filling in a layer the existing literature has under-measured.
Can I run this audit on my own brand?
Yes. The methodology is reproducible with about 30 minutes of setup and an OpenAI API key. We published the audit brief and the full prompt list. For the Knowledge Graph test, use the Google Knowledge Graph Search API. For the LLM tests, use OpenAI's gpt-5.2 with reasoning.effort=low (training-data test, no tools) and the same model with web_search tool use forced (web-search test). The total cost per brand is about $0.02.
Will you re-run this study?
Quarterly. The next iteration will expand to Anthropic Claude and Google Gemini, increase the sample size, use dual-rater scoring, and replace the alphabetical YC sample with a random one. The v2 baseline below is what you can cite today.
Does Knowledge Graph presence guarantee LLM recall?
No. Six brands in our cohort had Knowledge Graph entries (PASS or WEAK_PASS on Test 1) but still failed training-data recognition on both models. All 6 had resultScore below 100. The threshold matters: barely-there KG presence does not appear to feed LLM training data reliably. A high-confidence KG entry (resultScore over 100) is the structural anchor that gates downstream visibility.
Dataset access
The full audit dataset is published with this post:
- Master CSV. One row per brand, all 40 rows, all test fields
- Aggregate stats JSON. Machine-readable summary numbers
- Raw response logs. Every LLM response for tests 2 and 3, both v1 and v2
If you cite the dataset in your own writing, please link to this post. For replication, the methodology section above plus the published prompts are sufficient to reproduce the run.
Methodology footnote. Audit conducted by friction AI, April 2026. 40 SaaS brands stratified into 20 G2 category leaders + 20 Y Combinator W24/W25 startups. Three Layer 1 sub-tests per brand: entity foundation via Google Knowledge Graph Search API; training-data recognition via OpenAI gpt-4o and gpt-5.2 (reasoning.effort=low, no tools); web-search recognition via the same models with web_search tool use forced. Single-rater scoring. Layer 2 and Layer 3 not in scope for this run. Total runtime: 1h 37min combined; total API cost: $0.87. Limitations enumerated in section above.
About the author. Joao da Silva is co-founder of friction AI alongside Camilla Wirth. friction AI tracks brand visibility across ChatGPT, Claude, Perplexity, and Gemini for SaaS and DTC brands. Joao writes about AI search, entity recognition, and the operational side of getting recommended by LLMs. Connect with him on LinkedIn.