I had a very specific research task to do. Not the "summarise this industry for me" kind. The kind where you need to find specific companies, pull together a consistent set of structured information on each one, and get it back in a particular format you can actually use downstream.
So I wrote a detailed prompt. Specific requirements. Defined objectives. A clear output format. The kind of prompt that, if a human researcher received it, they'd know exactly what was expected.
Then I sent the same prompt to five tools and watched what happened.
The tools
Claude (Sonnet 4.6, Extended Thinking), ChatGPT, Manus, Gemini (Thinking), Grok (Expert mode).
Same prompt. Same task. No adjustments between tools.
1. Claude
Claude followed the prompt exactly. Produced the output in the format I asked for without any steering. The only limitation: tool-call limits meant I had to split the work across two sessions.
Session one ran for 1 hour and 8 minutes and gathered 1,085 sources. Session two took 19 minutes and gathered 481. That's 1,566 sources across roughly 90 minutes of total research time. The depth showed in the output in a way that was immediately obvious when I compared it to everything else.
If you're doing this kind of structured research and you need thoroughness, nothing else I tested came close.
2. ChatGPT
Good. About 400 sources in around 30 minutes. The results were solid but lacked the depth Claude produced. The more significant issue was format compliance: I had to redirect it several times to follow the output structure I'd asked for. It kept drifting. Not a dealbreaker, but if your prompt has very specific structural requirements, expect to manage that conversation.
3. Manus
This one surprised me. About 20 minutes, followed the format without much steering, clean output. Less depth than Claude or ChatGPT, and it dropped some information along the way, but it did what it was asked. Of the five, it was the most obedient to the prompt format relative to its overall capability. Straightforward and honest about what it could do.
4. Gemini (Thinking)
Genuinely disappointing. Gemini has been my default research tool for a while, so I came in with high expectations. What I got instead was a report. A summary. Not the structured output I asked for.
I tried to redirect it. Multiple times. It kept going its own way. I don't know if something about the prompt style didn't suit it, or if this type of structured output task is just not where it performs well, but in practice the result wasn't usable in the format I needed. That surprised me more than anything else in the test.
5. Grok (Expert)
Basic results. Shallow. Missed a lot of information. Did not follow the format.
The most frustrating part had nothing to do with the research quality: Grok couldn't create a canvas or produce a proper markdown artifact. Every other tool handled this without any issue. If you're doing structured research and need to work with the output downstream, that limitation matters more than it might seem.
The thing none of them got right
There's one failure mode that showed up across all five tools, and it matters more than the rankings.
Specific company website domains.
Every tool was reasonably good at researching and scraping website content. But when it came to identifying the actual domain a specific company uses, all five got it wrong. Not occasionally. Consistently.
The pattern is predictable once you see it: they don't guess randomly. They construct what a company's domain probably should be. Name with .ai or .com, clean and professional, the kind of domain a sensible company would register. Except it's the wrong domain, or it belongs to a different company entirely, or it doesn't exist.
To deal with this, I started a separate Claude Cowork session specifically to open and verify every domain in a browser. It added time but it worked. The problem is worst for smaller companies and recent startups. Well known companies with obvious, long-established domains are mostly fine. Anything less prominent? Verify manually.
Here's the observation I keep coming back to: all five models assume a company's domain looks like companyname.ai or companyname.com. That's the mental model they've built. If your actual domain doesn't fit that pattern, you are less discoverable in AI-generated research.
Domain choice is now part of your AI discoverability strategy. That's not something anyone was talking about a couple of years ago. But it's real, it's structural, and it's already affecting how companies appear when someone uses one of these tools to research a space.
The ranking was roughly what I expected. The insight wasn't.
I went into this wanting to know which tool handles structured deep research best. I got that answer. Claude, clearly, followed by ChatGPT, then Manus, with Gemini and Grok at the bottom.
But the more interesting finding came from the edges of the test: the shared domain identification problem, and what it reveals about how these models construct a picture of companies on the internet.
If you're building something right now, your domain is being read (and sometimes misread) by every AI research tool your potential customers might use. Worth knowing.



