AICodexClaudeCritical Thinking

Claude Fable, Codex and the Duty to Question AI Rankings

June 12, 2026 · 11 min read · By Fabrizio Galiano

When someone like Salvatore Sanfilippo, creator of Redis and a developer with a technical track record that is hard to dismiss, comments on a new frontier model, it is worth listening carefully. Not because authority replaces verification, but because some technical authorities have a rare quality: they do not test models in the abstract, they bring them into real problems.

His reflections on Claude Fable 5, shared in his latest YouTube videos, Claude Fable and Altre considerazioni su Claude Fable, are interesting precisely because they do not confirm a simple narrative. On one hand, Sanfilippo describes Fable as an extremely powerful model, able to reason about trade-offs before jumping into patches and attempts. On the other hand, in the second video, he sharpens the judgment: the jump from Opus is huge, while the jump from GPT 5.5 is an "incremental jump", especially if we consider the Pro version.

That nuance matters more than the ranking. In an ecosystem where many people look for an absolute verdict - "Claude is the best", "GPT is the best", "this model won" - a serious evaluation has to start from a different question: best for what, in which workflow, under which constraint and with what kind of human control?

The point is not to pick a team. The point is to avoid accepting the consensus of the moment as truth when direct experience, real use cases and the observations of expert engineers show a more complex picture.

Why Sanfilippo's opinion matters

Sanfilippo is not a generic creator trying models on demo prompts. He created Redis, spent years working on real systems, performance, data structures, C, debugging and software used in production by millions of developers. When he evaluates a model for coding, he is not only checking whether it produces an elegant function. He is watching whether it understands constraints, identifies physical limits, avoids useless work and knows when to say "this probably is not worth it".

In the first video he describes a case around Dwarf Star, his local inference engine project for open-weight models; for readers who want the technical context, Sanfilippo presents it in detail on his blog in the article Dwarf Star. The point of the story concerns speculative decoding: instead of immediately proposing approaches, Fable analyzes timing, mixture-of-experts limits, attention cost and the fact that certain gains are not guaranteed. This is a rare quality in coding agents: not generating code by inertia, but reasoning about the space of possibilities.

The second video is even more useful because it turns the initial enthusiasm into a more operational judgment. Sanfilippo essentially says that Fable is a huge jump from Opus, but not a huge jump from GPT 5.5. That changes the conversation: perhaps we are not looking at a model that makes every other model irrelevant, but at two frontier model families that are fairly close, with different distributions of strength.

The myth of the best model

The AI world has an understandable but dangerous tendency: turning every release into a single leaderboard. A benchmark goes up, a thread goes viral, a lab publishes impressive examples, and within hours a sentence appears: "nothing is better than this".

The problem is that software development is not one benchmark. A coding agent has to do many things at once: read the project, respect existing patterns, avoid breaking what works, understand when to stop, communicate while working, accept corrections, preserve a global view, avoid over-engineering, produce small patches when small patches are needed, and know when the cost of a solution exceeds its benefit.

A model can be superior at mathematics, local reasoning or a narrow technical problem, while being less effective in a long workflow where the programmer needs to steer it continuously. Another model may be less brilliant on an isolated challenge, but more useful as a daily collaborator because it keeps context, communicates better and accepts guidance.

The question is not which model wins in absolute terms. The question is which model helps you reach a correct, maintainable and verifiable solution with less friction.

Claude can be extremely strong and still not ideal for everything

One of the most interesting observations that emerges from the videos concerns steerability. Sanfilippo notes that Fable gives little feedback while working and seems harder to correct during the turn. This is not a UI detail: it is a fundamental property of the relationship between developer and agent.

When a model is highly autonomous, it can feel more powerful. But if that autonomy reduces the human's ability to intervene at the right moment, the advantage becomes ambiguous. In software engineering, the problem is not only reaching a patch. It is reaching it without losing control over reasoning, assumptions and trade-offs.

This is where the perception of many intensive Codex or ChatGPT users becomes relevant: in some contexts they feel more collaborative, more steerable, more suited to working inside a live project. Not necessarily more intelligent on every task. But more effective when the work requires dialogue, continuous review and whole-project awareness.

Over-engineering and project-wide vision

Another point worth making explicit is over-engineering. Some very strong models tend to treat every problem as an opportunity to design a larger system. That can help in architecture, but it can hurt when the real request is to understand the existing code, apply a minimal change, respect local conventions and avoid unnecessary complexity.

In daily engineering, quality is not just producing sophisticated code. It is also recognizing that a boring, readable solution consistent with the repository is often better than an elegant solution that feels alien to the project. From this point of view, many advanced users perceive Codex as particularly strong in operational continuity: reading context, keeping the thread, connecting files, tests, constraints, branches, deployment and regressions.

This does not deny Claude's strength. It makes it easier to understand. Claude may excel when the problem is vertical, deep, technical and contained. Codex may be more competitive when the problem is systemic: many files, implicit constraints, repository history, tests, workflow ergonomics and the need for continuous collaboration.

Cross review as a method

Sanfilippo proposes a very concrete practice: using two models in "cross code review". Not simplistically splitting "this one designs" and "that one codes", but having one model review the other's work, sending that review back to the first, applying changes and then checking again.

It is a strong intuition because it treats models not as oracles, but as systems with different error distributions. If one model sees a problem that the other ignored, the second may recognize it once the argument is made explicit. They do not both need to be perfect. They need to fail in non-identical ways.

This is probably one of the more mature practices for using AI in software today: not asking the model to replace technical judgment, but placing it inside a loop of critique, verification and risk reduction.

More convincing models, more dangerous mistakes

There is a passage in the videos that should make every practitioner pause. Sanfilippo observes that a more powerful model can also be more powerful at convincing you when it is wrong. Not because it is malicious, but because it builds better arguments, uses more context, sounds more authoritative and makes the missing logical step harder to notice.

This is central for anyone working with advanced AI. The error no longer arrives as confused output that is easy to discard. It arrives as something elegant, coherent and plausible. At that point human expertise does not become less important: it becomes more important. We need to know where to look, which invariants to verify, which tests to write and which assumptions to isolate.

When a model produces a beautiful but false explanation, the inexperienced user may accept it. The expert user can treat it as a hypothesis to stress-test. The difference is method.

Safety, access and power

The videos also touch a delicate topic: limits imposed on frontier models, especially around LLM research, cybersecurity, distillation or sensitive capabilities. This is not a simple issue. The risks are real. But there is also a power question: who decides which capabilities may be used, by whom, for which purposes and with what transparency?

If a model becomes powerful enough to accelerate research and development, restricting access can be a safety measure. But it can also become a concentration of knowledge. The boundary between legitimate safety and protected competitive advantage is not always clear, and precisely for that reason it deserves public discussion.

For companies like Xseven, the practical takeaway is different: we cannot build critical processes on the assumption that the strongest model will always be accessible, predictable and available under the same conditions. We need fallback architecture, model plurality, human audit and control over data.

A rule for evaluating AI models

The conclusion should not be "Claude is better" or "Codex is better". That would be too simple and probably false by the next release. A more useful rule is this: evaluate models on your real problems, under your constraints, inside your workflow, measuring not only the final answer but the cognitive cost of getting there.

In practice, when we test a model, we should ask:

does it understand the repository context or force a generic solution?
does it know when a path is not worth taking?
does it accept corrections during the work?
does it produce patches proportional to the problem?
does it drift into over-engineering?
does it make its assumptions verifiable?
does it really improve total time to a reliable solution?

These questions are worth more than many rankings. A ranking measures a model in the abstract. A workflow measures a model inside reality.

Conclusion: stay critical even in front of the best

The most interesting part of Sanfilippo's reflections is not deciding whether Fable is first, second or half a step above GPT 5.5. It is the implicit invitation to use models as powerful but fallible tools, to be questioned, compared and verified.

Maybe Claude Fable 5 is currently the most impressive model on many tasks. Maybe Codex remains superior in some complex real-world development workflows. Maybe a new release will change the map again in a few weeks. But the mature point is different: never delegate judgment to the dominant narrative.

In software, as in AI, consensus is a signal. It is not proof. The proof remains the work: code that passes tests, architectures that hold up, systems that remain maintainable, decisions we can explain and responsibility that does not disappear behind the name of the strongest model of the moment.

Fabrizio Galiano

Founder & SRE — Xseven SRLS

Want to experiment with local AI responsibly?

We help teams and companies design local, private and governable AI environments, balancing technical freedom, security, policy and operational control.

Start the conversation