I Made 5 AI Models Argue About Stocks. The Disagreements Were More Valuable Than the Answers.
quality 6/10 · good
0 net
AI Summary
The article describes a multi-model AI validation architecture for financial analysis that uses deliberate model disagreement and fact auditing to detect hallucinations and silent failures in AI outputs. The approach mitigates risks from single-model systems by implementing output validation, cascading fallbacks, and RAG-based verification across multiple independent models with conflicting prompts.
Tags
Entities
Nipun AI
Google Gemini
Cerebras
Llama 3.3 70B
Cohere Command R+
Finnhub
Raviteja Nekkalapu
I Made 5 AI Models Argue About Stocks. The Disagreements Were More Valuable Than the Answers. | by Raviteja Nekkalapu | in InfoSec Write-ups - Freedium
Milestone: 20GB Reached
We’ve reached 20GB of stored data — thank you for helping us grow!
Patreon
Ko-fi
Liberapay
Close
< Go to the original
I Made 5 AI Models Argue About Stocks. The Disagreements Were More Valuable Than the Answers.
Consensus is comfortable.
Raviteja Nekkalapu
Follow
InfoSec Write-ups
·
~5 min read
·
March 13, 2026 (Updated: March 13, 2026)
·
Free: Yes
It feels safe.
When everyone agrees, you stop questioning.
That is exactly why consensus is dangerous in financial analysis.
I learned this the hard way. I was building Nipun AI, an open-source financial analysis platform, and my first version used a single AI model. Google Gemini.
I sent it financial data, it returned an analysis, and I displayed it.
Clean. Simple. Wrong.
The problem was not accuracy. Gemini is a capable model. The problem was that a single model gives you a single perspective. And a single perspective has blind spots you cannot see because you have nothing to compare it against.
So I added a second model.
Then a third.
Then a fourth.
Then I realized something unexpected: the disagreements between models taught me more than any individual analysis ever could.
The Architecture of Disagreement
Nipun AI runs a 4-phase analysis pipeline.
Phase 3 sends financial data to Google Gemini for synthesis.
Phase 4 sends the same data to Cerebras (running Llama 3.3 70B) with an explicit instruction - provide your independent verdict. Do not agree with any previous analysis.
That second instruction is critical.
Without it, the second model tends to defer to the first. Language models are trained on human text, and human text is full of consensus-seeking patterns. You have to explicitly tell the model to think independently.
The system then compares the two verdicts and produces an agreement score. When Gemini says bullish and Cerebras says bullish, you get a high agreement score and higher confidence. When they disagree, you get something far more interesting.
What Disagreement Looks Like
I ran an analysis on a semiconductor company last quarter. Gemini looked at the revenue growth trajectory, the expanding market for AI chips, and the 40% gross margin. It returned a bullish verdict with high confidence.
Cerebras looked at the same numbers.
Same revenue growth. Same market data. Same margins. It returned bearish.
Why?
Cerebras flagged something Gemini did not emphasize - the debt-to-equity ratio had been climbing for three consecutive quarters. The company was financing its growth with debt. At current interest rates, the interest coverage ratio was thinning. If revenue growth slowed even slightly, the debt load could become a problem.
Gemini mentioned the debt in passing. One sentence in a 900-word report. Cerebras made it the centerpiece of its analysis.
Neither model was wrong. They weighted the same data differently. And that divergence, that gap between two interpretations of identical numbers, was the most valuable output of the entire analysis.
Why Multi-Model Matters for Security
This is not just a finance problem. It is a security problem.
Single-model AI systems have a failure mode that nobody talks about enough - they fail silently.
When a model hallucinates a number or misinterprets a data point, there is no alarm. The output looks reasonable. The formatting is clean. The user trusts it.
With multiple models, you get a built-in cross-check. If Gemini says a stock's P/E ratio is 15 and Cerebras says it is 45, something is wrong. The system can flag that divergence before it reaches the user.
Nipun AI takes this further with a dedicated fact audit layer. After Gemini generates the report in Phase 3, Cohere Command R+ audits every factual claim against the verified source data from Phase 1. Each claim is classified as grounded, speculative, or unverifiable.
This is RAG-based verification applied to the system's own output. The AI checks its own work against known facts.
The Fact Audit in Practice
Here is what a fact audit actually catches.
Gemini generates a report that says "Apple's revenue grew 8% year-over-year."
The fact audit checks this against the Finnhub data that was fetched in Phase 1. If Phase 1 data shows revenue per share metrics but not explicit YoY growth, Cohere classifies that claim as "speculative." It is probably true, but the system cannot verify it from the data it actually has.
This matters.
Financial professionals need to know which claims in an AI-generated report are backed by data and which are the model's interpretation. The fact audit makes that distinction visible.
In testing across 200 analyses, the average report contained 12 to 15 auditable claims.
Roughly 60% were grounded in the source data.
About 30% were speculative but reasonable.
Around 10% were unverifiable.
That 10% is the danger zone.
Those are claims the model presented as facts but could not be traced to any source. Without the audit layer, users would read those claims and assume they were verified.
The Cascading Fallback Problem
Running multiple AI models introduces a reliability challenge. APIs fail. Rate limits hit. Models return empty responses. If your system depends on five models and one goes down, you cannot crash. You need fallbacks.
Nipun AI handles this with cascading model fallback. The Gemini integration alone has five models. If the primary model fails, the system automatically tries the next one. Different models are assigned to different tasks based on complexity.
This is not just about uptime. It is about cost and rate limit optimization. By spreading requests across five model variants, Nipun AI maximizes the free-tier budget across all quotas. Users get institutional-grade analysis without paying for a single API call.
The Phase 4 models (Cerebras and Cohere) are non-fatal. If they fail, the analysis still completes. The user gets a report and a score. They just do not get the second opinion or the fact audit for that particular analysis. The system logs the fallback and moves on.
What This Means for AI Security
If you are building any system where an AI model's output influences decisions, you need adversarial validation.
Not just input validation.
Output validation.
A single model can be manipulated through prompt injection, data poisoning, or simple hallucination. Multiple models with independent prompts and explicit instructions to disagree create a much harder target. An attacker would need to compromise multiple models simultaneously to produce a consistent false output.
Nipun AI's architecture is not bulletproof. No system is. But it makes silent failures visible. When the models agree, you can have reasonable confidence. When they disagree, you know exactly where to dig deeper.
And in security, knowing where to look is half the battle.
Nipun AI is an open-source financial analysis platform with multi-AI consensus, fact auditing, and zero infrastructure.
MIT licensed.
Live demo: nipun-ai.pages.dev
All free-tier APIs.
Zero infrastructure.
GitHub Source Code : https://github.com/myProjectsRavi/Nipun-AI
#artificial-intelligence #ai #ai-agent #open-source #finance
Reporting a Problem
Sometimes we have problems displaying some Medium posts.
If you have a problem that some images aren't loading - try using VPN. Probably you have problem with
access to Medium CDN (or fucking Cloudflare's bot detection algorithms are blocking you).