Why Task Proficiency Doesn't Equal AI Autonomy
quality 2/10 · low quality
0 net
AI Summary
This essay critiques claims that AI agents will replace software engineers by analyzing cognitive ability gaps between humans and AI across 12+ dimensions (output speed, working memory, long-term memory, confidence calibration, etc.). The author argues that task proficiency on isolated benchmarks does not translate to real-world autonomy due to AI's fundamental inability to perform causal modeling, calibrate confidence accurately, and operate reliably outside controlled environments.
Tags
Entities
Max Trivedi
SignalBloom AI
DeepMind Gemini v2.5
John Von Neumann
Miller's Law
Why Task Proficiency Doesn’t Equal AI Autonomy | SignalBloom AI posts ☰ Menu Why Task Proficiency Doesn’t Equal AI Autonomy March 8th, 2026 • Max Trivedi TL;DR: This essay begins by comparing the base cognitive abilities of human and AI agents to evaluate the claim of imminent software engineering job replacement. Through this analysis, two primary hypotheses emerge: First, a Benchmark-Trainability Coupling : the constraints that make a task easy to benchmark are by and large the same constraints that make it trainable via statistical methods. Second, the tasks required for true autonomy are difficult to isolate from their environment, relying heavily on abilities like causal modeling, confidence calibration, and evidential sufficiency assessment areas where statistical systems struggle. Because of this distinction, benchmarkability and autonomy measure two separate axes with limited overlap . Evaluating AI agents on isolated individual tasks provides a systematically misleading proxy for real-world autonomous capability. Depending on who you ask, AI agents are either going to replace every white-collar worker in the next 12 to 60 months, or they are just glorified autocomplete tools that won't affect the job market at all. In this essay, we attempt to evaluate the widely repeated claim of ‘AI Agents will replace a significant percentage of the human software engineers’. The main focus is not really about proving or disproving anything as much as it is on critically examining the space (the title may seem biased but it was added after the completion of the essay). Method First we identify a dozen cognitive abilities that are useful in a professional SWE job. There is no canonical source that deems these and only these particular abilities as relevant. They are picked by the author from his SWE background and as such are not claimed to be definitive, but useful for a discussion nonetheless. We try to estimate how well the humans do in each of those and how well does AI do. Next, we estimate how necessary each of the aforementioned skills are for various SWE tasks on a scale of {none, some, critical}. Finally, we take the dot product of Tasks x Abilities and score humans and AI suitability for the given task. Lastly, we derive some hypotheses and findings from the results. The Human Agent A human software engineer can be modeled as an Autonomous Agent with guardrails. Come to think of it, the Agent oriented framework is as old as work itself. Historically, all agents have been humans. So, the question can be reframed as which type of agent (AI or Human) is better suited for which type of work. A Human Agent for the context of this essay is an employable human. Human Agent’s goals (ignoring exceptions): Maximize the likelihood of success*, minimize the likelihood of getting fired Human Agent’s guardrails: Their own interests + Other Human Agents, who also have similar goals but different roles *success here is a multivariable optimisation over financial gain, career progression, personal values etc The AI Agent vs Human Agent The abilities we picked below are only a subset taken from a vast spectrum of abilities a human or AI can perform. Many arguably important abilities such as Spatial awareness are not included due to their perceived low relevance to SWE tasks. Scale definition 100% represents the best mechanism for the job that is presently known to exist, the theoretical or practical maximum currently observable in either biology or silicon. 0% represents a complete inability to perform the function natively. Note: The reader can assign their own rating to each ability in the next section. 1. Output speed The raw rate at which an agent generates usable actions. Human Agent: Human Agents are bottlenecked by their biological speed. Even the fastest of Human Agents write code slower than the slowest of the AI Agents. It is improbable that there will be any breakthrough that can change this. Rating: 10 % (relative to AI Agents) AI Agent: Already approaching entire codebases equivalent output in seconds, very high ceiling with ASICs. Rating 100 % (currently best known mechanism for arbitrary generation) 2. Varied effort per unit of output The ability to use varied cognitive or computational effort on certain units of work, depending on their importance Human Agent: Highly variable. Can potentially think for days or months on a problem where final output is just a simple yes/no token. Budgets thinking based on goals/risks Rating: 80 % AI Agent: Largely deterministic, scales with context size and a separate test-time compute (reasoning tokens). Lacks any fundamental notion of ‘gravity of the situation’. While the reasoning models generate variable length tokens in preparation of a response, these have an upper bound and very large contexts have shown to degrade the reasoning abilities ( DeepMind Gemini v2.5 Report ). Rating: 20 % 3. Ability to recall How easy it is to retrieve previously known chunks of information verbatim. Human Agent: Low without tools and by default quite lossy. Can use tools to improve. Introduced inefficiencies due to slow tool use. Recalling a specific block of text verbatim from years ago is nearly impossible without tools (unless the Human agent is John Von Neumann ). Rating: 20 % AI Agent: Moderate/High without tools as models have been known to recite plenty of books verbatim. Lossless with tools, fast tool use. Rating: 95 % 4. Working Memory Temporary, task-relevant storage capacity. Human Agent: Very limited, a few chunks at a time (see Miller’s Law ), decays typically in 30 seconds or less, affected by emotional and mental state. Rating: 5 % AI Agent: Very high, multiple books worth of content at the same time, doesn’t get affected by any external factor, even more room to grow. Rating: 95 % Caveat: We are substituting the context size as working memory here, which is not strictly true. AI Agents can not coherently reason through large number of variables in the context window. 5. Update/Retrieval of Long-term Memory Ability to update and retrieve from the persistent repository of past experiences, knowledge, and learned patterns Human Agent: Practically unlimited, updates continuously, is implicitly filtered by perceived importance, emotional state etc. Rating: 85 % AI Agent: Frozen in weights, no currently known mechanism of continuous updates. Can use external sources to some degree but they come with their own issues such as context blow up. Rating: 5 % The most commonly suggested solution to long-term memory is to put the relevant things into the Context Window of the AI Agent. While this works for simpler cases, there are nasty failure modes to this approach: Context curation itself becomes a bottleneck soon. It is intractable to continuously generate a hierarchy of instructions tailored to each task and supply with it the filtered data that are both sufficient and concise. Any outdated data in the context window is a liability, so is any even slightly vague instruction. Even if the context curation was solved, the long-term memory by nature is temporal and contextual. e.g. Did Steve Rogers and Tony Stark have a falling out? No, if you asked before The Civil War arc, yes if you asked after, also no outside the MCU context since they are fictional characters that exist only in the MCU. Or more concretely, a design could be good design in one context and a bad design in another, it is impossible to evaluate things without knowing what is relevant in their specific context. Another workaround (particularly for skills) is to create files to store things the AI Agent learned at the end of each run. While this works for well defined tasks, it also is, to a lesser extent, exposed to the failure modes mentioned above as the number of skills and size of each skill grows. 6. Confidence Calibration Ability to accurately determine what the agent knows and what it doesn’t, and assign confidence levels to retrieved information. "It ain't what you don't know that gets you into trouble. It's what you know for sure that just ain't so." — Mark Twain Human Agent: Very good, can usually tell what they know and what they don’t know, or whether a thing happened or did not happen. Can assign confidence levels on memories. Are affected by cognitive biases and emotional states. Rating: 80 % AI Agent: No native Metamemory . While LLMs that power AI Agents output statistical probabilities (logprobs), this is distinct from epistemic certainty. There are ways to achieve by constant grounding to an external source such as RAG or Knowledge graphs, but those bring their own limitations in terms of accuracy, context window budget and more importantly, the AI Agent's inability to decide which version is true when presented with contradictory information in the context. Rating: 5 % This last limitation in particular results in the most catastrophic failure modes and arguably is the biggest hurdle to mass deployments of AI agentic systems (author's view). Simply contradicting an LLM cause them to abandon their stance in significant percentage of cases ( paper 1 ) ( paper 2 ). Some real world failures: Air Canada chatbot hallucinated a policy, passenger sued, Air Canada lost (2024) Cursor AI's support bot hallucinated policies (2025) Deloitte submitted a 237-page, AU$440,000 report to the Australian Department of Employment and Workplace Relations with a fabricated a court case extract, including hallucinated quotes attributed to a real Austrailian judge (2025) Over 1000 documented legal decisions in cases where generative AI produced hallucinated content All of these happen for the simple reason that LLMs (and by extension AI Agents) do not have a solid layer of factuality to fall back onto, especially when confronted with contradicting goals. In general, any system that learns from statistics can only predict the likelihood of what is correct based on the learned statistics. A rather amusing manifestation of this is when a SOTA coding model would stubbornly deny its own existence by overwriting your config to a previous gen model , because its version of 'truth' comes from statistics in the training data, and when it trained, its own name was not present in the training data! 7. Performance Consistency How consistent the agent’s performance is Human Agent: Can be highly inconsistent. Affected by physical and emotional states, by unpredictable external triggers (such as inability to focus in a noisy environment or a song reminding them of a breakup) or even the time of the day. Rating: 20 % AI Agent: Unaffected by any external stimuli, performs at a constant level at all times. Deterministic output under the same prompt and 0 temperature. Rating: 100 % 8. Ability to generalize Capacity to apply learned concepts and patterns to novel domains or unseen edge cases Human Agent: Extremely high, gold standard. Can learn from very few examples. This ability for generalized learning also includes absorbing AI Agents into their toolkit. Rating: 95 % AI Agent: Unpredictable, training dependent. Strong at interpolating within the training distribution, falls sharply beyond that. Limited generalization with In-context learning. Rating: 30 % 9. Causal Understanding Possession of an internal cause-effect world model (see Judea Pearl’s Ladder of Causation ) Human Agent: Possesses an underlying grounded causal model of the world that can be used to simulate situations and build hypotheses. The said world model is refined iteratively and independently over very long time horizons. Operates on all 3 layers of the causal hierarchy. Rating: 90 % AI Agent: Possesses a linguistic world model based on statistical association. Cannot natively simulate counterfactual physics or logic. Operates only on the first (Association) layer of causal hierarchy Rating: 5 % Try this example “Imagine a world where time runs backwards. You have just landed into this world and see a lighter about to set fire to a paper. What happens next?” A Human Agent can simulate this situation mentally and respond with ‘lighter goes off, paper is never set to fire’. An AI agent will likely respond with a dramatic description of smokes and ashes and how paper emerges from ashes. 10. Evidential Sufficiency Assessment The ability to tell whether the agent has enough information to act confidently Human Agent: Good at detecting more obvious information gaps, struggles with highly nuanced situations. Exceptional calibration is quite rare. Rating: 65 % (because humans are still prone to jumping to solutions) AI Agent: Autoregressive models are fundamentally biased toward generating an answer. They do not natively experience the "absence of evidence" and will often invent parameters or assume constraints rather than halting to request more data. Arguably one of the highest stakes failure modes that leads to hallucinations or confident generation of false information. Rating: 10 % Try this example “There are 20 people in a room. Each person gave a rose to the person on their right if there is someone to their right. How many people didn't get a rose?” A Human Agent would identify that the question can't be answered without knowing their topology. An AI agent will almost certainly assume they are in a single line and answer with One. 11. Hierarchical Thinking Ability to organize things (such as concepts, goals, causes) across multiple levels simultaneously, and to move deliberately between those levels Human Agent: Highly elastic. Can seamlessly traverse abstraction layers. Widely uses this along with Abstraction to make up for lack of working memory. Rating: 90 % AI Agent: At its core, it’s strictly an autoregressive mechanism (meaning every new output on the right hand side is strictly the function of what is currently present on the left hand side). Does not organize things hierarchically. Can use external harnesses but is vulnerable to goal drifts and error cascades. Unable to autonomously ‘zoom out’ and reassess plans when trapped in a lower level failure. Unable to come up with a 'measured response' to novel situations. Rating: 15 % (AI Agents often make choices that are locally reasonable but do not fit into the overall high level goal) This is another factor that makes full autonomous AI Agents challenging. AI Agents unable to prioritize amongst never breaking the role, always validating the user and doing the responsible thing have resulted in deaths of humans (warning: fairly disturbing read for some, discusses suicide extensively). 12. Social Reads Ability to infer useful information from human non-verbal cues Human Agent: Highly variable, overall average ability to read other humans. More adapted to subconsciously responding but limited unless trained. Rating: 50 % AI Agent: Zero at baseline, but this appears highly trainable with a super-human ceiling. For instance, a model specifically trained on all CEO voice and video transcripts can almost certainly learn to outperform humans in detecting potential risks. Potentially ethical concerns in training. Rating: 5 % (ceiling nearly 100%) Nuance: Human Agent’s tool use includes the AI agent, not vice versa. What do SWEs do Obviously this varies widely depending on the domain, team’s immediate goals, expertise and seniority. Roadmapping Deciding which tasks to undertake out of a large addressable space. Requires optimizing within a uniquely constrained space of multiple competing priorities. Naturally requires Hierarchical thinking (e.g. how to rank two competing priorities), Causal understanding (e.g. if we pull the plug on product X, what would be the consequences? Can I sell it to the leadership?), Evidence sufficiency (e.g. do we have enough data to conclude X?) among other things. Alignment Making sure different teams share a common understanding and plans to drive a larger outcome. Requires Confidence Calibration (e.g. Am I accurately relaying the current state of affairs? Am I flagging the things I am uncertain about?), Updating Long-term memory and Social reads. Architecture/Design Fleshing out systems that adequately satisfy the specific contextual requirements. Requires Causal understanding (e.g. simulating failure modes), Ability to generalize lessons learned from previous experiences and Evidential Sufficiency assessment (e.g. do we have enough information to decide on the architecture). Benefits from working memory. Although it is arguably possible to demonstrate a good performance in System design tests because a large majority of design choices and learnings fall into a limited set of variations and can be memorized. Coding Generating code that complies with the coding language and follows good practices. Hugely benefits from Working Memory, Output speed, and Performance Consistency. Since it is relatively easy to generate a very large corpus of coding training data, LLMs can leverage this effectively. Code reviews Can be sub-divided into reviewing for code quality (e.g. derived from well agreed-upon best practices) and the reviewing for specific constraints/business-requirements (e.g. mocking this method here passes the tests but it’ll break in production due to XYZ). Benefits from Ability to Recall, Working Memory and Evidential Sufficiency (e.g. the code is correct and tests pass but is it actually accomplishing the real intent?) Meetings Realtime State Synchronization between agents/teams. Requires Updating Long-term Memory, Confidence Calibration (am I providing the accurate state to the other party whose actions will depend on it) and Hierarchical thinking (e.g. how to fit in a conflicting ask that doesn’t align with the original roadmap). Benefits from Social read. Debugging and maintenance Spans a huge range in terms of complexity. On the lower end, things like the code threw an exception because it could not find the python module to the higher end that requires deep systems expertise such as finding the root cause of a hard-to-reproduce SEV from 5000 PRs or 350 config change. Working Memory and Ability to Recall serve quite well on the lower end but requires Causal Understanding and Evidential Sufficiency on the higher end. Key Insight 1 All of these are interdependent within the context of a job and are impossible to decouple to any significant degree, which places a hard constraint that the Agent has to carry the persistent and continuously updating state-of-the-world across the tasks. This is the reason we have not assigned weights to them. Key Insight 2 This interdependency introduces a spectrum to the replacement discussion: improvements in a minority of tasks lead to augmentation, improvements across a majority lead to role-repurposing, and only near-parity across all tasks constitutes genuine replacement. Putting It Together Interact with the base abilities or the task weights below to see the final suitability scores recalculate in real-time. How they stack up: Raw Abilities Drag the numbers to adjust the performance for both humans and AI. What does the work actually require? Click a number to change it (0: Irrelevant, 1: Helpful, 3: Critical). Cognitive Ability Roadmapping Alignment Architecture Coding Reviews Meetings Debugging Output Speed 0 0 0 3 1 0 1 Effort per Output 1 1 1 1 1 0 1 Ability to Recall 1 1 3 3 3 1 3 Working Memory 1 1 3 3 3 1 3 Update LTM 3 3 1 1 1 3 1 Confidence Calibration 3 3 3 1 3 3 3 Performance Consistency 1 1 1 3 3 1 1 Ability to Generalize 3 1 3 1 1 1 1 Causal Understanding 3 1 3 1 1 1 3 Evidential Sufficiency 3 3 3 1 3 1 3 Hierarchical Thinking 3 1 3 1 1 1 1 Social Reads 1 3 0 0 0 3 0 Scoring method: Dot product of Task x Ability. The Result: Who is better suited? Calculated based on the assumptions above. Findings The most interesting observation: The top 3 tasks (coding, boilerplate code reviews, basic debugging) that AI Agents do well in are also easy to create benchmarks for. We clearly did not start out by using benchmarkability as a factor. Is that just a coincidence? Let's see. A benchmark requires a) A well defined scope b) Ability to make all relevant information explicit c) Correct answer, or a limited set of correctness criteria d) Limited, or at least tractable amount of input and output variables. Interestingly, all these criteria that make a domain easy to benchmark ALSO make the domain easy to train using statistical methods like LLMs are trained with. In other words, benchmarkability and trainability are correlated. We present the following hypotheses: i. Benchmark-Trainability Coupling Tasks that are benchmark friendly are almost always training friendly (with minority exceptions such as systems that lack statistical regularities or chaotic systems). This explains the benchmark-reality gap, the continuous benchmark saturation without any corresponding real world payoffs. ii. The set of non benchmark friendly tasks is the set of tasks that are hard to isolate/decouple from their environment. Because of this hard-to-isolate nature, statistical systems cannot effectively train on them. The Human Agents leverage combination of abilities like Generalizing, Confidence Calibration, Evidential Sufficiency Assessment, Long-Term Memory, Causal Model of the world and Hierarchical thinking to make autonomous progress in such domains. Benchmarking AI Agents on individual tasks is systematically misleading from an autonomous agent perspective since benchmarkability and autonomy measure two separate axes with limited overlap. The obvious thing that is hard to miss is, AI agents are exceptionally good at coding due to the unique requirements of the coding tasks and relatively limited number of patterns to learn . It can also be inferred that the coding ship has sailed , it is unlikely that Human Agents will ever beat the AI Agents in pure coding tasks (at least the greenfield ones that have no complex dependencies). A not-insignificant percentage of all review comments are about the standardized best coding practices. This seems like another low-hanging fruit for AI adoption - a Human Agent reviews code only after a first round of review/revision by an AI Agent . Architecture/Design is a special case where AI Agents can add a lot of value without being fully autonomous by virtue of having learned a large number of patterns . A Human Agent + AI agent setup will almost always far out perform either a Human Agent or an AI Agent for the foreseeable future because they both seem to have complementing abilities. So, both ‘AI is here to stay’ and ‘AI is not replacing humans anytime soon’ can be true . A horizontal across the board, across the job functions improvement in information retrieval. This undoubtedly makes every Human Agent more efficient. At the same time, the legacy Human Agent moats that were only created by ‘learning 10 tricks about an arcane thing nobody else bothered to learn’ are disappearing rapidly. What would add the biggest gains to AI Agent Abilities If the model above is accurate, Confidence Calibration and Evidential Sufficiency Assessment would add the biggest gains to AI Agent’s real world employability. We noted above that Confidence Calibration is Critical for 6 out of 7 tasks, because no meaningful high stakes work can be done with a colleague that has no solid grounding in reality, nor an ability to assess the accuracy of their own claims. Perhaps this is the reason why we see millions of demos that people claim to write in 30 mins, but very few autonomous production grade systems entirely driven by AI Agents. Likewise, lack of Evidential Sufficiency Assessment makes the AI Agents assume and hallucinate. If you gave a Human Agent a vague task (like most system design interviews), they would ask you a lot of clarifying questions (and in fact the quality of these questions is how you evaluate a candidate) before moving to implementation. The same task given to an AI Agent will result in instant action. Autoregressive models do not natively experience the "absence of evidence" and will often invent parameters or assume constraints rather than halting to request more data. Interesting Second Order Effects For Software Engineering specifically, the speed improvements in coding have some interesting second order effects. Cognitive Overload and Overbuild As a general systems engineering principle, if you improve something that wasn’t a bottleneck, you either improve nothing (if improvement was downstream) or make the bottleneck worse (case where the improvement was upstream and bottleneck is more choked). Sure humans can generate code 100x faster with AI, but how long can this output sustain before becoming incoherent? Can the reviewers keep up with this pace? Does the cognitive capacity scale equally easily? This creates 2 new and interesting problems: Cognitive Overload on the 10x/100x engineer: Precisely because of the shortfall in the AI Agent’s abilities, the humans using the AI are forced to absorb increasingly more context and increasingly more responsibilities (see HBR: AI doesn't reduce work, it intensifies it ). The human cognition has its limitations and pushing beyond that Overbuild: This is a rather devious manifestation of the tools upgrade combined with incentives. If you are an engineer at a large tech company and are incentivized to produce more code, the path of least resistance for you to do given the new tools at your disposal would be to produce a very large number of low stakes low importance PRs. Low stakes diffs means they are easy to get reviewed and land (because nobody cares) and 100s of such diffs can be landed effectively without spending significant cognitive budget. So, an overbuild of internal tools, dashboards, bots and such that nobody ever wanted or nobody will ever use, just good enough to claim impact in your performance review. Lots of motion, little progress . Maintenance Overhead The amount of issues Agents have to debug are roughly proportional to the codebase size (more code = more liability). So when the code is being committed at 10x speed but the issues are being resolved only marginally faster, you have a backlog of debug/maintenance. It is plausible that the SWE role will drift towards LLM-steward/fixer-of-issues and a lot of people would want to do a ‘complete rewrite’ of whatever they are working on more frequently. Junior Developers There is a popular mainstream hypothesis that says ‘Junior developers will be more affected because they mostly code’. This appears to be quite misleading. Junior developers have historically coded the most because someone had to code. There is no inherent connection between junior developers and coding. With coding out of the way, Junior Developers will take on other roles. You could argue that historically, coding tasks gave them a ladder to gain experience, but there is no reason to think that typing code by hand is an unavoidable step towards future growth. New Ventures and Lump of labour fallacy As we noted earlier, even the smartest Human Agents are constrained by their output speed. Using AI Agents strategically alleviates this bottleneck which gives the Human Agent a lot more leverage. Everyone with an idea, knowledge and willingness to carry an enormous amount of context can build businesses much more cheaply to go after incumbents. This would result in a lot more businesses forming (which is already happening: 2026 Global Intelligence Crisis ). The successful ones out of those will need to rebuild the entire stack of whatever they are working on, an enormous amount of new code and context which will require hiring. Perhaps this is a contributor to why the work is neither fixed nor finite ( Lump of labour fallacy ): most human work is not to achieve a shared goal, but to get a one-up on each other . Conclusion The original claim of AI replacing any meaningful percentage of SWEs in the foreseeable future does not appear feasible unless there are groundbreaking improvements in each of the following: Confidence Calibration, Evidential Sufficiency Assessment, Causal Modelling and Update/Retrieval of Long Term memory. The currently popular approaches of scaling LLMs with more data and more compute seem unlikely to address these shortfalls, as they target fundamentally different mechanisms. This is by no means a pessimistic view on AI developments. The AI Agents look set to continuously improve in terms of task handling. We are simply pointing out that the real alpha is not in task handling proficiency . We use cookies solely to understand how our site is used and to improve your experience. Your data is never shared or sold. Learn More . Accept Decline ×