Be Unbenchmarkable in the Agent Era

As I think more about the Agent Era, I increasingly believe that the old path of becoming a “standard competent person” is losing expected value. This is not because intelligence becomes worthless, nor because human work disappears. The reason is structural:

\[\text{benchmarkable} \Rightarrow \text{optimizable} \Rightarrow \text{automatable} \Rightarrow \text{commoditized}\]

I used to think the boundary of automation was the boundary of verbalization: if a task can be clearly described, then perhaps it can eventually be solved by either humans or AI. But in the agent paradigm, the sharper condition is whether a task can be converted into an agent-friendly environment. The bottleneck shifts from execution to communication, specification, and evaluation.

Language gives agents instructions. Benchmarks give agents gradients.


In the earlier stage, AI was often treated as a helper: a search interface, knowledge retriever, or assistant for local tasks. In the agent stage, the relation becomes more ambiguous. Humans increasingly provide goals, contexts, constraints, and feedback, while AI systems execute, search, and iterate.

A task becomes vulnerable once it can be wrapped into a loop:

\[\text{task} \rightarrow \text{environment} \rightarrow \text{action space} \rightarrow \text{feedback} \rightarrow \text{optimization}\]

The central point is not whether the task uses human language, such as text, code, or mathematics, nor how difficult the task is. The central question is whether the task is benchmarkable. Human-readable text helps with auditing and interpretability, but AI only requires a medium that preserves stable, evaluable structure. Once a task can be evaluated, searched, optimized, and iterated, agents can enter the loop.

This changes the location of human scarcity. The scarce part is no longer pure execution, or even technical ability in the ordinary sense. It is the capacity to define what matters before the task has fully stabilized.


1) Benchmarkable Means Solvable

A benchmark is not merely an evaluation metric. It is a mirror of a task’s optimization loop.

Once a benchmark exists, the target has become partially explicit. Once the target is explicit, optimization can begin. Once optimization is well-defined, agents can scale the search process far beyond human throughput.

This explains why many domains previously considered resistant to automation are becoming partially solvable: mathematics, software engineering, law, research, visual art, audio production, and video generation.

This does not imply that every component of these jobs disappears. Real work is a mixture of benchmarkable and unbenchmarkable subproblems: execution, judgment, communication, responsibility, taste, context, and coordination. But the benchmarkable portion is often the first part that human organizations reward, precisely because it is legible, measurable, and scalable.

A coding task is benchmarkable. A legal memo is partially benchmarkable. A math proof becomes benchmarkable when the verifier is formalized. A research direction becomes more benchmarkable when the field stabilizes around datasets, metrics, and leaderboards. An image becomes more benchmarkable when preference models, aesthetic predictors, and human feedback loops approximate preferred taste.

The moment a field develops a stable evaluation protocol, it creates the conditions for AI systems to optimize it. Benchmarkability is therefore both a mechanism of progress and a mechanism of commoditization. It makes improvement measurable, but by doing so, it also makes improvement automatable.

The fragility of standard excellence is that it is excellence under a known metric. Known metrics are exactly what optimization systems are built to consume.


2) Standardized Excellence Decays

The Agent Era does not punish greatness. It compresses the value of standardized competence that was previously perceived as excellence.

In the previous era, being above average on standardized dimensions was often enough to generate a stable trajectory:

\[\text{education} \rightarrow \text{credential} \rightarrow \text{organization} \rightarrow \text{salary} \rightarrow \text{stability}\]

This path was historically rational. When formal education, institutional access, and technical literacy were scarce, credentials carried strong signal. Standardized filters selected for intelligence, discipline, memory, compliance, and pressure tolerance. For a long period, passing these filters had real structural value.

But every signal decays when it becomes common.

When more people acquire the same credential, the credential becomes less informative. When more people are trained to solve predefined problems, “being good at solving predefined problems” becomes less scarce. The marginal standardized achiever carries less signal because the profile itself has become less scarce.

The deeper issue is structural. Standardized excellence is benchmarkable excellence. It is excellence inside a predefined reward function. The problem is given externally, the rubric is specified externally, and the score is assigned externally.

This works well in closed environments. It works less well in open-ended reality. It works even worse in the Agent Era, because agents are also benchmark optimizers, but with extreme patience, memory, parallelism, and iteration speed.

The old implicit contract was simple:

\[\text{perform well} \Rightarrow \text{enter the system} \Rightarrow \text{receive stability}\]

That contract was historically contingent; however, it was not universal. When the environment changes, people who continue to optimize for the old contract silently carry the societal tail risk. The pain is not only economic; it is also epistemic: the map that once organized life no longer matches the territory.

A person can still win inside established institutions, but the expected value of that victory decreases. The path becomes crowded, the signal weakens, and the reward compresses. What remains is often a sequence of increasingly competitive status ladders:

\[\text{credential rank} \rightarrow \text{institution rank} \rightarrow \text{organization rank} \rightarrow \text{compensation rank} \rightarrow \text{lifestyle rank}\]

The loop does not converge. It only changes metrics.


3) The Certainty Trap

Seeking certainty is natural. Humans are risk-averse social animals. Stable roles, stable titles, stable income, and stable identities feel safe because uncertainty often carries real cost. The desire for certainty is not irrational at the biological level.

But certainty and value are not the same thing.

Certainty is where the reward function is already known. It is where competition has already arrived, benchmarks have matured, and optimization systems can be deployed. A stable metric offers comfort, but it also creates exposure. If one’s value is defined by a metric, then one’s value is vulnerable to any system that can optimize that metric.

Repeated success under explicit metrics can hide a larger tail risk: local validation is mistaken for global robustness. A person can become highly adapted to a particular evaluation system while remaining fragile outside it.

When the mismatch appears, the common response is not to exit the logic, but to migrate into another metric:

\[\text{credential} \rightarrow \text{title} \rightarrow \text{license} \rightarrow \text{promotion} \rightarrow \text{status track}\]

This is merely metric substitution driven by inertia, not a solution to the real problem.

The pursuit of certainty is dangerous because a given metric is never neutral. Whoever defines the metric partially defines the person who optimizes for it. A benchmark is not only an evaluation device; it is also a behavioral shaping mechanism.

Stability itself is not the problem. In many cases, stability is rational. Random movement is not better; unstructured chaos destroys people faster than stagnation. The real distinction is not stability versus movement, but robust positioning and pivoting versus hiding inside an unverified, familiar metric.

A good position remains adaptive under distribution shift. A fragile position only looks safe because the environment has not changed yet.


4) Signals Are Priors

Signals still matter, but their function must be understood correctly.

A university name, company logo, medal, famous advisor, prestigious lab, or elite network is not destiny. It is a Bayesian update.

\[P(\text{high ability} \mid \text{signal}) > P(\text{high ability})\]

A strong university logo increases the posterior probability that a person passed a difficult filter. A strong company logo suggests survival inside a competitive environment. A prestigious research lab implies exposure to better priors, stronger taste, and higher ambition. But none of these signals defines who a person is.

They only adjust priors.

The real value of elite environments is not primarily the curriculum. Most knowledge is increasingly available for free. The deeper value is the change in comparison class. A high-signal environment exposes people to stronger priors, higher agency, stranger ambitions, and lower tolerance for mediocrity. The aura matters because humans and institutions are Bayesian machines. They rarely evaluate from scratch; they update from priors.

Signals open doors, expand opportunity space, and increase the probability of being taken seriously. But they do not automatically create direction, taste, courage, judgment, or the promise of a better life. Once a signal becomes a commodity, it degenerates into just another benchmark.

The correct use of signals is instrumental. They are useful when they increase access, density, and optionality. They become dangerous when they become identity tags.


5) Uncertainty Is Where Value Lives

The central mistake of standardized thinking is the overvaluation of certainty.

Value and irreplaceability rarely live where the reward function is already explicit. Certainty is where competition has already arrived. Excess return in markets, research, careers, and culture comes from reducing uncertainty that others cannot or will not touch.

Great research is one example. Its essence is not method, but entropy reduction. After reading a good paper, the reader’s uncertainty about an important object decreases. The method can be theoretical, empirical, experimental, computational, or aesthetic. What matters is the delta:

\[\Delta H = H(\text{world before}) - H(\text{world after})\]

A trivial paper reduces no meaningful uncertainty on a problem no one cares about. A technical paper may reduce local uncertainty. A great paper reduces important uncertainty. A field-opening paper changes what uncertainty people consider worth reducing.

This is why problem definition dominates execution in research. A perfect solution to a mediocre problem remains mediocre, while a partial solution to a great problem can be historically meaningful. Most people focus on solution quality because it is easier to evaluate, but in high-variance domains, the problem itself is the main source of expected value.

Life is also closer to SGD. The full objective is unknown. Gradients are sampled from local experience. Updates happen under noise. People get trapped in basins, overfit to early rewards, and mistake local smoothness for global truth.

This is the sense in which choice can matter more than effort. Under power-law structure, direction is often more important than step size.

Effort is the step size. Taste is the direction, a meta-prior compressed from what one has read, thought, experienced, and repeatedly mispredicted. Courage is the willingness to move under uncertainty. Reflection is the mechanism that prevents blind descent into bad minima.

High effort with bad taste converges quickly to a local optimum. High taste without effort remains imaginary. Their combination compounds.


6) Evaluation, Permission, and Agency

External evaluation is useful when the evaluator is better calibrated than the evaluated.

A student should learn from a good teacher. A junior researcher should absorb feedback from a strong advisor. A young engineer should listen to excellent builders. Beginners often need borrowed judgment from people who have already paid the cost of being wrong.

The problem begins when external evaluation becomes the permanent source of self-worth.

Schools, companies, markets, and audiences all provide evaluation signals: ability, level, price, and status. Over time, the evaluator becomes internalized. Even when nobody is watching, the person continues to ask for permission.

This is the hidden structure of many anxieties. The pain is not only failure; it is being trapped inside an evaluation system one no longer believes in but still obeys.

The solution is not to reject evaluation. That would be naive. The right question is how much weight each evaluator deserves.

A random person’s disapproval should have a near-zero gradient. A precise criticism from a well-calibrated person should have a high gradient. The key is to be tactical about whose judgment deserves weight and whose experience to distill. An institutional rejection should be interpreted through the institution’s objective function. A market signal deserves attention, but not worship.

In the Agent Era, permission-seeking becomes especially costly. It waits for an existing authority to define the game, while the highest-value games are often illegible at birth. If execution becomes cheap across many domains, then the bottleneck shifts toward deciding what is worth executing.

The agent waits for a command. The human problem is to have a command that is actually one’s own.

AI should become an executor of one’s goals, not the environment that assigns them.


7) Taste Is Nonlinear Uncertainty Reduction

Taste is often treated as mysterious, but it can be framed more precisely.

Taste is one’s judgment over what deserves attention: a compressed meta-prior over what is likely to matter, what is aesthetically coherent, what is technically fertile, and what is socially mispriced. It appears in research, startups, art, music, movies, friendships, lifestyles, and almost every repeated allocation of attention.

Taste reduces search cost. Without it, exploration begins almost randomly from every starting point. With it, attention is allocated toward regions with higher expected gain or better initialization. In this sense, taste is a heuristic for nonlinear uncertainty reduction.

\[\text{Taste} \approx P(\text{future value} \mid \text{weak early signal})\]

This is why good taste is hard to benchmark before it becomes legible.

A mature benchmark evaluates what the world has already learned to see. Taste operates earlier. It detects structure while the signal is still weak, noisy, and socially underpriced. It is latent and difficult to formalize, more like a representation than a rule.

To be unbenchmarkable is not to be impossible to evaluate forever. Many things that feel unbenchmarkable today will later be proxied, modeled, optimized, and commoditized. The excess value lies in the interval before the benchmark stabilizes, when the object, metric, and value function are still being formed.

In this sense, taste is broader. It is judgment before the frame has settled: the ability to notice what may matter before it becomes obvious what should be measured.

Good research taste notices a problem before it becomes a field. Good startup taste notices a behavior before it becomes a market. Good artistic taste notices a form before it becomes a style. Good life taste notices a path before it becomes a commodity.

Taste also contains desire. If one genuinely likes a direction, time spent there is not only pain but energy-preserving pleasure. This matters because taste is not only a judgment function, but also an allocation function. It determines where attention can be sustained long enough for hidden structure to become visible.


8) Asymmetric Time and the Cold Start Problem

Talent helps, but talent is not magic. Nobody is born knowing calculus. Nobody is born with mature taste. The important difference is that time does not have equal return for everyone.

For some people, one year compounds into ten years of insight because they occupy the right environment, feedback loop, abstraction level, and energy distribution. For others, ten years produce only one year of growth because the work is low-information and repetitive.

Genius is not only high ability, which is merely the surface observable. Genius is asymmetric time spent on the right things.

Once the first breakthrough happens, the process becomes nonlinear. The person gains better collaborators, better priors, stronger confidence, higher-quality projects, and access to more selective networks. The next unit of time becomes more valuable than the previous one.

This is the real cold start problem of life. Before the first breakthrough, people are mostly evaluated by old signals. After the first breakthrough, they begin to generate their own signal. Before the first breakthrough, others ask where they came from. After the first breakthrough, others ask what they see next.

This is why taste and interest matter so much in the early phase. They allow a person to stay in a high-uncertainty direction long enough for compounding to begin.


9) The New Scarcity

The Agent Era does not eliminate human value. It relocates it.

Known tasks will be executed increasingly well by agents: faster, cheaper, more patiently, and at larger scale. Human scarcity therefore moves upstream: from solving assigned tasks to deciding what should be solved; from producing outputs to judging which outputs matter; from optimizing known metrics to constructing new value functions.

These capacities are hard to measure because they operate upstream of measurement itself.

Benchmarks remain useful as training devices. Signals remain useful as access mechanisms. Institutions remain useful as density machines. Technical skill remains useful as leverage and understanding. But none of them should be confused with the source of long-term differentiation.

To be unbenchmarkable is not to be vague, lazy, or anti-technical. It is to be technical enough to use benchmarks without becoming reducible to them. It means passing filters without worshiping them, learning from evaluation without outsourcing judgment, and developing unique taste before consensus has stabilized.

The Agent Era will make many people more productive. It will also make many people more replaceable, because productivity inside a known frame is exactly what scales.

The relevant question is:

\[\text{Which part of me cannot be compressed into a benchmark?}\]

That part is where the future premium lives.


Yufa Zhou — May 18, 2026