2024-08-28T16:38:23+00:00Z

LLMs won't save labor when you use them like this

Aug 28, 2024

The developer tools community remains active in Twitter’s dying light. On Tuesday, August 27th, in a now-deleted tweet and blog post, Matteo Collina boasted benchmarks across several JS frameworks that showed his fastify outperformed alternatives in server-side rendering performance:

A graph of SSR performance showing fastify on top

There was just one problem: the implementations were written by a large language model. In a paragraph that will live as a cautionary tale, Collina discloses (my emphasis):

For this, we needed to generate a non-trivial sample document that includes a large number of elements, to have a very large page for the test, and consequently have more running time to capture each libraries’ performance. So, we asked an LLM to write some code to draw a spiral into a container using divs[…]

For the authors of the other frameworks, the benchmarks did not pass the smell test. Svelte’s Rich Harris was the first I saw on the scene, and he made a single-line fix that immediately changed the outcome of the benchmarks:

Revised benchmarks that place Svelte even with fastify in its current version, and much faster in Svelte 5

tiles = [...tiles, { x, y, id: idCounter++ }]

became, in Rich’s pull request,

tiles.push({ x, y, id: idCounter++ })

In the Svelte implementation only, spread is used to add tile instances to the tiles array. This has performance implications because on each iteration of the loop that creates tiles, the array has to be created anew, hobbling Svelte’s efficiency. Using push(), by contrast, doesn’t incur the overhead of creating a new array on each cycle. It preserves and mutates the existing array.

Both cases are, strictly speaking, working code. But one works far differently than the other.

Errors in the other implementations surfaced as well, including running React in development mode.

These are the risks of using LLMs for business. Collina and his colleagues have, at this point, lost any labor savings they might have enjoyed from generating this code. They’re reviewing pull requests from aggrieved maintainers of other projects, they’re responding to messages, they’re doing damage control, making apologies… To say nothing of the sector-wide labor costs of other maintainers correcting the record, and the benchmark code.

All because the LLM isn’t actually able to apply any sort of judgment or discernment. The process for generating this code, for example, did not catch the anomalous implementation of the Svelte loop, compared to the approaches in other frameworks. Generating the different projects was likely to be the output of distinct runs of the LLM.

But more than that, the LLM is not considering the performance implications of spread vs. push because the LLM cannot consider anything. It’s a loom that generates patterns of information, endlessly.

Sometimes that ends up as working code. Last month, I used Claude to get a demo application working for a class I was teaching on the social consequences of algorithms. I wanted a toy Twitter whose feed the class could manipulate in real time by applying boosts and penalties to certain kinds of content.

Claude had me up and running in half an hour, and everything worked great. To my shock! I’d never gotten an LLM to generate multiple, interdependent components that worked together successfully in the past. This was very helpful to me, and given my own time constraints, made the difference between a nice idea and a working demo.

But there’s a big difference between code written for toys and demos and code written for accomplishing a business purpose. In the fastify case, we see that the LLM is no match for the judgment of real humans.

There are obvious uses and benefits to a technology that can extrude workable patterns of information, especially code. But uncritical use of that code is going to create a real mess.

There’s a whole job category implied by this LLM future:

People who, like these framework maintainers, apply their experience and judgment to remediating LLM lapses.

So much for that labor savings.