My house was built in 1980. To my knowledge, nobody has died in it. It’s not built on top of a cemetery. There’s no hexes or curses that I’m aware of. And I’ve done renovations that have made my garage a trendy influencer-ready space. And yet, it’s haunted.
The first ghost is a lonely man that drifts incorporeally through the front wall. The second is a tractor trailer.
My 2019 Model 3 hallucinates these two entities1 almost every time I back out of my garage. They’ve been around for about a year and a half now—some update summoned them to the edge of our reality so they can be detected by my car’s cameras.
What’s actually happening isn’t rocket science: some pattern of pixels is triggering my Tesla to have confidence that a truck and person are present in a certain location above a threshold. It should go without saying that this shouldn’t happen. When we talk about self-driving cars, the obvious baseline standard is “they need to be better drivers than the average human,” and it’s intuitive to me that most human drivers would not identify a massive truck parked up next to me inside my garage. Tesla FSD is clearly so good that it can see two lost souls traveling the infinite highway of another plane of reality.
Machine learning2 is my day job. I work for a company that does (primarily) generative video. What we do is thankfully not safety-critical (in the “drive your car into the back of a parked truck” sense) and the models taking creative liberties can lead to fun outcomes. I don’t work on the models, though, I work on the glue that supports connecting the models to a user interface. This puts me in an interesting position of taking structured interactions (e.g., through an API) and massaging that structured data into fuzzy data that the model can munch on. When you look at the models this way day-in and day-out, you get a lot of perspective on what they’re really doing.
Ultimately, there’s two things that ML models do that makes them useful and valuable. The first thing they do is what most AI evangelists talk about: identify patterns that are challenging for humans to identify and articulate, and model them as data. In practical terms, this can mean a model can predict financial market behavior, or model physical systems with remarkable accuracy in ways humans simply cannot do. Language models can “consider” vast amounts of information that they were trained on to produce a concise output: unlike traditional systems which might need to perform large numbers of queries across structured data, ML models throw numbers at their big pile of weights and produce an output with relatively little work.
We see this as “intelligence”. ML systems can produce remarkable results in this manner, and while we don’t really understand why it works as well as it does, you can conceptually wrap your head around the fundamentals of what it’s doing.
MENACE is a cool idea from the 60s which basically creates a tic-tac-toe model using physical matchboxes. You “train” the system by playing tic-tac-toe and adjusting plastic beads in the boxes. You can then follow a set of rules and consider the beads in the boxes on each move, and the model gets better at “playing” tic-tac-toe. For all intents and purposes, it’s a decision tree. Decision trees
ML models today aren’t really that different at their core from the abstract concept behind MENACE, we’ve just scaled them up to tens or hundreds of billions of “matchboxes”. Those digital matchboxes represent things like “the vibe of the Golden Gate Bridge” or “intensity of sycophantic praise”.
The second thing that makes models useful and valuable is something that hampers their usefulness: they deal with fuzzy inputs and fuzzy outputs. This is a blessing and a curse.
Consider an LLM: its most useful features are that you can give it entirely unstructured text input and get back answers based on its vast training data. Human-produced text is unstructured text, and the answers are impressive when humans deal with the fuzzy unstructured text output. That’s why chat UIs are so ubiquitous: it’s incredibly hard to do a whole lot else with those answers.
Consider a vision model: you give it an image, it gives you back information on what it’s seeing. Well, maybe. It gives you back lots of things that it’s seeing, along with how confident it is. Maybe what it’s seeing is a blonde wig, or a bowl of carbonara, or an aerial shot of sand dunes. Simply picking the item with the highest confidence is fine, but that doesn’t make the result less fuzzy: a bowl of carbonara is unlikely if the other things in the image are a camel and a group of Bedouins.
How do we solve both of these problems? You throw hardware and training data at the models. A vision model that sees Bedouins and sand together more often has the context to know that the two are more likely to go together3. More training data, more neurons in the model, more weights, more computations…more right answers.
Similarly, you can fine-tune the hell out of an LLM to produce JSON output. You can teach that LLM to produce non-fuzzy results with a great deal of effort. You can make its output useful to non-chat applications.
But you can’t make it perfect. The dirty secret of every ML application is that it’s all just tricks. The vision model will inevitably identify bitcoins instead of blonde wigs and the LLM will inevitably generate JSON that’s syntactically invalid or incorrectly escaped or structured correctly but with semantically invalid contents. There’s no fixing it4: you’re stuck with the curse of fuzzy output.
Of course companies are stumbling over themselves to use AI. It’s the hot new thing: if you ignore it, someone else will innovate their way past you! But it’s also the solution, not the problem. Folks are searching high and low for problems to throw AI at. The answer is easy!
Problems that are hard for people to do
Problems with fuzzy inputs or fuzzy outputs
Read this at your own risk, though. Both of these types of problems are easily misidentified.
Problems that are hard for people to do are hard because they’re tedious. Ironically, identifying motorcycles and traffic lights is a problem that’s really quite easy for machine learning models to perform. You give the computer a big ol’ pile of images of streets and it’ll tell you, “Ehhh, probably this part” of each one. After all, humans were pretty much guessing a lot of the time anyway. You can have a model summarize every email in an inbox. You can have it pull the names out of every PDF in a folder. You can have it provide a rough translation of every menu item on every menu in every restaurant in a country.
Stuff that used to be hard to automate because it needed to produce convincing text or image output is now easy! Actually getting you to the right customer service department without a long phone menu has become a tractable problem. We’re at a point where we can convincingly dub a movie into another language, using the original actors’ voices and simply adjusting their lips with software.
Being right for some problems doesn’t make ML right for all problems. Lots of problems simply don’t need ML. The Kinect was doing realtime body tracking and pose estimation fifteen years ago—long before it was possible with machine learning, and might I say, often with higher accuracy than most modern ML models.
Maybe you want to scan some text in an API for which Pokémon are mentioned. You could feed it into an LLM and get back a list (maybe), or you could just grep the text for every Pokémon name in every language (all of which is readily available online). Hell, you could get Copilot to write the code for you. The speed and cost of the “dumb” approach are unbeatable.
Recently online, someone remarked that generative AI does a poor job of creating videos of people writing on a chalkboard. This is unsurprising: there’s extremely little annotated training data that shows, letter-by-letter and word-by-word how to form glyphs by forming the lines that they’re composed of in 3D space. If you needed videos of people writing, it’s almost certainly cheaper to just hire some folks to draw on a chalkboard (or even just generate 3D models that are rigged to write on a 3D chalkboard) than it would be to build, test, and deploy a video model with enough meaningful training data to get it right.
In other cases, throwing AI at certain problems takes the engineering process out of the creation of the solution and moves it to runtime. You might find, for instance, that your customer support chatbot fuzzily came up with a new refund policy for your business. Or that you spent a lot of money to take orders the new-old-fashioned-way when you could have done it with a touch-screen kiosk or app like you’ve already been doing it for a decade.
Google’s latest Pixel phones come with a dedicated weather app which prominently features an ML-generated “AI Weather Report”. The section takes seconds to load just to reveal such incredible insights as “Clear and cool evening with good air quality.” We simply didn’t need AI for this, we’ve had it in weather apps for…well for as long as I can remember.
So what about my car? ML models powering cars is probably the right thing. It’s fuzzy input: a bunch of measurements and images of the world. And fuzzy output: turn the steering wheel a little bit or a lot, accelerate or brake by so much, stay so far to one side of the lane or the other. I’ll feel comfortable riding in a car powered by AI, just as I feel comfortable and not at all anxious sitting in the passenger seat while my partner drives (love you babe 😘). There’s not a “right answer” to driving correctly.
But the ghosts in my garage prove something. You need context. One sensor isn’t going to tell the whole story, and especially when the training data doesn’t know what a ladder hanging on a wall looks like. Radar or lidar might suggest, for instance, “Hey, trucks don’t meet the ground at a neat right angle.” The model got a fuzzy input, and produced a fuzzy (wrong) output.
What can you learn from this? I’d say you should take pause when choosing ML. Do you have enough data to make a useful model? Can you get data if you don’t? Would it be better than a non-ML solution? Hell, in a lot of cases it’s probably better to write a good-enough heuristic and lie and call it “AI”. Could you write a MENACE clone with pytorch and teach it to play tic-tac-toe? Sure, but you could even more easily iterate every possible game of tic-tac-toe and store which moves lead to winning outcomes.
The motorcycles shown on the screen are indeed present on our plane of reality, but it is concerning that they disappear from the visualization when they are obviously not out of view.
Which I’ll probably refer to as “AI” at least a few times here, not because it meets some definition of “intelligence” but because it’s shorter and the linguistic ship has sailed on this one, just like “crypto” has become synonymous with “cryptocurrency” to the layman.
Let me be clear: I’m not trying to stereotype any Bedouins by implying they do not enjoy a nice bowl of carbonara or do not wear blonde wigs.
Yes, you can identify syntactically invalid JSON and have the model regenerate the response, but that’s not really solving the problem, is it?