The humbling and reassuring progression of AI
We're not losing our jobs any time soon, my dudes
These days, I work for a company that does machine learning as its bread and butter. (And we’re hiring!) Being neck-deep in ML and ML-adjacent work every day for the past few weeks has given me a lot of perspective on where the tech is and where it’s going. ChatGPT was released this week, along with Stable Diffusion 2. And lots of other small things have happened which are also important but not important enough to mention here. I have some thoughts that I wanted to share.
Before anything else, I want to say that all of this is my opinion, and may not reflect the views or opinions of my employer.
Your job is safe.
I’ve heard a lot of folks lamenting that machine learning is going to be taking our jobs (and by “our jobs” I’m largely referring to engineering jobs) in the near future. Frankly, I have my doubts.
I have been doing some work around cost-cutting for Pinecast. One of the biggest expenses and lowest-hanging pieces of fruit has been Intercom: frankly, the shine has worn off, it’s expensive, and it doesn’t do an especially good job of any of the myriad of features that it sells you on1. Getting off of Intercom and into Help Scout and Userlist has been a priority for me.
I bit the bullet two weeks ago and signed up for Github Copilot. I needed to crawl the Intercom API to download customer data and the events that I’ve logged against them. This information is used for triggering automations, like sending onboarding emails and nudges to upgrade. Subsequently, I needed to upload those customers and events to Userlist for processing. This isn’t hard, but it’s tedious work. Copilot has been something I’ve wanted to try for a while (I wasn’t in the beta), and I figured this is the perfect shape of task for it to help out on.
And help out it did! I was able—with a relatively small number of comments—to create a series of scripts that crawled the API, handled pagination, and dumped the data to a CSV with Python. Short of writing the import
statements, Copilot was able to write almost all of this code on its own:
output_file = sys.argv[1]
api_key = sys.argv[2]
def get_contacts():
"""
Make an API request with `requests` to the Intercom Contacts API to list contacts. Use
`api_key` as the access token. Return an iterator of contacts.
"""
starting_after = None
while True:
response = requests.get(
'https://api.intercom.io/contacts?per_page=150' +
(f'&starting_after={starting_after}' if starting_after else ''),
headers={
'Authorization': 'Bearer ' + api_key,
'Accept': 'application/json',
},
)
if response.status_code != 200:
raise Exception(f'Error fetching contact list: {response.status_code} {response.text}')
data = response.json()
for contact in data['data']:
yield contact
if 'next' in data['pages']:
starting_after = data['pages']['next']['starting_after']
else:
break
def get_events_for_contact(contact):
"""
Accept a dict containing an Intercom Contact object. Make an API request for the Intercom events for that contact.
Return an iterator of events.
"""
url = f'https://api.intercom.io/events?type=user&intercom_user_id={contact["id"]}&per_page=150'
while True:
response = requests.get(
url,
headers={
'Authorization': 'Bearer ' + api_key,
'Accept': 'application/json',
}
)
if response.status_code != 200:
raise Exception(f'Error fetching event list: {response.status_code} {response.text}')
data = response.json()
for event in data['events']:
yield event
if 'next' in data['pages']:
url = data['pages']['next']
if not url:
break
else:
break
contacts = get_contacts()
# Create a new CSV file at the path `output_file` using a CSV writer. Write a header row
# containing an ID, email, event name, and metadata.
with open(output_file, 'w') as f:
writer = csv.writer(f)
writer.writerow(['ID', 'Email', 'Event Name', 'Timestamp', 'Metadata'])
# Loop through each contact in the iterator returned by `get_contacts`.
for contact in contacts:
print(f'Processing contact {contact["email"]}')
# Loop through each event in the iterator returned by `get_events_for_contact`.
for event in get_events_for_contact(contact):
print(f' Processing event {event["event_name"]}')
# Write a row to the CSV file with the contact ID, email, event name, and metadata.
writer.writerow(
[
contact['id'],
contact['email'],
event['event_name'],
event['created_at'],
json.dumps(event['metadata']),
]
)
What surprised me most was Copilot’s ability to write code against Intercom’s API without me giving it information like…the URLs. It just knew that stuff. Which is awesome! I don’t know whether it uses existing Intercom code on Github to pull that information, or whether it used information crawled from the web. In either case, this is very impressive stuff.
I subsequently wrote two additional scripts: one to process the CSV to add user IDs from my system from the email addresses, and one to upload the resulting CSV to Userlist.
How much time did this save? Probably a few solid hours. But I did need to keep the Intercom docs open, and I did run into trouble. For one, my account was set to use a (very) old version of the Intercom API, so the code didn’t work at first. There’s no intuitive way to tell Copilot to revise your code based on a problem, which makes it hard to revise existing code.
Another problem was telling Copilot to use a feature that I wanted it to use, like Python’s csv.DictReader
, especially if you don’t remember the exact name of the class. If I have to open the docs to look up the feature to tell Copilot to use it, that doesn’t make it feel like it’s keeping me productive.
What feels especially bad here is that Copilot understands instructions, but it doesn’t understand use cases. I can’t suggest to Copilot how I want to use the output. I have to tell it exactly what I want the output to be. Which is to say, Copilot doesn’t understand the context of my task: it doesn’t know that I’m trying to import this data into Userlist, and it doesn’t know that the user IDs live in my postgres database, and it doesn’t know that I use Django. And this, friends, is why your job is secure.
My understanding is that this is ultimately a limitation of language models as they exist. They are trained to take a prompt and produce an output. They’re not trained to resolve ambiguity (e.g., by asking questions in response) or saying what it doesn’t know. You can see this in the way GPT responds to prompts: it always responds in complete thoughts, and where it doesn’t know something, it picks what it believes is the thing that fits in the best.
If you tell a person that you want to move data from the Intercom API to the Userlist API, they’ll ask questions about the things that they are unsure about. GPT-3 simply will never do that. Which means that to make Copilot a truly useful tool to automate humans out of the job, someone needs to sit there and write incredibly detailed prompts to describe the problem—start to finish—in a way that the model can understand and produce solid output. And then you need to just hope and pray that the model knows all of the details necessary to solve the task at hand, or it’ll just make shit up.
Damn lies
Making shit up is a big problem with these models right now. Some folks have dubbed this “hallucination”: where the machine produces valid-sounding output that is entirely made-up. Here’s an example that I called out:
In the linked tweet, the author asks ChatGPT what the limitations of generics in TypeScript are. The first bullet point is dubious but confusingly worded. The second bullet is outright wrong:
Generics are not supported by all JavaScript runtime environments.
This is flat-out wrong: generics are always types, and types are stripped during TypeScript compilation. They can’t be unsupported by a runtime environment, because they are never present in the JavaScript output.
The subsequent point about type inference is mostly wrong (the point of a generic is that you are explicitly passing it a type. The last point about complexity is probably valid, but arguable.
After many of my friends posting their ChatGPT experiences on Twitter, I decided to give it a shot myself this weekend by using it to help craft Terraform code for the Cloudflare provider. It would be nice to avoid having to dashboard-jockey my settings, and what better way to do that than writing some Terraform code.
ChatGPT consistently provided invalid (but valid-looking!) code snippets for Cloudflare terraform resources. It suggested the wrong permissions to assign to an API token to enable certain operations. And when I asked it about how to import a cloudflare_zone_settings_override
resource, it told me to visit a non-existent page in my dashboard to get the zone settings override ID: not only does this page not exist, there is no such thing as an ID for a zone settings override (if you consult the internet).
In the world of computing, you simply can’t make shit up. In any field (computing, math, engineering, etc.) where there are objective criteria for success, a source of knowledge that confidently lies is not a useful source of knowledge.
If you were to try to make engineers obsolete with today’s AI tools, you’d need to do a few painful things:
Have someone write exhaustive prompts, which is a substantial portion of the time required to write the code in the first place.2
Have someone read+understand the generated code and then fact-check it for errors. If you intend to run the model against the prompt in production, may the higher power of your choosing have mercy on your soul.
Deploy that code (you don’t want some automated external third party trained on the corpus of the internet having ~admin access to your production infrastructure, do you?).
And after all of that, you have achieved nothing of value.
OpenAI has a lot of plans for how to address the lies. My understanding from reading about what they’re planning to do is to use a combination of web-based training and human feedback to push the models to produce more factual answers.
To make our models safer, more helpful, and more aligned, we use an existing technique called reinforcement learning from human feedback (RLHF). On prompts submitted by our customers to the API, our labelers provide demonstrations of the desired model behavior, and rank several outputs from our models. We then use this data to fine-tune GPT-3.
I’m not an ML engineer and I won’t pretend I know the intricacies of this stuff, but I’ll take an educated guess and say that more training with human feedback will improve the process, but won’t make it foolproof. You’ll forever asymptotically approach the limit of the technology. New tech built on different approaches will need to exist in order to close those gaps, and that tech will have its own limitations. And involving humans inherently makes a process less scalable, so it’s unclear how that will work in the long term.
Additionally, the model’s inability to say “I don’t know” or “I’m not sure, but…” is a surprisingly hard challenge when an exact, factual answer is expected. When you Google something and Google doesn’t have a good result to offer, you get back an empty results page: if Google instead returned unrelated garbage, that would make it significantly less useful.
This is a very hard problem, both from a UX perspective and from a technology perspective. I don’t see anyone fully answering it in the next year, but the tech will improve.
Art
On the other side of the machine learning world is generative art, and folks are also talking about the perils of it. There’s certainly a lot to be said about the images that models like Stable Diffusion and DALL-E are trained on, and plenty of unanswered questions. The two which have dominated my feeds:
Is it ethical for a model to be able to produce pictures with a similar style to a particular artist, even though you can already pay another artist to do the same thing?
Is it ethical to train a model on publicly available artwork, even though those works’ artists likely did the same thing to develop their skills?
I’m no philosopher and I’m certainly no ethicist, and so I won’t even pretend that I have any sort of insights here. There’s a lot of very smart, impartial people who can think about these things and decide what is “right,” and that is not me.
What’s clear, though, is that the technology will move forward regardless of whether there’s a right answer, if for no reason other than people are finding value in it. We find ourselves at a point where lots of folks—especially artists—look at the coming tech and think, “They’re coming for me next.” And in some ways that’s true, and in some ways it’s not really true.
Some repeatable creative tasks where the outputs are constrained and repeatable are already being automated, and have been for years. Waifu Labs has sold AI-generated waifus for quite some time (using GANs). It would be unsurprising to see many other niches filled by AI end-to-end: fursona ref sheets3 come to mind as low-hanging fruit.
While some folks will see a decrease in commissions, we’ll also see the accessibility of this art grow tremendously. Participating in communities that are heavily role play-based (furries, D&D, etc.) are in some ways contingent on your ability to pay to turn your ideas into something you can share visually with others. I think there’s an important argument for generative art’s ability to enable more folks to participate in art-dependent activities.
Many folks have played with Stable Diffusion or DALL-E 2 by now, and the results are certainly very impressive. But the limitations are also obvious:
Faces, hands, and joints are challenging for the models to reproduce.
The current models can’t produce language, and can’t produce readable text in the output.
Prompts often produce output that is sub-par and require “prompt hacking” to get a reasonable result.
Linguistic challenges like homonyms and idioms are a problem for models to understand and interpret.
The resulting images are almost always limited in size/resolution due to underlying technical constraints.4
Here’s a picture that came out of an AI-powered portrait tool trained on my face. I’d say it’s quite good (apparently I need some tattoos!), but you can clearly see the difficulty that it has with hands, the articulation around the left shoulder, the geometry of the chair, and some facial proportions.
The models will get better, but they need to become significantly better to truly replace an artist. Even with the bleeding edge of prompt-based models, the level of understanding around language simply isn’t sufficient to describe to the model an outcome that an artist would consider a standard commission.
Artists might not be commissioned to make as much “fast art” when generative tools become more mainstream: we already live in a world where these tools exist and are available to the general public. I have been thinking about what this enables for the art community, though.
Something that has emerged in the past few months is “prompt hacking:” a whole community based around making the perfect artwork using the right prompts to generative tools. AI tools are tools, in the same way that Procreate or Photoshop or After Effects are tools.
Consider how an artist might use generative tools to fill in backgrounds in a commission, which are usually tedious and cost extra. Or to add details or do basic outlining. I could see generative art as a tool to explore potential options for inspiration. “How do I fill in this gap?” Plenty of artists (not just visual) already use generative tools to assist them, making their work faster and higher quality, in the same way composers use MIDI tools instead of hiring musicians to record their tracks.
Pinning blame
I said I’m not going to philosophize about the ethics of AI, but there is one thing that’s clear to me. Many conversations about AI talk about the technology as an entity. There’s a temptation to anthropomorphize machine learning models. After all, you talk to them, they can talk back to you sometimes, and they can simulate the behaviors of other humans.
This is a natural temptation for humans. I saw this tweet recently, which made me realize how often we do this as a species:
How many relatives do you know that have conversations with their Alexa or Siri? I know a few. The same behavior seems to emerge when folks decry the evils of AI: it’s a thing that’s doing something.
But…it’s not. You can’t blame the AI. The AI can’t think on its own. It can’t have malicious intent. It didn’t decide to be created, or to be misused. The AI is a tool that’s been designed, crafted, and used exclusively by humans.
I read an article recently about how AI is being used to mimic art of a deceased artist. A couple notable quotes from the article:
While there’s a long-established culture of creating fan art from copyrighted manga and anime, many are drawing a line in the sand where AI creates a similar artwork.
…
“I think they fear that they’re training for something they won’t ever be able to live off because they’re going to be replaced by AI,”
It’s tough to tell whether the sentiment that AI is doing this (rather than a person) is coming from the author or the sources for this story, or any other similar story.
The fact that Python and a few gigabytes of floating point numbers and not someone in a sweatshop overseas, or a bunch of complicated if
/else
statements5 is what is producing some harmful result doesn't change the outcome. AI can be whatever you want it to be for the purposes of these conversations; it could be literal witchcraft. There are people that are doing the Bad Things, but “AI” is billed as the problematic part.
Making AI a boogeyman dismisses how it—as a tool—can be used for good as well as bad purposes. AI, and even generative art, have no implicit character or values. The things that make it seem problematic:
It’s easy to use. It’s an extremely accessible technology that can be packaged and used over the internet or with dedicated silicon on almost any modern high-end phone.
The breadth of use cases is huge. The more ways something can be used, the more ways it can be used for purposes that folks find distasteful.
Machine learning is largely a black box. It’s extremely difficult to peek inside a model and determine how or why it works the way that it does.
Problems are easy to discover from the outputs, and it’s not always clear how these can be resolved by changing the input.
Care is not exerted in choosing unbiased model training data, which leads to problematic results.
AI is an easy target for criticism. But on its own, it does nothing. Every output produced by a machine learning model was at the instruction of a human, and it’s garbage-in-garbage-out. When talking about the perils of AI, it’s important we remember that: blame the people, not the tech.
The support site isn’t good, and the tools to update it aren’t good. The chat widget often starts failing with weird errors until someone presumably rolls back a deployment at Intercom. The tooling is often mysterious, the API is extremely limited…I could go on.
The difference between writing code and writing what you want the code to do is ultimately a matter of encoding your instructions. We might like to imagine that describing computer code in natural language is faster or simpler, but this is a lie we tell ourselves: the overhead of the programming language itself is small, and the language is designed to allow maximum specificity. Natural language incurs a significant overhead when you need to be extremely specific about computer behavior.
I’m not a furry myself, but I have a lot of friends in infosec, and the two communities have a startling amount of overlap.
Upscaling with more AI is certainly possible, but upscaling produces an effect that feels a lot like a visual artifact. The “proper” solution is having a model that can produce images at the resolution of your choice.
It’s worth remembering that lots of powerful, magical-seeming products haven’t been powered by machine learning. One notable example is the Microsoft Kinect.
The advances AI has made in 2022 is wild. Crazy stuff!
What happened to your hair