Two weeks ago (the week of the 7th) was my first week working at Runway. It’s also the week that the window installers had to reschedule my home’s new window installation for after the install crew all caught COVID last month. As I was in New York for my onboarding, it’s truly been an emotional rollercoaster, with my feelings oscillating between excitement for my new role and anxiety that some randos might damage my house.
Being in New York, I’d also been able to meet up with some former coworkers, one of whom I hadn’t actually ever met in person. When I’m traveling I usually assume that most of my plans to meet up with friends will fall through, but I was pleasantly surprised that we managed to make it work!
My first week at Runway
One of the striking things about starting at a new job is getting settled into all of the routines and peculiarities of how they work. In the same way that I felt intimidated by Stripe’s API codebase, I’m similarly apprehensive about many of the components of the Runway codebases. Confidence takes time, though, and everyone has been nothing but supportive.
During a 1:1, I had been saying to a coworker that I feel like one of the toughest adjustments to make when moving from a large company like Stripe to a small one like Runway is readjusting my expectations around what the right balance of good and fast is. I have no problem coding fast—most of what I do for Pinecast is “done fast.” And most of what I have done at Stripe had been “done good,” because that’s what’s valued.
My coworker said that I should bias towards iteration and preferring something that does what the customer and other engineers expect, rather than focusing on “the right” solution. I agree! In the majority of cases, I think this feels fairly unambiguous. Certainly if you’re reading this, you know I’ve spent lots of time harping on the merits of working iteratively.
The trouble is that this is not a smooth gradient. Security is a good example of this. It’s easy to build something quickly that works, but is dreadfully insecure. Building something that’s comfortably secure takes time and often requires lots of eyeballs.
Let’s imagine for a minute that we have a slider, where all the way to the left gets work done as fast as possible, and sliding it all the way to the right gets things done with the highest standards possible. If we had a task to build a UI, this metaphor works intuitively: put the slider in the middle and the work gets done in a reasonable amount of time and with a reasonable (but not amazing) standard. Slide it left and the UI is janky but it’s done tomorrow, slide it to the right and it’s a work of art but it takes all month to finish.
Security, though, doesn’t work this way. Setting the slider right in the middle doesn’t mean “the security incidents are only 50% spicy.” The security incidents are still pretty bad (and even potentially company-ending!). The spiciness curve is non-linear, requiring more “good” than “fast” to approach what’s probably a tolerable risk level.
Where I’m going to face the biggest challenge adjusting to Runway’s culture and cadence is going to be around figuring out how to set my sliders. On one hand, a startup necessarily needs to move quickly and deliver value to users. On the other hand, you also don’t want to work so fast that the product is bad/insecure/unstable/etc. to the point that you need to dramatically reverse course later on to clean up your mess.
Things that can’t be done right now
One person that I met with this week spent time telling me about how when he joined Runway, there were a lot of things he was bullish on doing. Joining early means defining the way of working, right? In practice, he said, there’s simply no time to invest in projects like this. Of course, this is realizing that the slider can’t be set as far towards “good” as you want it to be set.
This is something that’s come up at past jobs, even at larger companies. If you can’t find the time to define a medium-sized project and do it, you’ll inevitably just not do the project. But I don’t think that means the outcome is infeasible, it just means it can’t be done right now.
When I think about changes like this, the thing I keep in mind is that it’s almost never the case that you have time to do this stuff, even at larger companies. The larger the company is, the more stuff there is, and the more stuff there is, the harder it is to define the way that other people work.
Maybe the change is adopting ES2017 async
/await
(versus chaining promises or using callbacks). Maybe the change is migrating to TypeScript. Maybe something as simple as naming conventions. In almost every case, the war is the migration, but the battle that needs winning is mindshare.
The secret to making this stuff happen in my eyes—at a company of any size—is getting the people around you to do the work by making a compelling case for why your new thing is the right thing to do.
Let’s talk about a new naming convention. You want to stop having a big tests/
directory and instead put your tests for foo.ts
in the same folder, named as foo.test.ts
. You could maybe spend the time moving all the files and fixing imports and changing Jest config, or whatever. That’s a lot of work. Or, you could make sure the people around you are on board with the change, do a few popular directories, and let the new way of doing things slowly take over the old way.
That’s not to say this is zero-effort. You’ll still need to put in the work:
Having a stockpile of examples in obvious places will make sure folks that are copypasta-ing end up copypasta-ing the right things.
If you’re not migrating as you work going forward, nobody else is going to either. You need to make the active effort to put your proposal into action by keeping up the momentum.
You need to be on top of making sure folks don’t regress the policy. Watch for commits of a bunch of tests with the old naming convention, and ask nicely for them to do the new thing instead of the old thing.
But don’t nag too hard as to make the process a pain in the ass. You’re trying to win mindshare, after all.
There will be a long tail that will eventually need to be cleaned up. Be prepared to sit down and push the project over the line when it gets close.
Since this isn’t a thing that happens in a single bounded time window, it requires the extra diligence of good documentation. If people don’t know what, how, or why, they won’t do it.
The mindset here is the same: you have a good idea and you push it through. But the fundamental differences are twofold:
You’re leveraging everyone else to help do the work. As long as the amount of work left to do shrinks at roughly the rate that the change is completed, you’ll eventually end up getting to the end1. Even if some folks are still doing the old way, as long as there's a roughly equal effort to do the new way, the new way will eventually prevail.
The change doesn’t happen in one unit. There’s no defined, organizationally-sanctioned project to institute the new thing. It just sort of happens, often organically. It’s not on a roadmap, it’s not tracked, it’s just a thing that people are doing2.
The second item often makes folks feel like this approach is infeasible, but you can see it working all the time: folks bring good ideas from their past experience to new jobs all the time, and those ideas get picked up and put into practice gradually. Just because a project isn’t explicitly time-bounded doesn’t mean it’s destined to fail, it just means you compensate for the lower up-front investment with diligence in making sure the momentum continues over time.
Let’s talk about why teams move fast
Something that came up many times during my last week at Stripe was a question in response to something that I’ve preached for a while. I suggested (hopefully uncontroversially) that the success of any software team depends on its ability to move quickly. “Moving quick,” in practice, doesn’t mean doing more in a short period of time (like, you’re not going to type faster), it means you need to work iteratively: break projects into chunks that can be developed iteratively, do them, then move on to the next chunk.
In essence, this means being agile. Lots of folks mistake “agile” for “doing kanban” or “morning standup” or whatever ritual you can perform to tick enough boxes in your favorite agile development book to mentally justify saying “we do agile”. But that’s never been what agile is or was.
And so, lots of folks asked me, “Can Stripe become more agile?” or “Can large companies move fast?” Yes, of course! Or at least, it seemed that way in my mind.
Does big mean slow?
“Bigness” is often equated to “slow”. And to avoid burying the lede, I’ve spent a good amount of time wondering whether this is causal or the consequence of some other factor.
Startups need to move fast out of necessity. If you move slowly and don’t ship for long periods of time, you’ll run out of money (or flounder, if you’re bootstrapping), and you can’t prove out your product. Given enough time, someone will eventually solve the problem you’re trying to solve, making your product moot. There are enough folks preaching the urgency of iteration that I won’t waste your time explaining why it’s essentially table stakes for a startup to operate with agility.
But isn’t that true for all companies? If you rest on your laurels for too long, eventually someone is going to eat your lunch. Maybe you can prolong that by turning revenue into marketing to keep the flywheel of profits turning, but that only lasts as long as your product is capable of sustaining it.
One person at Runway during my first week suggested that a company as big as Stripe moves slower than a startup because they need to move more diligently. Or, to express it in terms of the metaphor from above, the slider needs to be further towards “good” than fast. I don’t believe that. Each project has varying degrees of urgency and quality standards. But in my experience, projects of any urgency (as determined by either the leaders or the ICs) tend to be mired by slowness to the same degree.
Another person I spoke with suggested that big companies simply have more managerial overhead. But I don’t think that fully explains the problem, either. The larger an organization is, the more procedural overhead there is by necessity:
The more people at a company, the more people there are that own things.
The more people that own things, the more folks need to be involved to make decisions or changes.
The more people, the more layers of management, which means more people who need to oversee and coordinate work.
The more things that are owned, the more vertically siloed the domain expertise3. This creates bottlenecks on the people with the expertise.
Surely there are many more aspects to overhead. But I’d postulate that the effect of “overhead” ends up being 1-3x (that is, increasing project time by 1-3 times) rather than 5x or more, as is common/believed.
The mythical “startup within a startup”
I’ve been pretty adamant about something for a long time. My manager at Stripe strongly disputed this, so take that for what you will. I contest that when a company creates a “startup within a startup” initiative (SWAS) to help drive something forward quickly, the company is in an unhealthy state. There’s some fascinating things about this to look at.
The thing I don’t like about SWAS orgs/teams/whatever is that they’re a band-aid for the real problem(s). Rather than fixing the company, culture, and conditions that make the SWAS necessary or desirable, the SWAS is put in place to circumvent those problems. It’s a short-term fix to a long-term problem. Regardless of the outcome, the improvements a SWAS institutes that aim to make it nimble just don’t get brought back to the company as a whole.
I think that SWAS orgs often end up being at least marginally successful. They’re often billed as being able to move fast without the overhead of the larger organization. And I think they usually deliver on that. But here’s the thing: if the SWAS can move fast (for some definition of fast), that implies that teams within the company can move fast. So why don’t other teams move fast?
And to hammer that home a bit more, you need to consider that these orgs usually don’t color too far outside the lines of the broader company. They probably use the same CI and compute infrastructure, are subject to the same security standards, and have to report progress to leadership. The real difference is in the culture of how work gets done and how ownership is handled.
There’s a few aspects to these groups that make them work faster than the broader company:
They aren’t subject to the same scrutiny as the rest of the company
Most folks find themselves trying to nudge their fast/good sliders towards fast, but leadership slides it back towards good. A SWAS is sold on the premise that they are supposed to move fast, and so they face less oversight.
A SWAS usually reports up through a single individual in management that is bullish on the group’s ability to work independently, and mostly shields them from unnecessary bureaucracy. Bottlenecking the org through a single person that deflects top-down demands mostly tends to work.
Ownership is clear and contained
SWAS orgs are often terminal in their ownership. They have nobody (or almost nobody) else in engineering that depends on them.
They don’t own existing things. They have no systems that require ongoing maintenance to resource.
Their very existence nudges the slider towards “fast”
If you have a SWAS that doesn’t move quickly, it is a failure by all accounts.
Being chartered with moving quickly makes everyone within that org consider agility as part of every decision. Your ability to move fast justifies your org’s very existence.
Now, I won’t say that there’s any sure-fire solutions to take away here, but I think that if you think about what makes a SWAS successful and apply that upwards to a company as a whole, you can make meaningful improvements. If I was running a large organization and wanted to make a meaningful change to the engineering org’s ability to move quickly overall, there are a few core things which are absolutely unavoidable.
The first is to recognize that engineers don’t practically work like line cooks. You can have them make something, then move onto the next thing. That works, but the thing they just finished building starts accruing maintenance burden, feature requests, and KTLO work. And surely, the next new thing finishes, and it joins the first thing. Soon, the once-productive team is faced with either pausing new work to focus on maintenance or ignoring maintenance to focus on new work.
In a small company, everyone owns everything. And there’s little difference in how new work and maintenance work is treated: it’s all just work. When a company grows, there’s too much stuff for everyone to own everything, and so you start drawing lines around areas of ownership. Back-end. Front-end. Infra. And the people you put in charge of those areas can specialize.
Where this goes poorly is when you start to run into the edge cases of the human aspects of an org. What’s often hidden is that there’s a balance: when you have an “infra” group, everything infra-related is owned by the infra group, and everyone is staffed on everything. If you build more infra, you hire more people. When you specialize the “infra” group into teams like “storage” and “continuous integration,” those teams now have far less flexible resources.
“Continuous integration” isn’t just running Jenkins, it’s managing the bazel setup. Unblocking flakey tests. Keeping your kernel up to date. Adding caching for your node_modules
. Making sure that errors are formatted correctly. Monitoring build times. The scope grows, and the growing scope means new code, and the new code is often owned by the same ~fixed size pool of engineers.
The company needs to switch from a line cook mentality to a rancher mentality. The new thing gets built, it gets fostered into a mature piece of software, and (without trying to evoke a gruesome metaphor of ranching) gets sunset. The effort beyond the creation of the software needs to be accounted for when budgeting headcount. If you stretch your engineers too thin by continuously increasing the scope of their ownership without giving them more resources, the process will break down.
The second change I’d put in place is to ban long-term roadmaps. You can’t be agile if you’ve planned what to do more than a few months out. That’s the literal opposite of agile, regardless of which flavor of agile development you ascribe to.
There’s obvious value in knowing what you want to be doing in a year. Having a vision is important. Laying out all of the individual steps to get there isn’t valuable or important. Besides the planning process for an extended roadmap being tedious and slow, it forces teams to sign up for a plan that is disrupted by any participant in the plan facing disruptions.
Consider waterfall development. If one component of the plan ends up being delayed, it cascades down the waterfall. That’s the immediate, direct effect of a delay. But there are indirect effects, as well. Team B is delayed due to an unforeseen problem. Team A’s work, which they went out of their way to deliver on time for Team B, is now sitting idle. What could Team A have done instead of sequencing their work for Team B?
What you’re left with is inefficiency (read: waste), that comes at the expense of maintenance and KTLO. If we assume maintenance and KTLO are actually necessary (that is, we wouldn’t say we should do the work if it didn’t actually need to be done), there’s some cost to delaying that work: bad user experiences, customer churn, wasted resources (compute, storage, etc.), sales are delayed due to missing features, security risk, or worse.
There’s always work to be done at a healthy company. Teams should feel empowered to decide which things need to be done at which times. Breaking work into chunks (sprints? cycles? whatever you want to call them) means that you can plan to do the most impactful thing at any given time. And when you’re done with that chunk, you can change direction quickly.
If, in my example, Team B’s dependency on Team A was well-communicated instead of just another line-item on a spreadsheet, Team A could have self-determined during sprint planning that the work for Team B was not the most impactful thing they could have done at that time, and done more impactful work instead. The broader project is mostly-unaffected but the outcome is better for the company as a whole.
I think the last big thing I’d do is invest in fixing papercuts. Stripe had generally been good about this (especially in recent years thanks to some really awesome folks). Developer productivity needs to be a first-class concern for any company.
The problem with guiding team success with metrics is that they don’t tell the whole story. When each team is evaluating their output at a macroscopic level, the metrics all look good. CI works pretty well. The deploy tooling works pretty well. The code editors are well-configured. But when you look at the average engineer’s day-to-day experience, build times are slow, the deploy tooling works poorly for the QA environment, the code editor intermittently stops formatting JavaScript and requires reboots…and developers just deal with it.
Lots of little frustrations and problems end up popping up—usually without anyone noticing. And they’re individually small enough (per instance of the frustration) that they aren’t seen as problems worth prioritizing against much more impactful work that a team could be doing. But if every engineer restarts VS Code twice per day because Prettier stops working, for a large org that’s hundreds of hours a year wasted on what’s probably a quick fix.
At one point at Stripe, there was a survey that had been sent around in my org about developer productivity. Many of the people I’d spoken with felt like they were moving slowly, but had a really hard time articulating why. It’s not hard to imagine why they struggled: a few wasted seconds here or there quickly add up, but the source of slowness or process or bugginess is not easy to pin on any one source.
In my experience, these problems go unseen for far too long. Like the proverbial frog in boiling water, you get used to papercut issues because they don’t hurt enough to get upset about, but they hurt enough for you to feel the pain from them.
Smaller companies usually don’t have this problem, because there’s simply less stuff to have papercuts with. Folks bring whatever editor they feel comfortable with. The volume of code is low enough that even some of the most inefficient tooling isn’t a huge burden4.
Last, I think there’s something to be said about managing product complexity. This is a problem that’s almost exclusive to more mature companies by virtue of how long they’ve existed for. Small companies tend to be younger, and younger companies have less complex products. Complexity is managed and culled quickly, because you don’t need to do as much work to combine things and refactor and draw the complexity downward.
At an older company, complexity is ingrained in the product. If you build and build without reducing complexity, the complexity compounds. It takes a superlinear amount of effort to rework your product to be less complex as time goes on: interactions between different parts of your product make it hard to keep feature parity while making your product simpler.
I’ve felt this with Pinecast, even. The interactions between different features increases complexity:
Users can access each others’ shows if they are in the same network or are marked as collaborators. But the features of the show are based on the functionality of the owning account’s subscription.
Some features are only available to the owner of a podcast (like deleting the podcast or an episode), even if someone else has access.
Some features are not available, even to the owner of the podcast, if the owner is on the “ice box” plan, which makes the account read-only for a very low monthly rate.
Business leaders (myself included!) have a hard time justifying simplification as an effort, because you’re taking something that works and turning it into something that works exactly the same way. It’s hard to measure return on investment for culling infrequently-used features because the cost of keeping them is often not easily measured.
Especially for complexity between multiple features that interact, the toll taken on developer velocity is exceptionally high. You see this in a number of ways, such as:
It feels unsafe to make seemingly benign changes.
The process of testing a change takes a huge amount of time.
The procedure for defining how a feature should work in all cases is a protracted exercise that takes far longer than it probably should.
In my role at Stripe, one of the questions I found myself asking folks about their proposals is, “Does it makes sense to build on this codebase? Or is it time to reevaluate whether this is the right abstraction?” In fact, one of the last few things I wrote an impassioned message about was not overloading an API with more functionality in the name of keeping it simple.
While hard to do and somewhat time intensive, this is probably also the most impactful way to speed up a company. The fewer things you need to think about to do your work, the faster you’ll move. And in general, reducing complexity is mostly uncontroversial! This process means talking to your users to find out what they actually want to do, then refining the interfaces you give them (programmatic, UI, etc.) to make sure that the things you are delivering address customer needs with as little surface area as possible.
There’s a lot more that can be said about reducing complexity, and that’s maybe a topic for another post. Simple does not mean small, for instance. DRY doesn’t apply to interfaces! If this is something you’d want to read more about, let me know.
The next few weeks
When I was writing these at Stripe, much of what I wrote looked a lot like this post. But I also did some other stuff, and I want to gauge interest.
Thanks for reading, as always, and I’ll write you again in two weeks!
This boils down to the Ant on a Rubber Rope paradox. The gist of it is that if you have an ant that’s standing at one end of a 1km long rubber rope and starts to walk the length of the rope at 1cm per second, the ant will still eventually reach the end, even if the rope begins to stretch at the alarming rate of 1km/s. The reason why is the rope behind the ant is still growing at the same rate (per unit of rope) as the rope in front of the ant. The ant’s forward progress always increases its percentage of completion, even if that rate is slow and continuously (and rapidly) decreasing.
Often not in addition to what they were doing before, but in lieu of what they were doing before. New code doesn’t need to be written in an old way, you just write it in the new way. This is net savings, as
E.g., at a certain point, a software company will have a “storage team” that’s in charge of databases and blob storage. Experience that may have previously been distributed across many people is now concentrated on a relatively small number of specialists.
I’d remarked to a colleague that Stripe today is at a point where it’s essentially impossible for any person to read “all the code.” There’s just too much of it. You can’t read tens of millions of lines of code. Runway is at the point where someone that’s sufficiently motivated could sit down with a case of energy drinks and read all of the code. They might not understand all of it, but they could read all of it.
Stuff you’ve noticed/learned about onboarding and recruiting now that you were recently doing it again. (In other words - what did Stripe do well with recruiting, what were weaknesses elsewhere that you noticed, what are you trying to do intentionally when joining your new place, etc!)