That time I broke login at Box

and what it says about business logic and code re-use

May 19, 2023

This was almost a decade ago now, so I don’t think anyone from Box is going to get salty at me for telling this story.

The story

I worked at Box for exactly two years from the start of 2014 to the end of 2015. When I joined, I took a front-end role working on the Performance team. Our team charter was to make the Files page (the page you’d get with all your files when you logged in) have a P50 load time of under one second for users in the United States. At the start of 2014, we were far off from this goal—by a lot. Both the server-side and client-side load times were each over one second for our target cohort, so lots of stuff needed to change.

Most of my time at Box was spent working on the front-end: at one point, I realized the only way that we’d achieve our goal was to undertake a rewrite of the JavaScript1. At the same time, one of the senior designers was pushing for a redesign. We joined forces and kicked off what later became known as All New Box. It didn’t fully ship until ten months after I left the company, sadly.

Box’s stack at the time (I assume it has changed quite a bit since then) was home-grown MVC. httpd would rewrite your request’s URL from something like /files to something like /index.php?rm=box_files. Each “controller” is what Box called a “run mode”: a blob of PHP that was invoked by what was essentially a giant switch statement. The endpoints that returned HTML returned HTML, the endpoints that returned JSON for APIs would return JSON.

Rewriting the front-end meant that we needed to change the HTML. Or, at least, return new HTML when the redesign feature flags were enabled. That, in turn, meant that we needed to change code on the server-side as well as the JavaScript and CSS.

If you’ve ever rewritten some old, crufty code (particularly an MVC controller), you know the sorts of horrors you might find. At the time, Box had attempted a number of server-side frameworks, but none had gotten a ton of adoption, and so there was a mishmash of various systems to fetch data and prepare the Files page. Moreover, the controllers weren’t easily tested because they relied on things like PHP superglobals, which are (were?) tricky to mock appropriately in tests.

To make a long story short, I came up with some helpers that wrapped common operations: $this→getIntegerFromGET(‘query_param_name’) would yield an integer passed as “query_param_name” in the query string. If the query parameter wasn’t passed, it would handle the error sensibly. A whole collection of these methods came together to (almost) form a little mini framework. It became a lot easier to write the “glue” code, and it was a lot faster to write tests.

Nothing here was revolutionary. If you look at Laravel today, this is table stakes. For Box at the time, though, this was a huge improvement and meant that we could clean up some really gnarly code, especially stuff that had been built on failed or unsupported server-side frameworks.

One of those frameworks was lovingly called Layers. Layers is probably now lost to the sands of time2. In principle, it was actually a good idea: you’d write UI components as classes in PHP, then compose those classes together to create more complex UIs. In a lot of ways, Layers was conceptually very similar to how you’d build a UI with modern component-based UI framework like React, it was just all server-side. Layers had a number of problems, though:

The folks that built it weren’t actively supporting it (and if I remember correctly, may have left).
It wasn’t obvious how to build idiomatic Layers components. There were no good examples or docs.
There was no “one good way” to do something correctly. You could take multiple separate approaches to accomplish the same task, making its use inconsistent.
The framework had design flaws that made it hard to do things that should have been trivial.

The problems with Layers made it unsuitable for pretty much anything. Only one team had gone all-in on Layers with enthusiastic optimism, and their team lead later begrudgingly admitted that it wasn’t very good. What they’d built with Layers was the login pages.

My rewrite was coming along fairly well, but as with any long-running project, we started to get pressure to show signs of progress. Speaking with my manager and the designers, we agreed that an easy win would be to build and ship the redesigned login pages. The login pages were almost entirely isolated from the rest of the web app. They used very little front-end JavaScript (by comparison, at least), the security team was bullish on simplifying the auth code, and we could put some of the new back-end helpers into production use. Moreover, it benefitted the goals of the Performance team.

The team that built the Layers-powered login pages had handed off responsibility to another team that was more focused on the back-end. We originally approached this team to help with the back-end changes for the rewrite, but they didn’t have time and weren’t willing to contribute anything. When pressed, they didn’t have any documentation for how the login pages should work.

I ended up taking the project on, with help here and there from other folks on the team and our wonderful interns3. The designs were simple enough, and it only took a few weeks to get the login, forgot password, email confirmation, SSO, and the handful of other pages ready to ship. We rolled them out, folks were excited, and we moved along.

Weeks later, one of our largest customers was on the phone. They were furious. Everyone at their company couldn’t log in. Our QA team had confirmed the issue. The issue was suspected to be a regression in my code. At that point, I’d already deleted the old code, so we couldn’t easily revert it or flip a feature flag off. I walked across the office to find out what was wrong.

All of the customer’s employees were being prompted to reset their password. But the company used SSO, so their users didn’t have a password. They got redirected back to the login page, which—when they typed their email—prompted them to reset their password.

The root of the problem would have been shocking if it was any other company other than Box: the customer had the “users must reset their passwords every 90 days” option selected in the admin settings. Even though these users had SSO enabled, they still had a password. But why? The whole point of SSO is that you don’t have a password!

It turns out that even with SSO, you sometimes do still have a password. Because Box offered FTP access, which has no support for SSO. So if you used FTP, you needed a password to do so. And the system diligently required users to set and reset a password (that most of them never used) every ninety days. And this particular customer’s users had all crossed the 90 day mark all at once, weeks after I’d shipped my change.

The fix was simple, probably only a dozen or two lines of changes. I stopped redirecting SSO users who needed a password reset away from the password reset page, unblocking those folks from logging in. I remember it taking only a couple hours to prepare the change, and the ops folks started an off-schedule deployment.

The PM for the back-end team that owned the login pages—the same one who said they had no resources to help—was furious with me. They escalated their frustrations (to a VP, if I’m not mistaken!), with bad excuses about code ownership and lack of diligence. I don’t know if leadership understood that their complaints were bullshit, if someone had advocated for me, or if nobody cared, but nothing ended up coming of it.

Looking at the login page for Box today, it looks like a surprising amount of the code from that era endures today. The final result has really endured the test of time!

Business logic and code re-use

I’m writing this section after looking at Box’s documentation for FTP. It seems they now have a notion of an “external password.” It’s not clear to me whether this still uses the underlying account password or whether this is a new, separate feature. I’ll give Box the benefit of the doubt here and assume they’ve rigorously considered a more robust solution.

Perhaps the most often-abused principle of software development is “Don’t Repeat Yourself.” Not because it’s bad—it’s not. In fact, most of the time, not repeating yourself in your code is a great thing. Right now at Runway, I’ve been working on reducing the number of different places that we perform certain kinds of lookups about users, and the result is remarkably better.

The problem with DRY is that the very name of it is written as an absolute: don’t! Don’t repeat yourself. It would be a lot less catchy if it was “only sometimes repeat yourself” or “repeat yourself only when there’s a compelling reason to do so.”

The particular piece of logic that I’m going to hate on here is the use of account passwords for FTP credentials. I understand why the original architect of this system at Box did what they did. There are plenty of reasons that appear good, at least on the surface:

Accounts have passwords. FTP uses accounts. FTP should use account passwords.
Accounts need their passwords rotated. FTP uses passwords. FTP passwords need to be rotated.
All of the UI for handling passwords for accounts already exists. Why build it again for FTP?

An uncritical look at the product requirements for FTP auth would lead plenty of folks to say, “Yeah, of course, just use account credentials for FTP.” But there’s some really crucial downsides:

Not all users use a password. When a user has SSO enabled, the account password is not being used for account auth, it’s being used for FTP auth.
Account passwords and FTP passwords are not used the same way. Though we all wish they’d use a password manager, many, many users type their account password from memory. It’s likely pretty rare that folks will type their FTP password; it’s probably remembered by their FTP tool.
At the same time, folks who use their browser to manage or remember their account password will not have it easily available for FTP use.

In the case of Box, they treated credentials and all of the infrastructure for them as one logical unit: FTP credentials are account credentials. The place you reset your account password was the place you reset your FTP password. There’s one way of doing things.

And in fairness, they’re DRY! They haven’t repeated any of that code. There’s a single way of doing things. In fact, the implementation (which, on the FTP side, I haven’t seen) was probably beautifully succinct. But of course, there was a major customer issue.

I won’t say that I was 100% exhaustive in my search to understand the product requirements of the login pages when I rewrote them. I looked at the code, I looked at all the pages, and I replicated the existing stuff with the new framework. I did make the principled choice to redirect users from the password reset page to the login page when they were SSO users. But that’s not a bad assumption to make! Those users have no business on the password reset page.

From a functionality perspective, it makes no sense to require users who auth with SSO to reset their password. During development, the ability for SSO users to be on the password reset page looked like an oversight, not a feature. If you’re going through a rewrite because the code quality is especially low, every missed edge case in the old code simply shouldn’t be replicated in the new code.

We talk a lot about self-documenting code. And while it’s possible (with effort) to write code that’s somewhat self-documenting, what doesn’t exist is self-documenting business logic. Business logic (or application logic, or whatever you want to call it) isn’t explicitly written out, it’s spread across all of your code in many files. Understanding business logic requires you to understand all of the constraints of the code, and the nature of code doesn’t lend itself to centralizing those constraints.

Tests can be written that document business logic to some degree, but that relies on the tests being exhaustive, well-maintained, readable, and correct. Writing documentation in a wiki or Google Doc helps, but relies on the author(s) being exhaustive (and for the documentation to stay up-to-date).

What’s important to understand here is that business logic is a superset of what code represents. Code can, for instance, describe a constraint of your business logic. The business logic, therefore, envelops all of the pieces of code in the system that the business logic supports. Making code reusable is a great thing to do: many things that share the same constraints can share the same code.

Making business logic reusable, on the other hand, is a recipe for disaster: you’re reusing the same sets of constraints across different problem spaces. If two problems can reuse the same business logic, they’re actually the same problem. As soon as you start having one system solve for two different sets of constraints, you’re skating on thin ice.

Box’s approach to account credentials is representable as business logic: there’s code that supports the authentication and authorization of those credentials, UIs to support that auth (and the business processes around it, like password reset), and the rules for how passwords are handled.

The approach for FTP credentials didn’t just reuse the code. They could have, for instance, re-used the set of functions that take a hashed and salted password and compare it against a user-provided password. They also could have re-used (in theory) the UI components for things like password entry. All of those pieces have well-defined inputs and outputs. Instead, they reused the whole system.

This created cross-product dependencies that were invisible. Absent documentation, there’s no way for someone like me to have traced the system to understand that the FTP credential system relied on the specific behaviors of the account credential system unless I—the front-end engineer rewriting the UI of the login page—traced the system allllll the way deep down into the bowels of authentication for FTP and reverse engineered that entire system’s product requirements. The one weird “missed” edge case for password reset becomes a landmine.

This happens unthinkingly. Nobody sets out to make systems more complicated to run or maintain. At Stripe, we had a system for session management4 that spanned the API and dashboard. A new product was being built (just before I’d joined in 2017) that bolted on a subclass that added a new type of session. The engineers who did this unknowingly reused not just the code for sessions, but the business logic that those sessions were built on.

The story ended similarly: folks made changes to sessions for the dashboard, causing a bad interaction with the other kind of session. And they really had no way of knowing. Unless they traced the session code all the way through and found the subclass, and then reverse engineered all of the constraints that the products built with it required, they were destined to cause an issue.

Almost without fail, you can look at an engineering org of sufficient size and find this exact story played out to the letter.

What is there to do?

Don’t let the moral of this story be “don’t reuse things.” If you examine at what these sorts of examples of reuse look like, it’s really reuse with extension. The Box password system had to be extended to support FTP5. The Stripe session code had to be extended to support the new product.

It’s probably not easy to reuse without extending, otherwise we’d just do it. Box might have had to undertake a migration to move passwords to be their own database entity, and make the UI code support multiple kinds of passwords per user. Stripe might have had to refactor the underlying Session base class to be more flexible, or design a more comprehensive security model for how users are logged in.

In both examples, all of the chunks of code are black boxes. All consumers of the password code can look at only their dependencies and see a full set of constraints (without having to look at other things that depend on the password code).

And that’s a great rule of thumb. Systems start to break down when you need to add new modes to your existing code, or when the rules for how code behaves change depending on what uses it. If you can’t think of a dependency as a black box (e.g., you can’t treat passwords as a black box because SSO is a problem), it’s time to start backing up and figuring out a different plan.

The normal advice of “don’t rewrite it” didn’t really apply in this case: the JavaScript was built off the original code written eleven years prior. The code was concatenated together, there was no modularity, no sense of organization, nothing. One big file of deeply nested code controlled unrelated functionality used on dozens of pages. And to absolve myself of the full responsibility for making this call, I’ll note that the principal architect at the time was pushing on the effort to rewrite the code to use his own module system.

I’d love to know whether the last of the Layers code was finally sunset. If anyone at Box knows, send me a message!

For all the stories I have about Box, the interns that joined our team were some of the best folks I’ve worked with in my career. One intern in particular, David Lee, later joined my team at Stripe. He kicked ass there, too, and is presumably kicking ass over at Retool.

When David came to interview at Stripe, my manager, of course, asked him about his experience at Box. Knowing that I’d also worked at Box, my manager asked David if he knew me. David said that we were on the same team. It was the end of the interview loop, so my manager offered to bring David up to where we sat to say hello, and David said, “No thanks.”

I hadn’t known that David applied to join Stripe, let alone that he was interviewing. I learned that he was interviewing because my manager stormed up to my desk and demanded to know, “What did you do to your interns when you were at Box??” 😂

Sessions as in “remembering who you are across requests using cookies”

It’s worth noting that Box may have assumed (before they introduced SSO) that account and FTP credentials should just be the same thing. Their constraints at the time may have overlapped completely. However, when SSO was introduced as a feature, someone needed to think about this. At that point, the constraints changed and it should have been an indication that broader changes were necessary.

Basta’s Notes

Discussion about this post