Go fix your bugs

An exploration of some bugs you might not have known that you had

May 04, 2023

The domain check

Maybe you use a feature flag service, or maybe you have some custom code for dealing with feature access. It’s great to be able to launch a feature internally, and the natural way to do that is to check the domain name in your users’ emails. Here’s some code:

export function userHasAccessToTheFeature(user: User) {
  if (user.email.endsWith('mystartup.biz')) {
    return true;
  }
  return false;
}

I’m sorry to report that this code has a bug. All I need to do to access your beta feature (and perhaps every feature at your startup!) is to register myself definitelymystartup.biz and sign up. Always always always add the @ symbol if you’re checking the domain on an email! This even applies to feature flag services like LaunchDarkly.

The better way to handle this is to have explicit roles in your database. A field on your user object for staff or admins will eliminate the possibility of goofing this up.

Blindly updating

If you run an API, you probably have some code that looks something like this:

router.post(
  '/v1/resource/:id',
  requiresLogin,
  async (req, res) => {
    const {foo, bar} = req.body;
    const resource = await Resource.load(req.params.id);
    if (foo != null) resource.validateAndSetFoo(foo);
    if (bar != null) resource.validateAndSetBar(bar);
    resource.save();
    res.send(resource.toJSON());
  }
);

Load a resource, update the resource, save the resource. Easy, no? Well, there’s a few problems. Bad problems!

The first problem is that we’re not checking if the logged-in user even has access to the Resource. Do they own it? Do they have permission to touch it? We need to update our code to authorize the user:

router.post(
  '/v1/resource/:id',
  requiresLogin,
  async (req, res) => {
    const {foo, bar} = req.body;
    const resource = await Resource.loadBy({
      id: req.params.id
      owner: req.user.id,
    });
    if (foo != null) resource.validateAndSetFoo(foo);
    if (bar != null) resource.validateAndSetBar(bar);
    resource.save();
    res.send(resource.toJSON());
  }
);

Better! Now users can no longer read or write resources that they don’t own.

But we have another problem, and it’s subtler: we might have a race condition. If two requests try to modify the same resource at the same time, they might overlap.

Consider these two requests:

POST /v1/resource/1234
{"foo": "hello"}

POST /v1/resource/1234
{"bar": "world"}

We fire off these two requests simultaneously. Naively, when both requests complete, we might expect the resource to have the value “hello” for foo and “world” for bar. And it might! But it also might not. It’s entirely possible one of those two fields is something else: the value it contained before the resource was mutated.

If both requests run at the same time, it’s possible that this happens:

Request 1 loads the resource
Request 2 loads the resource
Request 1 changes “foo” and saves the resource
Request 2 changes “bar” and saves the resource

Some ORMs don’t keep track of which fields were modified! Request 1 and Request 2 are not just saving the individual fields that were modified, they’re saving all of the fields that they are aware of. Which means that Request 2 is overwriting the value of foo that Request 1 set.

There are two potential fixes here! Which one works the best is going to depend on your application’s and how you want it to behave.

The first solution is to use transactions. A transaction lets you lock a row so that you can modify it in peace. Other requests that want to load that row for modification will sit and wait until your transaction wraps up (either the changes you make get committed to the database, or there’s a problem and the changes are discarded).

router.post(
  '/v1/resource/:id',
  requiresLogin,
  async (req, res) => {
    const {foo, bar} = req.body;
    const resource = await db.transaction(async (manager) => {
      const resource = await Resource.loadBy(
        {
          id: req.params.id
          owner: req.user.id,
        },
        {selectForUpdate: true}
      );
      if (foo != null) resource.validateAndSetFoo(foo);
      if (bar != null) resource.validateAndSetBar(bar);
      await manager.save(resource);
      return resource;
    });
    res.send(resource.toJSON());
  }
);

We wrap our load and our mutation in a transaction. When we load, we also tell the database to select the row with the intention of updating it. If you’re using SQL, this is a SELECT .. FOR UPDATE query, though the syntax and semantics may vary slightly by database. When this happens, the database will block anyone else from also selecting that row “for update”.

The upside to this approach is that all of the code inside the transaction that looks at resource sees the value of it as it exists in the database at that moment, regardless of how much time has passed (so long as we’re still inside the transaction).

The downside to this approach is that if we’re updating rows very frequently, we might end up with contention: lots of requests queueing up to modify the same row. This could slow everything down and cause problems.

Many applications don’t actually care about reading the object, only writing to it. If you don’t need to know anything about the object and you just want to set a value to it, you can do that and avoid the overhead and complexity of transactions.

router.post(
  '/v1/resource/:id',
  requiresLogin,
  async (req, res) => {
    const {foo, bar} = req.body;
    const valuesToSet = {};
    if (foo != null) {
      Resource.validateFooOrError(foo);
      valuesToSet.foo = foo;
    }
    if (bar != null) {
      Resource.validateBarOrError(foo);
      valuesToSet.bar = foo;
    }
    await Resource.update({
      where: {id: req.params.id, owner: req.user.id},
      setValues: valuesToSet,
    });

    const resource = await Resource.loadBy({
      id: req.params.id
      owner: req.user.id,
    });
    res.send(resource.toJSON());
  }
);

Here, we use our ORM’s update feature to pass a set of values we explicitly want to change on the objects in our database that match the provided criteria. Instead of loading the row, changing it, then saving the row, we instead tell the database to update a row that we believe to exist. After the update completes, we fetch the row and return it.

This approach alleviates the contention that we faced in the previous approach: there are no transactions, so our database can avoid some overhead.

The big downside here is that this doesn’t feel very ORM-ish. It also precludes us from safely making conditional changes to an object because we need to see the object to decide what changes to make, and that requires a transaction so the object doesn’t change while we’re looking at it.

Always consider that your code is competing with other applications or processes in production. If you read some data and change it based on what you saw, unless you took steps to prevent it the data could have changed while you were looking at it.

Check your strings

Here are two pieces of code. One is in JS, one is in PHP1. Both share the same bug. See if you can tell what it is!

router.post(
  '/v1/foo',
  async (req, res) => {
    const {id} = req.query;
    if (!id) return res.status(400).json({error: 'No ID'});
    await processResource(id, req.user.id);
    res.send(200).json({status: 'Thanks'});
  }
);

Route::post('/v1/foo', function (Request $request) {
    $id = $request->query('id');
    if (empty($id)) return Response::json(['error' => 'No ID'], 400);
    processResource($id, session('userId'));
    return response()->json(['status' => 'Thanks]);
});

(forgive me if this isn’t idiomatic Laravel, it’s been a hot minute since I’ve written PHP)

The bug is extremely easy to miss. And in fact, reading the docs for Express and Laravel both fail to note this gotcha: query string parameters can be arrays.

Node, as an example, defers to the querystring module in the Node standard library. Which is cool an all, except

> require('querystring').parse('foo=bar&foo=zap')
[Object: null prototype] { foo: [ 'bar', 'zap' ] }

Yeah, params set more than once come out as an array. In PHP, foo[]=bar&foo[]=zap will have the same effect. The code above checks that the ID is passed, but it doesn’t check that it’s a string.

How can that go poorly? For one, you’re fixin for a 500. But it can be abused as well: in JavaScript, casting an array to a string is roughly the same as calling .join(‘,’) on it, which is kind of wild. You might also be using methods on the string to check for weird stuff which also exist on Array.prototype:

async function processResource(id, owner) {
  // This will succeed with an array
  if (!id.includes('@')) throw new Error('Not an email!');
  // ...
}

// or

async function processResource(id, owner) {
  // All of this will succeed with an array

  // `id` gets cast to a string here and it passes the vibe check
  if (!/.+\[.+\]/.exec(id)) throw new Error('Invalid id');

  // `slice` and `indexOf` are both methods on Array.prototype
  const value = id.slice(0, id.indexOf('['));
  return await doSomethingWithSQL(value);
}

An example for PHP is left as an exercise to the reader.

Who knows what mayhem this issue will cause. At best, someone can easily trigger 500s that will perhaps page your on-call in the middle of the night. At worst, it means you’ve got a security incident.

The lesson here is to explicitly validate the types of the values your API accepts, in both the query string and in the body. Always validate your inputs. Check that they are exactly what you expect them to be.

Subdomains

Perhaps you run a service where your users can choose a username and get a subdomain. I sign up as cooldude420 and I get https://cooldude420.yourservice.biz/. Nice. Maybe yourservice.biz is even separate from the main dashboard/app, preventing cookies on the root domain from seeping into a user-controlled subdomain. You’re doing all the right things!

Did you remember to block trailing hyphens in usernames?

It turns out that trailing hyphens are not allowed in subdomains. Which is kind of wild when you think about it, really. Chrome doesn’t care, but Firefox will refuse to load a URL that has a subdomain with a trailing hyphen.

Almost nobody chooses to have a trailing hyphen. Or very few people, anyway. But often times what happens is you have some sort of Name field, and you slug-ify the name with a cheeky function like this:

export function slugify(input: string): string {
  return input
    .toLowerCase()
    .replace(/['"()&!.]+/gi, '')
    .replace(/[^a-z0-9]+/gi, '-')
    .replace(/^-+/, '');
}

Trailing spaces in the input, though, or characters that don’t translate to ASCII well, get replaced with hyphens. Add validation to both the front-end and back-end to ensure the username is valid, and if you plan to use it for subdomains, check that you’re following the rules for domain labels2.

I don’t know that there’s a lesson here. It’s just an annoying piece of trivia that causes very problematic headaches.

Not-so-regular expressions

Here’s a bit of Python code that has a bug:

import re

# Match anything that looks vaguely like an IP
ip_matcher = re.compile(r'([0-9]+\.?)+$')

def is_maybe_an_ip(val: str) -> bool:
  return bool(ip_matcher.match(val))

And while that’s not at all what an IP looks like, maybe we’re trying to detect weird IP-lookin stuff. Who knows. The code does what it says on the tin:

>>> is_maybe_an_ip('127.0.0.1')
True
>>> is_maybe_an_ip('255.255.255.0')
True
>>> is_maybe_an_ip('not an ip')
False

So despite matching stuff that’s objectively not an IP address, what’s the big deal?

Well, it’ll crash your server.

>>> from timeit import timeit
>>> timeit(lambda: is_maybe_an_ip('12345678912345678911111111x'), number=1)
3.5675897919999215
>>> timeit(lambda: is_maybe_an_ip('12345678912345678911111111111x'), number=1)
28.434655499999963

Sorry, what?

This is an issue known as ReDoS. Regular Expression Denial of Service. A badly written (or malicious) regular expression can take exponentially long to execute given some carefully crafted input. The result is fast for short strings, but you can see in my example how about twenty characters caused it to take almost thirty seconds. Adding more characters would easily make it take minutes, hours, or days.

I won’t go into depth on why this happens, but suffice it to say that ([0-9]+\.?)+ can match the same string multiple ways. E.g., given the string “11111”, the inner plus can match a single 1 five times. Or 11 twice and 1 once. And so on, for all the combinations. Meanwhile, the outer plus can match combinations on top of that. This makes it run in exponential time. Wikipedia can do a much better job of explaining this than I can.

Now of course, this issue will cause timeouts in certain environments. Timeout and move along But imagine if I knew one of your endpoints was vulnerable to this. And I sent a barrage of these requests at your server. Suddenly, all of your server’s workers are pegged computing this really bad regex.

Now, if you’re on something that will kill the process, your server process will die after the the request timeout. What if it then takes four seconds for your server process to reboot? What if all of your server processes reboot simultaneously? What if it takes a lot longer than four seconds for each process to reboot, or if the startup of your server does something inefficient like download a big file or load a machine learning model into GPU memory or something like that? Then you’re talking about an outage—especially if I’m still barraging your vulnerable endpoint and causing the rebooted workers to immediately fall back over.

How do you fix this? There are tools out there that can check your regular expressions. Eslint has a ReDoS plugin, as do the linters for many other languages. If there’s not a linter or linter plugin, chances are there’s a tool you can use to scan for vulnerable regular expressions.

ReDoS is a kind of DoS attack aimed at compute (clogging up the CPU). But there are similar vulnerabilities with memory, like the Billion Laughs Attack or zip bombs. In the case of billion laughs, a custom XML entity3 is defined that translates into multiple other XML entities that translate into even more other XML entities until the combinatorial explosion fills your RAM. Similarly, a zip bomb uses file compression to create a small ZIP file that, when decompressed, turns into petabytes of data4. Even YAML is vulnerable5 to this.

The lesson here is to be careful with your regular expressions and other algorithms for processing untrusted or user-submitted data! What can seem like a trivial regular expression (the one in my example was short), it can still cause a lot of grief.

Any time you’re dealing with untrusted input, research whether there’s a “safe” tool to use. Python, for instance, has defusedxml for parsing untrusted XML. Many YAML libraries have a safe_load method, which as its name implies, is safe. With very few exceptions, no encoding or format6 is safe to blindly trust.

Extremely rusty PHP. Apologies if it’s wrong.

A PDF: https://portal.icann.org/servlet/servlet.FileDownload?file=00P6100000FPBoDEAX

Notably, the third AND fourth characters can’t both be hyphens.

Imagine if you could do & but have it output whatever you wanted (instead of an ampersand). That’s an XML entity: you define what you want it to be, and the XML parser will replace your entity in the parsed data.

A good mental model is the ZIP equivalent of the instruction “the decompressed file is just the string ‘X’ repeated a quadrillion times”

YAML has other issues, and you should always be careful to read the docs.

JSON, perhaps, is an exception. JSON has so few features that it would be challenging to abuse a parser in a way that causes an issue.

Basta’s Notes

Discussion about this post