System stability is a steady state, not a goal
I think the goal isn't actually better software, it's durable business success, which then typically translates to happy users which translates to better software. Of course "better" is doing a lot of work here -- better to whom? Turtles all the way down 🙃.
Anyway, I think as relates to operational excellence, we'd do well to think of Meadow's advice to "dance with systems". Vis-a-vis happier users, we need to *understand* the health of our systems *as they appear to our users*, and we need to gain this understanding with a reasonable time investment.
I think this largely tracks with your suggestions and is in tension with Pulumi's blog post -- I mean, if it works for them, great! But as an argument I am strongly not convinced that the only way to have happy users is to read every error. I mean, plenty of folks don't do that and have happy users.
So I guess my biggest quibble here is just like, lets stop calling this stuff KTLO. Gardening our systems so to avoid haunted forests and increase the rate at which we can safely change (read: ship new features) is no less deep a domain or focus area than "distributed systems" or "user experience". I think calling it KTLO does it a disservice and primes us to think about it as a tax. Sure, *toil* is annoying, and some amount of toil is part of operational work, but toil and operational excellence and KTLO are not synonyms!
there is so much alpha in just look at errors that people are ignoring
usually points to parts of the software that are poorly observed
which then points to things that are unknown to others
which then again, points to places where large pareto chunks of opportunity (muh impact) have been not been noticed (low hanging fruit)