This post is about a case where we didn’t follow our own advice or industry best practices and it bit us. But then interesting other things ensued and we learned some things.
Last week, I wrote a blog post about AppSec Programs that included a live Trello board and exposed a fair amount of inner workings of how I think about AppSec. I was excited about the post and the concrete template to work from. It won’t get you there, but it is at least a decent reference point.
Well, after I pushed the post, I was naturally logged in to our WordPress and saw that some updates needed to be applied. Like any good security person, I don’t like to let updates sit around not being applied so I went ahead and applied them.
But unlike a good security (or devops) person, I didn’t follow any kind of change control process, didn’t test in a dev or staging environment, didn’t even snapshot the instance.
So … what I’m saying is that a week ago when this happened, we didn’t have any non-production version of our website: jemurai.com and even worse, we didn’t really have a way to get one back if we needed to.
Well, one of the updates I installed was a theme update that introduced a bunch of new content areas with default content. Think “Lorem Ipsum” all over the site with stock photography of people doing who knows what. I was mortified.
I mean, one cool thing about running your own business is that you can actually write this post … but at that time, I was pretty focused on how unprofessional it was and how that hits real close to home.
Well, when I first saw what had happened I looked for backups and other ways to revert the change. Alas, we were not using a WordPress.com versioned theme so it wasn’t going to be as easy as simply rolling it back. As I mentioned, we didn’t have another backup mechanism in place other than raw AWS Snapshots and none of those were recent enough to be a great option.
The reality is that we had wanted to redo the website completely using GitHub Pages for some time. I’d used that technology for years, just not on the company site. Its not like we were really leveraging WordPress anyway. We’d even had the cert expire a few times, embarrassingly harkening back to a much earlier stage in my career where we built tools to monitor for that.
So we flipped a switch, exported the blog posts and started a new website with GitHub Pages, based on Jekyll and a theme we had used for a few sites we run. Luckily, the blog post itself didn’t look that bad - but anyone that went to the main home page would have been a bit confused by the generic text and stock photos.
It took about 3-4 hours to have something that was good enough to push, so at that point we flipped DNS and continued with minor updates. The next morning we had some links to fix and we’re still migrating older blog posts - though most of that was also automated.
Meanwhile, behind the scenes we were still scrambling a little bit. There was nothing quite like a question in the #appsec-program channel of the OWASP Slack that went something like this:
Hey Matt, I can’t find this page that is referenced in your post. What’s up?
Oh, hold on, let me just … copy that old post I referenced into the new site that is on an entirely different platform than the one the original reader referenced.
Interestingly, Pingdom reported zero downtime. Might make for an interesting discussion about how much you need to be able to see to know things are ok, and why we think securitysignal.io is so interesting.
The fact is, I think the website is a little better now. I also think we have better automation around the certificates (since we never have to touch it), better collaboration with pull requests across the team and better backups. It is also nice, especially in security, to have a static website with no PHP or database code. I’m writing this post in markdown in my favorite editor instead of some wysiwyg WordPress editor. I can do all that perfectly well offline.
But there is a whole other angle I want to bring up with this scenario.
We had been telling our customers to keep backups, use change control, define RPO and RTO and test for it. But we didn’t do it ourselves for our website.
Seems embarrassing. But let’s step back for a minute and talk about the actual risks here.
When I write a blog post, we get a little traffic, but interruption of that traffic isn’t a real event. Its mostly AppSec people that are curious to improve their craft. They’re not buying from us and if they are, they’re not worried about the website.
I’m not saying the website doesn’t matter at all, but there are pros and cons to having more controls in place to manage the uptime. Specifically:
Pros to having more controls in place:
Cons to having more controls in place:
Now, I have enough trouble blogging regularly to begin with - I don’t need to get someone’s approval to make it even harder!
Most businesses assume that continuity is critical. For some applications it really is. For our customer facing applications, we have backups and redundancy and change control processes in place. “Real talk”: it is still probably not critical. It presents an inconvenience and we have come to expect little inconvenience.
When we, as a security community, treat the website with little to no material risk the same as a business system that for some reason has a need for a very high SLA, its like we’re taking a one size fits all approach to security rules and advice and we’re not credible.
Consider:
According to Gartner, the average cost of IT downtime is $5,600 per minute. Because there are so many differences in how businesses operate, downtime, at the low end, can be as much as $140,000 per hour, $300,000 per hour on average, and as much as $540,000 per hour at the higher end. (https://www.the20.com/blog/the-cost-of-it-downtime/)
I would argue a lot of business systems aren’t really business critical. Of course, it is pretty easy to build resilient systems in this day and age … so there’s not a lot of excuse not to when it matters but …
Anyway, no excuses - it was a failure on my part even if I can Monday morning quarterback it to be just a learning experience.
Speaking of failures, I think it might also be appropriate to take a moment to talk about failures.
I have been hesitant to post this blog post. What if people decide that I am incompetent because I didn’t back up WordPress before updating? Or worse, careless?
The truth is, we have all made mistakes. Its more dangerous not to talk about them and learn from them.
Like any experienced engineering leader, I’ve made mistakes before. Like the time I forgot a where clause when deleting out of a table in production oracle db and I ended up sitting up all night with the DBA’s while they restored from tape.
delete from whatever_table_it_was
-- where id=13
That was a big mistake.
I’ve made innumerable smaller mistakes in intricate code. I like to ask people when we give training how many security bugs they think they’ve introduced into systems they are building. My answer is thousands or more, I’m sure.
A big part of how we grow up in security is how we handle failures. Can we step back, learn and do better? A lot of that is cultural. Its something you can build organizationally but its not something you can just get or buy or manufacture. It takes work and trust. It takes confidence and resilience.
So part of the reason I’m writing this post is to let my team and anyone else who is interested know that I make mistakes. What I want them to notice is how I respond and what happens next.
As much as I can explain away this event, it was eye opening for me.
It is always better to have made conscious risk decisions than to have been falsely lulled into suboptimal decisions.
I’m a little embarrassed about it. Everything else we build is engineered - it needs to be robust - it needs varying degrees of failover, redundancy, etc. There’s no reason the web site should be an exception.
So I’m refreshing our threat model and keeping continuity as a focus through the process. We’re getting better. I don’t know a whole lot but I’m pretty sure we’ll always be getting better - which means we will also always be making mistakes.