You have impactful outages frequently
Outages are stressful to those involved and can affect both intrinsic customer perception and more immediate bottom line. Outages distract from delivering incremental value, as well as being hugely demoralizing to those involved.
Outages in this context means anything where customer experience is adversely impacted, whether it’s hard-down, or just performing painfully slowly.
Although it’s not necessary to institute extreme levels of sophistication like chaos engineering or automated canaries, it is vital to:
- Perform adequate validation to maximize confidence in a given change
- Use observability principles to develop key operational measures giving insight into how systems are performing, and make sure that the knowledgeable people who can resolve the fastest get immediate notification
- Plan for contingency and maintain the ability to roll back to a previous version if necessary
Your costs are running out of control
Your costs are increasing every month, and for some reason are growing more than you think they should. But you don’t know why. Either that, or you have no idea on what your costs are, and you’re going to get a nasty surprise at some point in the future. Some common culprits include:
Abandoned resources, where something gets created and then forgotten about, or used it for a while and then moved on without cleaning it up.
Over provisioned capacity, where somebody created one or more resources that were over-sized “just in case”, rather than thoughtfully designing and deploying with auto-scaling.
Capacity leakage, for example where absence of adequate retention and lifecycle policies mean that demand for storage grows beyond what is truly required.
Bloated process and cases where Conway’s Law takes hold can lead to the proliferation of environments. I’ve seen cases where development teams wanted per-branch environments, and QA teams insisted on multiple-phase test environments. The ideal is one environment (production). The more you drift from that, the more expensive and fragile things will become.
Perhaps it’s because of lack of accountability where nobody feels a sense of responsibility, or where lack of ownership means that there are people who want to make a difference but are fearful of causing an outage by removing something that is critical.
We found that it was vital to get the right level of insight to key stakeholders and influencers to help drive conversations around ownership and responsibility. Tagging, tracking, and getting the right level of detail to the right people can help with this, as well as automation tooling such as Capital One’s Cloud Custodian, but it starts with a mind-set change.
That’s not all, though: if you tried doing a “lift and shift” of legacy services from the datacenter to the cloud, then chances are that your costs will give you an unpleasant surprise, as will they if you try and manage fresh cloud infrastructure the same way that you did in the datacenter. Don’t do it.
Your cloud accounts are filled with “spaghetti” resources
It kinda sounds delicious, but it’s actually a bit more like the spaghetti monster. Provisioning infrastructure by hand without great consistency and hygiene means that accounts will rot over time. With inconsistency comes uncertainty and increasing use of special cases as staff become increasingly creative.
At some point it will become hard to understand what is there and whether or not any particular resource is important.
Eventually, fear to touch anything will become overwhelming, and in addition to paying for stuff you don’t need, operations will develop a demoralizing sense of paralysis.
Tie as much infrastructure to specific applications as you can (and keep infrastructure as code as close to application source as possible). Abstract where you can to avoid bulk copy-and paste which are also sources of error and effort. Everything else should be built through standard and consistent, well considered templates. Think twice (perhaps three times) if you ever have to build something that isn’t experimental through the console.
You can’t roll-back a bad version within minutes
Imagine the scenario where a shiny new feature has been rolled out to production, and then all hell breaks loose two hours later. Because either the previous version is no longer available, or there are low levels of confidence in successfully un-doing the change, the application will suffer extended downtime as all of the puzzle pieces are put back together.
Immutable infrastructure will help trivialize this, using traditional server configuration management tools will not.
Supporting services haven’t been updated in years
Imagine a hypothetical Jenkins server that was commissioned a couple of years ago. Imagine that the owners of that server didn’t have time to invest in its upkeep and curation, and perhaps allowed smart individuals to install a variety of favorite plugins to meet their varying needs.
Skip forward to the present. The server is overloaded for most of the working day, there are dozens (if not more) of security vulnerabilities.
The latest version of Jenkins has a lot of cool features available that would make delivery of projects far easier. Wouldn’t it be great to get those new features, install on faster hardware, and close all of those security holes?
Unfortunately, it turns out it may not be so easy:
- Upgrading to the latest version will likely need several intermediate steps since there is no easy way to navigate all of the file format and database configuration changes that have occurred since original installation.
- Due to the lack of curation, there may be multiple plugins installed which perform almost exactly the same role, exponentially increasing effort required.
- Worse still, perhaps many of those plugins were written by someone in their spare time and have become abandonware, so they aren’t compatible with the latest version.
So there becomes a tendency to avoid upgrading because of the risk of breaking jobs that would adversely impact organizational productivity and value stream.
Then things compound, until the only way out is to throw everything away and start again. Just be careful to really throw things away and not leave those old servers with running jobs, because that would mean now there’s even more stuff to maintain.
I took this hypothetical example from several real-world experiences, but this is not a problem specific to Jenkins. Generally, the longer a service has been allowed to rot without updates, the harder the task becomes. This begins a feedback cycle of doom.
Although it sounds a bit boring, having an easy and disciplined approach to curation and upgrades will save so much frustration and time later.
Production systems are patched on-the-fly in order to restore service
It might be that there was a server configuration setting that didn’t make it’s way through in the change, or perhaps a database migration didn’t behave the way it was intended. Alternatively, perhaps there isn’t sufficient administration tooling in place for business operations staff to update records through a user interface, and so the user interface becomes an SQL command line prompt. In any event, the operations team is asked to make “on the fly” changes to production to get everything up and running.
There are so many things wrong with this approach, but most fundamentally:
They remove the safety net of testing — it’s easy to fat-finger a change, and end up in potentially deeper water, let alone not having an easy way to verify if the intended change will actually work;
They bypass auditing and change control — Because nothing goes through version control or change tracking, it’s possible for someone working at a console could literally do anything to production systems, including breaching confidentiality controls, fabricating back doors, or making unauthorized changes for personal gain.
Since there is risk for the manual changes to production to put those systems out of sync with earlier testing environments, it’s also possible (and probable) for future changes to cause another outage-causing regression.
Immutability takes away the temptation of these changes from an infrastructure perspective, and thoughtful design of administrative functions from a product perspective are essential to drive data integrity.
You have to figure out continuous delivery from first principles for new projects
Imagine being killed by a snowflake. This goes back to the principle of Cattle not Cats, and — rather than new projects being cookie cutter in nature — you’re actually working in a kitten mill.
- Because of an absence of standardization, there is so much uniqueness to new projects that fast implementation is inhibited by long running special design reviews.
- Each project has a highly customized pipeline configuration, or — worse — continuous integration and delivery jobs are hand crafted every time;
- A significant percentage of application infrastructure is built by hand, whether through a console or infrastructure-as-code (such as Terraform).
How much easier would it be if you could clone a tried-and-trusted boilerplate example and be running within minutes?
Projects get derailed because of “Audit Time”
If your infrastructure is built in an ad-hoc fashion, then there is likely a lot of variability. With high variability comes high risk that something is different in a bad way, and volume of infrastructure drives up complexity and permutations.
I’ve seen cases where hundreds if not thousands of tickets are created for infrastructure teams to generate reports, perform diagnostics, and perform emergency fixes. All of this additional work is incredibly debilitating to the team and it can drag on for literally months.
Wouldn’t it be better to cookie-cutter everything, and then simply provide proof of how that is done to the auditors? I’m simplifying slightly, but not that much.
Changes worked fine in pre-prod, but fail in production
Not to say that multiple environments are a great thing but suppose that’s what you have. It may be that things seemed just fine as changes get promoted towards production… but then, things go horribly wrong as soon as the change is released to customers.
In order for validation and testing to mean anything, it’s vital that environments and artifacts mimic what will be seen in production as closely as possible.
I’ve seen cases where they are not, and the consequences are sometimes unpleasant. For example:
Platform is different between pre-prod and prod, perhaps pre-prod is in the cloud, yet production is in the data center. Or maybe one is deployed in containers and the other is on dedicated hosts.
Artifacts are different between pre-prod and prod, maybe because code gets built from source prior to being deployed in pre-prod, and then again prior to being deployed in production. This absence of immutability can cause nasty surprises, since it’s possible that underlying build tooling or libraries could change between builds.
Infrastructure is different between pre-prod and prod, such as the way that servers are built or configured. Whereas some aspects of this are okay (such as fine-grained tuning on scaling parameters or instance sizes), it’s not okay for there to be any divergence on things like OS version, system configuration, patch or package levels, and so on between environments.
This is another area where immutability is essential.
Perhaps the worst thing about this, is that confidence becomes lost in pre-production over time, and it’s easy for control-based thinkers to add in additional manual touch-points and process in order to regain trust. Of course, in reality, this only treats the symptoms and will usually slow down delivery further.
There was a lot in this article that recommends a drive towards standardization and simplification. I’ll cover process and controls next, which also have roots in standardization and simplification.