Delivering ops code quality

Ops code as an open source package

Let’s elevate operations code to industrial scale and quality. Take infrastructure as code, and add open source principles to create a shared body of operations code that works every time, every place, on every architecture, under every load.

Operations code is the last place where we reinvent the software wheel, every time. When a team sits down to automate MySQL, they are recreating a wheel that is already turning in a thousand other places.

Let’s make the best wheel and share it. Let’s make perfectly reusable ops code packages.

“apt-get it and move on”

The operator pattern is about packaging operations code so it can be reused, distributed and run everywhere, to reduce the duplication and errors associated with needless reimplementation. On Kubernetes, the operator pattern is gaining credibility as a way to address the fact that configuration management doesn’t work for containerised applications.

But the operator pattern is a universal idea, and we can use it everywhere — from legacy applications on bare metal, to virtual machines running Windows apps, to cloud instances, and the shiny new Kubernetes estate that is rapidly starting to sprawl to every corner of the compute landscape. This project, the Juju Operator Lifecycle Manager, brings operators to every class of enterprise software, not just Kubernetes.

The result is much better operations on the whole estate. Here’s why.

Reusable code improves quickly

The more people use a package, the more bugs they shake out. Reuse exercises code in different ways, which surfaces issues in design and implementation. Homegrown operations code is often full of race conditions because “those particular systems always start up that way”, or “these particular systems always come up first”.

Reusable operations packages get engineered to deal with every combination of fast, slow, first or last, because there are lots of people exercising them in lots of different circumstances. Even better, they get the benefit of proper distributed systems patterns, practices and services, which make it much more likely that they are always correct, and fail graciously when it’s not possible to do what you expected. Occasionally, stuff happens, especially on the public cloud, but it should be handled well.

In open source, we say that “many eyes make all bugs shallow”. That’s why we want to build open source packages of ops code for everything that’s interesting to operate. We want it to be as safe, and as easy, to fetch operations code and use it as apt-get makes fetching a new package or security update on your server.

Continuous automated testing

Good software has good tests and lots of test runners. In our best open source projects, we build and test the code for every change, on a wide array of platforms, automatically. In truly great projects, code has to pass all of those tests before it can land at all. In an ideal world, you can release at any time, because your development branch passes all tests all the time.

If only operations code were that rigorous!

And it can be.

When we build operators, we are building real code. We use real software engineering languages, real frameworks, real test harnesses, and real bug trackers. That makes it possible to be serious about testing. Unit tests, integration tests, and full-bore deployment tests.

Efficient updates

Reusable code means that more people can find and fix the problems. But those fixes need to be deployed to have an impact. In the open source world, we use packaging systems to spread good code around, in many cases automatically. The result is that people just have to “apt-get update” to know their system has the latest fixes.

By turning operations code into a first class package, we also create the means to distribute fixes and patches automatically to every place that problem might be in production. Every time a security team finds a gap, every time a performance issue is addressed, every user of that operations package stands to benefit, just by updating their ops code packages.

The real economic benefits of shared operations code don’t all arrive on day zero. It’s the fact that operations keep improving, every day, that really charms the finance folks.

Channels let users decide on risk

Change is always stressful, so it helps if you can decide where those changes will land first. The Juju OLM serves up packages in channels, with an explicit risk assessment — stable, candidate, beta or edge. You probably don’t want edge, but it’s great to run a few deployments on beta or candidate, to see the new stuff coming.

Progressive releases raise the bar for everyone

When it does come time to cut a release in the stable channel, it’s a little scary to think about changing all of those systems, all at the same time.

But a really great package distribution system doesn’t. We use progressive releases to make the fix available in stages, to a wider and wider audience. So if there is a problem, it only affects a tiny fraction of the user base before we freeze the rollout. It’s tough to be the canary, but clever cryptography shares the load, so it’s very unlikely you’ll be the canary twice in one quarter. And without shared packages, without channels, and without progressive releases, you’re the canary all day long.

Structured languages make better software

Ops code has always been about the quick and the convenient. Shell scripts are potent ways to get small things done quickly. But software has become really complex, and orchestrating it is no small thing. Many organisations are still writing ops code with primitive languages that don’t enable efficient code sharing and reuse, because they can’t share code or reuse it.

We can.

Tackling the operations of something really big like OpenStack or Hadoop or Kubernetes in Bash or Ansible is horrible. There are too many things to keep track of, too many abstractions to implement, too many components and plugins and options for spaghetti code to be quick or convenient. Python is just the right level of structure and just the right level of convenience.

And since there are many, many users of the same code, it pays to invest in the right structure and testing. It pays to build the right abstractions. It pays to build communities and share the insights, ideas, costs and capabilities.

Shared libraries increase reuse and reliability

Most large organisations with well established ops teams have a pile of homegrown sauce they can invoke and repurpose. We turn that into a global effort, in a high-level language, with documentation.

There are many patterns in automation and operations that repeat in different software settings. Checking if the system is up to date. Checking if there is a new container in the stream. Driving updates. Saving backups to S3. Retrieving them to restore. While every piece of software has its unique best practices and quirks, there are many, many common behaviours and requirements.

Shared libraries and classes make that easy.

Integration is a very good example. It’s hard to integrate applications, because you have to understand the details of both applications to get it right. So we encourage the author of every operator to publish the libraries you will use to integrate with them. The details are encapsulated so you don’t have to understand them, just consume the objects and events they emit.

Open source

Our DNA is open source. There are very good reasons why all operations code should be source code; it's important to know what it's doing on your system. An application can be confined with MAC based security tools like SELinux or AppArmor, but the operations code that installs it cannot.

In our community, the framework that everyone shares is pure open source. And all the operators we write are open source too. We let vendors make proprietary operators but they always ship the source and their integration libraries are open source too. Many eyes and all that.

Community

This work is specialised. There aren’t too many people who have to make big application graphs work, all the time, in all sorts of places. But there are lots of us that want to make interesting application graphs work in the places we need them. And that makes a nice community.

We all benefit from the amazing work that others do, and it’s nice to contribute to something that is changing the face of operations, everywhere. Running lots of back-end software is just too messy. We’re fixing that, and it’s fun to see it getting better, fast. If you like Python, and you like operations that are best in class, and you have a particular application you care about, or a combination of applications that you know how to integrate really well, you’ll like it here. Welcome.