Kubernetes operators – the top 5 things to watch for

by robgibbon on 31 August 2022

source: Wikipedia.org

Software operators are steadily revolutionising how we deploy and run complex distributed systems. They offer the promise of low-intervention, self-driving software – ideally leading to service reliability gains and better uptime. For an introduction to Kubernetes operators, check out our introductory webinar or download our guide to Kubernetes operators.

Many hundreds of ready-made operators have been published on various platforms like GitHub and charmhub.io – so many that it can sometimes be hard to choose the best operators for your deployment, right? Well fear not – in this blog post I’ll walk you through my top five points of attention, in order to give you some basic guidelines to help you decide. Let’s go!

Provenance

Provenance means “where something comes from”. In software terms, it means who is the developer, development community or vendor behind a piece of software. My number one piece of advice is to verify the provenance of any software that you are installing, not least software that operates other software and may run with elevated privileges.

It’s not only about the risk of having your service environment compromised by a hacker or APT (advanced persistent threat), there are other material concerns:

What level of support can you expect for the operator? Is it best effort or community support only? In which case, will that be sufficient if the operator doesn’t live up to its promises? Bearing in mind that software operators are supposed to encapsulate the knowledge and wisdom of a human operator, their effective and reliable operation will be critical to some of your most critical assets, like database servers and middleware.
Does the developer comprehensively test the operator to ensure that it reliably delivers the functionality it claims to deliver? How do they test it?

Licence

Most modern software is distributed under a software distribution licence of some kind or other. The licence determines what you may and may not do with the software, and what your rights and entitlements are with regard to the software. I recommend first considering software that is licensed with a liberal free and open source licence, for example the Apache Software Licence, version 2.0.

There are a few advantages to open source software operators:

You can inspect the code of the operator you are using – for defects, vulnerabilities, hostile code, telemetry and for non-compliance with the declared licence, any of which could leave your service and your business exposed.
If the developer of the software decides to stop supporting the operator for some reason, you can still potentially support it yourself if you have the necessary know-how, which can buy some time and help assure business continuity.
Usually (although importantly – not always), there are few or no restrictions on where you deploy the operator. For example if you want to deploy the operator in a test environment and be assured that you will be getting exactly the same software and behaviour in a production environment. You are likely able to do that with a liberally licensed open source operator versus an operator covered by a more restrictive or wholly proprietary licence.

Competence

I’ve ranked competence as my number three attribute for operators. Competence doesn’t mean “can the operator cover all my requirements”. Instead, by “competence” I mean “does the operator do the things it claims to do competently”?

For example, when asked to choose between an operator for MySQL Server that only manages MySQL high availability versus an operator that can manage high availability and has twenty other features, I will choose depending on the operator’s relative competence at managing MySQL high availability, given this is a critical feature. Get this wrong and you will spend a long time regretting your choice.

One way to understand the competence of an operator is to examine how and under what circumstances the operator is tested by the developer. But perhaps the best way is to thoroughly test the operator yourself, for example using a fault injection framework – like playing a game of Kube DOOM!

Maturity

Maturity is not always a badge of competence, but it can sometimes be an indicator. Operator maturity can generally be determined at three levels:

Ready for publication
Ready for evaluation
Ready for production

Ready for publication

In my opinion, an operator is ready for publication – that is, to be viewed by potential users – when it has the basic attributes of a product. For example: a web page, developer documentation, deployment and operations guide, contribution guide, licence, an issue tracker and developer/support contact details. I’d also like to see an SCM system like git being used, unit and integration tests present and some kind of automated build pipeline. Being ready for publication is a prerequisite for production readiness, but the operator might not yet implement all of the features needed to make it production ready – or it might not implement them to the level of capability needed to make the operator trusted and reliable when the chips are down.

Ready for evaluation

For me, an operator is only ready for production deployment when it reliably and capably covers scenarios like orchestrating high availability, wire and at-rest encryption and so forth. But even without these more advanced features, there may be features such as ability to perform application upgrades, or to backup and restore state, which make the operator useful. This is what I’d call “ready for evaluation” – not quite production ready, but still worth a look.

Ready for production

An operator that’s ready for production needs to tick a lot of boxes – it needs to meet all of the criteria of “ready for publication” and “ready for evaluation”, and it needs to offer sufficient features and capability that you will trust it to reliably operate your most critical assets like databases, web farms, authentication directories and middleware. It better be good!

Performance under stress

You’ve got your MySQL Server, OpenLDAP directory and Redis cache all deployed on Kubernetes and managed by operators. That’s great that you’ve got that done – grab a cold one and kick back, right? Well…sure. Bear in mind that you’re relying on these operators to respond competently when the unexpected happens. We’re not talking about the happy path of software operations here, we’re talking about when seven different kinds of chaos happen in the cloud at the same time. If you’re heavily reliant on the expertise and wisdom of the engineering team that developed your Kubernetes operator, the chances are you have less access to that expertise and wisdom onsite in human form. That’s okay, expert DBAs (database administrators) who also know how to efficiently plan and run an OpenLDAP capacity extension whilst simultaneously troubleshooting Redis clustering are as rare as vegan steak tartare. There’s nothing to feel bad about.

But then you need to be confident that your software operators are going to be able to cope when things start to go wrong: that they’ve been comprehensively tested, against as many “corner cases” as possible. This is what I mean by performance under stress, and as mentioned earlier in this post, you can also use fault injection frameworks (aka Chaos) to verify that yourself before dropping that operator into your production service environment.

Summary

These are the five things to watch out for when considering a Kubernetes operator:

Provenance
Licence
Competence
Maturity
Performance under stress

Check out the charmed operator collection at charmhub.io and – live long and prosper!