High availability operator lifecycle management

High availability OLM for mission critical infrastructure and apps

To bring the operator pattern to software-defined infrastructure and mission-critical applications, we must ensure availability at every level of the control plane and operators themselves.

The operator pattern is an approach to software operations which was pioneered by Canonical and popularised in the context of Kubernetes. The cloud-native focus on immutable containers signals the end of configuration management as a primary methodology for software ops, and brings the operator pattern to the fore. As a result there are many implementations of the operator concept on Kubernetes.

Cloud-native workloads have many primitives for availability inherent in Kubernetes. But the operator pattern is just as effective, perhaps even more so, when applied to traditional applications on Linux or Windows machines, whether bare metal, virtualised, or cloud instances. Such traditional or legacy applications require careful operations to achieve enterprise-grade or carrier-grade reliability.

We outline here the approach taken in the Juju Operator Lifecycle Manager (OLM) which supports highly available OLM services across multi-cloud, K8s, VMware and bare-metal operators.

Operators are critical software

In the operator pattern, all application management is handled by operator software. Application installation, updates, upgrades, and configuration are all driven by pieces of operator software that are in turn driven by the operator lifecycle manager.

It is evident, then, that a failure of the operator may have immediate and catastrophic consequences for application performance and reliability, just as a poor configuration management change can. In the most benign case, the lack of availability of an operator will prevent the evolution of the application stack, because configuration changes, scale and updates will not occur without the operator to drive them.

For these reasons, we consider the operator lifecycle manager and the operators themselves to be critical software, and we take steps to ensure that both are robust by design and in implementation.

Highly available OLM controller

The OLM controller provides operator services to all the operators in a model. It provides administrators with the ability to query the model, present a GUI dashboard, and handle administrative actions through the CLI. It serves agents and through them, operators, with a continuous stream of events and responses to their update or queries about model status.

If agents are unable to contact the controller, they are unable to update other operators or the administrator about the status of their application units. If administrators are unable to contact the controller, they are unable to check or modify the model to allocate resources, change configuration, undertake every operational activity such as resets, backups or restores, or upgrade the operator infrastructure itself.

Clearly the OLM controller is the single most critical component in the architecture, as it is systemic to all models hosted on that controller, and a failure may impact every application unit in all of those models, across all of the clouds or Kubernetes clusters which are driven by that controller. It is a central control plane of control planes and as such is the primary determinant of long term system dependability.

Clustered Juju controllers use consensus protocols

The Juju OLM provides a highly available Operator Lifecycle Management function. The resilience of the Juju controller is a design feature, and is not dependent on particular configuration or supporting software. In other words, availability is not bolted on after the fact using failover mechanisms, it is inherent in the design of the OLM controller and agents, and the communications protocols between them.

The OLM controller can be clustered for resilience. Conventionally, a cluster of three controller instances is considered highly available. Internal RAFT leader-election protocols ensure that a single instance of the controller in the cluster is designated the leader.

Automatic leader election

In the event of a failure, consensus algorithms determine the quorate controller set. If necessary a new leader is elected, with no administrative response required.

A controller that has been dropped from the consensus set may return to service, in which case it will rejoin the cluster automatically. Transient failures thus require no admin intervention and do not result in a long term loss of resilience.

Automatic quorum maintenance

The Juju OLM is capable of provisioning new machines and containers. This capability is essential to the OLM as a resource manager. In the normal mode of operation, the Juju OLM is deployed onto such an elastic substrate — container or machine — with a credential to further provision capacity for operators and their applications.

The Juju OLM can therefore also provision capacity for additional instances of the controller itself.

The enable-ha command configures the OLM controller to create a high availability cluster of itself, and maintain a level of resilience even in the event of instance failures. There are necessary dampeners to prevent over-enthusiastic auto-scaling of the controller. The mechanism is proven to ensure that a failure during a period of low administrator attention (long weekends away from connectivity!) does not fundamentally compromise resilience for the remainder of the period.

Agent awareness

Agents can connect to any of the active controllers in the cluster. After connecting to any controller in the cluster, agents become aware of all the other members in the cluster. It suffices then for a restarting agent simply to be able to reach any member of the cluster in order to become fully enmeshed with all the current quorate members. This is important in cases where machines might fail for an extended period, and need to re-establish contact with the controller after some controller disturbance which makes some of the prior controller addresses unavailable.

In the event of a change in the controller cluster, agents are informed of the change, and remember it for future restarts. In this way, a rolling controller restart that changes IP addresses preserves continuous availability even if agents are restarted at the same time.

Highly available operators

Operators themselves may have time-critical functions to perform. In the telecommunications industry, the operator may encapsulate network element management capabilities alongside the traditional lifecycle management function. Such element managers often incorporate clustering and availability techniques.

Operators themselves can be scaled out with the Juju OLM. In part, this comes from the close association of operators with individual application units — the operator is installed alongside every application unit in a machine context, and can be installed in a sidecar of every application pod in a container context. This scale-out ability for operators is unique to the Juju OLM, and part of its focus on availability and reliability in mission-critical environments.

An operator can run real-time high-availability clustered software to monitor its application, and therefore react almost instantaneously to application availability or performance concerns. This of course depends on the ability to coordinate between operators in the cluster, which Juju facilitates through its peer relation data exchange mechanism.

In summary, the Juju OLM is designed for high availability at the operator level as well as the controller level, enabling mission-critical workloads to be autonomous with highly available control mechanisms.

Model migration for controlled OLM upgrades

Juju provides model migration as an ‘aerospace grade’ approach to OLM upgrades in the operator estate.

The OLM controller itself is highly available. Nevertheless, because it is a control plane of control planes, it represents a critical capability in the overall system that warrants additional mechanisms for reliability. Model migration is a very carefully designed process that ensures continuous model operability even while institutions regularly upgrade their OLM services to benefit from new features.

Upgrades of the controller can be done in place. During the upgrade procedure, there is a period of time when the controller is upgrading database structures which represents a high risk moment for all of the models hosted on that controller. A failure at that moment would put all models on that controller at risk.

Model migration avoids exposing a large number of models to a single transition at the same time.

Instead of in-place upgrades of a controller hosting many models, a new controller of the new version is deployed alongside the existing controller. Models are then migrated, one by one, to the new controller. At no stage are all the models simultaneously in a state of transition, reducing the impact of any problem associated with the new controller version.

Migration itself maintains continuous availability for the model.

At the outset, agents are informed of the proposed migration. Agents are able to verify that they can connect to the new controller cluster as well as the old controller cluster. The data structures of the model are then copied to the new controller. In the process of migration, data structures for that model only are upgraded to reflect the new capabilities in the new controller version.

Agents at this stage can see the old controller with the model, and the copy of the model in the new controller. They are able to verify that their behaviour will not suddenly change if they cut over to the new controller. When all agents in the model have confirmed this, the controllers agree on the handover and the model is considered migrated.