Redefining Application Management

Enterprise software ops are costly and messy.
Let’s fix that with a universal operator pattern.

Mission statement

Our mission is to transform the quality, cost and speed of software operations by creating a community of operations practitioners, open source leaders and vendors that delivers:

Perfectly reusable ops and integration code.

Operations code is traditionally hand-crafted for each organisation that runs a particular application. Our goal is a global repository of reusable operations code for every significant application, which works on every cloud and on every kind of private infrastructure — physical, virtual or containerized.

It is very important that we include integration code in our mission. Integration code is traditionally the most tied to a particular scenario in a particular organisation. Integration code creates the most friction to upgrades and new components.

It’s also important to emphasize that our goal is perfect reuse. We want to transform operations code from a home-grown artefact into open source packages that can be published, consumed, distributed, shared and updated just like apps on your phone or packages on Linux. Not templates, examples or tutorials, but production-grade ops code that works everywhere.

When we deliver the ops code for a piece of software, our mission is to deliver it with the right design, architecture, and implementation that it works for every user in every environment they operate.

Why this matters

Maintenance is very expensive in a large enterprise estate. It is a drain on resources better spent on innovation, differentiation and evolution.

That estate is a sprawling complex of platforms and software layers, across private and public infrastructure. The software spans every layer in the stack, from the lowest level of infrastructure (Linux, Windows, VMware, OpenStack, Ceph) to the most abstract (serverless and containerized applications), and even SAAS, from a huge number of vendors. There is no single unifying principle or organising system for that software.

Many estates share applications but they do not share operations and integration code. Every time a business selects a new piece of software it must create custom operations code and custom integration code for that application, to fit it into the existing operation.

Every new piece of software requires new, deep understanding inside the operations team in order to deploy and integrate the new component effectively. By definition, the component is new to the team. They have not acquired deep familiarity with the software precisely because it is new. Such operations code tends to be shallow, missing subtle race conditions or performance aspects, until it has spent years being tweaked and fixed.

The result is a tangled mess of operations code that is only ever used inside one business, perhaps even just within one part of that business, by a small and overworked group of operations engineers.

Legacy integration and operations code makes it difficult to drive change in an enterprise technology stack. Every change requires detailed consideration of the rippling consequences in custom operations code, most of which was not written by the current team.

Our community exists to change this.

Reuse is the key to ops code quality

The root cause of the problem is the inability to share ops code across many organisations. When teams and organisations can share operations code perfectly, they also share the insights underneath that code, and the fixes to problems in that code. If operations code were shared perfectly, then updates could also be shared perfectly, and we would gain the same pace of innovation and acceleration in operations that we have witnessed in applications over the past decades.

Even better, if this operations code is open source, then improvements and innovations can reflect the experience of users as well as publishers, and elements of operations code that are not specific to a particular application can also be shared across many different pieces of the estate.

This represents a profound new vision of operations code, transforming it from a home-grown artefact into a global class of software that evolves and improves in exactly the same way that applications do. True reuse of operations code is a leap forward in the operations profession. In this vision, operations code becomes a package which is distributed globally, with a stream of updates, upgrades and security fixes, just like a deb.

Beyond config management

For the past two decades, people have tried to solve this problem with configuration management. We have seen waves of ‘language wars’ as practitioners embraced Puppet, Chef, Ansible, Salt and more. There is not much difference between them, conceptually. They all ensure that the contents of specific files are the way the organisation decided they should be.

Unfortunately, configuration management attacks the wrong part of the problem. By focusing on the desired state of a configuration file, they trap organisations in the idea that the organisation itself should know every detail of every configuration file. The great totem pole of institutional config management should ‘specify the precise contents of every configuration file in the building’. This, of course, makes that totem pole unique to the organisation, and requires low-level configuration knowledge in the ops team for every single application operated by the organisation.

SAAS experience, on prem

No wonder CIOs are moving to SAAS as fast as possible. Adopting SAAS means that the organisation does not need low-level configuration management knowledge internally, in order to benefit from an application. Getting config files right is the SAAS provider’s problem — the business just enjoys the service.

Our goal is to deliver the SAAS experience for on-premise applications too.

Configuration files should not be the way we express our desired state. Configuration files should be the output of something much smarter. A business wants to make high-level, strategic decisions, just as they do with SAAS, and they want to have all of the underlying details handled by someone else, or something else. The Juju operator lifecycle manager, and the Open Operator Collection, deliver on that vision.

Application management

Think of a particular application in your business. It is probably many pieces of software, integrated together, in a scenario. There are probably development, staging, test, and production scenarios that are similar but not exactly the same.

The business decisions that are important are:

  • Where should the application run?
  • Which software components should be included in the scenario?
  • What resources should be allocated to the scenario?
  • How should that scenario be integrated into the wider estate?

These questions are about application management, not configuration management. The answers to these questions are the business intent of the scenarios, and far more important than the details of a particular configuration file.

It is of course necessary for something to translate business decisions to configuration files. Traditionally, operations engineers translate business intent into configuration and deployment scripts based on their understanding.

But configuration management is a very poor place to express business intent. Configuration management code doesn’t express that intent clearly at all, only the tactical configuration decisions of the operations engineers. No matter how hard you work to understand the totem pole of a puppet master server in a large organisation, it may be difficult to discern the actual intention of the business, because all you see are a million configuration file details.

Instead, in the Juju OLM we make business intention the focus of the system and we use technology to take care of all the details that flow from that intention. When using operators with the Juju operator lifecycle manager, your entire experience is the scenario that you are composing out of applications. You decide where those scenarios will run, which applications are integrated in the scenario, the nature and amount of compute and storage allocated to the scenario, and any external systems integration such as centralised logging, monitoring and alerting.

Model-driven operations capture business intent

Juju is different to other operator lifecycle managers because it captures business intention in a model of applications and resources, which drives the behaviour of those operators. This high level application modelling makes operators easier to use at scale in an enterprise.

The model is associated with a substrate — such as a VMware cluster or a public cloud or a Kubernetes cluster — which determines where those applications will run. The model reflects further business decisions too — which applications, what integration between them, how much CPU or disk, and which external systems to connect.

A business scenario may be made up of several models, each on a different substrate. This enables the business to place applications across multiple different compute environments. A single scenario may be spread across models on two public clouds, a private cloud, a kubernetes cluster, bare metal servers and a mainframe. That would be an unusually diverse scenario, but it is no more difficult to express than a simple one, having legacy applications in machines integrated with containerized applications on a Kubernetes cluster.

The scenario captures all four aspects of business intention — which applications, on which substrates, with what capacity allocation, and what extended systems integration to the rest of the business. As such, Juju brings business thinking to the world of application management, making it much easier to have conversations between teams about the estate, and much easier to evolve capacity allocation over time.

Crucially, Juju operators do not make business decisions, such as resource allocation, they follow business decisions which are expressed in their model. Separation of technical and commercial concerns means the operator is reusable in very different business settings. Reuse enhances the community value of the operator and increases software quality.

Beyond lifecycle, to everyday maintenance

Maintenance, such as backups, restores, health checks, compliance checks, resets and application-specific event handling are expensive aspects of estate management in the enterprise.

In addition to lifecycle and integration, the Juju OLM supports daily operations and maintenance in a structured and safe manner. Each operator declares a set of maintenance activities, called actions. Administrators with appropriate permissions invoke those actions in order to carry out maintenance. In most cases, remote access to machines and containers is not granted to administrators.

Many maintenance activities require deep knowledge of the application. In some cases, this can be subtle. Is it better to backup a database from a read-only replica, or from the read-write server? Does application failover terminate a backup in progress? The simple act of backing up a database can take years to master in high-pressure, high-load environments.

Distilling this knowledge into a reusable, repeatable, shared artefact is priceless. It enables diverse industries to collaborate on the most efficient and most reliable way to run that application, in very different scenarios. Today, every organisation has to develop this knowledge from scratch and encode it in its own operations codebase.

Just as millions of people benefit from the knowledge that has been encapsulated in the Linux kernel, without ever having to understand how it works, so Juju allows institutions to benefit from deep and varied operational experience without ever having to study that subject themselves.

Beyond Kubernetes

Kubernetes is only the latest in a long list of transformations which promised to simplify application management and operations.

Kubernetes is a rich and capable substrate, but it is not simple. Kubernetes enables a wide variety of patterns and process management capabilities which need to be managed. YAML takes the place of /etc/ configuration management in the Kubernetes world, but the underlying story remains the same — all the YAML that you are expected to write does not actually express your business intent.

Clearly, Kubernetes is a critical new class of infrastructure, and we can expect many applications to work best on it. Any solution to the challenge of enterprise application management must address the Kubernetes space.

But we can also expect a large portion of the application estate never to run on Kubernetes, so in order to fix enterprise application management, we cannot limit our thinking or our tools to those which depend on Kubernetes. In fact, just as prior waves of infrastructure innovation created sprawling new estates which needed to be integrated into the existing estate, we expect Kubernetes to do the same.

Universal approach across all classes of software

The operator pattern has been popularized in Kubernetes, but the idea itself is much bigger than Kubernetes. Software which encodes knowledge and controls elements of a complex integration is a profound and universal idea, equally relevant in legacy estate as it is in the new Kubernetes world.

Juju brings the operator pattern to traditional, machine based workloads. Any traditional application which is installed on a machine, whether Linux or Windows, can have an operator that handles not only its lifecycle but also integration with other workloads. Operators for traditional software are essentially long-lived ‘installers’ that deploy their application, upgrade it, manage configuration and provide a lightweight monitoring function for status.

Juju supports Ubuntu, CentOS, RHEL and Windows application software lifecycle management. Python is generally used for the operator itself, although Juju is language neutral in its underlying message and event transports.

On Kubernetes, operators run in containers. They drive their application workload on Kubernetes from their own pod, or they can be placed inside the application pod, in a sidecar container. Again, Python is the preferred language for operators on Kubernetes.

Both x86 and ARM architectures are supported too. In an increasingly heterogeneous world, it is important to be able to deploy large topologies of software in mixed fashion, with some components on ARM and some on x86.

Running the same workload on different substrates — machine or container, x86 or ARM — suggests duplicate operators and complex maintenance. In fact, the logic of the application lifecycle is very similar on different substrates, and much of the common abstractions developed for the first target substrate apply directly to the others. Clean Python libraries provide reusable abstractions of the application. Multiple operator builds are usually produced from the same source tree, just as the same source code tree for an application generates binaries for Windows, MacOS and Linux.

So the first part of our mission is to develop operators for applications on all the different kinds of substrate — machine or container — where it makes sense to run them. Operators for an application can handle both kinds of substrate if that is how the user community wants to run the software, or they can specialize in container or machine substrates if the workload is native to those exclusively.

A community forms around an operator, sharing experience and requirements, ensuring that the breadth of scenarios supported by the operator matches the range of deployment types and configurations used in practice. High availability, scale out, upgrades, and status are all distilled into the operator, reflecting best practices in security, performance and resilience from experts in that application.

Everyday operations — backup, restore and maintenance activities — are also distilled into the operator, taking care to handle sophisticated situations.

The result is simply the best way to run the application. No direct access to machines or containers is needed because maintenance can be performed remotely, through the operator.

Integration as a first class design element

The second part of our mission is to facilitate rich integration between operators.

Enterprise software is integrated. This is the primary difference between consumer software (‘the app on your phone’) and enterprise software. It is the integration code between applications that glues particular versions of software in place in the legacy estate, and prevents fluid evolution to new versions or substitute components over time. The legacy estate ossifies because it is too expensive to maintain that custom integration code.

Our community solves this problem and makes integration code a first class element of the Juju operator lifecycle management system. We don’t manage applications in isolation, we model rich application graphs which reflect the real integration of all of the different applications in the estate.

Integration code is packaged in the operator, shared and reused in multiple different businesses and multiple different scenarios. That integration code is structured to allow substitution of components that perform standard functions. Multiple different operators could offer MySQL to an application, for example Amazon RDS and MySQL and MariaDB. They all present the same interface, so that applications can be integrated with any of them transparently.

The design of the Juju OLM is careful to reflect the need for evolution over time. No initial implementation will be perfect, and it is necessary to allow new operator versions to co-exist with older versions in running scenarios, as they improve over time. The protocols used by operators in their integration communications are deliberately designed to allow loose coupling, substitution and evolution on either side of an integration relationship.

Operator design, quality assurance and testing

It should be clear that operators are serious software.

Operators carry responsibility for executing business decisions, often in mission-critical environments. They must react to events — allocations of resources, scaling up or down, reconfiguration, upgrades, integration, maintenance — and they must do so in distributed systems where there are multiple events happening at the same time across the scenario. They must coordinate changes between applications, sometimes across diverse substrates, sometimes even across diverse cloud regions and cloud providers.

The Juju OLM is carefully engineered to avoid race conditions and deadlocks between operators, even in extreme environments where software is scaled out to thousands of machines and containers. Just as Golang changed conventions in software parallelism to improve the developer experience in concurrent processing, the Juju OLM steers operator events to avoid clashes between multiple operators from diverse vendors, all working in the same model. Nevertheless, it is important that operator code be written to a high standard.

So the third aspect to our mission is to enable software quality assurance for operators. Testing, continuous integration, inspection, debugging, observability and traceability are all critical capabilities in distributed systems design and development.

The Python Operator Framework provides architectural scaffolding for consistent, high quality operator development across the software ecosystem. It is specifically designed to reflect the distributed and event-driven nature of application management in a large, living estate. It also provides many common mechanisms for typical operator needs, across both container and machine estate.

Reducing the cost and complexity of operator design and maintenance is important to broaden the availability of enterprise-grade operators. Consistency in operator architecture makes it easier for developers to collaborate on operators for diverse workloads. As always, reuse of code improves quality and reliability.

Mechanisms for testing are essential for high quality code. The Python Operator Framework enables unit testing at the code level. We also respect conventions for real-world functional and acceptance tests.

As a community, continuous integration and testing of operators across a range of substrates and scenarios ensures a high standard of quality in the portfolio. We drive automated tests of operators across the entire ecosystem as a shared project, to raise the quality of operators for all users.

Community driven operators

Operators reflect real-world experience from people responsible for applications in a wide range of situations. An open community process and a preference for open source operators results in faster improvement, richer functionality, wiser operator behaviour and better integration than individual vendor efforts. We are of course open to vendor participation and even vendor leadership of operators, but we are committed to broad based participation because it delivers the highest quality total experience.

Our community values are encoded in the Open Operator Manifesto, which captures the key ingredients that underpin our ecosystem. Our focus on security, reuse, universality, maintenance, composition and integration, and in the community structure that drives quality and completeness, is reflected there.

The Juju OLM is language-neutral, but we encourage the use of common libraries to improve code reuse and to simplify integration between operators from diverse vendors and communities. Choosing a common language means that code can be shared directly between operators, reducing the cost of implementation and the risk of errors that comes with reimplementing the integration protocols designed by operator publishers. The Python Operator Framework has been designed to optimise for code sharing across the community.

The fourth practical element of our mission is that we build a shared community culture and practice, focused on the needs of software operations practitioners, with appropriate governance and mechanisms to bring the worlds expertise to bear on the shared problem of application management.

This community includes common code, conventions, design discussions, code release best practices, security standards and a code of conduct. There will be many operators for a given workload, we believe that operators which come from a well-run, open and collaborative community will deliver the best total value because they draw on that diverse base of perspectives.

Distribution and updates from CharmHub

Operators as a concept can be implemented in many different ways. In the Kubernetes operator pattern, an operator is a container. The Juju OLM supports operators on Kubernetes and on machine substrates like IAAS, VMware and bare metal, with an appropriate delivery mechanism for the operator in each case. On Kubernetes, a container hosts the operator. On machines, the operator is installed as an application.

Across all substrates, the package of the operator is called a charm, and charms can be shared, distributed and updated just like debs or rpms. Juju translates the charm to the appropriate format for a particular substrate.

The Juju OLM can accept locally-developed operators, and it can pull them from a global distribution system for the Open Operator Collection, called the CharmHub. When an operator has been pulled from the CharmHub, it becomes easy to pull updates to that operator just as you would update the packages of the operating system or application itself.

In the CharmHub, operators are published in semantic channels, allowing for stable, candidate, beta, and edge versions to be accessible on demand. Progressive releases ensure that unexpected quality defects impact the smallest possible number of scenarios. Semantic channels include tracks, which usually describe the major version of the application, stability, and branches which can be used to distribute very specific fixes or features for testing.

So the fifth element of our mission is the maintenance and management of the Open Operator Collection, as the world’s largest repository of distilled software and application operations experience. Spanning the entire gamut of application and infrastructure software, the CharmHub brings together everything the community knows about deploying, integrating and operating applications at every layer of the stack, across every cloud, on all major architectures, and at a wide variety of scales.

Juju is…

Transforming operations code into shared and reusable packages requires a new way of thinking about the operations problem. It requires us to separate the business decisions — choice of applications and versions, where they run, how much CPU and disk they may consume, and how they are integrated into the estate — from the technical decisions of how best to achieve those goals.

Juju is an operator lifecycle manager which gives operators a business context in which to run, enabling declarative integration between applications and separating the domain expertise of the software from the business goals of a particular deployment.

Juju is a universal operator lifecycle manager, meaning that it supports traditional machine substrates like bare metal, VMware, OpenStack, or public cloud instances, as well as the newer container substrates like Kubernetes. The ability to span legacy and modern estate, with seamless and transparent integration between them, makes Juju more than a tool for Kubernetes.

Juju underpins complex scenarios that span many models across diverse substrates and heterogeneous architectures, reflecting the reality of large-scale enterprise software estates. Juju models applications and SAAS equally and seamlessly. Multi-cloud operations are natural in Juju, providing business flexibility without forcing a lowest-common-denominator approach.

Most importantly, Juju is not a ‘language for configuration file management’. It is not a successor to Puppet, Chef, Ansible or Salt. Juju removes the requirement to invest in deep institutional knowledge of configuration details altogether, by encapsulating community knowledge in operators that handle those details automatically given a context and an application graph. Moving beyond configuration management puts the focus on the business problem of application management and resource allocation, a much higher level proposition.

Finally, Juju provides a shared operational framework to a community of practitioners, who collaborate to publish the Open Operator Collection as the deepest and widest repository of operations code that spans every layer of the enterprise stack.