Bits and some bytes: December 2014

You have awesome engineers, and they want to advance in their career. Their team is growing because of advancements they've made, and you want to recognize the work they've done with something. The obvious answer is to put them in charge of the team they've built, especially as they're the de-facto leader of the team already. But is this what they want? Or just what they believe they're supposed to want?

People Management Is A Different Skill

It's well known in the engineering world that engineers often will reach a technical peak, only to be asked to learn an entirely new set of skills, revolving around social and soft skills that they've spent the past chunk of their career probably ignoring. Building these skills involves a lot of trial and error, and above all, most of their time. Their time is now spent not coding, which is what they were originally rewarded for. Suddenly, they go from being good at their job to bad at it, which destroys confidence and job satisfaction.

The problem here is that people management is a different skill than technical leadership. What you want is to reward someone with recognition that they are a thought leader, an exceptional performer- to make them an example of what others should strive to be. Everyone can't strive to be a manager, who would do the work? Additionally, you don't want to send the message that management, as a career, is somehow "better" than coding or anything else in the company, for that matter.

What Is Technical Leadership?

Technical Leadership is composed of several aspects of purview over parts of job responsibility. Think about the day of a typical engineer- many decisions have to be made, problems have to be prioritized, solutions need to be found. Those are the fun parts of any engineer's job, and indeed all engineers must exercise some degree of technical leadership.

The rest of the job involves answering questions about how a thing you made works or doesn't work. Bugs need to be found and fixed. Documentation has to be written. Code has to be reviewed. Estimations must be made. Most of all, the longer an engineer remains at a company, the harder it is to find a few hours of isolation, free from interruption, to get work done. These parts, they are the worst parts of the job.

All of the fun work happens in these uninterrupted times, and these uninterrupted times happen only when your disappearance doesn't materially slow down someone else's work. Innovation happens when you have the bandwidth to hold the entire problem in your head- most of the time that takes a great deal of "load time" - research and contemplation about the problem. For introverts this means quiet, for extroverts this means dedicated time in a room of other people you work well with, thinking about the problem.

So what gets in the way? Why can't exceptional engineers have more fun work, and have someone who is yet to be recognized do more of the less fun things? The key is in empowerment, and an engineer's ability to say "No". This isn't a function of courage, it's a function of empowerment - most exceptional engineers can't just let problems go unsolved, and if they feel responsible for a product or service, they end up chained to the project forever. Engineers often feel responsible for a project for the rest of its life, and this is exacerbated by the tendency for information to "silo" in one person, rather than distributing across a team.

How To Empower Your Technical Leaders

As a project grows, a technical leader might naturally accrue team members, and accidentally transform into a people manager. This is the biggest risk. The way to empower your technical leaders is to set expectations early and often, encouraging them to be clear about their ultimate goals, without sending the message that they will be less respected if they don't transition into people management. If their goal is to be the best programmer imaginable, or to create a system that scales to 10 million users, or to understand deep processes in the operating system, then finding ways to help them reach their goals is rewarding for everyone.

Technical goals are easy to meet - there will be plenty of opportunities as a company grows to reward engineers with "fun" projects, and the ability to learn and grow as an engineer.Identify when someone will have to learn a lot for a role, these roles are rewarding and coveted. Also find opportunities to allow for professional development outside of your company, such as encouraging engineers to go to conferences and give talks, and become thought leaders in their field. Most engineers don't have an actionable plan for their own professional development, and so can benefit from advice on how to celebrate their accomplishments by putting them in a favorable position. This also will help you make the case for raises and bonuses, and will increase the esteem of your engineering department in general, which in turn attracts more talent. Seeing someone speak on something that interests them is the number one reason why mid-to-senior level engineers seek employment in an organization, rather than being recruited. Developing a strategy for positioning your technical thought leaders as the technical thought leaders is empowering, and it helps them empower themselves.

The ability to say "No" is the second component of a technical leader - develop a strategy around handoffs. The objective here is to reward innovators by not chaining them to their innovations forever - it encourages those who create to continue creating. Junior developers are perfect to hand off smaller creations to, it helps them develop a sense of ownership without having to understand deeply the best practices that created the idea. Handoffs create opportunities in both directions- both for the person with more time, as well as the person accepting the responsibility. The incentives align well, but make sure there are expectations for the person receiving the responsibility- be honest about what it means for their career, and set up a strategy around them getting help with the project.

Developers today are building a new class of applications. These applications no longer fit on a single server, but instead run across a fleet of servers in a data center. Examples include analytics frameworks like Apache Hadoop and Apache Spark, message brokers like Apache Kafka, key-value stores like Apache Cassandra, as well as customer-facing applications such as those run by Twitter and Netflix.

These new applications are more than applications, they are distributed systems. Just as it became commonplace for developers to build multithreaded applications for single machines, it’s now becoming commonplace for developers to build distributed systems for data centers.

But it’s difficult for developers to build distributed systems, and it’s difficult for operators to run distributed systems. Why? Because we expose the wrong level of abstraction to both developers and operators: machines.

Machines are the wrong abstraction

Machines are the wrong level of abstraction for building and running distributed applications. Exposing machines as the abstraction to developers unnecessarily complicates the engineering, causing developers to build software constrained by machine-specific characteristics, like IP addresses and local storage. This makes moving and resizing applications difficult if not impossible, forcing maintenance in data centers to be a highly involved and painful procedure.

With machines as the abstraction, operators deploy applications in anticipation of machine loss, usually by taking the easiest and most conservative approach of deploying one application per machine. This almost always means machines go underutilized since we rarely buy our machines (virtual or physical) to exactly fit our applications, or size our applications to exactly fit our machines.

It’s time we created the POSIX for distributed computing: a portable API for distributed systems running in a data center or on a cloud.By running only one application per machine, we end up dividing our data center into highly static, highly inflexible partitions of machines, one for each distributed application. We end up with a partition that runs analytics, another that runs the databases, another that runs the web servers, another that runs the message queues, and so on. And the number of partitions is only bound to increase as companies replace monolithic architectures with service-oriented architectures and build more software based on microservices.

What happens when a machine dies in one of these static partitions? Let’s hope we over-provisioned sufficiently (wasting money), or can re-provision another machine quickly (wasting effort). What about when the web traffic dips to its daily low? With static partitions we allocate for peak capacity, which means when traffic is at its lowest, all of that excess capacity is wasted. This is why a typical data center runs at only 8-15% efficiency. And don’t be fooled just because you’re running in the cloud: you’re still being charged for the resources your application is not using on each virtual machine (someone is benefiting — it’s just your cloud provider, not you).

And finally, with machines as the abstraction, organizations must employ armies of people to manually configure and maintain each individual application on each individual machine. People become the bottleneck for trying to run new applications, even when there are ample resources already provisioned that are not being utilized.

If my laptop were a data center

Imagine if we ran applications on our laptops the same way we run applications in our data centers. Each time we launched a web browser or text editor, we’d have to specify which CPU to use, which memory modules are addressable, which caches are available, and so on. Thankfully, our laptops have an operating system that abstracts us away from the complexities of manual resource management.

In fact, we have operating systems for our workstations, servers, mainframes, supercomputers, and mobile devices, each optimized for their unique capabilities and form factors.

We’ve already started treating the data center itself as one massive warehouse-scale computer. Yet, we still don’t have an operating system that abstracts and manages the hardware resources in the data center just like an operating system does on our laptops.

It’s time for the data center OS

What would an operating system for the data center look like?

From an operator’s perspective it would span all of the machines in a data center (or cloud) and aggregate them into one giant pool of resources on which applications would be run. You would no longer configure specific machines for specific applications; all applications would be capable of running on any available resources from any machine, even if there are other applications already running on those machines.

From a developer’s perspective, the data center operating system would act as an intermediary between applications and machines, providing common primitives to facilitate and simplify building distributed applications.

The data center operating system would not need to replace Linux or any other host operating systems we use in our data centers today. The data center operating system would provide a software stack on top of the host operating system. Continuing to use the host operating system to provide standard execution environments is critical to immediately supporting existing applications.

The data center operating system would provide functionality for the data center that is analogous to what a host operating system provides on a single machine today: namely, resource management and process isolation. Just like with a host operating system, a data center operating system would enable multiple users to execute multiple applications (made up of multiple processes) concurrently, across a shared collection of resources, with explicit isolation between those applications.

An API for the data center

Perhaps the defining characteristic of a data center operating system is that it provides a software interface for building distributed applications. Analogous to the system call interface for a host operating system, the data center operating system API would enable distributed applications to allocate and deallocate resources, launch, monitor, and destroy processes, and more. The API would provide primitives that implement common functionality that all distributed systems need. Thus, developers would no longer need to independently re-implement fundamental distributed systems primitives (and inevitably, independently suffer from the same bugs and performance issues).

Centralizing common functionality within the API primitives would enable developers to build new distributed applications more easily, more safely, and more quickly. This is reminiscent of when virtual memory was added to host operating systems. In fact, one of the virtual memory pioneers wrote that “it was pretty obvious to the designers of operating systems in the early 1960s that automatic storage allocation could significantly simplify programming.”

Example primitives

Two primitives specific to a data center operating system that would immediately simplify building distributed applications are service discovery and coordination. Unlike on a single host where very few applications need to discover other applications running on the same host, discovery is the norm for distributed applications. Likewise, most distributed applications achieve high availability and fault tolerance through some means of coordination and/or consensus, which is notoriously hard to implement correctly and efficiently.

With a data center operating system, a software interface replaces the human interface.Developers today are forced to pick between existing tools for service discovery and coordination, such as Apache ZooKeeper andCoreOS’ etcd. This forces organizations to deploy multiple tools for different applications, significantly increasing operational complexity and maintainability.

Having the data center operating system provide primitives for discovery and coordination not only simplifies development, it also enables application portability. Organizations can change the underlying implementations without rewriting the applications, much like you can choose between different filesystem implementations on a host operating system today.

A new way to deploy applications

With a data center operating system, a software interface replaces the human interface that developers typically interact with when trying to deploy their applications today; rather than a developer asking a person to provision and configure machines to run their applications, developers launch their applications using the data center operating system (e.g., via a CLI or GUI), and the application executes using the data center operating system’s API.

This supports a clean separation of concerns between operators and users: operators specify the amount of resources allocatable to each user, and users launch whatever applications they want, using whatever resources are available to them. Because an operator now specifies how much of any type of resource is available, but not which specific resource, a data center operating system, and the distributed applications running on top, can be more intelligent about which resources to use in order to execute more efficiently and better handle failures. Because most distributed applications have complex scheduling requirements (think Apache Hadoop) and specific needs for failure recovery (think of a database), empowering software to make decisions instead of humans is critical for operating efficiently at data-center scale.

The “cloud” is not an operating system

Why do we need a new operating system? Didn’t Infrastructure as a Service (IaaS) and Platform as a Service (PaaS) already solve these problems?

IaaS doesn’t solve our problems because it’s still focused on machines. It isn’t designed with a software interface intended for applications to use in order to execute. IaaS is designed for humans to consume, in order to provision virtual machines that other humans can use to deploy applications; IaaS turns machines into more (virtual) machines, but does not provide any primitives that make it easier for a developer to build distributed applications on top of those machines.

PaaS, on the other hand, abstracts away the machines, but is still designed first and foremost to be consumed by a human. Many PaaS solutions do include numerous tangential services and integrations that make building a distributed application easier, but not in a way that’s portable across other PaaS solutions.

Apache Mesos: The distributed systems kernel

Distributed computing is now the norm, not the exception, and we need a data center operating system that delivers a layer of abstraction and a portable API for distributed applications. Not having one is hindering our industry. Developers should be able to build distributed applications without having to reimplement common functionality. Distributed applications built in one organization should be capable of being run in another organization easily.

Existing cloud computing solutions and APIs are not sufficient. Moreover, the data center operating system API must be built, like Linux, in an open and collaborative manner. Proprietary APIs force lock-in, deterring a healthy and innovative ecosystem from growing. It’s time we created the POSIX for distributed computing: a portable API for distributed systems running in a data center or on a cloud.

The open source Apache Mesos project, of which I am one of the co-creators and the project chair, is a step in that direction. Apache Mesos aims to be a distributed systems kernel that provides a portable API upon which distributed applications can be built and run.

Many popular distributed systems have already been built directly on top of Mesos, including Apache Spark, Apache Aurora, Airbnb’s Chronos, and Mesosphere’s Marathon. Other popular distributed systems have been ported to run on top of Mesos, including Apache Hadoop, Apache Storm, and Google’s Kubernetes, to list a few.

Chronos is a compelling example of the value of building on top of Mesos. Chronos, a distributed system that provides highly available and fault-tolerant cron, was built on top of Mesos in only a few thousand lines of code and without having to do any explicit socket programming for network communication.

Companies like Twitter and Airbnb are already using Mesos to help run their datacenters, while companies like Google have been using in-house solutions they built almost a decade ago. In fact, just like Google’s MapReduce spurred an industry around Apache Hadoop, Google’s in-house datacenter solutions have had close ties with the evolution of Mesos.

While not a complete data center operating system, Mesos, along with some of the distributed applications running on top, provide some of the essential building blocks from which a full data center operating system can be built: the kernel (Mesos), a distributed init.d (Marathon/Aurora), cron (Chronos), and more.

Bits and some bytes

Pages

Saturday, 27 December 2014

Leadership