What would you do with a spare terawatt?


Elastic Compute - Harvard University, Data Science

Rick Farmer, Sean Keery, Robert Joseph Johnson


Video
The 2-minute breakdown

We can do better with capacity planning and alerting for microservice-based, container-managed clusters of compute resources.


Presentation
The complete deck

Take your time and walk through the presentation deck for yourself.

Slideshow
PDF file with speaker notes
Presentation

Overview

We are entering the age of the third platform in IT. The way we run applications is changing rapidly and significantly.

  • Platform Zero

    "Apps" were run on mainframes, but they were too big and too expensive.

  • Platform One

    Apps ran on commoditized computer hardware such as a PC, but this came with a high degree of waste. CPUs, memory and disk remained underutilized as single apps sat relatively idle to provide excess capacity for workloads that never came.

  • Platform Two

    Computers were virtualized so that several would be able to run at once on one hardware computer. This better used the CPU, memory, and disk of the inter-connected computers, thus creating a cloud of infrastructure resulting in much less waste.

The third platform moves the focus to driving business value and improving applications, instead of the increasing cost and infrastructure. We now virtualize applications into slices of CPU, memory, and disk of virtualized machines called containers. This process produces far less waste because the application containers can be spun-up or down in reaction to demand nearly instantly. We can scale applications to multiple nodes in places where capacity is available now, whether in Boston or Hong Kong. To scale well, applications are being redesigned as microservices, little units of work providing well-defined services to the whole of the modern distributed application.

The key element enabling this is capacity planning and load alerting for the container managed clusters. Do we need to scale now? If so, do we scale the app out because we need more resources to serve an avalanche of demand? Or do we scale in, because Black Friday has passed and our retail app no longer needs to use so many resources?

Datacenter
(source: topwalls.net )

Our project is motivated to reduce waste by "right-sizing" energy consumption by linking it more directly to demand. Not many have seen the inside of a large Data Center, but these are colossal energy-hungry ventures[1] on a scale that is difficult to appreciate without seeing one in person.

Everything in a data center is planned around watt usage. Over half of the wattage goes to server load and most of the rest to cooling equipment. If we can better predict and scale compute resource load more efficiently we have the opportunity to effect change in the United States alone at about the 9.1 TWhr (Terawatts per hour) level and, perhaps, slow the projected rise in consumption by 2020 to 13.9 TWhr [2].


Analysis

Looking for solutions

Our inferential goals are to understand how better to do capacity planning and alerting for microservice-based, container-managed clusters of compute resources. We tried to learn what makes a good alert threshold, such that we can predictably recommend actions that will keep an application highly available under various demand scenarios ranging from low to high. Effectively, we want to investigate the elasticity of supply and demand on compute resources so that we can make ongoing dynamic recommendations about the proper scale for a given set of inputs. Benefits include:

  • Learning

    Learning what a low usage state is for applications and provide information on the scale (how many compute nodes) that should be set on a dynamic basis

  • Energy

    Reduction of energy usage by fitting the supply and demand more appropriately over time

  • Elastic compute

    The ability to prescriptively maintain application availability with the minimum amount of compute resources

Datacenters around the world

Datacenters around the world
Fig 1. - Datacenters around the world.
Click the image above to view our interactive map



Datacenters by country

Datacenter count per country
Fig 2. - Datacenter count per country (includes datacenters, co-location centers, shared and virtual web-hosting providers)
(data source: IP catalog project )



We started by analyzing a small subset of data provided to us by Pivotal from their Pivotal Web Services platform. This is largest public Cloud Foundry platform currently in operation. Essentially, Cloud Foundry is and open-source well orchestrated container platform that allows many applications to scale easily in and out.

This data provided us insight into the nature of a completely container-based approach. We started by analyzing a general container metric which gave around 3,000 instances (containers) running on this particular Cloud Foundry foundation. We found that the mean instance CPU utilization was very low at around 2%. Each instance consumed the about 500 Mb or memory.



The next level of consolidation in the third platform will occur through application virtualization. Our goal was to identify how we could concentrate workloads to reduce energy consumption, so we began looking for application data we could use to identify patterns.
We were able to get a number of applications running on Pivotal Web Services. We added application performance monitoring by binding all our microservices to a free trial of New Relic. The New Relic suite is superb at tracking web and service transaction metrics such as response time and throughput. However, after initial exploration, we found it very difficult to export the data in a way that we could identify individual application utilization.

New Relic
Fig 3. - New Relic


Next we have a free subscription to DataDog. We found the metrics were too high level for what we needed. They tended to be at the host/server level.

Fig 4. - DataDog


Since each application runs in it’s own container, we thought we could get equivalent data from a container solution. Coincidentally, we were given a demo of an internal Pivotal APM tool.

Fig 5. - Firehose nozzle container-level metrics


Although the product was not available, we found we were able to obtain container level metrics through a command line interface through the same application “nozzle” that this beautiful dashboard was using.
We installed a plugin that would export the metrics following the instructions at the firehose-plugin site. Then we ran the following commands

                    cf nozzle -f ContainerMetric > ContainerMetricPaaS4.txt
                    cf nozzle -f ValueMetric > valueMetricSaaS2.txt which piped the output to a file.
                    



Just to get a sense of general performance across the different services, we plotted the three different readings: memory usage, disk usage, and CPU utilization. For easy readibility, we have removed the exact timestamps from our x-axis. Below is an example lot of disk usage for PaaS:

Fig #6. - Disk Usage Over Time, PaaS


We took a look at our container data again, this time grouped by application, IP address, and Instance Index. Cloud Foundry assigns each application a UUID (Universally Unique Identifier) that we can use to identify a single application. Below is a sample plot of memory consumption, grouped by application.

Fig #7. - Memory Usage per Application, PaaS


We then analyzed some of the relationships between memory/disk usage and CPU utilization in more detail on an app by app basis. For further analysis, we should be concerned with instances and applications that appear in the top right or bottom left corners of these plots - areas where both CPU Utilization and Memory/Disk Usage are considered to be high or low. When a process is in this area, we can begin to consider scaling out compute resources. Likewise, if we are residing in an area with low usage for both, we can start to ratchet down compute resources to scale them in.

Fig #8. - CPU Vs. Disk Usage per Application, PaaS
Fig #9. - CPU Vs. Disk Usage per Application, SaaS

Prediction

Our limited model illustrates the possibility of reducing energy usage by a terawatt in the near future.

Findings

Our data allowed us to group high and low utilization applications. We will be able to segregate the high usage services, with common patterns, through anti-affinity rules while letting the others fill in the gaps. Moreover, these pattern identification methods allow us to predict scaling needs before thresholds are met. This will allow us to scale out, along with scaling in as demand fades. The capability to reduce capacity will cut down on future electricity consumption.