Compute Capacity Planning for CICD
Workflows for CICD are also known as pipelines. They compose activities such as integrate, build, test, package, release, and deploy. In most DevOps tools, these activities are orchestrated by agent and executed by some sort of compute. In GitHub Actions, these are called runners. Actions are the instructions, and runners are the compute to execute the instructions.
GitHub has mainly three types of runners: GitHub-hosted, self-hosted, Larger Hosted Runners (LFS).
So which runners do we need? As any consultant would say “it depends”. There are a few of considerations:
- Do my Finance Department look at IT cost with service view? Do they adopt Total Cost of Ownership (TCO) viewpoint for digital services?
- Do I have existing infrastructure assets that I need to sweat (asset depreiciation)?
- Do I have a good bargain with my existing IaaS cloud provider?
- Do I have regulatory obligations?
- Do I have social, ethical and environmental responsibility?
- What the build performance of their current DevOps platform? Queue time? Build time? Concurrent rate?
Total Cost of Ownership is a mystery if my CFO does not adopt
Total Cost of Ownership (TOC) concept calculates the total cost to build and run the service, this includes capital infrastructure assets and fixtures, and operating cost like employee salary, rent, and utilities. Summing all these cost associated with the service to enable calculations to per service unit cost granularity.
This approach only work if the Finance Department do budgeting and forecasting with this kind of service view. It does not work if budgeting is function-based.
Sweating existing IT infrastructure assets
Owning IT infrastructure assets may mean they will take a number of years to fully depreciate (sweat), and during those years they are in reality expense already paid for.
Unused or underutilized compute are perfect candidates to add to compute pool. Sweet spot for on-prem is 60-70% utilization, and 80-90% for cloud infrastructure.
Good bargain from IaaS cloud provider
Normally enterprise with large cloud spent would receive large discount to the RRP for cloud compute. Additionally, we may enter into some sort of monetary commitment or saving plans in order to receive the discount, therefore we are obliged to use up our IaaS consumption commitment.
Restrictions from regulatory obligations
Certain industry sector or jurisdiction have regulatory requirements that enterprises need to meet for authorization to operate, examples are government, banking, and gambling have some of the strictest rules on data at-rest and data in-transit.
Requirements from social, ethical and environmental responsibility
Emerging trends that enterprises have obligations to be good citizens, particularly incorporate environmental responsibility into coroporate policies. As such, we may have adopted some Green Software Engineering principles that our compute have to be carbon and energy efficient.
Existing CICD performance
Lastly how do my dev teams perceived the current CICD performance? What are their frustrations or bottlenecks?
Objective metrics we want to know are:
- Execution (process) time
- Wait time
- Lead time (process + wait)
- Queue size
- Processing load (% utilization, concurrency)
- Error rate
Subjective observation when interviewing the dev teams, look for intangible obstacles like:
- Disjoint data flow from screen switching, copy and paste, recreating documents or files, etc
- Frequent execution errors
- Incorrect, incomplete or delayed feedback
Estimating the workflow compute capacity
To do runners capacity planning properly, leverage the forecast
command from GitHub Actions Importer (GAI) against existing DevOps platform would get some nice statistic - execution time, queue time, and concurrency rate - in median, p90, minimum, and maximum data points.
GAI offers CI instance analysis per each agent queue.
Analysing an example
Starting with execution time
1
2
3
4
5
- Total: **1062 minutes**
- Median: **15 minutes**
- P90: **22 minutes**
- Min: **0 minutes**
- Max: **42 minutes**
- This agent has a median execution time of 15 minutes
- 90% of the execution jobs are 22 minutes or less
- Maximum execution time is 42 minutes
Adding queue time
1
2
3
4
- Median: **0 minutes**
- P90: **0 minutes**
- Min: **0 minutes**
- Max: **5 minutes**
- This agent deemed have no job queued, there is an outliner of 5 minutes
- This implies this agent is undertilised
Adding concurrent jobs
1
2
3
4
- Median: **2**
- P90: **5**
- Min: **0**
- Max: **20**
- The load for execution has a median 2 jobs running concurrently
- 90% of the time, it has 5 or less concurrent jobs
- The agent seems it has a pool of 20 execution compute, and it is heavily underutilized
Observations
From looking at the analysis, this agent may computing on fixed infrastructure capacity, if so then it is significantly underutilized. If we are not sweating the assets or monetary commitment, I would recommend to reassign the workflows to GitHub-hosted runners, for the benefits of repurpose the underutilized capacity and better TCO.
Categorizing runners usage
Using this approach to continue analyze other agent runs and workflows, and then categorize into usage types with similar compute needs. With the design considerations discussed earlier, we can pragmatic map runners types to usage types, additionally with LFS and self-hosted runners what labels and groups to use.
Some usage categories are
- General (default) -> GitHub-hosted runners
- Highly regulated industry -> self-hosted runners
- High compute -> LFS, or autoscale self-hosted runners
- Long running -> self-hosted runners, autoscale self-hosted runners
- Low-priority or scheduled jobs -> self-hosted runners, autoscale self-hosted runners
- High-velocity development -> GitHub-hosted runners
- Burst -> GitHub-hosted runners
Tally for capacity and assign usage
With the historical job count, we next estimate what will the compute minutes be for each runners type usage. It may look like this:
Note: This is an example only. Categorising usage type is subjected to the requirement context.
Usage type | Runners type | Group | Label | Estimate compute minutes |
---|---|---|---|---|
General, high velocity (default) | GitHub-hosted | - | - | |
Technical spikes/experiment | GitHub-hosted, with spending limit | - | non-production | |
Hackathon week | GitHub-hosted, with spending limit | - | non-production | |
High-business priority/urgent | GitHub-hosted, LHR | - | - | |
Highly regulated | Self-hosted | on-prem | au-data | |
Long running | Self-hosted | aws | long-running | |
Low-priority | Self-hosted | azure | low-priority | |
High performance compute | GitHub-hosted, LHR | - | high-performance | |
Mission critical compute | TBD | TBD | mission-critical |
Finishing off the example above, additional statistic are
1
2
- Job count: 628
- Median execution time: 15 minutes
This workflow should reassign to GitHub-hosted runners, and with roughly estimation of 9,420 compute minutes / 157 compute hours (628 x 15 minutes).
Which one?
Which runners do we need?
Workflows for CICD are now becoming the critical path for all IT delivery and operations. It is the backbone of digital services lifecycle - delivering new features to catch up with the market, and more importantly restoring service from a priority one incident.
Therefore the compute that executes these workflows deserves mission critical attention beyond cost optimization. Enterprise will employ more than one runners type. A good strategy will consider all three, potentially leverage all three.