Efficient SAST with CodeQL
For practising DevSecOps, SAST with CodeQL can be very valuable to prevent potential losses for the organisation. For organisation with strong security culture, the benefits exceed the initial and running costs.
However, cost optimization is a FinOps continuous improvement exercise, it is important to understand inefficient SAST can be expensive:
- Expensive in terms of velocity - scanning can take hours to complete. Imagine push a commit or PR, grab a coffee and have a watercooler conversation, come back and find it is still scanning.
- Expensive in terms of compute - consuming the entitled compute minutes in exponential rate, unmanaged means bill shock at the end of the month.
- Expensive in terms of feedback - cost of delays is in-proportion to cost to remediate. To longer it takes to get feedback, to more expensive to fix the problem.
Below are some levers I collated to make CodeQL code scanning more efficient.
Levers
1. Runners - GitHub-hosted or self-hosted
CodeQL code scanning execute on runners which are just infrastructure compute. Collecting data points such as lines-of-code or other metrics with workflow-telemetry-action helps to decide if we need to scale-out or scale-up.
Scale-up means to use larger compute instance with more CPU cores and memory.
Scale-out may also be deployed with other levers listed below.
2. Filter proprietary and text file types with paths
and paths-ignore
Do we need to scan code against everything? Maybe not if the repository stores other artefacts beside code.
Configure the paths
and paths-ignore
options can help to relieve the effort and speed things up a little.
For example, we don’t need to scan thumbnail images and documentation files, so my codeql configuration file will have:
1
2
3
4
5
paths:
- src/com.myfabulouspackage.java
paths-ignore:
- src/media/thumbnails/*.jpg
- doc/*.md
3. Narrow the scanning to that folder with on
trigger
Some repositories migrated from legacy may have hierarcial folders structure, where the modules are identifiable in their own self-contained folders.
To save compute minutes and time, using on:paths
trigger and paths
/ paths-ignore
can narrow the monitoring and scanning to specific folder.
The workflow will look like:
1
2
3
4
on:
pull_request:
branches: [ 'main' ]
paths: src/modules/myawesomemodule
And the codeql configuration file will have:
1
2
paths:
- src/modules/myawesomemodule
4. Use self-hosted runner or container with dependencies-preloaded
Code scanning for compiled languages may have additional steps for dependencies downloading, building, and compiling before any analysis can take place.
We can shorten this repetitive effort in a self-hosted runner with pre-installed dependencies:
1
2
3
4
jobs:
analyze:
name: Analyze
runs-on: [ self-hosted, codeql, java ]
Or if we have a container registry, another way is to run the code scanning job in a container also with pre-installed dependencies:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
jobs:
analyze-on-container:
container:
image: ghcr.io/owner/myPrecannedImage
credentials:
username: USERNAME
password: PASSWORD
...
steps:
- uses: actions/checkout@v3
- uses: github/codeql-action/init@v2
- uses: github/codeql-action/autobuild@v2
- uses: github/codeql-action/analyze@v2
The disadvantage for this strategy is that we must maintain the updates and versioning of the runner image, i.e. the chores to upkeep the ‘golden images’
5. Reduce scanning frequency
It is a business decision to balance between building the right shift-left culture and paying the price for it. Options are available to scan at every push, every PR, nightly build, etc.
Ideally, we should scan code at every push to remote
branch to ensure no bad code to share with others. This may become pricey and long queue to merge if it is a busy remote
branch.
There is a common practice to scan code on a schedule with the schedule:cron
trigger. This is good for saving cost, but the developers cannot get immediate feedback on the quality of their code commits.
I will use a combination of code scanning on push/pull to remote
branch and local branch.
For scan on developers local machines (and take advantage of i7/i9/Xeon, 7000s/9000s, or M1/M2), do the initial CodeQL CLI setup, and then developers can scan code locally before commit:
1
2
3
codeql database create --language=python <output-folder>/python-database
codeql database analyze --output=<output-folder>/python-result
And when the code is ready to merge into the trunk, code scan the PRs, the selected branches, or on schedule:
1
2
3
4
5
on:
pull_request:
branches: [ 'main' ]
schedule:
- cron: '34 20 * * 6'
Strategies
1. For monolith - right-sizing the runners
A monolith application in a single repository. It is usually all the code for a self-contained application service, independent from other components. Overtime, these repositories tend to be large in the volume of the lines-of-code.
There are general CICD capacity planning factors to consider. For code scanning monolith repository, the key consideration is right-sizing the runners with the volume of the lines-of-code. This table has recommended CPU and memory against LOCs.
2. For monorepo - scale-out with strategy:matrix
A monorepo contains all the code in a single repository for multiple projects. It is a recommended practice for VCS for the benefits of visibility, collaboration, and speed. Git is a DVCS and designed for polyrepo where one repository to one project. GitHub Repository is enabled by Git and designed for polyrepo, henceforth may run into challenges with code scanning a monorepo.
A typical monorepo likely houses multiple languages and a few dozens of project folders similar to this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
.
├── util
│ ├── tools
│ │ ├── file1.txt
│ │ └── file2.txt
│ └── file3.txt
├── javascript
│ ├── project1
│ └── project2
├── python
│ ├── app1
│ │ ├── project3
│ │ └── project4
│ ├── app2
│ │ ├── project5
│ │ └── project6
│ └── file19.txt
└── java
├── modules
├── module1
│ ├── submodule1
| └── submodule2
└── module2
Workflow strategy:matrix
option allows breaking down a single sequential job into parallel jobs, by languages, subfolders, or any other self-contained unit.
For example, below workflow has one single job scanning each language sequentially:
1
2
3
4
5
6
7
8
9
steps:
...
- name: Initialize CodeQL
uses: github/codeql-action/init@v2
with:
languages: javascript, java, python
...
- name: Perform CodeQL Analysis
uses: github/codeql-action/analyze@v2
Set the strategy:matrix:language
option to fire up a new job for each language to run in parallel:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
strategy:
matrix:
language: [ 'javascript', 'java', 'python' ]
steps:
...
- name: Initialize CodeQL
uses: github/codeql-action/init@v2
with:
languages: $
...
- name: Perform CodeQL Analysis
uses: github/codeql-action/analyze@v2
with:
category: '/language:$'
3. Gain velocity with dynamic parallel scanning
Some organisations may have DevSecOps guiding principles that require quality check at every push or PR. That would mean the quality checks need to at minimal, otherwise they will impact developer’s velocity.
Parallel scanning implementation leverages git diff
, path
in codeql configuration, and strategy:matrix
options to laser-focus the scanning on the project folder(s) with the deltas. It also analyzes with query pack that is needed for the deltas. This makes the code scanning very efficiency in velocity, particularly for large monorepo.
4. For large size repo with more threadflow steps per result than allowed
issue
Some repositories may re-home over years and/or through M&As, so forth inheriting multiple layers of legacy (and got so big). So, when running CodeQL code scanning with codeql-action, more threadflow steps per result than allowed
error may occur.
According to documentation, this is related to the SARIF threadflow
limit of 10,000. There are a few workarounds floating around, all start by re-run the workflow with Enable debug logging
on, then either
- Configuring
paths-ignore
to skip problematic files and/or folders; or - Add a step with
advancedsecurity/filter-sarif
to skip problematic files and/or folders, ql query patterns, and languages
I do not recommend this exclusion approach as it requires manual interventions and maintenance henceforth cannot be scaled. And by excluding files and queries, this would also impact the comprehensiveness and completeness of the code scanning.
I do recommend instead, try to optimise the --max-paths
value to reduce the threadflow steps per result
. This option sets the level of details to return for each found vulnerability, a bit like the levels of stack trace to generate. The default value is 4, and setting it to 3, 2 or even 0 will provide lesser debugging details, but better than dismissing potential vulnerability.
Sharing a side story of why I prefer less details over exclusion approach. I used to head up solution delivery and operations for business critical applications, one of the improvement initiatives I introduced was unit testing and test coverage as the developer KPIs to assure quality. It seemed the shift-left thinking was well accepted, but later I found out one of the developers sprinkled this kind of line across his work:
1 Assert.IsTrue(true);First hand experience to see how “metrics drive behaviours”. In this case, it was undesirable behaviour.
Apparently this developer may had met his KPI, but in reality he put the whole business at risk.
So applying this to CodeQL, apparently we may have got code scanning working for every PR checks, but in reality we left the backdoor open.