Post

Efficient SAST with CodeQL

Efficient SAST with CodeQL

For practising DevSecOps, SAST with CodeQL can be very valuable to prevent potential losses for the organisation. For organisation with strong security culture, the benefits exceed the initial and running costs.

However, cost optimization is a FinOps continuous improvement exercise, it is important to understand inefficient SAST can be expensive:

  • Expensive in terms of velocity - scanning can take hours to complete. Imagine push a commit or PR, grab a coffee and have a watercooler conversation, come back and find it is still scanning.
  • Expensive in terms of compute - consuming the entitled compute minutes in exponential rate, unmanaged means bill shock at the end of the month.
  • Expensive in terms of feedback - cost of delays is in-proportion to cost to remediate. To longer it takes to get feedback, to more expensive to fix the problem.

Below are some levers I collated to make CodeQL code scanning more efficient.

Levers

1. Runners - GitHub-hosted or self-hosted

CodeQL code scanning execute on runners which are just infrastructure compute. Collecting data points such as lines-of-code or other metrics with workflow-telemetry-action helps to decide if we need to scale-out or scale-up.

Scale-up means to use larger compute instance with more CPU cores and memory.

Scale-out may also be deployed with other levers listed below.

2. Filter proprietary and text file types with paths and paths-ignore

Do we need to scan code against everything? Maybe not if the repository stores other artefacts beside code.

Configure the paths and paths-ignore options can help to relieve the effort and speed things up a little.

For example, we don’t need to scan thumbnail images and documentation files, so my codeql configuration file will have:

1
2
3
4
5
paths:
  - src/com.myfabulouspackage.java
paths-ignore:
  - src/media/thumbnails/*.jpg
  - doc/*.md

3. Narrow the scanning to that folder with on trigger

Some repositories migrated from legacy may have hierarcial folders structure, where the modules are identifiable in their own self-contained folders.

To save compute minutes and time, using on:paths trigger and paths / paths-ignore can narrow the monitoring and scanning to specific folder.

The workflow will look like:

1
2
3
4
on:
  pull_request:
    branches: [ 'main' ]
    paths: src/modules/myawesomemodule

And the codeql configuration file will have:

1
2
paths:
  - src/modules/myawesomemodule

4. Use self-hosted runner or container with dependencies-preloaded

Code scanning for compiled languages may have additional steps for dependencies downloading, building, and compiling before any analysis can take place.

We can shorten this repetitive effort in a self-hosted runner with pre-installed dependencies:

1
2
3
4
jobs:
  analyze:
    name: Analyze
    runs-on: [ self-hosted, codeql, java ]

Or if we have a container registry, another way is to run the code scanning job in a container also with pre-installed dependencies:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
jobs:
  analyze-on-container:
    container:
      image: ghcr.io/owner/myPrecannedImage
      credentials:
         username: USERNAME
         password: PASSWORD
    ...
    steps:
    - uses: actions/checkout@v3
    - uses: github/codeql-action/init@v2
    - uses: github/codeql-action/autobuild@v2
   	- uses: github/codeql-action/analyze@v2

The disadvantage for this strategy is that we must maintain the updates and versioning of the runner image, i.e. the chores to upkeep the ‘golden images’

5. Reduce scanning frequency

It is a business decision to balance between building the right shift-left culture and paying the price for it. Options are available to scan at every push, every PR, nightly build, etc.

Ideally, we should scan code at every push to remote branch to ensure no bad code to share with others. This may become pricey and long queue to merge if it is a busy remote branch.

There is a common practice to scan code on a schedule with the schedule:cron trigger. This is good for saving cost, but the developers cannot get immediate feedback on the quality of their code commits.

I will use a combination of code scanning on push/pull to remote branch and local branch.

For scan on developers local machines (and take advantage of i7/i9/Xeon, 7000s/9000s, or M1/M2), do the initial CodeQL CLI setup, and then developers can scan code locally before commit:

1
2
3
codeql database create --language=python <output-folder>/python-database

codeql database analyze --output=<output-folder>/python-result

And when the code is ready to merge into the trunk, code scan the PRs, the selected branches, or on schedule:

1
2
3
4
5
on:
  pull_request:
    branches: [ 'main' ]
  schedule:
    - cron: '34 20 * * 6'

Strategies

1. For monolith - right-sizing the runners

A monolith application in a single repository. It is usually all the code for a self-contained application service, independent from other components. Overtime, these repositories tend to be large in the volume of the lines-of-code.

There are general CICD capacity planning factors to consider. For code scanning monolith repository, the key consideration is right-sizing the runners with the volume of the lines-of-code. This table has recommended CPU and memory against LOCs.

2. For monorepo - scale-out with strategy:matrix

A monorepo contains all the code in a single repository for multiple projects. It is a recommended practice for VCS for the benefits of visibility, collaboration, and speed. Git is a DVCS and designed for polyrepo where one repository to one project. GitHub Repository is enabled by Git and designed for polyrepo, henceforth may run into challenges with code scanning a monorepo.

A typical monorepo likely houses multiple languages and a few dozens of project folders similar to this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
.
├── util
│   ├── tools
│   │   ├── file1.txt
│   │   └── file2.txt
│   └── file3.txt
├── javascript
│   ├── project1
│   └── project2
├── python
│   ├── app1
│   │   ├── project3
│   │   └── project4
│   ├── app2
│   │    ├── project5
│   │    └── project6
│    └── file19.txt 
└── java 
     ├── modules 
           ├── module1 
           │     ├── submodule1 
           |     └── submodule2 
           └── module2 

Workflow strategy:matrix option allows breaking down a single sequential job into parallel jobs, by languages, subfolders, or any other self-contained unit.

For example, below workflow has one single job scanning each language sequentially:

1
2
3
4
5
6
7
8
9
    steps:
      ...
    - name: Initialize CodeQL
      uses: github/codeql-action/init@v2
      with:
        languages: javascript, java, python
      ...
    - name: Perform CodeQL Analysis
      uses: github/codeql-action/analyze@v2

Set the strategy:matrix:language option to fire up a new job for each language to run in parallel:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
    strategy:
      matrix:
        language: [ 'javascript', 'java', 'python' ]

    steps:
      ...
    - name: Initialize CodeQL
      uses: github/codeql-action/init@v2
      with:
        languages: $
      ...
    - name: Perform CodeQL Analysis
      uses: github/codeql-action/analyze@v2
      with:
        category: '/language:$'

3. Gain velocity with dynamic parallel scanning

Some organisations may have DevSecOps guiding principles that require quality check at every push or PR. That would mean the quality checks need to at minimal, otherwise they will impact developer’s velocity.

Parallel scanning implementation leverages git diff, path in codeql configuration, and strategy:matrix options to laser-focus the scanning on the project folder(s) with the deltas. It also analyzes with query pack that is needed for the deltas. This makes the code scanning very efficiency in velocity, particularly for large monorepo.

4. For large size repo with more threadflow steps per result than allowed issue

Some repositories may re-home over years and/or through M&As, so forth inheriting multiple layers of legacy (and got so big). So, when running CodeQL code scanning with codeql-action, more threadflow steps per result than allowed error may occur.

According to documentation, this is related to the SARIF threadflow limit of 10,000. There are a few workarounds floating around, all start by re-run the workflow with Enable debug logging on, then either

I do not recommend this exclusion approach as it requires manual interventions and maintenance henceforth cannot be scaled. And by excluding files and queries, this would also impact the comprehensiveness and completeness of the code scanning.

I do recommend instead, try to optimise the --max-paths value to reduce the threadflow steps per result. This option sets the level of details to return for each found vulnerability, a bit like the levels of stack trace to generate. The default value is 4, and setting it to 3, 2 or even 0 will provide lesser debugging details, but better than dismissing potential vulnerability.

Sharing a side story of why I prefer less details over exclusion approach. I used to head up solution delivery and operations for business critical applications, one of the improvement initiatives I introduced was unit testing and test coverage as the developer KPIs to assure quality. It seemed the shift-left thinking was well accepted, but later I found out one of the developers sprinkled this kind of line across his work:

1
Assert.IsTrue(true);

First hand experience to see how “metrics drive behaviours”. In this case, it was undesirable behaviour.

Apparently this developer may had met his KPI, but in reality he put the whole business at risk.

So applying this to CodeQL, apparently we may have got code scanning working for every PR checks, but in reality we left the backdoor open.

This post is licensed under CC BY 4.0 by the author.