Share on facebook
Share on twitter
Share on linkedin

How much code do I have? A DevSecOps story

Darwin Sanoy
Darwin Sanoy

by Darwin Sanoy
Blog | Twitter | LinkedIn | GitLab

The age of DevSecOps is upon us

As a GitLab Solutions Architect, I work with many customers to consider their forward-looking plans for DevSecOps. Everyone is rightfully concerned about code security as new threats and vulnerabilities are being constantly discovered. Unfortunately, the bad actors in computing have a very agile release cycle.

I see that organizations of all types are trying to improve their DevSecOps game. Since DevSecOps is all about the parts of an App Sec program that can be automated, a big part of the shift is assessing what tooling choices are most relevant and cost-effective. 

In many cases, this means driving the usage of scanning tools from early adopters to the point where all code is scanned. 

I encourage customers to undertake a DevSecOps Maturity and TCO plan, which forecasts their total costs for their vision of Mature DevSecOps. For a variety of reasons, planning both maturity and TCO at the same time is critical. Surprisingly many organizations haven’t planned what “Mature DevSecOps” means to them — they have grabbed a few tools and started to play around with some scanning. The total cost of ownership of an unplanned organic approach to DevSecOps can lead to very uneven and escalating costs, as well as a lack of vision in knowing when the entire operation is defensible as “mature” for the company and code bases at hand.

Forecasting security scanning costs for mature DevSecOps

As a part of this assessment, licensed security scanning tools have a vast variety of licensing models. Many popular scanning tools charge by the volume of code scanned.

Here are some popular volume-based models:

  • Lines of code scanned
  • Megabytes of code scanned
  • Number of applications scanned

Some may charge based on an accumulation of every scan over identical code and some may charge by “unique lines/MBs of code under management” — so repeat scanning of the same code doesn’t accrue a cost.

There are several key inflection points at which an organization may need to forecast this cost:

  • to assess the true cost of ramping up their DevSecOps with volume-based tools
  • trying to budget for DevSecOps given the growth in their code base
  • when comparing the total cost of ownership of various scanning tooling choices
  • when assessing the true costs or savings for consolidating security tooling
  • when considering how often to scan code that is in a “security fix only” status
  • when large code bases are inherited — such as during corporate acquisitions

Notice that the list above implies that counting your code is likely to be an ongoing need.

In all of these cases, a core challenge is to obtain the code base metrics used by the scanning tools for determining what is owed — without having to configure their tool over all repositories.

If you’re involved in a project to compare scanning tools, then it’s both more important to have the stats and to obtain them without incurring the cost of setting up each tool that is under consideration.

How hard can counting be?

The vast majority of us were taught to count before we can remember. So how hard could it possibly be to add up a few lines or megabytes of source code with a quick script?

If all you have in your repositories is source code (no binaries and no non-code files) and if you’re only ever going to scan the default branch of your code, it can be fairly simple.

However, if you need to eliminate binaries or any other file type (e.g., markdown) it can become challenging. In addition, if you don’t intend to scan all content in a monorepo, it can be difficult to manage.

As a first step, we’ll use Git commands to eliminate binaries and other non-code files. You’ll be able to use an extensions exclusions list to indicate files that are not code files.

After that, we’ll take a look at using a utility specialized in counting only the code lines, including eliminating whitespace and comments.

No matter what you’re using to count the code, the Git command optimizations not only scope the checked out files to code files, but they make scanning large repositories and large numbers of repositories much faster and cheaper.

Optimizing counting “things with blue” through exclusion

Imagine you’re moving from your city apartment to a new home. Your household goods are packed in a shipping container and ready to go. Just then you get a notice from the government of your new home that informs you that new residents must pay a tax on some of their items. This place is a little odd and they simply want you to pay 5 pennies per pound on any item that has the color blue anywhere on it. (No, the U.S. state of New Jersey doesn’t have this requirement, but that’s a reasonable guess.)

For the process of weighing, you’ll need to transfer boxes back to your apartment, open them, find the blue items, weigh them, and track the total weight.

Fortunately, you’ve been able to borrow a set of high-speed robots from work, and you can give them specific instructions about selecting and handling the items. Unfortunately, they aren’t equipped with scales, so you’ll have to do the weighing and counting.

There are two ways you could have the robots optimize your task:

  1. Ask them to bring the boxes to the apartment and only unpack things containing blue so they can be weighed.
  2. Ask them to find the blue things in the boxes while they are in the truck and only bring those items to the apartment.

It turns out we have both of these options with some newer Git capabilities.

Two times we take a size and time hit

The optimizations we’ll discuss save both disk space and time. If you plan to run this on many repositories, then both can significantly impact the process. Disk space may overflow or require fussy management of how many repositories you can handle at a time in order to avoid overflow. If you’re running this on a CI system that is paid, then the run time length equates to actual dollars.

When working with Git repositories, there are two potential times you may take a size and time hit based on binary content or other unneeded files:

  1. When receiving Git history data from the remote (git clone or fetch), by default, all files for all branches for all of the repository history are transferred.
  2. When unpacking the Git files (git checkout) into the working directory, all files for the target branch are extracted.

Some of the following Git optimizing commands focus on not copying the data from the server at all and others focus on not extracting data assumed to be in the transferred data.

In the examples below, we’ll assume that this clone is explicitly for the purpose of counting and will be discarded afterward. In order for it to also be usable for build activities, additional Git commands would be necessary to reconfigure the clone into a more usable state.

Having Git do the heavy lifting

Git will be doing the selecting of history to transfer during cloning and the selection of what to checkout. The result should be just our current code on our default branch. This makes a total size estimate or size estimates by file extension much easier because we only have the files that we want to count.

Optimization test

For our example code, we’ll use the repository www-gitlab-com since it contains many binaries in the form of web graphics files and significant Git history and it is publicly available. Picking something large makes it easier to tell by file size alone if the commands are being done correctly.

Here’s how you can tell if your commands are executing correctly:

Regular SizeOptimized Size
www-gitlab-com .git folder 4.2 GB111 MB 
git clone time (no checkout)  21 minutes46 seconds
www-gitlab-com files (other than .git folder) 2.7 GB 260 MB
git checkout time27 seconds19 seconds
Checked out objects155464467
Lines of code counted207925206658*
*Note: sparse-checkout reduces the ruby file count and therefore the lines count. I was not able to discover why before the publishing due date.

Partial Clone to the partial rescue

Each time a Git clone is requested from a remote Git service, the Git service prepares a pack file to send to the client. Partial Cloning is a new feature worked on by many contributing companies and individuals that takes advantage of this preparation step by passing along instructions to exclude specific objects.

Objects initially excluded by a Partial Clone will be automatically pulled from the remote if a subsequent local Git command requires them. So if you do a partial clone that prevents binaries from coming down but then subsequently checkout a branch that contains a binary, the binary will be pulled down during the checkout. The exclusion filter specification isn’t enforced on all subsequent local Git commands. This is an automatic behavior intended to prevent partial clones from breaking current workflows.

The following partial clone filters out all files from the .git history:

git clone --filter=blob:none --no-checkout

For partial clone to work, you must have Git 2.22 or later.

Shallow Clone to optimize more

For code counting, we don’t need the full git history and so we can use –depth 1 on our clone or fetch to reduce the size of the .git folder even more for the text files we’ll be bringing down.  For this we’ll add `–depth 1` to our Git command for this:

git clone --depth 1 --filter=blob:none --no-checkout

Sparse Checkout to the rest of the rescue

Git’s Sparse Checkout feature is how we prevent the unwanted files we removed from history from being dynamically pulled from the origin during checkout.

Sparse Checkout allows the use of a file configured the same as .gitignore and it filters what will and will not be checked out when the checkout command is executed.

We’ll need to enable Sparse Checkout and then configure the sparse-checkout file with file specifications of non-code files. You may want to hone a master list of non-code / binary files that you reuse for the sparse-checkout file across all of your repositories. There’s also a clever way to store this file in .git and use it for both partial clone and sparse checkout. For that method, read through the following documentation and scan for the file name “.gitfilterspec”: GitLab Partial Clone.

This code sets us up and then checks out master.

git sparse-checkout init
echo -e '*.* \n!**/*.idx \n!**/*.mp4 \n!**/*.gif \n!**/*.pdf \n!**/*.png \n!**/*.jpg \n!**/*.jpeg \n!**/*.eps \n!**/*.md \n' > .git/info/sparse-checkout
git checkout master

For sparse checkout to work reasonably well, you must have at least Git 2.25.

Counting lines of code

Counting lines of code is challenging for many reasons, most notable is that comments in code do not count as a line. In addition, it’s necessary to know what languages you have since your security tooling will need to have coverage for all your languages.

For documentation purposes, here’s some generic shell code for counting lines of all the files. However, if you compare it to the output of the below counting utility, you’ll see there’s a large margin of error for this method and it would most likely require specifically targeting file types.

 find . -type f -exec wc -l {} \; | awk '{ SUM += $0} END { print SUM }'

Even for our most basic edition of code counting, it makes sense to use specialized lines of code counting utility.

I did a survey of the open-source tooling available for this task and landed on Ben Boyter’s “sloc cloc code” — or scc for short. It’s actively maintained in a modern coding language (Go). It also allows you to extend the language support if you have custom needs. Here’s an example of the output after processing the test repository:

Ruby HTML1179119432334209488506414177
Plain Text472050
Estimated Cost to Develop $11,615,312
Estimated Schedule Effort 38.958526 months
Estimated People Required 35.316937
Processed 41026462 bytes, 41.026 megabytes (SI)

Here’s the code to get the above output:

go get -u
scc .

scc has many options that you can examine in the open-source project or by using the –-help command line option.

Counting MBs of code

Since we have eliminated all non-code files from our file hierarchy, we can get a rough idea of the MBs with this code:

cd www-gitlab-com
du -sh

The above method requires a spot-on exclusion criteria to ensure there’s absolutely nothing but code files — not an easy task and likely to take plenty of fussing on a per-repository basis.

Due to this challenge, I put in a request to Ben Boyter’s OSS project to see if scc could be updated to count MBs, and he was able to work it into version 2.13.0! Thanks, Ben! As far as I have researched, you now have the only Lines of Code Counter that does this!

Upon request, Ben also took time to update scc so that it can output multiple formats after a single counting run — so if it wasn’t already fast enough, now we don’t have to call it two or more times to get the output formats we are looking for. This will keep everything super fast if you’re analyzing a large set of repositories all at once.

scc counts bytes by simply adding up the file bytes of each code file it processes, which is a little different than the line count that eliminates blanks and comments. If you look at the above console output capture, you can see it gives a summary of all files. In the data output formats — JSON, HTML and CSV — it will give a per language byte count. Our code below uses w3m to dump the html file to the console so that we can get the bytes in the console output.

Counting bytes of compiled binaries

Some vulnerability tools scan compiled binaries. The below shell oneliner tallies file size by extension for a list of extensions that you give it. In the example, *.jpg and *.png are used as they will work with the www-gitlab-com repo example.  Removing the extension references and the enclosing parenthesis results in seeing results for all extensions. In the completed script below, awk is used to format CSV and HTML data formats. Find has a not operator “!” that can be used in front of the -iname parameters to create an exclusions list.

find . -type f \( -iname \*.jpg -o -iname \*.png \) |  egrep -o "\.[a-zA-Z0-9]+$" | sort -u | xargs -I '%' find . -type f -name "*%" -exec du -ch {} + -exec echo % \; | egrep "^\.[a-zA-Z0-9]+$|total$" | uniq | paste - -

Bringing it all together

The below code uses Docker so that you can easily test locally and also port it to a container-based CI system.

**Note:** scc is so fast (3600% faster than CLOC on this task) that running multiple times for various output formats is not as expensive as it would seem.

docker run -it golang:1.15rc1-alpine3.12 sh #golang on a distro with a proper package manager
apk update; apk add git w3m #need at least 2.25 of git, w3m to render html in console
go get -u
git clone --depth 1 --filter=blob:none --no-checkout
cd www-gitlab-com
git config --local core.sparsecheckout true
echo -e '*.* \n!**/*.idx \n!**/*.mp4 \n!**/*.gif \n!**/*.pdf \n!**/*.png \n!**/*.jpg \n!**/*.jpeg \n!**/*.eps \n!**/*.md \n' > .git/info/sparse-checkout
git checkout master

#html for easy viewing as a CI artifact and json for data ingestion
time scc . --not-match .*md --format-multi "html:loc.html,json:loc.json"
w3m -dump loc.html

echo "Total Bytes of jpgs and pngs (emulating counting your binary code files) in csv:"
time find . -type f \( -iname \*.jpg -o -iname \*.png \) |  egrep -o "\.[a-zA-Z0-9]+$" | sort -u | xargs -I '%' find . -type f -name "*%" -exec du -c {} + -exec echo % \; | egrep "^\.[a-zA-Z0-9]+$|total$" | uniq | paste - - | awk 'BEGIN {print "Type,Bytes"}{print $1 "," $2*1024}' > binarysize.csv

echo "Total Bytes of jpgs and pngs (emulating counting your binary code files) in HTML:"
time find . -type f \( -iname \*.jpg -o -iname \*.png \) |  egrep -o "\.[a-zA-Z0-9]+$" | sort -u | xargs -I '%' find . -type f -name "*%" -exec du -c {} + -exec echo % \; | egrep "^\.[a-zA-Z0-9]+$|total$" | uniq | paste - - | awk 'BEGIN {print "<html lang=\"en\"><head><meta charset=\"utf-8\" /><title>binary bytes html output</title><style>table { border-collapse: collapse; }td, th { border: 1px solid #999; padding: 0.5rem; text-align: left;}</style></head><body><table id=\"binarybytes-table\" border=1><thead><tr><th>Language</th><th>Bytes</th></tr></thead><tbody>"}{print "<tr><th>" $1 "</th><th>" $2*1024 "</th></tr>"}END { print "</table></body></html>" }' > binarysize.html
w3m -dump binarysize.html

Final output

Here is an example of the output of the above, including bytes of code.

Ruby HTML179664359306050625391692
Plain Text472050327

Cross checking results

Since scc has similar language support to the older but venerable CLOC, you can cross check with the older utility like this (assuming you are on the above Golang container):

apk add cloc
time cloc .

Note: While CLOC does a bit of refinement of how it counts code, in my testing for www-gitlab-com it took 24 to 36 times as long and had very similar results. live coding to improve this code

In a previous episode of Mission Impossible Live Coding, my colleague Jefferson Jones and I were able to get scc running as a reusable GitLab CI CD Plugin. In a future episode, we’ll look to making this even more flexible — by allowing a list of repositories (rather than doing one at a time) and maybe some code to iterate through all projects in a GitLab group hierarchy. You can learn more about the Mission Impossible Live Coding events here.

One final “things with blue” optimization

Let’s go back to your apartment — the one with the blue-sorting robots. While you’re considering how to pull off the counting task, one of the robots has a eureka moment (it might have some vacuum cleaner DNA) and suggests that the apartment trash bin is closer to the packed container than your apartment. So simply tossing all “Things With Blue” would save a lot of time, effort, and tax money. Empathy and nostalgia algorithms, anyone?

Doing the numbers with GitLab

Understanding the total cost of tooling and processes is key for the current state and future growth. GitLab has some tools that can help in this department. GitLab Secure integrates with third-party scanners and can report vulnerabilities directly to the developers who introduced them immediately upon their next feature branch build (shifting hard left). If the free scanners that come with GitLab are sufficient for your scanning needs, they don’t have any usage-based pricing. So when some tool consolidation is possible, you can potentially save some money. Find more details on GitLab Secure here.


Get more insights, news, and assorted awesomeness around all things cloud learning.

Sign In
Welcome Back!

Psst…this one if you’ve been moved to ACG!

Get Started
Who’s going to be learning?