Skip to main content
Philippe Ombredanne
AboutCode Lead Maintainer
View all authors

atom and chen join AboutCode

· 3 min read
Philippe Ombredanne
AboutCode Lead Maintainer

apprhreat-image

atom and chen, two open source tools for high-quality code analysis built by the AppThreat team, are now part of the non-profit AboutCode organization committed to making open source easier and safer to use by building critical open source tools for Software Composition Analysis (SCA) and beyond.

“AppThreat started with the simple mission to make high-quality code analysis and security tools for everyone,” says Prabhu Subramanian, lead maintainer of atom and chen, founder of AppThreat, and creator of other open source supply chain security tools like OWASP CycloneDX Generator (cdxgen), OWASP blint, and OWASP depscan.

While working on a different problem, Prabhu uncovered a lack of high-quality code hierarchy analysis libraries and CLI tools. atom and chen were built as open source tools to identify likely adversary entry points to improve threat modeling, vulnerability management, and risk mitigation. Precisely knowing when, where, and how a given library is used in an application or service empowers developers to better understand risks and secure their work.

chen, or Code Hierarchy Exploration Net, is an advanced exploration toolkit for your application source code analysis to parse and extract code property graphs.

Powered by the chen library, atom is a novel intermediate representation for applications and a standalone tool. The intermediate representation (a network with nodes and links) is optimized for operations typically used for application analytics and machine learning, including slicing and vectoring.

“As our projects grew in usage and significance, we felt the need to donate these projects to an open source organization committed to the original AppThreat mission,” says Prabhu. “AboutCode is that organization.”

AboutCode is a registered non-profit organization that supports the development and maintenance of the AboutCode stack of open source tools and open data for SCA, including the industry-leading ScanCode, VulnerableCode, and DejaCode projects. AboutCode believes that good open source tools and open data help you use open source securely and efficiently.

With planned tighter integrations with the AboutCode stack, atom and chen will provide an even more comprehensive open source solution for the practical management of open source and security compliance. This includes advanced code reachability analysis, more efficient triage of vulnerabilities based on true reachability, and deep analysis of call graphs to find where vulnerable code is used.

For supply chain analysis, atom can generate evidence of external library usage, including the flow of data. OWASP cdxgen uses atom to improve the precision and comprehensiveness of the generated CycloneDX SBOM document.

For vulnerability analysis, atom describes vulnerabilities with evidence of affected symbols, call paths, and data flows to enable variant and reachability analysis at scale.

“The next frontier in vulnerability management is deep vulnerable code reachability analysis and taint analysis to discover new vulnerabilities,” says AboutCode lead maintainer Philippe Ombredanne. “atom and chen are the fundamental blocks to enable the construction of a FOSS solution to better triage vulnerabilities and avoid vulnerability fatigue.”

Building upon atom and chen joining, AboutCode will adopt an open governance model, drawing from best practices established by other organizations committed to open source software, prioritizing transparency, inclusivity, and community-driven development. A technical advisory group (TAG) will be formed to ensure project development addresses the needs of the wider community.

Want to get involved? Join the AboutCode Slack or Gitter to chat with the community.

PURLs of Wisdom

· 12 min read
Philippe Ombredanne
AboutCode Lead Maintainer

Accurately identify third-party software packages with PURL.

purl-image

If you need to generate (or consume) Software Bill of Materials (SBOMs), then you need a standardized way to communicate information about what components are in your software.

If you’re using or building applications, you need tools to determine if there are any known security issues with open source and third-party components.

If you’re building tools for Software Composition Analysis (SCA) like analyzing the origin, license, security, and quality of the code across different ecosystems. You need a simple way to identify the packages used.

Package URL (PURL) is a new open source standard to convey accurate identifications of the third-party software packages that were used.

standards_2x https://xkcd.com/927/

A universal identifier like PURL provides both internal developers and external users with direct access to key information about those packages like vulnerabilities and licenses. PURL reduces friction by providing continuity across tools, processes, development environments, and ecosystems.

It’s not complex. The idea behind PURL was to find a simple way to identify a software package based on its name. Just by looking at the code, you can determine the Package URLs of the open source packages that you are using. PURLs are defined by the intrinsic nature of the software you observe, and that makes the difference.

A car is a good analogy for demonstrating the superpower of easily identifying something just by looking at it. You can determine the make and model of a car by observing it. You can then uniquely identify by looking at the license plate.

In contrast, identifiers previously used for software packages are complicated. In the world of security, the National Vulnerability Database (NVD) uses an identifier for software packages called Common Platform Enumeration (CPE).

Extending the car analogy, identifying the CPE is like trying to find the internal model number from the manufacturer: you need to have access to extra information that are not obvious and not readily available by watching the car. Using CPEs to query the NVD requires prior knowledge of this extra information and arbitrary assigned identifiers, adding complexity with an additional step.

Package URLs, in comparison, are simple and easy for developers to know what the package actually is! By looking at the car, we can easily observe the make, model, color, condition, and the license plate number, which we can use to universally identify the car more efficiently. Finding the internal manufacturer part number with a central authority to identify it is too cumbersome. PURL brings us simplicity.

PURL can be extremely helpful for organizations mitigating high-profile security issues related to open source package vulnerabilities. More information is usually required to fix these vulnerabilities than what you can discover by looking at the code. The conjunction of several high-profile security issues heightened the need to figure out what third-party software packages are included in many software products.

For example, there was that major security bug with Log4j, which was really in Log4j version 1.2.17. How can you know if you use Log4j v1.2.17 when there’s a security issue? It’s powerful to know just by looking at the code. But if you need to know that it’s called apache:log4j, then it’s much harder and this is required to query NVD.

purl-image

PURL removes the friction to search complex auto-generated values in databases. This makes PURL even more useful for the larger software development community.

How PURL was made

The origins of PURL can be traced very specifically to 2017 when we needed a new way to identify packages in ScanCode Toolkit.

As part of ScanCode v3, we designed parsers for package manifests, like a Maven POM, a Python setup.py, and a npm package.json. But, it was incredibly difficult to quickly identify these packages across ecosystems. We considered different schemas for different environments, but quickly understood that would be way too complicated. After looking at existing options in different open source communities, we discovered a Google project called Grafeas.

Grafeas defined something promising and wonderfully simple called resource URLs, derived from and contributed by JFrog and their product JFROG XRAY. In a resource URL, a node package would be identified as npm://name@version. This is quite similar to URL nomenclature, but modifying the prefix to identify the development environment and then, essentially a name:version. Grafeas was already in use in various places, so we created an issue describing a few details to work out and pinged R2wenD2, who was actively working on the project.

I also reached out to the OSS Review Toolkit (ORT) project to discuss best practices for identifying packages across ecosystems related to AboutCode data specs. At that time, we used a system with a name, namespace, and version for some packages, based on the resource URLs in Grafeas. But, we still needed a common syntax to easily identify these packages in and across different ecosystems.

Open source projects like Grafeas, Libraries.io, Fabric8 from RedHat, and several others were all doing similar things, but none solved this issue. I cherry-picked the best components of each, and built the first version of Package URL in late October 2017.

As part of nexB’s core principle of FOSS for FOSS, we moved PURL to a new, separate organization on GitHub to better collect effective feedback and share ownership. We invited every key contributor as a co-maintainer/co-owner, including R2wenD2 from Google, Sebastian Schubert from ORT, and later on Steve Springett from CycloneDX, who was one of PURL’s early adopters and for which, the spec was key.

Relinquishing exclusive control early on but maintaining a strong design direction with the key contributors was critical for the ongoing development and success of the PURL project.

With the AboutCode projects, PURL is the critical connector. Whether it’s detecting a package in ScanCode Toolkit, looking for vulnerabilities in VulnerableCode, or reporting complex scans of container images in ScanCode.io, the output is PURLs. The input is PURLs whenever we consume a CycloneDX or SPDX SBOM. We can easily consume and exchange information extracted by PURLs from other tools. That has proved critical to the success of these open source projects.

Casting PURL in the real world

The basic identification of packages is simple and easy. But, there are many levels of depth beyond the package name and version, including the metadata details, like the software license, that APIs can look up. This gets tricky when the metadata details are not documented effectively or the documentation is poor or messy. All this adds more complexity!

Part of the overall identification process is knowing what files were provisioned or provided by the software package – this is very important in verifying the integrity of a package. Once the package is identified and installed on the machine, then you need to ensure that it really is what it’s supposed to be.

There are all these layers, which matter when you don’t have any package manifest or structured information about the package. It’s more complex because you need to accumulate different clues to find the potential origin, like copyright and permissions, URLs, processing READMEs of semi-structured information, and eventually matching.

Matching is performing a lookup across a larger database of known packages. In modern software development, the vast majority of packages will come with a PURL and don’t need any kind of matching. Matching is the signatory technique to help when there is murkier or more difficult information.

In some cases, matching is used for Microsoft .NET and NuGet packages. Each NuGet package is a set of files, typically DLLs and executables and they may be installed and bundled together. Once this has happened, you don’t know anymore which DLLs came from which NuGet, with NuGet now being the package. It’s not easy to find just by observing the deployed binaries. It requires some matching back to each of the packages to figure out, like “Oh! This DLL came from this Microsoft NuGet, and this other DLL came from this Acrobat NuGet.” Some high level matching of whole files can be very useful to complete the identification, typically NuGet, Apache Maven, or Java environments.

Dependency resolution is being able to literally unwind the twine ball. First, you state the direct dependencies – you need this and that and this package, and this version or a version range. Next, you go for the second level of dependencies, and then, the third level, until you go all the way down.

There is a complex set of software code called “solvers” or “dependency resolvers”, that ensure the set of versions are compatible together. The input for these solvers is a set of package versions (and in some cases, package version ranges) to find all the dependencies in the complex dependency trees. Each package manager uses their own conventions and approaches to list the various constraints so there is no easy way for a unified approach to dependency resolution.

The AboutCode team is currently working on VERS, as a universal standard for version ranges identifiers to enable universal dependency resolution across all package managers and to express constraints across system packages and application packages.

Dependency resolution is designed to determine which version for a given application, and then fetch this installed version. Typically, package managers – like npm, pip for Python, a bundler for Ruby or Python – perform the dependency resolution, and once they find the release of a package to install, they download, extract and install them.

Looking at software that’s already been built, there’s nothing to understand about dependency resolution that previously happened. There’s a set of packages downloaded and installed by the package management tool in real-time, so dependency resolution matters most during the development when building the software. Upstream, this can be an extremely complex and difficult process of reverse engineering the dependency resolution.

A new approach called Non-vulnerable dependency resolution merges and combines the dependency constraints, which are functional dependencies, to avoid using packages.

The amount of code that can be identified by PURL depends on the environment. For more general purpose software like web, backend, desktop, and mobile development, usually 95 to 99% of the code used comes from package repositories – unless you use proprietary code, of course. But for the code that has third-party open source origins, everything is easily identified by PURL.

The main difference in PURL’s efficacy is in embedded vs non-embedded software development.

Embedded systems or environments using native code like C or C++ are more complex for PURL because these embedded systems don’t typically utilize a package manager. There are some new package managers for embedded systems like Yocto and buildroot for Linux and Conan for C++.

Tools for embedded systems like Yocto for Linux or buildroot may not use PURL, but still have a strong traceable origin of the code. In other cases, the “generic” type with a download URL can be used as a qualifier. So even if they may not use PURL, there is a good enough approach that can handle these exceptions.

In general, there’s no such thing as package managers, and PURL can still be used, but it’s going to be a bit more complex to do proper identification there – which is the big difference between embedded vs. non-embedded kind of software development. With caveats also because you have tools for embedded development now such as Yocto or buildroot that may not use PURL but still have a very strongly traceable origin of where the code comes from. In other cases we can use the “generic” type with a download URL as a qualifier. So even if they may not use PURL, there is a good enough way that you may not use them all the time but helps deal with these exceptions!

PURL of a FOSS price

PURL makes identification easy, across tools and ecosystems. The ability to compose with other tools – and those tools being free and all available with PURL as a mostly universal identifier – means users don’t have to struggle with integrations and can choose the best-in-class tool they want to use.

PURL was originally built for Python because that’s the language of preference for ScanCode Toolkit. But the specification and libraries for PURL libraries are simple enough that there are now implementations in C#, .NET, JavaScript, Go, Java, Ruby, Swift, Python, PHP, Rust and more. We even have more than one PURL implementation in Java for instance. A list of the available implementations is available at https://github.com/package-url/.

PURL was specifically designed to not be unique to any organization’s projects. As a new open source standard, PURL helps the ecosystem of SCA tools and provides more flexibility and more options for end users to replace, exchange, and combine tools together using PURL.

As Jackie Kennedy meant to say, “PURLs are always appropriate.”

There and back again -- A software versioning story

· 14 min read
Philippe Ombredanne
AboutCode Lead Maintainer

One software version control to rule them (modern software development) all?

version

Software projects make many decisions, but one of the most critical is deciding how to implement version control (also known as revision control, source control, or source code management). With modern software development, a versioning convention is a key tool to manage software releases and revisions. The two main approaches are calendar versioning (CalVer) and semantic versioning (SemVer), often with some alterations depending on an organization’s or project’s requirements.

For AboutCode projects, we started with SemVer, transitioned to CalVer and then migrated back to a format that mostly resembles SemVer. This blog post details the pros and cons of each version convention, along with explaining why we embarked on this version convention journey.

The Fellowship of the Version Conventions

CalVer

Short for calendar versioning, CalVer uses something that resembles a date as a version number. It’s popular, especially in software, to convey time in a version number. Ubuntu is a good example of using CalVer: version 12 was first released in 2012.

The idea is that you have a first segment in the version number, which is either the four-digit year or just the last two digits. Next is a month and then a day (the placement of the digits remains constant for future releases) to get the full version number where each segment is separated by a dot. With Ubuntu, they do stable releases in April, so version 22.04 is the stable release from April 2022.

SemVer

SemVer, or semantic versioning, doesn’t convey time like CalVer. It is designed to better define the relative importance of changes in the software and its interface(s).

SemVer.org clearly states the structure:

Given a version number MAJOR.MINOR.PATCH, increment the:

  1. MAJOR version when you make incompatible API changes

  2. MINOR version when you add functionality in a backwards compatible manner

  3. PATCH version when you make backwards compatible bug fixes

Additional labels for pre-release and build metadata are available as extensions to the MAJOR.MINOR.PATCH format.

Each time there’s a change in the underlying application programming interface (API), then you should change the MAJOR version. Otherwise if it’s compatible, you should change the MINOR version (the middle segment). If it’s just a bug fix, you should change the PATCH version.

In theory, SemVer looks beautiful and simple. It very simply conveys the changes in the software with three numbers. In practice, it is extremely difficult for humans to understand what is a major change that breaks the API or not. I doubt there’s any tool or whole ecosystem, like npm or Go, claiming to use SemVer that is actually true to the principles of SemVer because it’s too hard to understand if a change affects the API and whether that change is major or minor. More often than not, those version changes are misrepresenting what SemVer is designed to convey.

OtherVer

Other approaches to version control exist beyond SemVer and CalVer.

Until version one, OpenSSL used a peculiar scheme that resembled SemVer, but with letters as a suffix. Now in version 3, they changed the scheme to be more like SemVer and dropped the letter suffix. This is difficult for users to make sense of the version when different version schemes are used over time. OpenSSL is still using its legacy version control using patches and letter suffix for pre v3 versions, making it even harder for users to understand, especially when dealing with bugs and vulnerabilities. This is problematic because there’s no upward compatibility. OpenSSL conventions are still reasonably straightforward to understand. Other projects such as nginx have more byzantine conventions where a version segment number has a different meaning if it is even or odd.

It’s important when a project is switching to a new version scheme that the project ensures and explains to its users the correct sequence of versions in a way that is as clear as possible. Note that we attempt to resolve the versioning weirdnesses of version schemes and version ranges in the univers library and the upcoming Package-URL version specification for “Version Range Specification”.

Another interesting example is in the Android ecosystem. The Google Play store carries two versions for each app. One is a “version name”, which is whatever version string the author likes – this can be SemVer or CalVer or any other convention or version scheme. This version name is used only for display. The other is a version code, which has to be a single number, and must increase each time there’s a new release. Google recognizes through working with the large number of Android developers, that no versioning scheme is actually correct and could work given the scale and diversity of developers. With this version code approach, users know sequentially what is the latest release, without any bit of ambiguity and can use the “version name” for cosmetic display.

Google Play using version code for Android apps is very similar to an older approach to version control with a version control system called Subversion that was popular at the beginning of the century. Subversion used a single version number that would apply across the code tree for everyone using the same system; this single version number was incremented centrally with every code change. The Apache Software Foundation – home of the Subversion project – was one of its largest users. Each commit incrementally increased the version number, which made staying current with the latest release difficult. This was effectively one of the earliest and likely the largest public “monorepo“.

To have this kind of version numbering across a very large foundation with hundreds of projects and thousands of users created millions of revisions but was not a contribution to a better coordination across projects. It was impractical and hyper centralized and most Apache projects have since switched to distributed and decentralized version control systems such as Git, but there was a value to having a simple version number that is just bumped as needed.

Subversion was the complete opposite of Git, where everything can be distributed on any computer. There is a clear benefit of being able to strictly order versions without ambiguity, which we’ve lost with distributed version control systems like Git. It’s difficult to understand the sequence of versions using commit hashes with multiple branches and their relation with time, because there may not be any relationship that’s as obvious as a single number.

One interesting pocket of Semver is the Go programming language that claims to enforce SemVer-compliance for all the third-party Go modules. In practice, the Go tools are generating pseudo-SemVer versions based on a sequence of commits in a Git repository for each of the Go libraries that you are using. The Go algorithm for doing pseudo-SemVer versioning for modules is not straightforward. If you’re changing the API or the interface of how you call your Go function, there is no version change. Instead, you are supposed to create a new library changing its import path with a v2, but it’s not v2 as in version two – it’s more like a brand new library that shares the name with the v1.

This approach achieves some SemVer compliance by avoiding any API change, and therefore recognizes that there’s no such thing as being able to capture API changes faithfully. The only way to get compatibility between two versions is to create a new library whenever there’s a change to the signature of your Go library. Eventually, you could hope to entirely ignore versioning and use only the commit history instead of a separate versioning mechanism. The problem is that commit hashes are long (40 characters) and obscure strings that are not human-friendly (in hexadecimal).

The Two Version Conventions: Choosing between CalVer and SemVer

Does your project feature a large or constantly-changing scope, including large systems and frameworks, like Ubuntu and Twisted, or amorphous sets of utilities, like Boltons? Is your project time-sensitive in any way or do other external changes drive new project releases, including business requirements like Ubuntu’s focus on support schedules, security updates like certifi’s need to update certificates, or political shifts, such as pytz’s handling of time zone changes?

CalVer.org suggests adopting CalVer under those conditions listed above.

There is a simple value in CalVer that is you don’t have to think or strategize to pick a version number – just use the date of the release and that’s it. And it’s easier to convey a temporal change with sequential version numbers in CalVer. If you want to convey that one version is obsolete, then CalVer makes sense.

But do not overestimate the strength of the signal sent by your version number. Windows is a good example of this. Windows 95 used CalVer – 95 was for 1995. Windows 95 was used for many years afterwards, so much so that folks after 2000 probably didn’t realize that 95 really meant the last century. Same with Windows 98, these tools were still used over ten years after their release.

Recognizing SemVer and determining an API change is extremely difficult and almost impossible to accomplish truthfully. However, SemVer can be useful in systems with many dependencies. Issues like version lock (inability to upgrade packages without releasing new versions of every dependent package) or version promiscuity (assuming compatibility with more future versions than is reasonable, especially when dependencies are poorly specified) can result in a dependency hell (SemVer.org).

Jacob Bolda from Frontside Software wrote that it’s “not a boolean choice but a spectrum whose adherence specifications you must outline depending on your project’s needs and circumstances.”

The Return of the (mostly) SemVer

When we started AboutCode, we used SemVer, with a Python flavor that adds a few extra extensions for managing pre and post releases differently. We then realized that we had a problem when conveying the obsolescence of data that we use in ScanCode like the license DB.

Licenses change on a regular basis, both in terms of newer licenses and newer ways to talk about licenses. When we helped fix and streamline the licensing of the Linux kernel, we found close to 800 different ways to say “this file is under GPL”. Developers are creative when it comes to reporting a FOSS license!

By switching from SemVer to CalVer, we wanted to convey the idea to users that they are running a ScanCode version that’s old and probably outdated with obsolete license data. It was pretty naive for us to believe that just a version number is enough to signal users that they should upgrade. We tried to provide a way to proactively signal with a warning that they should update. This was great theoretically, except this warning code had issues and displayed a warning message even when no new version was available to upgrade, and when there was a new release it sometimes failed to display a warning.

CalVer was not working and was too weak a signal to convey obsolescence with versions, so we switched back to a SemVer version. SemVer and CalVer are compatible, so when AboutCode and ScanCode went from SemVer to CalVer and then back to mostly SemVer, it was easier (and important) to ensure that the versions sequence would stay consistent and obvious to users. During our first switch of ScanCode to CalVer, we went from version 2 to version 21. When we went back to SemVer in 2022, we decided to jump to version 30 to avoid any confusion with dates and previous versions.

The work we did around the Package-URL (PURL) and vers specifications and the univers library project for VulnerableCode also informed our approach. Combined with PURL, the univers library is able to parse and make sense of package version ranges syntax and conventions across all ecosystems, especially for vulnerabilities, and expose them in a simple and easy to read normalized string. This is unfortunately extremely complex to achieve because of the lack of standards and the diversity of version range syntaxes. Yet this is also very useful and important to understand because we rely on versioning to understand which versions come before and after. This is key to resolving package dependencies and understanding if a version of a package falls in a vulnerable version range.

We’re not the only ones to understand the importance of this. There’s a project at the Linux Foundation under the Open Source Security Foundation (OSSF) and maintained by Google that is called the Open Source Vulnerability Database (OSV), which has very similar goals to VulnerableCode. We’re collaborating and OSV reuses some of the code from univers to better understand how all of these versions compare and fit together.

One Ver to rule them all?

Having designed libraries and specs and built tools to handle arbitrarily complex versions and version ranges, we came to the conclusion that the only benefit of SemVer is that it’s familiar and well understood by everyone. But it is a fallacy to believe that we can faithfully implement SemVer.

This is because of Hyrum’s Law, which states:

With a sufficient number of users of an API, it does not matter what you promise in the contract: all observable behaviors of your system will be depended on by somebody.

Hyrum’s Law means that if you have enough users, every little bug and how your tool behaves internally will eventually be considered an API or a feature by some users. What we think about APIs is not what your user may think about being an API. That applies to any interface that we may design.

workflow.png
https://xkcd.com/1172/

It may be technically possible in some cases to let the system, tool, or code decide when there’s an API change, but it would be extremely complex in general as determining this requires intelligence and detailed understanding of how a library or tool behaves and may be used. This is very hard for humans and harder for machines!

Versions are necessary and useful. They carry at least one meaning, which is to state if this version came before or after a piece of code. But applying more weight to their meaning distracts from what they’re meant to convey.

Jacob Tomlinson, a software engineer at NVIDIA, wrote a blog post on why he sometimes regrets using CalVer:

By releasing software that uses SemVer you are signaling to your community that you have some constraints. In my opinion, CalVer signals to your community that anything could happen at any time and that you have no interest in the effect that has on your users.

We recognize that unfortunately all the versions that pretend to carry special meaning are mostly misleading, or not true to what they’re trying to convey, whether it is date changes or API-based changes. But even if a SemVer version may be misleading, it’s very important to have such a number that humans can refer to and easily remember. There are some good reasons to implement CalVer, but for the AboutCode team, we found that a modified approach to SemVer was the best approach to display version obsolescence for our users.

With this approach, we acknowledge that any attempt to decide correctly if a new release should increase the major, minor or patch part of the SemVer version is condemned to be incorrect or misleading more often than not. Instead, we are trying to make decent guesses if a new version is major, minor or patch, but this is mostly stating that this is a big release with big changes or a smaller release. We have also introduced an “output format version” to ScanCode Toolkit that is loosely based on SemVer and lives on its own schedule; It is based strictly on the output data structure in the JSON format; Changes to this format are simpler to evaluate as breaking the API or not, and we may be truer to SemVer spirit in this case.