Skip to main content

ScanCode LicenseDB -- 2,000+ licenses curated in a public database

· 4 min read
AboutCode team
Open source for open source

The ScanCode LicenseDB is all about identifying a wide variety of licenses that are actually found in software.

ScanCode-LicenseDB-2026-01

New software licenses appear constantly (like mushrooms popping out of the ground after a heavy rain) and old nearly-forgotten ones are rediscovered when someone scans a codebase that incorporates legacy code (like finding rare medieval manuscripts in the back shelves of a library). The ScanCode LicenseDB precisely identifies and organizes licenses and their metadata so that multiple members of the software community can understand exactly which licenses are being referenced in project documentation.

If you have seen a license notice, passed it on to your legal team for scrutiny, and completed that review, then you probably do not want to repeat that process over and over again.

With over 2,000 licenses, ScanCode LicenseDB is arguably the largest free list of curated software licenses available on the internet, and an essential reference license resource for license compliance and SBOMs. ScanCode LicenseDB is available as a website, a JSON or YAML API, and a git repository making it easy to reuse and integrate in tools that need a database of reference software licenses.

Here are some key points about the ScanCode LicenseDB:

  • Is a list of 2,470 licenses recognized by scancode-toolkit as of 2026-01-29
  • Identifies each license by the license key defined in scancode-toolkit
  • Provides an SPDX Identifier (with link) to every license and exception on the SPDX License List, and a “Licenseref” identifier for every license and exception not on the SPDX License List.
  • Provides license texts in plain text formats.
  • Provides license texts and metadata in yml and json.
  • Freely accessible via API
  • Data licensed under CC-BY-4.0
  • Community supported on GitHub.

And below are some frequently asked questions about the ScanCode LicenseDB.

Q: What are the inclusion criteria for a license to be in the ScanCode LicenseDB?

A: The only requirements are a text and a usage in existing code. The ScanCode LicenseDB includes multiple categories of licenses, not just open source: permissive, copyleft, commercial, proprietary free, source-available, etc. More information on license categories is available here: https://scancode-licensedb.aboutcode.org/help.html#license-categories

Q: Does the ScanCode LicenseDB compete with other license lists, such as the SPDX license list?

A: No. The ScanCode LicenseDB is intended to supplement other license lists. When new licenses are discovered by scancode-toolkit or the software community, they are added to the list with references to other lists whenever possible.

Q: What is the process for adding or correcting licenses in the ScanCode LicenseDB?

A: License curation is primarily a task of the active participants in AboutCode.org, but any member of the software community is welcome to log and respond to issues at https://github.com/nexB/scancode-licensedb/issues. See https://scancode-licensedb.aboutcode.org/help.html#support for more details.

Q: Is a license in the ScanCode LicenseDB “approved” or “recommended for use”?

A: The ScanCode LicenseDB is all about identifying the wide variety of licenses that are actually found in software. There is no attempt to approve or disapprove of license terms, and there is no attempt to correct poorly written licenses. The only license interpretation provided is a license category, which represents the best judgment of the license curators.

Q: How are licenses discovered (detected) by scancode-toolkit?

A: For license detection, ScanCode uses a (large) number of license texts and license detection ‘rules’ that are compiled in a search index. When scanning, the text of the target file is extracted and used to query the license search index and find license matches.

For copyright detection, ScanCode uses a grammar that defines the most common and less common forms of copyright statements. When scanning, the target file text is extracted and ‘parsed’ with this grammar to extract copyright statements.

More detailed information is available at https://scancode-toolkit.readthedocs.io/en/stable/explanation/scancode-license-detection.html#scancode-license-detection.

Q: How can I get help or contribute to ScanCode LicenseDB?

A: You can chat with the AboutCode community on Gitter, or report issues or ask questions at https://github.com/nexB/scancode-licensedb/issues.

VulnerableCode API Deprecation and V3 Introduction

· 2 min read
Tushar Goel
Software Engineer

The AboutCode team is planning to deprecate the V1 and V2 API of VulnerableCode (public.vulnerablecode.io) by the end of Q2 2026 (June 20, 2026). We are introducing V3 API and UI by the end of January 2026.


Why this new API

The existing V1 and V2 APIs are both based on the “vulnerabilities” model, designed to aggregate information from multiple advisory sources based on identifiers and aliases. With the "vulnerabilities" model it is difficult to determine which source is correct because of the combination of sources. This may result in data from one source overwriting data from another source.


What to expect from the new V3 API and UI

Moving forward, VulnerableCode will report “advisories” for packages and not “vulnerabilities”.

Currently if a package has 4 advisories and those 4 advisories were correlated with each other by their aliases and identifiers, we report a single vulnerability affecting that package. The new approach in V3 will report 4 individual advisories.

The new “advisories” model introduces an Advisory ID (AVID) for each advisory in VulnerableCode. An AVID will have different components like the source and the natural unique identifier used at that source. For example if we are importing an advisory from “nodejs_security_wg” and it’s identified by its ID “123”, the AVID will be “nodejs_security_wg/123”.


Plan and Timeline

We are planning to complete the following tasks by the end of January 2026:

  • Redesigning the API and UI
  • Migrating our existing data sources
  • Documenting the V3 API and the new UI

Current Status

https://public2.vulnerablecode.io/v2 uses the new advisory based UI https://public2.vulnerablecode.io/api/v3/ uses new API, but it is still under development and not ready for production use.


Migration Progress

You can track the progress of migration here:
https://github.com/orgs/aboutcode-org/projects/52/views/48

atom and chen join AboutCode

· 3 min read
Philippe Ombredanne
AboutCode Lead Maintainer

apprhreat-image

atom and chen, two open source tools for high-quality code analysis built by the AppThreat team, are now part of the non-profit AboutCode organization committed to making open source easier and safer to use by building critical open source tools for Software Composition Analysis (SCA) and beyond.

“AppThreat started with the simple mission to make high-quality code analysis and security tools for everyone,” says Prabhu Subramanian, lead maintainer of atom and chen, founder of AppThreat, and creator of other open source supply chain security tools like OWASP CycloneDX Generator (cdxgen), OWASP blint, and OWASP depscan.

While working on a different problem, Prabhu uncovered a lack of high-quality code hierarchy analysis libraries and CLI tools. atom and chen were built as open source tools to identify likely adversary entry points to improve threat modeling, vulnerability management, and risk mitigation. Precisely knowing when, where, and how a given library is used in an application or service empowers developers to better understand risks and secure their work.

chen, or Code Hierarchy Exploration Net, is an advanced exploration toolkit for your application source code analysis to parse and extract code property graphs.

Powered by the chen library, atom is a novel intermediate representation for applications and a standalone tool. The intermediate representation (a network with nodes and links) is optimized for operations typically used for application analytics and machine learning, including slicing and vectoring.

“As our projects grew in usage and significance, we felt the need to donate these projects to an open source organization committed to the original AppThreat mission,” says Prabhu. “AboutCode is that organization.”

AboutCode is a registered non-profit organization that supports the development and maintenance of the AboutCode stack of open source tools and open data for SCA, including the industry-leading ScanCode, VulnerableCode, and DejaCode projects. AboutCode believes that good open source tools and open data help you use open source securely and efficiently.

With planned tighter integrations with the AboutCode stack, atom and chen will provide an even more comprehensive open source solution for the practical management of open source and security compliance. This includes advanced code reachability analysis, more efficient triage of vulnerabilities based on true reachability, and deep analysis of call graphs to find where vulnerable code is used.

For supply chain analysis, atom can generate evidence of external library usage, including the flow of data. OWASP cdxgen uses atom to improve the precision and comprehensiveness of the generated CycloneDX SBOM document.

For vulnerability analysis, atom describes vulnerabilities with evidence of affected symbols, call paths, and data flows to enable variant and reachability analysis at scale.

“The next frontier in vulnerability management is deep vulnerable code reachability analysis and taint analysis to discover new vulnerabilities,” says AboutCode lead maintainer Philippe Ombredanne. “atom and chen are the fundamental blocks to enable the construction of a FOSS solution to better triage vulnerabilities and avoid vulnerability fatigue.”

Building upon atom and chen joining, AboutCode will adopt an open governance model, drawing from best practices established by other organizations committed to open source software, prioritizing transparency, inclusivity, and community-driven development. A technical advisory group (TAG) will be formed to ensure project development addresses the needs of the wider community.

Want to get involved? Join the AboutCode Slack or Gitter to chat with the community.

PURLs of Wisdom

· 12 min read
Philippe Ombredanne
AboutCode Lead Maintainer

Accurately identify third-party software packages with PURL.

purl-image

If you need to generate (or consume) Software Bill of Materials (SBOMs), then you need a standardized way to communicate information about what components are in your software.

If you’re using or building applications, you need tools to determine if there are any known security issues with open source and third-party components.

If you’re building tools for Software Composition Analysis (SCA) like analyzing the origin, license, security, and quality of the code across different ecosystems. You need a simple way to identify the packages used.

Package URL (PURL) is a new open source standard to convey accurate identifications of the third-party software packages that were used.

standards_2x https://xkcd.com/927/

A universal identifier like PURL provides both internal developers and external users with direct access to key information about those packages like vulnerabilities and licenses. PURL reduces friction by providing continuity across tools, processes, development environments, and ecosystems.

It’s not complex. The idea behind PURL was to find a simple way to identify a software package based on its name. Just by looking at the code, you can determine the Package URLs of the open source packages that you are using. PURLs are defined by the intrinsic nature of the software you observe, and that makes the difference.

A car is a good analogy for demonstrating the superpower of easily identifying something just by looking at it. You can determine the make and model of a car by observing it. You can then uniquely identify by looking at the license plate.

In contrast, identifiers previously used for software packages are complicated. In the world of security, the National Vulnerability Database (NVD) uses an identifier for software packages called Common Platform Enumeration (CPE).

Extending the car analogy, identifying the CPE is like trying to find the internal model number from the manufacturer: you need to have access to extra information that are not obvious and not readily available by watching the car. Using CPEs to query the NVD requires prior knowledge of this extra information and arbitrary assigned identifiers, adding complexity with an additional step.

Package URLs, in comparison, are simple and easy for developers to know what the package actually is! By looking at the car, we can easily observe the make, model, color, condition, and the license plate number, which we can use to universally identify the car more efficiently. Finding the internal manufacturer part number with a central authority to identify it is too cumbersome. PURL brings us simplicity.

PURL can be extremely helpful for organizations mitigating high-profile security issues related to open source package vulnerabilities. More information is usually required to fix these vulnerabilities than what you can discover by looking at the code. The conjunction of several high-profile security issues heightened the need to figure out what third-party software packages are included in many software products.

For example, there was that major security bug with Log4j, which was really in Log4j version 1.2.17. How can you know if you use Log4j v1.2.17 when there’s a security issue? It’s powerful to know just by looking at the code. But if you need to know that it’s called apache:log4j, then it’s much harder and this is required to query NVD.

purl-image

PURL removes the friction to search complex auto-generated values in databases. This makes PURL even more useful for the larger software development community.

How PURL was made

The origins of PURL can be traced very specifically to 2017 when we needed a new way to identify packages in ScanCode Toolkit.

As part of ScanCode v3, we designed parsers for package manifests, like a Maven POM, a Python setup.py, and a npm package.json. But, it was incredibly difficult to quickly identify these packages across ecosystems. We considered different schemas for different environments, but quickly understood that would be way too complicated. After looking at existing options in different open source communities, we discovered a Google project called Grafeas.

Grafeas defined something promising and wonderfully simple called resource URLs, derived from and contributed by JFrog and their product JFROG XRAY. In a resource URL, a node package would be identified as npm://name@version. This is quite similar to URL nomenclature, but modifying the prefix to identify the development environment and then, essentially a name:version. Grafeas was already in use in various places, so we created an issue describing a few details to work out and pinged R2wenD2, who was actively working on the project.

I also reached out to the OSS Review Toolkit (ORT) project to discuss best practices for identifying packages across ecosystems related to AboutCode data specs. At that time, we used a system with a name, namespace, and version for some packages, based on the resource URLs in Grafeas. But, we still needed a common syntax to easily identify these packages in and across different ecosystems.

Open source projects like Grafeas, Libraries.io, Fabric8 from RedHat, and several others were all doing similar things, but none solved this issue. I cherry-picked the best components of each, and built the first version of Package URL in late October 2017.

As part of nexB’s core principle of FOSS for FOSS, we moved PURL to a new, separate organization on GitHub to better collect effective feedback and share ownership. We invited every key contributor as a co-maintainer/co-owner, including R2wenD2 from Google, Sebastian Schubert from ORT, and later on Steve Springett from CycloneDX, who was one of PURL’s early adopters and for which, the spec was key.

Relinquishing exclusive control early on but maintaining a strong design direction with the key contributors was critical for the ongoing development and success of the PURL project.

With the AboutCode projects, PURL is the critical connector. Whether it’s detecting a package in ScanCode Toolkit, looking for vulnerabilities in VulnerableCode, or reporting complex scans of container images in ScanCode.io, the output is PURLs. The input is PURLs whenever we consume a CycloneDX or SPDX SBOM. We can easily consume and exchange information extracted by PURLs from other tools. That has proved critical to the success of these open source projects.

Casting PURL in the real world

The basic identification of packages is simple and easy. But, there are many levels of depth beyond the package name and version, including the metadata details, like the software license, that APIs can look up. This gets tricky when the metadata details are not documented effectively or the documentation is poor or messy. All this adds more complexity!

Part of the overall identification process is knowing what files were provisioned or provided by the software package – this is very important in verifying the integrity of a package. Once the package is identified and installed on the machine, then you need to ensure that it really is what it’s supposed to be.

There are all these layers, which matter when you don’t have any package manifest or structured information about the package. It’s more complex because you need to accumulate different clues to find the potential origin, like copyright and permissions, URLs, processing READMEs of semi-structured information, and eventually matching.

Matching is performing a lookup across a larger database of known packages. In modern software development, the vast majority of packages will come with a PURL and don’t need any kind of matching. Matching is the signatory technique to help when there is murkier or more difficult information.

In some cases, matching is used for Microsoft .NET and NuGet packages. Each NuGet package is a set of files, typically DLLs and executables and they may be installed and bundled together. Once this has happened, you don’t know anymore which DLLs came from which NuGet, with NuGet now being the package. It’s not easy to find just by observing the deployed binaries. It requires some matching back to each of the packages to figure out, like “Oh! This DLL came from this Microsoft NuGet, and this other DLL came from this Acrobat NuGet.” Some high level matching of whole files can be very useful to complete the identification, typically NuGet, Apache Maven, or Java environments.

Dependency resolution is being able to literally unwind the twine ball. First, you state the direct dependencies – you need this and that and this package, and this version or a version range. Next, you go for the second level of dependencies, and then, the third level, until you go all the way down.

There is a complex set of software code called “solvers” or “dependency resolvers”, that ensure the set of versions are compatible together. The input for these solvers is a set of package versions (and in some cases, package version ranges) to find all the dependencies in the complex dependency trees. Each package manager uses their own conventions and approaches to list the various constraints so there is no easy way for a unified approach to dependency resolution.

The AboutCode team is currently working on VERS, as a universal standard for version ranges identifiers to enable universal dependency resolution across all package managers and to express constraints across system packages and application packages.

Dependency resolution is designed to determine which version for a given application, and then fetch this installed version. Typically, package managers – like npm, pip for Python, a bundler for Ruby or Python – perform the dependency resolution, and once they find the release of a package to install, they download, extract and install them.

Looking at software that’s already been built, there’s nothing to understand about dependency resolution that previously happened. There’s a set of packages downloaded and installed by the package management tool in real-time, so dependency resolution matters most during the development when building the software. Upstream, this can be an extremely complex and difficult process of reverse engineering the dependency resolution.

A new approach called Non-vulnerable dependency resolution merges and combines the dependency constraints, which are functional dependencies, to avoid using packages.

The amount of code that can be identified by PURL depends on the environment. For more general purpose software like web, backend, desktop, and mobile development, usually 95 to 99% of the code used comes from package repositories – unless you use proprietary code, of course. But for the code that has third-party open source origins, everything is easily identified by PURL.

The main difference in PURL’s efficacy is in embedded vs non-embedded software development.

Embedded systems or environments using native code like C or C++ are more complex for PURL because these embedded systems don’t typically utilize a package manager. There are some new package managers for embedded systems like Yocto and buildroot for Linux and Conan for C++.

Tools for embedded systems like Yocto for Linux or buildroot may not use PURL, but still have a strong traceable origin of the code. In other cases, the “generic” type with a download URL can be used as a qualifier. So even if they may not use PURL, there is a good enough approach that can handle these exceptions.

In general, there’s no such thing as package managers, and PURL can still be used, but it’s going to be a bit more complex to do proper identification there – which is the big difference between embedded vs. non-embedded kind of software development. With caveats also because you have tools for embedded development now such as Yocto or buildroot that may not use PURL but still have a very strongly traceable origin of where the code comes from. In other cases we can use the “generic” type with a download URL as a qualifier. So even if they may not use PURL, there is a good enough way that you may not use them all the time but helps deal with these exceptions!

PURL of a FOSS price

PURL makes identification easy, across tools and ecosystems. The ability to compose with other tools – and those tools being free and all available with PURL as a mostly universal identifier – means users don’t have to struggle with integrations and can choose the best-in-class tool they want to use.

PURL was originally built for Python because that’s the language of preference for ScanCode Toolkit. But the specification and libraries for PURL libraries are simple enough that there are now implementations in C#, .NET, JavaScript, Go, Java, Ruby, Swift, Python, PHP, Rust and more. We even have more than one PURL implementation in Java for instance. A list of the available implementations is available at https://github.com/package-url/.

PURL was specifically designed to not be unique to any organization’s projects. As a new open source standard, PURL helps the ecosystem of SCA tools and provides more flexibility and more options for end users to replace, exchange, and combine tools together using PURL.

As Jackie Kennedy meant to say, “PURLs are always appropriate.”

Non-Vulnerable Dependency Resolution

· 4 min read
Tushar Goel
Software Engineer

Dependencies may come with vulnerabilities that can be exploited by attackers.

non-vulnerable-dependency

Dependency resolution is the process of identifying and installing the required software packages to ensure that the software being developed runs smoothly. However, these dependencies may come with vulnerabilities that can be exploited by attackers.

Until now, these contexts have been considered as separate domains:

  • Package management tools resolve the version expression of the dependent package of a package to resolved versions in order to install the selected versions.

  • Security tools check if resolved package versions are affected by known vulnerabilities (even when integrated in a package management tool)

As a result, the typical approach to get a non-vulnerable dependency tree is:

  1. Resolve a dependency tree and install the resolved package versions.

  2. For each resolved dependent package version, translate the identifiers and look in a vulnerability or bug database to determine if a version is affected by a vulnerability and which package version fixes this vulnerability, if any.

  3. Update the vulnerable versions with fixing versions.

  4. Repeat step 1 until you have exhausted all possibilities. Stop on conflicts if a resolution is not possible when considering functional requirements and vulnerability fixing versions.

That approach is complex, tedious and time-consuming. It also suggests non-vulnerable versions without consideration for the functional dependency requirements necessary when reconsidering each dependency separately. This is a waste of time and effort as the non-vulnerable suggestion may not satisfy the functional constraints. Stated otherwise, the result may be a non-vulnerable package tree where packages do not work together and do not satisfy functional requirements, e.g., this results in potentially non-functional software.

maven-find-transitive-dependencies

Here at nexB, we propose a new method and process to resolve software package vulnerable version ranges and dependency version constraints at the same time. This enables developers to obtain a resolved software package version tree matching the blended constraints of functional and vulnerability requirements in order to provide non-vulnerable and up-to-date software code.

The process would go through these typical steps:

  1. Given an input software package, collect its direct functional dependency requirements from its manifests and/or lockfiles. Optionally, parse these requirements to normalize them as Package-URLs and Version Range Specs.

  2. Fetch the known package versions set from the ecosystem package registry.

  3. Collect known affected package versions ranges and fixed ranges from a vulnerability database or service using the identifiers from step 1.

  4. Combine the version ranges of each dependency from steps 1 and 3 in a single new version range and for each dependent package.

  5. Feed the combined ranges from step 4 as input to the dependency resolver. Obtain resolved dependencies that satisfies both constraints. The resolver may further request additional versions and ranges using the processes from steps 1 through 4 when new dependent packages are collected during the resolution process.

  6. Obtain and output the results of the combined resolution of step 5. Report conflicts and optionally suggest conflict resolutions.

With this new process, we get a resolved package dependency tree with versions that satisfy both functional and vulnerability or bug constraints in a single resolution pass.

It’s worth noting that non-vulnerable dependency resolution is an ongoing process. Developers should regularly monitor their software packages for any newly discovered vulnerabilities and update their packages accordingly. This is particularly important when new vulnerabilities are discovered in commonly used packages, as these can have a significant impact on a wide range of software applications.

In conclusion, non-vulnerable dependency resolution is an essential practice that should be adopted by all developers. By selecting software packages that are free from known vulnerabilities, developers can significantly reduce the risk of security breaches in their software applications. Additionally, regularly monitoring and updating packages, as well as ensuring that packages are obtained from trusted sources, can further enhance the security of software development.

To understand this topic in more detail, read this defensive publication on Non-Vulnerable Dependency Resolution from the Technical Disclosure Commons.

What is a Dual License Anyway?

· 4 min read
AboutCode team
Open source for open source

Make it easier for users and remove the word “Dual” from your software project notice vocabulary.

dual_licensing-1

“This project is licensed under a Dual License of BSD and GPL.”

What does “Dual” mean in this context? In a practical sense, it means you have to dig more deeply into the licensing for that project to figure out what this license statement means:

  • Both the BSD AND GPL apply? (conjunctive)
  • Or choose between BSD OR GPL? (disjunctive)
  • Which version of BSD?
  • And which version of GPL?

Typically, but not always, this example statement means that you have a choice of BSD-3-Clause OR GPL 2.0 or later because these are the most common versions of those licenses. As the consumer of the software project you must conclude that interpretation and choice, usually after exploring the other license notices in the project files. You must declare that choice in the attribution of your project(s) or product(s) that use that software.

But doesn’t “Dual” mean “consisting of two parts”? Well, yes, that is true in standard English usage, but in the historical practice of many open source projects, this term is ambiguously applied. This wreaks havoc on license detection programs, and creates more busy-work for anyone wanting to use the “Dual-licensed” software.

If you are publishing an open source project, you may of course declare that the project code is under one license, and the project documentation is under another license, and the sample files are under another license. That makes perfect sense, especially if you do not use the word “Dual”. In fact, it would be best to remove the word “Dual” from your project notice vocabulary altogether. If you are publishing a project under a choice of licenses, you should probably indicate what the default license is in case the user of your software does not understand that a stated license conclusion is necessary, and you should avoid referring to that choice as a “Dual” license.

The best solution is to use a standard license expression which explicitly states whether the relationship between two licenses is “AND” or “OR”. The most common syntax for license expressions is from the SPDX v2.3 specification. There are many examples from the SPDX license list or the ScanCode LicenseDB. License identification precision provides the clarity that potential users of your software need to be compliant with the licensing terms.

Dual FOSS/Proprietary Licenses

An increasingly common occurrence in software project licensing is the statement that a project is dual-licensed under a FOSS license and a commercial alternative. This usually means there is a choice between the two licenses, and again the word “dual” is misleading because it makes no sense for both the FOSS license (especially a copyleft license) and the commercial alternative to apply simultaneously and equally. Also note that in such cases, the commercial alternative is often a requirement if the usage of the software goes beyond certain restrictions (e.g. number of users, deployment on a public network, embedding in a commercial product, etc.). Any license notices of this kind should be carefully reviewed to avoid legal risks.

The best practice for a multiple license use case is to state a valid license expression using the correct operator and standard license identifiers. Some examples:

  • /* SPDX-License-Identifier: BSD-3-Clause OR GPL-2.0-or-later */

  • /* SPDX-License-Identifier: BSD-3-Clause AND MIT */

  • /* SPDX-License-Identifier: AGPL-3.0-only OR LicenseRef-scancode-commercial-license */

You can, of course, provide additional explanatory text, remembering always to avoid the inappropriate use of the word “dual”.

More details about license expression syntax are provided in SPDX’s docs.

Additional Reading

The following links provide varying perspectives on “dual” licensing:

There and back again -- A software versioning story

· 14 min read
Philippe Ombredanne
AboutCode Lead Maintainer

One software version control to rule them (modern software development) all?

version

Software projects make many decisions, but one of the most critical is deciding how to implement version control (also known as revision control, source control, or source code management). With modern software development, a versioning convention is a key tool to manage software releases and revisions. The two main approaches are calendar versioning (CalVer) and semantic versioning (SemVer), often with some alterations depending on an organization’s or project’s requirements.

For AboutCode projects, we started with SemVer, transitioned to CalVer and then migrated back to a format that mostly resembles SemVer. This blog post details the pros and cons of each version convention, along with explaining why we embarked on this version convention journey.

The Fellowship of the Version Conventions

CalVer

Short for calendar versioning, CalVer uses something that resembles a date as a version number. It’s popular, especially in software, to convey time in a version number. Ubuntu is a good example of using CalVer: version 12 was first released in 2012.

The idea is that you have a first segment in the version number, which is either the four-digit year or just the last two digits. Next is a month and then a day (the placement of the digits remains constant for future releases) to get the full version number where each segment is separated by a dot. With Ubuntu, they do stable releases in April, so version 22.04 is the stable release from April 2022.

SemVer

SemVer, or semantic versioning, doesn’t convey time like CalVer. It is designed to better define the relative importance of changes in the software and its interface(s).

SemVer.org clearly states the structure:

Given a version number MAJOR.MINOR.PATCH, increment the:

  1. MAJOR version when you make incompatible API changes

  2. MINOR version when you add functionality in a backwards compatible manner

  3. PATCH version when you make backwards compatible bug fixes

Additional labels for pre-release and build metadata are available as extensions to the MAJOR.MINOR.PATCH format.

Each time there’s a change in the underlying application programming interface (API), then you should change the MAJOR version. Otherwise if it’s compatible, you should change the MINOR version (the middle segment). If it’s just a bug fix, you should change the PATCH version.

In theory, SemVer looks beautiful and simple. It very simply conveys the changes in the software with three numbers. In practice, it is extremely difficult for humans to understand what is a major change that breaks the API or not. I doubt there’s any tool or whole ecosystem, like npm or Go, claiming to use SemVer that is actually true to the principles of SemVer because it’s too hard to understand if a change affects the API and whether that change is major or minor. More often than not, those version changes are misrepresenting what SemVer is designed to convey.

OtherVer

Other approaches to version control exist beyond SemVer and CalVer.

Until version one, OpenSSL used a peculiar scheme that resembled SemVer, but with letters as a suffix. Now in version 3, they changed the scheme to be more like SemVer and dropped the letter suffix. This is difficult for users to make sense of the version when different version schemes are used over time. OpenSSL is still using its legacy version control using patches and letter suffix for pre v3 versions, making it even harder for users to understand, especially when dealing with bugs and vulnerabilities. This is problematic because there’s no upward compatibility. OpenSSL conventions are still reasonably straightforward to understand. Other projects such as nginx have more byzantine conventions where a version segment number has a different meaning if it is even or odd.

It’s important when a project is switching to a new version scheme that the project ensures and explains to its users the correct sequence of versions in a way that is as clear as possible. Note that we attempt to resolve the versioning weirdnesses of version schemes and version ranges in the univers library and the upcoming Package-URL version specification for “Version Range Specification”.

Another interesting example is in the Android ecosystem. The Google Play store carries two versions for each app. One is a “version name”, which is whatever version string the author likes – this can be SemVer or CalVer or any other convention or version scheme. This version name is used only for display. The other is a version code, which has to be a single number, and must increase each time there’s a new release. Google recognizes through working with the large number of Android developers, that no versioning scheme is actually correct and could work given the scale and diversity of developers. With this version code approach, users know sequentially what is the latest release, without any bit of ambiguity and can use the “version name” for cosmetic display.

Google Play using version code for Android apps is very similar to an older approach to version control with a version control system called Subversion that was popular at the beginning of the century. Subversion used a single version number that would apply across the code tree for everyone using the same system; this single version number was incremented centrally with every code change. The Apache Software Foundation – home of the Subversion project – was one of its largest users. Each commit incrementally increased the version number, which made staying current with the latest release difficult. This was effectively one of the earliest and likely the largest public “monorepo“.

To have this kind of version numbering across a very large foundation with hundreds of projects and thousands of users created millions of revisions but was not a contribution to a better coordination across projects. It was impractical and hyper centralized and most Apache projects have since switched to distributed and decentralized version control systems such as Git, but there was a value to having a simple version number that is just bumped as needed.

Subversion was the complete opposite of Git, where everything can be distributed on any computer. There is a clear benefit of being able to strictly order versions without ambiguity, which we’ve lost with distributed version control systems like Git. It’s difficult to understand the sequence of versions using commit hashes with multiple branches and their relation with time, because there may not be any relationship that’s as obvious as a single number.

One interesting pocket of Semver is the Go programming language that claims to enforce SemVer-compliance for all the third-party Go modules. In practice, the Go tools are generating pseudo-SemVer versions based on a sequence of commits in a Git repository for each of the Go libraries that you are using. The Go algorithm for doing pseudo-SemVer versioning for modules is not straightforward. If you’re changing the API or the interface of how you call your Go function, there is no version change. Instead, you are supposed to create a new library changing its import path with a v2, but it’s not v2 as in version two – it’s more like a brand new library that shares the name with the v1.

This approach achieves some SemVer compliance by avoiding any API change, and therefore recognizes that there’s no such thing as being able to capture API changes faithfully. The only way to get compatibility between two versions is to create a new library whenever there’s a change to the signature of your Go library. Eventually, you could hope to entirely ignore versioning and use only the commit history instead of a separate versioning mechanism. The problem is that commit hashes are long (40 characters) and obscure strings that are not human-friendly (in hexadecimal).

The Two Version Conventions: Choosing between CalVer and SemVer

Does your project feature a large or constantly-changing scope, including large systems and frameworks, like Ubuntu and Twisted, or amorphous sets of utilities, like Boltons? Is your project time-sensitive in any way or do other external changes drive new project releases, including business requirements like Ubuntu’s focus on support schedules, security updates like certifi’s need to update certificates, or political shifts, such as pytz’s handling of time zone changes?

CalVer.org suggests adopting CalVer under those conditions listed above.

There is a simple value in CalVer that is you don’t have to think or strategize to pick a version number – just use the date of the release and that’s it. And it’s easier to convey a temporal change with sequential version numbers in CalVer. If you want to convey that one version is obsolete, then CalVer makes sense.

But do not overestimate the strength of the signal sent by your version number. Windows is a good example of this. Windows 95 used CalVer – 95 was for 1995. Windows 95 was used for many years afterwards, so much so that folks after 2000 probably didn’t realize that 95 really meant the last century. Same with Windows 98, these tools were still used over ten years after their release.

Recognizing SemVer and determining an API change is extremely difficult and almost impossible to accomplish truthfully. However, SemVer can be useful in systems with many dependencies. Issues like version lock (inability to upgrade packages without releasing new versions of every dependent package) or version promiscuity (assuming compatibility with more future versions than is reasonable, especially when dependencies are poorly specified) can result in a dependency hell (SemVer.org).

Jacob Bolda from Frontside Software wrote that it’s “not a boolean choice but a spectrum whose adherence specifications you must outline depending on your project’s needs and circumstances.”

The Return of the (mostly) SemVer

When we started AboutCode, we used SemVer, with a Python flavor that adds a few extra extensions for managing pre and post releases differently. We then realized that we had a problem when conveying the obsolescence of data that we use in ScanCode like the license DB.

Licenses change on a regular basis, both in terms of newer licenses and newer ways to talk about licenses. When we helped fix and streamline the licensing of the Linux kernel, we found close to 800 different ways to say “this file is under GPL”. Developers are creative when it comes to reporting a FOSS license!

By switching from SemVer to CalVer, we wanted to convey the idea to users that they are running a ScanCode version that’s old and probably outdated with obsolete license data. It was pretty naive for us to believe that just a version number is enough to signal users that they should upgrade. We tried to provide a way to proactively signal with a warning that they should update. This was great theoretically, except this warning code had issues and displayed a warning message even when no new version was available to upgrade, and when there was a new release it sometimes failed to display a warning.

CalVer was not working and was too weak a signal to convey obsolescence with versions, so we switched back to a SemVer version. SemVer and CalVer are compatible, so when AboutCode and ScanCode went from SemVer to CalVer and then back to mostly SemVer, it was easier (and important) to ensure that the versions sequence would stay consistent and obvious to users. During our first switch of ScanCode to CalVer, we went from version 2 to version 21. When we went back to SemVer in 2022, we decided to jump to version 30 to avoid any confusion with dates and previous versions.

The work we did around the Package-URL (PURL) and vers specifications and the univers library project for VulnerableCode also informed our approach. Combined with PURL, the univers library is able to parse and make sense of package version ranges syntax and conventions across all ecosystems, especially for vulnerabilities, and expose them in a simple and easy to read normalized string. This is unfortunately extremely complex to achieve because of the lack of standards and the diversity of version range syntaxes. Yet this is also very useful and important to understand because we rely on versioning to understand which versions come before and after. This is key to resolving package dependencies and understanding if a version of a package falls in a vulnerable version range.

We’re not the only ones to understand the importance of this. There’s a project at the Linux Foundation under the Open Source Security Foundation (OSSF) and maintained by Google that is called the Open Source Vulnerability Database (OSV), which has very similar goals to VulnerableCode. We’re collaborating and OSV reuses some of the code from univers to better understand how all of these versions compare and fit together.

One Ver to rule them all?

Having designed libraries and specs and built tools to handle arbitrarily complex versions and version ranges, we came to the conclusion that the only benefit of SemVer is that it’s familiar and well understood by everyone. But it is a fallacy to believe that we can faithfully implement SemVer.

This is because of Hyrum’s Law, which states:

With a sufficient number of users of an API, it does not matter what you promise in the contract: all observable behaviors of your system will be depended on by somebody.

Hyrum’s Law means that if you have enough users, every little bug and how your tool behaves internally will eventually be considered an API or a feature by some users. What we think about APIs is not what your user may think about being an API. That applies to any interface that we may design.

workflow.png
https://xkcd.com/1172/

It may be technically possible in some cases to let the system, tool, or code decide when there’s an API change, but it would be extremely complex in general as determining this requires intelligence and detailed understanding of how a library or tool behaves and may be used. This is very hard for humans and harder for machines!

Versions are necessary and useful. They carry at least one meaning, which is to state if this version came before or after a piece of code. But applying more weight to their meaning distracts from what they’re meant to convey.

Jacob Tomlinson, a software engineer at NVIDIA, wrote a blog post on why he sometimes regrets using CalVer:

By releasing software that uses SemVer you are signaling to your community that you have some constraints. In my opinion, CalVer signals to your community that anything could happen at any time and that you have no interest in the effect that has on your users.

We recognize that unfortunately all the versions that pretend to carry special meaning are mostly misleading, or not true to what they’re trying to convey, whether it is date changes or API-based changes. But even if a SemVer version may be misleading, it’s very important to have such a number that humans can refer to and easily remember. There are some good reasons to implement CalVer, but for the AboutCode team, we found that a modified approach to SemVer was the best approach to display version obsolescence for our users.

With this approach, we acknowledge that any attempt to decide correctly if a new release should increase the major, minor or patch part of the SemVer version is condemned to be incorrect or misleading more often than not. Instead, we are trying to make decent guesses if a new version is major, minor or patch, but this is mostly stating that this is a big release with big changes or a smaller release. We have also introduced an “output format version” to ScanCode Toolkit that is loosely based on SemVer and lives on its own schedule; It is based strictly on the output data structure in the JSON format; Changes to this format are simpler to evaluate as breaking the API or not, and we may be truer to SemVer spirit in this case.

License Clarity Scoring in ScanCode

· 5 min read
AboutCode team
Open source for open source

When automating SCA, License Clarity Scoring helps determine if scan results require more review.

license-clarity-2

When automating Software Composition Analysis (SCA) with a scanning tool, you need to quickly evaluate the results – especially to determine whether or not the results require a deeper investigation.

ScanCode now includes License Clarity Scoring to provide users with a confidence level regarding the automated scan results.

License Clarity is a set of criteria that indicate how clearly, comprehensively and accurately a software project has defined and communicated the licensing that applies to the project software. Note that this is not an indication of the license clarity of any software dependencies.

License Clarity Scoring in ScanCode uses that series of criteria to then rank how well a software project provides licensing information.

usbmuxd-1.1.1-1536x992

Declared License Expression

declared_license_expression is the primary license expression as determined from the declaration(s) of the authors of the package.

The new summary fields are:

  • declared_license_expression
  • declared_holder
  • primary_language
  • other_license_expressions
  • other_holders
  • other_languages

Note that the term declared_license_expression is used equivalently for the concept of a primary license expression in order to align with community usage, such as SPDX.

Here is how ScanCode determines the value for a declared_license_expression, primary_holder and primary_language of a package when it scans a codebase:

  1. Look at the root of a codebase to see if there are any package manifest files that have origin information.
  2. If there is package data available, collect the license expression, holder, and package language and use that information as the declared_license_expression, declared_holder, and primary_language.
  3. If there are multiple package manifests at the codebase root, then concatenate all of the license expressions and holders together and use those concatenated values to construct the declared_license_expression and declared_holder.
  4. If there is no package data, then collect license and holder information from key files (such as LICENSE, NOTICE, README, COPYING, and ADDITIONAL_LICENSE_INFO). Try to find the primary license from the licenses referenced by the key files. If unable to determine a single license that is the primary, then concatenate all of the detected license expressions from key files together and use that as a conjunctive declared_license_expression. Concatenate all of the detected holders from key files together as the declared_holder.
  5. Note that a count of how many times a license identifier occurs in a codebase does NOT necessarily identify a license that appears in the (primary) declared_license_expression due to the typical inclusion of multiple third-party libraries that may have varying standards for license declaration. It is possible that the declared_license_expression constructed by this process may not appear literally in the codebase.

As of DejaCode 4.2, you can also access the new license clarity scoring fields and summary fields in the Scan tab of the Package details user view.

When you scan a Package from DejaCode, you can view the Scan Results in a Scan tab on the Package details user view. DejaCode presents a selection of scan details with an emphasis on license detection. You can also download the complete Scan Results in .json format.

You can set the values from declared_license_expression, declared_holder, and primary_language to the package definition in DejaCode.

pyinstaller-5.5-1426x1536

License Clarity Scoring

The license clarity score is a value from 0-100 calculated by combining the weighted values determined for each of the scoring elements: Declared license, Identification precision, License texts, Declared copyright, Ambiguous compound licensing, and Conflicting license categories.

Declared license (Scoring weight = 40)

When true, indicates that the software package licensing is documented at top-level or well-known locations (key files) in the software project, typically in a package manifest, NOTICE, LICENSE, COPYING or README file.

Identification precision (Scoring weight = 40)

Identification precision indicates how well the license statement(s) of the software identify known licenses that can be designated by precise keys (identifiers) as provided in a publicly available license list, such as the ScanCode LicenseDB, the SPDX license list, the OSI license list, or a URL pointing to a specific license text in a project or organization website.

License texts (Scoring weight = 10)

License texts are provided to support the declared license expression in files such as a package manifest, NOTICE, LICENSE, COPYING or README.

When true, indicates that the software package copyright is documented at top-level or well-known locations (key files) in the software project, typically in a package manifest, NOTICE, LICENSE, COPYING or README file.

Ambiguous compound licensing (Scoring negative weight = -10)

When true, indicates that the software has a license declaration that makes it difficult to construct a reliable license expression, such as in the case of multiple licenses where the conjunctive versus disjunctive relationship is not well defined.

Conflicting license categories (Scoring negative weight = -20)

When true, indicates the declared_license_expression of the software is in the permissive category, but that other potentially conflicting categories, such as copyleft and proprietary, have been detected in lower level code.

cla-workstation-order-df07672-1536x1364

Want to see License Clarity Scoring in action? Download ScanCode.io or sign up for a free DejaCode account.

ScanCode provides you the license clarity score when you specify the --summary option for a scan. ScanCode.io specifies that option for you automatically.

DejaCode makes it even easier and specifies all the scan options that you need automatically when you request a package scan.

Using Copyleft-licensed software components in a Java application

· 7 min read
AboutCode team
Open source for open source

Key considerations while using Copyleft-licensed software components in a Java application.

java_copyleft_license

This document explains some key considerations for the use of Copyleft-licensed software components in a Java application in two contexts:

  • Execution of the Java code in a shared JVM.
  • Combining class files in a shared executable JAR – and by extension in a Combined JAR (e.g. uber-jar or fat jar).

For this document, “JAR” refers specifically to an executable Java library that is a collection of .class files packaged into a file with the .jar extension; it does not refer to the use of a .jar file as an archive file only (such as for packaging source files for a Java library).

The purpose of this document is to present a “conservative” interpretation of what linking, or interaction may mean in the Java context. It is not based on any particular product or application and we are not aware of any specific license compliance enforcement actions in this area.

“Strong” Copyleft-licensed Components

The execution of any software component licensed under GPL (or another “strong” Copyleft license such as AGPL, SleepyCat, etc.) in a JVM effectively links that component with all other software components in that JVM process and therefore those other components become subject to GPL license obligations including redistribution of source code.

The net impact of this interaction inside a JVM is that you should not Deploy any GPL-licensed code in a commercial Java-based product, unless that GPL-licensed code is executed in a separate JVM. This use case is possible, but quite rare in practice.

In such rare cases, the GPL-licensed component should be used as-is and un-modified.

If a modification is absolutely required, the purpose of the modification must not be to expose some privileged way to communicate with this library from proprietary code such as exposing a socket interface or other API for the sole benefit of avoiding a direct call to the Copyleft-licensed library.

Such modifications would be considered as essentially similar to running the Copyleft-licensed library in the same JVM process and making direct calls so that the Copyleft obligation would still apply.

“Limited” Copyleft-licensed Components

Any code included within a JAR can be considered to be statically linked with any other code in that JAR, even though strictly-speaking there is no such concept of “static linking” in Java technology.

The primary logic here is that a JAR is an executable program and all of the files inside it interact within that context.

Clearly there are many programming-level differences between:

  1. the process of compiling and linking C/C++ source files into an executable program and
  2. the process of converting .java or other source files (such as Scala) into .class files and packaging them into a JAR.

But there are more similarities than differences. The net impact of this interaction inside a JAR is that you should not deploy any Copyleft-licensed code in a JAR in combination with any proprietary code.

The impact of software interaction of .class files within a JAR varies according to the specific subtype of limited Copyleft license. There are three primary subtypes to consider:

  1. LGPL
  2. GPL with Classpath Exception
  3. “Public” or file-based licenses (CDDL, EPL, MPL)

1) LGPL

The LGPL version 2 and version 3 licenses are quite different, but in both cases there are specific terms and conditions related to software interaction and these provide the strongest case that combining .class files in an executable .jar is a form of static linking.

2) GPL with Classpath Exception

This license permits static linking of “independent modules”, but it may be hard to argue that .class files combined into a single JAR are independent.

3) “Public” or file-based licenses (CDDL, EPL, MPL)

The Copyleft impact from these licenses are primarily limited to the file level so this is the best case to argue that you can combine class files into one JAR without Copyleft impact.

For a component licensed under any of the Limited Copyleft licenses, you do have the option to dynamically link separate libraries (JARs) within a JVM. This is different from GPL-licensed code, as described above, because you can dynamically link libraries under a Limited Copyleft license inside a JVM without a Copyleft impact on other libraries.

The recommended best practice is to Deploy any Java library under a Limited Copyleft license as a separate “dynamic” library as provisioned from the original OSS project. This is the best way to avoid Copyleft impact.

Combined JARs: uber-jars, mega-jars and fat-jars

Java code is typically packaged and redistributed as pre-compiled .class files assembled in one or more JAR libraries. Open source Java libraries are commonly downloaded at build time from a repository such as Maven (either a private or the Maven Central public repository).

The process of creating a Combined JAR is to combine the .class files from all of the third-party dependency JARs together with proprietary-licensed .class files in a single JAR. This larger Combined JAR mixes open source (and possibly Copyleft-licensed code) and proprietary code in a single JAR.

Creating larger Combined JARs is typically automated as part of a product build. Maven-based build plugins and tools include Maven Shade, one-jar, fat jar and others.

In most cases, this is an addition to the build that is easily reversed to revert to a multi-jar deployment approach. The technical purpose of building a Combined JAR may be to:

  • Simplify the deployment or configuration of some larger Java applications by reducing the number of .jar libraries to be deployed.
  • Simplify runtime configuration. In particular the Java class paths do not need to be configured to reference the dependencies since they are all contained in a single executable library.
  • Accelerate initial loading of the application in the JVM where startup time is critical for the application. This acceleration is likely to be minimal.

In addition to the Copyleft interaction issues outlined above, some other disadvantages of using Combined JARs are:

  • In the process of creating a Combined JAR, some common files with the same name and path (such as NOTICE, LICENSE) may be overwritten in a Combined JAR. Only one copy of each such file will exist in the Combined JAR. The terms of most open source licenses do not permit you to remove license or notice files.
  • The repackaging of un-modified JARs in a Combined JAR could be considered to be a modification. Most Copyleft licenses require you to track and document changes so this repackaging may require additional documentation work for the product team.
  • Tracing the package-version of an individual third-party component included in a Combined JAR may be difficult, which in turn may make it difficult to comply with Copyleft license conditions that require an offer to redistribute package-version-specific source code.
  • When updating software, the entire Combined JAR will need to be rebuilt even if most individual third-party packages are unchanged. In particular if a single third-party component JAR needs to be updated for a vulnerability, bug or new feature fix, then the whole Combined JAR need to be redistributed to customers.
  • If several larger Combined JARs are created in a product, the resulting size of the executables may be larger, as the contents of every shared third-party JAR will be duplicated in each Combined JAR instead of being shared across modules. Thus, a Combined JAR can impede the possibility and flexibility of Java library reuse.

In general, Combined JARs are best suited for Deployment of Java applications in an internal system/IT- or SaaS-only use case where some of their benefits are measurable and there are fewer issues related to license compliance and Copyleft-licensed component interaction.

When used in a commercial product that is distributed in any way, the issues attached to larger combined JARs usually outweigh any technical benefits that they may offer.