Uncovering thousands of unique secrets in PyPI packages

Uncovering thousands of unique secrets in PyPI packages

Security Researcher Tom Forbes worked with the GitGuardian team to analyze all the code committed to PyPi packages and surfaced thousands of secrets

|Let's start with the big reveal of what we found:

  • 3,938 total unique secrets across all projects

  • 768 of those unique secrets were found to be valid

  • 2,922 projects contained at least one unique secret

To put those numbers in perspective, there are over 450,000 projects released through the PyPI website, containing over 9.4 million files. There have been over 5 million released versions of these packages. If we add up all the secrets shared across all the releases, we found 56,866 occurrences of secrets, meaning once a secret enters a project, it is often included in multiple releases.

What was leaked?

Across all scanned projects, 151 individual types of secrets were detected. Everything from AWS Keys to Redis credentials were found among the releases.

Distinct secrets by detector over time

Distinct secrets by detector over time

There were a few notable trends we discovered while examining the data.

  • The number of leaked, valid Telegram bot tokens more than doubled in early 2021 and multiplied again in Spring 2023.

  • Google API key leaks have grown steadily over time, apart from a very large spike that occurred in mid-2020.

  • The overall number of database credentials leaked exploded in popularity in 2022.

PyPI

The Python Package Index, better known as PyPI, is the official 3rd party package management system for the Python community. It allows package developers to share their code and allows developers worldwide to add functionality to their projects without reinventing the wheel every time. More and more, these open source packages end up in production codebases. As part of the software supply chain, they make up to an estimated 90% of the code run in production.

But just like all code, there is the potential for security issues. Given how widely distributed these packages are and how commonly they are embedded in production code, any small security vulnerability can have far-reaching effects. This is especially true of hardcoded credentials, as many eyes can immediately access your private systems. In 2023, we have seen leaked credentials become the leading root cause of any initial breach, as we saw in the Sophos report from earlier this year.

Projects vs Releases

If you are not familiar with PyPI, you might be wondering what constitutes a project and how that differs from a release or a file. Here is how the PyPI project defines these terms.

A "project" on PyPI is the name of a collection of releases and files, and information about them. Projects on PyPI are made and shared by other members of the Python community so that you can use them.

A "release" on PyPI is a specific version of a project. For example, the requests project has many releases, like "requests 2.10" and "requests 1.2.1". A release consists of one or more "files."

A "file," also known as a "package" on PyPI, is something that you can download and install. Because of different hardware, operating systems, and file formats, a release may have several files (packages). In PyPI, there are different types of release files: "source" and "wheel" files. Wheels are the pre-packaged distributions that just need to be moved to the correct file location to use. Source files are built from the same source code, but they can, and often do, contain different contents.

Valid vs All

Any leaked secret brings some risk, but the most immediate and critical threats come from valid credentials that malicious actors can still exploit. Validating a credential is a crucial step when investigating any incident involving credentials. The primary tool for this research was ggshield, the GitGuardian CLI. GitGuardian looks for over 400 types of secrets, both specific detectors and generic patterns. Built into the secrets detection engine are validators, which can quickly check if over 190 specific types of credentials still work or not. Not all secrets can be checked for validity, but when they can be verified, GitGuardian makes the least intrusive call possible.

For PyPI packages specifically, there is a command ggshield secret scan pypi, This will scan any Python package for secrets. If the names package is not present on the local file system, it will download the named project from PyPI.org. This can save a good amount of time and headaches if you are working with or making Python packages regularly.

Surprising types of valid secrets detected

While any leaked credential will be problematic, several types of keys can immediately lead to broader access for an attacker or direct access to your data. Some of the more shocking secrets we discovered as valid included:

  • Azure Active Directory API Keys

  • GitHub OAuth App Keys.

  • Database credentials for providers such as MongoDB, MySQL, and PostgreSQL.

  • Dropbox Key

  • Auth0 Keys

  • SSH Credentials

  • Coinbase Credentials

  • Twilio Master Credentials

Unable to validate does not mean invalid

It is important to note that just because a credential can not be validated does not mean it should be considered invalid. Only once a secret has been properly rotated can you know if it is invalid. Some types of secrets GitGuardian is still working toward automatically validating include Hashicorp Vault Tokens, Splunk Authentication Tokens, Kubernetes Cluster Credentials, and Okta Tokens.

The State of Pypi Secrets Sprawl

A growing problem

The number of secrets being added to PyPI is growing steadily over time. The addition of fresh, new, and valid credentials to PyPI is also steadily increasing. Unfortunately, this finding aligns with what we have seen in the GitGuardian State of Secrets Sprawl report, year over year. In the last year alone, the research shows over 1,000 unique secrets have been added via new projects and commits on PyPI.

Unique secrets added over time

Unique secrets added over time

Types files containing leaked credentials

The number one file type that contained a hardcoded credential was, of course, .py files. Aside from that most common Python extension, most valid secrets seem to be stored in configuration/documentation files such as .JSON and .yml files. We also found 209 unique secrets in README files, and 675 unique secrets in files in some form of 'test' folder. The distribution of valid secrets is heavily biased towards source distributions as well.

Most common types of files other than .py containing a hardcoded secret in  PyPI packages

Most common types of files other than .py containing a hardcoded secret in PyPI packages

Why are we seeing leaked secrets in PyPI packages?

Like most hardcoded credential events, the cause for leaked secrets in PyPI is by accident. Just as it is all too easy to make a private repo a public repo. and just takes a few wrong keystrokes to push a package intended for internal use into public availability. In the course of outreach for this project, we discovered at least 15 incidents where the publisher was unaware they had made their project public. Without naming any names, we did want to mention some of these were from very large companies that have robust security teams. Accidents can happen to anyone.

Rather than making a whole package public, accidentally published files are a more common issue. This is evidenced by the fact that, in many cases, when someone publishes a secret, they often quickly publish a new release to remove the file.

For example, the chatllm project leaked 209, now invalidated, OpenAI keys in an internal markdown file in one release but removed the file from all further releases. Similarly, an early release of safire leaked 320 Google Cloud keys, all since revoked, in a directory that is empty in all other releases.

Average days a package with credentials is present

Average days a package with credentials is present

While this is an encouraging trend, pointing to the fact that more and more developers are aware of the problem with hardcoded secrets, there was still the leak of a secret that will require rotation and possibly further remediation. Just as adding a new commit to remove a credential does not remove the original secret contained in the previous commit, just releasing a new version does not remedy the situation.

In many cases, though, the data suggests developers are unaware of these issues. PyPI has no simple way to view the contents of a package nor any tool to detect meaningful differences in the package contents. Any bugs, configuration mistakes, or other issues might mean a single release file contains secrets, but others do not. It is simply hard to tell as a developer.

What are the risks of exposing secrets in open source packages?

Exposing secrets in open-source packages carries significant risks for developers and users alike. Attackers can exploit this information to gain unauthorized access, impersonate package maintainers, or manipulate users through social engineering tactics.

One notable risk is typosquatting, where malicious packages with similar names to legitimate ones are uploaded to trusted registries. This can lead to the theft of sensitive information or the hijacking of computing resources. Compromised build and deployment pipelines can also expose additional credentials, resulting in unauthorized access and the potential abuse of cloud resources.

To learn more about how the maintainers of Bokeh, one of the largest open-source Python packages hosted on PyPI, mitigate these risks, you can read about their strategy here.

Like other developers, core maintainers may not be application security experts. They are not particularly immune to introducing bugs in their code or inadvertently committing secrets in their project's git repository.

Bryan Van de Ven, co-creator and core maintainer of the Bokeh project

Yanking is not an effective strategy

In PyPI, there is a mechanism called yanking, where a release is marked to be ignored by an installer unless it is the only release that matches a version specifier. While this makes it unlikely that you will get an intended release, it also does not make the code unavailable, as the name might lead someone to believe. Only 300 releases have been yanked, compared to 56,566 releases containing secrets.

Contrary to what the name implies, "yanking" a package does not actually remove the package from PyPI. The file is still downloadable if you have the URL. Files are only 100% removed from PyPI if they have known malicious code. Anything else is there forever.

Publishing tools lack sensible defaults for ignoring files

Another common reason why so many secrets are being leaked in PyPI is the lack of safeguards for files you should not include in a distribution. It is easy to assume that if you use .gitignore to keep these files out of your git history, then you will be safe. But unless you take extra setup steps and safeguards like using setuptools-git, Python does not honor your .gitignore when a package is built.

This includes files such as your local configuration files, like .cookiecutterrc.It also includes .env and .pypirc files. The research uncovered 43 .pypirc files containing unique PyPI publishing credentials! These are certainly things you don't want out in public.

How to not leak secrets in PyPI

Almost all modern, functional software needs to communicate with another service or retrieve data in order to deliver value. The ongoing need to connect to more and more services, unfortunately, has meant a steady increase in the number of leaked secrets, but it does not need to continue to be an upward trend for you or your PyPI packages.

Avoid any and all unencrypted credentials

The easiest way to prevent a secret from ever making it into your PyPI package is to never add it in plaintext to the code base to begin with. We realize this is much easier said than done, but fortunately, there are tools like python-dotenv which make it easy to programmatically call read-only values from .env files stored outside of version control.

Even better than a local .env file would be to use a Cloud Secrets Manager, such as AWS Secrets Manager, Google Cloud's Secret Manager, or Azure Key Vault. These are built-in secrets managers into the three most popular cloud providers that can be used to create and use secrets across cloud infrastructure.

Scan for secrets before any release

If you accidentally add a plaintext credential to your code, you want to find it before sharing it. This is where automatic scanning can come in. While manual code reviews can find some issues, human error means some amount will get through. What you want is automation to scan every new line of code that makes it into your project.

Automated secrets scanning can make sure any and all commits that will be included in your next release are free of any secrets. Adding an automated scanning step to the merge request process is straightforward with tools like GitLab CI/CD or GitHub Actions and ggshield.

Don't let your secrets leave your machine

The absolute best place to find a leaked secret is locally before any commit is made or before the code is pushed to your shared repo. Similar to how ggshield can be leveraged in the PR process, you can also automate its use locally through the pre-commit and pre-push git hooks. Aside from just finding the secret, it will also provide information such as type, number of occurrences, and if the secret is valid.

Share your code, not your secrets

PyPI is a fantastic community filled with helpful code that can save you time and make you more productive. We encourage anyone building in Python to consider donating useful code back to the community and support open source whenever possible. We hope you found these results illuminating and as a call to action to focus on secrets management in your code and project releases.

No matter what you are coding or how you plan to distribute it, GitGuardian is here to help make sure you do it safely. Our Secrets Detection platform is free for individuals, for open source contributions, and any organization with less than 25 developers. Make sure you are scanning for those secrets in your shared repos and locally to keep your secrets from sprawling.

----------------

|Tom Forbes - Python developer living
and working in London
Author | tomforb.es |

portrait