Curling CPython around PVS-Studio

Python, a programming language that needs no special introduction. It has rightly earned the title of "The best Excel" for user-friendly handling of big data. Game developers love it for its seamless integration with C and C++ code. It also has a low entry barrier! Introduction In this article, we'll look at the main and most widely used Python interpreter, CPython. As the name suggests, it's written in C. We checked the project almost nine years ago, and even that check required some updates. Since then, the language has evolved a lot with the Python 3 release series, and the components under the hood of the interpreter have changed along with it. The Python/C API has also expanded, each of the interpreter built-in modules has been rewritten several times, and some have faded into history after the DeprecationWarning stage. Then, one day... PVS-Studio installation Let's start by installing PVS-Studio. You can download the installer from this page. A license is required to run the analysis. The trial version is easy to get—a single click and a quick tap on the keyboard is all it takes. Installing the analyzer is also a simple task: the only extra step is integrating it with Visual Studio 2022. At the time of writing, the version of the PVS-Studio analyzer is 7.35. Build and analysis configuration Then one day, the upcoming CPython 3.13.3 release, which we've taken on the e6dfa9d commit, came to our attention. I usually check projects in the Debug configuration. However, this time, I was curious and wanted to run the analysis in the Release configuration to see what happens in the interpreter shipped to the end user. Modern CPython 3 versions require a bit of tinkering before running the build on Windows: go to the PCbuild folder and run the get_externals.bat file there. The third-party dependencies needed to build the interpreter will be downloaded. Analysis of detected errors The pcbuild solution file, configured for a complete CPython build, is in the same PCbuild folder. When the get_externals.bat batch file is run, the console lists the folder names of third-party dependencies: D:\Git\cpython\PCbuild>get_externals.bat Using py -3.12 (found 3.12 with py.exe) Fetching external libraries... Fetching bzip2-1.0.8... Fetching mpdecimal-4.0.0... Fetching sqlite-3.45.3.0... Fetching xz-5.2.5... Fetching zlib-1.3.1... Fetching external binaries... Fetching libffi-3.4.4... Fetching openssl-bin-3.0.15... Fetching tcltk-8.6.15.0... Finished. You don't need to list every folder you want to exclude from the analysis: just add an externals folder that is one level above. To exclude it from the analysis when working in Visual Studio, go to the following section of the settings: Extensions > PVS-Studio > Options > Don't Check Files. This isn't the only way to change folders excluded from the analysis—you can find more options in our documentation. For convenience, I excluded third-party dependencies via the absolute path: D:\Git\cpython\externals\ Well, let's dig in... MD5 inspires confidence... The PVS-Studio warnings: V522 There might be dereferencing of a potential null pointer 'p'. Check lines: 1174, 1173. Hacl_Hash_MD5.c 1174 #ifndef KRML_HOST_MALLOC # define KRML_HOST_MALLOC malloc #endif Hacl_Streaming_MD_state_32 *Hacl_Hash_MD5_malloc(void) { .... Hacl_Streaming_MD_state_32 *p = (Hacl_Streaming_MD_state_32 *) KRML_HOST_MALLOC(sizeof (Hacl_Streaming_MD_state_32)); p[0U] = s; // fut_state != STATE_PENDING) { // asyncio_InvalidStateError, "invalid state"); return NULL; } if (PyExceptionClass_Check(exc)) { exc_val = PyObject_CallNoArgs(exc); if (exc_val == NULL) { return NULL; } if (fut->fut_state != STATE_PENDING) { // asyncio_InvalidStateError, "invalid state"); return NULL; } } .... } This is a pretty common case. There's the fut->fut_state check: if (fut->fut_state != STATE_PENDING) { // asyncio_InvalidStateError, "invalid state"); xreturn NULL; } We have a plain text notification of the invalid state, and now we want to check whether an exception should be thrown... and we check the fut->fut_state value again? Yet the first check shows that there's STATE_PENDING! As a result, an extra reference to the exception object stays there. Let's shorten and simplify the code: if (PyExceptionClass_Check(exc)) { exc_val = PyObject_CallNoArgs(exc); if (exc_val == NULL) { return NULL; } Py_DECREF(exc_val); PyErr_SetString(state->asyncio_InvalidStateError, "invalid state"); return NULL; } Now we're good. Here's a collection of similar warnings: V547 Expression 'start_state != deque->state' is always false. _collectionsmodule.c 1156 V547 Expression 'start_state != deque->state' is always false. _collectionsmodule.c 1190 V54

Apr 18, 2025 - 13:20
 0
Curling CPython around PVS-Studio

Python, a programming language that needs no special introduction. It has rightly earned the title of "The best Excel" for user-friendly handling of big data. Game developers love it for its seamless integration with C and C++ code. It also has a low entry barrier!

1246_CurlingCPythonAround/image1.png

Introduction

In this article, we'll look at the main and most widely used Python interpreter, CPython. As the name suggests, it's written in C. We checked the project almost nine years ago, and even that check required some updates. Since then, the language has evolved a lot with the Python 3 release series, and the components under the hood of the interpreter have changed along with it. The Python/C API has also expanded, each of the interpreter built-in modules has been rewritten several times, and some have faded into history after the DeprecationWarning stage. Then, one day...

PVS-Studio installation

Let's start by installing PVS-Studio. You can download the installer from this page. A license is required to run the analysis. The trial version is easy to get—a single click and a quick tap on the keyboard is all it takes. Installing the analyzer is also a simple task: the only extra step is integrating it with Visual Studio 2022. At the time of writing, the version of the PVS-Studio analyzer is 7.35.

Build and analysis configuration

Then one day, the upcoming CPython 3.13.3 release, which we've taken on the e6dfa9d commit, came to our attention. I usually check projects in the Debug configuration. However, this time, I was curious and wanted to run the analysis in the Release configuration to see what happens in the interpreter shipped to the end user.

Modern CPython 3 versions require a bit of tinkering before running the build on Windows: go to the PCbuild folder and run the get_externals.bat file there. The third-party dependencies needed to build the interpreter will be downloaded.

Analysis of detected errors

The pcbuild solution file, configured for a complete CPython build, is in the same PCbuild folder. When the get_externals.bat batch file is run, the console lists the folder names of third-party dependencies:


D:\Git\cpython\PCbuild>get_externals.bat
Using py -3.12 (found 3.12 with py.exe)
Fetching external libraries...
Fetching bzip2-1.0.8...
Fetching mpdecimal-4.0.0...
Fetching sqlite-3.45.3.0...
Fetching xz-5.2.5...
Fetching zlib-1.3.1...
Fetching external binaries...
Fetching libffi-3.4.4...
Fetching openssl-bin-3.0.15...
Fetching tcltk-8.6.15.0...
Finished.

You don't need to list every folder you want to exclude from the analysis: just add an externals folder that is one level above. To exclude it from the analysis when working in Visual Studio, go to the following section of the settings: Extensions > PVS-Studio > Options > Don't Check Files. This isn't the only way to change folders excluded from the analysis—you can find more options in our documentation. For convenience, I excluded third-party dependencies via the absolute path:


D:\Git\cpython\externals\

Well, let's dig in...

MD5 inspires confidence...

The PVS-Studio warnings:

V522 There might be dereferencing of a potential null pointer 'p'. Check lines: 1174, 1173. Hacl_Hash_MD5.c 1174


#ifndef KRML_HOST_MALLOC
#  define KRML_HOST_MALLOC malloc
#endif

Hacl_Streaming_MD_state_32 *Hacl_Hash_MD5_malloc(void)
{
  ....
  Hacl_Streaming_MD_state_32
  *p = (Hacl_Streaming_MD_state_32 *)
        KRML_HOST_MALLOC(sizeof (Hacl_Streaming_MD_state_32));
  p[0U] = s;                                                   // <=
  ....
}

V575 The potential null pointer is passed into 'memcpy' function. Inspect the first argument. Check lines: 1415, 1414. Hacl_Hash_MD5.c 1415

V575 The potential null pointer is passed into 'memcpy' function. Inspect the first argument. Check lines: 1417, 1416. Hacl_Hash_MD5.c 1417


#ifndef KRML_HOST_CALLOC
#  define KRML_HOST_CALLOC calloc
#endif

Hacl_Streaming_MD_state_32 *
Hacl_Hash_MD5_copy(Hacl_Streaming_MD_state_32 *state)
{
  Hacl_Streaming_MD_state_32 scrut = *state;
  uint32_t *block_state0 = scrut.block_state;
  uint8_t *buf0 = scrut.buf;
  ....
  uint8_t *buf = (uint8_t *)KRML_HOST_CALLOC(64U, sizeof (uint8_t));
  memcpy(buf, buf0, 64U * sizeof (uint8_t));                         // <=
  uint32_t *block_state = (uint32_t *)KRML_HOST_CALLOC(4U, sizeof (uint32_t));
  memcpy(block_state, block_state0, 4U * sizeof (uint32_t));         // <=
  ....
}

It's gotten pretty clumpy. No, these three warnings aren't the main reason why MD5 was recognized as a non-crypto-resistant hash algorithm. The issue is different: allocating memory via calloc/malloc doesn't guarantee that a pointer to the allocated memory will be returned.

The memcpy function looks like a harmless little bug. It's so harmless that MSVC will unleash SDL checks on a programmer for using the function, and it'll be the right thing to do. In theory, memcpy may have enough time to corrupt the data in memory before tearing the interpreter apart because there's NULL instead of a buffer. The function implementation isn't required to read memory areas from the beginning, and if the passed size is large enough, the function can access the forbidden memory area before the program crashes.

...but control structures don't?

The PVS-Studio warning:

V547 Expression 'fut->fut_state != STATE_PENDING' is always false. _asynciomodule.c 506


static PyObject *
future_set_exception(asyncio_state *state, FutureObj *fut, PyObject *exc)
{
  PyObject *exc_val = NULL;

  if (fut->fut_state != STATE_PENDING) {           // <=
      PyErr_SetString(state->asyncio_InvalidStateError, "invalid state");
      return NULL;
  }

  if (PyExceptionClass_Check(exc)) {
      exc_val = PyObject_CallNoArgs(exc);
      if (exc_val == NULL) {
        return NULL;
      }
      if (fut->fut_state != STATE_PENDING) {      // <=
          Py_DECREF(exc_val);
          PyErr_SetString(state->asyncio_InvalidStateError, "invalid state");
          return NULL;
      }
  }
  ....
}

This is a pretty common case. There's the fut->fut_state check:


if (fut->fut_state != STATE_PENDING) {           // <=
    PyErr_SetString(state->asyncio_InvalidStateError, "invalid state");
    xreturn NULL;
}

We have a plain text notification of the invalid state, and now we want to check whether an exception should be thrown... and we check the fut->fut_state value again? Yet the first check shows that there's STATE_PENDING! As a result, an extra reference to the exception object stays there. Let's shorten and simplify the code:

if (PyExceptionClass_Check(exc)) {
    exc_val = PyObject_CallNoArgs(exc);
    if (exc_val == NULL) {
      return NULL;
    }
    Py_DECREF(exc_val);
    PyErr_SetString(state->asyncio_InvalidStateError, "invalid state");
    return NULL;
}

Now we're good. Here's a collection of similar warnings:

  • V547 Expression 'start_state != deque->state' is always false. _collectionsmodule.c 1156
  • V547 Expression 'start_state != deque->state' is always false. _collectionsmodule.c 1190
  • V547 Expression 'start_state != deque->state' is always false. _collectionsmodule.c 1450

The impossible is... still impossible

The PVS-Studio warning:

V547 Expression 'error_handler == _Py_ERROR_UNKNOWN' is always false. unicodeobject.c 15765

Careful, a great deal of context coming up:


static _Py_error_handler
get_error_handler_wide(const wchar_t *errors)
{
    if (errors == NULL || wcscmp(errors, L"strict") == 0) {
        return _Py_ERROR_STRICT;
    }
    if (wcscmp(errors, L"surrogateescape") == 0) {
        return _Py_ERROR_SURROGATEESCAPE;
    }
    if (wcscmp(errors, L"replace") == 0) {
        return _Py_ERROR_REPLACE;
    }
    if (wcscmp(errors, L"ignore") == 0) {
        return _Py_ERROR_IGNORE;
    }
    if (wcscmp(errors, L"backslashreplace") == 0) {
        return _Py_ERROR_BACKSLASHREPLACE;
    }
    if (wcscmp(errors, L"surrogatepass") == 0) {
        return _Py_ERROR_SURROGATEPASS;
    }
    if (wcscmp(errors, L"xmlcharrefreplace") == 0) {
        return _Py_ERROR_XMLCHARREFREPLACE;
    }
    return _Py_ERROR_OTHER;
}

static int
init_fs_codec(PyInterpreterState *interp)
{
    const PyConfig *config = _PyInterpreterState_GetConfig(interp);

    _Py_error_handler error_handler;
    error_handler = get_error_handler_wide(config->filesystem_errors);
    if (error_handler == _Py_ERROR_UNKNOWN) {                              // <=
        PyErr_SetString(PyExc_RuntimeError, "unknown filesystem error handler");
        return -1;
    }
    ....
}

If you smell burnt scales, you're on the right track :) If you can't smell them, fear not—we'll show you.

The get_error_handler_wide function returns anything but _Py_ERROR_UNKNOWN, which is what is being checked for. Exiting it is impossible with this result: all branch paths end in return with some error definition. Since we have a wide variety of story endings, why not try them all? Hmm...

typedef enum {
    _Py_ERROR_UNKNOWN=0,          // <=
    _Py_ERROR_STRICT,
    _Py_ERROR_SURROGATEESCAPE,
    _Py_ERROR_REPLACE,
    _Py_ERROR_IGNORE,
    _Py_ERROR_BACKSLASHREPLACE,
    _Py_ERROR_SURROGATEPASS,
    _Py_ERROR_XMLCHARREFREPLACE,
    _Py_ERROR_OTHER
} _Py_error_handler;

Oops, there's a secret an unused ending. One can speculate that instead of "other error" it should be "unknown error", but when to use "unknown error" if there's "other error"? Perhaps some code fragments, where this state is explicitly assigned, can answer this question, but they don't fit the case at hand.

1246_CurlingCPythonAround/image2.png

Objectively non-null

The PVS-Studio warning:

V547 Expression 'capi == NULL' is always false. _datetimemodule.c 7236

/* The C-API is process-global.  This violates interpreter isolation
 * due to the objects stored here.  Thus each of those objects must
 * be managed carefully. */
// XXX Can we make this const?
static PyDateTime_CAPI capi = {
    ....
}

/* Get a new C API by calling this function.
 * Clients get at C API via PyDateTime_IMPORT, defined in datetime.h.
 */
static inline PyDateTime_CAPI *
get_datetime_capi(void)
{
    return &capi;
}

static int
_datetime_exec(PyObject *module)
{
    ....
    /* At last, set up and add the encapsulated C API */
    PyDateTime_CAPI *capi = get_datetime_capi();
    if (capi == NULL) {                            // <=
        goto error;
    }
    ....
}

Take a look at the get_datetime_capi call. Judging by the code, the function returns a pointer to PyDateTime, and devs assumed that it could be null. Let's look at how it's implemented: it simply returns a pointer to a static variable in the global scope. There's no reason to think that the function won't have a value assigned to it when it's called. It may not have a correct value, but it will have an address. Also, declaring structures and functions for Python/C extensions as static is a common approach to writing them, there's no risk of losing anything. If you still have doubts, here comes the strongest argument—WinDbg, a debugger designed primarily to work at the Windows kernel level!

1246_CurlingCPythonAround/image3.png

To play fair, I didn't attach the debugger to a fully running process—instead, I launched Python under it. As you can see, the interpreter hasn't even displayed the prompt yet, but the capi static structure is already sitting in the front row. Conclusion: the null-pointer check can be safely removed from the code—it doesn't affect anything, unless the get_datetime_capi implementation suddenly decides to play some clever trick.

SDK kicks the can

The PVS-Studio message:

V547 Expression 't != u' is always false. _datetimemodule.c 5245

static long long
local(long long u)
{
    struct tm local_time;
    time_t t;
    u -= epoch;
    t = u;
    if (t != u) {
        PyErr_SetString(PyExc_OverflowError,
        "timestamp out of range for platform time_t");
        return -1;
    }
    ....
}

The discussion of this warning turned into a heated 38-minute argument. I'm not even joking to drum up your interest. What's the issue?

The reader has probably already guessed what I'm about to say about this:

t = u;
if (t != u) { .... }

And they'd be thinking along the right lines—like, what kind of nonsense is this? We're comparing a freshly assigned value to itself, and time_t is already long long! It's not that simple, though.

The time_t type has been forcibly defined as 64-bit starting with Visual Studio 2005. According to the default corecrt.h header file from Universal C Runtime, the use of the 32-bit time_t is forbidden on a 64-bit platform. Visual Studio has been using UCRT since Visual Studio 2015.

#if defined _USE_32BIT_TIME_T && defined _WIN64
    #error You cannot use 32-bit time_t (_USE_32BIT_TIME_T) with _WIN64
#endif

#ifndef _CRT_NO_TIME_T
    #ifdef _USE_32BIT_TIME_T
        typedef __time32_t time_t;
    #else
        typedef __time64_t time_t;
    #endif
#endif

Visual Studio 2005 forbids this too, as indicated by a similar chain of preprocessor directives in the standard crtdefs.h header file from Microsoft Visual C Runtime (MSVCRT).

#ifdef  _USE_32BIT_TIME_T
#ifdef  _WIN64
#error You cannot use 32-bit time_t (_USE_32BIT_TIME_T) with _WIN64
#undef  _USE_32BIT_TIME_T
#endif
#else
#if     _INTEGRAL_MAX_BITS < 64
#define _USE_32BIT_TIME_T
#endif
#endif

MSVC is one thing, but what about MinGW? The same thing you'd see in MSVC would be there, only the formatting is different. Issues may occur if time_t isn't defined as a 64-bit integer, but that is another story.

Spatial disorientation

I think we've had enough of the hard stuff. Let's take a little break and look at something simpler. For example, here are three PVS-Studio messages:

V523 The 'then' statement is equivalent to the 'else' statement. crossinterp.c 1416

void
_PyXI_FreeNamespace(_PyXI_namespace *ns)
{
    ....
    if (interpid == PyInterpreterState_GetID(PyInterpreterState_Get())) {
        _sharedns_free(ns);
    }
    else {
        // If we weren't always dynamically allocating the cross-interpreter
        // data in each item then we would need to using a pending call
        // to call _sharedns_free(), to avoid the race between freeing
        // the shared namespace and releasing the XI data.
        _sharedns_free(ns);
    }
}

V524 It is odd that the body of 'FStar_UInt128_u32_combine_' function is fully equivalent to the body of 'FStar_UInt128_u32_combine' function. FStar_UInt128_Verified.h 309

static inline
uint64_t FStar_UInt128_u32_combine(uint64_t hi, uint64_t lo)  // <=
{
  return lo + (hi << FStar_UInt128_u32_32);
}

static inline
uint64_t FStar_UInt128_u32_combine_(uint64_t hi, uint64_t lo) // <=
{
  return lo + (hi << FStar_UInt128_u32_32);
}

V1048 The 'val' variable was assigned the same value. blob.c 491

static int
ass_subscript_index(pysqlite_Blob *self, PyObject *item, PyObject *value)
{
    ....
    long val = PyLong_AsLong(value);
    if (val == -1 && PyErr_Occurred()) {
        PyErr_Clear();
        val = -1;                          // <=
    }
    ....
}

All three examples share a common issue—a slip of the hand or a sudden lapse in concentration.

The situation in the first example is common—even LLVM, where code quality is treated with great care, hadn't been immune to it. However, in our case, the code looks like developers wanted to finish it later and forgot about it.

The second version of the issue begs the question: who got confused—the writer or the reader? This is especially relevant when looking at the second code snippet. Yes, it's inline, yes, there's exactly the same text in the function body. The compiler can substitute the function, and nobody will spot the catch. Except it doesn't have to do that. Let's say the function call substitution failed. Both functions are static, and if inline doesn't work, then there will be normal calls to two static functions.

The third issue probably stems from devs' doubts about their own documentation. The PyLong_AsLong function returns -1 if outputting a number of the long type from PyObject* has failed. This is a function from the stable ABI, so changes to this behavior are unlikely.

Following the trail of past checks

"Is it déjà vu?" I asked myself while reading the following PVS-Studio messages:

V781 The value of the 'i' index is checked after it was used. Perhaps there is a mistake in program logic. lexer.c 143

static int
set_fstring_expr(struct tok_state* tok, struct token *token, char c) {
  tokenizer_mode *tok_mode = TOK_GET_MODE(tok);
  ....
  if (hash_detected) {
    Py_ssize_t input_length =
      tok_mode->last_expr_size - tok_mode->last_expr_end;

    ....

    for (i = 0, j = 0; i < input_length; i++) {
      if (tok_mode->last_expr_buffer[i] == '#') {
        // Skip characters until newline or end of string
        while (   tok_mode->last_expr_buffer[i] != '\0'
               && i < input_length) {                           // <=
          ....
        }
....
}

I didn't just imagine it—this has already happened in x64dbg. It's unknown whether input_length will always be less than i, and whether reading outside the array will occur if the condition isn't met.

V1109 The 'PathRemoveFileSpecW' function is deprecated. Consider switching to an equivalent newer function. pyshellext.cpp 464

class DECLSPEC_UUID(CLASS_GUID) PyShellExt : public RuntimeClass<
  RuntimeClassFlags<ClassicCom>,
  IDropTarget,
  IPersistFile
>
{
public:
  STDMETHODIMP Load(LPCOLESTR pszFileName, DWORD dwMode) {
    ....
    if (!PathRemoveFileSpecW(target_dir)) {                     // <=
        OutputDebugStringW(
          L"PyShellExt::Load - failed to remove filespec from target"
        );
        return E_FAIL;
    }
    ....
  }
}

The analyzer has also issued the V1109 warning, but for a different function. For the Windows Explorer shell extension, this warning is relevant for another reason: the official website explicitly states that Python 3.13 can't run on Windows 7 and earlier versions of the system.

1246_CurlingCPythonAround/image4.png

Instead of the deprecated function, Microsoft recommends using PathCchRemoveFileSpec, introduced in Windows 8. The new function has two main advantages: functionally, it supports UNC paths, particularly network folders; technically, it eliminates the risk of buffer overflows. Programmers are already using this function in CPython, and it's quite possible that refactoring Windows API functions usage has been stretched over time.

1246_CurlingCPythonAround/image5.png

Conclusion

It's hard to comment on all this as someone who mostly writes in Python 2.6/2.7 in his spare time :)

Over the past few years, Python 3 has undergone a lot of changes, sparking plenty of debate among critics. Some are frustrated by the growing amount of syntactic sugar, while others are excited about the capability to disable the global interpreter lock in official CPython. Ensuring the security of the interpreter and standard libraries is undoubtedly a crucial point that devs should consider during the development stage. Otherwise, the famous "batteries" may "leak".

A simple-syntax language interpreter is often complex under the hood, and that complexity inevitably grows as the project ages. Amusingly enough, in one of the pull-request discussions, someone brought up the idea of using static analysis—at least Clang-Tidy—and the response came immediately: "It's hard to configure a linter to ignore existing code and just check the changes." PVS-Studio isn't a linter, but it can analyze modified files, commits, and pull requests. And if you have an open-source project, you can get a free one-year license—with the option to renew it later.

They say Python is all about clean code, so let's wish the Python Software Foundation and the community contributors to not stray from the path!