DESIGN AND IMPLEMENTATION OF THE CPYTHON'S MOST RELEVANT DATA STRUCTURES.
This article presents and describes the design and implementation of the CPython's most relevant data structures. We will have a look at the underlying C implementation, how these types work and what tweaks are there to make them faster, all good knowledge to make us better at Python.
Lists, dictionaries, integers, strings, all structures that hold our data are there for us to use but we never think of the trouble they go through in order to make our interaction with them as smooth as possible. It is time to see how they work, how they make our code faster and how we can use them in accordance to their implementation. Next, we're going to take a look at the basic data types from CPython by going straight to the roots and starting off with PyObject, the root of them all.
From now on, when saying Python, I will be referring to CPython.
PyObject
Given the C implementation of Python, all data types that one can use, have an underlying C struct definition which holds the actual data and on which operations are performed. Each of this struct's "inherit" (we will get to it in a minute) from the base struct, PyObject.
PyObject has the following definition:
typedef struct _object {PyObject_HEAD} PyObject;
PyObject_HEAD is a macro that, expanded, looks like this:
struct _object *_ob_next;struct _object *_ob_prev;Py_ssize_t ob_refcnt;struct _typeobject *ob_type;
All objects that live in Python are represented as a linked list (_ob_next, _ob_prev), have a type and a reference count (that is used to determine if the object should be garbage collected). Whenever a new type is defined, the structure must comply to the following pattern:
typedef struct PyMyObject {PyObject_HEAD...}
so that PyMyObject will contain all the attributes that PyObject has. But why does PyObject_HEAD has to be the first attribute? This is related to the "inheritance" that I was talking about earlier. As stated in object.h, `Objects are always accessed through pointers of the type 'PyObject *'.` meaning that I need the ability to cast a PyMyObject pointer to a PyObject pointer while preserving the basic details of the object. The fact that PyObject is placed at the beginning of the struct declaration, means that I can simply cast MyPyObject to PyObject and I can treat it as its "parent" type.
PyObject *obj = (PyObject*)my_py_type_variable;
This is how inheritance and polymorphism are partly mimicked in C, which is pretty awesome.
To reacap, PyObject is the structure that holds all the runtime memory together, is the base for all Python objects and holds information about the type of the object and whether or not should that object be garbage collected. It is also the structure that is passed around in the implementation of the language and makes things a lot easier and a lot more comprehensible.
Integer
Integer type is represented by the following structure, which quite ordinary:
typedef struct {PyObject_HEADlong ob_ival;} PyIntObject;
As one ca see, the Python integer is represented by a C long data type. There is not much to say about the structure itself but there are some interesting things that Python does to speed up the processing of objects of this type.
Since integers are heavily used in most programs, Python creates, at start-up, a static list of PyIntObject so that the overhead of malloc-ing is eliminated. When this static list fills up, new address space will be reserved on the heap for additional integers.
Another pattern is that small integers are used quite a lot in ordinary programs, so Python has an optimization for small integers ranging from -5 up to and including 256. Python interns the values from this range, meaning that it keeps a list of integer objects within the said interval and whenever one belonging to it is request, it returns the object that is already created.
if (-NSMALLNEGINTS <= ival && ival < NSMALLPOSINTS)
v = small_ints[ival + NSMALLNEGINTS];
An often met optimisation in the source code of Python is using the CPU registers when performing comparations.
static int;
int_compare(PyIntObject *v, PyIntObject *w) {register long i = v->ob_ival;register long j = w->ob_ival;return (i < j) ? -1 : (i > j) ? 1 : 0;}
Python 2.x vs 3.x
In Python 2.x there are integers and there are longs. Python int, as seen above, is represented by a C long, which means it is architecture dependent. On the other hand, Python long is represented as an array of uint32_t, which means that it can be as big as the memory allows it to be.
In python 2.x, whenever an integer reaches overflow, it is transformed into a Python long.
In Python 3.x there is no longer an integer type that is represented by a C long, but the former long is renamed into int
Python float type is similar to integer, only that it is represented by a C double.
String
String type is represented by the following structure:
typedef struct {PyObject_VAR_HEADlong ob_shash;int ob_sstate;char ob_sval[1];} PyStringObject;
where
ob_ shash is the hash of string or -1 if the string is empty
ob_sval is the actual data stored in the string
ob_sstate indicates whether or not the string is interned
The cool thing about what Python does with strings is that it interns the ones that are only one character in length or those that "look" like python variable names. This is so in order to ensure that there is only one string object for a given value so equlity tests can be one pointer comparison. By implementing this strategy, the interpreted is up to 20% faster. For this to happen, there are two structures:
static PyObject *interned
// UCHAR_MAX Maximum value for an object of type unsigned charstatic PyStringObject *characters[UCHAR_MAX + 1];
The variable interned is a dictionary that holds all the strings that are interned. It is a dict where the key is the address of the string object and the value, the same address. Whenever a string is requested for a specific value, it is checked if a string with that value isn't already interned; if so, it will return the address of the interned string object.
For a more in-depth article on Python interning checkout this excellent article. Another topic worth looking at regarding the implementation of string type is the algorithm chosen for .find() which is described here.
List
This is the underlying structure of a Python list:
typedef struct {
PyObject_VAR_HEAD
PyObject **ob_item;
Py_ssize_t allocated;} PyListObject;
where
ob_item is a list of pointers to the elements of the list
allocated is the number of slots allocated in memory
The most interesting aspect of Python lists is the resize policy. It is designed in such a way that resizing is not needed too often. This is the resize policy implemented, as found in listobject.c:
The list is also shrunk if the number of elements is lower than half the allocated elements.
Tuple As before, the structure that holds the type:
typedef struct {PyObject_VAR_HEADPyObject *ob_item[1];} PyTupleObject;
Because in the course of most programs' runtime there will be lots of small tuples that will be created, Python keeps an internal cache of small tuples that it reuses. This helps cut down on a lot of memory allocation and deallocation. The way it does that is by having a list of small tuples arrays, free_list. Each array, free_list[i] (0 <= i <= PyTuple_MAXSAVESIZE, around 20) contains a mximum of PyTuple_MAXFREELIST (around 2000) tuples of length i. Whenever a request is made for a new tuple of length smaller than PyTuple_MAXSAVESIZE, the implementation checks if there is such a tuple already allocated and free for use. If not, it allocates it and when the time comes for deallocating, it is put in the free_list array at the position corresponding to its length. This saves a lot of memory management.
Dictionaries The dictionary type has the following structures as its underlying implementation:
typedef struct {Py_ssize_t me_hash;PyObject *me_key;PyObject *me_value;} PyDictEntry;typedef struct _dictobject PyDictObject;struct _dictobject {PyObject_HEADPy_ssize_t ma_fill; /* # Active + # Dummy */Py_ssize_t ma_used; /* # Active */Py_ssize_t ma_mask; /* equal to size of ma_table - 1; calc index*/PyDictEntry *ma_table;PyDictEntry *(*ma_lookup)(PyDictObject *mp, PyObject *key, long hash);PyDictEntry ma_smalltable[PyDict_MINSIZE];};
Dictionaries use hash tables, which are basically just some arrays, to store their data. This structure uses two hash tables, ma_smalltable, which is the smallest possible table, and ma_table. ma_table points to ma_smalltable if the dict is very small, else to additional malloc'ed memory. ma_lookup is a pointer to the function used to search for a specific key and it varies depending on the type of the key.
The dictionary may have active or dummy slots . In the active slots there is actual data and the slots where data used to be but was it erased are called dummy slots. ma_fill holds the active + dummy slots count while ma_used counts only the active slots. These attributes are used to make the decision of resizing or not the hash table.
The main performance factor for dictionaries is the hashing function that is used for the keys and the conflict resolution strategy. Python is not using any complex hashing function; instead, is uses a more advanced collision resolution strategy (open addressing) that produces pseudo-random values at which to probe for free slots. For a detailed explanation of this strategy checkout the beginning of the source file dictobject.c and this article.
Best case scenario for searching a key is a complexity O(1) but this changes as the hash tables fills up and less slots are free for holding data. The best performance is achieved from a hash table if it is less then two thirds full, meaning that the sum of active plus dummy slots needs to be less than 2/3 of its size. When this threshold is reached, a resize takes place in order to keep the performance as high as possible. The resize policy for dictionaries is as follows:
if ma_used < 5000 then quadruple the size of the dictionary
else, double the the size of the dictionary
Lets consider that we want to build a dictionary that holds five keys. We add all the keys up until the forth and when we add the fifth, a check will be made to see if the dictionary is more than two thirds full, which it is, and it will resize it to 32 slots. Considering that we just wanted to add 5 keys to this hash table, the new allocated memory for it is six times higher than the minimum. This a a very useful aspect to consider when working with dictionaries.
Given Python's dynamic nature, all objects use a dictionary to hold their attributes and these dictionaries can be modified at any time. Consider an application where thousands of objects are instantiated and they all have only five attributes, as in the example above; this will result in allocating six times as much memory than necessary. One might expect the ability to somehow go around this and spare that extra memory. For this, you can add the attribute __slots__ which can take a string, iterable or sequence with names of the attributes of the class. These attributes will be added to a static structure that may no longer be modified.
Boolean
The boolean type was added only from Python 2.3 and it simply extends the int type. True and False were present in version 2.2.1 but when called with str() or repr() they would have returned the strings '0' and '1'. Now we have this:
bool_repr(PyBoolObject *self){...
s = true_str ? true_str :(true_str = PyString_InternFromString("True"));...
s = false_str ? false_str :(false_str = PyString_InternFromString("False"));...return s;}
Mutable vs Immutable
Python has two kinds of data types: those who are mutable (i.e. one can change the contents of an object of that type) and immutable (i.e. one cannot change the contents of an objects of that type). Lets consider the following example to illustrate these concepts:
>>> x = 5>>> y = x
>>> x = x + y
Right before the addition, x and y are pointing to the same in-memory object but when we execute the last line, what happens is that Python creates another object for us and makes x point to it, leaving the original, that held the value 5, unchanged. So the whole meaning of immutability is that when operations are performed on immutable objects, the result will always be a new, separate objects.
On the other hand, we have the mutable objects such as lists, or dicts. As in the example above, if we do this:
>>> l1 = [1, 2, 3]>>> l2 = l1
>>> l1[0] = 10>>> l2
[10, 2,
The obvious question is why was such a decision taken? One of the reasons was performance, since the underlying implementation of an immutable object is much simpler than for a mutable one. So simple data types that represent numbers or types as tuples for which strategies of getting around the overhead of memory allocation can be easily implemented are defined as immutable. Another reason for having immutable objects reveals itself while considering the keys of dictionary. Consider the example:
>>> key = "key";
>>> d = {key: 1}
>>> key[1] = "E"
Why are all of these important?
I believe every Python programmer, and not only, should check out the CPyhton language implementation. It is an example of complex programming yet simple API, advanced C programming techniques and versatile data structures. The features of the underlying implementation of Python data types are useful to know to enable one to make use of them in the best possible way.
You can find other interesting articles on similar topics here.
Comments