Python: how to use struct to pack and unpack references to objects? - python

I have a list of objects, for example:
L = [<CustomObject object at 0x101992eb8>, <CustomObject object at 0x101763908>, ...]
The items in the list are "references" so I guess it's like a list of unsigned integers, am I wrong?
In order to see if I can save some memory, I would like to pack this list using the struct module.
Is this possible? And if yes how to do it? (except if you know for sure I won't save memory like this)

The list is already an array of “integers” (pointers) internally; struct can’t compress that in any simple or significant fashion, and doing so would interfere with Python’s garbage collection.
The CustomObjects (if they are unique) take more than twice as much memory—closer to a hundred times unless you use __slots__ for the class.

Related

Why the location of python list is not being changed if the size is increased?

As far as I know, python list is a dynamic array. So when we reach a certain size, the capacity of that list will be increased automatically. But the problem is, unlike dynamic array of c or c++, even after increasing the capacity of list instance, the location is not being changed. Why is it happening?
I've tested this using the following code block
l = []
print(l.__sizeof__())
print(id(l))
for i in range(5_000_000):
l.append(i)
print(l.__sizeof__())
print(id(l))
In CPython (the implementation written in C distributed by python.org), a Python object never moves in memory. In the case of a list object, two pieces of memory are actually allocated: a basic header struct common to all variable-size Python container objects (containing things like the reference count, a pointer to the type object, and the number of contained objects), and a distinct block of memory for a C-level vector holding pointers to the contained Python objects. The header struct points to that vector.
That vector can change size in arbitrary ways, and the header struct will change to point to its current location, but the header struct never moves. id() returns the address of that header struct. Python does not expose the address of the vector of objects.

Why memory space allocation is different for the same objects?

I was experimenting with how Python allocates the memory, so found the same issue like
Size of list in memory and Eli describes in a much better way. His answer leads me to the new doubt that, I checked the size of 1 + [] and [1], but it is different as you can see in the code snippet. if I'm not wrong memory space allocation should be the same. But it's not the case. Anyone can help me with the understanding?
>>> import sys
>>> sys.getsizeof(1)
28
>>> sys.getsizeof([])
64
>>> 28 + 64
92
>>> sys.getsizeof([1])
72
What's the minimum information a list needs to function?
some kind of top-level list object, containg a reference to the class information (methods, type info, etc), and the list's own instance data
the actual objects stored in the list
... that gets you the size you expected. But is it enough?
A fixed-size list object can only track a fixed number of list entries: traditionally just one (head) or two (head and tail).
Adding more entries to the list doesn't change the size of the list object itself, so there must be some extra information: the relationship between list elements.
It's possible to store this information in every Object (this is called an intrusive list), but it's very restrictive: each Object can only be stored in one list at a time.
Since Python lists clearly don't behave like that, we know this extra information isn't already in the list element, and it can't be inside the list object, so it must be stored elsewhere. Which increases the total size of the list.
NB. I've kept this argument fairly abstract deliberately. You could implement list a few different ways, but none of them avoid some extra storage for element relationships, even if the representation differs.

PyBytes_FromString different endianness

I have a python-wrapped C++ object whose underlying data is a container std::vector<T> that represents bits. I have a function that writes these bits to a PyBytes object. If the endianness is the same, then there is no issue. However if I wish to write the bytes in a different endianness, then I need to bitswap (or byteswap) each word.
Ideally, I could pass an output iterator to the PyBytes_FromString constructor, where the output operator just transforms the endianness of each word. This would be O(1) extra memory, which is the target.
Less ideally, I could somehow construct an empty PyBytes object, create the different-endianness char array manually and somehow assign that to the PyBytes object (basically reimplementing the PyBytes constructors). This would also be O(1) extra memory. Unfortunately, the way to do this would be to use _PyBytes_FromSize, but that's not available in the API.
The current way of doing this is to create an entire copy of the reversed words, just to then copy that representation over to the PyBytes objects representation.
I think the second option is the most practical way of doing this, but the only way I can see that working is by basically copying the _PyBytes_FromSize function into my source code which seems hacky. I'm new to the python-C api and am wondering if there's a cleaner way to do this.
PyBytes_FromStringAndSize lets you pass NULL as the first argument, in which case it returns an uninitialized bytes object (which you can edit). It's really just equivalent to _PyBytes_FromSize and would let you do your second option.
If you wanted to try your "output iterator" option instead, then the solution would be to call PyBytes_Type:
PyObject *result = PyObject_CallFunctionObjArgs((PyObject*)&PyBytes_Type, your_iterable, NULL);
Any iterable that returns values between 0 and 255 should work. You can pick the PyObject_Call* that you find easiest to use.
I suspect writing the iterable in C/C++ will be more trouble than writing the loop though.

Does the len() built-in function iterates through the collection to calculate its length, or does it access a collection's attribute? [duplicate]

This question already has answers here:
Cost of len() function
(5 answers)
Closed 7 years ago.
Python has many built-in functions, and len() is one of them.
Return the length (the number of items) of an object. The argument may be a sequence (such as a string, bytes, tuple, list, or range) or a collection (such as a dictionary, set, or frozen set).
If collections and sequences are objects, they could hold a length attribute that can be updated every time something changes. Accessing this attribute would be a fast way to retrieve the collection's length.
Another approach is to iterate through the collection and count the number of items on the fly.
How does len() calculates said length? Through iteration or attribute access? One, none, both, other approaches?
Python built-in collections cache the length in an attribute. Using len() will not iterate over all elements to count them, no.
Keeping the length as an attribute is cheap and easy to maintain. Given that the Python built-in collection types are used so widely, it'd be unwise for them not to do this.
Python built-in types with a length are typically built on top of the PyObject_VAR_HEAD struct, which includes an ob_size entry. The Py_SIZE macro can then be used to implement the object.__len__ method (e.g. the PySequenceMethods.sq_length slot in the C-API). See the listobject.c implementation for list_length for example.

Python: Incrementally marshal / pickle an object?

I have a large object I'd like to serialize to disk. I'm finding marshal works quite well and is nice and fast.
Right now I'm creating my large object then calling marshal.dump . I'd like to avoid holding the large object in memory if possible - I'd like to dump it incrementally as I build it. Is that possible?
The object is fairly simple, a dictionary of arrays.
The bsddb module's 'hashopen' and 'btopen' functions provide a persistent dictionary-like interface. Perhaps you could use one of these, instead of a regular dictionary, to incrementally serialize the arrays to disk?
import bsddb
import marshal
db = bsddb.hashopen('file.db')
db['array1'] = marshal.dumps(array1)
db['array2'] = marshal.dumps(array2)
...
db.close()
To retrieve the arrays:
db = bsddb.hashopen('file.db')
array1 = marshal.loads(db['array1'])
...
It all your object has to do is be a dictionary of lists, then you may be able to use the shelve module. It presents a dictionary-like interface where the keys and values are stored in a database file instead of in memory. One limitation which may or may not affect you is that keys in Shelf objects must be strings. Value storage will be more efficient if you specify protocol=-1 when creating the Shelf object to have it use a more efficient binary representation.
This very much depends on how you are building the object. Is it an array of sub objects? You could marshal/pickle each array element as you build it. Is it a dictionary? Same idea applies (marshal/pickle keys)
If it is just a big complex harry object, you might want to marshal dump each piece of the object, and then the apply what ever your 'building' process is when you read it back in.
You should be able to dump the item piece by piece to the file. The two design questions that need settling are:
How are you building the object when you're putting it in memory?
How do you need you're data when it comes out of memory?
If your build process populates the entire array associated with a given key at a time, you might just dump the key:array pair in a file as a separate dictionary:
big_hairy_dictionary['sample_key'] = pre_existing_array
marshal.dump({'sample_key':big_hairy_dictionary['sample_key']},'central_file')
Then on update, each call to marshal.load('central_file') will return a dictionary that you can use to update a central dictionary. But this is really only going to be helpful if, when you need the data back, you want to handle reading 'central_file' once per key.
Alternately, if you are populating arrays element by element in no particular order, maybe try:
big_hairy_dictionary['sample_key'].append(single_element)
marshal.dump(single_element,'marshaled_files/'+'sample_key')
Then, when you load it back, you don't necessarily need to build the entire dictionary to get back what you need; you just call marshal.load('marshaled_files/sample_key') until it returns None, and you have everything associated with the key.

Categories

Resources