When I call sys.getsizeof(4), it returns 14. Assuming this is the same as sizeof() in C, this is unacceptably high.
I would like to use the memory array like a big, raw array of bytes. Memory overhead is of the utmost priority, due to the size of the arrays in the project in question. Portability is a huge issue, too, so dropping into C or using a more exotic library is less than optimal.
Is there a way to force Python to use less memory for a single positive signed byte list or tuple member, using only standard Python 3?
14 strikes me as rather low considering that a Python object must at least have a pointer to its type struct and a refcount.
PyObject
All object types are extensions of this type. This is a type which contains the information Python needs to treat a pointer to an object as an object. In a normal “release” build, it contains only the object’s reference count and a pointer to the corresponding type object. Nothing is actually declared to be a PyObject, but every pointer to a Python object can be cast to a PyObject*. Access to the members must be done by using the macros Py_REFCNT and Py_TYPE.
This overhead you will have for every Python object. The only way to reduce the overhead / payload ratio is to have more payload as for example in arrays (both plain Python and numpy).
The trick here is that array elements typically are not Python objects, so they can dispense with the refcount and type pointer and occupy just as much memory as the underlying C type.
(Hat tip to martineau for his comment...)
If you're only concerned with unsigned bytes (values [0, 255]), then the simplest answer might be the built-in bytearray and its immutable sibling, bytes.
One potential problem is that these are intended to represent encoded strings (reading from or writing to the outside world), so their default __repr__ is "string-like", not a list of integers:
>>> lst = [0x10, 0x20, 0x30, 0x41, 0x61, 0x7f, 0x80, 0xff]
>>> bytearray(lst)
bytearray(b'\x10 0Aa\x7f\x80\xff')
>>> bytes(lst)
b'\x10 0Aa\x7f\x80\xff'
Note that space, '0', 'A', and 'a' appear literally, while "unprintable" values appear as '\x##' string escape sequences.
If you're trying to think of those bytes as a bunch of integers, this is not what you want.
For homogeneous arrays of fixed-width integers or floats (much like in C), use the standard library's array module.
>>> import array
# One megabyte of unsigned 8-bit integers.
>>> a = array.array('B', (n % 2**8 for n in range(2**20)))
>>> len(a)
1048576
>>> a.typecode
'B'
>>> a.itemsize
1
>>> a.buffer_info() # Memory address, memory size.
(24936384, 1048576)
>>> a_slice = a[slice(1024, 1040)] # Can be sliced like a list.
>>> a_slice
array('B', [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15])
>>> type(a_slice) # Slice is also an array, not a list.
<class 'array.array'>
For more complex data, the struct module is for packing heterogeneous records, much like C's struct keyword.
Unlike C, I don't see any obvious way to create an array of structs.
These data structures all make use of Python's Buffer Protocol, which (in CPython, at least) allows a Python class to expose its inner C-like array directly to other Python code.
If you need to do something complicated, you might have to learn this...
or give up and use NumPy.
Related
I have a numpy array: ch=[1, 2, 3, 4, 20, 25]
How can i write it in: b'\x01\x02\x03\x04\x14\x19'
Note: i do not want to convert each integer to binary. Is there any function available to do it directly in one step?
You can use bytes bult-in and pass the sequence:
>>> ch=[1, 2, 3, 4, 20, 25]
>>> bytes(ch)
b'\x01\x02\x03\x04\x14\x19'
On a side note, what you are showing is a python list, not a numpy array.
But, if you want to operate on numpy array, you can first convert it to a python list:
>>> bytes(np.array(ch).tolist())
b'\x01\x02\x03\x04\x14\x19'
When you directly try to_bytes() on the numpy array for above data:
>>> np.array(ch).tobytes()
b'\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00\x04\x00\x00\x00\x14\x00\x00\x00\x19\x00\x00\x00'
The above output is also right, the only difference is due to the data type, if you print it, you'll know that it's numpy.int32, which is 32 bit means 32/8=4 bytes i.e. the number of bytes required to represent each of the values.
>>> np.array(ch).dtype
dtype('int32')
If, you convert it to 8-bit i.e. 1 byte number, the output will be same, as using bytes bultin over a list:
>>> np.array(ch).astype(np.int8).tobytes()
b'\x01\x02\x03\x04\x14\x19'
The following will work:
b"".join([int(item).to_bytes(1, "big") for item in ch])
First you iterate over the NumPy array, convert each np.int32 object to int, call int.to_bytes() on it, returning 1 byte in big-endian order (you could also use "little" here), then joining them all together.
Alternatively, you could call list() on the array, then pass it to the built-in bytes() constructor:
bytes(list(ch))
I have a variable x in my code that takes only three values x = {1, 2, 3}. When use the sys.getsizeof() I get 24 which is the size of an object in bytes.
Question
I was wondering if it's possible in python to convert x to char with 1 byte size. I used the str(x) but sys.getsizeof(str(x)) printed 38 bytes.
It is not possible for a single byte, since python objects always include the overhead of the Python implementation.
Your use case is only relevant in practice, if you have larger amounts of such values (thousands or millions, e.g. an image). In that case you would use for example the array or bytearray objects as containers. Another approach would be using numpy arrays.
I am trying to port a portion of a code written in a different language (an obscure one called Igor Pro by Wavemetrics for those of you have heard of it) to Python.
In this code, there is a conversion of a data type from a 16-bit integer (read in from a 16-bit, big endian binary file) to single-precision (32-bit) floating-point. In this program, the conversion is as follows:
Signed 16-bit integer:
print tmp
tmp[0]={-24160,18597,-24160,18597,-24160}
converted to 32-bit floating-point:
Redimension/S/E=1 tmp
print tmp
tmp[0]={339213,339213,5.79801e-41,0,0}
The /S flag/option indicates that the data type of tmp should be float32 instead of int16. However, I believe the important flag/option is /E=1, which is said to "Force reshape without converting or moving data."
In Python, the conversion is as follows:
>>> tmp[:5]
array([-24160, 18597, -24160, 18597, -24160], dtype=int16)
>>> tmp.astype('float32')
array([-24160., 18597., -24160., ..., 18597., -24160., 18597.], dtype=float32)
Which is what I expect, but I need to find a function/operation that emulates the /E=1 option in the original code above. Is there an obvious way in which -24160 and 18597 would both be converted to 339213? Does this have anything to do with byteswap or newbyteorder or something else?
import numpy
tmp=numpy.array([-24160,18597,-24160,18597,-24160, 0], numpy.int16)
tmp.dtype = numpy.float32
print tmp
Result:
[ 3.39213000e+05 3.39213000e+05 5.79801253e-41]
I had to add a zero to the list of value because there are an odd number of values. It cannot interpret those as 32 bit floats since there 5 16 bit values.
Use view instead of astype:
In [9]: tmp=np.array([-24160, 18597, -24160, 18597, -24160, 18597], dtype=int16)
In [10]: tmp.view('float32')
Out[10]: array([ 339213., 339213., 339213.], dtype=float32)
.astype creates a copy of the array cast to the new dtype
.view returns a view of the array (with the same underlying data),
with the data interpreted according to the new dtype.
Is there an obvious way in which -24160 and 18597 would both be converted to 339213?
No, but neither is there any obvious way in which -24160 would convert to 339213 and 5.79801e-41 and 0.
It looks more like the conversion takes two input numbers to create one output (probably by concatenating the raw 2×16 bits to 32 bits and calling the result a float). In that case the pair -24160,18597 consistently becomes 339213, and 5.79801e-41 probably results from -24160,0 where the 0 is invented because we run out of inputs. Since 5.79801e-41 looks like it might be a single-precision denormal, this implies that the two 16-bit blocks are probably concatenated in little-endian order.
It remains to see whether you need to byte-swap each of the 16-bit inputs, but you can check that for yourself.
When I do the compress the data with compact code, I don't know how to deal with the integer, I need to store an integer into 1bytes, 2bytes, 3bytes, etc, memory, How can I do this in Python.
Or, how to change the tuple (1, 0, 1, ..., 1) (24bits) into exact 3bytes storage
The struct module in the standard library packs data into bytes.
If you need to pack in arbitrary numbers of bytes then it might be better to use a bytearray than rely on the struct module, for example:
>>> a = bytearray(3) # create 3 byte array
>>> a[0] = 0x3e
>>> a[1] = 0xff
>>> a[2] = 0x00
Note that the memory overhead of any Python object is going to be considerably more than a few bytes, so if you are really worried about memory use then you should store all your data together in as few objects as possible.
Depending on your exact needs a third party module such as bitstring could be helpful (full disclosure: I wrote it).
>>> b = bitstring.BitArray((1,0,1,1,1,0,0,1,1,1,0,1,0,1,1,1))
>>> b.bytes
'\xb9\xd7'
>>> b.uint
47575
I need a very large list, and am trying to figure out how big I can make it so that it still fits in 1-2GB of RAM. I am using the CPython implementation, on 64 bit (x86_64).
Edit: thanks to bua's answer, I have filled in some of the more concrete answers.
What is the space (memory) usage of (in bytes):
the list itself
sys.getsizeof([]) == 72
each list entry (not including the data)
sys.getsizeof([0, 1, 2, 3]) == 104, so 8 bytes overhead per entry.
the data if it is an integer
sys.getsizeof(2**62) == 24 (but varies according to integer size)
sys.getsizeof(2**63) == 40
sys.getsizeof(2**128) == 48
sys.getsizeof(2**256) == 66
the data if it is an object (sizeof(Pyobject) I guess))
sys.getsizeof(C()) == 72 (C is an empty user-space object)
If you can share more general data about the observed sizes, that would be great. For example:
Are there special cases (I think immutable values might be shared, so maybe a list of bools doesn't take any extra space for the data)?
Perhaps small lists take X bytes overhead but large lists take Y bytes overhead?
point to start:
>>> import sys
>>> a=list()
>>> type(a)
<type 'list'>
>>> sys.getsizeof(a)
36
>>> b=1
>>> type(b)
<type 'int'>
>>> sys.getsizeof(b)
12
and from python help:
>>> help(sys.getsizeof)
Help on built-in function getsizeof in module sys:
getsizeof(...)
getsizeof(object, default) -> int
Return the size of object in bytes.
If you want lists of numerical values, the standard array module provides optimized arrays (that have an append method).
The non-standard, but commonly used NumPy module gives you fixed-size efficient arrays.