Fast conversion from string to numpy.int16 array - python

I read (int)32 bit audio data (given as string by previous commands) into a numpy.int32 array with :
myarray = numpy.fromstring(data, dtype=numpy.int32)
But then I want to store it in memory as int16 (I know this will decrease the bit depth / resolution / sound quality) :
myarray = myarray >> 16
my_16bit_array = myarray.astype('int16')
It works very well, but : is there a faster solution? (here I use : a string buffer, 1 array in int32, 1 array in int16 ; I wanted to know if it's possible to save one step)

How about this?
np.fromstring(data, dtype=np.uint16)[0::2]
Note however, that overhead of the kind you describe here is common when working with numpy, and cannot always be avoided. If this kind of overhead isn't acceptable for your application, make sure that you plan ahead to write extension modules for the performance critical parts.
Note: it should be 0::2 or 1::2 depending on the endianness of your platform

Related

How to simulate a fixed size char array in python

I was inquiring myself the best way (or any good way actually) of simulating a small ram memory in Python.
In most languages, I would simply create a fixed size array of char, but this seems to be surprisingly complex in Python.
The closest thing I found was this:
self.two_KB_internal_ram = [] #goes from $0000-$07FF
for x in range (2048):
self.two_KB_internal_ram = 0
print ("two_KB_internal_ram: ", type(self.two_KB_internal_ram))
However, the type shows that the type is an int, and not char.
Is there a way of doing this with chars? If not (or even if there is), what would be a good way to emulate a ram memory?
To get a fixed sized Array you can use np.array which ist capable of Holding Strings aswell
Something Like this:
import numpy as np
np.array(["a"] * 2048)
But why would you do that in Python?

How can I make my Python program use 4 bytes for an int instead of 24 bytes?

To save memory, I want to use less bytes (4) for each int I have instead of 24.
I looked at structs, but I don't really understand how to use them.
https://docs.python.org/3/library/struct.html
When I do the following:
myInt = struct.pack('I', anInt)
sys.getsizeof(myInt) doesn't return 4 like I expected.
Is there something that I am doing wrong? Is there another way for Python to save memory for each variable?
ADDED: I have 750,000,000 integers in an array that I wish to be able to use given an index.
If you want to hold many integers in an array, use a numpy ndarray. Numpy is a very popular third-party package that handles arrays more compactly than Python alone does. Numpy is not in the standard library so that it could be updated more frequently than Python itself is updated--it was considered to be added to the standard library. Numpy is one of the reasons Python has become so popular for Data Science and for other scientific uses.
Numpy's np.int32 type uses four bytes for an integer. Declare your array full of zeros with
import numpy as np
myarray = np.zeros((750000000,), dtype=np.int32)
Or if you just want the array and do not want to spend any time initializing the values,
myarray = np.empty((750000000,), dtype=np.int32)
You then fill and use the array as you like. There is some Python overhead for the complete array, so the array's size will be slightly larger than 4 * 750000000, but the size will be close.

getting Killed message in Python -- Is memory the issue?

I have a list which I .append() to in a for-loop, finally the length of the list is around 180,000. Each item in the list is a numpy array of 7680 float32 values.
Then I convert the list to a numpy array, i.e. I expect an array of shape ( 180000, 7680 ):
d = numpy.asarray( dlist, dtype = 'float32' )
That caused the script to crash with the message Killed.
Is memory the problem? Assuming float32 takes 4 bytes, 180000x7680x4bytes = 5.5 GB.
I am using 64 bit Ubuntu, 12 GB RAM.
Yes, memory is the problem
Your estimate also needs to take into account a memory-allocation already done for the list-representation of the 180000 x 7680 x float32, so without other details on dynamic memory-releases / garbage-collections, the numpy.asarray() method needs a bit more than just another space of 1800000 x 7680 x numpy.float32 bytes.
If you try to test with less than a third length of the list, you may inspect the resulting effective-overhead of the numpy.array data-representation, so as to have exact data for your memory-feasible design
Memory-profiling may help to point out the bottleneck and understand the code requirements, that may sometimes help to save half of the allocation space needed for data, compared to an original mode of data-flow and operations:
(Fig.: Courtesy scikit-learn testing numpy-based or BLAS-direct calling method impact on memory-allocation envelopes )
You should take into account, that you need twice the size of memory in the process of conversion.
Also, other software may take some of your RAM and when you have no additional paging space defined, using 11GB of your 12GB memory will probably bring your system into trouble.

Python Converts integer into a bit number of specific length, fast

I am trying to delta compress a list of pixels and store them in a binary file. I have managed to do this however the method I found takes ~4 minutes a frame.
def getByte_List(self):
values = BitArray("")
for I in range(len(self.delta_values)):
temp = Bits(int= self.delta_values[I], length=self.num_bits_pixel)
values.append(temp)
##start_time = time.time()
bit_stream = pack("uint:16, uint:5, bits", self.intial_value, self.num_bits_pixel, values)
##end_time = time.time()
##print(end_time - start_time)
# Make sure that the list of bits contains a multiple of 8 values
if (len(bit_stream) % 8):
bit_stream.append(Bits(uint=0, length = (8-(len(bit_stream) % 8)))) #####Append? On a pack? (Only work on bitarray? bit_stream = BitArray("")
# Create a list of unsigned integer values to represent each byte in the stream
fmt = (len(bit_stream)/8) * ["uint:8"]
return bit_stream.unpack(fmt)
This is my code. I take the initial value, the number of bits per pixel and the delta values and convert them into bits. I then byte align and take the integer representation of the bytes and use it elsewhere. The problem areas are where I convert each delta value to bits(3min) and where I pack(1min). Is it possible to do what I am doing faster or another way to pack them straight into integers representing bytes.
From a quick Google of the classes you're instantiating, it looks like you're using the bitstring module. This is written in pure Python, so it's not a big surprise that it's pretty slow. You might look at one or more of the following:
struct - a module that comes with Python that will let you pack and unpack C structures into constituent values
bytearray - a built-in type that lets you accumulate, well, an array of bytes, and has both list-like and string-like operations
bin(x), int(x, 2) - conversion of numbers to a binary representation as a string, and back - string manipulations can sometimes be a reasonably efficient way to do this
bitarray - native (C) module for bit manipulation, looks like it has similar functionality to bitstring but should be much faster. Available here in form suitable for compiling on Linux or here pre-compiled for Windows.
numpy - fast manipulation of arrays of various types including single bytes. Kind of the go-to module for this sort of thing, frankly. http://www.numpy.org/

Is there a way to get a view into a python array.array()?

I'm generating many largish 'random' files (~500MB) in which the contents are the output of repeated calls to random.randint(...). I'd like to preallocate a large buffer, write longs to that buffer, and periodically flush that buffer to disk. I am currently using array.array() but I can't see a way to create a view into this buffer. I need to do this so that I can feed the part of the buffer with valid data into hashlib.update(...) and to write the valid part of the buffer to the file. I could use the slice operator but AFAICT that creates a copy of the buffer, which isn't what I want.
Is there a way to do this that I'm not seeing?
Update:
I went with numpy as user42005 and hgomersall suggested. Unfortunately this didn't give me the speedups I was looking for. My dirt-simple C program generates ~700MB of data in 11s, while my python equivalent using numpy takes around 700s! It's hard to believe that that's the difference in performance between the two (I'm more likely to believe that I made a naive mistake somewhere...)
I guess you could use numpy: http://www.numpy.org - the fundamental array type in numpy at least supports no-copy views.
Numpy is incredibly flexible and powerful when it comes to views into arrays whilst minimising copies. For example:
import numpy
a = numpy.random.randint(0, 10, size=10)
b = numpy.a[3:10]
b is now a view of the original array that was created.
Numpy arrays allow all manner of access directly to the data buffers, and can be trivially typecast. For example:
a = numpy.random.randint(0, 10, size=10)
b = numpy.frombuffer(a.data, dtype='int8')
b is now view into the memory with the data all as 8-bit integers (the data itself remains unchanged, so that each 64-bit int now becomes 8 8-bit ints). These buffer objects (from a.data) are standard python buffer objects and so can be used in all the places that are defined to work with buffers.
The same is true for multi-dimensional arrays. However, you have to bear in mind how the data lies in memory. For example:
a = numpy.random.randint(0, 10, size=(10, 10))
b = numpy.frombuffer(a[3,:].data, dtype='int8')
will work, but
b = numpy.frombuffer(a[:,3].data, dtype='int8')
returns an error about being unable to get single-segment buffer for discontiguous arrays. This problem is not obvious because simply allocating that same view to a variable using
b = a[:,3]
returns a perfectly adequate numpy array. However, it is not contiguous in memory as it's a view into the other array, which need not be (and in this case isn't) a view of contiguous memory. You can get info about the array using the flags attribute on an array:
a[:,3].flags
which returns (among other things) both C_CONTIGUOUS (C order, row major) and F_CONTIGUOUS (Fortran order, column major) as False, but
a[3,:].flags
returns them both as True (in 2D arrays, at most one of them can be true).

Categories

Resources