Numpy arrax of integers from decimal to binary in python - python

I have a numpy array: ch=[1, 2, 3, 4, 20, 25]
How can i write it in: b'\x01\x02\x03\x04\x14\x19'
Note: i do not want to convert each integer to binary. Is there any function available to do it directly in one step?

You can use bytes bult-in and pass the sequence:
>>> ch=[1, 2, 3, 4, 20, 25]
>>> bytes(ch)
b'\x01\x02\x03\x04\x14\x19'
On a side note, what you are showing is a python list, not a numpy array.
But, if you want to operate on numpy array, you can first convert it to a python list:
>>> bytes(np.array(ch).tolist())
b'\x01\x02\x03\x04\x14\x19'
When you directly try to_bytes() on the numpy array for above data:
>>> np.array(ch).tobytes()
b'\x01\x00\x00\x00\x02\x00\x00\x00\x03\x00\x00\x00\x04\x00\x00\x00\x14\x00\x00\x00\x19\x00\x00\x00'
The above output is also right, the only difference is due to the data type, if you print it, you'll know that it's numpy.int32, which is 32 bit means 32/8=4 bytes i.e. the number of bytes required to represent each of the values.
>>> np.array(ch).dtype
dtype('int32')
If, you convert it to 8-bit i.e. 1 byte number, the output will be same, as using bytes bultin over a list:
>>> np.array(ch).astype(np.int8).tobytes()
b'\x01\x02\x03\x04\x14\x19'

The following will work:
b"".join([int(item).to_bytes(1, "big") for item in ch])
First you iterate over the NumPy array, convert each np.int32 object to int, call int.to_bytes() on it, returning 1 byte in big-endian order (you could also use "little" here), then joining them all together.
Alternatively, you could call list() on the array, then pass it to the built-in bytes() constructor:
bytes(list(ch))

Related

How to convert an unformatted string with hex values to a numpy int array?

I am writing a program which gets an unformatted string as an input and should output a numpy int array.
The string contains id, timestamp etc. and a hexadecimal data array. Say the input string is data_string = '01190810000235a5000235b4000234c5000211a5', then 01is the id, 190810 is the timestamp and 000235a5000235b4000234c5000211a5 is the data array with the values 000235a5, 000235b4, 000234c5, 000211a5. (The real input string is several MB in size.)
I am having problems converting the data array to a numpy integer array. I have come up with:
import numpy as np
data_dict['data array'] = np.core.defchararray.asarray(data_string[8:], 8)
but this way I am only getting a string array. I tried fiddling with np.fromstring(data_string[8:], np.int32), but this changed the given values of the input string. Is there any way to get a int array from a string? Using a for loop (or similar implementiations) is not an option because this code is performance critical.
EDIT:
To clarify my problem...
Input string is
>>> import numpy as np
>>> s = "000235a5000235b4000234c5000211a5"
Converting it with np.core.defchararray.asarray() results in a chararray. But I want a integer type array.
>>> s1 = np.core.defchararray.asarray(s, 8)
>>> s1
chararray(['000235a5', '000235a5', '000235a5', '000235a5'], dtype='<U8')
Converting s with np.fromstring() results in an integer array, but it seems that it does not like hexadecimal numbers.
>>> s2 = np.fromstring(s, dtype=np.int32)
>>> s2
array([842018864, 895563059, 842018864, 878851379, 842018864, 895693875,
842018864, 895562033])
array([000235a5, 000235a5, 000235a5, 000235a5]) is the result I actually want to get.

Assigning dtype value using array.dtype = <data type> in NumPy arrays gives ambiguous results

I am new to programming and numpy... While reading tutorials and experimenting on jupyter-notebook... I thought of converting dtype of a numpy array as follows:
import numpy as np
c = np.random.rand(4)*10
print c
#Output1: [ 0.12757225 5.48992242 7.63139022 2.92746857]
c.dtype = int
print c
#Output2: [4593764294844833304 4617867121563982285 4620278199966380988 4613774491979221856]
I know the proper way of changing is:
c = c.astype(int)
But I want to the reason behind those ambiguous numbers in Output2. What are they and what do they signify?
Floats and integers (numpy.float64s and numpy.int64s) are represented differently in memory. The value 42 stored in these different types corresponds to a different bit pattern in memory.
When you're reassigning the dtype attribute of an array, you keep the underlying data unchanged, and you're telling numpy to interpret that pattern of bits in a new way. Since the interpretation now doesn't match the original definition of the data, you end up with gibberish (meaningless numbers).
On the other hand, converting your array via .astype() will actually convert the data in memory:
>>> import numpy as np
>>> arr = np.random.rand(3)
>>> arr.dtype
dtype('float64')
>>> arr
array([ 0.7258989 , 0.56473195, 0.20885672])
>>> arr.data
<memory at 0x7f10d7061288>
>>> arr.dtype = np.int64
>>> arr.data
<memory at 0x7f10d7061348>
>>> arr
array([4604713535589390862, 4603261872765946451, 4596692876638008676])
Proper conversion:
>>> arr = np.random.rand(3)*10
>>> arr
array([ 3.59591191, 1.21786042, 6.42272461])
>>> arr.astype(np.int64)
array([3, 1, 6])
As you can see, using astype will meaningfully convert the original values of the array, in this case it will truncate to the integer part, and return a new array with corresponding values and dtype.
Note that assigning a new dtype doesn't trigger any checks, so you can do very weird stuff with your array. In the above example, 64 bits of floats were reinterpreted as 64 bits of integers. But you can also change the bit size:
>>> arr = np.random.rand(3)
>>> arr.shape
(3,)
>>> arr.dtype
dtype('float64')
>>> arr.dtype = np.float32
>>> arr.shape
(6,)
>>> arr
array([ 4.00690371e+35, 1.87285304e+00, 8.62005305e+13,
1.33751166e+00, 7.17894062e+30, 1.81315207e+00], dtype=float32)
By telling numpy that your data occupies half the space than originally, numpy will deduce that your array has twice as many elements! Clearly not what you should ever want to do.
Another example: consider the 8-bit unsigned integer 255==2**8-1: it corresponds to 11111111 in binary. Now, try to reinterpret two of these numbers as a single 16-bit unsigned integer:
>>> arr = np.array([255,255],dtype=np.uint8)
>>> arr.dtype = np.uint16
>>> arr
array([65535], dtype=uint16)
As you can see, the result is the single number 65535. If that doesn't ring a bell, it's exactly 2**16-1, with 16 ones in its binary pattern. The two full-one patterns were reinterpreted as a single 16-bit number, and the result changed accordingly. The reason you often see weirder numbers is that reinterpreting floats as ints as vice versa will lead to a much stronger mangling of the data, due to how floating-point numbers are represented in memory.
As hpaulj noted, you can directly perform this reinterpretation of the data by constructing a new view of the array with a modified dtype. This is probably more useful than having to reassign the dtype of a given array, but then again changing the dtype is only useful in fairly rare, very specific use cases.

How to limit amount of memory used to store integers?

When I call sys.getsizeof(4), it returns 14. Assuming this is the same as sizeof() in C, this is unacceptably high.
I would like to use the memory array like a big, raw array of bytes. Memory overhead is of the utmost priority, due to the size of the arrays in the project in question. Portability is a huge issue, too, so dropping into C or using a more exotic library is less than optimal.
Is there a way to force Python to use less memory for a single positive signed byte list or tuple member, using only standard Python 3?
14 strikes me as rather low considering that a Python object must at least have a pointer to its type struct and a refcount.
PyObject
All object types are extensions of this type. This is a type which contains the information Python needs to treat a pointer to an object as an object. In a normal “release” build, it contains only the object’s reference count and a pointer to the corresponding type object. Nothing is actually declared to be a PyObject, but every pointer to a Python object can be cast to a PyObject*. Access to the members must be done by using the macros Py_REFCNT and Py_TYPE.
This overhead you will have for every Python object. The only way to reduce the overhead / payload ratio is to have more payload as for example in arrays (both plain Python and numpy).
The trick here is that array elements typically are not Python objects, so they can dispense with the refcount and type pointer and occupy just as much memory as the underlying C type.
(Hat tip to martineau for his comment...)
If you're only concerned with unsigned bytes (values [0, 255]), then the simplest answer might be the built-in bytearray and its immutable sibling, bytes.
One potential problem is that these are intended to represent encoded strings (reading from or writing to the outside world), so their default __repr__ is "string-like", not a list of integers:
>>> lst = [0x10, 0x20, 0x30, 0x41, 0x61, 0x7f, 0x80, 0xff]
>>> bytearray(lst)
bytearray(b'\x10 0Aa\x7f\x80\xff')
>>> bytes(lst)
b'\x10 0Aa\x7f\x80\xff'
Note that space, '0', 'A', and 'a' appear literally, while "unprintable" values appear as '\x##' string escape sequences.
If you're trying to think of those bytes as a bunch of integers, this is not what you want.
For homogeneous arrays of fixed-width integers or floats (much like in C), use the standard library's array module.
>>> import array
# One megabyte of unsigned 8-bit integers.
>>> a = array.array('B', (n % 2**8 for n in range(2**20)))
>>> len(a)
1048576
>>> a.typecode
'B'
>>> a.itemsize
1
>>> a.buffer_info() # Memory address, memory size.
(24936384, 1048576)
>>> a_slice = a[slice(1024, 1040)] # Can be sliced like a list.
>>> a_slice
array('B', [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15])
>>> type(a_slice) # Slice is also an array, not a list.
<class 'array.array'>
For more complex data, the struct module is for packing heterogeneous records, much like C's struct keyword.
Unlike C, I don't see any obvious way to create an array of structs.
These data structures all make use of Python's Buffer Protocol, which (in CPython, at least) allows a Python class to expose its inner C-like array directly to other Python code.
If you need to do something complicated, you might have to learn this...
or give up and use NumPy.

Integer array to string in Python

I have an array of integer, and I need to transform it into string.
[1,2,3,4] => '\x01\x02\x03\x04'
What function can I use for it? I tried with str(), but it returns '1234'.
string = ""
for val in [1,2,3,4]:
string += str(val) # '1234'
''.join([chr(x) for x in [1, 2, 3, 4]])
You can convert a list of small numbers directly to a bytearray:
If it is an iterable, it must be an iterable of integers in the range 0 <= x < 256, which are used as the initial contents of the array.
And you can convert a bytearray directly to a str (2.x) or bytes (3.x, or 2.6+).
In fact, in 3.x, you can even convert the list straight to bytes without going through bytearray:
constructor arguments are interpreted as for bytearray().
So:
str(bytearray([1,2,3,4])) # 2.6-2.7 only
bytes(bytearray([1,2,3,4])) # 2.6-2.7, 3.0+
bytes([1,2,3,4]) # 3.0+ only
If you really want a string in 3.x, as opposed to a byte string, you need to decode it:
bytes(bytearray([1,2,3,4])).decode('ascii')
See Binary Sequence Types in the docs for more details.
Simple solution
digits = [1,2,3,4]
print(''.join(map(str,digits)))

Converting 16-bit integer to 32-bit floating-point

I am trying to port a portion of a code written in a different language (an obscure one called Igor Pro by Wavemetrics for those of you have heard of it) to Python.
In this code, there is a conversion of a data type from a 16-bit integer (read in from a 16-bit, big endian binary file) to single-precision (32-bit) floating-point. In this program, the conversion is as follows:
Signed 16-bit integer:
print tmp
tmp[0]={-24160,18597,-24160,18597,-24160}
converted to 32-bit floating-point:
Redimension/S/E=1 tmp
print tmp
tmp[0]={339213,339213,5.79801e-41,0,0}
The /S flag/option indicates that the data type of tmp should be float32 instead of int16. However, I believe the important flag/option is /E=1, which is said to "Force reshape without converting or moving data."
In Python, the conversion is as follows:
>>> tmp[:5]
array([-24160, 18597, -24160, 18597, -24160], dtype=int16)
>>> tmp.astype('float32')
array([-24160., 18597., -24160., ..., 18597., -24160., 18597.], dtype=float32)
Which is what I expect, but I need to find a function/operation that emulates the /E=1 option in the original code above. Is there an obvious way in which -24160 and 18597 would both be converted to 339213? Does this have anything to do with byteswap or newbyteorder or something else?
import numpy
tmp=numpy.array([-24160,18597,-24160,18597,-24160, 0], numpy.int16)
tmp.dtype = numpy.float32
print tmp
Result:
[ 3.39213000e+05 3.39213000e+05 5.79801253e-41]
I had to add a zero to the list of value because there are an odd number of values. It cannot interpret those as 32 bit floats since there 5 16 bit values.
Use view instead of astype:
In [9]: tmp=np.array([-24160, 18597, -24160, 18597, -24160, 18597], dtype=int16)
In [10]: tmp.view('float32')
Out[10]: array([ 339213., 339213., 339213.], dtype=float32)
.astype creates a copy of the array cast to the new dtype
.view returns a view of the array (with the same underlying data),
with the data interpreted according to the new dtype.
Is there an obvious way in which -24160 and 18597 would both be converted to 339213?
No, but neither is there any obvious way in which -24160 would convert to 339213 and 5.79801e-41 and 0.
It looks more like the conversion takes two input numbers to create one output (probably by concatenating the raw 2×16 bits to 32 bits and calling the result a float). In that case the pair -24160,18597 consistently becomes 339213, and 5.79801e-41 probably results from -24160,0 where the 0 is invented because we run out of inputs. Since 5.79801e-41 looks like it might be a single-precision denormal, this implies that the two 16-bit blocks are probably concatenated in little-endian order.
It remains to see whether you need to byte-swap each of the 16-bit inputs, but you can check that for yourself.

Categories

Resources