Reading binary data on bit level - python

I have a binary file in which the data is organised in 16 bit integer blocks like so:
bit 15: digital bit 1
bit 14: digital bit 2
bits 13 to 0: 14 bit signed integer
The only way that I found how to extract the data from file to 3 arrays is:
data = np.fromfile("test1.bin", dtype=np.uint16)
digbit1 = data >= 2**15
data = np.array([x - 2**15 if x >= 2**15 else x for x in data], dtype=np.uint16)
digbit2 = data >= 2**14
data = np.array([x-2**14 if x >= 2**14 else x for x in data])
data = np.array([x-2**14 if x >= 2**13 else x for x in data], dtype=np.int16)
Now I know that I could do the same with with the for loop over the original data and fill out 3 separate arrays, but this would still be ugly. What I would like to know is how to do this more efficiently in style of dtype=[('db', [('1', bit), ('2', bit)]), ('temp', 14bit-signed-int)]) so that it would be easy to access like data['db']['1'] = array of ones and zeros.

Here's a way that is more efficient than your code because Numpy does the looping at compiled speed, which is much faster than using Python loops. And we can use bitwise arithmetic instead of those if tests.
You didn't supply any sample data, so I wrote some plain Python 3 code to create some fake data. I save that data to file in big-endian format, but that's easy enough to change if your data is actually stored in little-endian. I don't use numpy.fromfile to read that data because it's faster to read the file in plain Python and then convert the read bytes using numpy.frombuffer.
The only tricky part is handling those 14 bit signed integers. I assume you're using two's complement representation.
import numpy as np
# Make some fake data
bdata = []
bitlen = 14
mask = (1 << bitlen) - 1
for i in range(12):
# Two initial bits
a = i % 4
# A signed number
b = i - 6
# Combine initial bits with the signed number,
# using 14 bit two's complement.
n = (a << bitlen) | (b & mask)
# Convert to bytes, using 16 bit big-endian
nbytes = n.to_bytes(2, 'big')
bdata.append(nbytes)
print('{} {:2} {:016b} {} {:>5}'.format(a, b, n, nbytes.hex(), n))
print()
# Save the data to a file
fname = 'test1.bin'
with open(fname, 'wb') as f:
f.write(b''.join(bdata))
# And read it back in
with open(fname, 'rb') as f:
data = np.frombuffer(f.read(), dtype='>u2')
print(data)
# Get the leading bits
digbit1 = data >> 15
print(digbit1)
# Get the second bits
digbit2 = (data >> 14) & 1
print(digbit2)
# Get the 14 bit signed integers
data = ((data & mask) << 2).astype(np.int16) >> 2
print(data)
output
0 -6 0011111111111010 3ffa 16378
1 -5 0111111111111011 7ffb 32763
2 -4 1011111111111100 bffc 49148
3 -3 1111111111111101 fffd 65533
0 -2 0011111111111110 3ffe 16382
1 -1 0111111111111111 7fff 32767
2 0 1000000000000000 8000 32768
3 1 1100000000000001 c001 49153
0 2 0000000000000010 0002 2
1 3 0100000000000011 4003 16387
2 4 1000000000000100 8004 32772
3 5 1100000000000101 c005 49157
[16378 32763 49148 65533 16382 32767 32768 49153 2 16387 32772 49157]
[0 0 1 1 0 0 1 1 0 0 1 1]
[0 1 0 1 0 1 0 1 0 1 0 1]
[-6 -5 -4 -3 -2 -1 0 1 2 3 4 5]
If you do need to use little-endian byte ordering, just change the dtype to '<u2' in the np.frombuffer call. And to test it, change 'big' to 'little' in the n.to_bytes call in the fake data making section.

Related

how to convert a dataframe containing 1's and 0's and add a new column to the same dataframe that represents the hex value of entire row in python

I have a dataframe of 51 rows and 464 columns , the columns contain 1's and 0's. I want to have a encoded value of the hex as you see in the attached picture.
I was trying to use numpy to make the hex conversion but it would fail
df = pd.DataFrame(np.random.randint(0,2,size=(51, 464)))
#converting into numpy for easier shifting
a = df.values
b = a.dot(2**np.arange(a.size)[::-1])
I want to have every 4 columns grouped to produce the hexadecimal value and then if there are odd columns for ex: 463 instead of 464 then the trailing hexadecimal will be padded with zero or zeroes based on how many ever needed to make the full hexadecimal value
This code only works for 64 bits length and then fails.
I was following this example
binary0|1 to hex string
any suggestions on how to do this?
Doesn't this do what you want?
df.apply(lambda row: hex(int(''.join(map(str, row)), base=2)), axis=1)
Convert to string every number in a row
Join them to create one big number in string
Convert it to integer with base 2 (since a row is in binary format)
Convert it to hex
Edit: To convert every 4 piece with the same manner:
def hexize(row):
hexes = '0x'
row = ''.join(map(str, row))
for i in range(0, len(row), 4):
value = row[i:i+4]
value = value.ljust(4, '0') # right fill with 0
value = hex(int(value, base=2))
hexes += value[2:]
return hexes
df.apply(hexize, axis=1)
hexize('011101100') # returns '0x760'
Given input data:
ECID,T1,T2,T3,T4,T5,T6,T7,T8,T9,T10,T11,T12,T13,T14,T15,T16,T17,T18,T19,T20,T21,T22,T23,T24,T25,T26,T27,T28,T29,T30,T31,T32,T33,T34,T35,T36,T37,T38,T39,T40,T41,T42,T43,T44,T45,T46,T47,T48,T49,T50,T51
ABC123,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
XYZ345,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
DEF789,1,0,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
434thECID,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
This adds an "Encoded" column similar to what was asked. The first row example in the original question seems to have the wrong number of Fs:
import pandas as pd
def encode(row):
s = ''.join(str(x) for x in row[1:]) # Create binary string
s += '0' * (4 - len(row[1:]) % 4) # Make length a multiple of 4 by adding zeros
i = int(s,2) # convert to integer base 2
h = hex(i).rstrip('0') # strip trailing zeros
return h if h != '0x' else '0x0' # Handle special case of '0x0' stripping to '0x'
df = pd.read_csv('input.csv')
df['Encoded'] = df.apply(encode,axis=1)
print(df)
Output:
ECID T1 T2 T3 T4 T5 ... T47 T48 T49 T50 T51 Encoded
0 ABC123 1 1 1 1 1 ... 1 1 1 1 1 0xffffffffffffe
1 XYZ345 1 0 0 0 0 ... 0 0 0 0 0 0x8
2 DEF789 1 0 1 0 1 ... 0 0 0 0 0 0xaa
3 434thECID 0 0 0 0 0 ... 0 0 0 0 0 0x0
[4 rows x 53 columns]

How does the &= assignment operator work?

Can anyone explain me how &= assignment operator works in Python programming.
&= is a bitwise and operator. It works with the binary number. I'll explain this to you by an example
Example:
x = 5
In binary 5 is equal to 101
Now
x &= 3 which means x = x & 3
You also need to convert 3 to binary number which is 011
Now you need to apply and operator to both binary number
101
& 011
=001
So now convert this resulting binary number to decimal number which is equal to 1.
You can use online converter from decimal to binary and binary to decimal.
&= is a bitwise operator, it works with bits such as following,
a = 60 # 60 = 0011 1100
b = 13 # 13 = 0000 1101
c = 0
c = a & b; # 12 = 0000 1100
print ("c: ", c)
output:
c: 12
It works on the basis of logic given below
0&0= 0
0&1= 0
1&0= 0
1&1= 1
look at the comments i have given in the code.

How to unpack a uint32 array with np.unpackbits

I used a piece of code to create a 2D binary valued array to cover all possible scenarios of an event. For the first round, I tested it with 2 members.
Here is my code:
number_of_members = 2
n = number_of_members
values = np.arange(2**n, dtype=np.uint8).reshape(-1, 1)
print('$$$ ===> ', values)
bin_array = np.unpackbits(values, axis=1)[:, -n:]
print('*** ===> ', bin_array)
And the result is this:
$$$ ===> [[0]
[1]
[2]
[3]]
*** ===> [[0 0]
[0 1]
[1 0]
[1 1]]
As you can see, it correctly provided my 2D binary array.
The problem begins when I intended to use number_of_members = 20. If I assign 20 to number_of_members python shows this as result:
$$$ ===> [[ 0]
[ 1]
[ 2]
...
[253]
[254]
[255]]
*** ===> [[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 1]
[0 0 0 ... 0 1 0]
...
[1 1 1 ... 1 0 1]
[1 1 1 ... 1 1 0]
[1 1 1 ... 1 1 1]]
The result has 8 columns, but I expected an array of 32 columns. How can I unpack a uint32 array?
As you noted correctly, np.unpackbits operates only on uint8 arrays. The nice thing is that you can view any type as uint8. You can create a uint32 view into your data like this:
view = values.view(np.uint8)
On my machine, this is little-endian, which makes trimming easier. You can force little-endian order conditionally across all systems:
if values.dtype.byteorder == '>' or (values.dtype.byteorder == '=' and sys.byteorder == 'big'):
view = view[:, ::-1]
Now you can unpack the bits. In fact, unpackbits has a nice feature that I personally added, the count parameter. It allows you to make your output be exactly 20 bits long instead of the full 32, without subsetting. Since the output will be mixed big-endian bits and little-endian bytes, I recommend displaying the bits in little-endian order too, and flipping the entire result:
bin_array = np.unpackbits(view, axis=1, count=20, bitorder='little')[:, ::-1]
The result is a (1<<20, 20) array with the exact values you want.

Filling in missing data in Python

I was hoping you would be able to help me solve a small problem.
I am using a small device that prints out two properties that I save to a file. The device rasters in X and Y direction to form a grid. I am interested in plotting the relative intensity of these two properties as a function of the X and Y dimensions. I record the data in 4 columns that are comma separated (X, Y, property 1, property 2).
The grid is examined in lines, so for each Y value, it will move from X1 to X2 which are separated several millimeters apart. Then it will move to the next line and over again.
I am able to process the data in python with pandas/numpy but it doesn't work too well when there are any missing rows (which unfortunately does happen).
I have attached a sample of the output (and annotated the problems):
44,11,500,1
45,11,120,2
46,11,320,3
47,11,700,4
New << used as my Y axis separator
44,12,50,5
45,12,100,6
46,12,1500,7
47,12,2500,8
Sometimes, however a line or a few will be missing making it not possible to process and plot. Currently I have not been able to automatically fix it and have to do it manually. The bad output looks like this:
44,11,500,1
45,11,120,2
46,11,320,3
47,11,700,4
New << used as my Y axis separator
45,12,100,5 << missing 44,12...
46,12,1500,6
47,12,2500,7
I know the number of lines I expect since I know my range of X and Y.
What would be the best way to deal with this? Currently I manually enter the missing X and Y values and populate property 1 and 2 with values of 0. This can be time consuming and I would like to automate it. I have two questions.
Question 1: How can I automatically fill in my missing data with the corresponding values of X and Y and two zeros? This could be obtained from a pre-generated array of X and Y values that correspond to the experimental range.
Question 2: Is there a better way to split the file into separate arrays for plotting (rather than using the 'New' line?) For instance, by having a 'if' function that will output each line between X(start) and X(end) to a separate array? I've tried doing that but with no success.
I've attached my current (crude) code:
df = pd.read_csv('FileName.csv', delimiter = ',', skiprows=0)
rows = [-1] + np.where(df['X']=='New')[0].tolist() + [len(df.index)]
dff = {}
for i, r in enumerate(rows[:-1]):
dff[i] = df[r+1: rows[i+1]]
maxY = len(dff)
data = []
data2 = []
for yaxes in range(0, maxY):
data2.append(dff[yaxes].ix[:,2])
<data2 is then used for plotting using matplotlib>
To answer my Question 1, I was thinking about using the 'reindex' and 'reset_index' functions, however haven't managed to make them work.
I would appreciate any suggestions.
Is this meet what you want?
Q1: fill X using reindex, and others using fillna
Q2: Passing separated StringIO to read_csv is easier (change if you use Python 3)
# read file and split the input
f = open('temp.csv', 'r')
chunks = f.read().split('New')
# read csv as separated dataframes, using first column as index
dfs = [pd.read_csv(StringIO(unicode(chunk)), header=None, index_col=0) for chunk in chunks]
def pad(df):
# reindex, you should know the range of x
df = df.reindex(np.arange(44, 48))
# pad y from forward / backward, assuming y should have the single value
df[1] = df[1].fillna(method='bfill')
df[1] = df[1].fillna(method='ffill')
# padding others
df = df.fillna(0)
# revert index to values
return df.reset_index(drop=False)
dfs = [pad(df) for df in dfs]
dfs[0]
# 0 1 2 3
# 0 44 11 500 1
# 1 45 11 120 2
# 2 46 11 320 3
# 3 47 11 700 4
# dfs[1]
# 0 1 2 3
# 0 44 12 0 0
# 1 45 12 100 5
# 2 46 12 1500 6
# 3 47 12 2500 7
First Question
I've included print statements inside function to explain how this function works
In [89]:
def replace_missing(df , Ids ):
# check what are the mssing values
missing = np.setdiff1d(Ids , df[0])
if len(missing) > 0 :
missing_df = pd.DataFrame(data = np.zeros( (len(missing) , 4 )))
#print('---missing df---')
#print(missing_df)
missing_df[0] = missing
#print('---missing df---')
#print(missing_df)
missing_df[1].replace(0 , df[1].iloc[0] , inplace = True)
#print('---missing df---')
#print(missing_df)
df = pd.concat([df , missing_df])
#print('---final df---')
#print(df)
return df
​
In [91]:
Ids = np.arange(44,48)
final_df = df1.groupby(df1[1] , as_index = False).apply(replace_missing , Ids).reset_index(drop = True)
final_df
Out[91]:
0 1 2 3
44 11 500 1
45 11 120 2
46 11 320 3
47 11 700 4
45 12 100 5
46 12 1500 6
47 12 2500 7
44 12 0 0
Second question
In [92]:
group = final_df.groupby(final_df[1])
In [99]:
separate = [group.get_group(key) for key in group.groups.keys()]
separate[0]
Out[104]:
0 1 2 3
44 11 500 1
45 11 120 2
46 11 320 3
47 11 700 4

Python: Why does right shift >> round down and where should it be used?

I've never used the >> and << operators, not because I've never needed them, but because I don't know if I could have used them, or where I should have.
100 >> 3 outputs 12 instead of 12.5. Why is this. Perhaps learning where to best use right shift will answer that implicitly, but I'm curious.
Right shift is not division
Let's look at what right-shift actually does, and it will become clear.
First, recall that a number is stored in memory as a collection of binary digits. If we have 8 bits of memory, we can store 2 as 00000010 and 5 as 00000101.
Right-shift takes those digits and shifts them to the right. For example, right-shifting our above two digits by one will give 00000001 and 00000010 respectively.
Notice that the lowest digit (right-most) is shifted off the end entirely and has no effect on the final result.
>> and << are the right and left bit shift operators, respectively. You should look at the binary representation of the numbers.
>>> bin(100)
'0b1100100'
>>> bin(12)
'0b1100'
The other answers explain the idea of bitshifting, but here's specifically what happens for 100>>3
100
128 64 32 16 8 4 2 1
0 1 1 0 0 1 0 0 = 100
100 >> 1
128 64 32 16 8 4 2 1
0 0 1 1 0 0 1 0 = 50
100 >> 2
128 64 32 16 8 4 2 1
0 0 0 1 1 0 0 1 = 25
100 >> 3
128 64 32 16 8 4 2 1
0 0 0 0 1 1 0 0 = 12
You won't often need to use it, unless you need some really quick division by 2, but even then, DON'T USE IT. it makes the code much more complicated then it needs to be, and the speed difference is unnoticeable.
The main time you'd ever need to use it would be if you're working with binary data, and you specifically need to shift the bits around. The only real use I've had for it was reading & writing ID3 tags, which stores size information in 7-bit bytes, like so:
0xxxxxxx 0xxxxxxx 0xxxxxxx 0xxxxxxx.
which would need to be put together like this:
0000xxxx xxxxxxxx xxxxxxxx xxxxxxxx
to give a normal integer in memory.
Bit shifting an integer gives another integer. For instance, the number 12 is written in binary as 0b1100. If we bit shift by 1 to the right, we get 0b110 = 6. If we bit shift by 2, we get 0b11 = 3. And lastly, if we bitshift by 3, we get 0b1 = 1 rather than 1.5. This is because the bits that are shifted beyond the register are lost.
One easy way to think of it is bitshifting to the right by N is the same as dividing by 2^N and then truncating the result.
I have read the answers above and just wanted to add a little bit more practical example, that I had seen before.
Let us assume, that you want to create a list of powers of two. So, you can do this using left shift:
n = 10
list_ = [1<<i for i in range(1, n+1)] # Where n is a maximum power.
print(list_)
# Output: [2, 4, 8, 16, 32, 64, 128, 256, 512, 1024]
You can timeit it if you want, but I am pretty sure, that the code above is one the fastest solutions for this problem. But what I cannot understand is when you can use right shift.

Categories

Resources