I have a dataframe of 51 rows and 464 columns , the columns contain 1's and 0's. I want to have a encoded value of the hex as you see in the attached picture.
I was trying to use numpy to make the hex conversion but it would fail
df = pd.DataFrame(np.random.randint(0,2,size=(51, 464)))
#converting into numpy for easier shifting
a = df.values
b = a.dot(2**np.arange(a.size)[::-1])
I want to have every 4 columns grouped to produce the hexadecimal value and then if there are odd columns for ex: 463 instead of 464 then the trailing hexadecimal will be padded with zero or zeroes based on how many ever needed to make the full hexadecimal value
This code only works for 64 bits length and then fails.
I was following this example
binary0|1 to hex string
any suggestions on how to do this?
Doesn't this do what you want?
df.apply(lambda row: hex(int(''.join(map(str, row)), base=2)), axis=1)
Convert to string every number in a row
Join them to create one big number in string
Convert it to integer with base 2 (since a row is in binary format)
Convert it to hex
Edit: To convert every 4 piece with the same manner:
def hexize(row):
hexes = '0x'
row = ''.join(map(str, row))
for i in range(0, len(row), 4):
value = row[i:i+4]
value = value.ljust(4, '0') # right fill with 0
value = hex(int(value, base=2))
hexes += value[2:]
return hexes
df.apply(hexize, axis=1)
hexize('011101100') # returns '0x760'
Given input data:
ECID,T1,T2,T3,T4,T5,T6,T7,T8,T9,T10,T11,T12,T13,T14,T15,T16,T17,T18,T19,T20,T21,T22,T23,T24,T25,T26,T27,T28,T29,T30,T31,T32,T33,T34,T35,T36,T37,T38,T39,T40,T41,T42,T43,T44,T45,T46,T47,T48,T49,T50,T51
ABC123,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1
XYZ345,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
DEF789,1,0,1,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
434thECID,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
This adds an "Encoded" column similar to what was asked. The first row example in the original question seems to have the wrong number of Fs:
import pandas as pd
def encode(row):
s = ''.join(str(x) for x in row[1:]) # Create binary string
s += '0' * (4 - len(row[1:]) % 4) # Make length a multiple of 4 by adding zeros
i = int(s,2) # convert to integer base 2
h = hex(i).rstrip('0') # strip trailing zeros
return h if h != '0x' else '0x0' # Handle special case of '0x0' stripping to '0x'
df = pd.read_csv('input.csv')
df['Encoded'] = df.apply(encode,axis=1)
print(df)
Output:
ECID T1 T2 T3 T4 T5 ... T47 T48 T49 T50 T51 Encoded
0 ABC123 1 1 1 1 1 ... 1 1 1 1 1 0xffffffffffffe
1 XYZ345 1 0 0 0 0 ... 0 0 0 0 0 0x8
2 DEF789 1 0 1 0 1 ... 0 0 0 0 0 0xaa
3 434thECID 0 0 0 0 0 ... 0 0 0 0 0 0x0
[4 rows x 53 columns]
I used a piece of code to create a 2D binary valued array to cover all possible scenarios of an event. For the first round, I tested it with 2 members.
Here is my code:
number_of_members = 2
n = number_of_members
values = np.arange(2**n, dtype=np.uint8).reshape(-1, 1)
print('$$$ ===> ', values)
bin_array = np.unpackbits(values, axis=1)[:, -n:]
print('*** ===> ', bin_array)
And the result is this:
$$$ ===> [[0]
[1]
[2]
[3]]
*** ===> [[0 0]
[0 1]
[1 0]
[1 1]]
As you can see, it correctly provided my 2D binary array.
The problem begins when I intended to use number_of_members = 20. If I assign 20 to number_of_members python shows this as result:
$$$ ===> [[ 0]
[ 1]
[ 2]
...
[253]
[254]
[255]]
*** ===> [[0 0 0 ... 0 0 0]
[0 0 0 ... 0 0 1]
[0 0 0 ... 0 1 0]
...
[1 1 1 ... 1 0 1]
[1 1 1 ... 1 1 0]
[1 1 1 ... 1 1 1]]
The result has 8 columns, but I expected an array of 32 columns. How can I unpack a uint32 array?
As you noted correctly, np.unpackbits operates only on uint8 arrays. The nice thing is that you can view any type as uint8. You can create a uint32 view into your data like this:
view = values.view(np.uint8)
On my machine, this is little-endian, which makes trimming easier. You can force little-endian order conditionally across all systems:
if values.dtype.byteorder == '>' or (values.dtype.byteorder == '=' and sys.byteorder == 'big'):
view = view[:, ::-1]
Now you can unpack the bits. In fact, unpackbits has a nice feature that I personally added, the count parameter. It allows you to make your output be exactly 20 bits long instead of the full 32, without subsetting. Since the output will be mixed big-endian bits and little-endian bytes, I recommend displaying the bits in little-endian order too, and flipping the entire result:
bin_array = np.unpackbits(view, axis=1, count=20, bitorder='little')[:, ::-1]
The result is a (1<<20, 20) array with the exact values you want.
I was hoping you would be able to help me solve a small problem.
I am using a small device that prints out two properties that I save to a file. The device rasters in X and Y direction to form a grid. I am interested in plotting the relative intensity of these two properties as a function of the X and Y dimensions. I record the data in 4 columns that are comma separated (X, Y, property 1, property 2).
The grid is examined in lines, so for each Y value, it will move from X1 to X2 which are separated several millimeters apart. Then it will move to the next line and over again.
I am able to process the data in python with pandas/numpy but it doesn't work too well when there are any missing rows (which unfortunately does happen).
I have attached a sample of the output (and annotated the problems):
44,11,500,1
45,11,120,2
46,11,320,3
47,11,700,4
New << used as my Y axis separator
44,12,50,5
45,12,100,6
46,12,1500,7
47,12,2500,8
Sometimes, however a line or a few will be missing making it not possible to process and plot. Currently I have not been able to automatically fix it and have to do it manually. The bad output looks like this:
44,11,500,1
45,11,120,2
46,11,320,3
47,11,700,4
New << used as my Y axis separator
45,12,100,5 << missing 44,12...
46,12,1500,6
47,12,2500,7
I know the number of lines I expect since I know my range of X and Y.
What would be the best way to deal with this? Currently I manually enter the missing X and Y values and populate property 1 and 2 with values of 0. This can be time consuming and I would like to automate it. I have two questions.
Question 1: How can I automatically fill in my missing data with the corresponding values of X and Y and two zeros? This could be obtained from a pre-generated array of X and Y values that correspond to the experimental range.
Question 2: Is there a better way to split the file into separate arrays for plotting (rather than using the 'New' line?) For instance, by having a 'if' function that will output each line between X(start) and X(end) to a separate array? I've tried doing that but with no success.
I've attached my current (crude) code:
df = pd.read_csv('FileName.csv', delimiter = ',', skiprows=0)
rows = [-1] + np.where(df['X']=='New')[0].tolist() + [len(df.index)]
dff = {}
for i, r in enumerate(rows[:-1]):
dff[i] = df[r+1: rows[i+1]]
maxY = len(dff)
data = []
data2 = []
for yaxes in range(0, maxY):
data2.append(dff[yaxes].ix[:,2])
<data2 is then used for plotting using matplotlib>
To answer my Question 1, I was thinking about using the 'reindex' and 'reset_index' functions, however haven't managed to make them work.
I would appreciate any suggestions.
Is this meet what you want?
Q1: fill X using reindex, and others using fillna
Q2: Passing separated StringIO to read_csv is easier (change if you use Python 3)
# read file and split the input
f = open('temp.csv', 'r')
chunks = f.read().split('New')
# read csv as separated dataframes, using first column as index
dfs = [pd.read_csv(StringIO(unicode(chunk)), header=None, index_col=0) for chunk in chunks]
def pad(df):
# reindex, you should know the range of x
df = df.reindex(np.arange(44, 48))
# pad y from forward / backward, assuming y should have the single value
df[1] = df[1].fillna(method='bfill')
df[1] = df[1].fillna(method='ffill')
# padding others
df = df.fillna(0)
# revert index to values
return df.reset_index(drop=False)
dfs = [pad(df) for df in dfs]
dfs[0]
# 0 1 2 3
# 0 44 11 500 1
# 1 45 11 120 2
# 2 46 11 320 3
# 3 47 11 700 4
# dfs[1]
# 0 1 2 3
# 0 44 12 0 0
# 1 45 12 100 5
# 2 46 12 1500 6
# 3 47 12 2500 7
First Question
I've included print statements inside function to explain how this function works
In [89]:
def replace_missing(df , Ids ):
# check what are the mssing values
missing = np.setdiff1d(Ids , df[0])
if len(missing) > 0 :
missing_df = pd.DataFrame(data = np.zeros( (len(missing) , 4 )))
#print('---missing df---')
#print(missing_df)
missing_df[0] = missing
#print('---missing df---')
#print(missing_df)
missing_df[1].replace(0 , df[1].iloc[0] , inplace = True)
#print('---missing df---')
#print(missing_df)
df = pd.concat([df , missing_df])
#print('---final df---')
#print(df)
return df
In [91]:
Ids = np.arange(44,48)
final_df = df1.groupby(df1[1] , as_index = False).apply(replace_missing , Ids).reset_index(drop = True)
final_df
Out[91]:
0 1 2 3
44 11 500 1
45 11 120 2
46 11 320 3
47 11 700 4
45 12 100 5
46 12 1500 6
47 12 2500 7
44 12 0 0
Second question
In [92]:
group = final_df.groupby(final_df[1])
In [99]:
separate = [group.get_group(key) for key in group.groups.keys()]
separate[0]
Out[104]:
0 1 2 3
44 11 500 1
45 11 120 2
46 11 320 3
47 11 700 4
I've never used the >> and << operators, not because I've never needed them, but because I don't know if I could have used them, or where I should have.
100 >> 3 outputs 12 instead of 12.5. Why is this. Perhaps learning where to best use right shift will answer that implicitly, but I'm curious.
Right shift is not division
Let's look at what right-shift actually does, and it will become clear.
First, recall that a number is stored in memory as a collection of binary digits. If we have 8 bits of memory, we can store 2 as 00000010 and 5 as 00000101.
Right-shift takes those digits and shifts them to the right. For example, right-shifting our above two digits by one will give 00000001 and 00000010 respectively.
Notice that the lowest digit (right-most) is shifted off the end entirely and has no effect on the final result.
>> and << are the right and left bit shift operators, respectively. You should look at the binary representation of the numbers.
>>> bin(100)
'0b1100100'
>>> bin(12)
'0b1100'
The other answers explain the idea of bitshifting, but here's specifically what happens for 100>>3
100
128 64 32 16 8 4 2 1
0 1 1 0 0 1 0 0 = 100
100 >> 1
128 64 32 16 8 4 2 1
0 0 1 1 0 0 1 0 = 50
100 >> 2
128 64 32 16 8 4 2 1
0 0 0 1 1 0 0 1 = 25
100 >> 3
128 64 32 16 8 4 2 1
0 0 0 0 1 1 0 0 = 12
You won't often need to use it, unless you need some really quick division by 2, but even then, DON'T USE IT. it makes the code much more complicated then it needs to be, and the speed difference is unnoticeable.
The main time you'd ever need to use it would be if you're working with binary data, and you specifically need to shift the bits around. The only real use I've had for it was reading & writing ID3 tags, which stores size information in 7-bit bytes, like so:
0xxxxxxx 0xxxxxxx 0xxxxxxx 0xxxxxxx.
which would need to be put together like this:
0000xxxx xxxxxxxx xxxxxxxx xxxxxxxx
to give a normal integer in memory.
Bit shifting an integer gives another integer. For instance, the number 12 is written in binary as 0b1100. If we bit shift by 1 to the right, we get 0b110 = 6. If we bit shift by 2, we get 0b11 = 3. And lastly, if we bitshift by 3, we get 0b1 = 1 rather than 1.5. This is because the bits that are shifted beyond the register are lost.
One easy way to think of it is bitshifting to the right by N is the same as dividing by 2^N and then truncating the result.
I have read the answers above and just wanted to add a little bit more practical example, that I had seen before.
Let us assume, that you want to create a list of powers of two. So, you can do this using left shift:
n = 10
list_ = [1<<i for i in range(1, n+1)] # Where n is a maximum power.
print(list_)
# Output: [2, 4, 8, 16, 32, 64, 128, 256, 512, 1024]
You can timeit it if you want, but I am pretty sure, that the code above is one the fastest solutions for this problem. But what I cannot understand is when you can use right shift.