I would like to encode integers stored in a pandas dataframe column into respective 16-bit binary numbers which correspond to bit positions in those integers. I would also need to pad leading zeros for numbers with corresponding binary less than 16 bits. For example, given one column containing integers ranging from 0 to 33000, for an integer value of 20 (10100 in binary) I would like to produce 16 columns with values 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0 and so on across the entire column.
Setup
Consider the data frame df with column 'A'
df = pd.DataFrame(dict(A=range(16)))
Numpy broadcasting and bit shifting
a = df.A.values
n = int(np.log2(a.max() + 1))
b = (a[:, None] >> np.arange(n)[::-1]) & 1
pd.DataFrame(b)
0 1 2 3
0 0 0 0 0
1 0 0 0 1
2 0 0 1 0
3 0 0 1 1
4 0 1 0 0
5 0 1 0 1
6 0 1 1 0
7 0 1 1 1
8 1 0 0 0
9 1 0 0 1
10 1 0 1 0
11 1 0 1 1
12 1 1 0 0
13 1 1 0 1
14 1 1 1 0
15 1 1 1 1
String formatting with f-strings
n = int(np.log2(df.A.max() + 1))
pd.DataFrame([list(map(int, f'{i:0{n}b}')) for i in df.A])
0 1 2 3
0 0 0 0 0
1 0 0 0 1
2 0 0 1 0
3 0 0 1 1
4 0 1 0 0
5 0 1 0 1
6 0 1 1 0
7 0 1 1 1
8 1 0 0 0
9 1 0 0 1
10 1 0 1 0
11 1 0 1 1
12 1 1 0 0
13 1 1 0 1
14 1 1 1 0
15 1 1 1 1
Could you do something like this?
x = 20
bin_string = format(x, '016b')
df = pd.DataFrame(list(bin_string)).T
I don't know enough about what you're trying to do to know if that's sufficient.
Related
I'm trying to create two new columns to alternate starts and endings in a dataframe :
for 1 start there is only 1 ending maximum
the last start can have no ending corresponding
there is no ends before the first start
the succession of two or more starts or two or more ends isn't possible
How could I do that without using any loop, so using numpy or pandas functions ?
The code to create the dataframe :
df = pd.DataFrame({ 'start':[0,0,1,0,1,0,1,0,0,0,0,1,0,1,0,0,0,1,0],
'end':[1,0,0,0,0,0,0,1,0,1,0,0,0,0,0,1,0,0,0]})
The render and the result I want :
start end start wanted end wanted
0 0 1 0 0
1 0 0 0 0
2 1 0 1 0
3 0 0 0 0
4 1 0 0 0
5 0 0 0 0
6 1 0 0 0
7 0 1 0 1
8 0 0 0 0
9 0 1 0 0
10 0 0 0 0
11 1 0 1 0
12 0 0 0 0
13 1 0 0 0
14 0 0 0 0
15 0 1 0 1
16 0 0 0 0
17 1 0 1 0
18 0 0 0 0
I don't know how to do this with pure pandas/numpy but here's a simple for loop that gives your expected output. I tested it with a pandas dataframe 50,000 times the size of your example data (so around 1 million rows in total) and it runs in roughly 1 second:
import pandas as pd
df = pd.DataFrame({ 'start':[0,0,1,0,1,0,1,0,0,0,0,1,0,1,0,0,0,1,0],
'end':[1,0,0,0,0,0,0,1,0,1,0,0,0,0,0,1,0,0,0]})
start = False
start_wanted = []
end_wanted = []
for s, e in zip(df['start'], df['end']):
if start:
if e == 1:
start = False
start_wanted.append(0)
end_wanted.append(e)
else:
if s == 1:
start = True
start_wanted.append(s)
end_wanted.append(0)
df['start_wanted'] = start_wanted
df['end_wanted'] = end_wanted
print(df)
Output:
end start start_wanted end_wanted
0 1 0 0 0
1 0 0 0 0
2 0 1 1 0
3 0 0 0 0
4 0 1 0 0
5 0 0 0 0
6 0 1 0 0
7 1 0 0 1
8 0 0 0 0
9 1 0 0 0
10 0 0 0 0
11 0 1 1 0
12 0 0 0 0
13 0 1 0 0
14 0 0 0 0
15 1 0 0 1
16 0 0 0 0
17 0 1 1 0
18 0 0 0 0
I have a column that looks something like this:
1
0
0
1
0
0
0
1
I want the output to look something like this:
1 <--
0
0
2 <--
0
0
0
3 <--
And so forth. I'm not sure where to begin. There about 10,000 rows and I feel like making a if statement might take awhile. How do I achieve this output?
Efficient and concise:
s.cumsum()*s
0 1
1 0
2 0
3 2
4 0
5 0
6 0
7 3
dtype: int64
Use Series.cumsum + Series.where
Here is an example:
print(df)
0
0 1
1 0
2 0
3 1
4 0
5 0
6 0
7 1
df['0']=df['0'].cumsum().where(df['0'].ne(0),df['0'])
print(df)
0
0 1
1 0
2 0
3 2
4 0
5 0
6 0
7 3
Try this:
s = pd.Series([1,0,0,1,0,0,0,1])
s.cumsum().mask(s==0, 0)
Output:
0 1
1 0
2 0
3 2
4 0
5 0
6 0
7 3
dtype: int64
np.where and cumsum:
df['cum_sum'] = np.where(df.val>0, df.val.cumsum(), 0)
output:
val cum_sum
0 1 1
1 0 0
2 0 0
3 1 2
4 0 0
5 0 0
6 0 0
7 1 3
you could do something like this
df = {'col1': [1, 0,0,0,1,0,0,1] }
count = 0
col = []
for val in zip(df['col1']):
if val[0] == 1:
count+=1
col.append(count)
else:
col.append(val[0])
and you get [1, 0, 0, 0, 2, 0, 0, 3]
Only select the rows that are non-zero and replace those values with cumsum
import pandas as pd
df=pd.DataFrame({'col': [0,1,0,0,1,0,0,0,1,0] })
index=df["col"]!=0
df.loc[index,"col"]=df.loc[index,"col"].cumsum()
print(df)
col
0 0
1 1
2 0
3 0
4 2
5 0
6 0
7 0
8 3
9 0
I'm currently looking at the correlation between features in my dataset and need to group features that have similar targets into larger supergroups that can be used for a more general correlation analysis.
The features are one hot encoded and are in a pandas data-frame that looks similar to this:
1 2 3 4 5 6 7 8 9
A 0 0 1 0 0 1 0 1 0
B 0 0 0 1 0 0 0 0 0
C 1 0 0 0 1 0 0 0 0
D 1 0 0 1 0 0 0 0 0
E 0 1 0 1 0 0 0 0 1
I would like the resulting dataframe to look like this:
1 2 3 4 5 6 7 8 9
group1(A) 0 0 1 0 0 1 0 1 0
group2(B,D,E,C)1 1 0 1 1 0 0 0 1
I've already tried all forms of groupby and some of the methods in networkx.
This is a hidden network problem , so we using networkx after merge
s=df.reset_index().melt('index')
s=s.loc[s.value==1]
s=s.merge(s,on = 'variable')
import networkx as nx
G=nx.from_pandas_edgelist(s, 'index_x', 'index_y')
l=list(nx.connected_components(G))
from collections import ChainMap
L=dict(ChainMap(*[dict.fromkeys(y,x) for x, y in enumerate(l)]))
df.groupby(L).sum().ge(1).astype(int)
Out[133]:
1 2 3 4 5 6 7 8 9
0 1 1 0 1 1 0 0 0 1
1 0 0 1 0 0 1 0 1 0
L
Out[134]: {'A': 1, 'B': 0, 'C': 0, 'D': 0, 'E': 0}
Seems like an easy question but I'm running into an odd error. I have a large dataframe with 24+ columns that all contain 1s or 0s. I wish to concatenate each field to create a binary key that'll act as a signature.
However, when the number of columns exceeds 12, the whole process falls apart.
a = np.zeros(shape=(3,12))
df = pd.DataFrame(a)
df = df.astype(int) # This converts each 0.0 into just 0
df[2]=1 # Changes one column to all 1s
#result
0 1 2 3 4 5 6 7 8 9 10 11
0 0 0 1 0 0 0 0 0 0 0 0 0
1 0 0 1 0 0 0 0 0 0 0 0 0
2 0 0 1 0 0 0 0 0 0 0 0 0
Concatenating function...
df['new'] = df.astype(str).sum(1).astype(int).astype(str) # Concatenate
df['new'].apply('{0:0>12}'.format) # Pad leading zeros
# result
0 1 2 3 4 5 6 7 8 9 10 11 new
0 0 0 1 0 0 0 0 0 0 0 0 0 001000000000
1 0 0 1 0 0 0 0 0 0 0 0 0 001000000000
2 0 0 1 0 0 0 0 0 0 0 0 0 001000000000
This is good. However, if I increase the number of columns to 13, I get...
a = np.zeros(shape=(3,13))
# ...same intermediate steps as above...
0 1 2 3 4 5 6 7 8 9 10 11 12 new
0 0 0 1 0 0 0 0 0 0 0 0 0 0 00-2147483648
1 0 0 1 0 0 0 0 0 0 0 0 0 0 00-2147483648
2 0 0 1 0 0 0 0 0 0 0 0 0 0 00-2147483648
Why am I getting -2147483648? I was expecting 0010000000000
Any help is appreciated!
I am working with a dataframe, consisting of a continuity column df['continuity'] and a column group df['group'].
Both are binary columns.
I want to add an extra column 'group_id' that gives consecutive rows of 1s the same integer value, where the first group of rows have a
1, then 2 etc. After each time where the continuity value of a row is 0, the counting should start again at 1.
Since this question is rather specific, I'm not sure how to tackle this vectorized. Below an example, where the first two
columns are the input and the column the output I'd like to have.
continuity group group_id
1 0 0
1 1 1
1 1 1
1 1 1
1 0 0
1 1 2
1 1 2
1 1 2
1 0 0
1 0 0
1 1 3
1 1 3
0 1 1
0 0 0
1 1 1
1 1 1
1 0 0
1 0 0
1 1 2
1 1 2
I believe you can use:
#get unique groups in both columns
b = df[['continuity','group']].ne(df[['continuity','group']].shift()).cumsum()
#identify first 1
c = ~b.duplicated() & (df['group'] == 1)
#cumulative sum of first values only if group are 1, else 0 per groups
df['new'] = np.where(df['group'] == 1,
c.groupby(b['continuity']).cumsum(),
0).astype(int)
print (df)
continuity group group_id new
0 1 0 0 0
1 1 1 1 1
2 1 1 1 1
3 1 1 1 1
4 1 0 0 0
5 1 1 2 2
6 1 1 2 2
7 1 1 2 2
8 1 0 0 0
9 1 0 0 0
10 1 1 3 3
11 1 1 3 3
12 0 1 1 1
13 0 0 0 0
14 1 1 1 1
15 1 1 1 1
16 1 0 0 0
17 1 0 0 0
18 1 1 2 2
19 1 1 2 2