How to label ascending consecutive numbers onto values in a column? - python

I have a column that looks something like this:
1
0
0
1
0
0
0
1
I want the output to look something like this:
1 <--
0
0
2 <--
0
0
0
3 <--
And so forth. I'm not sure where to begin. There about 10,000 rows and I feel like making a if statement might take awhile. How do I achieve this output?

Efficient and concise:
s.cumsum()*s
0 1
1 0
2 0
3 2
4 0
5 0
6 0
7 3
dtype: int64

Use Series.cumsum + Series.where
Here is an example:
print(df)
0
0 1
1 0
2 0
3 1
4 0
5 0
6 0
7 1
df['0']=df['0'].cumsum().where(df['0'].ne(0),df['0'])
print(df)
0
0 1
1 0
2 0
3 2
4 0
5 0
6 0
7 3

Try this:
s = pd.Series([1,0,0,1,0,0,0,1])
s.cumsum().mask(s==0, 0)
Output:
0 1
1 0
2 0
3 2
4 0
5 0
6 0
7 3
dtype: int64

np.where and cumsum:
df['cum_sum'] = np.where(df.val>0, df.val.cumsum(), 0)
output:
val cum_sum
0 1 1
1 0 0
2 0 0
3 1 2
4 0 0
5 0 0
6 0 0
7 1 3

you could do something like this
df = {'col1': [1, 0,0,0,1,0,0,1] }
count = 0
col = []
for val in zip(df['col1']):
if val[0] == 1:
count+=1
col.append(count)
else:
col.append(val[0])
and you get [1, 0, 0, 0, 2, 0, 0, 3]

Only select the rows that are non-zero and replace those values with cumsum
import pandas as pd
df=pd.DataFrame({'col': [0,1,0,0,1,0,0,0,1,0] })
index=df["col"]!=0
df.loc[index,"col"]=df.loc[index,"col"].cumsum()
print(df)
col
0 0
1 1
2 0
3 0
4 2
5 0
6 0
7 0
8 3
9 0

Related

Convert pandas df cells to say column name

I have a df like this:
0 1 2 3 4 5
abc 0 1 0 0 1
bcd 0 0 1 0 0
def 0 0 0 1 0
How can I convert the dataframe cells to be the column name if there's a 1 in the cell?
Looks like this:
0 1 2 3 4 5
abc 0 2 0 0 5
bcd 0 0 3 0 0
def 0 0 0 4 0
Let us try
df.loc[:,'1':] = df.loc[:,'1':] * df.columns[1:].astype(int)
df
Out[468]:
0 1 2 3 4 5
0 abc 0 2 0 0 5
1 bcd 0 0 3 0 0
2 def 0 0 0 4 0
We can use np.where over the whole dataframe:
values = np.where(df.eq(1), df.columns, df)
df = pd.DataFrame(values, columns=df.columns)
0 1 2 3 4 5
0 abc 0 2 0 0 5
1 bcd 0 0 3 0 0
2 def 0 0 0 4 0
I'd suggest you simply do the logic for each column, where the value is 1 in the given column, set the value as the column name
for col in df.columns:
df.loc[df[col] == 1, col] = col

Collapsing multiple indices into groups based on overlapping targets

I'm currently looking at the correlation between features in my dataset and need to group features that have similar targets into larger supergroups that can be used for a more general correlation analysis.
The features are one hot encoded and are in a pandas data-frame that looks similar to this:
1 2 3 4 5 6 7 8 9
A 0 0 1 0 0 1 0 1 0
B 0 0 0 1 0 0 0 0 0
C 1 0 0 0 1 0 0 0 0
D 1 0 0 1 0 0 0 0 0
E 0 1 0 1 0 0 0 0 1
I would like the resulting dataframe to look like this:
1 2 3 4 5 6 7 8 9
group1(A) 0 0 1 0 0 1 0 1 0
group2(B,D,E,C)1 1 0 1 1 0 0 0 1
I've already tried all forms of groupby and some of the methods in networkx.
This is a hidden network problem , so we using networkx after merge
s=df.reset_index().melt('index')
s=s.loc[s.value==1]
s=s.merge(s,on = 'variable')
import networkx as nx
G=nx.from_pandas_edgelist(s, 'index_x', 'index_y')
l=list(nx.connected_components(G))
from collections import ChainMap
L=dict(ChainMap(*[dict.fromkeys(y,x) for x, y in enumerate(l)]))
df.groupby(L).sum().ge(1).astype(int)
Out[133]:
1 2 3 4 5 6 7 8 9
0 1 1 0 1 1 0 0 0 1
1 0 0 1 0 0 1 0 1 0
L
Out[134]: {'A': 1, 'B': 0, 'C': 0, 'D': 0, 'E': 0}

Is there a way to break a pandas column with categories to seperate true or false columns with the category name as the column name

I have a dataframe with the following column:
df = pd.DataFrame({"A": [1,2,1,2,2,2,0,1,0]})
and i want:
df2 = pd.DataFrame({"0": [0,0,0,0,0,0,1,0,1],"1": [1,0,1,0,0,0,0,1,0],"2": [0,1,0,1,1,1,0,0,0]})
is there an elegant way of doing this using a oneliner.
NOTE
I can do this using df['0'] = df['A'].apply(find_zeros)
I dont mind if 'A' is included in the final.
Use get_dummies:
df2 = pd.get_dummies(df.A)
print (df2)
0 1 2
0 0 1 0
1 0 0 1
2 0 1 0
3 0 0 1
4 0 0 1
5 0 0 1
6 1 0 0
7 0 1 0
8 1 0 0
In [50]: df.A.astype(str).str.get_dummies()
Out[50]:
0 1 2
0 0 1 0
1 0 0 1
2 0 1 0
3 0 0 1
4 0 0 1
5 0 0 1
6 1 0 0
7 0 1 0
8 1 0 0

Encode integer pandas dataframe column to padded 16 bit binary

I would like to encode integers stored in a pandas dataframe column into respective 16-bit binary numbers which correspond to bit positions in those integers. I would also need to pad leading zeros for numbers with corresponding binary less than 16 bits. For example, given one column containing integers ranging from 0 to 33000, for an integer value of 20 (10100 in binary) I would like to produce 16 columns with values 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0 and so on across the entire column.
Setup
Consider the data frame df with column 'A'
df = pd.DataFrame(dict(A=range(16)))
Numpy broadcasting and bit shifting
a = df.A.values
n = int(np.log2(a.max() + 1))
b = (a[:, None] >> np.arange(n)[::-1]) & 1
pd.DataFrame(b)
0 1 2 3
0 0 0 0 0
1 0 0 0 1
2 0 0 1 0
3 0 0 1 1
4 0 1 0 0
5 0 1 0 1
6 0 1 1 0
7 0 1 1 1
8 1 0 0 0
9 1 0 0 1
10 1 0 1 0
11 1 0 1 1
12 1 1 0 0
13 1 1 0 1
14 1 1 1 0
15 1 1 1 1
String formatting with f-strings
n = int(np.log2(df.A.max() + 1))
pd.DataFrame([list(map(int, f'{i:0{n}b}')) for i in df.A])
0 1 2 3
0 0 0 0 0
1 0 0 0 1
2 0 0 1 0
3 0 0 1 1
4 0 1 0 0
5 0 1 0 1
6 0 1 1 0
7 0 1 1 1
8 1 0 0 0
9 1 0 0 1
10 1 0 1 0
11 1 0 1 1
12 1 1 0 0
13 1 1 0 1
14 1 1 1 0
15 1 1 1 1
Could you do something like this?
x = 20
bin_string = format(x, '016b')
df = pd.DataFrame(list(bin_string)).T
I don't know enough about what you're trying to do to know if that's sufficient.

Convert pandas dataframe to series

Is there a way to convert pandas dataframe to series with multiindex? The dataframe's columns could be multi-indexed too.
Below works, but only for multiindex with labels.
In [163]: d
Out[163]:
a 0 1
b 0 1 0 1
a 0 0 0 0
b 1 2 3 4
c 2 4 6 8
In [164]: d.stack(d.columns.names)
Out[164]:
a b
a 0 0 0
1 0
1 0 0
1 0
b 0 0 1
1 2
1 0 3
1 4
c 0 0 2
1 4
1 0 6
1 8
dtype: int64
I think you can use nlevels for find length of levels in MultiIndex, then create range with stack:
print (d.columns.nlevels)
2
#for python 3 add `list`
print (list(range(d.columns.nlevels)))
[0, 1]
print (d.stack(list(range(d.columns.nlevels))))
a b
a 0 0 0
1 0
1 0 0
1 0
b 0 0 1
1 2
1 0 3
1 4
c 0 0 2
1 4
1 0 6
1 8
dtype: int64

Categories

Resources