From Pandas Dataframe, how to get the index of all ones at the row level?
My data frame has around a hundred columns. here is an example:
a b c d
0 1 0 1 0
1 0 0 0 1
2 1 1 0 1
3 1 1 0 0
4 1 1 1 1
The expected result is
0 a,c
1 d
2 a,b,d
3 a,b
4 a,b,c,d
I found this question on stackoverflow
index of non "NaN" values in Pandas
but it works at the column level
Thanks in advance.
If there are only 1 and 0 values use DataFrame.dot for matrix multiplication with columns names and separator, last remove separator with Series.str.rstrip:
df['e'] = df.dot(df.columns + ', ').str.rstrip(', ')
#if exist another values like 0,1 and compare 1
#df['e'] = df.eq(1).dot(df.columns + ', ').str.rstrip(', ')
print (df)
a b c d e
0 1 0 1 0 a, c
1 0 0 0 1 d
2 1 1 0 1 a, b, d
3 1 1 0 0 a, b
4 1 1 1 1 a, b, c, d
Also for Series use:
s = df.dot(df.columns + ', ').str.rstrip(', ')
print (s)
0 a, c
1 d
2 a, b, d
3 a, b
4 a, b, c, d
dtype: object
Try:
df=df.stack()
df=df.loc[df.eq(1)].reset_index(level=1).groupby(level=0).agg(', '.join)
Outputs:
level_1
0 a, c
1 d
2 a, b, d
3 a, b
4 a, b, c, d
Related
I have the following example data set
A
B
C
D
foo
0
1
1
bar
0
0
1
baz
1
1
0
How could extract the column names of each 1 occurrence in a row and put that into another column E so that I get the following table:
A
B
C
D
E
foo
0
1
1
C, D
bar
0
0
1
D
baz
1
1
0
B, C
Note that there can be more than two 1s per row.
You can use DataFrame.dot.
df['E'] = df[['B', 'C', 'D']].dot(df.columns[1:] + ', ').str.rstrip(', ')
df
A B C D E
0 foo 0 1 1 C, D
1 bar 0 0 1 D
2 baz 1 1 0 B, C
Inspired by jezrael's answer in this post.
Another way is that you can convert each row to boolean and use it as a selection mask to filter the column names.
cols = pd.Index(['B', 'C', 'D'])
df['E'] = df[cols].astype('bool').apply(lambda row: ", ".join(cols[row]), axis=1)
df
A B C D E
0 foo 0 1 1 C, D
1 bar 0 0 1 D
2 baz 1 1 0 B, C
I am trying to do a multiple column select then replace in pandas
df:
a b c d e
0 1 1 0 none
0 0 0 1 none
1 0 0 0 none
0 0 0 0 none
select where any or all of a, b, c, d are non zero
i, j = np.where(df)
s=pd.Series(dict(zip(zip(i, j),
df.columns[j]))).reset_index(-1, drop=True)
s:
0 b
0 c
1 d
2 a
Now I want to replace the values in column e by the series:
df['e'] = s.values
so that e looks like:
e:
b, c
d
a
none
But the problem is that the lengths of the series are different to the number of rows in the dataframe.
Any idea on how I can do this?
Use DataFrame.dot for product with columns names, add rstrip, last add numpy.where for replace empty strings to None:
e = df.dot(df.columns + ', ').str.rstrip(', ')
df['e'] = np.where(e.astype(bool), e, None)
print (df)
a b c d e
0 0 1 1 0 b, c
1 0 0 0 1 d
2 1 0 0 0 a
3 0 0 0 0 None
You can locate the 1's and use their locations as boolean indexes into the dataframe columns:
df['e'] = (df==1).apply(lambda x: df.columns[x], axis=1)\
.str.join(",").replace('','none')
# a b c d e
#0 0 1 1 0 b,c
#1 0 0 0 1 d
#2 1 0 0 0 a
#3 0 0 0 0 none
I have following data frame:
1 A a
1 A b
2 B c
1 A d
How do I append all the values of a row with same values to data frame:
1 A a,c,d
2 B c
You can use groupby and apply function join :
df.columns = ['a','b','c']
print (df)
a b c
0 1 A a
1 1 A b
2 2 B c
3 1 A d
print (df.groupby(['a', 'b'])['c'].apply(', '.join).reset_index())
a b c
0 1 A a, b, d
1 2 B c
Or if first column is index:
df.columns = ['a','b']
print (df)
a b
1 A a
1 A b
2 B c
1 A d
df1 = df.b.groupby([df.index, df.a]).apply(', '.join).reset_index(name='c')
df1.columns = ['a','b','c']
print (df1)
a b c
0 1 A a, b, d
1 2 B c
Suppose I have a dataframe df such as
A B C
0 a 1
1 b 1
2 c 2
I would like to return B, and C when C==1, like so
B C
a 1
b 1
I've got as far as df.B[df.C==1], which returns
B
a
b
which is right (countwise) but wrong in the slice. How do I get C?
You can use loc or query:
print df.loc[df.C==1, ['B','C']]
B C
0 a 1
1 b 1
print df[['B','C']].query('C == 1')
B C
0 a 1
1 b 1
Or if you need only column C:
print df.loc[df.C==1, 'C']
0 1
1 1
Name: C, dtype: int64
Hi all I have a csv file which contains data as the format below
A a
A b
B f
B g
B e
B h
C d
C e
C f
The first column contains items second column contains available feature from feature vector=[a,b,c,d,e,f,g,h]
I want to convert this to occurence matrix look like below
a,b,c,d,e,f,g,h
A 1,1,0,0,0,0,0,0
B 0,0,0,0,1,1,1,1
C 0,0,0,1,1,1,0,0
Can anyone tell me how to do this using pandas?
Here is another way to do it using pd.get_dummies().
import pandas as pd
# your data
# =======================
df
col1 col2
0 A a
1 A b
2 B f
3 B g
4 B e
5 B h
6 C d
7 C e
8 C f
# processing
# ===================================
pd.get_dummies(df.col2).groupby(df.col1).apply(max)
a b d e f g h
col1
A 1 1 0 0 0 0 0
B 0 0 0 1 1 1 1
C 0 0 1 1 1 0 0
Unclear if your data has a typo or not but you can crosstab for this:
In [95]:
pd.crosstab(index=df['A'], columns = df['a'])
Out[95]:
a b d e f g h
A
A 1 0 0 0 0 0
B 0 0 1 1 1 1
C 0 1 1 1 0 0
In your sample data your second column has value a as the name of that column but in your expected output it's in the column as a value
EDIT
OK I fixed your input data so it generates the correct result:
In [98]:
import pandas as pd
import io
t="""A a
A b
B f
B g
B e
B h
C d
C e
C f"""
df = pd.read_csv(io.StringIO(t), sep='\s+', header=None, names=['A','a'])
df
Out[98]:
A a
0 A a
1 A b
2 B f
3 B g
4 B e
5 B h
6 C d
7 C e
8 C f
In [99]:
ct = pd.crosstab(index=df['A'], columns = df['a'])
ct
Out[99]:
a a b d e f g h
A
A 1 1 0 0 0 0 0
B 0 0 0 1 1 1 1
C 0 0 1 1 1 0 0
This approach yields the same result in a scipy sparse coo matrix much faster
from scipy import sparse
df['col1'] = df['col1'].astype("category")
df['col2'] = df['col2'].astype("category")
df['ones'] = 1
user_items = sparse.coo_matrix((df.ones.astype(float),
(df.col1.cat.codes,
df.col2.cat.codes)))