Related
I have a dataframe where one column is a list of groups each of my users belongs to. Something like:
index groups
0 ['a','b','c']
1 ['c']
2 ['b','c','e']
3 ['a','c']
4 ['b','e']
And what I would like to do is create a series of dummy columns to identify which groups each user belongs to in order to run some analyses
index a b c d e
0 1 1 1 0 0
1 0 0 1 0 0
2 0 1 1 0 1
3 1 0 1 0 0
4 0 1 0 0 0
pd.get_dummies(df['groups'])
won't work because that just returns a column for each different list in my column.
The solution needs to be efficient as the dataframe will contain 500,000+ rows.
Using s for your df['groups']:
In [21]: s = pd.Series({0: ['a', 'b', 'c'], 1:['c'], 2: ['b', 'c', 'e'], 3: ['a', 'c'], 4: ['b', 'e'] })
In [22]: s
Out[22]:
0 [a, b, c]
1 [c]
2 [b, c, e]
3 [a, c]
4 [b, e]
dtype: object
This is a possible solution:
In [23]: pd.get_dummies(s.apply(pd.Series).stack()).sum(level=0)
Out[23]:
a b c e
0 1 1 1 0
1 0 0 1 0
2 0 1 1 1
3 1 0 1 0
4 0 1 0 1
The logic of this is:
.apply(Series) converts the series of lists to a dataframe
.stack() puts everything in one column again (creating a multi-level index)
pd.get_dummies( ) creating the dummies
.sum(level=0) for remerging the different rows that should be one row (by summing up the second level, only keeping the original level (level=0))
An slight equivalent is pd.get_dummies(s.apply(pd.Series), prefix='', prefix_sep='').sum(level=0, axis=1)
If this will be efficient enough, I don't know, but in any case, if performance is important, storing lists in a dataframe is not a very good idea.
Very fast solution in case you have a large dataframe
Using sklearn.preprocessing.MultiLabelBinarizer
import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer
df = pd.DataFrame(
{'groups':
[['a','b','c'],
['c'],
['b','c','e'],
['a','c'],
['b','e']]
}, columns=['groups'])
s = df['groups']
mlb = MultiLabelBinarizer()
pd.DataFrame(mlb.fit_transform(s),columns=mlb.classes_, index=df.index)
Result:
a b c e
0 1 1 1 0
1 0 0 1 0
2 0 1 1 1
3 1 0 1 0
4 0 1 0 1
Worked for me and also was suggested here and here
This is even faster:
pd.get_dummies(df['groups'].explode()).sum(level=0)
Using .explode() instead of .apply(pd.Series).stack()
Comparing with the other solutions:
import timeit
import pandas as pd
setup = '''
import time
import pandas as pd
s = pd.Series({0:['a','b','c'],1:['c'],2:['b','c','e'],3:['a','c'],4:['b','e']})
df = s.rename('groups').to_frame()
'''
m1 = "pd.get_dummies(s.apply(pd.Series).stack()).sum(level=0)"
m2 = "df.groups.apply(lambda x: pd.Series([1] * len(x), index=x)).fillna(0, downcast='infer')"
m3 = "pd.get_dummies(df['groups'].explode()).sum(level=0)"
times = {f"m{i+1}":min(timeit.Timer(m, setup=setup).repeat(7, 1000)) for i, m in enumerate([m1, m2, m3])}
pd.DataFrame([times],index=['ms'])
# m1 m2 m3
# ms 5.586517 3.821662 2.547167
Even though this quest was answered, I have a faster solution:
df.groups.apply(lambda x: pd.Series([1] * len(x), index=x)).fillna(0, downcast='infer')
And, in case you have empty groups or NaN, you could just:
df.loc[df.groups.str.len() > 0].apply(lambda x: pd.Series([1] * len(x), index=x)).fillna(0, downcast='infer')
How it works
Inside the lambda, x is your list, for example ['a', 'b', 'c']. So pd.Series will be as follows:
In [2]: pd.Series([1, 1, 1], index=['a', 'b', 'c'])
Out[2]:
a 1
b 1
c 1
dtype: int64
When all pd.Series comes together, they become pd.DataFrame and their index become columns; missing index became a column with NaN as you can see next:
In [4]: a = pd.Series([1, 1, 1], index=['a', 'b', 'c'])
In [5]: b = pd.Series([1, 1, 1], index=['a', 'b', 'd'])
In [6]: pd.DataFrame([a, b])
Out[6]:
a b c d
0 1.0 1.0 1.0 NaN
1 1.0 1.0 NaN 1.0
Now fillna fills those NaN with 0:
In [7]: pd.DataFrame([a, b]).fillna(0)
Out[7]:
a b c d
0 1.0 1.0 1.0 0.0
1 1.0 1.0 0.0 1.0
And downcast='infer' is to downcast from float to int:
In [11]: pd.DataFrame([a, b]).fillna(0, downcast='infer')
Out[11]:
a b c d
0 1 1 1 0
1 1 1 0 1
PS.: It's not required the use of .fillna(0, downcast='infer').
You can use explode and crosstab:
s = pd.Series([['a', 'b', 'c'], ['c'], ['b', 'c', 'e'], ['a', 'c'], ['b', 'e']])
s = s.explode()
pd.crosstab(s.index, s)
Output:
col_0 a b c e
row_0
0 1 1 1 0
1 0 0 1 0
2 0 1 1 1
3 1 0 1 0
4 0 1 0 1
You can use str.join to join all elements in list present in series into string and then use str.get_dummies:
out = df.join(df['groups'].str.join('|').str.get_dummies())
print(out)
groups a b c e
0 [a, b, c] 1 1 1 0
1 [c] 0 0 1 0
2 [b, c, e] 0 1 1 1
3 [a, c] 1 0 1 0
4 [b, e] 0 1 0 1
I have a list of DataFrames. Each of these DataFrames looks like this :
df_list[0] =
place place2 value1 value2
0 x a 10 0
1 y a 15 10
2 z b 5 10
To give you a concrete example I will show two more :
df_list[1] =
place place2 value1 value2
0 x a 20 20
1 y a 0 0
df_list[2]=
place place2 value1 value2
0 x a 50 10
1 y a 30 20
2 z b 0 40
As you can see, not each of these dataframes contains every 'place' possible. However, 'place2' is always associated to the same 'place'.
I would like to have a final DataFrame where I could see the top 3 'value1' and 'value2' and their associated "i" as in df_list[i], for each 'place'. The format really doesn't matter, but for example it could look like this :
place place2 v1_1st v1_1st_i v2_1st v2_1st_i v1_2nd v1_2nd_i v2_2nd v2_2nd_i v1_3rd v1_3rd_i ...
x a 50 2 20. 1 20. 1. 10. 0. 10. 2
y a 30 2 20. 2. 15. 0. 10. 0. 0. 1
z b 5 0 40. 2. 0. 2. 10. 0. NaN. NaN
Thank you for bearing with me ! xoxo
Need a few steps here.
First we concatenate all dfs from df_list while adding a column to each that keeps track of the index of that df in the list, we put it into column di:
df_ag = pd.concat([d.assign(di = n) for n,d in enumerate(df_list)], axis=0, ignore_index=True)
df_ag
produces
place place2 value1 value2 di
-- ------- -------- -------- -------- ----
0 x a 10 0 0
1 y a 15 10 0
2 z b 5 10 0
3 x a 20 20 1
4 y a 0 0 1
5 x a 50 10 2
6 y a 30 20 2
7 z b 0 40 2
We will treat value1 and value2 separately. For value1, we groupby on ['place', 'place2'], find 3 largest per group, and rank them (via reset_index() within each group)
df_agv1 = df_ag.groupby(['place','place2']).apply(lambda d: d.nlargest(3, 'value1').reset_index(drop=True))
df_agv1
this produces
place place2 value1 value2 di
place place2
x a 0 x a 50 10 2
1 x a 20 20 1
2 x a 10 0 0
y a 0 y a 30 20 2
1 y a 15 10 0
2 y a 0 0 1
z b 0 z b 5 10 0
1 z b 0 40 2
This already has the info we need (columns value1 and di). Assuming you want the format closer to the format you specified, we need to extract value1 and di for each group. We can do it like so:
df_agv1 = df_agv1.drop(columns = ['place','place2','value2']).unstack(level=2)
df_agv1.columns = df_agv1.columns.to_flat_index()
df_agv1
which produces
('value1', 0) ('value1', 1) ('value1', 2) ('di', 0) ('di', 1) ('di', 2)
---------- --------------- --------------- --------------- ----------- ----------- -----------
('x', 'a') 50 20 10 2 1 0
('y', 'a') 30 15 0 2 0 1
('z', 'b') 5 0 nan 0 2 nan
and this is what you asked for for value1. You may want to rename column labels if you do not like these
Then we can do the same for value2 by changing value1<-->value2 in the commands above, to produce df_agv2, I do not repeat the steps here
If you want to put the two together you can do with something like
pd.concat([df_agv1,df_agv2], axis=1)
Another option
df1 = pd.DataFrame([['x', 'a', 10, 0], ['y', 'a', 15, 10], ['z', 'b', 5, 10]], columns=['place', 'place2', 'value', 'value2'])
df2 = pd.DataFrame([['x', 'a', 20, 20], ['y', 'a', 0, 0]], columns=['place', 'place2', 'value', 'value2'])
df3 = pd.DataFrame([['x', 'a', 50, 10], ['y', 'a', 30, 20], ['z', 'b', 0, 40]], columns=['place', 'place2', 'value', 'value2'])
df_list =[df1, df2, df3]
Identify list position for for each dataframe in list:
for i, df in enumerate(df_list):
df['listposition'] = i
Concatenate the dataframes:
df_temp = pd.concat(df_list, axis=0)
analyze for value and value2 separately but in the same way to merge later:
df_pv1 = df_temp[['place','place2','value', 'listposition']].sort_values('value')
df_pv1.rename(columns={'listposition': 'listposition'}, inplace=True)
df_pv2 = df_temp[['place','place2', 'value2', 'listposition']].sort_values('value2')
df_pv2.rename(columns={'listposition': 'listposition2'}, inplace=True)
group by place, place2 (taking head as we sorted the value columns descending)
df_ranked_pv1 = df_pv1.groupby(['place','place2']).head(3).sort_values(['place', 'place2', 'value'], ascending=[True, True, False])
df_ranked_pv2 = df_pv2.groupby(['place','place2']).head(3).sort_values(['place', 'place2', 'value2'], ascending=[True, True, False])
Put it all together. You mentioned format wasn't settled, so this is a different layout.
df_final = pd.concat([df_ranked_pv1, df_ranked_pv2[['value2', 'listposition2']]], axis=1)
In [125]: df_final
Out[125]:
place place2 value listposition value2 listposition2
0 x a 50 2 20 1
0 x a 20 1 10 2
0 x a 10 0 0 0
1 y a 30 2 20 2
1 y a 15 0 10 0
1 y a 0 1 0 1
2 z b 5 0 40 2
2 z b 0 2 10 0
I have a dataframe where one column is a list of groups each of my users belongs to. Something like:
index groups
0 ['a','b','c']
1 ['c']
2 ['b','c','e']
3 ['a','c']
4 ['b','e']
And what I would like to do is create a series of dummy columns to identify which groups each user belongs to in order to run some analyses
index a b c d e
0 1 1 1 0 0
1 0 0 1 0 0
2 0 1 1 0 1
3 1 0 1 0 0
4 0 1 0 0 0
pd.get_dummies(df['groups'])
won't work because that just returns a column for each different list in my column.
The solution needs to be efficient as the dataframe will contain 500,000+ rows.
Using s for your df['groups']:
In [21]: s = pd.Series({0: ['a', 'b', 'c'], 1:['c'], 2: ['b', 'c', 'e'], 3: ['a', 'c'], 4: ['b', 'e'] })
In [22]: s
Out[22]:
0 [a, b, c]
1 [c]
2 [b, c, e]
3 [a, c]
4 [b, e]
dtype: object
This is a possible solution:
In [23]: pd.get_dummies(s.apply(pd.Series).stack()).sum(level=0)
Out[23]:
a b c e
0 1 1 1 0
1 0 0 1 0
2 0 1 1 1
3 1 0 1 0
4 0 1 0 1
The logic of this is:
.apply(Series) converts the series of lists to a dataframe
.stack() puts everything in one column again (creating a multi-level index)
pd.get_dummies( ) creating the dummies
.sum(level=0) for remerging the different rows that should be one row (by summing up the second level, only keeping the original level (level=0))
An slight equivalent is pd.get_dummies(s.apply(pd.Series), prefix='', prefix_sep='').sum(level=0, axis=1)
If this will be efficient enough, I don't know, but in any case, if performance is important, storing lists in a dataframe is not a very good idea.
Very fast solution in case you have a large dataframe
Using sklearn.preprocessing.MultiLabelBinarizer
import pandas as pd
from sklearn.preprocessing import MultiLabelBinarizer
df = pd.DataFrame(
{'groups':
[['a','b','c'],
['c'],
['b','c','e'],
['a','c'],
['b','e']]
}, columns=['groups'])
s = df['groups']
mlb = MultiLabelBinarizer()
pd.DataFrame(mlb.fit_transform(s),columns=mlb.classes_, index=df.index)
Result:
a b c e
0 1 1 1 0
1 0 0 1 0
2 0 1 1 1
3 1 0 1 0
4 0 1 0 1
Worked for me and also was suggested here and here
This is even faster:
pd.get_dummies(df['groups'].explode()).sum(level=0)
Using .explode() instead of .apply(pd.Series).stack()
Comparing with the other solutions:
import timeit
import pandas as pd
setup = '''
import time
import pandas as pd
s = pd.Series({0:['a','b','c'],1:['c'],2:['b','c','e'],3:['a','c'],4:['b','e']})
df = s.rename('groups').to_frame()
'''
m1 = "pd.get_dummies(s.apply(pd.Series).stack()).sum(level=0)"
m2 = "df.groups.apply(lambda x: pd.Series([1] * len(x), index=x)).fillna(0, downcast='infer')"
m3 = "pd.get_dummies(df['groups'].explode()).sum(level=0)"
times = {f"m{i+1}":min(timeit.Timer(m, setup=setup).repeat(7, 1000)) for i, m in enumerate([m1, m2, m3])}
pd.DataFrame([times],index=['ms'])
# m1 m2 m3
# ms 5.586517 3.821662 2.547167
Even though this quest was answered, I have a faster solution:
df.groups.apply(lambda x: pd.Series([1] * len(x), index=x)).fillna(0, downcast='infer')
And, in case you have empty groups or NaN, you could just:
df.loc[df.groups.str.len() > 0].apply(lambda x: pd.Series([1] * len(x), index=x)).fillna(0, downcast='infer')
How it works
Inside the lambda, x is your list, for example ['a', 'b', 'c']. So pd.Series will be as follows:
In [2]: pd.Series([1, 1, 1], index=['a', 'b', 'c'])
Out[2]:
a 1
b 1
c 1
dtype: int64
When all pd.Series comes together, they become pd.DataFrame and their index become columns; missing index became a column with NaN as you can see next:
In [4]: a = pd.Series([1, 1, 1], index=['a', 'b', 'c'])
In [5]: b = pd.Series([1, 1, 1], index=['a', 'b', 'd'])
In [6]: pd.DataFrame([a, b])
Out[6]:
a b c d
0 1.0 1.0 1.0 NaN
1 1.0 1.0 NaN 1.0
Now fillna fills those NaN with 0:
In [7]: pd.DataFrame([a, b]).fillna(0)
Out[7]:
a b c d
0 1.0 1.0 1.0 0.0
1 1.0 1.0 0.0 1.0
And downcast='infer' is to downcast from float to int:
In [11]: pd.DataFrame([a, b]).fillna(0, downcast='infer')
Out[11]:
a b c d
0 1 1 1 0
1 1 1 0 1
PS.: It's not required the use of .fillna(0, downcast='infer').
You can use explode and crosstab:
s = pd.Series([['a', 'b', 'c'], ['c'], ['b', 'c', 'e'], ['a', 'c'], ['b', 'e']])
s = s.explode()
pd.crosstab(s.index, s)
Output:
col_0 a b c e
row_0
0 1 1 1 0
1 0 0 1 0
2 0 1 1 1
3 1 0 1 0
4 0 1 0 1
You can use str.join to join all elements in list present in series into string and then use str.get_dummies:
out = df.join(df['groups'].str.join('|').str.get_dummies())
print(out)
groups a b c e
0 [a, b, c] 1 1 1 0
1 [c] 0 0 1 0
2 [b, c, e] 0 1 1 1
3 [a, c] 1 0 1 0
4 [b, e] 0 1 0 1
I want to compare two values in column 0 to the values in all the other columns and change to values of those columns appropriately.
I have 4329 rows x 197 columns.
From this:
0 1 2 3
0 G G G T
1 A A G A
2 C C C C
3 T A T G
To this:
0 1 2 3
0 G 1 1 0
1 A 1 0 1
2 C 1 1 1
3 T 0 1 0
I've tried a nested for loop, which does not work and is slow.
for index, row in df.iterrows():
for name, value in row.iteritems():
if name == 0:
c = value
continue
if value == c:
value = 1
else:
value = 0
I haven't been able to piece together a way to use apply or applymap for the problem.
Here's an approach with iloc and eq:
df.iloc[:,1:] = df.iloc[:,1:].eq(df.iloc[:,0], axis=0).astype(int)
Output:
0 1 2 3
0 G 1 1 0
1 A 1 0 1
2 C 1 1 1
3 T 0 1 0
df = pandas.DataFrame([['G', 'G', 'G', 'T'],
['A', 'A', 'G', 'A'],
['C', 'C', 'C', 'C'],
['T', 'A', 'T', 'G']])
df2 = df[0] + df.apply(lambda c:df[0]==c)[[1,2,3]].astype(int)
print(df2)
I guess ... theres probably a better way though
you could also do something like
df.apply(lambda c:(df[0]==c).astype(int) if c.name > 0 else c)
I have a dataframe, with the following columns, in this order;
'2','4','9','A','1','B','C'
I want the first 3 columns to be ABC but the rest it doesn't matter.
Output:
'A','B','C','3','2','9'... and so on
Is this possible?
(there are 100's of columns, so i can't put them all in a list
You can try to reorder like this:
first_cols = ['A','B','C']
last_cols = [col for col in df.columns if col not in first_cols]
df = df[first_cols+last_cols]
Setup
cols = ['2','4','9','A','1','B','C']
df = pd.DataFrame(1, range(3), cols)
df
2 4 9 A 1 B C
0 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1
2 1 1 1 1 1 1 1
sorted with key
key = lambda x: (x != 'A', x != 'B', x != 'C')
df[sorted(df, key=key)]
A B C 2 4 9 1
0 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1
2 1 1 1 1 1 1 1
Better suited for longer length of first column members
first_cols = ['A', 'B', 'C']
key = lambda x: tuple(y != x for y in first_cols)
df[sorted(df, key=key)]