Group by column and Spread values of another Column into other Columns - python

I have the current dataframe and I'm trying to group by the Name and spread the values of weight into the columns and count each time they occur. Thanks!
df = pd.DataFrame({'Name':['John','Paul','Darren','John','Darren'],
'Weight':['Average','Below Average','Above Average','Average','Above Average']})
Desired output:

Try pandas crosstab :
pd.crosstab(df.Name, df.Weight)
Weight Above Average Average Below Average
Name
John 0 2 0
Paul 0 0 1
Darren 2 0 0

use groupby and unstack:
df = pd.DataFrame({'Name':['John','Paul','Darren','John','Darren'],
'Weight':['Average','Below Average','Above Average','Average','Above Average']})
df = df.groupby(['Name', 'Weight'])['Weight'].count().unstack(1).fillna(0).astype(int).reset_index()
df = df.rename_axis('', axis=1).set_index('Name')
df
Out[1]:
Above Average Average Below Average
Name
Darren 2 0 0
John 0 2 0
Paul 0 0 1

Use get dummies to achieve what you need here
pd.get_dummies(df.set_index('Name'), dummy_na=False,prefix=[None]).groupby('Name').sum()
Above Average Average Below Average
Name
Darren 2 0 0
John 0 2 0
Paul 0 0 1

Related

How to keep certain rows based on a condition in python pandas

I have the following df. Below are two fields that pertain to my question
name tardy
max 0
max 1
ben 0
amy 0
amy 1
sue 1
tyler 0
tyler 1
I would like to keep only the name of those who have both tardy==0 and tardy==1. Thus, my desired output is the following
name tardy
max 0
max 1
amy 0
amy 1
tyler 0
tyler 1
Getting rid of name==sue and name==ben makes it so that the only name showing up is for those who have both a 0 and 1 value for tardy.
I tried doing a .loc
df[(df.tardy==0) & (df.tardy==1)]
but this doesn't take into account filtering it by name.
Any help is appreciated. Thanks!
For most general solution working for any data compare values of groups converted to sets with original and for avoid matching data like 0,1,0 compare by length if match:
vals = set([0,1])
m = df.groupby('name')['tardy'].transform(lambda x: set(x)==vals and len(x)==len(vals))
df = df[m]
print (df)
name tardy
0 max 0
1 max 1
3 amy 0
4 amy 1
6 tyler 0
7 tyler 1
Or solution with pandas functions - compare values if unique is same like set, compare lengths and also if matching values 0,1:
vals = [0,1]
g = df.groupby('name')['tardy']
df = df[g.transform('size').eq(2) & g.transform('size').eq(2) & df['tardy'].isin(vals)]
print (df)
name tardy
0 max 0
1 max 1
3 amy 0
4 amy 1
6 tyler 0
7 tyler 1
You can use groupby().nunique():
df[df.groupby('name')['tardy'].transform('nunique')==2]
Output:
name tardy
0 max 0
1 max 1
3 amy 0
4 amy 1
6 tyler 0
7 tyler 1
The easiest way is to use df.groupby().filter, which filters the dataframe's groups based on a condition.
tardy_vals = {0, 1}
df.groupby('name').filter(lambda g: tardy_vals.issubset(g['tardy']))
name tardy
0 max 0
1 max 1
3 amy 0
4 amy 1
6 tyler 0
7 tyler 1

Update row in a dataframe based on a second one

I have the following dataframe, df1 :
AS AT CH TR
James Robert/01/08/2019 0 0 0 1
James Robert/18/08/2019 0 0 0 1
John Smith/01/08/2019 1 0 0 0
John Smith/02/08/2019 0 1 0 0
And df2 :
TIME
Andrew Johnson/08/08/2019 1
James Robert/01/08/2019 0.5
John Smith/02/08/2019 1
If an index value is present in both dataframes (example : James Robert/01/08/2019 and John Smith/02/08/2019), I would like to delete the row in df1 if the value of df1["Column with a value"] - df2['TIME'] = 0 otherwise I would like to update the value.
The desired output would be :
AS AT CH TR
James Robert/01/08/2019 0 0 0 0.5
James Robert/18/08/2019 0 0 0 1
John Smith/01/08/2019 1 0 0 0
If a row is in both dataframes, I'm able to delete it from df1, but I can't find a way to add this particular condition : "df1["Column with a value"]"
Thanks
Instead of using index use them as columns. Place the df2['index'] column in a list. Use that list as parameter in isin method done in df1.
df2['index'] = df2.index
df1['index'] = df1.index
filtered_df1 = df1[df1['index'].isin(df2['index'].values.tolist())]
Create a dictionary with your 'index' column and the value for your 'Time' column from df2 then map it to filtered_df1.
your_dict = dict(zip(df2['index'],df2['Time']))
filtered_df1['Subtract Value'] = filtered_df1['index'].map(your_dict).fillna(value = 0)
Then do the subtraction there.
final_df = filtered_df1.sub(filtered_df1['Subtract Value'], axis=0)
Hope this helps.

How to count the element in a column and take the result as a new column?

The DataFrame named df is shown as follows.
import pandas as pd
df = pd.DataFrame({'id': [1, 1, 3]})
Input:
id
0 1
1 1
2 3
I want to count the number of each id, and take the result as a new column count.
Expected:
id count
0 1 2
1 1 2
2 3 1
pd.factorize and np.bincount
My favorite. factorize does not sort and has time complexity of O(n). For big data sets, factorize should be preferred over np.unique
i, u = df.id.factorize()
df.assign(Count=np.bincount(i)[i])
id Count
0 1 2
1 1 2
2 3 1
np.unique and np.bincount
u, i = np.unique(df.id, return_inverse=True)
df.assign(Count=np.bincount(i)[i])
id Count
0 1 2
1 1 2
2 3 1
Assign the new count column to the dataframe by grouping on id and then transforming that column with value_counts (or size).
>>> f.assign(count=f.groupby('id')['id'].transform('value_counts'))
id count
0 1 2
1 1 2
2 3 1
Use Series.map with Series.value_counts:
df['count'] = df['id'].map(df['id'].value_counts())
#alternative
#from collections import Counter
#df['count'] = df['id'].map(Counter(df['id']))
Detail:
print (df['id'].value_counts())
1 2
3 1
Name: id, dtype: int64
Or GroupBy.transform for return Series with same size as original DataFrame with GroupBy.size:
df['count'] = df.groupby('id')['id'].transform('size')
print (df)
id count
0 1 2
1 1 2
2 3 1

reset a recurring multiindex in pandas

I have a pandas data frame in python coming from a pd.concat with a recurring multiindex:
customer_id
0 0 46841769
1 4683936
1 0 8880872
1 8880812
0 0 8880873
1 1000521
1 0 1135488
1 5388773
No, I will reset only the first index of the multiIndex, so that I get a recurring number on the index. Something like this:
customer_id
0 0 46841769
1 4683936
1 0 8880872
1 8880812
2 0 8880873
1 1000521
3 0 1135488
1 5388773
In general, I have around 5 Mio records and not the biggest machine. So I'm looking for a memory efficient solution for that.
ignore_index=True in pd.concat do not works, because then I lose the Multiindex.
Many thanks
You can convert first level by get_level_values to_series, then compare it with shifted values and add cumsum for count and last use MultiIndex.from_arrays:
a = df.index.get_level_values(0).to_series()
a = a.ne(a.shift()).cumsum() - 1
mux = pd.MultiIndex.from_arrays([a, df.index.get_level_values(1)], names=df.index.names)
df.index = mux
Or:
df = df.set_index(mux)
print (df)
customer_id
0 0 46841769
1 4683936
1 0 8880872
1 8880812
2 0 8880873
1 1000521
3 0 1135488
1 5388773

Column headers like pivot table

I am trying to find out the mix of member grades that visit my stores.
import pandas as pd
df=pd.DataFrame({'MbrID':['M1','M2','M3','M4','M5','M6','M7']
,'Store':['PAR','TPM','AMK','TPM','PAR','PAR','AMK']
,'Grade':['A','A','B','A','C','A','C']})
df=df[['MbrID','Store','Grade']]
print(df)
df.groupby('Store').agg({'Grade':pd.Series.nunique})
Below is the dataframe and also the result of groupby function.
How do I produce the result like Excel Pivot table, such that the categories of Grade (A,B,C) is the column headers? This is assuming that I have quite a wide range of member grades.
I think you can use groupby with size and reshaping by unstack:
df1 = df.groupby(['Store','Grade'])['Grade'].size().unstack(fill_value=0)
print (df1)
Grade A B C
Store
AMK 0 1 1
PAR 2 0 1
TPM 2 0 0
Solution with crosstab:
df2 = pd.crosstab(df.Store, df.Grade)
print (df2)
Grade A B C
Store
AMK 0 1 1
PAR 2 0 1
TPM 2 0 0
and with pivot_table:
df3 = df.pivot_table(index='Store',
columns='Grade',
values='MbrID',
aggfunc=len,
fill_value=0)
print (df3)
Grade A B C
Store
AMK 0 1 1
PAR 2 0 1
TPM 2 0 0

Categories

Resources