how to aggregate in pivot table in pandas - python

I have following dataframe in pandas
code date tank nozzle qty amount
123 2018-01-01 1 1 100 0
123 2018-01-01 1 2 0 50
123 2018-01-01 1 2 0 50
123 2018-01-01 1 2 100 0
123 2018-01-02 1 1 0 70
123 2018-01-02 1 1 0 50
123 2018-01-02 1 2 100 0
My desired dataframe is
code date tank nozzle_1_qty nozzle_2_qty nozzle_1_amount nozzle_2_amount
123 2018-01-01 1 100 100 0 100
123 2018-01-02 1 0 100 120 0
I am doing following in pandas..
df= (df.pivot_table(index=['date', 'tank'], columns='nozzle',
values=['qty','amount']).add_prefix('nozzle_')
.reset_index()
)
But,this does not give me my desired output.

Default aggregation function in pivot_table is np.mean, so is necessary change it to sum and then flatten MultiIndex in list comprehension:
df = df.pivot_table(index=['code','date', 'tank'],
columns='nozzle',
values=['qty','amount'], aggfunc='sum')
#python 3.6+
df.columns = [f'nozzle_{b}_{a}' for a, b in df.columns]
#python bellow
#df.columns = ['nozzle_{}_{}'.format(b,a) for a, b in df.columns]
df = df.reset_index()
print (df)
code date tank nozzle_1_amount nozzle_2_amount nozzle_1_qty \
0 123 2018-01-01 1 0 100 100
1 123 2018-01-02 1 120 0 0
nozzle_2_qty
0 100
1 100

I don't use pivot_table much in pandas, but you can get your result using groupby and some reshaping.
df = df.groupby(['code', 'date', 'tank', 'nozzle']).sum().unstack()
The columns will be a MultiIndex that you maybe want to rename.

Related

panda dataframe aggregate by ID and date

I'm trying to aggregate a dataframe by both ID and date. Suppose I had a dataframe:
Publish date ID Price
0 2000-01-02 0 10
1 2000-01-03 0 20
2 2000-02-17 0 30
3 2000-01-04 1 40
I would like to aggregate the value by ID and date (frequency = 1W) and get a dataframe like:
Publish date ID Price
0 2000-01-02 0 30
1 2000-02-17 0 30
2 2000-01-04 1 40
I understand it can be achieved by iterating the ID and using grouper to aggregate the price. Is there any more efficient way without iterating the IDs? Many thanks.
Use Grouper with aggregate sum, but not sure about frequency of Grouper (because all looks different like in question):
df['Publish date'] = pd.to_datetime(df['Publish date'])
df = (df.groupby([pd.Grouper(freq='W', key='Publish date'),'ID'], sort=False)['Price']
.sum()
.reset_index())
print (df)
Publish date ID Price
0 2000-01-02 0 10
1 2000-01-09 0 20
2 2000-02-20 0 30
3 2000-01-09 1 40
df['Publish date'] = pd.to_datetime(df['Publish date'])
df = (df.groupby([pd.Grouper(freq='W-Mon', key='Publish date'),'ID'], sort=False)['Price']
.sum()
.reset_index())
print (df)
Publish date ID Price
0 2000-01-03 0 30
1 2000-02-21 0 30
2 2000-01-10 1 40
Or:
df['Publish date'] = pd.to_datetime(df['Publish date'])
df = (df.groupby([pd.Grouper(freq='7D', key='Publish date'),'ID'], sort=False)['Price']
.sum()
.reset_index())
print (df)
Publish date ID Price
0 2000-01-02 0 30
1 2000-02-13 0 30
2 2000-01-02 1 40

left join in pandas with multiple records with same key

I have following dataframes in pandas
df1 (LHS)
code date tank product key
123 2019-01-01 1 HS 123_2019-01-01_1
123 2019-01-01 1 HS 123_2019-01-01_1
123 2019-01-02 2 MS 123_2019-01-02_2
123 2019-01-02 1 HS 123_2019-01-02_1
df2_master (RHS)
code date tank product key
123 2019-01-01 1 MS 123_2019-01-01_1
123 2019-01-01 1 HS 123_2019-01-01_1
123 2019-01-02 2 MS 123_2019-01-02_2
123 2019-01-02 1 HS 123_2019-01-02_1
I want to merge df1 and df2_master with left join on key. Now df2_master has 2 products associated with same key for date 2019-01-01, so I want to flag this while merging two dataframes.
My desired dataframe should look like this.
df1 (LHS)
code date tank product key product_df2
123 2019-01-01 1 HS 123_2019-01-01_1 More than 1 product
123 2019-01-01 1 HS 123_2019-01-01_1 More than 1 product
123 2019-01-02 2 MS 123_2019-01-02_2 MS
123 2019-01-02 1 HS 123_2019-01-02_1 HS
How do I do it in pandas?
Create column product_df2 for check if duplicates by DataFrame.duplicated, merge with rows without dupes by DataFrame.drop_duplicates and last set values by numpy.where:
df2_master['product_df2'] = df2_master.duplicated(subset=['key'], keep=False)
df = df1.merge(df2_master.drop_duplicates('key'), how='left', on='key', suffixes=('','_'))
df['product_df2'] = np.where(df['product_df2'], 'More than 1 product', df['product_'])
#remove unnecessary columns
df = df.loc[:, ~df.columns.str.endswith('_')]
print (df)
code date tank product key product_df2
0 123 2019-01-01 1 HS 123_2019-01-01_1 More than 1 product
1 123 2019-01-01 1 HS 123_2019-01-01_1 More than 1 product
2 123 2019-01-02 2 MS 123_2019-01-02_2 MS
3 123 2019-01-02 1 HS 123_2019-01-02_1 HS

Convert the data frame from long to wide format and dynamically name columns

I am converting the data frame from long to wide format, however the problem I am facing is generating the right number of translated columns and dynamically renaming the new data frame columns.
So lets say I have a sample data frame as follows:
data = {'name':['Tom', 'nick', 'Tom', 'nick','Tom'], 'id':[20, 21, 20, 21,22], 'plan' : [100,101,102,101,100], 'drug' : ['a','b','b','c','a']}
df = pd.DataFrame(data)
drug id name plan
a 20 Tom 100
b 21 nick 101
b 20 Tom 102
c 21 nick 101
a 22 Tom 100
So for every given name and id I want to create multiple columns for plan and drugs. For example there are 3 distinct plans and 3 distinct drugs , so ideally I should get 6 new columns which indicate whether a particular plan/drug has been taken or not.
I tried converting from long to wide but I am not getting the desired result.
Convert long to wide:
df1 = df.groupby(['name','id'])['plan', 'drug'].apply(lambda x: pd.DataFrame(x.values)).unstack().reset_index()
Actual output:
name id 0 1 0 1
Tom 20 100 102 a b
nick 21 101 101 b c
Tom 22 100 None a None
Expected output:
name age 100 101 102 a b c
Tom 20 1 0 1 1 1 0
Tom 22 1 0 0 1 0 0
nick 21 0 1 0 0 1 1
Use get_dummies with max:
df1 = pd.get_dummies(df.set_index(['name','id']).astype(str)).max(level=[0,1]).reset_index()
print(df1)
name id plan_100 plan_101 plan_102 drug_a drug_b drug_c
0 Tom 20 1 0 1 1 1 0
1 nick 21 0 1 0 0 1 1
2 Tom 22 1 0 0 1 0 0
df2 = (pd.get_dummies(df.set_index(['name','id'])
.astype(str), prefix='', prefix_sep='')
.max(level=[0,1])
.reset_index())
print(df2)
name id 100 101 102 a b c
0 Tom 20 1 0 1 1 1 0
1 nick 21 0 1 0 0 1 1
2 Tom 22 1 0 0 1 0 0
EDIT: Solution with DataFrame.pivot_table, concat and DataFrame.clip:
df1 = df.pivot_table(index=['name','id'],
columns=['plan'],
aggfunc='size',
fill_value=0)
df2 = df.pivot_table(index=['name','id'],
columns=['drug'],
aggfunc='size',
fill_value=0)
df = pd.concat([df1, df2], axis=1).clip(upper=1).reset_index()
print(df)
name id 100 101 102 a b c
0 Tom 20 1 0 1 1 1 0
1 Tom 22 1 0 0 1 0 0
2 nick 21 0 1 0 0 1 1
import pandas as pd
data = {
'name':['Tom', 'nick', 'Tom', 'nick','Tom'],
'id':[20, 21, 20, 21,22],
'plan': [100,101,102,101,100],
'drug': ['a','b','b','c','a']
}
df = pd.DataFrame(data)
plans = df.groupby(['name', 'id', 'plan']).size().unstack()
drugs = df.groupby(['name', 'id', 'drug']).size().unstack()
merged_df = pd.merge(plans, drugs, left_index=True, right_index=True)
merged_df = merged_df.fillna(0)
get the plan and drug counts for each name and id. (that's what's size() and then unstack() is for)
and then just merge them on their index (which is set to name and id).
use fillna to replace NaN to 0

using loc to sum part columns and not all of them in pandas

When I am using this:
results_all.ix['Total', 'n_Close'] = results_all['n_Close'].sum()
the results is this (the results is not summed but rather added):
date n_Close g_Close potential
0 2017-05-02.csv 234 10.5 20.5
1 2017-05-03.csv -8 0 1
Total 234.0-8.0
When I use this:
results_all.loc['Total']= results_all.sum()
the result is:
date n_Close g_Close potential
0 2017-05-02.csv 234 10.5 20.5
1 2017-05-03.csv -8 0 1
Total 2017-05-02.csv2017-05-03.csv 234.0-8.0 10.50.0 20.51.0
The required result will be sum for specific columns:
date n_Close g_Close potential XXX
0 2017-05-02.csv 234 10.5 20.5 10
1 2017-05-03.csv -8 0 1 7
Total nothing here 226 10.50 21.5 nothing here
I think there is problem numeric columns are strings.
Solution is convert all values to numeric by to_numeric, replace original by numeric columns and get sum:
print (df)
date n_Close g_Close potential
0 2017-05-02.csv 234 10.5 20.5
1 2017-05-03.csv -8 0 1
1 2017-05-03.csv aaa 0 1
df1 = df.apply(pd.to_numeric, errors='coerce')
df.loc[:, ~df1.isnull().all()] = df1
df.loc['Total'] = df1.sum(min_count=1)
print (df)
date n_Close g_Close potential
0 2017-05-02.csv 234 10.5 20.5
1 2017-05-03.csv -8 0 1
1 2017-05-03.csv NaN 0 1 <-aaa was replaced by NaN
Total NaN 226 10.5 22.5
Another solution if dont want modify values to numeric:
df1 = df.apply(pd.to_numeric, errors='coerce')
df.loc['Total'] = df1.sum(min_count=1)
print (df)
date n_Close g_Close potential
0 2017-05-02.csv 234 10.5 20.5
1 2017-05-03.csv -8 0 1
1 2017-05-03.csv aaa 0 1 <- aaa for original values
Total NaN 226 10.5 22.5

transform column with categorical data into one column for each category

I have a DataFrame looking like that:
df index id timestamp cat value
0 8066 101 2012-03-01 09:00:29 A 1
1 8067 101 2012-03-01 09:01:15 B 0
2 8068 101 2012-03-01 09:40:18 C 1
3 8069 102 2012-03-01 09:40:18 C 0
What I want is something like this:
df timestamp A B C id value
0 2012-03-01 09:00:29 1 0 0 101 1
1 2012-03-01 09:01:15 0 1 0 101 0
2 2012-03-01 09:40:18 0 0 1 101 1
3 2012-03-01 09:40:18 0 0 1 102 0
As you can see in rows 2,3 timestamps can be duplicates. At first I tried using pivot (with timestamp as an index), but that didn't work because of those duplicates. I don't want to drop them, since the other data is different and should not be lost.
Since index contains no duplicates, I thought maybe I can pivot over it and after that merge the result into the original DataFrame, but I was wondering if there is an easier more intuitive solution.
Thanks!
Here is the one-liner that will achieve that you want. Assuming that your dataframe is named df
df_new = df.join(pd.get_dummies(df.cat).drop(['index', 'cat'], axis=1)
As your get_dummies returns a df this will be aligned already with your existing df so just concat column-wise:
In [66]:
pd.concat([df,pd.get_dummies(df['cat'])], axis=1)
Out[66]:
index id timestamp cat value A B C
0 8066 101 2012-03-01 09:00:29 A 1 1 0 0
1 8067 101 2012-03-01 09:01:15 B 0 0 1 0
2 8068 101 2012-03-01 09:40:18 C 1 0 0 1
3 8069 102 2012-03-01 09:40:18 C 0 0 0 1
You can drop the 'cat' column by doing df.drop('cat', axis=1)
Use get_dummies.
See here:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.core.reshape.get_dummies.html
StackOverflow Example here:
Create dummies from column with multiple values in pandas

Categories

Resources