Flatten / Remove hierarchical column headers - python

I have the following dataframe which is the result of doing a groupby + aggregate sum:
df.groupby(['id', 'category']).agg([pd.Series.sum])
supply stock
sum sum
id category
4 abc 161 -0.094
6 sde -76 0.150
23 hgv 64 -0.054
1 wcd -14 0.073
76 jhf -8 -0.057
Because of the groupby and agg, the column headings are now tuples. How do I change the column headings back into single values, ie: the column headings need to be supply and stock. I just need to get rid of sum from the headings

If you use sum the "agg function name" won't be created as part of the columns (as a MultiIndex):
df.groupby(['id', 'category']).sum()
To remove them, you can drop the level:
df.columns = df.columns.droplevel(1)
For example:
In [11]: df
Out[11]:
supply stock
sum sum
0 0.501176 0.482497
1 0.442689 0.008664
2 0.885112 0.512066
3 0.724619 0.418720
In [12]: df.columns.droplevel(1)
Out[12]: Index(['supply', 'stock'], dtype='object')
In [13]: df.columns = df.columns.droplevel(1)
In [14]: df
Out[14]:
supply stock
0 0.501176 0.482497
1 0.442689 0.008664
2 0.885112 0.512066
3 0.724619 0.418720

You can explicitly set the columns attribute to whatever you'd like it to be. For example:
>>> df = pd.DataFrame(np.random.random((4, 2)),
... columns=pd.MultiIndex.from_arrays([['supply', 'stock'],
['sum', 'sum']]))
>>> df
supply stock
sum sum
0 0.170950 0.314759
1 0.632121 0.147884
2 0.955682 0.127857
3 0.776764 0.318614
>>> df.columns = df.columns.get_level_values(0)
>>> df
supply stock
0 0.170950 0.314759
1 0.632121 0.147884
2 0.955682 0.127857
3 0.776764 0.318614

Related

For each distinct value in a given column, count the null and non-null values in another column

Suppose I have the following dataframe:
df = pd.DataFrame({'col1':['x','y','z','x','x','x','y','z','y','y'],
'col2':[np.nan,'n1',np.nan,np.nan,'n3','n2','n5',np.nan,np.nan,np.nan]})
for each distinct element in col1 I want to count how may null and non-null value are there in col2 and summarise the result in a new dataframe. So far I used df1 = df[df['col1']=='x'] and then
print(df1[df1['col2'].isna()].shape[0],
df1[df1['col2'].notna()].shape[0])
I was then manually changing the value in df1 so that df1 = df[df['col1']=='y'] and df1 = df[df['col1']=='z']. Yet my method is not efficient at all. The table I desire should look like the following:
col1 value no value
0 x 2 2
1 y 2 2
2 z 0 2
I have also tried df.groupby('col1').col2.nunique() yet that only gives me result when there is non-null value.
Let us try crosstab to create a frequency table where the index is the unique values in column col1 and columns represent the corresponding counts of non-nan and nan values in col2:
out = pd.crosstab(df['col1'], df['col2'].isna())
out.columns = ['value', 'no value']
>>> out
value no value
col1
x 2 2
y 2 2
z 0 2
Use SeriesGroupBy.value_counts with SeriesGroupBy.value_counts for counts with reshape by Series.unstack and some data cleaning:
df = (df['col2'].isna()
.groupby(df['col1'])
.value_counts()
.unstack(fill_value=0)
.reset_index()
.rename_axis(None, axis=1)
.rename(columns={False:'value', True:'no value'}))
print (df)
col1 value no value
0 x 2 2
1 y 2 2
2 z 0 2

Pandas: sum of every N columns

I have dataframe
ID 2016-01 2016-02 ... 2017-01 2017-02 ... 2017-10 2017-11 2017-12
111 12 34 0 12 3 0 0
222 0 32 5 5 0 0 0
I need to count every 12 columns and get
ID 2016 2017
111 46 15
222 32 10
I try to use
(df.groupby((np.arange(len(df.columns)) // 31) + 1, axis=1).sum().add_prefix('s'))
But it returns to all columns
But when I try to use
df.groupby['ID']((np.arange(len(df.columns)) // 31) + 1, axis=1).sum().add_prefix('s'))
It returns
TypeError: 'method' object is not subscriptable
How can I fix that?
First set_index of all columns without dates:
df = df.set_index('ID')
1. groupby by splited columns and selected first:
df = df.groupby(df.columns.str.split('-').str[0], axis=1).sum()
2. lambda function for split:
df = df.groupby(lambda x: x.split('-')[0], axis=1).sum()
3. converted columns to datetimes and groupby years:
df.columns = pd.to_datetime(df.columns)
df = df.groupby(df.columns.year, axis=1).sum()
4. resample by years:
df.columns = pd.to_datetime(df.columns)
df = df.resample('A', axis=1).sum()
df.columns = df.columns.year
print (df)
2016 2017
ID
111 46 15
222 32 10
The above code has a slight syntax error and throws the following error:
ValueError: No axis named 1 for object type
Basically, the groupby condition needs to be wrapped by []. So I'm rewriting the code correctly for convenience:
new_df = df.groupby([[i//n for i in range(0,m)]], axis = 1).sum()
where n is the number of columns you want to group together and m is the total number of columns being grouped. You have to rename the columns after that.
If you don't mind losing the labels, you can try this:
new_df = df.groupby([i//n for i in range(0,m)], axis = 1).sum()
where n is the number of columns you want to group together and m is the total number of columns being grouped. You have to rename the columns after that.

pandas adding grouped data frame to another data frame as row

I get following dataframe:
category_name amount
Blades & Razors & Foam 158
Diaper 486
Empty 193
Fem Care 2755
HairCare 3490
Irrelevant 1458
Laundry 889
Oral Care 2921
Others 69
Personal Cleaning Care 1543
Skin Care 645
I want to add it as row to following dataframe that has additional retailer column that is absent with the first dataframe.
categories_columns = ['retailer'] + self.product_list.category_name.unique().tolist()
categories_df = pd.DataFrame(columns=categories_columns)
And if some category is missing I just want zero value.
Any ideas ?
Use set_index to move the category_name column into the index. Then taking the transpose (.T) will move the category_names into the column index:
In [35]: df1
Out[35]:
amount cat
0 0 A
1 1 B
2 2 C
In [36]: df1.set_index('cat').T
Out[36]:
cat A B C
amount 0 1 2
Once the category names (cat, above) are in the column index, you can concatenate
the reshaped DataFrame with the second DataFrame using append or `pd.concat.
pd.concat fills missing values with NaN. Use fillna(0) to replace the NaNs with 0.
import numpy as np
import pandas as pd
df1 = pd.DataFrame({'amount': range(3), 'cat': list('ABC')})
df2 = pd.DataFrame(np.arange(2*4).reshape(2, 4), columns=list('ABCD'))
result = df2.(df1.set_index('cat').T).fillna(0)
print(result)
yields
A B C D
0 0 1 2 3.0
1 4 5 6 7.0
amount 0 1 2 0.0
Just append and replace Nan :
pd.DataFrame(columns=products).append(df.T).fillna(0)

Calculate weights for grouped data in pandas

I would like to calculate portfolio weights with a pandas dataframe. Here is some dummy data for an example:
df1 = DataFrame({'name' : ['ann','bob']*3}).sort('name').reset_index(drop=True)
df2 = DataFrame({'stock' : list('ABC')*2})
df3 = DataFrame({'val': np.random.randint(10,100,6)})
df = pd.concat([df1, df2, df3], axis=1)
Each person owns 3 stocks with a value val. We can calculate portfolio weights like this:
df.groupby('name').apply(lambda x: x.val/(x.val).sum())
which gives this:
If I want to add a column wgt to df I need to merge this result back to df on name and index. This seems rather clunky.
Is there a way to do this in one step? Or what is the way to do this that best utilizes pandas features?
Use transform, this will return a series with an index aligned to your original df:
In [114]:
df['wgt'] = df.groupby('name')['val'].transform(lambda x: x/x.sum())
df
Out[114]:
name stock val wgt
0 ann A 18 0.131387
1 ann B 43 0.313869
2 ann C 76 0.554745
3 bob A 16 0.142857
4 bob B 44 0.392857
5 bob C 52 0.464286

Pandas and sum and cum sum in same dataframe

I use the below to create a sum and a cumsum. But they are in two separate dataframes. I want all in one
asp = np.array(np.array([0,0,1]))
asq = np.array(np.array([10,10,20]))
columns=['asp']
df = pd.DataFrame(asp, index=None, columns=columns)
df['asq'] = asq
df = df.groupby(by=['asp']).sum()
dfcum =df.cumsum()
How do I have both the sum and the cumsum in the same dataframe. Totally not clear how to do this. Below is what I want
asqsum cumsum
asp
0 20 20
1 20 40
Maybe you want this?
In [20]: df['asq_cum']=df['asq'].cumsum()
In [21]: df
Out[21]:
asq asq_cum
asp
0 20 20
1 20 40

Categories

Resources