applying pivot table on pandas dataframe instead of grouping - python

I have a dataframe like this and can group it by library and sample columns and create new columns:
df = pd.DataFrame({'barcode': ['b1', 'b2','b1','b2','b1',
'b2','b1','b2'],
'library': ['l1', 'l1','l1','l1','l2', 'l2','l2','l2'],
'sample': ['s1','s1','s2','s2','s1','s1','s2','s2'],
'category': ['c1', 'c2','c1','c2','c1', 'c2','c1','c2'],
'count': [10,21,13,54,51,16,67,88]})
df
barcode library sample category count
0 b1 l1 s1 c1 10
1 b2 l1 s1 c2 21
2 b1 l1 s2 c1 13
3 b2 l1 s2 c2 54
4 b1 l2 s1 c1 51
5 b2 l2 s1 c2 16
6 b1 l2 s2 c1 67
7 b2 l2 s2 c2 88
I used grouping to reduce dimentions of the df:
grp=df.groupby(['library','sample'])
df=grp.get_group(('l1','s1')).rename(columns={"count":
"l1_s1_count"}).reset_index(drop=True)
df['l1_s2_count']=grp.get_group(('l1','s2'))[['count']].values
df['l2_s1_count']=grp.get_group(('l2','s1'))[['count']].values
df['l2_s2_count']=grp.get_group(('l2','s2'))[['count']].values
df=df.drop(['sample','library'],axis=1)
result
barcode category l1_s1_count l1_s2_count l2_s1_count
l2_s2_count
0 b1 c1 10 13 51 67
1 b2 c2 21 54 16 88
I think there should be a neater way for this transformation, like using pivot table which I failed with, could you please suggest how this could be done with pivot table?
thanks.

try pivot_table function as below,
it will produce multi-index result, which will need to be flattened.
df2 = pd.pivot_table(df,index=['barcode', 'category'], columns= ['sample', 'library'], values='count').reset_index()
df2.columns = ["_".join(a) for a in df2.columns.to_flat_index()]
out:
barcode_ category_ s1_l1 s1_l2 s2_l1 s2_l2
0 b1 c1 10 51 13 67
1 b2 c2 21 16 54 88
or even without , values='count'.
df2 = pd.pivot_table(df,index=['barcode', 'category'], columns= ['sample', 'library']).reset_index()
df2.columns = ["_".join(a) for a in df2.columns.to_flat_index()]
out:
barcode__ category__ count_s1_l1 count_s1_l2 count_s2_l1 count_s2_l2
0 b1 c1 10 51 13 67
1 b2 c2 21 16 54 88
as per your preference

Related

Pandas sort a subset of column based on conditions

Let's say I have the dataframe:
c1 c2
a1 9
a1 11
a1 12
a1 8
a2 10
a2 14
a2 6
I would like to sort only subset a2 of column c1:
c1|c2
a2 6 <=
a1 9
a2 10 <=
a1 11
a1 12
a2 14 <=
a1 8
Here the traditional sorting with sort_values doesn't seem to work.
Also, c2 is composed of only unique values, so there is no possibility to have repeated values.
Lets say your dataframe is in df
df = df[df['c1'] == 'a2']

Summing columns from different dataframe according to some column names

Suppose I have a main dataframe
main_df
Cri1 Cri2 Cr3 total
0 A1 A2 A3 4
1 B1 B2 B3 5
2 C1 C2 C3 6
I also have 3 dataframes
df_1
Cri1 Cri2 Cri3 value
0 A1 A2 A3 1
1 B1 B2 B3 2
df_2
Cri1 Cri2 Cri3 value
0 A1 A2 A3 9
1 C1 C2 C3 10
df_3
Cri1 Cri2 Cri3 value
0 B1 B2 B3 15
1 C1 C2 C3 17
What I want is to add value from each frame df to total in the main_df according to Cri
i.e. main_df will become
main_df
Cri1 Cri2 Cri3 total
0 A1 A2 A3 14
1 B1 B2 B3 22
2 C1 C2 C3 33
Of course I can do it using for loop, but at the end I want to apply the method to a large amount of data, say 50000 rows in each dataframe.
Is there other ways to solve it?
Thank you!
First you should align your numeric column names. In this case:
df_main = df_main.rename(columns={'total': 'value'})
Then you have a couple of options.
concat + groupby
You can concatenate and then perform a groupby with sum:
res = pd.concat([df_main, df_1, df_2, df_3])\
.groupby(['Cri1', 'Cri2', 'Cri3']).sum()\
.reset_index()
print(res)
Cri1 Cri2 Cri3 value
0 A1 A2 A3 14
1 B1 B2 B3 22
2 C1 C2 C3 33
set_index + reduce / add
Alternatively, you can create a list of dataframes indexed by your criteria columns. Then use functools.reduce with pd.DataFrame.add to sum these dataframes.
from functools import reduce
dfs = [df.set_index(['Cri1', 'Cri2', 'Cri3']) for df in [df_main, df_1, df_2, df_3]]
res = reduce(lambda x, y: x.add(y, fill_value=0), dfs).reset_index()
print(res)
Cri1 Cri2 Cri3 value
0 A1 A2 A3 14.0
1 B1 B2 B3 22.0
2 C1 C2 C3 33.0

Pandas str alphabetically then numerically

This is probably a simple question and I just couldn't find the answer. In a pandas DataFrame like the one below, how can the objects be sorted first alphabetically and then numerically.
START:
import pandas as pd
d ={'col1': ['A1','B2','A10','A7','C4','C2','C22','B4']}
df = pd.DataFrame(data=d)
df
col1
0 A1
1 A7
2 A10
3 B2
4 B4
5 C2
6 C4
7 C22
WHAT I WANT TO GET:
col1
0 A1
1 A7
2 A10
3 B2
4 B4
5 C2
6 C4
7 C22
WHAT I GET:
>>>df.sort_values(by='col1')
col1
0 A1
2 A10
1 A7
3 B2
4 B4
5 C2
7 C22
6 C4
This is overkill to use Pandas to sort a list:
lot_file = pd.DataFrame()
lot_file['SPOOL'] = ['A39','B34','A3','B37','A6','B18','A48','B15','A47']
group_lots = lot_file.sort_values(by=['SPOOL'])
group_lots['SPOOL'].tolist()
Output:
['A3', 'A39', 'A47', 'A48', 'A6', 'B15', 'B18', 'B34', 'B37']
Or use sorted
spool_list = ['A39','B34','A3','B37','A6','B18','A48','B15','A47']
sorted(spool_list)
Output:
['A3', 'A39', 'A47', 'A48', 'A6', 'B15', 'B18', 'B34', 'B37']

add values to data frame for only some specific elements - python

With these two data frames
df1 = pd.DataFrame({'c1':['a','b','c','d'],'c2':[10,20,10,22]})
df2 = pd.DataFrame({'c3':['e','f','a','g','b','c','r','j','d'],'c4':[1,2,3,4,5,6,7,8,9]})
I'm trying to add the values of c4 to df1 for only the elements in c3 that are also present in c1:
>>> df1
c1 c2 c4
a 10 3
b 20 5
c 10 6
d 22 9
Is there a simple way of doing this in pandas?
UPDATE:
If
df2 = pd.DataFrame({'c3':['e','f','a','g','b','c','r','j','d'],'c4':[1,2,3,4,5,6,7,8,9]},'c5':[10,20,30,40,50,60,70,80,90])
how can I achieve this result?
>>> df1
c1 c2 c4 c5
a 10 3 30
b 20 5 50
c 10 6 60
d 22 9 90
Doing:
>>> df1['c1'].map(df2.set_index('c3')['c4','c5'])
gives me a KeyError
You can call map on df2['c4'] after setting the index on df2['c3'], this will perform a lookup:
In [239]:
df1 = pd.DataFrame({'c1':['a','b','c','d'],'c2':[10,20,10,22]})
df2 = pd.DataFrame({'c3':['e','f','a','g','b','c','r','j','d'],'c4':[1,2,3,4,5,6,7,8,9]})
df1['c4'] = df1['c1'].map(df2.set_index('c3')['c4'])
df1
Out[239]:
c1 c2 c4
0 a 10 3
1 b 20 5
2 c 10 6
3 d 22 9

Flatten DataFrame with multi-index columns

I'd like to convert a Pandas DataFrame that is derived from a pivot table into a row representation as shown below.
This is where I'm at:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'goods': ['a', 'a', 'b', 'b', 'b'],
'stock': [5, 10, 30, 40, 10],
'category': ['c1', 'c2', 'c1', 'c2', 'c1'],
'date': pd.to_datetime(['2014-01-01', '2014-02-01', '2014-01-06', '2014-02-09', '2014-03-09'])
})
# we don't care about year in this example
df['month'] = df['date'].map(lambda x: x.month)
piv = df.pivot_table(["stock"], "month", ["goods", "category"], aggfunc="sum")
piv = piv.reindex(np.arange(piv.index[0], piv.index[-1] + 1))
piv = piv.ffill(axis=0)
piv = piv.fillna(0)
print piv
which results in
stock
goods a b
category c1 c2 c1 c2
month
1 5 0 30 0
2 5 10 30 40
3 5 10 10 40
And this is where I want to get to.
goods category month stock
a c1 1 5
a c1 2 0
a c1 3 0
a c2 1 0
a c2 2 10
a c2 3 0
b c1 1 30
b c1 2 0
b c1 3 10
b c2 1 0
b c2 2 40
b c2 3 0
Previously, I used
piv = piv.stack()
piv = piv.reset_index()
print piv
to get rid of the multi-indexes, but this results in this because I pivot now on two columns (["goods", "category"]):
month category stock
goods a b
0 1 c1 5 30
1 1 c2 0 0
2 2 c1 5 30
3 2 c2 10 40
4 3 c1 5 10
5 3 c2 10 40
Does anyone know how I can get rid of the multi-index in the column and get the result into a DataFrame of the exemplified format?
>>> piv.unstack().reset_index().drop('level_0', axis=1)
goods category month 0
0 a c1 1 5
1 a c1 2 5
2 a c1 3 5
3 a c2 1 0
4 a c2 2 10
5 a c2 3 10
6 b c1 1 30
7 b c1 2 30
8 b c1 3 10
9 b c2 1 0
10 b c2 2 40
11 b c2 3 40
then all you need is to change last column name from 0 to stock.
It seems to me that melt (aka unpivot) is very close to what you want to do:
In [11]: pd.melt(piv)
Out[11]:
NaN goods category value
0 stock a c1 5
1 stock a c1 5
2 stock a c1 5
3 stock a c2 0
4 stock a c2 10
5 stock a c2 10
6 stock b c1 30
7 stock b c1 30
8 stock b c1 10
9 stock b c2 0
10 stock b c2 40
11 stock b c2 40
There's a rogue column (stock), that appears here that column header is constant in piv. If we drop it first the melt works OOTB:
In [12]: piv.columns = piv.columns.droplevel(0)
In [13]: pd.melt(piv)
Out[13]:
goods category value
0 a c1 5
1 a c1 5
2 a c1 5
3 a c2 0
4 a c2 10
5 a c2 10
6 b c1 30
7 b c1 30
8 b c1 10
9 b c2 0
10 b c2 40
11 b c2 40
Edit: The above actually drops the index, you need to make it a column with reset_index:
In [21]: pd.melt(piv.reset_index(), id_vars=['month'], value_name='stock')
Out[21]:
month goods category stock
0 1 a c1 5
1 2 a c1 5
2 3 a c1 5
3 1 a c2 0
4 2 a c2 10
5 3 a c2 10
6 1 b c1 30
7 2 b c1 30
8 3 b c1 10
9 1 b c2 0
10 2 b c2 40
11 3 b c2 40
I know that the question has already been answered, but for my dataset multiindex column problem, the provided solution was unefficient. So here I am posting another solution for unpivoting multiindex columns using pandas.
Here is the problem I had:
As one can see, the dataframe is composed of 3 multiindex, and two levels of multiindex columns.
The desired dataframe format was:
When I tried the options given above, the pd.melt function didn't allow to have more than one column in the var_name attribute. Therefore, every time that I tried a melt, I would end up losing some attribute from my table.
The solution I found was to apply a double stacking function over my dataframe.
Before the coding, it is worth notice that the desired var_name for my unpivoted table column was "Populacao residente em domicilios particulares ocupados" (see in the code below). Therefore, for all my value entries, they should be stacked in this newly created var_name new column.
Here is a snippet code:
import pandas as pd
# reading my table
df = pd.read_excel(r'my_table.xls', sep=',', header=[2,3], encoding='latin3',
index_col=[0,1,2], na_values=['-', ' ', '*'], squeeze=True).fillna(0)
df.index.names = ['COD_MUNIC_7', 'NOME_MUN', 'TIPO']
df.columns.names = ['sexo', 'faixa_etaria']
df.head()
# making the stacking:
df = pd.DataFrame(pd.Series(df.stack(level=0).stack(), name='Populacao residente em domicilios particulares ocupados')).reset_index()
df.head()
Another solution I found was to first apply a stacking function over the dataframe and then apply the melt.
Here is an alternative code:
df = df.stack('faixa_etaria').reset_index().melt(id_vars=['COD_MUNIC_7', 'NOME_MUN','TIPO', 'faixa_etaria'],
value_vars=['Homens', 'Mulheres'],
value_name='Populacao residente em domicilios particulares ocupados',
var_name='sexo')
df.head()
Sincerely yours,
Philipe Riskalla Leal

Categories

Resources