Pandas - How to group sub columns of a dataframe?

Pandas - How to group sub columns of a dataframe? - python

I create the following dataframe:
Date ProductID SubProductId Value
0 2015-01-02 1 1 11
1 2015-01-02 1 2 12
2 2015-01-02 1 3 NaN
3 2015-01-02 1 4 NaN
4 2015-01-02 2 1 14
5 2015-01-02 2 2 15
6 2015-01-02 2 3 16
7 2015-01-03 1 1 17
8 2015-01-03 1 2 18
9 2015-01-03 1 3 NaN
10 2015-01-03 1 4 21
11 2015-01-03 2 1 20
12 2015-01-03 2 2 21
And then I group the subproducts by products:
df.set_index(['Date','ProductID','SubProductId']).unstack(['ProductID','SubProductId'])
and I would like to get the following:
Value
ProductID 1 2
SubProductId 1 2 3 4 1 2 3
Date
2015-01-02 11.0 12.0 NaN NaN 14.0 15.0 16.0
2015-01-03 17.0 18.0 NaN 21.0 20.0 21.0 NaN
But what it does when I print it is that it pulls every column that start with some NaN at the end:
Value
ProductID 1 2 1
SubProductId 1 2 1 2 3 4 3
Date
2015-01-02 11.0 12.0 14.0 15.0 16.0 NaN NaN
2015-01-03 17.0 18.0 20.0 21.0 NaN 21.0 NaN
How to have every sub columns grouped under its corresponding column ? even the sub columns that contain NaN
NB: Versions used:
Python version: 3.6.0
Pandas version: 0.19.2

If you want to have ordered column names, you can use sort_level with axis = 1 to sort the column index:
df1 = df.set_index(['Date','ProductID','SubProductId']).unstack(['ProductID','SubProductId'])
# sort in descending order
df1.sortlevel(axis=1, ascending=False)
# Value
#ProductID 2 1
#SubProductId 3 2 1 4 3 2 1
#Date
#2015-01-02 16.0 15.0 14.0 NaN NaN 12.0 11.0
#2015-01-03 NaN 21.0 20.0 21.0 NaN 18.0 17.0
# sort in ascending order
df1.sortlevel(axis=1, ascending=True)
# Value
#ProductID 1 2
#SubProductId 1 2 3 4 1 2 3
#Date
#2015-01-02 11.0 12.0 NaN NaN 14.0 15.0 16.0
#2015-01-03 17.0 18.0 NaN 21.0 20.0 21.0 NaN

Related

Get new column with groupby and return the maximum to entire group

I want to add a new column with the maximum next_crossing_down for the entire x street.
I have this:
cars = pd.DataFrame({'x': [1,1,1,1,1,1,1,2,2,2,2],
'y': [7,None,13,14,22,None,9,13,14,15,16],
'next_crossing_down': [5,None,10,10,20,None,5,10,10,10,15]})
x y next_crossing_down
0 1 7.0 5.0
1 1 NaN NaN
2 1 13.0 10.0
3 1 14.0 10.0
4 1 22.0 20.0
5 1 NaN NaN
6 1 9.0 5.0
7 2 13.0 10.0
8 2 14.0 10.0
9 2 15.0 10.0
10 2 16.0 15.0
And I would like this:
x y next_crossing_down next_crossing_down_max
0 1 7.0 5.0 20.0
1 1 NaN NaN NaN
2 1 13.0 10.0 20.0
3 1 14.0 10.0 20.0
4 1 22.0 20.0 20.0
5 1 NaN NaN NaN
6 1 9.0 5.0 15.0
7 2 13.0 10.0 15.0
8 2 14.0 10.0 15.0
9 2 15.0 10.0 15.0
10 2 16.0 15.0 15.0
This is the closest that I have come. I get the right numbers, only not in the entire x_street.
cars['next_crossing_down_max']= cars.groupby(['x'])['next_crossing_down'].max()
x y next_crossing_down next_crossing_down_max
0 1 7.0 5.0 NaN
1 1 NaN NaN 20.0
2 1 13.0 10.0 15.0
3 1 14.0 10.0 NaN
4 1 22.0 20.0 NaN
5 1 NaN NaN NaN
6 1 9.0 5.0 NaN
7 2 13.0 10.0 NaN
8 2 14.0 10.0 NaN
9 2 15.0 10.0 NaN
10 2 16.0 15.0 NaN

Are you looking for pandas.DataFrame.transform?
import numpy as np
cars['next_crossing_down_max']= cars.groupby(['x'])['next_crossing_down'].transform('max')
cars['next_crossing_down_max'] = np.where(cars['next_crossing_down'].isnull(),
np.nan,
cars['next_crossing_down_max'])
Output
cars
Out[18]:
x y next_crossing_down next_crossing_down_max
0 1 7.0 5.0 20.0
1 1 NaN NaN NaN
2 1 13.0 10.0 20.0
3 1 14.0 10.0 20.0
4 1 22.0 20.0 20.0
5 1 NaN NaN NaN
6 1 9.0 5.0 20.0
7 2 13.0 10.0 15.0
8 2 14.0 10.0 15.0
9 2 15.0 10.0 15.0
10 2 16.0 15.0 15.0
Alternatively you could mask instead of np.where, which will get you the same result, but it's a bit slower (thanks to #Anky):
>>> cars.groupby("x")['next_crossing_down'].transform('max').mask(cars['next_crossing_down'].isna())
Out[19]:
0 20.0
1 NaN
2 20.0
3 20.0
4 20.0
5 NaN
6 20.0
7 15.0
8 15.0
9 15.0
10 15.0

Convert column vector into multi-column matrix

I have a column vector with say 30 values (1-30) I would like to try to manipulate this vector so that it becomes a matrix with 5 values in the first column, 10 values in the second and 15 values in the third column. How would I implement this using Pandas or NumPy?
import pandas as pd
#Create data
df = pd.DataFrame(np.linspace(1,20,20))
print(df)
1
2
:
28
29
30
In order to get something like this:
# Manipulate the column vector to make columns where the first column has 5
# the second column has 10 and the last column has 15 values
'T1' 'T2' 'T3'
1 6 16
2 7 17
3 8 18
4 9 19
5 10 20
NA 11 21
NA 12 22
NA 13 23
NA 14 24
NA 15 25
NA NA 26
NA NA 27
NA NA 28
NA NA 29
NA NA 30

It took a little time to find out what series is this, and I found that its a triangular series , just a modified one.
tri = lambda x:int((0.25+2*x)**0.5-0.5)
This would give results like:
0 1 1 2 2 2 3 3 3 3 4 4 4 4 4 5 5 5 5 5 5 ...
And after the modification:
modtri = lambda x:int((0.25+2*(x//5))**0.5-0.5)
0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 ...
So each occurrence in normal triangular series repeats 5 times.
The above modtri function would directly map the index starting from 0, to appropriate group ids.
and so after that, this would do the job:
df[0].groupby(modtri).apply(lambda x: pd.Series(x.values)).unstack().T
Full execution:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.linspace(1,30,30))
N = 5 #the increment value
modtri = lambda x:int((0.25+2*(x//N))**0.5-0.5)
df2 = df[0].groupby(modtri).apply(lambda x: pd.Series(x.values)).unstack().T
df2.rename(columns={0: "T1", 1: "T2",2:"T3"},inplace=True)
print(df2)
Output:
T1 T2 T3
0 1.0 6.0 16.0
1 2.0 7.0 17.0
2 3.0 8.0 18.0
3 4.0 9.0 19.0
4 5.0 10.0 20.0
5 NaN 11.0 21.0
6 NaN 12.0 22.0
7 NaN 13.0 23.0
8 NaN 14.0 24.0
9 NaN 15.0 25.0
10 NaN NaN 26.0
11 NaN NaN 27.0
12 NaN NaN 28.0
13 NaN NaN 29.0
14 NaN NaN 30.0

Try this by slicing with reindexing:
df['T1'] = df[0][0:5]
df['T2'] = df[0][5:15].reset_index(drop=True)
df['T3'] = df[0][15:].reset_index(drop=True)
Original data before operation:
df = pd.DataFrame(np.linspace(1,30,30))
print(df)
0
0 1.0
1 2.0
2 3.0
3 4.0
4 5.0
5 6.0
6 7.0
7 8.0
8 9.0
9 10.0
10 11.0
11 12.0
12 13.0
13 14.0
14 15.0
15 16.0
16 17.0
17 18.0
18 19.0
19 20.0
20 21.0
21 22.0
22 23.0
23 24.0
24 25.0
25 26.0
26 27.0
27 28.0
28 29.0
29 30.0
Running new codes:
df['T1'] = df[0][0:5]
df['T2'] = df[0][5:15].reset_index(drop=True)
df['T3'] = df[0][15:].reset_index(drop=True)
print(df)
0 T1 T2 T3
0 1.0 1.0 6.0 16.0
1 2.0 2.0 7.0 17.0
2 3.0 3.0 8.0 18.0
3 4.0 4.0 9.0 19.0
4 5.0 5.0 10.0 20.0
5 6.0 NaN 11.0 21.0
6 7.0 NaN 12.0 22.0
7 8.0 NaN 13.0 23.0
8 9.0 NaN 14.0 24.0
9 10.0 NaN 15.0 25.0
10 11.0 NaN NaN 26.0
11 12.0 NaN NaN 27.0
12 13.0 NaN NaN 28.0
13 14.0 NaN NaN 29.0
14 15.0 NaN NaN 30.0
15 16.0 NaN NaN NaN
16 17.0 NaN NaN NaN
17 18.0 NaN NaN NaN
18 19.0 NaN NaN NaN
19 20.0 NaN NaN NaN
20 21.0 NaN NaN NaN
21 22.0 NaN NaN NaN
22 23.0 NaN NaN NaN
23 24.0 NaN NaN NaN
24 25.0 NaN NaN NaN
25 26.0 NaN NaN NaN
26 27.0 NaN NaN NaN
27 28.0 NaN NaN NaN
28 29.0 NaN NaN NaN
29 30.0 NaN NaN NaN

combine multiple dataframes in a csv file separating each with an empty row

how can I separate each dataframe with an empty row
ive combined them using this snippet
frames1 = [df4, df5, df6]
Summary = pd.concat(frames1)
so how can i split them with an empty row

You can use the below example which works:
Create test dfs
df1 = pd.DataFrame(np.random.randint(0,20,20).reshape(5,4),columns=list('ABCD'))
df2 = pd.DataFrame(np.random.randint(0,20,20).reshape(5,4),columns=list('ABCD'))
df3 = pd.DataFrame(np.random.randint(0,20,20).reshape(5,4),columns=list('ABCD'))
dfs=[df1,df2,df3]
Solution:
pd.concat([df.append(pd.Series(), ignore_index=True) for df in dfs])
A B C D
0 17.0 16.0 15.0 7.0
1 13.0 6.0 12.0 18.0
2 0.0 2.0 10.0 17.0
3 8.0 13.0 10.0 17.0
4 4.0 18.0 8.0 19.0
5 NaN NaN NaN NaN
0 14.0 0.0 13.0 12.0
1 10.0 3.0 6.0 3.0
2 15.0 10.0 15.0 3.0
3 9.0 16.0 11.0 4.0
4 5.0 7.0 6.0 2.0
5 NaN NaN NaN NaN
0 10.0 18.0 13.0 12.0
1 1.0 6.0 10.0 0.0
2 2.0 19.0 4.0 18.0
3 4.0 3.0 9.0 16.0
4 16.0 6.0 5.0 6.0
5 NaN NaN NaN NaN
For horizontal stack:
pd.concat([df.assign(test=np.nan) for df in dfs],axis=1)
A B C D test A B C D test A B C D test
0 17 16 15 7 NaN 14 0 13 12 NaN 10 18 13 12 NaN
1 13 6 12 18 NaN 10 3 6 3 NaN 1 6 10 0 NaN
2 0 2 10 17 NaN 15 10 15 3 NaN 2 19 4 18 NaN
3 8 13 10 17 NaN 9 16 11 4 NaN 4 3 9 16 NaN
4 4 18 8 19 NaN 5 7 6 2 NaN 16 6 5 6 NaN

Is this what you want?:
fname = 'test2.csv'
frames1 = [df4, df5, df6]
with open(fname, mode='a+') as f:
for df in frames1:
df.to_csv(fname, mode='a', header = f.tell() == 0)
f.write('\n')
test2.csv:
,a,b,c
0,0,1,2
1,3,4,5
2,6,7,8
0,0,1,2
1,3,4,5
2,6,7,8
0,0,1,2
1,3,4,5
2,6,7,8
f.tell() == 0 checks whether the file handle is at the beginning of the file, i.e. at 0, if yes, prints header, else doesn't.
NOTE: I have used same values for all the dfs, that's why all the results are similar.
For columns:
fname = 'test3.csv'
frames1 = [df1, df2, df3]
Summary = pd.concat([df.assign(**{' ':' '}) for df in frames1], axis=1)
Summary.to_csv(fname)
test3.csv:
,a,b,c, ,a,b,c, ,a,b,c,
0,0,1,2, ,0,1,2, ,0,1,2,
1,3,4,5, ,3,4,5, ,3,4,5,
2,6,7,8, ,6,7,8, ,6,7,8,
But the columns will not be equally spaced. If you save with header=False:
test3.csv:
0,0,1,2, ,0,1,2, ,0,1,2,
1,3,4,5, ,3,4,5, ,3,4,5,
2,6,7,8, ,6,7,8, ,6,7,8,

Fill in missing dates of groupby

Imagine I have a dataframe that looks like:
ID DATE VALUE
1 31-01-2006 5
1 28-02-2006 5
1 31-05-2006 10
1 30-06-2006 11
2 31-01-2006 5
2 31-02-2006 5
2 31-03-2006 5
2 31-04-2006 5
As you can see this is panel data with multiple entries on the same date for different IDs. What I want to do is fill in missing dates for each ID. You can see that for ID "1" there is a jump in months between the second and third entry.
I would like a dataframe that looks like:
ID DATE VALUE
1 31-01-2006 5
1 28-02-2006 5
1 31-03-2006 NA
1 30-04-2006 NA
1 31-05-2006 10
1 30-06-2006 11
2 31-01-2006 5
2 31-02-2006 5
2 31-03-2006 5
2 31-04-2006 5
I have no idea how to do this since I can not index by date since there are duplicate dates.

One way is to use pivot_table and then unstack:
In [11]: df.pivot_table("VALUE", "DATE", "ID")
Out[11]:
ID 1 2
DATE
28-02-2006 5.0 NaN
30-06-2006 11.0 NaN
31-01-2006 5.0 5.0
31-02-2006 NaN 5.0
31-03-2006 NaN 5.0
31-04-2006 NaN 5.0
31-05-2006 10.0 NaN
In [12]: df.pivot_table("VALUE", "DATE", "ID").unstack().reset_index()
Out[12]:
ID DATE 0
0 1 28-02-2006 5.0
1 1 30-06-2006 11.0
2 1 31-01-2006 5.0
3 1 31-02-2006 NaN
4 1 31-03-2006 NaN
5 1 31-04-2006 NaN
6 1 31-05-2006 10.0
7 2 28-02-2006 NaN
8 2 30-06-2006 NaN
9 2 31-01-2006 5.0
10 2 31-02-2006 5.0
11 2 31-03-2006 5.0
12 2 31-04-2006 5.0
13 2 31-05-2006 NaN
An alternative, perhaps slightly more efficient way is to reindex from_product:
In [21] df1 = df.set_index(['ID', 'DATE'])
In [22]: df1.reindex(pd.MultiIndex.from_product(df1.index.levels))
Out[22]:
VALUE
1 28-02-2006 5.0
30-06-2006 11.0
31-01-2006 5.0
31-02-2006 NaN
31-03-2006 NaN
31-04-2006 NaN
31-05-2006 10.0
2 28-02-2006 NaN
30-06-2006 NaN
31-01-2006 5.0
31-02-2006 5.0
31-03-2006 5.0
31-04-2006 5.0
31-05-2006 NaN

Another solution is to convert the incomplete data to a "wide" form (a table; this will create cells for the missing values) and then back to a "tall" form.
df.set_index(['ID','DATE']).unstack().stack(dropna=False).reset_index()
# ID DATE VALUE
#0 1 28-02-2006 5.0
#1 1 30-06-2006 11.0
#2 1 31-01-2006 5.0
#3 1 31-02-2006 NaN
#4 1 31-03-2006 NaN
#5 1 31-04-2006 NaN
#6 1 31-05-2006 10.0
#7 2 28-02-2006 NaN
#....

convert specific rows of pandas dataframe into multiindex

here is my DataFrame:
0 1 2
0 0 0.0 20.0 NaN
1 1.0 21.0 NaN
2 2.0 22.0 NaN
ID NaN NaN 11111.0
Year NaN NaN 2011.0
1 0 3.0 23.0 NaN
1 4.0 24.0 NaN
2 5.0 25.0 NaN
3 6.0 26.0 NaN
ID NaN NaN 11111.0
Year NaN NaN 2012.0
i want to convert the 'ID' and 'Year' rows to dataframe Index with 'ID' being level=0 and 'Year' being level=1. I tried using stack() but still cannot figure it .
Edited: my desired output should look like below:
0 1
11111 2011 0 0.0 20.0
1 1.0 21.0
2 2.0 22.0
2012 0 3.0 23.0
1 4.0 24.0
2 5.0 25.0
3 6.0 26.0

This should work:
df1 = df.loc[pd.IndexSlice[:, ['ID', 'Year']], '2']
dfs = df1.unstack()
dfi = df1.index
dfn = df.drop(dfi).drop('2', axis=1).unstack()
dfn.set_index([dfs.ID, dfs.Year]).stack()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas - How to group sub columns of a dataframe? - python

Related

Get new column with groupby and return the maximum to entire group

Convert column vector into multi-column matrix

combine multiple dataframes in a csv file separating each with an empty row

Fill in missing dates of groupby

convert specific rows of pandas dataframe into multiindex

Categories

Resources