joining two dataframes with multilevel indices in pandas

joining two dataframes with multilevel indices in pandas - python

I have two dataframes like following with multilevel indices:
df1:
Total_Consumption
2010 2011 2012
1 8544.357 5133.553 5279.884
2 8581.545 6091.454 4323.611
3 4479.319 2784.283 1948.262
4 5493.114 3633.187 3516.346
5 5582.544 3138.680 3995.311
6 9877.752 7798.371 8505.287
7 5137.488 4109.556 3301.129
8 13038.200 8853.721 8525.272
df2:
Charging Capacity
2010 2011 2012
1 7.989 4.752 5.801
2 11.349 22.092 10.967
3 6.968 6.803 9.760
4 5.191 7.294 9.199
5 0.201 -1.204 10.488
6 14.598 13.077 17.004
7 5.134 12.945 8.970
8 44.680 23.607 24.395
I tried to concatenate these two dataframes via:
l1=[df1,df2]
pd.concat(l1)
But I get the following output. Why do I get NaN for df2 dataframe? Is there a way to join the two dataframe with multilevel indices properly in pandas?
Charging Capacity Total_Consumption
2010 2011 2012 2010 2011 2012
1 NaN NaN NaN 8544.357 5133.553 5279.884
2 NaN NaN NaN 8581.545 6091.454 4323.611
3 NaN NaN NaN 4479.319 2784.283 1948.262
4 NaN NaN NaN 5493.114 3633.187 3516.346
5 NaN NaN NaN 5582.544 3138.680 3995.311
6 NaN NaN NaN 9877.752 7798.371 8505.287
7 NaN NaN NaN 5137.488 4109.556 3301.129
8 NaN NaN NaN 13038.200 8853.721 8525.272

Use axis=1:
pd.concat([df1, df], axis=1)
Output:
Total_Consumption Charging Capacity
2010 2011 2012 2010 2011 2012
1 8544.357 5133.553 5279.884 7.989 4.752 5.801
2 8581.545 6091.454 4323.611 11.349 22.092 10.967
3 4479.319 2784.283 1948.262 6.968 6.803 9.760
4 5493.114 3633.187 3516.346 5.191 7.294 9.199
5 5582.544 3138.680 3995.311 0.201 -1.204 10.488
6 9877.752 7798.371 8505.287 14.598 13.077 17.004
7 5137.488 4109.556 3301.129 5.134 12.945 8.970
8 13038.200 8853.721 8525.272 44.680 23.607 24.395

Related

Balancing a panel data for regression

I have a dataframe:
df = pd.DataFrame({"id": [1, 1, 1, 2, 2, 3], "city": ['abc', 'abc', 'abc', 'def10', 'def10', 'ghk'] ,"year": [2008, 2009, 2010, 2008, 2010,2009], "value": [10,20,30,10,20,30]})
id city year value
0 1 abc 2008 10
1 1 abc 2009 20
2 1 abc 2010 30
3 2 def10 2008 10
4 2 def10 2010 20
5 3 ghk 2009 30
I wanna create a balanced data such that:
id city year value
0 1 abc 2008 10
1 1 abc 2009 20
2 1 abc 2010 30
3 2 def10 2008 10
4 2 def10 2009 NaN
5 2 def10 2010 20
6 3 ghk 2008 NaN
7 3 ghk 2009 30
8 3 ghk 2009 NaN
if I use the following code:
df = df.set_index('id')
balanced = (id.set_index('year',append=True).reindex(pd.MultiIndex.from_product([df.index,range(df.year.min(),df.year.max()+1)],names=['frs_id','year'])).reset_index(level=1))
This gives me following error:
cannot handle a non-unique multi-index!

You are close to the solution. You can amend your code slightly as follows:
idx = pd.MultiIndex.from_product([df['id'].unique(),range(df.year.min(),df.year.max()+1)],names=['id','year'])
df2 = df.set_index(['id', 'year']).reindex(idx).reset_index()
df2['city'] = df2.groupby('id')['city'].ffill().bfill()
Changes to your codes:
Create the MultiIndex by using unique values of id instead of from index
Set index on both id and year before reindex()
Fill-in the NaN values of column city by non-NaN entries of the same id
Result:
print(df2)
id year city value
0 1 2008 abc 10.0
1 1 2009 abc 20.0
2 1 2010 abc 30.0
3 2 2008 def10 10.0
4 2 2009 def10 NaN
5 2 2010 def10 20.0
6 3 2008 ghk NaN
7 3 2009 ghk 30.0
8 3 2010 ghk NaN
Optionally, you can re-arrange the column sequence, if you like:
df2.insert(2, 'year', df2.pop('year'))
print(df2)
id city year value
0 1 abc 2008 10.0
1 1 abc 2009 20.0
2 1 abc 2010 30.0
3 2 def10 2008 10.0
4 2 def10 2009 NaN
5 2 def10 2010 20.0
6 3 ghk 2008 NaN
7 3 ghk 2009 30.0
8 3 ghk 2010 NaN
Edit
You can also do it using stack() and unstack() without using reindex(), as follows:
(df.set_index(['id', 'city', 'year'], append=True)
.unstack()
.groupby(level=[1, 2]).max()
.stack(dropna=False)
).reset_index()
Output:
id city year value
0 1 abc 2008 10.0
1 1 abc 2009 20.0
2 1 abc 2010 30.0
3 2 def10 2008 10.0
4 2 def10 2009 NaN
5 2 def10 2010 20.0
6 3 ghk 2008 NaN
7 3 ghk 2009 30.0
8 3 ghk 2010 NaN

Pivot the table and stack year without drop NaN values:
>>> df.pivot(["id", "city"], "year", "value") \
.stack(dropna=False) \
.rename("value") \
.reset_index()
id city year value
0 1 abc 2008 10.0
1 1 abc 2009 20.0
2 1 abc 2010 30.0
3 2 def10 2008 10.0
4 2 def10 2009 NaN
5 2 def10 2010 20.0
6 3 ghk 2008 NaN
7 3 ghk 2009 30.0
8 3 ghk 2010 NaN
Edit: case of duplicate entries
I slightly modified your original dataframe:
df = pd.DataFrame({"id": [1,1,1,2,2,3,3], "city": ['abc','abc','abc','def10','def10','ghk','ghk'], "year": [2008,2009,2010,2008,2010,2009,2009], "value": [10,20,30,10,20,30,40]})
>>> df
id city year value
0 1 abc 2008 10
1 1 abc 2009 20
2 1 abc 2010 30
3 2 def10 2008 10
4 2 def10 2010 20
5 3 ghk 2009 30 # The problem is here
6 3 ghk 2009 40 # same (id, city, year)
You need to take a decision. Do you want to keep the row 5 or 6 or apply a math function (mean, sum, ...). Imagine you want the mean for (3, ghk, 2009):
>>> df.pivot_table(index=["id", "city"], columns="year", values="value", aggfunc="mean") \
.stack(dropna=False) \
.rename("value") \
.reset_index()
id city year value
0 1 abc 2008 10.0
1 1 abc 2009 20.0
2 1 abc 2010 30.0
3 2 def10 2008 10.0
4 2 def10 2009 NaN
5 2 def10 2010 20.0
6 3 ghk 2008 NaN
7 3 ghk 2009 35.0 # <- mean of (30, 40)
8 3 ghk 2010 NaN

Merge if possible otherwise concat pandas

I have 3 df that I would like to combine where the first 3 columns are same data if exists, and columns afterwards are new from each of the df. For example df[3:] are different, than df2[3:]
I would like to merge them if they have the same unique identifier, otherwise I would like to concat.
df1
ID A B 2009 2010
1 A B 2 3
2 A C 2 2
3 A B 3 3
df2
ID A B 2011 2012
2 A C 2 2
3 A C 3 4
5 A B 8 9
df3
ID A B 2013 2014
2 A C 2 3
4 A E 3 4
5 A B 8 9
result df
ID A B 2009 2010 2011 2012 2013 2014
1 A B 2 3. 2. 3.
2 A C 2 2. 2. 2. 2. 3
3 A C 3 3. 3. 4.
4 A E 3. 4
5 A B 8 9 8. 9
Edit: fixed df data. Secondly One issue I notice is when I merge, my data A, and B, are duplicated, A_X, A_Y, A_Z, B_X, B_Y, B_Z
thank you in advance

Try pd.concat([df.set_index('ID') for df in [df1, df2, df3]], axis=1).reset_index()
The list comprehension sets ID as the index of each dataframe. Then we concatenate horizontally. Horizontal concatenation tries to match up the indexes where possible, otherwise it adds rows. Finally, we reset the index.

There is something wrong with the result.
But the code for merging will be like this:
from functools import reduce
import pandas as pd
dfs = [df1,df2,df3]
df_merged = reduce(lambda left,right: pd.merge(left,right,on=['ID'],
how='outer'), dfs)
df_merged:
ID 2009 2010 2011 2012 2013 2014
0 1 2.0 3.0 2.0 3.0 NaN NaN
1 2 3.0 4.0 3.0 4.0 2.0 3.0
2 3 4.0 5.0 4.0 5.0 NaN NaN
3 4 NaN NaN NaN NaN 3.0 4.0
4 5 NaN NaN NaN NaN 8.0 9.0
Edit:
Just use on=['ID', 'A', 'B']
output:
ID A B 2009 2010 2011 2012 2013 2014
0 1 A B 2.0 3.0 NaN NaN NaN NaN
1 2 A C 2.0 2.0 2.0 2.0 2.0 3.0
2 3 A B 3.0 3.0 NaN NaN NaN NaN
3 3 A C NaN NaN 3.0 4.0 NaN NaN
4 5 A B NaN NaN 8.0 9.0 8.0 9.0
5 4 A E NaN NaN NaN NaN 3.0 4.0

Faster way to construct a multiindex of dates in Pandas

I have a Pandas dataframe, df. Here are the first five rows:
Id StartDate EndDate
0 0 2015-08-11 2018-07-13
1 1 2014-02-15 2016-01-25
2 2 2014-12-20 NaT
3 3 2015-01-09 2015-01-14
4 4 2014-07-20 NaT
I want to construct a new dataframe, df2. df2 should have a row for each month between StartDate and EndDate, inclusive, for each Id in df1. For example, since the first row of df1 has StartDate in August 2015 and EndDate in July 2018, df2 should have rows corresponding to August 2015, September 2015, ..., July 2018. If an Id in df1 has no EndDate, we will take it to be June 2019.
I would like df2 to use a multiindex with the first level being the corresponding Id in df1, the second level being the year, and the third level being the month. For example, if the above five rows were all of df1, then df2 should look like:
Id Year Month
0 2015 8
9
10
11
12
2016 1
2
3
4
5
6
7
8
9
10
11
12
2017 1
2
3
4
5
6
7
8
9
10
11
12
2018 1
... ... ...
4 2017 1
2
3
4
5
6
7
8
9
10
11
12
2018 1
2
3
4
5
6
7
8
9
10
11
12
2019 1
2
3
4
5
6
The following code does the trick, but takes about 20 seconds on my decent laptop for 10k Ids. Can I be more efficient somehow?
import numpy as np
def build_multiindex_for_id_(id_, enroll_month, enroll_year, cancel_month, cancel_year):
# Given id_ and start/end dates,
# returns 2d array to be converted to multiindex.
# Each row of returned array represents a month/year
# between enroll date and cancel date inclusive.
year = enroll_year
month = enroll_month
multiindex_array = [[],[],[]]
while (month != cancel_month) or (year != cancel_year):
multiindex_array[0].append(id_)
multiindex_array[1].append(year)
multiindex_array[2].append(month)
month += 1
if month == 13:
month = 1
year += 1
multiindex_array[0].append(id_)
multiindex_array[1].append(year)
multiindex_array[2].append(month)
return np.array(multiindex_array)
# Begin by constructing array for first id.
array_for_multiindex = build_multiindex_for_id_(0,8,2015,7,2018)
# Append the rest of the multiindices for the remaining ids.
for _, row in df.loc[1:].fillna(pd.to_datetime('2019-06-30')).iterrows():
current_id_array = build_multiindex_for_id_(
row['Id'],
row['StartDate'].month,
row['StartDate'].year,
row['EndDate'].month,
row['EndDate'].year)
array_for_multiindex = np.append(array_for_multiindex, current_id_array, axis=1)
df2_index = pd.MultiIndex.from_arrays(array_for_multiindex).rename(['Id','Year','Month'])
pd.DataFrame(index=df2_index)

Here's my approach after several trial and error:
(df.melt(id_vars='Id')
.fillna(pd.to_datetime('June 2019'))
.set_index('value')
.groupby('Id').apply(lambda x: x.asfreq('M').ffill())
.reset_index('value')
.assign(year=lambda x: x['value'].dt.year,
month=lambda x: x['value'].dt.month)
.set_index(['year','month'], append=True)
)
Output:
value Id variable
Id year month
0 2015 8 2015-08-31 NaN NaN
9 2015-09-30 NaN NaN
10 2015-10-31 NaN NaN
11 2015-11-30 NaN NaN
12 2015-12-31 NaN NaN
2016 1 2016-01-31 NaN NaN
2 2016-02-29 NaN NaN
3 2016-03-31 NaN NaN
4 2016-04-30 NaN NaN
5 2016-05-31 NaN NaN
6 2016-06-30 NaN NaN
7 2016-07-31 NaN NaN
8 2016-08-31 NaN NaN
9 2016-09-30 NaN NaN
10 2016-10-31 NaN NaN

Reformatting a date-frame into a new output format

I have a the output from a pivot table in dataframe (df) which is that looks like:
Year Month sum
2005 10 -1.596817e+05
11 -2.521054e+05
12 5.981900e+05
2006 1 8.686413e+05
2 1.673673e+06
3 1.218341e+06
4 4.131970e+05
5 1.090499e+05
6 1.495985e+06
7 1.736795e+06
8 1.155071e+05
...
9 7.847369e+05
10 -5.564139e+04
11 -7.435682e+05
12 1.073361e+05
2017 1 3.427652e+05
2 3.574432e+05
3 5.026018e+04
Is there a way to reformat the dataframe so the output to console would look like:
Month 1 2 3 4 5 6 7 8 9 10 11 12
Year
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
All the values would be populated in the new table as well.

Use unstack:
In [18]: df['sum'].unstack('Month')
Out[18]:
Month 1 2 3 4 5 6 7 8 9 10 11 12
Year
2005.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN -159681.70 -252105.4 598190.0
2006.0 868641.3 1673673.0 1218341.00 413197.0 109049.9 1495985.0 1736795.0 115507.1 784736.9 -55641.39 -743568.2 107336.1
2017.0 342765.2 357443.2 50260.18 NaN NaN NaN NaN NaN NaN NaN NaN NaN

Try df.pivot(index='year', columns='month', values='sum').
To fill you empty (if empty) year column use df.fillna(method='ffill') before the above.
Reading the answer above it should be mentioned that my suggestion works in cases where year and month aren't the index.

Expand panel data python

I am trying to expand the following data. I am a Stata user, and my problem can be fix by the command "fillin" in stata, now i am trying to rewrite this command in python and couldn't found any command that works.
For example: , transform this data frame:
(my dataframe is bigger than the example given, the example is just to illustrate what i want to do)
id year X Y
1 2008 10 20
1 2010 15 25
2 2011 2 4
2 2012 3 6
to this one
id year X Y
1 2008 10 20
1 2009 . .
1 2010 15 20
1 2011 . .
1 2012 . .
2 2008 . .
2 2009 . .
2 2010 . .
2 2011 2 4
2 2012 3 6
thank you, and sorry for my english

This can be done by using .loc[]
from itertools import product
import pandas as pd
df = pd.DataFrame([[1,2008,10,20],[1,2010,15,25],[2,2011,2,4],[2,2012,3,6]],columns=['id','year','X','Y'])
df = df.set_index(['id','year'])
# All combinations of index
#idx = list(product(df.index.levels[0], df.index.levels[1]))
idx = list(product(range(1,3), range(2008,2013)))
df.loc[idx]

Create a new multi-index from the dataframe and then reindex
years = np.tile(np.arange(df.year.min(), df.year.max()+1,1) ,2)
ids = np.repeat(df.id.unique(), df.year.max()-df.year.min()+1)
arrays = [ids.tolist(), years.tolist()]
new_idx = pd.MultiIndex.from_tuples(list(zip(*arrays)), names=['id', 'year'])
df = df.set_index(['id', 'year'])
df.reindex(new_idx).reset_index()
id year X Y
0 1 2008 10.0 20.0
1 1 2009 NaN NaN
2 1 2010 15.0 25.0
3 1 2011 NaN NaN
4 1 2012 NaN NaN
5 2 2008 NaN NaN
6 2 2009 NaN NaN
7 2 2010 NaN NaN
8 2 2011 2.0 4.0
9 2 2012 3.0 6.0

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

joining two dataframes with multilevel indices in pandas - python

Related

Balancing a panel data for regression

Merge if possible otherwise concat pandas

Faster way to construct a multiindex of dates in Pandas

Reformatting a date-frame into a new output format

Expand panel data python

Categories

Resources