Balancing a panel data for regression - python

I have a dataframe:
df = pd.DataFrame({"id": [1, 1, 1, 2, 2, 3], "city": ['abc', 'abc', 'abc', 'def10', 'def10', 'ghk'] ,"year": [2008, 2009, 2010, 2008, 2010,2009], "value": [10,20,30,10,20,30]})
id city year value
0 1 abc 2008 10
1 1 abc 2009 20
2 1 abc 2010 30
3 2 def10 2008 10
4 2 def10 2010 20
5 3 ghk 2009 30
I wanna create a balanced data such that:
id city year value
0 1 abc 2008 10
1 1 abc 2009 20
2 1 abc 2010 30
3 2 def10 2008 10
4 2 def10 2009 NaN
5 2 def10 2010 20
6 3 ghk 2008 NaN
7 3 ghk 2009 30
8 3 ghk 2009 NaN
if I use the following code:
df = df.set_index('id')
balanced = (id.set_index('year',append=True).reindex(pd.MultiIndex.from_product([df.index,range(df.year.min(),df.year.max()+1)],names=['frs_id','year'])).reset_index(level=1))
This gives me following error:
cannot handle a non-unique multi-index!

You are close to the solution. You can amend your code slightly as follows:
idx = pd.MultiIndex.from_product([df['id'].unique(),range(df.year.min(),df.year.max()+1)],names=['id','year'])
df2 = df.set_index(['id', 'year']).reindex(idx).reset_index()
df2['city'] = df2.groupby('id')['city'].ffill().bfill()
Changes to your codes:
Create the MultiIndex by using unique values of id instead of from index
Set index on both id and year before reindex()
Fill-in the NaN values of column city by non-NaN entries of the same id
Result:
print(df2)
id year city value
0 1 2008 abc 10.0
1 1 2009 abc 20.0
2 1 2010 abc 30.0
3 2 2008 def10 10.0
4 2 2009 def10 NaN
5 2 2010 def10 20.0
6 3 2008 ghk NaN
7 3 2009 ghk 30.0
8 3 2010 ghk NaN
Optionally, you can re-arrange the column sequence, if you like:
df2.insert(2, 'year', df2.pop('year'))
print(df2)
id city year value
0 1 abc 2008 10.0
1 1 abc 2009 20.0
2 1 abc 2010 30.0
3 2 def10 2008 10.0
4 2 def10 2009 NaN
5 2 def10 2010 20.0
6 3 ghk 2008 NaN
7 3 ghk 2009 30.0
8 3 ghk 2010 NaN
Edit
You can also do it using stack() and unstack() without using reindex(), as follows:
(df.set_index(['id', 'city', 'year'], append=True)
.unstack()
.groupby(level=[1, 2]).max()
.stack(dropna=False)
).reset_index()
Output:
id city year value
0 1 abc 2008 10.0
1 1 abc 2009 20.0
2 1 abc 2010 30.0
3 2 def10 2008 10.0
4 2 def10 2009 NaN
5 2 def10 2010 20.0
6 3 ghk 2008 NaN
7 3 ghk 2009 30.0
8 3 ghk 2010 NaN

Pivot the table and stack year without drop NaN values:
>>> df.pivot(["id", "city"], "year", "value") \
.stack(dropna=False) \
.rename("value") \
.reset_index()
id city year value
0 1 abc 2008 10.0
1 1 abc 2009 20.0
2 1 abc 2010 30.0
3 2 def10 2008 10.0
4 2 def10 2009 NaN
5 2 def10 2010 20.0
6 3 ghk 2008 NaN
7 3 ghk 2009 30.0
8 3 ghk 2010 NaN
Edit: case of duplicate entries
I slightly modified your original dataframe:
df = pd.DataFrame({"id": [1,1,1,2,2,3,3], "city": ['abc','abc','abc','def10','def10','ghk','ghk'], "year": [2008,2009,2010,2008,2010,2009,2009], "value": [10,20,30,10,20,30,40]})
>>> df
id city year value
0 1 abc 2008 10
1 1 abc 2009 20
2 1 abc 2010 30
3 2 def10 2008 10
4 2 def10 2010 20
5 3 ghk 2009 30 # The problem is here
6 3 ghk 2009 40 # same (id, city, year)
You need to take a decision. Do you want to keep the row 5 or 6 or apply a math function (mean, sum, ...). Imagine you want the mean for (3, ghk, 2009):
>>> df.pivot_table(index=["id", "city"], columns="year", values="value", aggfunc="mean") \
.stack(dropna=False) \
.rename("value") \
.reset_index()
id city year value
0 1 abc 2008 10.0
1 1 abc 2009 20.0
2 1 abc 2010 30.0
3 2 def10 2008 10.0
4 2 def10 2009 NaN
5 2 def10 2010 20.0
6 3 ghk 2008 NaN
7 3 ghk 2009 35.0 # <- mean of (30, 40)
8 3 ghk 2010 NaN

Related

Panel Data - dealing with missing year when creating lead and lag variables

I work with panel data. Typically my panel data is not balanced, i.e., there are some missing years. The general look of panel data is as follows:
df = pd.DataFrame({'name': ['a']*4+['b']*3+['c']*4,
'year':[2001,2002,2004,2005]+[2000,2002,2003]+[2001,2002,2003,2005],
'val1':[1,2,3,4,5,6,7,8,9,10,11],
'val2':[2,5,7,11,13,17,19,23,29,31,37]})
name year val1 val2
0 a 2001 1 2
1 a 2002 2 5
2 a 2004 3 7
3 a 2005 4 11
4 b 2000 5 13
5 b 2002 6 17
6 b 2003 7 19
7 c 2001 8 23
8 c 2002 9 29
9 c 2003 10 31
10 c 2005 11 37
Now I want to create lead and lag variables that are groupby name. Using:
df['val1_lag'] = df.groupby('name')['val1'].shift(1)
df['val1_lead'] = df.groupby('name')['val1'].shift(-1)
This simply shift up/down 1 row before/after which is not what I want. I want to shift in relative to year. My expected output:
name year val1 val2 val1_lag val1_lead
0 a 2001 1 2 NaN 2.0
1 a 2002 2 5 1.0 NaN
2 a 2004 3 7 NaN 4.0
3 a 2005 4 11 3.0 NaN
4 b 2000 5 13 NaN NaN
5 b 2002 6 17 NaN 7.0
6 b 2003 7 19 6.0 NaN
7 c 2001 8 23 NaN 9.0
8 c 2002 9 29 8.0 10.0
9 c 2003 10 31 9.0 NaN
10 c 2005 11 37 NaN NaN
My current work around solution is to fill is missing year by:
df.set_index(['name', 'year'], inplace=True)
mux = pd.MultiIndex.from_product([df.index.levels[0], df.index.levels[1]], names=['name', 'year'])
df = df.reindex(mux).reset_index()
Then using normal shift. However, because my data size is quite large. Using this often x3 the data size which is not very efficiency here.
I am looking for a better approach for this scenario
The solution is to create a check column if that year is continuous by lag and lead. Set the check col to 1.0 and np.NaN then multiply to your normal groupby
df['yearlag'] = (df['year'] == 1 + df.groupby('name')['year'].shift(1))*1.0
df.loc[df['yearlag']==0.0, 'yearlag'] = None
df['yearlead'] = (df['year'] == -1 + df.groupby('name')['year'].shift(-1))*1.0
df.loc[df['yearlead']==0.0, 'yearlead'] = None
To create lag lead variables:
%timeit df['val1_lag'] = df.groupby('name')['val1'].shift(1)*df['yearlag']
You can check if one with the merge method above, it is much more efficiency
%timeit df['val1_lag'] = df[['name', 'year']].merge(df.eval('year=year+1'), how='left')['val1']
Don't use shift but a merge with the year ± 1:
df['val1_lag'] = df[['name', 'year']].merge(df.eval('year = year+1'), how='left')['val1']
df['val1_lead'] = df[['name', 'year']].merge(df.eval('year = year-1'), how='left')['val1']
Output:
name year val1 val2 val1_lag val1_lead
0 a 2001 1 2 NaN 2.0
1 a 2002 2 5 1.0 NaN
2 a 2004 3 7 NaN 4.0
3 a 2005 4 11 3.0 NaN
4 b 2000 5 13 NaN NaN
5 b 2002 6 17 NaN 7.0
6 b 2003 7 19 6.0 NaN
7 c 2001 8 23 NaN 9.0
8 c 2002 9 29 8.0 10.0
9 c 2003 10 31 9.0 NaN
10 c 2005 11 37 NaN NaN

How to add missing years in panel dataset? [duplicate]

This question already has answers here:
pandas or python equivalent of tidyr complete
(4 answers)
Closed last year.
My toy DataFrame is similar to
import pandas as pd
data = {'year': [1999, 2000, 2001, 2002, 2003, 2004, 2005,
1999, 2000, 2003, 2004, 2005],
'id': [1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2],
'price': [1200, 150, 300, 450, 200, 300, 400, 120,
140, 150, 155, 156]
}
df = pd.DataFrame(data)
What's the most elegant way to add missing years?
In the example, the years 2001 and 2002 are missing for id = 2 because of missing data. In such cases, I still want to have the years in the DataFrame, id should be 2 and price = NaN.
My real DataFrame has thousands of IDs.
Use a cross merge to create all possible combinations of "Year" and "ID" and merge back to the original DataFrame:
>>> df["year"].drop_duplicates().to_frame().merge(df["id"].drop_duplicates(), how="cross").merge(df, how="left")
year id price
0 1999 1 1200.0
1 1999 2 120.0
2 2000 1 150.0
3 2000 2 140.0
4 2001 1 300.0
5 2001 2 NaN
6 2002 1 450.0
7 2002 2 NaN
8 2003 1 200.0
9 2003 2 150.0
10 2004 1 300.0
11 2004 2 155.0
12 2005 1 400.0
13 2005 2 156.0
You could make "year" a Categorical variable and include it in the groupby:
df['year'] = pd.Categorical(df['year'], categories=df['year'].unique())
out = df.groupby(['id', 'year'], as_index=False).first()
Output:
id year price
0 1 1999 1200.0
1 1 2000 150.0
2 1 2001 300.0
3 1 2002 450.0
4 1 2003 200.0
5 1 2004 300.0
6 1 2005 400.0
7 2 1999 120.0
8 2 2000 140.0
9 2 2001 NaN
10 2 2002 NaN
11 2 2003 150.0
12 2 2004 155.0
13 2 2005 156.0
Update
You can also use product from itertools:
# from itertools import product
>>> df.set_index(['year', 'id']).reindex(product(set(df['year']), set(df['id']))) \
.sort_index(level=1).reset_index()
year id price
0 1999 1 1200.0
1 2000 1 150.0
2 2001 1 300.0
3 2002 1 450.0
4 2003 1 200.0
5 2004 1 300.0
6 2005 1 400.0
7 1999 2 120.0
8 2000 2 140.0
9 2001 2 NaN
10 2002 2 NaN
11 2003 2 150.0
12 2004 2 155.0
13 2005 2 156.0
Create a MultiIndex of all combinations of year and id columns. Set this columns as index and reindex by the multi-index:
mi = pd.MultiIndex.from_product([df['year'].unique(), df['id'].unique()], names=['year', 'id'])
out = df.set_index(['year', 'id']).reindex(mi).reset_index().sort_values('id', ignore_index=True)
Output:
>>> out
year id price
0 1999 1 1200.0
1 2000 1 150.0
2 2001 1 300.0
3 2002 1 450.0
4 2003 1 200.0
5 2004 1 300.0
6 2005 1 400.0
7 1999 2 120.0
8 2000 2 140.0
9 2001 2 NaN
10 2002 2 NaN
11 2003 2 150.0
12 2004 2 155.0
13 2005 2 156.0

Reformatting a date-frame into a new output format

I have a the output from a pivot table in dataframe (df) which is that looks like:
Year Month sum
2005 10 -1.596817e+05
11 -2.521054e+05
12 5.981900e+05
2006 1 8.686413e+05
2 1.673673e+06
3 1.218341e+06
4 4.131970e+05
5 1.090499e+05
6 1.495985e+06
7 1.736795e+06
8 1.155071e+05
...
9 7.847369e+05
10 -5.564139e+04
11 -7.435682e+05
12 1.073361e+05
2017 1 3.427652e+05
2 3.574432e+05
3 5.026018e+04
Is there a way to reformat the dataframe so the output to console would look like:
Month 1 2 3 4 5 6 7 8 9 10 11 12
Year
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
All the values would be populated in the new table as well.
Use unstack:
In [18]: df['sum'].unstack('Month')
Out[18]:
Month 1 2 3 4 5 6 7 8 9 10 11 12
Year
2005.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN -159681.70 -252105.4 598190.0
2006.0 868641.3 1673673.0 1218341.00 413197.0 109049.9 1495985.0 1736795.0 115507.1 784736.9 -55641.39 -743568.2 107336.1
2017.0 342765.2 357443.2 50260.18 NaN NaN NaN NaN NaN NaN NaN NaN NaN
Try df.pivot(index='year', columns='month', values='sum').
To fill you empty (if empty) year column use df.fillna(method='ffill') before the above.
Reading the answer above it should be mentioned that my suggestion works in cases where year and month aren't the index.

Expand panel data python

I am trying to expand the following data. I am a Stata user, and my problem can be fix by the command "fillin" in stata, now i am trying to rewrite this command in python and couldn't found any command that works.
For example: , transform this data frame:
(my dataframe is bigger than the example given, the example is just to illustrate what i want to do)
id year X Y
1 2008 10 20
1 2010 15 25
2 2011 2 4
2 2012 3 6
to this one
id year X Y
1 2008 10 20
1 2009 . .
1 2010 15 20
1 2011 . .
1 2012 . .
2 2008 . .
2 2009 . .
2 2010 . .
2 2011 2 4
2 2012 3 6
thank you, and sorry for my english
This can be done by using .loc[]
from itertools import product
import pandas as pd
df = pd.DataFrame([[1,2008,10,20],[1,2010,15,25],[2,2011,2,4],[2,2012,3,6]],columns=['id','year','X','Y'])
df = df.set_index(['id','year'])
# All combinations of index
#idx = list(product(df.index.levels[0], df.index.levels[1]))
idx = list(product(range(1,3), range(2008,2013)))
df.loc[idx]
Create a new multi-index from the dataframe and then reindex
years = np.tile(np.arange(df.year.min(), df.year.max()+1,1) ,2)
ids = np.repeat(df.id.unique(), df.year.max()-df.year.min()+1)
arrays = [ids.tolist(), years.tolist()]
new_idx = pd.MultiIndex.from_tuples(list(zip(*arrays)), names=['id', 'year'])
df = df.set_index(['id', 'year'])
df.reindex(new_idx).reset_index()
id year X Y
0 1 2008 10.0 20.0
1 1 2009 NaN NaN
2 1 2010 15.0 25.0
3 1 2011 NaN NaN
4 1 2012 NaN NaN
5 2 2008 NaN NaN
6 2 2009 NaN NaN
7 2 2010 NaN NaN
8 2 2011 2.0 4.0
9 2 2012 3.0 6.0

joining two dataframes with multilevel indices in pandas

I have two dataframes like following with multilevel indices:
df1:
Total_Consumption
2010 2011 2012
1 8544.357 5133.553 5279.884
2 8581.545 6091.454 4323.611
3 4479.319 2784.283 1948.262
4 5493.114 3633.187 3516.346
5 5582.544 3138.680 3995.311
6 9877.752 7798.371 8505.287
7 5137.488 4109.556 3301.129
8 13038.200 8853.721 8525.272
df2:
Charging Capacity
2010 2011 2012
1 7.989 4.752 5.801
2 11.349 22.092 10.967
3 6.968 6.803 9.760
4 5.191 7.294 9.199
5 0.201 -1.204 10.488
6 14.598 13.077 17.004
7 5.134 12.945 8.970
8 44.680 23.607 24.395
I tried to concatenate these two dataframes via:
l1=[df1,df2]
pd.concat(l1)
But I get the following output. Why do I get NaN for df2 dataframe? Is there a way to join the two dataframe with multilevel indices properly in pandas?
Charging Capacity Total_Consumption
2010 2011 2012 2010 2011 2012
1 NaN NaN NaN 8544.357 5133.553 5279.884
2 NaN NaN NaN 8581.545 6091.454 4323.611
3 NaN NaN NaN 4479.319 2784.283 1948.262
4 NaN NaN NaN 5493.114 3633.187 3516.346
5 NaN NaN NaN 5582.544 3138.680 3995.311
6 NaN NaN NaN 9877.752 7798.371 8505.287
7 NaN NaN NaN 5137.488 4109.556 3301.129
8 NaN NaN NaN 13038.200 8853.721 8525.272
Use axis=1:
pd.concat([df1, df], axis=1)
Output:
Total_Consumption Charging Capacity
2010 2011 2012 2010 2011 2012
1 8544.357 5133.553 5279.884 7.989 4.752 5.801
2 8581.545 6091.454 4323.611 11.349 22.092 10.967
3 4479.319 2784.283 1948.262 6.968 6.803 9.760
4 5493.114 3633.187 3516.346 5.191 7.294 9.199
5 5582.544 3138.680 3995.311 0.201 -1.204 10.488
6 9877.752 7798.371 8505.287 14.598 13.077 17.004
7 5137.488 4109.556 3301.129 5.134 12.945 8.970
8 13038.200 8853.721 8525.272 44.680 23.607 24.395

Categories

Resources