Panel Data - dealing with missing year when creating lead and lag variables - python

I work with panel data. Typically my panel data is not balanced, i.e., there are some missing years. The general look of panel data is as follows:
df = pd.DataFrame({'name': ['a']*4+['b']*3+['c']*4,
'year':[2001,2002,2004,2005]+[2000,2002,2003]+[2001,2002,2003,2005],
'val1':[1,2,3,4,5,6,7,8,9,10,11],
'val2':[2,5,7,11,13,17,19,23,29,31,37]})
name year val1 val2
0 a 2001 1 2
1 a 2002 2 5
2 a 2004 3 7
3 a 2005 4 11
4 b 2000 5 13
5 b 2002 6 17
6 b 2003 7 19
7 c 2001 8 23
8 c 2002 9 29
9 c 2003 10 31
10 c 2005 11 37
Now I want to create lead and lag variables that are groupby name. Using:
df['val1_lag'] = df.groupby('name')['val1'].shift(1)
df['val1_lead'] = df.groupby('name')['val1'].shift(-1)
This simply shift up/down 1 row before/after which is not what I want. I want to shift in relative to year. My expected output:
name year val1 val2 val1_lag val1_lead
0 a 2001 1 2 NaN 2.0
1 a 2002 2 5 1.0 NaN
2 a 2004 3 7 NaN 4.0
3 a 2005 4 11 3.0 NaN
4 b 2000 5 13 NaN NaN
5 b 2002 6 17 NaN 7.0
6 b 2003 7 19 6.0 NaN
7 c 2001 8 23 NaN 9.0
8 c 2002 9 29 8.0 10.0
9 c 2003 10 31 9.0 NaN
10 c 2005 11 37 NaN NaN
My current work around solution is to fill is missing year by:
df.set_index(['name', 'year'], inplace=True)
mux = pd.MultiIndex.from_product([df.index.levels[0], df.index.levels[1]], names=['name', 'year'])
df = df.reindex(mux).reset_index()
Then using normal shift. However, because my data size is quite large. Using this often x3 the data size which is not very efficiency here.
I am looking for a better approach for this scenario

The solution is to create a check column if that year is continuous by lag and lead. Set the check col to 1.0 and np.NaN then multiply to your normal groupby
df['yearlag'] = (df['year'] == 1 + df.groupby('name')['year'].shift(1))*1.0
df.loc[df['yearlag']==0.0, 'yearlag'] = None
df['yearlead'] = (df['year'] == -1 + df.groupby('name')['year'].shift(-1))*1.0
df.loc[df['yearlead']==0.0, 'yearlead'] = None
To create lag lead variables:
%timeit df['val1_lag'] = df.groupby('name')['val1'].shift(1)*df['yearlag']
You can check if one with the merge method above, it is much more efficiency
%timeit df['val1_lag'] = df[['name', 'year']].merge(df.eval('year=year+1'), how='left')['val1']

Don't use shift but a merge with the year ± 1:
df['val1_lag'] = df[['name', 'year']].merge(df.eval('year = year+1'), how='left')['val1']
df['val1_lead'] = df[['name', 'year']].merge(df.eval('year = year-1'), how='left')['val1']
Output:
name year val1 val2 val1_lag val1_lead
0 a 2001 1 2 NaN 2.0
1 a 2002 2 5 1.0 NaN
2 a 2004 3 7 NaN 4.0
3 a 2005 4 11 3.0 NaN
4 b 2000 5 13 NaN NaN
5 b 2002 6 17 NaN 7.0
6 b 2003 7 19 6.0 NaN
7 c 2001 8 23 NaN 9.0
8 c 2002 9 29 8.0 10.0
9 c 2003 10 31 9.0 NaN
10 c 2005 11 37 NaN NaN

Related

Get sum of values from last nth row by group id

I just want to know how to get the sum of the last 5th values based on id from every rows.
df:
id values
-----------------
a 5
a 10
a 10
b 2
c 2
d 2
a 5
a 10
a 20
a 10
a 15
a 20
expected df:
id values sum(x.tail(5))
-------------------------------------
a 5 NaN
a 10 NaN
a 10 NaN
b 2 NaN
c 2 NaN
d 2 NaN
a 5 NaN
a 10 NaN
a 20 40
a 10 55
a 15 55
a 20 60
For simplicity, I'm trying to find the sum of values from the last 5th rows from every rows with id a only.
I tried to use code df.apply(lambda x: x.tail(5)), but that only showed me last 5 rows from the very last row of the entire df. I want to get the sum of last nth rows from every and each rows. Basically it's like rolling_sum for time series data.
you can calculate the sum of the last 5 as like this:
df["rolling As"] = df[df['id'] == 'a'].rolling(window=5).sum()["values"]
(this includes the current row as one of the 5. not sure if that is what you want)
id values rolling As
0 a 5 NaN
1 a 10 NaN
2 a 10 NaN
3 b 2 NaN
4 c 2 NaN
5 d 5 NaN
6 a 10 NaN
7 a 20 55.0
8 a 10 60.0
9 a 10 60.0
10 a 15 65.0
11 a 20 75.0
If you don't want it included. you can shift
df["rolling"] = df[df['id'] == 'a'].rolling(window=5).sum()["values"].shift()
to give:
id values rolling
0 a 5 NaN
1 a 10 NaN
2 a 10 NaN
3 b 2 NaN
4 c 2 NaN
5 d 5 NaN
6 a 10 NaN
7 a 20 NaN
8 a 10 55.0
9 a 10 60.0
10 a 15 60.0
11 a 20 65.0
Try using groupby, transform, and rolling:
df['sum(x.tail(5))'] = df.groupby('id')['values']\
.transform(lambda x: x.rolling(5, min_periods=5).sum().shift())
Output:
id values sum(x.tail(5))
1 a 5 NaN
2 a 10 NaN
3 a 10 NaN
4 b 2 NaN
5 c 2 NaN
6 d 2 NaN
7 a 5 NaN
8 a 10 NaN
9 a 20 40.0
10 a 10 55.0
11 a 15 55.0
12 a 20 60.0

Reformatting a date-frame into a new output format

I have a the output from a pivot table in dataframe (df) which is that looks like:
Year Month sum
2005 10 -1.596817e+05
11 -2.521054e+05
12 5.981900e+05
2006 1 8.686413e+05
2 1.673673e+06
3 1.218341e+06
4 4.131970e+05
5 1.090499e+05
6 1.495985e+06
7 1.736795e+06
8 1.155071e+05
...
9 7.847369e+05
10 -5.564139e+04
11 -7.435682e+05
12 1.073361e+05
2017 1 3.427652e+05
2 3.574432e+05
3 5.026018e+04
Is there a way to reformat the dataframe so the output to console would look like:
Month 1 2 3 4 5 6 7 8 9 10 11 12
Year
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
All the values would be populated in the new table as well.
Use unstack:
In [18]: df['sum'].unstack('Month')
Out[18]:
Month 1 2 3 4 5 6 7 8 9 10 11 12
Year
2005.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN -159681.70 -252105.4 598190.0
2006.0 868641.3 1673673.0 1218341.00 413197.0 109049.9 1495985.0 1736795.0 115507.1 784736.9 -55641.39 -743568.2 107336.1
2017.0 342765.2 357443.2 50260.18 NaN NaN NaN NaN NaN NaN NaN NaN NaN
Try df.pivot(index='year', columns='month', values='sum').
To fill you empty (if empty) year column use df.fillna(method='ffill') before the above.
Reading the answer above it should be mentioned that my suggestion works in cases where year and month aren't the index.

Cumulative Sum using 2 columns

I am trying to create a column that does a cumulative sum using 2 columns , please see example of what I am trying to do :#Faith Akici
index lodgement_year words sum cum_sum
0 2000 the 14 14
1 2000 australia 10 10
2 2000 word 12 12
3 2000 brand 8 8
4 2000 fresh 5 5
5 2001 the 8 22
6 2001 australia 3 13
7 2001 banana 1 1
8 2001 brand 7 15
9 2001 fresh 1 6
I have used the code below , however my computer keep crashing , I am unsure if is the code or the computer. Any help will be greatly appreciated:
df_2['cumsum']= df_2.groupby('lodgement_year')['words'].transform(pd.Series.cumsum)
Update ; I have also used the code below , it worked and said exit code 0 . However with some warnings.
df_2['cum_sum'] =df_2.groupby(['words'])['count'].cumsum()
You are almost there, Ian!
cumsum() method calculates the cumulative sum of a Pandas column. You are looking for that applied to the grouped words. Therefore:
In [303]: df_2['cumsum'] = df_2.groupby(['words'])['sum'].cumsum()
In [304]: df_2
Out[304]:
index lodgement_year words sum cum_sum cumsum
0 0 2000 the 14 14 14
1 1 2000 australia 10 10 10
2 2 2000 word 12 12 12
3 3 2000 brand 8 8 8
4 4 2000 fresh 5 5 5
5 5 2001 the 8 22 22
6 6 2001 australia 3 13 13
7 7 2001 banana 1 1 1
8 8 2001 brand 7 15 15
9 9 2001 fresh 1 6 6
Please comment if this fails on your bigger data set, and we'll work on a possibly more accurate version of this.
If we only need to consider the column 'words', we might need to loop through unique values of the words
for unique_words in df_2.words.unique():
if 'cum_sum' not in df_2:
df_2['cum_sum'] = df_2.loc[df_2['words'] == unique_words]['sum'].cumsum()
else:
df_2.update(pd.DataFrame({'cum_sum': df_2.loc[df_2['words'] == unique_words]['sum'].cumsum()}))
above will result to:
>>> print(df_2)
lodgement_year sum words cum_sum
0 2000 14 the 14.0
1 2000 10 australia 10.0
2 2000 12 word 12.0
3 2000 8 brand 8.0
4 2000 5 fresh 5.0
5 2001 8 the 22.0
6 2001 3 australia 13.0
7 2001 1 banana 1.0
8 2001 7 brand 15.0
9 2001 1 fresh 6.0

Call a Nan Value and change to a number in python

I have a DataFrame, say df, which looks like this:
id property_type1 property_type pro
1 Condominium 2 2
2 Farm 14 14
3 House 7 7
4 Lots/Land 15 15
5 Mobile/Manufactured Home 13 13
6 Multi-Family 8 8
7 Townhouse 11 11
8 Single Family 10 10
9 Apt/Condo 1 1
10 Home 7 7
11 NaN 29 NaN
Now, I need the pro column to have the same value as the property_type column, whenever the property_type1 column has a NaN value. This is how it should be:
id property_type1 property_type pro
1 Condominium 2 2
2 Farm 14 14
3 House 7 7
4 Lots/Land 15 15
5 Mobile/Manufactured Home 13 13
6 Multi-Family 8 8
7 Townhouse 11 11
8 Single Family 10 10
9 Apt/Condo 1 1
10 Home 7 7
11 NaN 29 29
That is, in line 11, where property_type1 is NaN, the value of the pro column becomes 29, which is the value of property_type. How can I do this?
ix is deprecated, don't use it.
Option 1
I'd do this with np.where -
df = df.assign(pro=np.where(df.pro.isnull(), df.property_type, df.pro))
df
id property_type1 property_type pro
0 1 Condominium 2 2.0
1 2 Farm 14 14.0
2 3 House 7 7.0
3 4 Lots/Land 15 15.0
4 5 Mobile/Manufactured Home 13 13.0
5 6 Multi-Family 8 8.0
6 7 Townhouse 11 11.0
7 8 Single Family 10 10.0
8 9 Apt/Condo 1 1.0
9 10 Home 7 7.0
10 11 NaN 29 29.0
Option 2
If you want to perform in-place assignment, use loc -
m = df.pro.isnull()
df.loc[m, 'pro'] = df.loc[m, 'property_type']
df
id property_type1 property_type pro
0 1 Condominium 2 2.0
1 2 Farm 14 14.0
2 3 House 7 7.0
3 4 Lots/Land 15 15.0
4 5 Mobile/Manufactured Home 13 13.0
5 6 Multi-Family 8 8.0
6 7 Townhouse 11 11.0
7 8 Single Family 10 10.0
8 9 Apt/Condo 1 1.0
9 10 Home 7 7.0
10 11 NaN 29 29.0
Compute the mask just once, and use it to index multiple times, which should be more efficient than computing it twice.
Find the rows where property_type1 column is NaN, and for those rows: assign the property_type values to the pro column.
df.ix[df.property_type1.isnull(), 'pro'] = df.ix[df.property_type1.isnull(), 'property_type']

Get dataframe columns from a list using isin

I have a dataframe df1, and I have a list which contains names of several columns of df1.
df1:
User_id month day Age year CVI ZIP sex wgt
0 1 7 16 1977 2 NA M NaN
1 2 7 16 1977 3 NA M NaN
2 3 7 16 1977 2 DM F NaN
3 4 7 16 1977 7 DM M NaN
4 5 7 16 1977 3 DM M NaN
... ... ... ... ... ... ... ... ...
35544 35545 12 31 2002 15 AH NaN NaN
35545 35546 12 31 2002 15 AH NaN NaN
35546 35547 12 31 2002 10 RM F 14
35547 35548 12 31 2002 7 DO M 51
35548 35549 12 31 2002 5 NaN NaN NaN
list= [u"User_id", u"day", u"ZIP", u"sex"]
I want to make a new dataframe df2 which will contain omly those columns which are in the list, and a dataframe df3 which will contain columns which are not in the list.
Here I found that I need to do:
df2=df1[df1[df1.columns[1]].isin(list)]
But as a result I get:
Empty DataFrame
Columns: []
Index: []
[0 rows x 9 columns]
What Im I odoing wrong and how can i get a needed result? Why "9 columns" if it supossed to be 4?
Solution with Index.difference:
L = [u"User_id", u"day", u"ZIP", u"sex"]
df2 = df1[L]
df3 = df1[df1.columns.difference(df2.columns)]
print (df2)
User_id day ZIP sex
0 0 7 NaN M
1 1 7 NaN M
2 2 7 DM F
3 3 7 DM M
4 4 7 DM M
print (df3)
Age CVI month wgt year
0 16 2 1 NaN 1977
1 16 3 2 NaN 1977
2 16 2 3 NaN 1977
3 16 7 4 NaN 1977
4 16 3 5 NaN 1977
Or:
df2 = df1[L]
df3 = df1[df1.columns.difference(pd.Index(L))]
print (df2)
User_id day ZIP sex
0 0 7 NaN M
1 1 7 NaN M
2 2 7 DM F
3 3 7 DM M
4 4 7 DM M
print (df3)
Age CVI month wgt year
0 16 2 1 NaN 1977
1 16 3 2 NaN 1977
2 16 2 3 NaN 1977
3 16 7 4 NaN 1977
4 16 3 5 NaN 1977
never name a list as "list"
my_list= [u"User_id", u"day", u"ZIP", u"sex"]
df2 = df1[df1.keys()[df1.keys().isin(my_list)]]
or
df2 = df1[df1.columns[df1.columns.isin(my_list)]]
You can try :
df2 = df1[list] # it does a projection on the columns contained in the list
df3 = df1[[col for col in df1.columns if col not in list]]
never name a list as "list"
my_list= [u"User_id", u"day", u"ZIP", u"sex"]
df2 = df1[df1.keys()[df1.keys().isin(my_list)]]

Categories

Resources