I work with panel data. Typically my panel data is not balanced, i.e., there are some missing years. The general look of panel data is as follows:
df = pd.DataFrame({'name': ['a']*4+['b']*3+['c']*4,
'year':[2001,2002,2004,2005]+[2000,2002,2003]+[2001,2002,2003,2005],
'val1':[1,2,3,4,5,6,7,8,9,10,11],
'val2':[2,5,7,11,13,17,19,23,29,31,37]})
name year val1 val2
0 a 2001 1 2
1 a 2002 2 5
2 a 2004 3 7
3 a 2005 4 11
4 b 2000 5 13
5 b 2002 6 17
6 b 2003 7 19
7 c 2001 8 23
8 c 2002 9 29
9 c 2003 10 31
10 c 2005 11 37
Now I want to create lead and lag variables that are groupby name. Using:
df['val1_lag'] = df.groupby('name')['val1'].shift(1)
df['val1_lead'] = df.groupby('name')['val1'].shift(-1)
This simply shift up/down 1 row before/after which is not what I want. I want to shift in relative to year. My expected output:
name year val1 val2 val1_lag val1_lead
0 a 2001 1 2 NaN 2.0
1 a 2002 2 5 1.0 NaN
2 a 2004 3 7 NaN 4.0
3 a 2005 4 11 3.0 NaN
4 b 2000 5 13 NaN NaN
5 b 2002 6 17 NaN 7.0
6 b 2003 7 19 6.0 NaN
7 c 2001 8 23 NaN 9.0
8 c 2002 9 29 8.0 10.0
9 c 2003 10 31 9.0 NaN
10 c 2005 11 37 NaN NaN
My current work around solution is to fill is missing year by:
df.set_index(['name', 'year'], inplace=True)
mux = pd.MultiIndex.from_product([df.index.levels[0], df.index.levels[1]], names=['name', 'year'])
df = df.reindex(mux).reset_index()
Then using normal shift. However, because my data size is quite large. Using this often x3 the data size which is not very efficiency here.
I am looking for a better approach for this scenario
The solution is to create a check column if that year is continuous by lag and lead. Set the check col to 1.0 and np.NaN then multiply to your normal groupby
df['yearlag'] = (df['year'] == 1 + df.groupby('name')['year'].shift(1))*1.0
df.loc[df['yearlag']==0.0, 'yearlag'] = None
df['yearlead'] = (df['year'] == -1 + df.groupby('name')['year'].shift(-1))*1.0
df.loc[df['yearlead']==0.0, 'yearlead'] = None
To create lag lead variables:
%timeit df['val1_lag'] = df.groupby('name')['val1'].shift(1)*df['yearlag']
You can check if one with the merge method above, it is much more efficiency
%timeit df['val1_lag'] = df[['name', 'year']].merge(df.eval('year=year+1'), how='left')['val1']
Don't use shift but a merge with the year ± 1:
df['val1_lag'] = df[['name', 'year']].merge(df.eval('year = year+1'), how='left')['val1']
df['val1_lead'] = df[['name', 'year']].merge(df.eval('year = year-1'), how='left')['val1']
Output:
name year val1 val2 val1_lag val1_lead
0 a 2001 1 2 NaN 2.0
1 a 2002 2 5 1.0 NaN
2 a 2004 3 7 NaN 4.0
3 a 2005 4 11 3.0 NaN
4 b 2000 5 13 NaN NaN
5 b 2002 6 17 NaN 7.0
6 b 2003 7 19 6.0 NaN
7 c 2001 8 23 NaN 9.0
8 c 2002 9 29 8.0 10.0
9 c 2003 10 31 9.0 NaN
10 c 2005 11 37 NaN NaN
In this example, we attempt to apply value in group and column to all other NaNs, that are in the same group and column.
import pandas as pd
df = pd.DataFrame({'id':[1,1,2,2,3,4,5], 'Year':[2000,2000, 2001, 2001, 2000, 2000, 2000], 'Values': [1, 3, 2, 3, 4, 5,6]})
df['pct'] = df.groupby(['id', 'Year'])['Values'].apply(lambda x: x/x.shift() - 1)
print(df)
id Year Values pct
0 1 2000 1 NaN
1 1 2000 3 2.0
2 2 2001 2 NaN
3 2 2001 3 0.5
4 3 2000 4 NaN
5 4 2000 5 NaN
6 5 2000 6 NaN
I have tried to use .ffill() to fill the NaN's within each group that contains a value. For example, the code is trying to make it so that the NaN associated with index 0, to be 2.0, and the NaN associated to index 2 to be 0.5.
df['pct'] = df.groupby(['id', 'Year'])['pct'].ffill()
print(df)
id Year Values pct
0 1 2000 1 NaN
1 1 2000 3 2.0
2 2 2001 2 NaN
3 2 2001 3 0.5
4 3 2000 4 NaN
5 4 2000 5 NaN
6 5 2000 6 NaN
It should be bfill
df['pct'] = df.groupby(['id', 'Year'])['pct'].bfill()
df
Out[109]:
id Year Values pct
0 1 2000 1 2.0
1 1 2000 3 2.0
2 2 2001 2 0.5
3 2 2001 3 0.5
4 3 2000 4 NaN
5 4 2000 5 NaN
6 5 2000 6 NaN
I am working on a dataframe of shape 146 rows x 48 columns. The columns are
['Region','Rank 2015','Score 2015','Economy 2015','Family 2015','Health 2015','Freedom 2015','Generosity 2015','Trust 2015','Rank 2016','Score 2016','Economy 2016','Family 2016','Health 2016','Freedom 2016','Generosity 2016','Trust 2016','Rank 2017','Score 2017','Economy 2017','Family 2017','Health 2017','Freedom 2017','Generosity 2017','Trust 2017','Rank 2018','Score 2018','Economy 2018','Family 2018','Health 2018','Freedom 2018','Generosity 2018','Trust 2018','Rank 2019','Score 2019','Economy 2019','Family 2019','Health 2019','Freedom 2019','Generosity 2019','Trust 2019','Score Mean','Economy Mean','Family Mean','Health Mean','Freedom Mean','Generosity Mean','Trust Mean']
I want to access a particular row and want to convert it to to the following dataframe
Year Rank Score Family Health Freedom Generosity Trust
0 2015 NaN NaN NaN NaN NaN NaN NaN
1 2016 NaN NaN NaN NaN NaN NaN NaN
2 2017 NaN NaN NaN NaN NaN NaN NaN
3 2018 NaN NaN NaN NaN NaN NaN NaN
4 2019 NaN NaN NaN NaN NaN NaN NaN
Any help is welcomed & Thank you in advance.
An alternate way:
cols=['Region','Rank 2015','Score 2015','Economy 2015','Family 2015','Health 2015','Freedom 2015','Generosity 2015', 'Trust 2015','Rank 2016','Score 2016','Economy 2016','Family 2016','Health 2016','Freedom 2016','Generosity 2016','Trust 2016', 'Rank 2017','Score 2017','Economy 2017','Family 2017','Health 2017','Freedom 2017','Generosity 2017','Trust 2017','Rank 2018','Score 2018','Economy 2018','Family 2018','Health 2018','Freedom 2018','Generosity 2018','Trust 2018','Rank 2019','Score 2019','Economy 2019','Family 2019','Health 2019','Freedom 2019','Generosity 2019','Trust 2019','Score Mean','Economy Mean','Family Mean','Health Mean','Freedom Mean','Generosity Mean','Trust Mean']
# source dataframe
df1 = pd.DataFrame(columns=cols)
df1.loc[0] = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1]
#target dataframe
df2 = pd.DataFrame(columns=['Year','Rank','Score','Family','Health','Freedom','Generosity','Trust','Economy'])
df2['Year']=['2015','2016','2017','2018','2019','Mean']
df2.set_index('Year', inplace=True)
idx = 0 # source row to copy
for col in df1.columns[1:]:
c,r = col.split(" ")
df2.at[r,c] = df1.at[idx, col]
print (df2)
Rank Score Family Health Freedom Generosity Trust Economy
Year
2015 1 1 1 1 1 1 1 1
2016 1 1 1 1 1 1 1 1
2017 1 1 1 1 1 1 1 1
2018 1 1 1 1 1 1 1 1
2019 1 1 1 1 1 1 1 1
Mean NaN 1 1 1 1 1 1 1
Here's a solution utilizing list comprehension:
The input:
cols = ['Region','Rank 2015','Score 2015','Economy 2015','Family 2015','Health 2015','Freedom 2015','Generosity 2015','Trust 2015','Rank 2016','Score 2016','Economy 2016','Family 2016','Health 2016','Freedom 2016','Generosity 2016','Trust 2016','Rank 2017','Score 2017','Economy 2017','Family 2017','Health 2017','Freedom 2017','Generosity 2017','Trust 2017','Rank 2018','Score 2018','Economy 2018','Family 2018','Health 2018','Freedom 2018','Generosity 2018','Trust 2018','Rank 2019','Score 2019','Economy 2019','Family 2019','Health 2019','Freedom 2019','Generosity 2019','Trust 2019','Score Mean','Economy Mean','Family Mean','Health Mean','Freedom Mean','Generosity Mean','Trust Mean']
df = pd.DataFrame(np.random.randint(1,10,(3,48)))
df.columns = cols
print(df.iloc[:, :4])
Region Rank 2015 Score 2015 Economy 2015
0 7 9 9 9
1 8 7 2 3
2 3 3 4 5
And the new dataframe would be:
target_cols = ['Rank', 'Score', 'Family', 'Health', 'Freedom', 'Generosity', 'Trust']
years = ['2015', '2016', '2017', '2018', '2019']
newdf = pd.DataFrame([df.loc[1, [x + ' ' + year for x in target_cols]].values for year in years])
newdf.columns = target_cols
newdf['year'] = years
print(newdf)
Rank Score Family Health Freedom Generosity Trust year
0 7 2 6 9 3 4 9 2015
1 2 8 1 1 7 6 1 2016
2 7 4 2 5 1 7 4 2017
3 9 7 1 4 7 5 2 2018
4 5 4 4 9 1 6 2 2019
Assuming that you have only the target years are those spanning between 2015 and 2019; and that the target columns are known.
I would procede as follows:
(1) define the target columns and years
target_columns = ['Rank', 'Score', 'Family', 'Health', 'Freedom', 'Generosity', 'Trust'] target_years = ['2015', '2016', '2017', '2018', '2019']
(2) retrieve the particular row, I assume your starting dataframe to be initial_dataframe
particular_row = initial_dataframe.iloc[0]
(3) retrieve and reshape the information from the particular_row
reshaped_row = { 'Year': target_years }
reshaped_row.update({ column_name: [ particular_row[column_name + ' ' + year_name] for year_name in target_years ] for column_name in target_columns })
(4) assign the reshaped row to the output_dataframe
output_dataframe = pd.Dataframe(reshaped_row)
Have you tried using a 2D array? I would find that to be the easiest. Otherwise, you could also use a dictionary. https://www.w3schools.com/python/python_dictionaries.asp
I didn't get your question properly but I can give you hint how to translate the data.
df = pd.DataFrame(li)
df = df[0].str.split("(\d{4})", expand=True)
df = df[df[2]==""]
col_name = df[0].unique()
df_new = df.pivot(index=1, columns=0, values=2)
df_new.drop(df_new.index[0], inplace=True)
df_new:
Economy Family Freedom Generosity Health Rank Score Trust
1
2016
2017
2018
2019
You can write your own logic.
It needs a lot of manipulation, a simple idea is to modify to required dict and then make df
In [61]: dicts = {}
In [62]: for t in text[1:]:
...: n,y = t.split(" ")
...: if n not in dicts:
...: dicts[n]=[]
...: if y !="Mean":
...: if n == 'Rank':
...: dicts[n].append(y)
...: else:
...: dicts[n].append(pd.np.NaN)
...:
In [63]: df = pd.DataFrame(dicts)
In [64]: df['Year'] = df['Rank']
In [65]: df['Rank'] = df['Family']
In [66]: df
Out[66]:
Rank Score Economy Family Health Freedom Generosity Trust Year
0 NaN NaN NaN NaN NaN NaN NaN NaN 2015
1 NaN NaN NaN NaN NaN NaN NaN NaN 2016
2 NaN NaN NaN NaN NaN NaN NaN NaN 2017
3 NaN NaN NaN NaN NaN NaN NaN NaN 2018
4 NaN NaN NaN NaN NaN NaN NaN NaN 2019