This question already has answers here:
pandas or python equivalent of tidyr complete
(4 answers)
Closed last year.
My toy DataFrame is similar to
import pandas as pd
data = {'year': [1999, 2000, 2001, 2002, 2003, 2004, 2005,
1999, 2000, 2003, 2004, 2005],
'id': [1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2],
'price': [1200, 150, 300, 450, 200, 300, 400, 120,
140, 150, 155, 156]
}
df = pd.DataFrame(data)
What's the most elegant way to add missing years?
In the example, the years 2001 and 2002 are missing for id = 2 because of missing data. In such cases, I still want to have the years in the DataFrame, id should be 2 and price = NaN.
My real DataFrame has thousands of IDs.
Use a cross merge to create all possible combinations of "Year" and "ID" and merge back to the original DataFrame:
>>> df["year"].drop_duplicates().to_frame().merge(df["id"].drop_duplicates(), how="cross").merge(df, how="left")
year id price
0 1999 1 1200.0
1 1999 2 120.0
2 2000 1 150.0
3 2000 2 140.0
4 2001 1 300.0
5 2001 2 NaN
6 2002 1 450.0
7 2002 2 NaN
8 2003 1 200.0
9 2003 2 150.0
10 2004 1 300.0
11 2004 2 155.0
12 2005 1 400.0
13 2005 2 156.0
You could make "year" a Categorical variable and include it in the groupby:
df['year'] = pd.Categorical(df['year'], categories=df['year'].unique())
out = df.groupby(['id', 'year'], as_index=False).first()
Output:
id year price
0 1 1999 1200.0
1 1 2000 150.0
2 1 2001 300.0
3 1 2002 450.0
4 1 2003 200.0
5 1 2004 300.0
6 1 2005 400.0
7 2 1999 120.0
8 2 2000 140.0
9 2 2001 NaN
10 2 2002 NaN
11 2 2003 150.0
12 2 2004 155.0
13 2 2005 156.0
Update
You can also use product from itertools:
# from itertools import product
>>> df.set_index(['year', 'id']).reindex(product(set(df['year']), set(df['id']))) \
.sort_index(level=1).reset_index()
year id price
0 1999 1 1200.0
1 2000 1 150.0
2 2001 1 300.0
3 2002 1 450.0
4 2003 1 200.0
5 2004 1 300.0
6 2005 1 400.0
7 1999 2 120.0
8 2000 2 140.0
9 2001 2 NaN
10 2002 2 NaN
11 2003 2 150.0
12 2004 2 155.0
13 2005 2 156.0
Create a MultiIndex of all combinations of year and id columns. Set this columns as index and reindex by the multi-index:
mi = pd.MultiIndex.from_product([df['year'].unique(), df['id'].unique()], names=['year', 'id'])
out = df.set_index(['year', 'id']).reindex(mi).reset_index().sort_values('id', ignore_index=True)
Output:
>>> out
year id price
0 1999 1 1200.0
1 2000 1 150.0
2 2001 1 300.0
3 2002 1 450.0
4 2003 1 200.0
5 2004 1 300.0
6 2005 1 400.0
7 1999 2 120.0
8 2000 2 140.0
9 2001 2 NaN
10 2002 2 NaN
11 2003 2 150.0
12 2004 2 155.0
13 2005 2 156.0
Related
I work with panel data. Typically my panel data is not balanced, i.e., there are some missing years. The general look of panel data is as follows:
df = pd.DataFrame({'name': ['a']*4+['b']*3+['c']*4,
'year':[2001,2002,2004,2005]+[2000,2002,2003]+[2001,2002,2003,2005],
'val1':[1,2,3,4,5,6,7,8,9,10,11],
'val2':[2,5,7,11,13,17,19,23,29,31,37]})
name year val1 val2
0 a 2001 1 2
1 a 2002 2 5
2 a 2004 3 7
3 a 2005 4 11
4 b 2000 5 13
5 b 2002 6 17
6 b 2003 7 19
7 c 2001 8 23
8 c 2002 9 29
9 c 2003 10 31
10 c 2005 11 37
Now I want to create lead and lag variables that are groupby name. Using:
df['val1_lag'] = df.groupby('name')['val1'].shift(1)
df['val1_lead'] = df.groupby('name')['val1'].shift(-1)
This simply shift up/down 1 row before/after which is not what I want. I want to shift in relative to year. My expected output:
name year val1 val2 val1_lag val1_lead
0 a 2001 1 2 NaN 2.0
1 a 2002 2 5 1.0 NaN
2 a 2004 3 7 NaN 4.0
3 a 2005 4 11 3.0 NaN
4 b 2000 5 13 NaN NaN
5 b 2002 6 17 NaN 7.0
6 b 2003 7 19 6.0 NaN
7 c 2001 8 23 NaN 9.0
8 c 2002 9 29 8.0 10.0
9 c 2003 10 31 9.0 NaN
10 c 2005 11 37 NaN NaN
My current work around solution is to fill is missing year by:
df.set_index(['name', 'year'], inplace=True)
mux = pd.MultiIndex.from_product([df.index.levels[0], df.index.levels[1]], names=['name', 'year'])
df = df.reindex(mux).reset_index()
Then using normal shift. However, because my data size is quite large. Using this often x3 the data size which is not very efficiency here.
I am looking for a better approach for this scenario
The solution is to create a check column if that year is continuous by lag and lead. Set the check col to 1.0 and np.NaN then multiply to your normal groupby
df['yearlag'] = (df['year'] == 1 + df.groupby('name')['year'].shift(1))*1.0
df.loc[df['yearlag']==0.0, 'yearlag'] = None
df['yearlead'] = (df['year'] == -1 + df.groupby('name')['year'].shift(-1))*1.0
df.loc[df['yearlead']==0.0, 'yearlead'] = None
To create lag lead variables:
%timeit df['val1_lag'] = df.groupby('name')['val1'].shift(1)*df['yearlag']
You can check if one with the merge method above, it is much more efficiency
%timeit df['val1_lag'] = df[['name', 'year']].merge(df.eval('year=year+1'), how='left')['val1']
Don't use shift but a merge with the year ± 1:
df['val1_lag'] = df[['name', 'year']].merge(df.eval('year = year+1'), how='left')['val1']
df['val1_lead'] = df[['name', 'year']].merge(df.eval('year = year-1'), how='left')['val1']
Output:
name year val1 val2 val1_lag val1_lead
0 a 2001 1 2 NaN 2.0
1 a 2002 2 5 1.0 NaN
2 a 2004 3 7 NaN 4.0
3 a 2005 4 11 3.0 NaN
4 b 2000 5 13 NaN NaN
5 b 2002 6 17 NaN 7.0
6 b 2003 7 19 6.0 NaN
7 c 2001 8 23 NaN 9.0
8 c 2002 9 29 8.0 10.0
9 c 2003 10 31 9.0 NaN
10 c 2005 11 37 NaN NaN
In this example, we attempt to apply value in group and column to all other NaNs, that are in the same group and column.
import pandas as pd
df = pd.DataFrame({'id':[1,1,2,2,3,4,5], 'Year':[2000,2000, 2001, 2001, 2000, 2000, 2000], 'Values': [1, 3, 2, 3, 4, 5,6]})
df['pct'] = df.groupby(['id', 'Year'])['Values'].apply(lambda x: x/x.shift() - 1)
print(df)
id Year Values pct
0 1 2000 1 NaN
1 1 2000 3 2.0
2 2 2001 2 NaN
3 2 2001 3 0.5
4 3 2000 4 NaN
5 4 2000 5 NaN
6 5 2000 6 NaN
I have tried to use .ffill() to fill the NaN's within each group that contains a value. For example, the code is trying to make it so that the NaN associated with index 0, to be 2.0, and the NaN associated to index 2 to be 0.5.
df['pct'] = df.groupby(['id', 'Year'])['pct'].ffill()
print(df)
id Year Values pct
0 1 2000 1 NaN
1 1 2000 3 2.0
2 2 2001 2 NaN
3 2 2001 3 0.5
4 3 2000 4 NaN
5 4 2000 5 NaN
6 5 2000 6 NaN
It should be bfill
df['pct'] = df.groupby(['id', 'Year'])['pct'].bfill()
df
Out[109]:
id Year Values pct
0 1 2000 1 2.0
1 1 2000 3 2.0
2 2 2001 2 0.5
3 2 2001 3 0.5
4 3 2000 4 NaN
5 4 2000 5 NaN
6 5 2000 6 NaN
I have a dataframe:
df = pd.DataFrame({"id": [1, 1, 1, 2, 2, 3], "city": ['abc', 'abc', 'abc', 'def10', 'def10', 'ghk'] ,"year": [2008, 2009, 2010, 2008, 2010,2009], "value": [10,20,30,10,20,30]})
id city year value
0 1 abc 2008 10
1 1 abc 2009 20
2 1 abc 2010 30
3 2 def10 2008 10
4 2 def10 2010 20
5 3 ghk 2009 30
I wanna create a balanced data such that:
id city year value
0 1 abc 2008 10
1 1 abc 2009 20
2 1 abc 2010 30
3 2 def10 2008 10
4 2 def10 2009 NaN
5 2 def10 2010 20
6 3 ghk 2008 NaN
7 3 ghk 2009 30
8 3 ghk 2009 NaN
if I use the following code:
df = df.set_index('id')
balanced = (id.set_index('year',append=True).reindex(pd.MultiIndex.from_product([df.index,range(df.year.min(),df.year.max()+1)],names=['frs_id','year'])).reset_index(level=1))
This gives me following error:
cannot handle a non-unique multi-index!
You are close to the solution. You can amend your code slightly as follows:
idx = pd.MultiIndex.from_product([df['id'].unique(),range(df.year.min(),df.year.max()+1)],names=['id','year'])
df2 = df.set_index(['id', 'year']).reindex(idx).reset_index()
df2['city'] = df2.groupby('id')['city'].ffill().bfill()
Changes to your codes:
Create the MultiIndex by using unique values of id instead of from index
Set index on both id and year before reindex()
Fill-in the NaN values of column city by non-NaN entries of the same id
Result:
print(df2)
id year city value
0 1 2008 abc 10.0
1 1 2009 abc 20.0
2 1 2010 abc 30.0
3 2 2008 def10 10.0
4 2 2009 def10 NaN
5 2 2010 def10 20.0
6 3 2008 ghk NaN
7 3 2009 ghk 30.0
8 3 2010 ghk NaN
Optionally, you can re-arrange the column sequence, if you like:
df2.insert(2, 'year', df2.pop('year'))
print(df2)
id city year value
0 1 abc 2008 10.0
1 1 abc 2009 20.0
2 1 abc 2010 30.0
3 2 def10 2008 10.0
4 2 def10 2009 NaN
5 2 def10 2010 20.0
6 3 ghk 2008 NaN
7 3 ghk 2009 30.0
8 3 ghk 2010 NaN
Edit
You can also do it using stack() and unstack() without using reindex(), as follows:
(df.set_index(['id', 'city', 'year'], append=True)
.unstack()
.groupby(level=[1, 2]).max()
.stack(dropna=False)
).reset_index()
Output:
id city year value
0 1 abc 2008 10.0
1 1 abc 2009 20.0
2 1 abc 2010 30.0
3 2 def10 2008 10.0
4 2 def10 2009 NaN
5 2 def10 2010 20.0
6 3 ghk 2008 NaN
7 3 ghk 2009 30.0
8 3 ghk 2010 NaN
Pivot the table and stack year without drop NaN values:
>>> df.pivot(["id", "city"], "year", "value") \
.stack(dropna=False) \
.rename("value") \
.reset_index()
id city year value
0 1 abc 2008 10.0
1 1 abc 2009 20.0
2 1 abc 2010 30.0
3 2 def10 2008 10.0
4 2 def10 2009 NaN
5 2 def10 2010 20.0
6 3 ghk 2008 NaN
7 3 ghk 2009 30.0
8 3 ghk 2010 NaN
Edit: case of duplicate entries
I slightly modified your original dataframe:
df = pd.DataFrame({"id": [1,1,1,2,2,3,3], "city": ['abc','abc','abc','def10','def10','ghk','ghk'], "year": [2008,2009,2010,2008,2010,2009,2009], "value": [10,20,30,10,20,30,40]})
>>> df
id city year value
0 1 abc 2008 10
1 1 abc 2009 20
2 1 abc 2010 30
3 2 def10 2008 10
4 2 def10 2010 20
5 3 ghk 2009 30 # The problem is here
6 3 ghk 2009 40 # same (id, city, year)
You need to take a decision. Do you want to keep the row 5 or 6 or apply a math function (mean, sum, ...). Imagine you want the mean for (3, ghk, 2009):
>>> df.pivot_table(index=["id", "city"], columns="year", values="value", aggfunc="mean") \
.stack(dropna=False) \
.rename("value") \
.reset_index()
id city year value
0 1 abc 2008 10.0
1 1 abc 2009 20.0
2 1 abc 2010 30.0
3 2 def10 2008 10.0
4 2 def10 2009 NaN
5 2 def10 2010 20.0
6 3 ghk 2008 NaN
7 3 ghk 2009 35.0 # <- mean of (30, 40)
8 3 ghk 2010 NaN
Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
I created a column 'Year' I want to insert year in the dataframe every 13 rows from 2000 to 2018. pd.concat() did not work for me
Coastal_Fisheries
Small_Pelagic
Clam_Harvesting
Total_Catches
Month
Year
1
63299
No Data
20301
83600
1
0.0
2
41999
29854
21404
93257
2
0.0
3
41028
No Data
4179
45207
3
0.0
4
35812
No Data
2132
37944
4
0.0
5
70262
13156
81882
165300
5
0.0
6
46519
5940
No Data
52459
6
0.0
7
43317
7981
No Data
51298
7
0.0
8
55803
12219
No Data
68022
8
0.0
9
44737
15772
No Data
60509
9
0.0
10
35031
6233
No Data
41264
10
0.0
11
86585
33925
116176
236686
11
0.0
12
62267
13340
204554
280161
12
0.0
13
626660
138420
450628
1215708
None
0.0
1
60918
143509
60575
265002
1
0.0
The simple way is to create an array of values from 2000 to 2xxx with a step number of 13 of the same length as the number of row indexes of the dataframe. Then Insert this array to dataframe.
For example:
As below, I manually create a dataframe with step number = 13 and number of row = 15.
import pandas as pd
id = 13 # The step number
# The dataframe
df = pd.DataFrame({"Total": [i for i in range(15)],
"months": [i for i in range(1,id)]+['None',1,2]
}, index=[i for i in range(1,id+1)]+[1,2])
# Create data of 'Year' column
lst = len(df.index)
i_lst = round(lst/id)
df2 = [id*[2000+i] for i in range(i_lst)][0] + lst%id*[i_lst+2000]
# result df2 = [2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2001, 2001]
# Insert 'Year' column
df.insert(loc=2, column='Year', value=df2)
The results will be:
Total months Year
1 0 1 2000
2 1 2 2000
3 2 3 2000
4 3 4 2000
5 4 5 2000
6 5 6 2000
7 6 7 2000
8 7 8 2000
9 8 9 2000
10 9 10 2000
11 10 11 2000
12 11 12 2000
13 12 None 2000
1 13 1 2001
2 14 2 2001
If you want to use the code you posted in the comment above:
df = pd.DataFrame({'Year': [year for year in range(2000, 2018)]})
df = df.loc[df.index.repeat(13)]
then you are probably experiencing a indexing error, and you need to reset the indexes of both your original dataframe (which I'm calling data) and your "Year" dataframe:
data = data.reset_index(drop=True)
df = df.reset_index(drop=True)
and at this point pd.concat() should work:
df2 = pd.concat([data,df], axis=1)
If your index is from 0 to n:
You can groupby by taking 13 consecutive index and assign them year
df['Year'] = df.groupby((df.index)//13).ngroup().add(2000)
If your index is as mentioned in the post then groupby consecutive index.
df['Year'] = df.groupby(df.index.to_series().diff().ne(1).cumsum()).ngroup().add(2000)
Coastal_Fisheries
Small_Pelagic
Clam_Harvesting
Total_Catches
Month
Year
1
41999
29854.0
21404.0
93257
2
2000
2
41028
NaN
4179.0
45207
3
2000
3
35812
NaN
2132.0
37944
4
2000
4
70262
13156.0
81882.0
165300
5
2000
5
46519
5940.0
NaN
52459
6
2000
6
43317
7981.0
NaN
51298
7
2000
7
55803
12219.0
NaN
68022
8
2000
8
44737
15772.0
NaN
60509
9
2000
9
35031
6233.0
NaN
41264
10
2000
10
86585
33925.0
116176.0
236686
11
2000
11
62267
13340.0
204554.0
280161
12
2000
12
626660
138420.0
450628.0
1215708
None
2000
13
60918
143509.0
60575.0
265002
1
2000
1
41999
29854.0
21404.0
93257
2
2001
2
41028
NaN
4179.0
45207
3
2001
3
35812
NaN
2132.0
37944
4
2001
4
70262
13156.0
81882.0
165300
5
2001
5
46519
5940.0
NaN
52459
6
2001
6
43317
7981.0
NaN
51298
7
2001
7
55803
12219.0
NaN
68022
8
2001
8
44737
15772.0
NaN
60509
9
2001
9
35031
6233.0
NaN
41264
10
2001
10
86585
33925.0
116176.0
236686
11
2001
11
62267
13340.0
204554.0
280161
12
2001
12
626660
138420.0
450628.0
1215708
None
2001
13
60918
143509.0
60575.0
265002
1
2001
1
63299
NaN
20301.0
83600
1
2002
I have the following dataframe:
PersonID AmountPaid PaymentReceivedDate StartDate withinNYears
1 100 2017 2016
2 20 2014 2014
1 30 2017 2016
1 40 2016 2016
4 300 2015 2000
5 150 2005 2002
What I'm looking for is the Amount Paid should appear in the withNYears column if the payment was made within n years of start date otherwise you get NaN.
N years can be any number but let's say 2 for this example (as I will be playing with this to see findings).
so basically the above dataframe would come out like this if the amount was paid within 2 years:
PersonID AmountPaid PaymentReceivedDate StartDate withinNYears
1 100 2017 2016 100
2 20 2014 2014 20
1 30 2017 2016 30
1 40 2016 2016 40
4 300 2015 2000 NaN
5 150 2005 2002 NaN
does anyone know how to achieve this? cheers.
Subtract columns and compare by scalar for boolean mask and then set value by numpy.where, Series.where or DataFrame.loc:
m = (df['PaymentReceivedDate'] - df['StartDate']) < 2
df['withinNYears'] = np.where(m, df['AmountPaid'], np.nan)
#alternatives
#df['withinNYears'] = df['AmountPaid'].where(m)
#df.loc[m, 'withinNYears'] = df['AmountPaid']
print (df)
PersonID AmountPaid PaymentReceivedDate StartDate \
0 1 100 2017 2016
1 2 20 2014 2014
2 1 30 2017 2016
3 1 40 2016 2016
4 4 300 2015 2000
5 5 150 2005 2002
withinNYears
0 100.0
1 20.0
2 30.0
3 40.0
4 NaN
5 NaN
EDIT:
If StartDate column have datetimes:
m = (df['PaymentReceivedDate'] - df['StartDate'].dt. year) < 2
Just do with assign using loc
df.loc[(df['PaymentReceivedDate'] - df['StartDate']<2),'withinNYears']=df.AmountPaid
df
Out[37]:
PersonID AmountPaid ... StartDate withinNYears
0 1 100 ... 2016 100.0
1 2 20 ... 2014 20.0
2 1 30 ... 2016 30.0
3 1 40 ... 2016 40.0
4 4 300 ... 2000 NaN
5 5 150 ... 2002 NaN
[6 rows x 5 columns]