I am using Python 3.9 on Pycharm. I have the following dataframe:
id year gdp
0 A 2019 3
1 A 2020 0
2 A 2021 5
3 B 2019 4
4 B 2020 2
5 B 2021 1
6 C 2020 5
7 C 2021 4
I want to keep individuals that have available data for the whole period. In other words, I would like to filter the rows such that I only keep id that have data for the three years (2019, 2020, 2021). This means excluding all observations of id C and keep all observations of id A and B:
id year gdp
0 A 2019 3
1 A 2020 0
2 A 2021 5
3 B 2019 4
4 B 2020 2
5 B 2021 1
Is it feasible in Python?
As you want to include only the ids for which all three year exist, you can group the dataframe by id then filter based on set equalities for the years you want versus the years available for particular id:
>>> years = {2019, 2020, 2021}
>>> df.groupby('id').filter(lambda x: set(x['year'].values)==years)
# df is your dataframe
id year gdp
0 A 2019 3
1 A 2020 0
2 A 2021 5
3 B 2019 4
4 B 2020 2
5 B 2021 1
First, make a set of all the years existing in the column year then use a boolean mask to filter your dataframe. For that, you need pandas.DataFrame.groupby and pandas.DataFrame.transform to count the occurences of each id in each group of year.
from io import StringIO
import pandas as pd
s = """id year gdp
A 2019 3
A 2020 0
A 2021 5
B 2019 4
B 2020 2
B 2021 1
C 2020 5
C 2021 4
"""
df = pd.read_csv(StringIO(s), sep='\t')
mask = df.groupby('id')['year'].transform('count').eq(len(set(df['id'])))
out = df[mask]
>>> print(out)
id year gdp
0 A 2019 3
1 A 2020 0
2 A 2021 5
3 B 2019 4
4 B 2020 2
5 B 2021 1
Here is a way using pivot and dropna to automatically find ids with missing values:
keep = df.pivot('id', 'year', 'gdp').dropna().index
# ['A', 'B']
out = df[df['id'].isin(keep)]
output:
id year gdp
0 A 2019 3
1 A 2020 0
2 A 2021 5
3 B 2019 4
4 B 2020 2
5 B 2021 1
I have a DF such as the one below:
ID
Year
Value
1
2007
1
1
2008
1
1
2009
1
1
2011
1
1
2013
1
1
2014
1
1
2015
1
2
2008
1
2
2010
1
2
2011
1
2
2012
1
2
2013
1
2
2014
1
3
2009
1
3
2010
1
3
2011
1
3
2012
1
3
2013
1
3
2014
1
3
2015
1
As you can see, in ID '1' I am missing values for 2010 and 2012; and for ID '2' I am missing values for 2008, 2009, 2015, and ID '3' I am missing 2007, 2008. So, I would like to fill these gaps with the value '1'. What I would like to achieve is below:
ID
Year
Value
1
2007
1
1
2008
1
1
2009
1
1
2010
1
1
2011
1
1
2012
1
1
2013
1
1
2014
1
1
2015
1
2
2007
1
2
2008
1
2
2009
1
2
2010
1
2
2011
1
2
2012
1
2
2013
1
2
2014
1
2
2015
1
3
2007
1
3
2008
1
3
2009
1
3
2010
1
3
2011
1
3
2012
1
3
2013
1
3
2014
1
3
2015
1
I have created the below so far; however, that only fills for one ID, and i was struggling to find a way to loop through each ID adding a 'value' for each year that is missing:
idx = pd.date_range('2007', '2020', freq ='Y')
DF.index = pd.DatetimeIndex(DF.index)
DF_s = DF.reindex(idx, fill_value=0)
Any ideas would be helpful, please.
I'm not sure I got what you want to achieve, but if you want to fill NaNs in the "Value" column between 2007 and 2015 (suggesting that there are more years where you don't want to fill the column), you could do something like this:
import math
df1 = pd.DataFrame({'ID': [1,1,1,2,2,2],
'Year': [2007,2010,2020,2007,2010,2015],
'Value': [1,None,None,None,1,None]})
# Write a function with your logic
def func(x, y):
return 0 if math.isnan(y) and 2007<=x<=2015 else y
# Apply it to the df and update the column
df1['Value'] = df1.apply(lambda x: func(x.Year, x.Value), axis=1)
# ID Year Value
# 0 1 2007 1.0
# 1 1 2010 0.0
# 2 1 2020 NaN
# 3 2 2007 0.0
# 4 2 2010 1.0
# 5 2 2015 0.0
Answering my own question :). Needed to apply a lambda function after doing the groupby['org'] that adds a nan to each year that is missing. The reset_index effectivity ungroups it back into the original list.
f = lambda x: x.reindex(pd.date_range(pd.to_datetime('2007'), pd.to_datetime('2020'), name='date', freq='Y'))
DF_fixed = DF.set_index('Year').groupby(['Org']).apply(f).drop(['Org'], axis=1)
DF.reset_index()
I have a time series data of the following form:
Item 2020 Jan 2020 Feb 2020 Mar 2020 Apr 2020 May 2020 Jun
0 A 0 1 2 3 4 5
1 B 5 4 3 2 1 0
This is monthly data but I want to get quarterly data of this data. A normal quarterly data would be calculated by summing up Jan-Mar and Apr-Jun and would look like this:
Item 2020 Q1 2020 Q2
0 A 3 12
1 B 12 3
I want to get smoother quarterly data so it would shift by only 1 month for each new data item, not 3 months. So it would have Jan-Mar, then Feb-Apr, then Mar-May, and Apr-Jun. So the resulting data would look like this:
Item 2020 Q1 2020 Q1 2020 Q1 2020 Q2
0 A 3 6 9 12
1 B 12 9 6 3
I believe this is similar to cumsum which can be used as follows:
df_dates = df.iloc[:,1:]
df_dates.cumsum(axis=1)
which leads to the following result:
2020 Jan 2020 Feb 2020 Mar 2020 Apr 2020 May 2020 Jun
0 0 1 3 6 10 15
1 5 9 12 14 15 15
but instead of getting the sum over the whole time, it gets the sum of the nearest 3 months (a quarter).
I do not know how this version of cumsum is called but I saw it in many places so I believe there might be a library function for that.
Let us solve in steps
Set the index to Item column
Parse the date like columns to quarterly period
Calculate the rolling sum with window of size 3
Shift the calculated rolling sum 2 units along the columns axis and get rid of the last two columns
s = df.set_index('Item')
s.columns = pd.PeriodIndex(s.columns, freq='M').strftime('%Y Q%q')
s = s.rolling(3, axis=1).sum().shift(-2, axis=1).iloc[:, :-2]
print(s)
2020 Q1 2020 Q1 2020 Q1 2020 Q2
Item
A 3.0 6.0 9.0 12.0
B 12.0 9.0 6.0 3.0
Try with column wise groupby with axis=1:
>>> df.iloc[:, [0]].join(df.iloc[:, 1:].groupby(pd.to_datetime(df.columns[1:], format='%Y %b').quarter, axis=1).sum().add_prefix('Q'))
Item Q1 Q2
0 A 3 12
1 B 12 3
>>>
Edit:
I misread your question, to do what you want try rolling sum:
>>> x = df.rolling(3, axis=1).sum().dropna(axis='columns')
>>> df.iloc[:, [0]].join(x.set_axis('Q' + pd.to_datetime(df.columns[1:], format='%Y %b').quarter.astype(str)[:len(x.T)], axis=1))
Item Q1 Q1 Q1 Q2
0 A 3.0 6.0 9.0 12.0
1 B 12.0 9.0 6.0 3.0
>>>
I have a dateframe where values in the "price" column are different depending on both the values in the "quantity" and "year" columns. For example for a quantity equal to 2 I have a price equal to 2 in the 2017 and equal to 4 in the 2018. I would like to fill the rows for 2019, that have a 0 and NaN value, with values from 2018.
df = pd.DataFrame({
'quantity': pd.Series([1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7,8,9]),
'year': pd.Series([2017,2017,2017,2017,2017,2017,2017,2017,2017,2018,2018,2018,2018,2018,2018,2018,2018,2018,2019,2019,2019,2019,2019,2019,2019,2019,2019,]),
'price': pd.Series([1,2,3,4,5,6,7,8,9,2,4,6,8,10,12,14,16,18,np.NaN,np.NaN,0,0,np.NaN,0,np.NaN,0,np.NaN])
})
And what if, instead of taking values from 2018, I should calculate a mean between 2017 and 2018?
I tried to readapt this question applying it to the first case (to apply data from 2018), but it doesn't work:
df['price'][df['year']==2019].fillna(df['price'][df['year'] == 2018], inplace = True)
Could you please help me?
The expected output should be a dataframe like the followings:
Df with values from 2018
df = pd.DataFrame({
'quantity': pd.Series([1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7,8,9]),
'year': pd.Series([2017,2017,2017,2017,2017,2017,2017,2017,2017,2018,2018,2018,2018,2018,2018,2018,2018,2018,2019,2019,2019,2019,2019,2019,2019,2019,2019,]),
'price': pd.Series([1,2,3,4,5,6,7,8,9,2,4,6,8,10,12,14,16,18,2,4,6,8,10,12,14,16,18])
})
Df with values that are a mean between 2017 and 2018
df = pd.DataFrame({
'quantity': pd.Series([1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7,8,9]),
'year': pd.Series([2017,2017,2017,2017,2017,2017,2017,2017,2017,2018,2018,2018,2018,2018,2018,2018,2018,2018,2019,2019,2019,2019,2019,2019,2019,2019,2019,]),
'price': pd.Series([1,2,3,4,5,6,7,8,9,2,4,6,8,10,12,14,16,18,1.5,3,4.5,6,7.5,9,10.5,12,13.5])
})
Here's one way filling with the mean of 2017 and 2018.
Start by grouping the previous year's data by the quantity and aggregating with the mean:
m = df[df.year.isin([2017, 2018])].groupby('quantity').price.mean()
Use set_index to set the quantity column as index, replace 0s by NaNs and use fillna which also accepts dictionaries to map the values according to the index:
ix = df[df.year.eq(2019)].index
df.loc[ix, 'price'] = (df.loc[ix].set_index('quantity').price
.replace(0, np.nan).fillna(m).values)
quantity year price
0 1 2017 1.0
1 2 2017 2.0
2 3 2017 3.0
3 4 2017 4.0
4 5 2017 5.0
5 6 2017 6.0
6 7 2017 7.0
7 8 2017 8.0
8 9 2017 9.0
9 1 2018 2.0
10 2 2018 4.0
11 3 2018 6.0
12 4 2018 8.0
13 5 2018 10.0
14 6 2018 12.0
15 7 2018 14.0
16 8 2018 16.0
17 9 2018 18.0
18 1 2019 1.5
19 2 2019 3.0
20 3 2019 4.5
21 4 2019 6.0
22 5 2019 7.5
23 6 2019 9.0
24 7 2019 10.5
25 8 2019 12.0
26 9 2019 13.5