I am using Python 3.9 on Pycharm. I have the following dataframe:
id year gdp
0 A 2019 3
1 A 2020 0
2 A 2021 5
3 B 2019 4
4 B 2020 2
5 B 2021 1
6 C 2020 5
7 C 2021 4
I want to keep individuals that have available data for the whole period. In other words, I would like to filter the rows such that I only keep id that have data for the three years (2019, 2020, 2021). This means excluding all observations of id C and keep all observations of id A and B:
id year gdp
0 A 2019 3
1 A 2020 0
2 A 2021 5
3 B 2019 4
4 B 2020 2
5 B 2021 1
Is it feasible in Python?
As you want to include only the ids for which all three year exist, you can group the dataframe by id then filter based on set equalities for the years you want versus the years available for particular id:
>>> years = {2019, 2020, 2021}
>>> df.groupby('id').filter(lambda x: set(x['year'].values)==years)
# df is your dataframe
id year gdp
0 A 2019 3
1 A 2020 0
2 A 2021 5
3 B 2019 4
4 B 2020 2
5 B 2021 1
First, make a set of all the years existing in the column year then use a boolean mask to filter your dataframe. For that, you need pandas.DataFrame.groupby and pandas.DataFrame.transform to count the occurences of each id in each group of year.
from io import StringIO
import pandas as pd
s = """id year gdp
A 2019 3
A 2020 0
A 2021 5
B 2019 4
B 2020 2
B 2021 1
C 2020 5
C 2021 4
"""
df = pd.read_csv(StringIO(s), sep='\t')
mask = df.groupby('id')['year'].transform('count').eq(len(set(df['id'])))
out = df[mask]
>>> print(out)
id year gdp
0 A 2019 3
1 A 2020 0
2 A 2021 5
3 B 2019 4
4 B 2020 2
5 B 2021 1
Here is a way using pivot and dropna to automatically find ids with missing values:
keep = df.pivot('id', 'year', 'gdp').dropna().index
# ['A', 'B']
out = df[df['id'].isin(keep)]
output:
id year gdp
0 A 2019 3
1 A 2020 0
2 A 2021 5
3 B 2019 4
4 B 2020 2
5 B 2021 1
I have a time series data of the following form:
Item 2020 Jan 2020 Feb 2020 Mar 2020 Apr 2020 May 2020 Jun
0 A 0 1 2 3 4 5
1 B 5 4 3 2 1 0
This is monthly data but I want to get quarterly data of this data. A normal quarterly data would be calculated by summing up Jan-Mar and Apr-Jun and would look like this:
Item 2020 Q1 2020 Q2
0 A 3 12
1 B 12 3
I want to get smoother quarterly data so it would shift by only 1 month for each new data item, not 3 months. So it would have Jan-Mar, then Feb-Apr, then Mar-May, and Apr-Jun. So the resulting data would look like this:
Item 2020 Q1 2020 Q1 2020 Q1 2020 Q2
0 A 3 6 9 12
1 B 12 9 6 3
I believe this is similar to cumsum which can be used as follows:
df_dates = df.iloc[:,1:]
df_dates.cumsum(axis=1)
which leads to the following result:
2020 Jan 2020 Feb 2020 Mar 2020 Apr 2020 May 2020 Jun
0 0 1 3 6 10 15
1 5 9 12 14 15 15
but instead of getting the sum over the whole time, it gets the sum of the nearest 3 months (a quarter).
I do not know how this version of cumsum is called but I saw it in many places so I believe there might be a library function for that.
Let us solve in steps
Set the index to Item column
Parse the date like columns to quarterly period
Calculate the rolling sum with window of size 3
Shift the calculated rolling sum 2 units along the columns axis and get rid of the last two columns
s = df.set_index('Item')
s.columns = pd.PeriodIndex(s.columns, freq='M').strftime('%Y Q%q')
s = s.rolling(3, axis=1).sum().shift(-2, axis=1).iloc[:, :-2]
print(s)
2020 Q1 2020 Q1 2020 Q1 2020 Q2
Item
A 3.0 6.0 9.0 12.0
B 12.0 9.0 6.0 3.0
Try with column wise groupby with axis=1:
>>> df.iloc[:, [0]].join(df.iloc[:, 1:].groupby(pd.to_datetime(df.columns[1:], format='%Y %b').quarter, axis=1).sum().add_prefix('Q'))
Item Q1 Q2
0 A 3 12
1 B 12 3
>>>
Edit:
I misread your question, to do what you want try rolling sum:
>>> x = df.rolling(3, axis=1).sum().dropna(axis='columns')
>>> df.iloc[:, [0]].join(x.set_axis('Q' + pd.to_datetime(df.columns[1:], format='%Y %b').quarter.astype(str)[:len(x.T)], axis=1))
Item Q1 Q1 Q1 Q2
0 A 3.0 6.0 9.0 12.0
1 B 12.0 9.0 6.0 3.0
>>>
I have data frame like
Year Month Date X Y
2015 5 1 0.21120733 0.17662421
2015 5 2 0.36878636 0.14629167
2015 5 3 0.27969632 0.37910569
2016 5 1 -1.2968733 8.29E-02
2016 5 2 -1.1575716 -0.20657887
2016 5 3 -1.0049003 -0.39670503
2017 5 1 -1.5630698 1.1710221
2017 5 2 -1.70889 0.93349206
2017 5 3 -1.8548334 0.86701781
2018 5 1 -7.94E-02 0.3962194
2018 5 2 -2.91E-02 0.39321879
I want to make it like
2015 2016 2017 2018
0.21120733 -1.2968733 -1.5630698 -7.94E-02
0.36878636 -1.1575716 -1.70889 -2.91E-02
0.27969632 -1.0049003 -1.8548334 NA
I tried using df.pivot(columns='Year',values='X') but the answer is not as expected
Try passing index in pivot():
out=df.pivot(columns='Year',values='X',index='Date')
#If needed use:
out=out.rename_axis(index=None,columns=None)
OR
Try via agg() and dropna():
out=df.pivot(columns='Year',values='X').agg(sorted,key=pd.isnull).dropna(how='all')
#If needed use:
out.columns.names=[None]
output of out:
2015 2016 2017 2018
0 0.211207 -1.296873 -1.563070 -0.0794
1 0.368786 -1.157572 -1.708890 -0.0291
2 0.279696 -1.004900 -1.854833 NaN
working on jupyter, my dataframe have number of transaction per customer per year and field that indicates the "trend - up for more transactions than last year, down for less transaction than last year, null for the first year.
I want to create a numerator that for every "up" per customer will raised by 1 and for every "down" will "reduced" by 1.
I understand that I need first to sort the df and than to build a loop that will run on the number of customers and an inside loop that will run for every year but I need help.
DF SAMPLE:
df = pd.DataFrame({
'group number': [1,1,1,1,3,3,3],
'year': ['2012','2013','2014','2015','2011','2012','2013'],
'trend': [NaN,'down','up','up',NaN,'down','up']
})
this is what I did so far:
df =pd.read_excel('totals_new.xlsx',sheet_name='Sheet1').sort_values(['group number', 'year'])
noofgroups = len(df['group number'].unique())
yearspergroup = df.groupby('group number')['year'].nunique()
vtrend =0
for i in noofgroups:
for j in yearspergroup:
if df["trend"] == "up":
vtrend = vtrend+1
if df["trend"] == "down":
vtrend = vtrend-1
IIUC, you can use nested np.where() to convert your trend column and then perform a groupby() and agg(). Take this sample dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'group number': [1,1,1,1,1,1,1,2,2,2,2,2,2,1,1,1,2,2,1,2,1,2],
'year': ['2017','2016','2018','2017','2016','2018','2017','2016','2018','2017','2016','2018',
'2017','2016','2018','2017','2016','2018','2017','2016','2018','2017'],
'trend': ['up','down','up',np.nan,'up','down',np.nan,'up','up','up','down',
'up',np.nan,'up','up','up','down','up','up','up',np.nan,'down']
})
Yields:
group number year trend
0 1 2017 up
1 1 2016 down
2 1 2018 up
3 1 2017 NaN
4 1 2016 up
5 1 2018 down
6 1 2017 NaN
7 2 2016 up
8 2 2018 up
9 2 2017 up
10 2 2016 down
11 2 2018 up
12 2 2017 NaN
13 1 2016 up
14 1 2018 up
15 1 2017 up
16 2 2016 down
17 2 2018 up
18 1 2017 up
19 2 2016 up
20 1 2018 NaN
21 2 2017 down
Then:
df['trend'] = np.where(df['trend']=='up', 1, np.where(df['trend']=='down', -1, 0))
df.groupby(['group number','year']).agg({'trend': 'sum'})
Returns:
trend
group number year
1 2016 1
2017 3
2018 1
2 2016 0
2017 0
2018 3
This case is probably closed by now but, here's a possible solution since it did not come to a conclusion previously.
import pandas as pd
"""
In this case, the original dataframe is already properly sorted by group number and year.
If it isn't, the 2 columns should be sorted first
"""
df = pd.DataFrame({
'group number': [1,1,1,1,3,3,3],
'year': ['2012','2013','2014','2015','2011','2012','2013'],
'trend': [np.nan,'down','up','up', np.nan,'down','up']
})
df['trend_val'] = df.loc[df['trend'].isna() == False, 'trend'].map(lambda x: -1 if x == 'down' else 1)
df.join(df.groupby('group number')['trend_val'].cumsum(), rsuffix='_cumulative')
>>>df
group number year trend trend_val trend_val_cumulative
0 1 2012 NaN NaN NaN
1 1 2013 down -1.0 -1.0
2 1 2014 up 1.0 0.0
3 1 2015 up 1.0 1.0
4 3 2011 NaN NaN NaN
5 3 2012 down -1.0 -1.0
6 3 2013 up 1.0 0.0
I have a multiindex pandas dataframe:
SHOPPING_COUNT
CLIENT YEAR MONTH
1000063 2013 12 9
2014 1 9
2 7
3 9
2015 4 6
5 5
6 9
1001327 2014 5 1
6 1
2015 2 7
3 1
4 3
1001399 2013 8 1
And I would to know the first index of each client, ordering by level 0.
I mean, I would want to get:
1000063 2013 12
1001327 2014 5
1001399 2013 8
Let df be your dataframe, you can do something like:
df = df.groupby(level=0).apply(lambda x: x.iloc[0:1])
df.index = df.index.droplevel(0)
actually this should be more easy to do maybe, but I think that this method works.
It's not very programmatic, but if you look at the result of:
client = 1000063
df.loc[client].index
Then the following would work:
year = df.loc[client].index.levels[0][df.loc[client].index.labels[0][0]]
month = df.loc[client].index.levels[1][df.loc[client].index.labels[1][0]]