ranking transactions trend for each customer per year - python

working on jupyter, my dataframe have number of transaction per customer per year and field that indicates the "trend - up for more transactions than last year, down for less transaction than last year, null for the first year.
I want to create a numerator that for every "up" per customer will raised by 1 and for every "down" will "reduced" by 1.
I understand that I need first to sort the df and than to build a loop that will run on the number of customers and an inside loop that will run for every year but I need help.
DF SAMPLE:
df = pd.DataFrame({
'group number': [1,1,1,1,3,3,3],
'year': ['2012','2013','2014','2015','2011','2012','2013'],
'trend': [NaN,'down','up','up',NaN,'down','up']
})
this is what I did so far:
df =pd.read_excel('totals_new.xlsx',sheet_name='Sheet1').sort_values(['group number', 'year'])
noofgroups = len(df['group number'].unique())
yearspergroup = df.groupby('group number')['year'].nunique()
vtrend =0
for i in noofgroups:
for j in yearspergroup:
if df["trend"] == "up":
vtrend = vtrend+1
if df["trend"] == "down":
vtrend = vtrend-1

IIUC, you can use nested np.where() to convert your trend column and then perform a groupby() and agg(). Take this sample dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'group number': [1,1,1,1,1,1,1,2,2,2,2,2,2,1,1,1,2,2,1,2,1,2],
'year': ['2017','2016','2018','2017','2016','2018','2017','2016','2018','2017','2016','2018',
'2017','2016','2018','2017','2016','2018','2017','2016','2018','2017'],
'trend': ['up','down','up',np.nan,'up','down',np.nan,'up','up','up','down',
'up',np.nan,'up','up','up','down','up','up','up',np.nan,'down']
})
Yields:
group number year trend
0 1 2017 up
1 1 2016 down
2 1 2018 up
3 1 2017 NaN
4 1 2016 up
5 1 2018 down
6 1 2017 NaN
7 2 2016 up
8 2 2018 up
9 2 2017 up
10 2 2016 down
11 2 2018 up
12 2 2017 NaN
13 1 2016 up
14 1 2018 up
15 1 2017 up
16 2 2016 down
17 2 2018 up
18 1 2017 up
19 2 2016 up
20 1 2018 NaN
21 2 2017 down
Then:
df['trend'] = np.where(df['trend']=='up', 1, np.where(df['trend']=='down', -1, 0))
df.groupby(['group number','year']).agg({'trend': 'sum'})
Returns:
trend
group number year
1 2016 1
2017 3
2018 1
2 2016 0
2017 0
2018 3

This case is probably closed by now but, here's a possible solution since it did not come to a conclusion previously.
import pandas as pd
"""
In this case, the original dataframe is already properly sorted by group number and year.
If it isn't, the 2 columns should be sorted first
"""
df = pd.DataFrame({
'group number': [1,1,1,1,3,3,3],
'year': ['2012','2013','2014','2015','2011','2012','2013'],
'trend': [np.nan,'down','up','up', np.nan,'down','up']
})
df['trend_val'] = df.loc[df['trend'].isna() == False, 'trend'].map(lambda x: -1 if x == 'down' else 1)
df.join(df.groupby('group number')['trend_val'].cumsum(), rsuffix='_cumulative')
>>>df
group number year trend trend_val trend_val_cumulative
0 1 2012 NaN NaN NaN
1 1 2013 down -1.0 -1.0
2 1 2014 up 1.0 0.0
3 1 2015 up 1.0 1.0
4 3 2011 NaN NaN NaN
5 3 2012 down -1.0 -1.0
6 3 2013 up 1.0 0.0

Related

Filter individuals that don't have data for the whole period

I am using Python 3.9 on Pycharm. I have the following dataframe:
id year gdp
0 A 2019 3
1 A 2020 0
2 A 2021 5
3 B 2019 4
4 B 2020 2
5 B 2021 1
6 C 2020 5
7 C 2021 4
I want to keep individuals that have available data for the whole period. In other words, I would like to filter the rows such that I only keep id that have data for the three years (2019, 2020, 2021). This means excluding all observations of id C and keep all observations of id A and B:
id year gdp
0 A 2019 3
1 A 2020 0
2 A 2021 5
3 B 2019 4
4 B 2020 2
5 B 2021 1
Is it feasible in Python?
As you want to include only the ids for which all three year exist, you can group the dataframe by id then filter based on set equalities for the years you want versus the years available for particular id:
>>> years = {2019, 2020, 2021}
>>> df.groupby('id').filter(lambda x: set(x['year'].values)==years)
# df is your dataframe
id year gdp
0 A 2019 3
1 A 2020 0
2 A 2021 5
3 B 2019 4
4 B 2020 2
5 B 2021 1
First, make a set of all the years existing in the column year then use a boolean mask to filter your dataframe. For that, you need pandas.DataFrame.groupby and pandas.DataFrame.transform to count the occurences of each id in each group of year.
from io import StringIO
import pandas as pd
s = """id year gdp
A 2019 3
A 2020 0
A 2021 5
B 2019 4
B 2020 2
B 2021 1
C 2020 5
C 2021 4
"""
df = pd.read_csv(StringIO(s), sep='\t')
mask = df.groupby('id')['year'].transform('count').eq(len(set(df['id'])))
out = df[mask]
>>> print(out)
id year gdp
0 A 2019 3
1 A 2020 0
2 A 2021 5
3 B 2019 4
4 B 2020 2
5 B 2021 1
Here is a way using pivot and dropna to automatically find ids with missing values:
keep = df.pivot('id', 'year', 'gdp').dropna().index
# ['A', 'B']
out = df[df['id'].isin(keep)]
output:
id year gdp
0 A 2019 3
1 A 2020 0
2 A 2021 5
3 B 2019 4
4 B 2020 2
5 B 2021 1

Loop through timeseries and fill missing data - Python

I have a DF such as the one below:
ID
Year
Value
1
2007
1
1
2008
1
1
2009
1
1
2011
1
1
2013
1
1
2014
1
1
2015
1
2
2008
1
2
2010
1
2
2011
1
2
2012
1
2
2013
1
2
2014
1
3
2009
1
3
2010
1
3
2011
1
3
2012
1
3
2013
1
3
2014
1
3
2015
1
As you can see, in ID '1' I am missing values for 2010 and 2012; and for ID '2' I am missing values for 2008, 2009, 2015, and ID '3' I am missing 2007, 2008. So, I would like to fill these gaps with the value '1'. What I would like to achieve is below:
ID
Year
Value
1
2007
1
1
2008
1
1
2009
1
1
2010
1
1
2011
1
1
2012
1
1
2013
1
1
2014
1
1
2015
1
2
2007
1
2
2008
1
2
2009
1
2
2010
1
2
2011
1
2
2012
1
2
2013
1
2
2014
1
2
2015
1
3
2007
1
3
2008
1
3
2009
1
3
2010
1
3
2011
1
3
2012
1
3
2013
1
3
2014
1
3
2015
1
I have created the below so far; however, that only fills for one ID, and i was struggling to find a way to loop through each ID adding a 'value' for each year that is missing:
idx = pd.date_range('2007', '2020', freq ='Y')
DF.index = pd.DatetimeIndex(DF.index)
DF_s = DF.reindex(idx, fill_value=0)
Any ideas would be helpful, please.
I'm not sure I got what you want to achieve, but if you want to fill NaNs in the "Value" column between 2007 and 2015 (suggesting that there are more years where you don't want to fill the column), you could do something like this:
import math
df1 = pd.DataFrame({'ID': [1,1,1,2,2,2],
'Year': [2007,2010,2020,2007,2010,2015],
'Value': [1,None,None,None,1,None]})
# Write a function with your logic
def func(x, y):
return 0 if math.isnan(y) and 2007<=x<=2015 else y
# Apply it to the df and update the column
df1['Value'] = df1.apply(lambda x: func(x.Year, x.Value), axis=1)
# ID Year Value
# 0 1 2007 1.0
# 1 1 2010 0.0
# 2 1 2020 NaN
# 3 2 2007 0.0
# 4 2 2010 1.0
# 5 2 2015 0.0
Answering my own question :). Needed to apply a lambda function after doing the groupby['org'] that adds a nan to each year that is missing. The reset_index effectivity ungroups it back into the original list.
f = lambda x: x.reindex(pd.date_range(pd.to_datetime('2007'), pd.to_datetime('2020'), name='date', freq='Y'))
DF_fixed = DF.set_index('Year').groupby(['Org']).apply(f).drop(['Org'], axis=1)
DF.reset_index()

Getting sum data for smoothly shifting groups of 3 months of a months data in Pandas

I have a time series data of the following form:
Item 2020 Jan 2020 Feb 2020 Mar 2020 Apr 2020 May 2020 Jun
0 A 0 1 2 3 4 5
1 B 5 4 3 2 1 0
This is monthly data but I want to get quarterly data of this data. A normal quarterly data would be calculated by summing up Jan-Mar and Apr-Jun and would look like this:
Item 2020 Q1 2020 Q2
0 A 3 12
1 B 12 3
I want to get smoother quarterly data so it would shift by only 1 month for each new data item, not 3 months. So it would have Jan-Mar, then Feb-Apr, then Mar-May, and Apr-Jun. So the resulting data would look like this:
Item 2020 Q1 2020 Q1 2020 Q1 2020 Q2
0 A 3 6 9 12
1 B 12 9 6 3
I believe this is similar to cumsum which can be used as follows:
df_dates = df.iloc[:,1:]
df_dates.cumsum(axis=1)
which leads to the following result:
2020 Jan 2020 Feb 2020 Mar 2020 Apr 2020 May 2020 Jun
0 0 1 3 6 10 15
1 5 9 12 14 15 15
but instead of getting the sum over the whole time, it gets the sum of the nearest 3 months (a quarter).
I do not know how this version of cumsum is called but I saw it in many places so I believe there might be a library function for that.
Let us solve in steps
Set the index to Item column
Parse the date like columns to quarterly period
Calculate the rolling sum with window of size 3
Shift the calculated rolling sum 2 units along the columns axis and get rid of the last two columns
s = df.set_index('Item')
s.columns = pd.PeriodIndex(s.columns, freq='M').strftime('%Y Q%q')
s = s.rolling(3, axis=1).sum().shift(-2, axis=1).iloc[:, :-2]
print(s)
2020 Q1 2020 Q1 2020 Q1 2020 Q2
Item
A 3.0 6.0 9.0 12.0
B 12.0 9.0 6.0 3.0
Try with column wise groupby with axis=1:
>>> df.iloc[:, [0]].join(df.iloc[:, 1:].groupby(pd.to_datetime(df.columns[1:], format='%Y %b').quarter, axis=1).sum().add_prefix('Q'))
Item Q1 Q2
0 A 3 12
1 B 12 3
>>>
Edit:
I misread your question, to do what you want try rolling sum:
>>> x = df.rolling(3, axis=1).sum().dropna(axis='columns')
>>> df.iloc[:, [0]].join(x.set_axis('Q' + pd.to_datetime(df.columns[1:], format='%Y %b').quarter.astype(str)[:len(x.T)], axis=1))
Item Q1 Q1 Q1 Q2
0 A 3.0 6.0 9.0 12.0
1 B 12.0 9.0 6.0 3.0
>>>

Grabbing data from previous year in a Pandas DataFrame

I've got this df:
d={'year':[2019,2018,2017],'B':[10,5,17]}
df=pd.DataFrame(data=d)
print(df):
year B
0 2019 10
1 2018 5
2 2017 17
I want to create a column "B_previous_year" that grabs B data from the previous year, in a way it looks like this:
year B B_previous_year
0 2019 10 5
1 2018 5 17
2 2017 17 NaN
I'm trying this:
df['B_previous_year']=df.B.loc[df.year == (df.year - 1)]
However my B_previous_year is getting full of NaN
year B B_previous_year
0 2019 10 NaN
1 2018 5 NaN
2 2017 17 NaN
How could I do that?
In case if you want to keep in Integer format:
df = df.convert_dtypes()
df['New'] = df.B.shift(-1)
df
Output:
year B New
0 2019 10 5
1 2018 5 17
2 2017 17 <NA>
You might want to sort the dataframe by year first, then verify that the difference from one row to the next is, indeed, one year:
df = df.sort_values(by='year')
df['B_previous_year'] = df[df.year.diff() == 1]['B']
year B B_previous_year
2 2017 17 NaN
1 2018 5 5.0
0 2019 10 10.0

Fill Pandas dataframe rows, whose value is a 0 or NaN, with a formula that have to be calculated on specific rows of another column

I have a dateframe where values in the "price" column are different depending on both the values in the "quantity" and "year" columns. For example for a quantity equal to 2 I have a price equal to 2 in the 2017 and equal to 4 in the 2018. I would like to fill the rows for 2019, that have a 0 and NaN value, with values from 2018.
df = pd.DataFrame({
'quantity': pd.Series([1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7,8,9]),
'year': pd.Series([2017,2017,2017,2017,2017,2017,2017,2017,2017,2018,2018,2018,2018,2018,2018,2018,2018,2018,2019,2019,2019,2019,2019,2019,2019,2019,2019,]),
'price': pd.Series([1,2,3,4,5,6,7,8,9,2,4,6,8,10,12,14,16,18,np.NaN,np.NaN,0,0,np.NaN,0,np.NaN,0,np.NaN])
})
And what if, instead of taking values from 2018, I should calculate a mean between 2017 and 2018?
I tried to readapt this question applying it to the first case (to apply data from 2018), but it doesn't work:
df['price'][df['year']==2019].fillna(df['price'][df['year'] == 2018], inplace = True)
Could you please help me?
The expected output should be a dataframe like the followings:
Df with values from 2018
df = pd.DataFrame({
'quantity': pd.Series([1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7,8,9]),
'year': pd.Series([2017,2017,2017,2017,2017,2017,2017,2017,2017,2018,2018,2018,2018,2018,2018,2018,2018,2018,2019,2019,2019,2019,2019,2019,2019,2019,2019,]),
'price': pd.Series([1,2,3,4,5,6,7,8,9,2,4,6,8,10,12,14,16,18,2,4,6,8,10,12,14,16,18])
})
Df with values that are a mean between 2017 and 2018
df = pd.DataFrame({
'quantity': pd.Series([1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7,8,9]),
'year': pd.Series([2017,2017,2017,2017,2017,2017,2017,2017,2017,2018,2018,2018,2018,2018,2018,2018,2018,2018,2019,2019,2019,2019,2019,2019,2019,2019,2019,]),
'price': pd.Series([1,2,3,4,5,6,7,8,9,2,4,6,8,10,12,14,16,18,1.5,3,4.5,6,7.5,9,10.5,12,13.5])
})
Here's one way filling with the mean of 2017 and 2018.
Start by grouping the previous year's data by the quantity and aggregating with the mean:
m = df[df.year.isin([2017, 2018])].groupby('quantity').price.mean()
Use set_index to set the quantity column as index, replace 0s by NaNs and use fillna which also accepts dictionaries to map the values according to the index:
ix = df[df.year.eq(2019)].index
df.loc[ix, 'price'] = (df.loc[ix].set_index('quantity').price
.replace(0, np.nan).fillna(m).values)
quantity year price
0 1 2017 1.0
1 2 2017 2.0
2 3 2017 3.0
3 4 2017 4.0
4 5 2017 5.0
5 6 2017 6.0
6 7 2017 7.0
7 8 2017 8.0
8 9 2017 9.0
9 1 2018 2.0
10 2 2018 4.0
11 3 2018 6.0
12 4 2018 8.0
13 5 2018 10.0
14 6 2018 12.0
15 7 2018 14.0
16 8 2018 16.0
17 9 2018 18.0
18 1 2019 1.5
19 2 2019 3.0
20 3 2019 4.5
21 4 2019 6.0
22 5 2019 7.5
23 6 2019 9.0
24 7 2019 10.5
25 8 2019 12.0
26 9 2019 13.5

Categories

Resources