I have the following dataframe:
PersonID AmountPaid PaymentReceivedDate StartDate withinNYears
1 100 2017 2016
2 20 2014 2014
1 30 2017 2016
1 40 2016 2016
4 300 2015 2000
5 150 2005 2002
What I'm looking for is the Amount Paid should appear in the withNYears column if the payment was made within n years of start date otherwise you get NaN.
N years can be any number but let's say 2 for this example (as I will be playing with this to see findings).
so basically the above dataframe would come out like this if the amount was paid within 2 years:
PersonID AmountPaid PaymentReceivedDate StartDate withinNYears
1 100 2017 2016 100
2 20 2014 2014 20
1 30 2017 2016 30
1 40 2016 2016 40
4 300 2015 2000 NaN
5 150 2005 2002 NaN
does anyone know how to achieve this? cheers.
Subtract columns and compare by scalar for boolean mask and then set value by numpy.where, Series.where or DataFrame.loc:
m = (df['PaymentReceivedDate'] - df['StartDate']) < 2
df['withinNYears'] = np.where(m, df['AmountPaid'], np.nan)
#alternatives
#df['withinNYears'] = df['AmountPaid'].where(m)
#df.loc[m, 'withinNYears'] = df['AmountPaid']
print (df)
PersonID AmountPaid PaymentReceivedDate StartDate \
0 1 100 2017 2016
1 2 20 2014 2014
2 1 30 2017 2016
3 1 40 2016 2016
4 4 300 2015 2000
5 5 150 2005 2002
withinNYears
0 100.0
1 20.0
2 30.0
3 40.0
4 NaN
5 NaN
EDIT:
If StartDate column have datetimes:
m = (df['PaymentReceivedDate'] - df['StartDate'].dt. year) < 2
Just do with assign using loc
df.loc[(df['PaymentReceivedDate'] - df['StartDate']<2),'withinNYears']=df.AmountPaid
df
Out[37]:
PersonID AmountPaid ... StartDate withinNYears
0 1 100 ... 2016 100.0
1 2 20 ... 2014 20.0
2 1 30 ... 2016 30.0
3 1 40 ... 2016 40.0
4 4 300 ... 2000 NaN
5 5 150 ... 2002 NaN
[6 rows x 5 columns]
Related
I have a DF such as the one below:
ID
Year
Value
1
2007
1
1
2008
1
1
2009
1
1
2011
1
1
2013
1
1
2014
1
1
2015
1
2
2008
1
2
2010
1
2
2011
1
2
2012
1
2
2013
1
2
2014
1
3
2009
1
3
2010
1
3
2011
1
3
2012
1
3
2013
1
3
2014
1
3
2015
1
As you can see, in ID '1' I am missing values for 2010 and 2012; and for ID '2' I am missing values for 2008, 2009, 2015, and ID '3' I am missing 2007, 2008. So, I would like to fill these gaps with the value '1'. What I would like to achieve is below:
ID
Year
Value
1
2007
1
1
2008
1
1
2009
1
1
2010
1
1
2011
1
1
2012
1
1
2013
1
1
2014
1
1
2015
1
2
2007
1
2
2008
1
2
2009
1
2
2010
1
2
2011
1
2
2012
1
2
2013
1
2
2014
1
2
2015
1
3
2007
1
3
2008
1
3
2009
1
3
2010
1
3
2011
1
3
2012
1
3
2013
1
3
2014
1
3
2015
1
I have created the below so far; however, that only fills for one ID, and i was struggling to find a way to loop through each ID adding a 'value' for each year that is missing:
idx = pd.date_range('2007', '2020', freq ='Y')
DF.index = pd.DatetimeIndex(DF.index)
DF_s = DF.reindex(idx, fill_value=0)
Any ideas would be helpful, please.
I'm not sure I got what you want to achieve, but if you want to fill NaNs in the "Value" column between 2007 and 2015 (suggesting that there are more years where you don't want to fill the column), you could do something like this:
import math
df1 = pd.DataFrame({'ID': [1,1,1,2,2,2],
'Year': [2007,2010,2020,2007,2010,2015],
'Value': [1,None,None,None,1,None]})
# Write a function with your logic
def func(x, y):
return 0 if math.isnan(y) and 2007<=x<=2015 else y
# Apply it to the df and update the column
df1['Value'] = df1.apply(lambda x: func(x.Year, x.Value), axis=1)
# ID Year Value
# 0 1 2007 1.0
# 1 1 2010 0.0
# 2 1 2020 NaN
# 3 2 2007 0.0
# 4 2 2010 1.0
# 5 2 2015 0.0
Answering my own question :). Needed to apply a lambda function after doing the groupby['org'] that adds a nan to each year that is missing. The reset_index effectivity ungroups it back into the original list.
f = lambda x: x.reindex(pd.date_range(pd.to_datetime('2007'), pd.to_datetime('2020'), name='date', freq='Y'))
DF_fixed = DF.set_index('Year').groupby(['Org']).apply(f).drop(['Org'], axis=1)
DF.reset_index()
I've got this df:
d={'year':[2019,2018,2017],'B':[10,5,17]}
df=pd.DataFrame(data=d)
print(df):
year B
0 2019 10
1 2018 5
2 2017 17
I want to create a column "B_previous_year" that grabs B data from the previous year, in a way it looks like this:
year B B_previous_year
0 2019 10 5
1 2018 5 17
2 2017 17 NaN
I'm trying this:
df['B_previous_year']=df.B.loc[df.year == (df.year - 1)]
However my B_previous_year is getting full of NaN
year B B_previous_year
0 2019 10 NaN
1 2018 5 NaN
2 2017 17 NaN
How could I do that?
In case if you want to keep in Integer format:
df = df.convert_dtypes()
df['New'] = df.B.shift(-1)
df
Output:
year B New
0 2019 10 5
1 2018 5 17
2 2017 17 <NA>
You might want to sort the dataframe by year first, then verify that the difference from one row to the next is, indeed, one year:
df = df.sort_values(by='year')
df['B_previous_year'] = df[df.year.diff() == 1]['B']
year B B_previous_year
2 2017 17 NaN
1 2018 5 5.0
0 2019 10 10.0
I have a dataset like this:
date | a | diff_a | b | diff_b | c | diff_c
2020 0 NaN 10 Nan 5 NaN
2021 1 1 20 10 7 2
2022 3 2 30 10 13 6
2023 4 1 40 10 20 7
And I want to transpose this dataset and merge different columns below, like this:
date | Cat | value | diff
2020 a 0 NaN
2021 a 1 1
2022 a 3 2
2023 a 4 1
2020 b 10 ...
2021 b 20
2022 b 30
2023 b 40
2020 c 5
2021 c 7
2022 c 13
2023 c 20
The diff is not important since if I can put the other columns below I can just filter and then concat the dataframes, but how do I pass this columns to rows?
Kind Regards
My approach with DataFrame.melt
m=df.columns.str.contains('diff')
new_df = (df.melt(df.columns[~m],df.columns[m],var_name='Cat',value_name = 'diff')
.assign(Cat = lambda x: x['Cat'].str.split('_').str[-1],
a = lambda x: x.lookup(x.index, x.Cat)))
new_df = new_df.drop(columns = list(filter(lambda x: x != 'a',new_df.Cat.unique())))
print(new_df)
date a Cat diff
0 2020 0 a NaN
1 2021 1 a 1
2 2022 3 a 2
3 2023 4 a 1
4 2020 10 b Nan
5 2021 20 b 10
6 2022 30 b 10
7 2023 40 b 10
8 2020 5 c NaN
9 2021 7 c 2
10 2022 13 c 6
11 2023 20 c 7
EDIT
If we do not want the diff column we can delete it, also if the column does not have to call a we can do:
m=df.columns.str.contains('diff')
new_df = (df.melt(df.columns[~m],df.columns[m],var_name='Cat',value_name = 'diff')
#.drop(columns = 'diff') #if you want drop diff
.assign(Cat = lambda x: x['Cat'].str.split('_').str[-1],
other = lambda x: x.lookup(x.index, x.Cat)))
new_df = new_df.drop(columns = new_df['Cat'])
print(new_df)
date Cat diff other
0 2020 a NaN 0
1 2021 a 1 1
2 2022 a 2 3
3 2023 a 1 4
4 2020 b Nan 10
5 2021 b 10 20
6 2022 b 10 30
7 2023 b 10 40
8 2020 c NaN 5
9 2021 c 2 7
10 2022 c 6 13
11 2023 c 7 20
I like #ansev answer. Definitely elegant and oozing experience.
My attempt below. Please note I drop the diff columns now that you don't need them and then;
df2=df.set_index('date').stack().reset_index(level=0, drop=False).rename_axis('Cat', axis=0).reset_index().sort_values(by='date')
df2.rename(columns = {0:'value'}, inplace = True)
With a df as below:
name year total
0 A 2015 100
1 A 2016 200
2 A 2017 500
3 C 2016 400
4 B 2016 100
5 B 2015 200
6 B 2017 800
How do I create a new dataframe with years as columns and amount as values:
Name 2015 2016 2017
A 100 200 500
B 200 100 800
C 0 400 0
UMMM pivot with fillna
df.pivot(*df.columns).fillna(0).reset_index()
Out[815]:
year name 2015 2016 2017
0 A 100.0 200.0 500.0
1 B 200.0 100.0 800.0
2 C 0.0 400.0 0.0
I have a dataframe with price column (p) and I have some undesired values like (0, 1.50, 92.80, 0.80). Before I calculate the mean of the price by product code, I would like to remove these outliers
Code Year Month Day Q P
0 100 2017 1 4 2.0 42.90
1 100 2017 1 9 2.0 42.90
2 100 2017 1 18 1.0 45.05
3 100 2017 1 19 2.0 45.05
4 100 2017 1 20 1.0 45.05
5 100 2017 1 24 10.0 46.40
6 100 2017 1 26 1.0 46.40
7 100 2017 1 28 2.0 92.80
8 100 2017 2 1 0.0 0.00
9 100 2017 2 7 2.0 1.50
10 100 2017 2 8 5.0 0.80
11 100 2017 2 9 1.0 45.05
12 100 2017 2 11 1.0 1.50
13 100 2017 3 8 1.0 49.90
14 100 2017 3 17 6.0 45.05
15 100 2017 3 24 1.0 45.05
16 100 2017 3 30 2.0 1.50
How would be a good way to filter the outliers for each product (group by code) ?
I tried this:
stds = 1.0 # Number of standard deviation that defines 'outlier'.
z = df[['Code','P']].groupby('Code').transform(
lambda group: (group - group.mean()).div(group.std()))
outliers = z.abs() > stds
df[outliers.any(axis=1)]
And then :
print(df[['Code', 'Year', 'Month','P']].groupby(['Code', 'Year', 'Month']).mean())
But the outlier filter doesn`t work properly.
IIUC You can use a groupby on Code, do your z score calculation on P, and filter if the z score is greater than your threshold:
stds = 1.0
filtered_ df = df[~df.groupby('Code')['P'].transform(lambda x: abs((x-x.mean()) / x.std()) > stds)]
Code Year Month Day Q P
0 100 2017 1 4 2.0 42.90
1 100 2017 1 9 2.0 42.90
2 100 2017 1 18 1.0 45.05
3 100 2017 1 19 2.0 45.05
4 100 2017 1 20 1.0 45.05
5 100 2017 1 24 10.0 46.40
6 100 2017 1 26 1.0 46.40
11 100 2017 2 9 1.0 45.05
13 100 2017 3 8 1.0 49.90
14 100 2017 3 17 6.0 45.05
15 100 2017 3 24 1.0 45.05
filtered_df[['Code', 'Year', 'Month','P']].groupby(['Code', 'Year', 'Month']).mean()
P
Code Year Month
100 2017 1 44.821429
2 45.050000
3 46.666667
You have the right idea. Just take the Boolean opposite of your outliers['P'] series via ~ and filter your dataframe via loc:
res = df.loc[~outliers['P']]\
.groupby(['Code', 'Year', 'Month'], as_index=False)['P'].mean()
print(res)
Code Year Month P
0 100 2017 1 44.821429
1 100 2017 2 45.050000
2 100 2017 3 46.666667