Calculate months elapsed since start value in pandas dataframe - python

I have a dataframe that looks as such
df = {'CAL_YEAR':[2021,2022,2022,2022,2022,2022,2022,2022,2022,2022,2022,2022,2022,2023,2023]
'CAL_MONTH' :[12,1,2,3,4,5,6,7,8,9,10,11,12,1,2]}
I want to calculate a months elapsed columns which should look like this
df = {'CUM_MONTH':[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14]}
how can I do this?
my starting month would be 12/2021 or 12/31/2021 (do not care about dates here I only care about the months elapsed). This is economic scenario data but the format of the source data is not in the way we need it.

IIUC:
multiplier = {'CAL_YEAR': 12, 'CAL_MONTH': 1}
df.assign(
CUM_MONTH=df[multiplier].diff().mul(multiplier).sum(axis=1).cumsum()
)
CAL_YEAR CAL_MONTH CUM_MONTH
0 2021 12 0.0
1 2022 1 1.0
2 2022 2 2.0
3 2022 3 3.0
4 2022 4 4.0
5 2022 5 5.0
6 2022 6 6.0
7 2022 7 7.0
8 2022 8 8.0
9 2022 9 9.0
10 2022 10 10.0
11 2022 11 11.0
12 2022 12 12.0
13 2023 1 13.0
14 2023 2 14.0

I basically did the above method but in numerous steps. Did not use diff() , sum() and cumsum() functions.
start_year = int(data["VALUATION_DATE"][0][-4:])
data = data.astype({"CAL_YEAR": "int","CAL_MONTH": "int"})
data["CAL_YEAR_ELAPSED"] = data["CAL_YEAR"] - (start_year+1)
data["CumMonths"] = data["CAL_MONTH"] + 12 * data["CAL_YEAR_ELAPSED"] +1

Related

Grabbing data from previous year in a Pandas DataFrame

I've got this df:
d={'year':[2019,2018,2017],'B':[10,5,17]}
df=pd.DataFrame(data=d)
print(df):
year B
0 2019 10
1 2018 5
2 2017 17
I want to create a column "B_previous_year" that grabs B data from the previous year, in a way it looks like this:
year B B_previous_year
0 2019 10 5
1 2018 5 17
2 2017 17 NaN
I'm trying this:
df['B_previous_year']=df.B.loc[df.year == (df.year - 1)]
However my B_previous_year is getting full of NaN
year B B_previous_year
0 2019 10 NaN
1 2018 5 NaN
2 2017 17 NaN
How could I do that?
In case if you want to keep in Integer format:
df = df.convert_dtypes()
df['New'] = df.B.shift(-1)
df
Output:
year B New
0 2019 10 5
1 2018 5 17
2 2017 17 <NA>
You might want to sort the dataframe by year first, then verify that the difference from one row to the next is, indeed, one year:
df = df.sort_values(by='year')
df['B_previous_year'] = df[df.year.diff() == 1]['B']
year B B_previous_year
2 2017 17 NaN
1 2018 5 5.0
0 2019 10 10.0

Fill Pandas dataframe rows, whose value is a 0 or NaN, with a formula that have to be calculated on specific rows of another column

I have a dateframe where values in the "price" column are different depending on both the values in the "quantity" and "year" columns. For example for a quantity equal to 2 I have a price equal to 2 in the 2017 and equal to 4 in the 2018. I would like to fill the rows for 2019, that have a 0 and NaN value, with values from 2018.
df = pd.DataFrame({
'quantity': pd.Series([1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7,8,9]),
'year': pd.Series([2017,2017,2017,2017,2017,2017,2017,2017,2017,2018,2018,2018,2018,2018,2018,2018,2018,2018,2019,2019,2019,2019,2019,2019,2019,2019,2019,]),
'price': pd.Series([1,2,3,4,5,6,7,8,9,2,4,6,8,10,12,14,16,18,np.NaN,np.NaN,0,0,np.NaN,0,np.NaN,0,np.NaN])
})
And what if, instead of taking values from 2018, I should calculate a mean between 2017 and 2018?
I tried to readapt this question applying it to the first case (to apply data from 2018), but it doesn't work:
df['price'][df['year']==2019].fillna(df['price'][df['year'] == 2018], inplace = True)
Could you please help me?
The expected output should be a dataframe like the followings:
Df with values from 2018
df = pd.DataFrame({
'quantity': pd.Series([1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7,8,9]),
'year': pd.Series([2017,2017,2017,2017,2017,2017,2017,2017,2017,2018,2018,2018,2018,2018,2018,2018,2018,2018,2019,2019,2019,2019,2019,2019,2019,2019,2019,]),
'price': pd.Series([1,2,3,4,5,6,7,8,9,2,4,6,8,10,12,14,16,18,2,4,6,8,10,12,14,16,18])
})
Df with values that are a mean between 2017 and 2018
df = pd.DataFrame({
'quantity': pd.Series([1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7,8,9]),
'year': pd.Series([2017,2017,2017,2017,2017,2017,2017,2017,2017,2018,2018,2018,2018,2018,2018,2018,2018,2018,2019,2019,2019,2019,2019,2019,2019,2019,2019,]),
'price': pd.Series([1,2,3,4,5,6,7,8,9,2,4,6,8,10,12,14,16,18,1.5,3,4.5,6,7.5,9,10.5,12,13.5])
})
Here's one way filling with the mean of 2017 and 2018.
Start by grouping the previous year's data by the quantity and aggregating with the mean:
m = df[df.year.isin([2017, 2018])].groupby('quantity').price.mean()
Use set_index to set the quantity column as index, replace 0s by NaNs and use fillna which also accepts dictionaries to map the values according to the index:
ix = df[df.year.eq(2019)].index
df.loc[ix, 'price'] = (df.loc[ix].set_index('quantity').price
.replace(0, np.nan).fillna(m).values)
quantity year price
0 1 2017 1.0
1 2 2017 2.0
2 3 2017 3.0
3 4 2017 4.0
4 5 2017 5.0
5 6 2017 6.0
6 7 2017 7.0
7 8 2017 8.0
8 9 2017 9.0
9 1 2018 2.0
10 2 2018 4.0
11 3 2018 6.0
12 4 2018 8.0
13 5 2018 10.0
14 6 2018 12.0
15 7 2018 14.0
16 8 2018 16.0
17 9 2018 18.0
18 1 2019 1.5
19 2 2019 3.0
20 3 2019 4.5
21 4 2019 6.0
22 5 2019 7.5
23 6 2019 9.0
24 7 2019 10.5
25 8 2019 12.0
26 9 2019 13.5

Using bfill with a chosen number

I have a data frame column like so:
Year Rank
2017 Nan
2017 Nan
2017 3
2017 4
2017 5
.
.
2016 Nan
2016 Nan
2016 3
2016 4
2016 5
.
.
Can I use bfill to replace the first two value so my column looks like this...
Year Rank
2017 1
2017 2
2017 3
2017 4
2017 5
.
.
2016 1
2016 2
2016 3
2016 4
2016 5
.
.
Or is there an easier way than using bfill? Thanks in advance
Use parameter limit in fillna:
df['Rank'] = df['Rank'].fillna(1, limit=1)
df['Rank'] = df['Rank'].fillna(2, limit=2)
...and if necessary call function per groups:
def f(x):
x = x.fillna(1, limit=1)
x = x.fillna(2, limit=2)
return x
df['New'] = df.groupby('Year')['Rank'].apply(f)
print (df)
Year Rank New
0 2017 NaN 1.0
1 2017 NaN 2.0
2 2017 3.0 3.0
3 2017 4.0 4.0
4 2017 5.0 5.0
5 2016 NaN 1.0
6 2016 NaN 2.0
7 2016 5.0 5.0
8 2016 6.0 6.0
9 2016 10.0 10.0
Look this document from PandasDataFrame.fillna

Filtering outliers before using group by

I have a dataframe with price column (p) and I have some undesired values like (0, 1.50, 92.80, 0.80). Before I calculate the mean of the price by product code, I would like to remove these outliers
Code Year Month Day Q P
0 100 2017 1 4 2.0 42.90
1 100 2017 1 9 2.0 42.90
2 100 2017 1 18 1.0 45.05
3 100 2017 1 19 2.0 45.05
4 100 2017 1 20 1.0 45.05
5 100 2017 1 24 10.0 46.40
6 100 2017 1 26 1.0 46.40
7 100 2017 1 28 2.0 92.80
8 100 2017 2 1 0.0 0.00
9 100 2017 2 7 2.0 1.50
10 100 2017 2 8 5.0 0.80
11 100 2017 2 9 1.0 45.05
12 100 2017 2 11 1.0 1.50
13 100 2017 3 8 1.0 49.90
14 100 2017 3 17 6.0 45.05
15 100 2017 3 24 1.0 45.05
16 100 2017 3 30 2.0 1.50
How would be a good way to filter the outliers for each product (group by code) ?
I tried this:
stds = 1.0 # Number of standard deviation that defines 'outlier'.
z = df[['Code','P']].groupby('Code').transform(
lambda group: (group - group.mean()).div(group.std()))
outliers = z.abs() > stds
df[outliers.any(axis=1)]
And then :
print(df[['Code', 'Year', 'Month','P']].groupby(['Code', 'Year', 'Month']).mean())
But the outlier filter doesn`t work properly.
IIUC You can use a groupby on Code, do your z score calculation on P, and filter if the z score is greater than your threshold:
stds = 1.0
filtered_ df = df[~df.groupby('Code')['P'].transform(lambda x: abs((x-x.mean()) / x.std()) > stds)]
Code Year Month Day Q P
0 100 2017 1 4 2.0 42.90
1 100 2017 1 9 2.0 42.90
2 100 2017 1 18 1.0 45.05
3 100 2017 1 19 2.0 45.05
4 100 2017 1 20 1.0 45.05
5 100 2017 1 24 10.0 46.40
6 100 2017 1 26 1.0 46.40
11 100 2017 2 9 1.0 45.05
13 100 2017 3 8 1.0 49.90
14 100 2017 3 17 6.0 45.05
15 100 2017 3 24 1.0 45.05
filtered_df[['Code', 'Year', 'Month','P']].groupby(['Code', 'Year', 'Month']).mean()
P
Code Year Month
100 2017 1 44.821429
2 45.050000
3 46.666667
You have the right idea. Just take the Boolean opposite of your outliers['P'] series via ~ and filter your dataframe via loc:
res = df.loc[~outliers['P']]\
.groupby(['Code', 'Year', 'Month'], as_index=False)['P'].mean()
print(res)
Code Year Month P
0 100 2017 1 44.821429
1 100 2017 2 45.050000
2 100 2017 3 46.666667

Read values from multiple rows and combine them in another row in pandas dataframe

I have the following dataframe:
item_id bytes value_id value
1 0 2.0 year 2017
2 0 1.0 month 04
3 0 1.0 day 12
4 0 1.0 time 07
5 0 1.0 minute 13
6 1 2.0 year 2017
7 1 1.0 month 12
8 1 1.0 day 19
9 1 1.0 time 09
10 1 1.0 minute 32
11 2 2.0 year 2017
12 2 1.0 month 04
13 2 1.0 day 17
14 2 1.0 time 14
15 2 1.0 minute 24
I want to be able to calculate the time for each item_id. How do I use group by here or anything else to achieve the following?
item_id time
0 2017/04/12 07:13
1 2017/12/19 09:32
2 2017/04/17 14:24
Use pivot + to_datetime
pd.to_datetime(
df.drop('bytes', 1)
.pivot('item_id', 'value_id', 'value')
.rename(columns={'time' :'hour'})
).reset_index(name='time')
item_id time
0 0 2017-04-12 07:13:00
1 1 2017-12-19 09:32:00
2 2 2017-04-17 14:24:00
You can drop the bytes column before pivoting, it doesn't seem like you need it.
set_index +unstack also , pd.to_datatime can passed a dataframe, you only need to name your column correctly
pd.to_datetime(df.set_index(['item_id','value_id']).value.unstack().rename(columns={'time' :'hour'}))
Out[537]:
item_id
0 2017-04-12 07:13:00
1 2017-12-19 09:32:00
2 2017-04-17 14:24:00
dtype: datetime64[ns]

Categories

Resources