Calculate months elapsed since start value in pandas dataframe - python
I have a dataframe that looks as such
df = {'CAL_YEAR':[2021,2022,2022,2022,2022,2022,2022,2022,2022,2022,2022,2022,2022,2023,2023]
'CAL_MONTH' :[12,1,2,3,4,5,6,7,8,9,10,11,12,1,2]}
I want to calculate a months elapsed columns which should look like this
df = {'CUM_MONTH':[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14]}
how can I do this?
my starting month would be 12/2021 or 12/31/2021 (do not care about dates here I only care about the months elapsed). This is economic scenario data but the format of the source data is not in the way we need it.
IIUC:
multiplier = {'CAL_YEAR': 12, 'CAL_MONTH': 1}
df.assign(
CUM_MONTH=df[multiplier].diff().mul(multiplier).sum(axis=1).cumsum()
)
CAL_YEAR CAL_MONTH CUM_MONTH
0 2021 12 0.0
1 2022 1 1.0
2 2022 2 2.0
3 2022 3 3.0
4 2022 4 4.0
5 2022 5 5.0
6 2022 6 6.0
7 2022 7 7.0
8 2022 8 8.0
9 2022 9 9.0
10 2022 10 10.0
11 2022 11 11.0
12 2022 12 12.0
13 2023 1 13.0
14 2023 2 14.0
I basically did the above method but in numerous steps. Did not use diff() , sum() and cumsum() functions.
start_year = int(data["VALUATION_DATE"][0][-4:])
data = data.astype({"CAL_YEAR": "int","CAL_MONTH": "int"})
data["CAL_YEAR_ELAPSED"] = data["CAL_YEAR"] - (start_year+1)
data["CumMonths"] = data["CAL_MONTH"] + 12 * data["CAL_YEAR_ELAPSED"] +1
Related
Grabbing data from previous year in a Pandas DataFrame
I've got this df: d={'year':[2019,2018,2017],'B':[10,5,17]} df=pd.DataFrame(data=d) print(df): year B 0 2019 10 1 2018 5 2 2017 17 I want to create a column "B_previous_year" that grabs B data from the previous year, in a way it looks like this: year B B_previous_year 0 2019 10 5 1 2018 5 17 2 2017 17 NaN I'm trying this: df['B_previous_year']=df.B.loc[df.year == (df.year - 1)] However my B_previous_year is getting full of NaN year B B_previous_year 0 2019 10 NaN 1 2018 5 NaN 2 2017 17 NaN How could I do that?
In case if you want to keep in Integer format: df = df.convert_dtypes() df['New'] = df.B.shift(-1) df Output: year B New 0 2019 10 5 1 2018 5 17 2 2017 17 <NA>
You might want to sort the dataframe by year first, then verify that the difference from one row to the next is, indeed, one year: df = df.sort_values(by='year') df['B_previous_year'] = df[df.year.diff() == 1]['B'] year B B_previous_year 2 2017 17 NaN 1 2018 5 5.0 0 2019 10 10.0
Fill Pandas dataframe rows, whose value is a 0 or NaN, with a formula that have to be calculated on specific rows of another column
I have a dateframe where values in the "price" column are different depending on both the values in the "quantity" and "year" columns. For example for a quantity equal to 2 I have a price equal to 2 in the 2017 and equal to 4 in the 2018. I would like to fill the rows for 2019, that have a 0 and NaN value, with values from 2018. df = pd.DataFrame({ 'quantity': pd.Series([1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7,8,9]), 'year': pd.Series([2017,2017,2017,2017,2017,2017,2017,2017,2017,2018,2018,2018,2018,2018,2018,2018,2018,2018,2019,2019,2019,2019,2019,2019,2019,2019,2019,]), 'price': pd.Series([1,2,3,4,5,6,7,8,9,2,4,6,8,10,12,14,16,18,np.NaN,np.NaN,0,0,np.NaN,0,np.NaN,0,np.NaN]) }) And what if, instead of taking values from 2018, I should calculate a mean between 2017 and 2018? I tried to readapt this question applying it to the first case (to apply data from 2018), but it doesn't work: df['price'][df['year']==2019].fillna(df['price'][df['year'] == 2018], inplace = True) Could you please help me? The expected output should be a dataframe like the followings: Df with values from 2018 df = pd.DataFrame({ 'quantity': pd.Series([1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7,8,9]), 'year': pd.Series([2017,2017,2017,2017,2017,2017,2017,2017,2017,2018,2018,2018,2018,2018,2018,2018,2018,2018,2019,2019,2019,2019,2019,2019,2019,2019,2019,]), 'price': pd.Series([1,2,3,4,5,6,7,8,9,2,4,6,8,10,12,14,16,18,2,4,6,8,10,12,14,16,18]) }) Df with values that are a mean between 2017 and 2018 df = pd.DataFrame({ 'quantity': pd.Series([1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7,8,9,1,2,3,4,5,6,7,8,9]), 'year': pd.Series([2017,2017,2017,2017,2017,2017,2017,2017,2017,2018,2018,2018,2018,2018,2018,2018,2018,2018,2019,2019,2019,2019,2019,2019,2019,2019,2019,]), 'price': pd.Series([1,2,3,4,5,6,7,8,9,2,4,6,8,10,12,14,16,18,1.5,3,4.5,6,7.5,9,10.5,12,13.5]) })
Here's one way filling with the mean of 2017 and 2018. Start by grouping the previous year's data by the quantity and aggregating with the mean: m = df[df.year.isin([2017, 2018])].groupby('quantity').price.mean() Use set_index to set the quantity column as index, replace 0s by NaNs and use fillna which also accepts dictionaries to map the values according to the index: ix = df[df.year.eq(2019)].index df.loc[ix, 'price'] = (df.loc[ix].set_index('quantity').price .replace(0, np.nan).fillna(m).values) quantity year price 0 1 2017 1.0 1 2 2017 2.0 2 3 2017 3.0 3 4 2017 4.0 4 5 2017 5.0 5 6 2017 6.0 6 7 2017 7.0 7 8 2017 8.0 8 9 2017 9.0 9 1 2018 2.0 10 2 2018 4.0 11 3 2018 6.0 12 4 2018 8.0 13 5 2018 10.0 14 6 2018 12.0 15 7 2018 14.0 16 8 2018 16.0 17 9 2018 18.0 18 1 2019 1.5 19 2 2019 3.0 20 3 2019 4.5 21 4 2019 6.0 22 5 2019 7.5 23 6 2019 9.0 24 7 2019 10.5 25 8 2019 12.0 26 9 2019 13.5
Using bfill with a chosen number
I have a data frame column like so: Year Rank 2017 Nan 2017 Nan 2017 3 2017 4 2017 5 . . 2016 Nan 2016 Nan 2016 3 2016 4 2016 5 . . Can I use bfill to replace the first two value so my column looks like this... Year Rank 2017 1 2017 2 2017 3 2017 4 2017 5 . . 2016 1 2016 2 2016 3 2016 4 2016 5 . . Or is there an easier way than using bfill? Thanks in advance
Use parameter limit in fillna: df['Rank'] = df['Rank'].fillna(1, limit=1) df['Rank'] = df['Rank'].fillna(2, limit=2) ...and if necessary call function per groups: def f(x): x = x.fillna(1, limit=1) x = x.fillna(2, limit=2) return x df['New'] = df.groupby('Year')['Rank'].apply(f) print (df) Year Rank New 0 2017 NaN 1.0 1 2017 NaN 2.0 2 2017 3.0 3.0 3 2017 4.0 4.0 4 2017 5.0 5.0 5 2016 NaN 1.0 6 2016 NaN 2.0 7 2016 5.0 5.0 8 2016 6.0 6.0 9 2016 10.0 10.0
Look this document from PandasDataFrame.fillna
Filtering outliers before using group by
I have a dataframe with price column (p) and I have some undesired values like (0, 1.50, 92.80, 0.80). Before I calculate the mean of the price by product code, I would like to remove these outliers Code Year Month Day Q P 0 100 2017 1 4 2.0 42.90 1 100 2017 1 9 2.0 42.90 2 100 2017 1 18 1.0 45.05 3 100 2017 1 19 2.0 45.05 4 100 2017 1 20 1.0 45.05 5 100 2017 1 24 10.0 46.40 6 100 2017 1 26 1.0 46.40 7 100 2017 1 28 2.0 92.80 8 100 2017 2 1 0.0 0.00 9 100 2017 2 7 2.0 1.50 10 100 2017 2 8 5.0 0.80 11 100 2017 2 9 1.0 45.05 12 100 2017 2 11 1.0 1.50 13 100 2017 3 8 1.0 49.90 14 100 2017 3 17 6.0 45.05 15 100 2017 3 24 1.0 45.05 16 100 2017 3 30 2.0 1.50 How would be a good way to filter the outliers for each product (group by code) ? I tried this: stds = 1.0 # Number of standard deviation that defines 'outlier'. z = df[['Code','P']].groupby('Code').transform( lambda group: (group - group.mean()).div(group.std())) outliers = z.abs() > stds df[outliers.any(axis=1)] And then : print(df[['Code', 'Year', 'Month','P']].groupby(['Code', 'Year', 'Month']).mean()) But the outlier filter doesn`t work properly.
IIUC You can use a groupby on Code, do your z score calculation on P, and filter if the z score is greater than your threshold: stds = 1.0 filtered_ df = df[~df.groupby('Code')['P'].transform(lambda x: abs((x-x.mean()) / x.std()) > stds)] Code Year Month Day Q P 0 100 2017 1 4 2.0 42.90 1 100 2017 1 9 2.0 42.90 2 100 2017 1 18 1.0 45.05 3 100 2017 1 19 2.0 45.05 4 100 2017 1 20 1.0 45.05 5 100 2017 1 24 10.0 46.40 6 100 2017 1 26 1.0 46.40 11 100 2017 2 9 1.0 45.05 13 100 2017 3 8 1.0 49.90 14 100 2017 3 17 6.0 45.05 15 100 2017 3 24 1.0 45.05 filtered_df[['Code', 'Year', 'Month','P']].groupby(['Code', 'Year', 'Month']).mean() P Code Year Month 100 2017 1 44.821429 2 45.050000 3 46.666667
You have the right idea. Just take the Boolean opposite of your outliers['P'] series via ~ and filter your dataframe via loc: res = df.loc[~outliers['P']]\ .groupby(['Code', 'Year', 'Month'], as_index=False)['P'].mean() print(res) Code Year Month P 0 100 2017 1 44.821429 1 100 2017 2 45.050000 2 100 2017 3 46.666667
Read values from multiple rows and combine them in another row in pandas dataframe
I have the following dataframe: item_id bytes value_id value 1 0 2.0 year 2017 2 0 1.0 month 04 3 0 1.0 day 12 4 0 1.0 time 07 5 0 1.0 minute 13 6 1 2.0 year 2017 7 1 1.0 month 12 8 1 1.0 day 19 9 1 1.0 time 09 10 1 1.0 minute 32 11 2 2.0 year 2017 12 2 1.0 month 04 13 2 1.0 day 17 14 2 1.0 time 14 15 2 1.0 minute 24 I want to be able to calculate the time for each item_id. How do I use group by here or anything else to achieve the following? item_id time 0 2017/04/12 07:13 1 2017/12/19 09:32 2 2017/04/17 14:24
Use pivot + to_datetime pd.to_datetime( df.drop('bytes', 1) .pivot('item_id', 'value_id', 'value') .rename(columns={'time' :'hour'}) ).reset_index(name='time') item_id time 0 0 2017-04-12 07:13:00 1 1 2017-12-19 09:32:00 2 2 2017-04-17 14:24:00 You can drop the bytes column before pivoting, it doesn't seem like you need it.
set_index +unstack also , pd.to_datatime can passed a dataframe, you only need to name your column correctly pd.to_datetime(df.set_index(['item_id','value_id']).value.unstack().rename(columns={'time' :'hour'})) Out[537]: item_id 0 2017-04-12 07:13:00 1 2017-12-19 09:32:00 2 2017-04-17 14:24:00 dtype: datetime64[ns]