Pandas Dataframe - Fillna with Mean by Month - python

I have a dataframe with som "NaN" and Outlier values which I want to fill with the mean value of the specific month.
df[["arrival_day", "ib_units", "month","year"]]
arrival_day ib_units month year
37 2020-01-01 262 1 2020
235 2020-01-02 2301 1 2020
290 2020-01-02 145 1 2020
476 2020-01-02 6584 1 2020
551 2020-01-02 30458 1 2020
... ... ... ... ...
1479464 2022-07-19 56424 7 2022
1479490 2022-07-19 130090 7 2022
1479510 2022-07-19 3552 7 2022
1479556 2022-07-19 23779 7 2022
1479756 2022-07-20 2882 7 2022
I know there is the pandas.DataFrame.fillna function df.fillna(df.mean()), but in this case it would build the overall mean for the whole dataset. I want to fill the "NaNs" with the mean value of the specific month in this specific year.
This is what I have tried but this solution is not straightforward and only calculates the mean by year and not the mean by month:
mask_2020 = (df['arrival_day'] >= '2020-01-01') & (df['arrival_day'] <= '2020-12-31')
df_2020 = df.loc[mask_2020]
mask_2021 = (df['arrival_day'] >= '2021-01-01') & (df['arrival_day'] <= '2021-12-31')
df_2021 = df.loc[mask_2021]
mask_2022 = (df['arrival_day'] >= '2022-01-01') & (df['arrival_day'] <= '2022-12-31')
df_2022 = df.loc[mask_2022]
mean_2020 = df_2020.ib_units.mean()
mean_2021 = df_2021.ib_units.mean()
mean_2022 = df_2022.ib_units.mean()
# this finds quartile outliers and replaces them with the mean value of the specific year
for x in ['ib_units']:
q75,q25 = np.percentile(df_2020.loc[:,x],[75,25])
intr_qr = q75-q25
max = q75+(1.5*intr_qr)
min = q25-(1.5*intr_qr)
df_2020.loc[df_2020[x] < min,x] = mean_2020
df_2020.loc[df_2020[x] > max,x] = mean_2020
for x in ['ib_units']:
q75,q25 = np.percentile(df_2021.loc[:,x],[75,25])
intr_qr = q75-q25
max = q75+(1.5*intr_qr)
min = q25-(1.5*intr_qr)
df_2021.loc[df_2021[x] < min,x] = mean_2021
df_2021.loc[df_2021[x] > max,x] = mean_2021
for x in ['ib_units']:
q75,q25 = np.percentile(df_2022.loc[:,x],[75,25])
intr_qr = q75-q25
max = q75+(1.5*intr_qr)
min = q25-(1.5*intr_qr)
df_2022.loc[df_2022[x] < min,x] = mean_2022
df_2022.loc[df_2022[x] > max,x] = mean_2022
So how can I do this in a more code effective way and also by month and not by year?
Thanks!

I think you are overthinking. Please see the code below if it works for you. For outlier the code should be the same as the filling the N/A
import pandas as pd
import datetime as dt
# Sample Data:
df = pd.DataFrame({'date': ['2000-01-02', '2000-01-02', '2000-01-15', '2000-01-27',
'2000-06-03', '2000-06-29', '2000-06-15', '2000-06-29',
'2001-01-02', '2001-01-02', '2001-01-15', '2001-01-27'],
'val':[5,7,None,4,
8,1,None,9,
2,3,None,7]})
# Some convert to datetime
df['date'] = pd.to_datetime(df['date'])
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
# Create mean value:
tem = df.groupby(['year', 'month'])[['val']].mean().reset_index()
tem.rename(columns={'val': 'val_mean'}, inplace=True)
tem
# Merge and fill NA:
df = pd.merge(df, tem, how='left', on=['year', 'month'])
df.loc[df['val'].isna(),'val'] = df['val_mean']

want to a bit modify #PTQouc code and rely on his dataframe
Grouping
tem = df.groupby(['year','month'])['val'].mean().reset_index()
Merging
merged = df.merge(df1, how = 'left' , on = ['year','month'])
Using Where
merged['col_z'] = merged['val_x'].where(merged['val_x'].notnull(), merged['val_y'])
Droping
merged = merged.drop(['val_x','val_y'],axis=1)

Related

Pandas - compare day and month only against a datetime?

I want to compare a timestamp datatype datetime64[ns] with a datetime.date I only want a comparison based on day and month
df
timestamp last_price
0 2023-01-22 14:15:06.033314 100.0
1 2023-01-25 14:15:06.213591 101.0
2 2023-01-30 14:15:06.313554 102.0
3 2023-03-31 14:15:07.018540 103.0
cu_date = datetime.datetime.now().date()
cu_year = cu_date.year
check_end_date = datetime.datetime.strptime(f'{cu_year}-11-05', '%Y-%m-%d').date()
check_start_date = datetime.datetime.strptime(f'{cu_year}-03-12', '%Y-%m-%d').date()
# this is incorrect as the day can be greater than check_start_date while the month might be less.
daylight_off_df = df.loc[((df.timestamp.dt.month >= check_end_date.month) & (df.timestamp.dt.day >= check_end_date.day)) |
((df.timestamp.dt.month <= check_start_date.month) & (df.timestamp.dt.day <= check_start_date.day))]
daylight_on_df = df.loc[((df.timestamp.dt.month <= check_end_date.month) & (df.timestamp.dt.day <= check_end_date.day)) &
((df.timestamp.dt.month >= check_start_date.month) & (df.timestamp.dt.day >= check_start_date.day))]
I am trying to think up of the logic to do this, but failing.
Expected output:
daylight_off_df
timestamp last_price
0 2023-01-22 14:15:06.033314 100.0
1 2023-01-25 14:15:06.213591 101.0
2 2023-01-30 14:15:06.313554 102.0
daylight_on_df
timestamp last_price
3 2023-03-31 14:15:07.018540 103.0
In summation separate the dataframe as per day and month comparison while ignoring the year.
I would break out these values then just query
df['day'] = df['timestamp'].dt.day_name()
df['month'] = df['timestamp'].dt.month_name()
then whatever you're looking for:
df.groupby('month').mean()
The following parameters could be helpfull if you dont want an additional column in your table:
check_end_date.timetuple().tm_yday # returns day of the year
#output 309
check_start_date.timetuple().tm_yday
#output 71
df['timestamp'].dt.is_leap_year.astype(int) #returns 1 if year is a leapyear
#output 0 | 1
df['timestamp'].dt.dayofyear #returns day of the year
#output
#0 22
#1 25
#2 30
#3 90
df['timestamp'].dt.dayofyear.between(a,b) #returns true if day is between a,b
there are some possible solutions now. i think using between can be the nicest looking one.
daylight_on_df4 = df.loc[df['timestamp'].dt.dayofyear.between(
check_start_date.timetuple().tm_yday + df['timestamp'].dt.is_leap_year.astype(int),
check_end_date.timetuple().tm_yday + df['timestamp'].dt.is_leap_year.astype(int))]
daylight_off_df4 = df.loc[~df['timestamp'].dt.dayofyear.between(
check_start_date.timetuple().tm_yday + df['timestamp'].dt.is_leap_year.astype(int),
check_end_date.timetuple().tm_yday + df['timestamp'].dt.is_leap_year.astype(int))]
or the code could look like this:
daylight_on_df3 = df.loc[((check_end_date.timetuple().tm_yday + df['timestamp'].dt.is_leap_year.astype(int)) - df['timestamp'].dt.dayofyear > 0)
& (df['timestamp'].dt.dayofyear - (df['timestamp'].dt.is_leap_year.astype(int) + check_start_date.timetuple().tm_yday) > 0)]
daylight_off_df3 = df.loc[((check_end_date.timetuple().tm_yday + df['timestamp'].dt.is_leap_year.astype(int)) - df['timestamp'].dt.dayofyear < 0)
| (df['timestamp'].dt.dayofyear - (check_start_date.timetuple().tm_yday + df['timestamp'].dt.is_leap_year.astype(int)) < 0)]
All daylight_on/off are doing now is checking if the day of the year is inbetween your ranges or not (inclusive leap year).
This formula has probably to be rewritten if your startdate / enddate would cross a year (ex 2022-11-19 , 2023-02-22) but i think it provides a general idea.

Why do i get all date similiar while trying to fill them in dataset?

I have dataset with 800 rows and i want to create new column with date, and in each row in should increase on one day.
import datetime
date = datetime.datetime.strptime('5/11/2011', '%d/%m/%Y')
for x in range(800):
df['Date'] = date + datetime.timedelta(days=x)
In each column date is equal to '2014-01-12', as i inderstand it fills as if x is always equal to 799
Each time through the loop you are updating the ENTIRE Date column. You see the results of the 800th update at the end.
You could use a date range:
dr = pd.date_range('5/11/2011', periods=800, freq='D')
df = pd.DataFrame({'Date': dr})
print(df)
Date
0 2011-05-11
1 2011-05-12
2 2011-05-13
3 2011-05-14
4 2011-05-15
.. ...
795 2013-07-14
796 2013-07-15
797 2013-07-16
798 2013-07-17
799 2013-07-18
Or:
df['Date'] = dr
pandas is nice tool which can repeate some calculations without using for-loop.
When you use df['Date'] = ... then you assign the same value to all cells in column.
You have to use df.loc[x, 'Date'] = ... to assign to single cell.
Minimal working example (with only 10 rows).
import pandas as pd
import datetime
df = pd.DataFrame({'Date':[1,2,3,4,5,6,7,8,9,0]})
date = datetime.datetime.strptime('5/11/2011', '%d/%m/%Y')
for x in range(10):
df.loc[x,'Date'] = date + datetime.timedelta(days=x)
print(df)
But you could use also pd.date_range() for this.
Minimal working example (with only 10 rows).
import pandas as pd
import datetime
df = pd.DataFrame({'Date':[1,2,3,4,5,6,7,8,9,0]})
date = datetime.datetime.strptime('5/11/2011', '%d/%m/%Y')
df['Date'] = pd.date_range(date, periods=10)
print(df)

How to organize pandas so the first column is just dates which correspond with 4 countries with percentage data in their cells?

The data here is web-scraped from a website, and this initial data in the variable 'r' has three columns, where there are three columns: 'Country', 'Date', '% vs 2019 (Daily)'. From these three columns I was able to extract only the ones I wanted from dates: "2021-01-01" to current/today. What I am trying to do (have spent hours), is trying to organize the data in such a way where there is one column with just the dates which correspond to the percentage data, then 4 other columns which are the country names: Denmark, Finland, Norway, Sweden. Underneath those four countries should have cells populated with the percent data. Have tried using [], loc, and iloc and various other combinations to filter the panda dataframes in such a way to make this happen, but to no avail.
Here is the code I have so far:
import requests
import pandas as pd
import json
import math
import datetime
from jinja2 import Template, Environment
from datetime import date
r = requests.get('https://docs.google.com/spreadsheets/d/1GJ6CvZ_mgtjdrUyo3h2dU3YvWOahbYvPHpGLgovyhtI/gviz/tq?usp=sharing&tqx=reqId%3A0output=jspn')
data = r.content
data = json.loads(data.decode('utf-8').split("(", 1)[1].rsplit(")", 1)[0])
d = [[i['c'][0]['v'], i['c'][2]['f'], (i['c'][5]['v'])*100 ] for i in data['table']['rows']]
df = pd.DataFrame(d, columns=['Country', 'Date', '% vs 2019 (Daily)'])
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
# EXTRACTING BETWEEN TWO DATES
df['Date'] = pd.to_datetime(df['Date'])
startdate = datetime.datetime.strptime('2021-01-01', "%Y-%m-%d").date()
enddate = datetime.datetime.strptime('2021-02-02', "%Y-%m-%d").date()
pd.Timestamp('today').floor('D')
df = df[(df['Date'] > pd.Timestamp(startdate).floor('D')) & (df['Date'] <= pd.Timestamp(enddate).floor('D'))]
Den = df.loc[df['Country'] == 'Denmark']
Fin = df.loc[df['Country'] == 'Finland']
Swe = df.loc[df['Country'] == 'Sweden']
Nor = df.loc[df['Country'] == 'Norway']
Den_data = Den.loc[: , "% vs 2019 (Daily)"]
Den_date = Den.loc[: , "Date"]
Nor_data = Nor.loc[: , "% vs 2019 (Daily)"]
Swe_data = Swe.loc[: , "% vs 2019 (Daily)"]
Fin_data = Fin.loc[: , "% vs 2019 (Daily)"]
Fin_date = Fin.loc[: , "Date"]
Den_data = Den.loc[: , "% vs 2019 (Daily)"]
df2 = pd.DataFrame()
df2['DEN_DATE'] = Den_date
df2['DENMARK'] = Den_data
df3 = pd.DataFrame()
df3['FIN_DATE'] = Fin_date
df3['FINLAND'] = Fin_data
Want it to be organized like this so I can eventually export it to excel:
Date | Denmark | Finland| Norway | Sweden
2020-01-01 | 1234 | 4321 | 5432 | 6574
...
Any help is greatly appreicated.
Thank you
Use isin to filter only the countries you are interested in getting the data. Then use pivot to return a reshaped dataframe organized by a given index and column values, in this case the index is the Date column, and the column values are the countries from the previous selection.
...
...
pd.Timestamp('today').floor('D')
df = df[(df['Date'] > pd.Timestamp(startdate).floor('D')) & (df['Date'] <= pd.Timestamp(enddate).floor('D'))]
countries_list=['Denmark', 'Finland', 'Norway', 'Sweden']
countries_selected = df[df.Country.isin(countries_list)]
result = countries_selected.pivot(index="Date", columns="Country")
print(result)
Output from result
% vs 2019 (Daily)
Country Denmark Finland Norway Sweden
Date
2021-01-02 -65.261383 -75.416667 -39.164087 -65.853659
2021-01-03 -60.405405 -77.408056 -31.763620 -66.385669
2021-01-04 -69.371429 -75.598086 -34.002770 -70.704467
2021-01-05 -73.690932 -79.251701 -33.815689 -73.450509
2021-01-06 -76.257310 -80.445151 -43.454791 -80.805484
...
...
2021-01-30 -83.931624 -75.545852 -63.751763 -76.260163
2021-01-31 -80.654339 -74.468085 -55.565777 -65.451895
2021-02-01 -81.494253 -72.419106 -49.610390 -75.473322
2021-02-02 -81.741233 -73.898305 -46.164021 -78.215223

How to get minimum of each group for each day based on hour criteria

I have given two dataframes below for you to test
df = pd.DataFrame({
'subject_id':[1,1,1,1,1,1,1,1,1,1,1],
'time_1' :['2173-04-03 12:35:00','2173-04-03 17:00:00','2173-04-03
20:00:00','2173-04-04 11:00:00','2173-04-04 11:30:00','2173-04-04
12:00:00','2173-04-05 16:00:00','2173-04-05 22:00:00','2173-04-06
04:00:00','2173-04-06 04:30:00','2173-04-06 06:30:00'],
'val' :[5,5,5,10,5,10,5,8,3,8,10]
})
df1 = pd.DataFrame({
'subject_id':[1,1,1,1,1,1,1,1,1,1,1],
'time_1' :['2173-04-03 12:35:00','2173-04-03 12:50:00','2173-04-03
12:59:00','2173-04-03 13:14:00','2173-04-03 13:37:00','2173-04-04
11:30:00','2173-04-05 16:00:00','2173-04-05 22:00:00','2173-04-06
04:00:00','2173-04-06 04:30:00','2173-04-06 08:00:00'],
'val' :[5,5,5,5,10,5,5,8,3,4,6]
})
what I would like to do is
1) Find all values (from val column) which have been same for more than 1 hour in each day for each subject_id and get the minimum of it
Please note that values can also be captured at every 15 min duration as well, so you might have to consider 5 records to see > 1 hr condition). See sample screenshot below
2) If there are no values which were same for more than 1 hour in a day, then just get the minimum of that day for that subject_id
The below screenshot for one subject will help you understand and the code I tried is given below
This is what I tried
df['time_1'] = pd.to_datetime(df['time_1'])
df['time_2'] = df['time_1'].shift(-1)
df['tdiff'] = (df['time_2'] - df['time_1']).dt.total_seconds() / 3600
df['reading_day'] = pd.DatetimeIndex(df['time_1']).day
# don't know how to apply if else condition here to check for 1 hr criteria
t1 = df.groupby(['subject_id','reading_start_day','tdiff])['val'].min()
As I have to apply this to million records, any elegant and efficient solution would be helpful
df = pd.DataFrame({
'subject_id':[1,1,1,1,1,1,1,1,1,1],
'time_1' :['2173-04-03 12:35:00','2173-04-03 17:00:00','2173-04-03 20:00:00','2173-04-04 11:00:00','2173-04-04 11:30:00','2173-04-04 12:00:00','2173-04-04 16:00:00','2173-04-04 22:00:00','2173-04-05 04:00:00','2173-04-05 06:30:00'],
'val' :[5,5,5,10,5,10,5,8,8,10]
})
# Separate Date and time
df['time_1']=pd.to_datetime(df['time_1'])
df['new_date'] = [d.date() for d in df['time_1']]
df['new_time'] = [d.time() for d in df['time_1']]
# find time diff in group with the first element to check > 1 hr
df['shift_val'] = df['val'].shift()
df1=df.assign(time_diff=df.groupby(['subject_id','new_date']).time_1.apply(lambda x: x - x.iloc[0]))
# Verify if time diff > 1 and value is not changed
df2=df1.loc[(df1['time_diff']/ np.timedelta64(1, 'h') >= 1) & (df1.val == df1.groupby('new_date').first().val[0])]
df3=df1.loc[(df1['time_diff']/ np.timedelta64(1, 'h') <= 1) & (df1.val == df1.shift_val)]
# Get the minimum within the group
df4=df2.append(df3).groupby(['new_date'], sort=False).min()
# drop unwanted columns
df4.drop(['new_time','shift_val','time_diff'],axis=1, inplace=True)
df4
Output
subject_id time_1 val
new_date
2173-04-03 1 2173-04-03 17:00:00 5
2173-04-04 1 2173-04-04 16:00:00 5
2173-04-05 1 2173-04-05 04:00:00 8
Try this.
from datetime import timedelta
def f(x):
dif = (x.iloc[0]-x.iloc[-1])//timedelta(minutes=1)
return dif
df1['time_1']= pd.to_datetime(df1['time_1'])
df1['flag']= df1.val.diff().ne(0).cumsum()
df1['t_d']=df1.groupby('flag')['time_1'].transform(f)
df1['date'] = df1['time_1'].dt.date
mask= df1['t_d'].ne(0)
dfa=df1[mask].groupby(['flag','date']).first().reset_index()
dfb=df1[~mask].groupby('date').first().reset_index().dropna(how='any')
df_f = dfa.merge(dfb, how='outer')
df_f.drop_duplicates(subset='date', keep='first', inplace=True)
df_f.drop(['flag','date','t_d'], axis=1, inplace=True)
df_f
Output.
subject_id time_1 val
0 1 2173-04-03 12:35:00 5
1 1 2173-04-04 11:30:00 5
2 1 2173-04-05 16:00:00 5
5 1 2173-04-06 04:00:00 3
Try this
from datetime import timedelta
df1['time_1']= pd.to_datetime(df1['time_1'])
df1['date'] = df1['time_1'].dt.date
df1['t_d'] = df1.groupby(['date'])['time_1'].diff().shift(-1)
mask= df1['t_d']>pd.Timedelta(1,'h')
dfa=df1[mask]
dfb=df1[~mask].groupby('date').first().reset_index()
df_f = dfa.merge(dfb, how='outer')
df_f.drop_duplicates(subset='date', keep='first', inplace=True)
df_f.drop(['date','t_d'], axis=1, inplace=True)
df_f.sort_values('time_1')
I came up with an approach like below and it is working. Any suggestions are welcome
s=pd.to_timedelta(24,unit='h')-(df.time_1-df.time_1.dt.normalize())
df['tdiff'] = df.groupby(df.time_1.dt.date).time_1.diff().shift(-1).fillna(s)
df['t_d'] = df['tdiff'].dt.total_seconds()/3600
df['hr'] = df['time_1'].dt.hour
df['date'] = df['time_1'].dt.date
df['day'] = pd.DatetimeIndex(df['time_1']).day
# here I get the freq and cumsum of each val for each day and each hour. Since sort = 'False', timeorder is retained as is
temp_1 = pd.DataFrame(df.groupby(['subject_id','date','hr','val'], sort=False)['t_d'].agg({'cumduration':sum,'freq':'count'}).reset_index())
# here i remove the `hour` component and sum the value duration in same day but different hours (for example `5` was in 12th hour and 13th hour. we sum them)
temp_2 = pd.DataFrame(temp_1.groupby(['subject_id','date','val'], sort=False)['cumduration'].agg({'sum_of_cumduration':sum,'freq':'count'}).reset_index())
# Later, I create a mask for `> 1` hr criteria
mask = temp_2.groupby(['subject_id','date'])['sum_of_cumduration'].apply(lambda x: x > 1)
output_1 = pd.DataFrame(temp_2[mask].groupby(['subject_id','date'])['val'].min()).reset_index()
# I check for `< 1 ` hr records here
output_2 = pd.DataFrame(temp_2[~mask].groupby(['subject_id','date'])['val'].min()).reset_index()
# I finally check for `subject_id` and `date` and then append
output = output_1.append(output_2[~output_2['subject_id'].isin(output_1['subject_id'])])
output

Pivot Dataframe of start and ending dates into truth table

I have a Pandas DataFrame that has the dates that SP500 constituents were added to/deleted from the index. It looks something like this:
PERMNO start ending
0 10006.0 1957-03-01 1984-07-18
1 10030.0 1957-03-01 1969-01-08
2 10049.0 1925-12-31 1932-10-01
3 10057.0 1957-03-01 1992-07-02
4 10078.0 1992-08-20 2010-01-28
I also have a list of dates that I am concerned with, it consists of trading days between 1/1/2003 and 6/30/2009. I want to create a dataframe with these dates on the index and PERMNOs as the columns. It will be populated as a truth table of whether the stock was included in the SP500 on that day.
Is there a fast way of doing this?
Note: some stocks are added to the SP500, then removed, then later added again.
If I understand you correctly, you are trying to find the list of S&P 500 constituents as of a series of dates. Assuming your dataframe has start and ending as datetime64 already:
# the list of dates that you are interested in
dates = pd.Series(['1960-01-01', '1980-01-01'], dtype='datetime64[ns]')
start = df['start'].values
end = df['ending'].values
d = dates.values[:, None] # to prepare for array broadcasting
# if the date is between `start` and `ending` of the stock's membership in the S&P 500
match = (start <= d) & (d <= end)
# list of PERMNO for each as-of date
p = dates.index.to_series() \
.apply(lambda i: df.loc[match[i], 'PERMNO']) \
.stack().droplevel(-1)
# tying everything together
result = dates.to_frame('AsOfDate').join(p)
Result:
AsOfDate PERMNO
0 1960-01-01 10006.0
0 1960-01-01 10030.0
0 1960-01-01 10057.0
1 1980-01-01 10006.0
1 1980-01-01 10057.0
You can use Dataframe constructor with np.tile, np.repeat with filter by mask created by ravel:
dates = pd.to_datetime(['1960-01-01', '1980-01-01'])
start = df['start'].values
end = df['ending'].values
d = dates.values[:, None]
#filter by boolean broadcasting
match = (start <= d) & (d <= end)
a = np.tile(df['PERMNO'], len(dates))
b = np.repeat(dates, len(df))
mask = match.ravel()
df1 = pd.DataFrame({'Date1':b[mask], 'PERMNO':a[mask]})
print (df1)
Date1 PERMNO
0 1960-01-01 10006.0
1 1960-01-01 10030.0
2 1960-01-01 10057.0
3 1980-01-01 10006.0
4 1980-01-01 10057.0
Different output like True/False table:
df2 = pd.DataFrame(match, index=dates, columns=df['PERMNO'])
print (df2)
PERMNO 10006.0 10030.0 10049.0 10057.0 10078.0
1960-01-01 True True False True False
1980-01-01 True False False True False

Categories

Resources