Filling in missing hours in a Pandas dataframe

Filling in missing hours in a Pandas dataframe - python

Date_NZST Time_NZST Radiation_Amount_MJ/m2
5/08/2011 0:00 0
5/08/2011 1:00 0
5/08/2011 2:00 0
5/08/2011 3:00 0
5/08/2011 4:00 0
5/08/2011 5:00 0
5/08/2011 6:00 0
5/08/2011 7:00 0
5/08/2011 8:00 0
5/08/2011 9:00 0.37
5/08/2011 10:00 0.41
5/08/2011 11:00 1.34
5/08/2011 12:00 0.87
5/08/2011 14:00 1.69
5/08/2011 15:00 1.53
5/08/2011 16:00 1.14
5/08/2011 17:00 0.63
5/08/2011 18:00 0.08
5/08/2011 19:00 0
5/08/2011 20:00 0
5/08/2011 21:00 0
5/08/2011 22:00 0
5/08/2011 23:00 0
I have an Excel spreadsheet that contains hourly measurements of solar irradiance everyday for a year. It has 3 columns, Date_NZST, Time_NZST and Radiation_Amount_MJ/m2.
I'm trying to find a way to automatically find all missing hours, generate a row for that missing hour and fill it with a - symbol in the Radiation_Amount_MJ/m2 column. For example, hour 13:00 is missing so I'd like to make row between the 12:00 and 14:00 rows with the correct date and fill the Radiation_Amount_MJ/m2 column with a -. All dates are present, just some hours are missing.
5/08/2011 11:00 1.34
5/08/2011 12:00 0.87
5/08/2011 13:00 -
5/08/2011 14:00 1.69
5/08/2011 15:00 1.53
I've been doing this in Excel but this is a very tedious process and there could be hundreds of missing points. I've resorted to trying to do it using the Pandas library in Python and I saw this thread (Fill in missing hours in a pandas dataframe) and I tried to alter the answer code to fit my data but I got really confused by the line
df.groupby('area').\
apply(lambda x : x.set_index('Datetime').resample('H').mean().fillna(0)).\
reset_index()'
and how to repurpose it to my data. Anyone have any ideas?

First we create a datetime index which contains the date + time with pd.to_datetime.
Then we use resample to get hourly data, and use fillna to fill the missing vallues with a -:
df.set_index(pd.to_datetime(df['Date_NZST'] + ' ' + df['Time_NZST']), inplace=True)
df = df.drop(columns=['Date_NZST', 'Time_NZST'])
df = df.resample('H').first().fillna('-')
Output
Radiation_Amount_MJ/m2
2011-05-08 00:00:00 0
2011-05-08 01:00:00 0
2011-05-08 02:00:00 0
2011-05-08 03:00:00 0
2011-05-08 04:00:00 0
2011-05-08 05:00:00 0
2011-05-08 06:00:00 0
2011-05-08 07:00:00 0
2011-05-08 08:00:00 0
2011-05-08 09:00:00 0.37
2011-05-08 10:00:00 0.41
2011-05-08 11:00:00 1.34
2011-05-08 12:00:00 0.87
2011-05-08 13:00:00 -
2011-05-08 14:00:00 1.69
2011-05-08 15:00:00 1.53
2011-05-08 16:00:00 1.14
2011-05-08 17:00:00 0.63
2011-05-08 18:00:00 0.08
2011-05-08 19:00:00 0
2011-05-08 20:00:00 0
2011-05-08 21:00:00 0
2011-05-08 22:00:00 0
2011-05-08 23:00:00 0
If you want the datetime out of your index use df.reset_index()
Note, by filling in - in a numeric column, it gets converted to object type.

Related

Calculation of the daily average

How to transform this data so that the pm 2.5 pm 10 columns are the average of the whole day. The data I collected (example here below) collects data every 15 minutes.
Pm 2.5 Pm 10 Created At
0 6.00 19.20 2021-06-21 19:00
1 4.70 17.00 2021-06-21 19:15
2 4.80 16.70 2021-06-21 19:30
3 5.10 12.10 2021-06-21 19:45
4 7.90 19.10 2021-06-21 20:00

Let's resample the dataframe:
df['Created At'] = pd.to_datetime(df['Created At'])
df.resample('D', on='Created At').mean()
Pm 2.5 Pm 10
Created At
2021-06-21 5.7 16.82

You can use pd.Grouper and then transform if you want to preserve the dataframe shape:
df['Created At'] = pd.to_datetime(df['Created At'])
df[['Pm 2.5', 'Pm 10']] = df.groupby(pd.Grouper(key='Created At', freq='D'))\
[['Pm 2.5', 'Pm 10']].transform('mean')
Output:
Pm 2.5 Pm 10 Created At
0 5.7 16.82 2021-06-21 19:00:00
1 5.7 16.82 2021-06-21 19:15:00
2 5.7 16.82 2021-06-21 19:30:00
3 5.7 16.82 2021-06-21 19:45:00
4 5.7 16.82 2021-06-21 20:00:00

here is one way do it
convert the date using to_datetime, grab the date and carry out the mean
df.groupby(pd.to_datetime(df['Created At']).dt.date).mean()
Created At Pm 2.5 Pm 10
0 2021-06-21 5.7 16.82

Create date/hour variable for each hour between two datetime variables

I have a data table that looks like this:
ID ARRIVAL_DATE_TIME DISPOSITION_DATE
1 2021-11-07 08:35:00 2021-11-07 17:58:00
2 2021-11-07 13:16:00 2021-11-08 02:52:00
3 2021-11-07 15:12:00 2021-11-07 21:08:00
I want to be able to count the number of patients in our location by date/hour and hour. I imagine I would eventually have to transform this data into a format seen below and then create a pivot table, but I'm not sure how to first transform this data. So for example, ID 1 would have a row for each date/hour and hour between '2021-11-07 08:35:00' and '2021-11-07 17:58:00'.
ID DATE_HOUR_IN_ED HOUR_IN_ED
1 2021-11-07 08:00:00 8:00
1 2021-11-07 09:00:00 9:00
1 2021-11-07 10:00:00 10:00
1 2021-11-07 11:00:00 11:00
...
2 2021-11-07 13:00:00 13:00
2 2021-11-07 14:00:00 14:00
2 2021-11-07 15:00:00 15:00
....

Use to_datetime with Series.dt.floor for remove times, then concat with repeat date_range and last create DataFrame by constructor:
df['ARRIVAL_DATE_TIME'] = pd.to_datetime(df['ARRIVAL_DATE_TIME']).dt.floor('H')
s = pd.concat([pd.Series(r.ID,pd.date_range(r.ARRIVAL_DATE_TIME,
r.DISPOSITION_DATE, freq='H'))
for r in df.itertuples()])
df1 = pd.DataFrame({'ID':s.to_numpy(),
'DATE_HOUR_IN_ED':s.index,
'HOUR_IN_ED': s.index.strftime('%H:%M')})
print (df1)
ID DATE_HOUR_IN_ED HOUR_IN_ED
0 1 2021-11-07 08:00:00 08:00
1 1 2021-11-07 09:00:00 09:00
2 1 2021-11-07 10:00:00 10:00
3 1 2021-11-07 11:00:00 11:00
4 1 2021-11-07 12:00:00 12:00
5 1 2021-11-07 13:00:00 13:00
6 1 2021-11-07 14:00:00 14:00
7 1 2021-11-07 15:00:00 15:00
8 1 2021-11-07 16:00:00 16:00
9 1 2021-11-07 17:00:00 17:00
10 2 2021-11-07 13:00:00 13:00
11 2 2021-11-07 14:00:00 14:00
12 2 2021-11-07 15:00:00 15:00
13 2 2021-11-07 16:00:00 16:00
14 2 2021-11-07 17:00:00 17:00
15 2 2021-11-07 18:00:00 18:00
16 2 2021-11-07 19:00:00 19:00
17 2 2021-11-07 20:00:00 20:00
18 2 2021-11-07 21:00:00 21:00
19 2 2021-11-07 22:00:00 22:00
20 2 2021-11-07 23:00:00 23:00
21 2 2021-11-08 00:00:00 00:00
22 2 2021-11-08 01:00:00 01:00
23 2 2021-11-08 02:00:00 02:00
24 3 2021-11-07 15:00:00 15:00
25 3 2021-11-07 16:00:00 16:00
26 3 2021-11-07 17:00:00 17:00
27 3 2021-11-07 18:00:00 18:00
28 3 2021-11-07 19:00:00 19:00
29 3 2021-11-07 20:00:00 20:00
30 3 2021-11-07 21:00:00 21:00
Alternative solution:
df['ARRIVAL_DATE_TIME'] = pd.to_datetime(df['ARRIVAL_DATE_TIME']).dt.floor('H')
L = [pd.date_range(s,e, freq='H')
for s, e in df[['ARRIVAL_DATE_TIME','DISPOSITION_DATE']].to_numpy()]
df['DATE_HOUR_IN_ED'] = L
df = (df.drop(['ARRIVAL_DATE_TIME','DISPOSITION_DATE'], axis=1)
.explode('DATE_HOUR_IN_ED')
.reset_index(drop=True)
.assign(HOUR_IN_ED = lambda x: x['DATE_HOUR_IN_ED'].dt.strftime('%H:%M')))

Try this:
import pandas as pd
import numpy as np
df = pd.read_excel('test.xls')
df1 = (df.set_index(['ID'])
.assign(DATE_HOUR_IN_ED=lambda x: [pd.date_range(s,d, freq='H')
for s,d in zip(x.ARRIVAL_DATE_TIME, x.DISPOSITION_DATE)])
['DATE_HOUR_IN_ED'].explode()
.reset_index()
)
df1['DATE_HOUR_IN_ED'] = df1['DATE_HOUR_IN_ED'].dt.floor('H')
df1['HOUR_IN_ED'] = df1['DATE_HOUR_IN_ED'].dt.strftime('%H:%M')
print(df1)
Output:
ID DATE_HOUR_IN_ED HOUR_IN_ED
0 1 2021-11-07 08:00:00 08:00
1 1 2021-11-07 09:00:00 09:00
2 1 2021-11-07 10:00:00 10:00
3 1 2021-11-07 11:00:00 11:00
4 1 2021-11-07 12:00:00 12:00
5 1 2021-11-07 13:00:00 13:00
6 1 2021-11-07 14:00:00 14:00
7 1 2021-11-07 15:00:00 15:00
8 1 2021-11-07 16:00:00 16:00
9 1 2021-11-07 17:00:00 17:00
10 2 2021-11-07 13:00:00 13:00
11 2 2021-11-07 14:00:00 14:00
12 2 2021-11-07 15:00:00 15:00
13 2 2021-11-07 16:00:00 16:00
14 2 2021-11-07 17:00:00 17:00
15 2 2021-11-07 18:00:00 18:00
16 2 2021-11-07 19:00:00 19:00
17 2 2021-11-07 20:00:00 20:00
18 2 2021-11-07 21:00:00 21:00
19 2 2021-11-07 22:00:00 22:00
20 2 2021-11-07 23:00:00 23:00
21 2 2021-11-08 00:00:00 00:00
22 2 2021-11-08 01:00:00 01:00
23 2 2021-11-08 02:00:00 02:00
24 3 2021-11-07 15:00:00 15:00
25 3 2021-11-07 16:00:00 16:00
26 3 2021-11-07 17:00:00 17:00
27 3 2021-11-07 18:00:00 18:00
28 3 2021-11-07 19:00:00 19:00
29 3 2021-11-07 20:00:00 20:00

Transpose hourly data in rows to new columns Pandas

I have a table recorded hourly weather data like this:
Year Month Day Time Temp
Date/Time
2005-01-01 00:00:00 2005 1 1 0:00 6.0
2005-01-01 01:00:00 2005 1 1 1:00 6.1
2005-01-01 02:00:00 2005 1 1 2:00 6.7
2005-01-01 03:00:00 2005 1 1 3:00 6.8
2005-01-01 04:00:00 2005 1 1 4:00 6.3
2005-01-01 05:00:00 2005 1 1 5:00 6.6
2005-01-01 06:00:00 2005 1 1 6:00 6.9
2005-01-01 07:00:00 2005 1 1 7:00 7.1
2005-01-01 08:00:00 2005 1 1 8:00 6.9
2005-01-01 09:00:00 2005 1 1 9:00 6.7
2005-01-01 10:00:00 2005 1 1 10:00 7.1
2005-01-01 11:00:00 2005 1 1 11:00 7.1
2005-01-01 12:00:00 2005 1 1 12:00 7.2
2005-01-01 13:00:00 2005 1 1 13:00 7.7
2005-01-01 14:00:00 2005 1 1 14:00 8.8
2005-01-01 15:00:00 2005 1 1 15:00 8.6
2005-01-01 16:00:00 2005 1 1 16:00 7.4
2005-01-01 17:00:00 2005 1 1 17:00 6.8
2005-01-01 18:00:00 2005 1 1 18:00 6.3
2005-01-01 19:00:00 2005 1 1 19:00 5.9
2005-01-01 20:00:00 2005 1 1 20:00 5.6
2005-01-01 21:00:00 2005 1 1 21:00 3.6
2005-01-01 22:00:00 2005 1 1 22:00 2.6
2005-01-01 23:00:00 2005 1 1 23:00 1.7
I wanted to save the dataframe in this format:
How can I transpose the dataframe and create new columns for each record?

1) Copy below data
Date/Time Year Month Day Time Temp
0 1/1/2005 0:00 2005 1 1 0:00 6.0
1 1/1/2005 1:00 2005 1 1 1:00 6.1
2 1/1/2005 2:00 2005 1 1 2:00 6.7
3 1/1/2005 3:00 2005 1 1 3:00 6.8
4 1/1/2005 4:00 2005 1 1 4:00 6.3
5 1/1/2005 5:00 2005 1 1 5:00 6.6
6 1/1/2005 6:00 2005 1 1 6:00 6.9
7 1/1/2005 7:00 2005 1 1 7:00 7.1
8 1/1/2005 8:00 2005 1 1 8:00 6.9
9 1/1/2005 9:00 2005 1 1 9:00 6.7
10 1/1/2005 10:00 2005 1 1 10:00 7.1
11 1/1/2005 11:00 2005 1 1 11:00 7.1
12 1/1/2005 12:00 2005 1 1 12:00 7.2
2) Use pd.read_clipboard with a double space or more param due to space in Date/Time column
import pandas as pd
df=pd.read_clipboard('\s\s+')
df
3) format date/time columns and create a pivot table and reset/rename axis.
df['Date/Time']=pd.to_datetime(df['Date/Time'],format='%m/%d/%Y
%H:%M').dt.strftime('%m/%d/%Y')
df['Time']=pd.to_datetime(df['Time']).dt.time
df=pd.pivot_table(df, index='Date/Time', columns='Time', values='Temp').reset_index().rename_axis(index=None, columns=None)
df['Date/Time']=df['Date/Time'].apply(lambda x:(x + ' 0:00'))
df
Output:
Date/Time 00:00:00 01:00:00 02:00:00 03:00:00 04:00:00 05:00:00 06:00:00 07:00:00 08:00:00 09:00:00 10:00:00 11:00:00 12:00:00
01/01/2005 6.0 6.1 6.7 6.8 6.3 6.6 6.9 7.1 6.9 6.7 7.1 7.1 7.2

This does the trick similarly to previous answer but on a new date column such that only one row per day is created:
datetimes = pd.date_range("2005-01-01 00:00:00","2005-01-02 23:00:00", freq="1h")
df = pd.DataFrame({"Date/Time": datetimes, "temp": rand(len(datetimes))})
df["Date"] = df["Date/Time"].dt.date
df["Hour"] = df["Date/Time"].dt.hour
reshaped= df.pivot(index='Date', columns='Hour', values='temp')
reshaped.columns = ['HR'+str(hour) for hour in reshaped.columns]

How to divide data into night time data and day time data in pandas [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I guys and gals,
I need help with dividing in pandas this dataframe into night time and day time data. Lets assume the night is after 17:00 and before 08:30 and that the day is between 08:30 and 17:00.
Date Time Open High Low Close Vol
7 2019-09-02 05:00 11919.9 11929.7 11917.7 11918.9 240
8 2019-09-02 06:00 11920.7 11940.4 11917.7 11927.9 240
9 2019-09-02 07:00 11927.4 11966.2 11927.2 11936.4 240
10 2019-09-02 08:00 11936.9 11955.9 11928.1 11951.4 240
11 2019-09-02 09:00 11951.4 11960.2 11939.4 11954.4 240
12 2019-09-02 10:00 11953.9 11995.9 11951.4 11976.9 240
13 2019-09-02 11:00 11976.7 11979.4 11956.2 11965.9 240
14 2019-09-02 12:00 11966.2 11971.4 11956.4 11965.4 240
15 2019-09-02 13:00 11965.7 11969.7 11943.4 11947.7 240
16 2019-09-02 14:00 11947.4 11962.4 11943.9 11960.7 240
17 2019-09-02 15:00 11960.9 11964.2 11901.2 11934.9 240
18 2019-09-02 16:00 11934.9 11939.7 11921.4 11929.7 240
19 2019-09-02 17:00 11929.9 11940.4 11928.4 11938.2 236
20 2019-09-02 18:00 11937.9 11938.2 11934.7 11938.2 176
21 2019-09-02 19:00 11937.9 11948.7 11937.7 11943.2 196
The between_time only shows times for the current date so that alone doesnt do it.

One idea is convert Time column to timedeltas and filter by boolean mask with Series.between:
mask = (pd.to_timedelta(df['Time'].astype(str).add(':00'))
.between(pd.Timedelta('08:30:00'), pd.Timedelta('17:00:00')))
df1 = df[mask]
print (df1)
Date Time Open High Low Close Vol
11 2019-09-02 09:00 11951.4 11960.2 11939.4 11954.4 240
12 2019-09-02 10:00 11953.9 11995.9 11951.4 11976.9 240
13 2019-09-02 11:00 11976.7 11979.4 11956.2 11965.9 240
14 2019-09-02 12:00 11966.2 11971.4 11956.4 11965.4 240
15 2019-09-02 13:00 11965.7 11969.7 11943.4 11947.7 240
16 2019-09-02 14:00 11947.4 11962.4 11943.9 11960.7 240
17 2019-09-02 15:00 11960.9 11964.2 11901.2 11934.9 240
18 2019-09-02 16:00 11934.9 11939.7 11921.4 11929.7 240
19 2019-09-02 17:00 11929.9 11940.4 11928.4 11938.2 236
df2 = df[~mask]
print (df2)
Date Time Open High Low Close Vol
7 2019-09-02 05:00 11919.9 11929.7 11917.7 11918.9 240
8 2019-09-02 06:00 11920.7 11940.4 11917.7 11927.9 240
9 2019-09-02 07:00 11927.4 11966.2 11927.2 11936.4 240
10 2019-09-02 08:00 11936.9 11955.9 11928.1 11951.4 240
20 2019-09-02 18:00 11937.9 11938.2 11934.7 11938.2 176
21 2019-09-02 19:00 11937.9 11948.7 11937.7 11943.2 196
EDIT:
Another idea with DataFrame.between_time, but necessary DatetimeIndex:
df['Datetime'] = pd.to_datetime(df['Date'].astype(str) + ':' + df['Time'].astype(str))
df = df.set_index('Datetime')
day = df.between_time('09:00','17:00')
night = df[~df.index.isin(day.index)]

I would try something like this, obviously change the times to what you need! but this is the general idea.
In [58]: df = pd.DataFrame({"Time":[
...: "05:00",
...: "06:00",
...: "07:00",
...: "08:00",
...: "09:00",
...: "10:00",
...: "11:00",
...: "12:00",
...: "13:00",
...: "14:00",
...: "15:00",
...: "16:00",
...: "17:00",
...: "18:00",
...: "19:00"]})
In [59]: df = df.set_index(pd.to_datetime(df["Time"]))
In [60]: df
Out[60]:
Time
Time
2019-09-15 05:00:00 05:00
2019-09-15 06:00:00 06:00
2019-09-15 07:00:00 07:00
2019-09-15 08:00:00 08:00
2019-09-15 09:00:00 09:00
2019-09-15 10:00:00 10:00
2019-09-15 11:00:00 11:00
2019-09-15 12:00:00 12:00
2019-09-15 13:00:00 13:00
2019-09-15 14:00:00 14:00
2019-09-15 15:00:00 15:00
2019-09-15 16:00:00 16:00
2019-09-15 17:00:00 17:00
2019-09-15 18:00:00 18:00
2019-09-15 19:00:00 19:00
In [61]: df["time_desc"] = "night"
In [62]: df
Out[62]:
Time time_desc
Time
2019-09-15 05:00:00 05:00 night
2019-09-15 06:00:00 06:00 night
2019-09-15 07:00:00 07:00 night
2019-09-15 08:00:00 08:00 night
2019-09-15 09:00:00 09:00 night
2019-09-15 10:00:00 10:00 night
2019-09-15 11:00:00 11:00 night
2019-09-15 12:00:00 12:00 night
2019-09-15 13:00:00 13:00 night
2019-09-15 14:00:00 14:00 night
2019-09-15 15:00:00 15:00 night
2019-09-15 16:00:00 16:00 night
2019-09-15 17:00:00 17:00 night
2019-09-15 18:00:00 18:00 night
2019-09-15 19:00:00 19:00 night
In [63]: df.loc[df.between_time("06:30", "18:00").index, "time_desc"] = "day"
In [64]: df
Out[64]:
Time time_desc
Time
2019-09-15 05:00:00 05:00 night
2019-09-15 06:00:00 06:00 night
2019-09-15 07:00:00 07:00 day
2019-09-15 08:00:00 08:00 day
2019-09-15 09:00:00 09:00 day
2019-09-15 10:00:00 10:00 day
2019-09-15 11:00:00 11:00 day
2019-09-15 12:00:00 12:00 day
2019-09-15 13:00:00 13:00 day
2019-09-15 14:00:00 14:00 day
2019-09-15 15:00:00 15:00 day
2019-09-15 16:00:00 16:00 day
2019-09-15 17:00:00 17:00 day
2019-09-15 18:00:00 18:00 day
2019-09-15 19:00:00 19:00 night

Reorder day of week in pandas groupby plot bar

I have sorted df data like below:
day_name Day_id
time
2019-05-20 19:00:00 Monday 0
2018-12-31 15:00:00 Monday 0
2019-02-25 17:00:00 Monday 0
2019-05-06 20:00:00 Monday 0
2019-03-12 12:00:00 Tuesday 1
2019-04-16 15:00:00 Tuesday 1
2019-04-02 18:00:00 Tuesday 1
2019-02-05 09:00:00 Tuesday 1
2019-05-28 21:00:00 Tuesday 1
2019-01-15 12:00:00 Tuesday 1
2019-06-04 20:00:00 Tuesday 1
2018-12-04 07:00:00 Tuesday 1
2019-01-22 11:00:00 Tuesday 1
2019-01-09 07:00:00 Wednesday 2
2019-03-06 16:00:00 Wednesday 2
2019-06-19 17:00:00 Wednesday 2
2019-04-10 20:00:00 Wednesday 2
2019-04-24 15:00:00 Wednesday 2
2019-01-31 08:00:00 Thursday 3
2019-01-03 08:00:00 Thursday 3
2019-02-28 19:00:00 Thursday 3
2019-05-23 20:00:00 Thursday 3
2018-12-20 07:00:00 Thursday 3
2019-05-09 19:00:00 Thursday 3
2019-06-28 15:00:00 Friday 4
2019-03-22 12:00:00 Friday 4
2019-03-29 14:00:00 Friday 4
2018-12-15 08:00:00 Saturday 5
2019-02-17 11:00:00 Sunday 6
2019-06-16 19:00:00 Sunday 6
2018-12-02 08:00:00 Sunday 6
Currentry with help of this post:
df = df.groupby(df.day_name).count().plot(kind="bar")
plt.show()
my output is:
How to plot histogram with days of week in proper order like: Monday, Tuesday ...?
I have found several approaches: 1, 2, 3, to solve this but can't find method for using them in my case.
Thank You all for hard work.

You need sort=False under groupby:
m = df.groupby(df.day_name,sort=False).count().plot(kind="bar")
plt.show()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Filling in missing hours in a Pandas dataframe - python

Related

Calculation of the daily average

Create date/hour variable for each hour between two datetime variables

Transpose hourly data in rows to new columns Pandas

How to divide data into night time data and day time data in pandas [closed]

Reorder day of week in pandas groupby plot bar

Categories

Resources