I have a table recorded hourly weather data like this:
Year Month Day Time Temp
Date/Time
2005-01-01 00:00:00 2005 1 1 0:00 6.0
2005-01-01 01:00:00 2005 1 1 1:00 6.1
2005-01-01 02:00:00 2005 1 1 2:00 6.7
2005-01-01 03:00:00 2005 1 1 3:00 6.8
2005-01-01 04:00:00 2005 1 1 4:00 6.3
2005-01-01 05:00:00 2005 1 1 5:00 6.6
2005-01-01 06:00:00 2005 1 1 6:00 6.9
2005-01-01 07:00:00 2005 1 1 7:00 7.1
2005-01-01 08:00:00 2005 1 1 8:00 6.9
2005-01-01 09:00:00 2005 1 1 9:00 6.7
2005-01-01 10:00:00 2005 1 1 10:00 7.1
2005-01-01 11:00:00 2005 1 1 11:00 7.1
2005-01-01 12:00:00 2005 1 1 12:00 7.2
2005-01-01 13:00:00 2005 1 1 13:00 7.7
2005-01-01 14:00:00 2005 1 1 14:00 8.8
2005-01-01 15:00:00 2005 1 1 15:00 8.6
2005-01-01 16:00:00 2005 1 1 16:00 7.4
2005-01-01 17:00:00 2005 1 1 17:00 6.8
2005-01-01 18:00:00 2005 1 1 18:00 6.3
2005-01-01 19:00:00 2005 1 1 19:00 5.9
2005-01-01 20:00:00 2005 1 1 20:00 5.6
2005-01-01 21:00:00 2005 1 1 21:00 3.6
2005-01-01 22:00:00 2005 1 1 22:00 2.6
2005-01-01 23:00:00 2005 1 1 23:00 1.7
I wanted to save the dataframe in this format:
How can I transpose the dataframe and create new columns for each record?
1) Copy below data
Date/Time Year Month Day Time Temp
0 1/1/2005 0:00 2005 1 1 0:00 6.0
1 1/1/2005 1:00 2005 1 1 1:00 6.1
2 1/1/2005 2:00 2005 1 1 2:00 6.7
3 1/1/2005 3:00 2005 1 1 3:00 6.8
4 1/1/2005 4:00 2005 1 1 4:00 6.3
5 1/1/2005 5:00 2005 1 1 5:00 6.6
6 1/1/2005 6:00 2005 1 1 6:00 6.9
7 1/1/2005 7:00 2005 1 1 7:00 7.1
8 1/1/2005 8:00 2005 1 1 8:00 6.9
9 1/1/2005 9:00 2005 1 1 9:00 6.7
10 1/1/2005 10:00 2005 1 1 10:00 7.1
11 1/1/2005 11:00 2005 1 1 11:00 7.1
12 1/1/2005 12:00 2005 1 1 12:00 7.2
2) Use pd.read_clipboard with a double space or more param due to space in Date/Time column
import pandas as pd
df=pd.read_clipboard('\s\s+')
df
3) format date/time columns and create a pivot table and reset/rename axis.
df['Date/Time']=pd.to_datetime(df['Date/Time'],format='%m/%d/%Y
%H:%M').dt.strftime('%m/%d/%Y')
df['Time']=pd.to_datetime(df['Time']).dt.time
df=pd.pivot_table(df, index='Date/Time', columns='Time', values='Temp').reset_index().rename_axis(index=None, columns=None)
df['Date/Time']=df['Date/Time'].apply(lambda x:(x + ' 0:00'))
df
Output:
Date/Time 00:00:00 01:00:00 02:00:00 03:00:00 04:00:00 05:00:00 06:00:00 07:00:00 08:00:00 09:00:00 10:00:00 11:00:00 12:00:00
01/01/2005 6.0 6.1 6.7 6.8 6.3 6.6 6.9 7.1 6.9 6.7 7.1 7.1 7.2
This does the trick similarly to previous answer but on a new date column such that only one row per day is created:
datetimes = pd.date_range("2005-01-01 00:00:00","2005-01-02 23:00:00", freq="1h")
df = pd.DataFrame({"Date/Time": datetimes, "temp": rand(len(datetimes))})
df["Date"] = df["Date/Time"].dt.date
df["Hour"] = df["Date/Time"].dt.hour
reshaped= df.pivot(index='Date', columns='Hour', values='temp')
reshaped.columns = ['HR'+str(hour) for hour in reshaped.columns]
Related
How to transform this data so that the pm 2.5 pm 10 columns are the average of the whole day. The data I collected (example here below) collects data every 15 minutes.
Pm 2.5 Pm 10 Created At
0 6.00 19.20 2021-06-21 19:00
1 4.70 17.00 2021-06-21 19:15
2 4.80 16.70 2021-06-21 19:30
3 5.10 12.10 2021-06-21 19:45
4 7.90 19.10 2021-06-21 20:00
Let's resample the dataframe:
df['Created At'] = pd.to_datetime(df['Created At'])
df.resample('D', on='Created At').mean()
Pm 2.5 Pm 10
Created At
2021-06-21 5.7 16.82
You can use pd.Grouper and then transform if you want to preserve the dataframe shape:
df['Created At'] = pd.to_datetime(df['Created At'])
df[['Pm 2.5', 'Pm 10']] = df.groupby(pd.Grouper(key='Created At', freq='D'))\
[['Pm 2.5', 'Pm 10']].transform('mean')
Output:
Pm 2.5 Pm 10 Created At
0 5.7 16.82 2021-06-21 19:00:00
1 5.7 16.82 2021-06-21 19:15:00
2 5.7 16.82 2021-06-21 19:30:00
3 5.7 16.82 2021-06-21 19:45:00
4 5.7 16.82 2021-06-21 20:00:00
here is one way do it
convert the date using to_datetime, grab the date and carry out the mean
df.groupby(pd.to_datetime(df['Created At']).dt.date).mean()
Created At Pm 2.5 Pm 10
0 2021-06-21 5.7 16.82
I have a data table that looks like this:
ID ARRIVAL_DATE_TIME DISPOSITION_DATE
1 2021-11-07 08:35:00 2021-11-07 17:58:00
2 2021-11-07 13:16:00 2021-11-08 02:52:00
3 2021-11-07 15:12:00 2021-11-07 21:08:00
I want to be able to count the number of patients in our location by date/hour and hour. I imagine I would eventually have to transform this data into a format seen below and then create a pivot table, but I'm not sure how to first transform this data. So for example, ID 1 would have a row for each date/hour and hour between '2021-11-07 08:35:00' and '2021-11-07 17:58:00'.
ID DATE_HOUR_IN_ED HOUR_IN_ED
1 2021-11-07 08:00:00 8:00
1 2021-11-07 09:00:00 9:00
1 2021-11-07 10:00:00 10:00
1 2021-11-07 11:00:00 11:00
...
2 2021-11-07 13:00:00 13:00
2 2021-11-07 14:00:00 14:00
2 2021-11-07 15:00:00 15:00
....
Use to_datetime with Series.dt.floor for remove times, then concat with repeat date_range and last create DataFrame by constructor:
df['ARRIVAL_DATE_TIME'] = pd.to_datetime(df['ARRIVAL_DATE_TIME']).dt.floor('H')
s = pd.concat([pd.Series(r.ID,pd.date_range(r.ARRIVAL_DATE_TIME,
r.DISPOSITION_DATE, freq='H'))
for r in df.itertuples()])
df1 = pd.DataFrame({'ID':s.to_numpy(),
'DATE_HOUR_IN_ED':s.index,
'HOUR_IN_ED': s.index.strftime('%H:%M')})
print (df1)
ID DATE_HOUR_IN_ED HOUR_IN_ED
0 1 2021-11-07 08:00:00 08:00
1 1 2021-11-07 09:00:00 09:00
2 1 2021-11-07 10:00:00 10:00
3 1 2021-11-07 11:00:00 11:00
4 1 2021-11-07 12:00:00 12:00
5 1 2021-11-07 13:00:00 13:00
6 1 2021-11-07 14:00:00 14:00
7 1 2021-11-07 15:00:00 15:00
8 1 2021-11-07 16:00:00 16:00
9 1 2021-11-07 17:00:00 17:00
10 2 2021-11-07 13:00:00 13:00
11 2 2021-11-07 14:00:00 14:00
12 2 2021-11-07 15:00:00 15:00
13 2 2021-11-07 16:00:00 16:00
14 2 2021-11-07 17:00:00 17:00
15 2 2021-11-07 18:00:00 18:00
16 2 2021-11-07 19:00:00 19:00
17 2 2021-11-07 20:00:00 20:00
18 2 2021-11-07 21:00:00 21:00
19 2 2021-11-07 22:00:00 22:00
20 2 2021-11-07 23:00:00 23:00
21 2 2021-11-08 00:00:00 00:00
22 2 2021-11-08 01:00:00 01:00
23 2 2021-11-08 02:00:00 02:00
24 3 2021-11-07 15:00:00 15:00
25 3 2021-11-07 16:00:00 16:00
26 3 2021-11-07 17:00:00 17:00
27 3 2021-11-07 18:00:00 18:00
28 3 2021-11-07 19:00:00 19:00
29 3 2021-11-07 20:00:00 20:00
30 3 2021-11-07 21:00:00 21:00
Alternative solution:
df['ARRIVAL_DATE_TIME'] = pd.to_datetime(df['ARRIVAL_DATE_TIME']).dt.floor('H')
L = [pd.date_range(s,e, freq='H')
for s, e in df[['ARRIVAL_DATE_TIME','DISPOSITION_DATE']].to_numpy()]
df['DATE_HOUR_IN_ED'] = L
df = (df.drop(['ARRIVAL_DATE_TIME','DISPOSITION_DATE'], axis=1)
.explode('DATE_HOUR_IN_ED')
.reset_index(drop=True)
.assign(HOUR_IN_ED = lambda x: x['DATE_HOUR_IN_ED'].dt.strftime('%H:%M')))
Try this:
import pandas as pd
import numpy as np
df = pd.read_excel('test.xls')
df1 = (df.set_index(['ID'])
.assign(DATE_HOUR_IN_ED=lambda x: [pd.date_range(s,d, freq='H')
for s,d in zip(x.ARRIVAL_DATE_TIME, x.DISPOSITION_DATE)])
['DATE_HOUR_IN_ED'].explode()
.reset_index()
)
df1['DATE_HOUR_IN_ED'] = df1['DATE_HOUR_IN_ED'].dt.floor('H')
df1['HOUR_IN_ED'] = df1['DATE_HOUR_IN_ED'].dt.strftime('%H:%M')
print(df1)
Output:
ID DATE_HOUR_IN_ED HOUR_IN_ED
0 1 2021-11-07 08:00:00 08:00
1 1 2021-11-07 09:00:00 09:00
2 1 2021-11-07 10:00:00 10:00
3 1 2021-11-07 11:00:00 11:00
4 1 2021-11-07 12:00:00 12:00
5 1 2021-11-07 13:00:00 13:00
6 1 2021-11-07 14:00:00 14:00
7 1 2021-11-07 15:00:00 15:00
8 1 2021-11-07 16:00:00 16:00
9 1 2021-11-07 17:00:00 17:00
10 2 2021-11-07 13:00:00 13:00
11 2 2021-11-07 14:00:00 14:00
12 2 2021-11-07 15:00:00 15:00
13 2 2021-11-07 16:00:00 16:00
14 2 2021-11-07 17:00:00 17:00
15 2 2021-11-07 18:00:00 18:00
16 2 2021-11-07 19:00:00 19:00
17 2 2021-11-07 20:00:00 20:00
18 2 2021-11-07 21:00:00 21:00
19 2 2021-11-07 22:00:00 22:00
20 2 2021-11-07 23:00:00 23:00
21 2 2021-11-08 00:00:00 00:00
22 2 2021-11-08 01:00:00 01:00
23 2 2021-11-08 02:00:00 02:00
24 3 2021-11-07 15:00:00 15:00
25 3 2021-11-07 16:00:00 16:00
26 3 2021-11-07 17:00:00 17:00
27 3 2021-11-07 18:00:00 18:00
28 3 2021-11-07 19:00:00 19:00
29 3 2021-11-07 20:00:00 20:00
I have a DataFrame where 'B' is a Category and 'Boy ' is an Event , For Boy{1,2,3,4} B = 1 is alloted ;Boy = 1 uses B for 10 Mins Start from 12:00 to End = 12:10 ,Next boy should be using it from End_Time[0] , Like that for B =1 there are four Samples and B = 2 Different 4 Samples
Input Sample :
B Boy Start End Out
1 1 12:00 12:10 0:10
1 2 12:01 12:11 0:10
1 3 12:02 12:12 0:10
1 4 12:03 12:13 0:10
2 5 12:00 12:10 0:05
2 6 12:01 12:11 0:05
2 7 12:02 12:12 0:05
2 8 12:03 12:13 0:05
3 9 12:00 12:10 0:03
3 10 12:01 12:11 0:03
3 11 12:02 12:12 0:03
3 12 12:03 12:13 0:03
Code Tried :
data_1['End'] = pd.to_datetime(data_1['Start'] + pd.to_timedelta(data_1['Out'])
for i in range(1, len(data_1)):
data_1.loc[i, 'Start'] = data_1.loc[i-1, 'End']
Output :
B Boy Start End Out
1 1 12:00 12:10 0:10
1 2 12:10 12:20 0:10
1 3 12:20 12:30 0:10
1 4 12:30 12:40 0:10
2 5 12:40 12:45 0:05
2 6 12:45 12:50 0:05
2 7 12:50 12:55 0:05
2 8 12:55 13:00 0:05
3 9 13:00 13:03 0:03
3 10 13:03 13:06 0:03
3 11 13:06 13:09 0:03
3 12 13:09 13:12 0:03
Code Failed :
new_Start_time = []
for i,item in data_1.groupby('B'):
temp_list = [item.iloc[0,2]]
list_all = [item.iloc[0,3]]
for j in range(len(list_all)):
temp_list[j+1] = [list_all[j] for i in range(len(list_all) - 1) ]
temp_list.append(temp_list[j])
new_Start_time.extend(temp_list)
data_1['new_Start_time'] = new_Start_time
Error : IndexError: list assignment index out of range
Expected Result :
B Boy Start End Out
1 1 12:00 12:10 0:10
1 2 12:10 12:20 0:10
1 3 12:20 12:30 0:10
1 4 12:30 12:40 0:10
2 5 12:00 12:05 0:05
2 6 12:05 12:10 0:05
2 7 12:10 12:15 0:05
2 8 12:15 12:20 0:05
3 9 12:00 12:03 0:03
3 10 12:03 12:06 0:03
3 11 12:06 12:09 0:03
3 12 12:09 12:12 0:03
Thanks In Advance
I found a solution. It is not the best if your table is really big but it works.
First I converted the columns to datetime and timedelta:
df["Start"] = pd.to_datetime(df["Start"], format='%H:%M')
df["End"] = pd.to_datetime(df["End"], format='%H:%M')
df["Out"] = pd.to_timedelta("0"+df["Out"]+":00")
Then the code to create the new start and end columns:
new_start =[]
new_end = []
for i, group in df.groupby("B"):
temp_start =[]
temp_end = []
out = group.iloc[0,4]
for j in range(0,group.shape[0]):
if j==0:
temp_start.append(group.iloc[0,2])
temp_end.append(group.iloc[0,2]+out)
else:
temp_start.append(temp_end[j-1])
temp_end.append(temp_start[j]+out)
new_start.extend(temp_start)
new_end.extend(temp_end)
Now update the old start and end columns with the new values:
df["Start"]= new_start
df["End"] = new_end
df
Output:
B Boy Start End Out
0 1 1 1900-01-01 12:00:00 1900-01-01 12:10:00 00:10:00
1 1 2 1900-01-01 12:10:00 1900-01-01 12:20:00 00:10:00
2 1 3 1900-01-01 12:20:00 1900-01-01 12:30:00 00:10:00
3 1 4 1900-01-01 12:30:00 1900-01-01 12:40:00 00:10:00
4 2 5 1900-01-01 12:00:00 1900-01-01 12:05:00 00:05:00
5 2 6 1900-01-01 12:05:00 1900-01-01 12:10:00 00:05:00
6 2 7 1900-01-01 12:10:00 1900-01-01 12:15:00 00:05:00
7 2 8 1900-01-01 12:15:00 1900-01-01 12:20:00 00:05:00
8 3 9 1900-01-01 12:00:00 1900-01-01 12:03:00 00:03:00
9 3 10 1900-01-01 12:03:00 1900-01-01 12:06:00 00:03:00
10 3 11 1900-01-01 12:06:00 1900-01-01 12:09:00 00:03:00
11 3 12 1900-01-01 12:09:00 1900-01-01 12:12:00 00:03:00
You can use:
def toTimeDelta(s):
h = pd.to_timedelta(s.str.split(':').str[0].astype(int), unit='h')
m = pd.to_timedelta(s.str.split(':').str[1].astype(int), unit='m')
return h + m
def fx(s):
s = s.transform(toTimeDelta)
out = s['Out'].copy()
out.iloc[0] += s['Start'].iloc[0]
s['End'] = out.cumsum()
s['Start'].iloc[1:] = s['End'].shift().iloc[1:]
return s
df[['Start', 'End', 'Out']] = df.groupby('B')[['Start', 'End', 'Out']].apply(fx)
Result:
# print(df)
B Boy Start End Out
0 1 1 12:00:00 12:10:00 00:10:00
1 1 2 12:10:00 12:20:00 00:10:00
2 1 3 12:20:00 12:30:00 00:10:00
3 1 4 12:30:00 12:40:00 00:10:00
4 2 5 12:00:00 12:05:00 00:05:00
5 2 6 12:05:00 12:10:00 00:05:00
6 2 7 12:10:00 12:15:00 00:05:00
7 2 8 12:15:00 12:20:00 00:05:00
8 3 9 12:00:00 12:03:00 00:03:00
9 3 10 12:03:00 12:06:00 00:03:00
10 3 11 12:06:00 12:09:00 00:03:00
11 3 12 12:09:00 12:12:00 00:03:00
I have the following data. This represents the number of occurrences in January:
date value WeekDay WeekNo Year Month
2018-01-01 214.0 Monday 1 2018 1
2018-01-02 232.0 Tuesday 1 2018 1
2018-01-03 147.0 Wed 1 2018 1
2018-01-04 257.0 Thursd 1 2018 1
2018-01-05 164.0 Friday 1 2018 1
2018-01-06 187.0 Saturd 1 2018 1
2018-01-07 201.0 Sunday 1 2018 1
2018-01-08 141.0 Monday 2 2018 1
2018-01-09 152.0 Tuesday 2 2018 1
2018-01-10 167.0 Wednesd 2 2018 1
2018-01-15 113.0 Monday 3 2018 1
2018-01-16 139.0 Tuesday 3 2018 1
2018-01-17 159.0 Wednesd 3 2018 1
2018-01-18 202.0 Thursd 3 2018 1
2018-01-19 207.0 Friday 3 2018 1
... ... ... ... ...
WeekNo is the number of the week in a year.
My goal is to have a line plot showing the evolution of occurrences, for this particular month, per week number. Therefore, I'd like to have the weekday in the x-axis, the occurrences on the y-axis and different lines, each with a different color, for each week (and a legend with the color that corresponds to each week).
Does anyone have any idea how this could be done? Thanks a lot!
You can first reshape your dataframe to a format where the columns are the week number and one row per weekday. Then, use the plot pandas method:
reshaped = (df
.assign(date=lambda f: pd.to_datetime(f.date))
.assign(dayofweek=lambda f: f.date.dt.dayofweek,
dayname=lambda f: f.date.dt.weekday_name)
.set_index(['dayofweek', 'dayname', 'WeekNo'])
.value
.unstack()
.reset_index(0, drop=True))
print(reshaped)
reshaped.plot(marker='x')
WeekNo 1 2 3
dayname
Monday 214.0 141.0 113.0
Tuesday 232.0 152.0 139.0
Wednesday 147.0 167.0 159.0
Thursday 257.0 NaN 202.0
Friday 164.0 NaN 207.0
Saturday 187.0 NaN NaN
Sunday 201.0 NaN NaN
Date_NZST Time_NZST Radiation_Amount_MJ/m2
5/08/2011 0:00 0
5/08/2011 1:00 0
5/08/2011 2:00 0
5/08/2011 3:00 0
5/08/2011 4:00 0
5/08/2011 5:00 0
5/08/2011 6:00 0
5/08/2011 7:00 0
5/08/2011 8:00 0
5/08/2011 9:00 0.37
5/08/2011 10:00 0.41
5/08/2011 11:00 1.34
5/08/2011 12:00 0.87
5/08/2011 14:00 1.69
5/08/2011 15:00 1.53
5/08/2011 16:00 1.14
5/08/2011 17:00 0.63
5/08/2011 18:00 0.08
5/08/2011 19:00 0
5/08/2011 20:00 0
5/08/2011 21:00 0
5/08/2011 22:00 0
5/08/2011 23:00 0
I have an Excel spreadsheet that contains hourly measurements of solar irradiance everyday for a year. It has 3 columns, Date_NZST, Time_NZST and Radiation_Amount_MJ/m2.
I'm trying to find a way to automatically find all missing hours, generate a row for that missing hour and fill it with a - symbol in the Radiation_Amount_MJ/m2 column. For example, hour 13:00 is missing so I'd like to make row between the 12:00 and 14:00 rows with the correct date and fill the Radiation_Amount_MJ/m2 column with a -. All dates are present, just some hours are missing.
5/08/2011 11:00 1.34
5/08/2011 12:00 0.87
5/08/2011 13:00 -
5/08/2011 14:00 1.69
5/08/2011 15:00 1.53
I've been doing this in Excel but this is a very tedious process and there could be hundreds of missing points. I've resorted to trying to do it using the Pandas library in Python and I saw this thread (Fill in missing hours in a pandas dataframe) and I tried to alter the answer code to fit my data but I got really confused by the line
df.groupby('area').\
apply(lambda x : x.set_index('Datetime').resample('H').mean().fillna(0)).\
reset_index()'
and how to repurpose it to my data. Anyone have any ideas?
First we create a datetime index which contains the date + time with pd.to_datetime.
Then we use resample to get hourly data, and use fillna to fill the missing vallues with a -:
df.set_index(pd.to_datetime(df['Date_NZST'] + ' ' + df['Time_NZST']), inplace=True)
df = df.drop(columns=['Date_NZST', 'Time_NZST'])
df = df.resample('H').first().fillna('-')
Output
Radiation_Amount_MJ/m2
2011-05-08 00:00:00 0
2011-05-08 01:00:00 0
2011-05-08 02:00:00 0
2011-05-08 03:00:00 0
2011-05-08 04:00:00 0
2011-05-08 05:00:00 0
2011-05-08 06:00:00 0
2011-05-08 07:00:00 0
2011-05-08 08:00:00 0
2011-05-08 09:00:00 0.37
2011-05-08 10:00:00 0.41
2011-05-08 11:00:00 1.34
2011-05-08 12:00:00 0.87
2011-05-08 13:00:00 -
2011-05-08 14:00:00 1.69
2011-05-08 15:00:00 1.53
2011-05-08 16:00:00 1.14
2011-05-08 17:00:00 0.63
2011-05-08 18:00:00 0.08
2011-05-08 19:00:00 0
2011-05-08 20:00:00 0
2011-05-08 21:00:00 0
2011-05-08 22:00:00 0
2011-05-08 23:00:00 0
If you want the datetime out of your index use df.reset_index()
Note, by filling in - in a numeric column, it gets converted to object type.