Transpose hourly data in rows to new columns Pandas - python

I have a table recorded hourly weather data like this:
Year Month Day Time Temp
Date/Time
2005-01-01 00:00:00 2005 1 1 0:00 6.0
2005-01-01 01:00:00 2005 1 1 1:00 6.1
2005-01-01 02:00:00 2005 1 1 2:00 6.7
2005-01-01 03:00:00 2005 1 1 3:00 6.8
2005-01-01 04:00:00 2005 1 1 4:00 6.3
2005-01-01 05:00:00 2005 1 1 5:00 6.6
2005-01-01 06:00:00 2005 1 1 6:00 6.9
2005-01-01 07:00:00 2005 1 1 7:00 7.1
2005-01-01 08:00:00 2005 1 1 8:00 6.9
2005-01-01 09:00:00 2005 1 1 9:00 6.7
2005-01-01 10:00:00 2005 1 1 10:00 7.1
2005-01-01 11:00:00 2005 1 1 11:00 7.1
2005-01-01 12:00:00 2005 1 1 12:00 7.2
2005-01-01 13:00:00 2005 1 1 13:00 7.7
2005-01-01 14:00:00 2005 1 1 14:00 8.8
2005-01-01 15:00:00 2005 1 1 15:00 8.6
2005-01-01 16:00:00 2005 1 1 16:00 7.4
2005-01-01 17:00:00 2005 1 1 17:00 6.8
2005-01-01 18:00:00 2005 1 1 18:00 6.3
2005-01-01 19:00:00 2005 1 1 19:00 5.9
2005-01-01 20:00:00 2005 1 1 20:00 5.6
2005-01-01 21:00:00 2005 1 1 21:00 3.6
2005-01-01 22:00:00 2005 1 1 22:00 2.6
2005-01-01 23:00:00 2005 1 1 23:00 1.7
I wanted to save the dataframe in this format:
How can I transpose the dataframe and create new columns for each record?

1) Copy below data
Date/Time Year Month Day Time Temp
0 1/1/2005 0:00 2005 1 1 0:00 6.0
1 1/1/2005 1:00 2005 1 1 1:00 6.1
2 1/1/2005 2:00 2005 1 1 2:00 6.7
3 1/1/2005 3:00 2005 1 1 3:00 6.8
4 1/1/2005 4:00 2005 1 1 4:00 6.3
5 1/1/2005 5:00 2005 1 1 5:00 6.6
6 1/1/2005 6:00 2005 1 1 6:00 6.9
7 1/1/2005 7:00 2005 1 1 7:00 7.1
8 1/1/2005 8:00 2005 1 1 8:00 6.9
9 1/1/2005 9:00 2005 1 1 9:00 6.7
10 1/1/2005 10:00 2005 1 1 10:00 7.1
11 1/1/2005 11:00 2005 1 1 11:00 7.1
12 1/1/2005 12:00 2005 1 1 12:00 7.2
2) Use pd.read_clipboard with a double space or more param due to space in Date/Time column
import pandas as pd
df=pd.read_clipboard('\s\s+')
df
3) format date/time columns and create a pivot table and reset/rename axis.
df['Date/Time']=pd.to_datetime(df['Date/Time'],format='%m/%d/%Y
%H:%M').dt.strftime('%m/%d/%Y')
df['Time']=pd.to_datetime(df['Time']).dt.time
df=pd.pivot_table(df, index='Date/Time', columns='Time', values='Temp').reset_index().rename_axis(index=None, columns=None)
df['Date/Time']=df['Date/Time'].apply(lambda x:(x + ' 0:00'))
df
Output:
Date/Time 00:00:00 01:00:00 02:00:00 03:00:00 04:00:00 05:00:00 06:00:00 07:00:00 08:00:00 09:00:00 10:00:00 11:00:00 12:00:00
01/01/2005 6.0 6.1 6.7 6.8 6.3 6.6 6.9 7.1 6.9 6.7 7.1 7.1 7.2

This does the trick similarly to previous answer but on a new date column such that only one row per day is created:
datetimes = pd.date_range("2005-01-01 00:00:00","2005-01-02 23:00:00", freq="1h")
df = pd.DataFrame({"Date/Time": datetimes, "temp": rand(len(datetimes))})
df["Date"] = df["Date/Time"].dt.date
df["Hour"] = df["Date/Time"].dt.hour
reshaped= df.pivot(index='Date', columns='Hour', values='temp')
reshaped.columns = ['HR'+str(hour) for hour in reshaped.columns]

Related

Calculation of the daily average

How to transform this data so that the pm 2.5 pm 10 columns are the average of the whole day. The data I collected (example here below) collects data every 15 minutes.
Pm 2.5 Pm 10 Created At
0 6.00 19.20 2021-06-21 19:00
1 4.70 17.00 2021-06-21 19:15
2 4.80 16.70 2021-06-21 19:30
3 5.10 12.10 2021-06-21 19:45
4 7.90 19.10 2021-06-21 20:00
Let's resample the dataframe:
df['Created At'] = pd.to_datetime(df['Created At'])
df.resample('D', on='Created At').mean()
Pm 2.5 Pm 10
Created At
2021-06-21 5.7 16.82
You can use pd.Grouper and then transform if you want to preserve the dataframe shape:
df['Created At'] = pd.to_datetime(df['Created At'])
df[['Pm 2.5', 'Pm 10']] = df.groupby(pd.Grouper(key='Created At', freq='D'))\
[['Pm 2.5', 'Pm 10']].transform('mean')
Output:
Pm 2.5 Pm 10 Created At
0 5.7 16.82 2021-06-21 19:00:00
1 5.7 16.82 2021-06-21 19:15:00
2 5.7 16.82 2021-06-21 19:30:00
3 5.7 16.82 2021-06-21 19:45:00
4 5.7 16.82 2021-06-21 20:00:00
here is one way do it
convert the date using to_datetime, grab the date and carry out the mean
df.groupby(pd.to_datetime(df['Created At']).dt.date).mean()
Created At Pm 2.5 Pm 10
0 2021-06-21 5.7 16.82

Create date/hour variable for each hour between two datetime variables

I have a data table that looks like this:
ID ARRIVAL_DATE_TIME DISPOSITION_DATE
1 2021-11-07 08:35:00 2021-11-07 17:58:00
2 2021-11-07 13:16:00 2021-11-08 02:52:00
3 2021-11-07 15:12:00 2021-11-07 21:08:00
I want to be able to count the number of patients in our location by date/hour and hour. I imagine I would eventually have to transform this data into a format seen below and then create a pivot table, but I'm not sure how to first transform this data. So for example, ID 1 would have a row for each date/hour and hour between '2021-11-07 08:35:00' and '2021-11-07 17:58:00'.
ID DATE_HOUR_IN_ED HOUR_IN_ED
1 2021-11-07 08:00:00 8:00
1 2021-11-07 09:00:00 9:00
1 2021-11-07 10:00:00 10:00
1 2021-11-07 11:00:00 11:00
...
2 2021-11-07 13:00:00 13:00
2 2021-11-07 14:00:00 14:00
2 2021-11-07 15:00:00 15:00
....
Use to_datetime with Series.dt.floor for remove times, then concat with repeat date_range and last create DataFrame by constructor:
df['ARRIVAL_DATE_TIME'] = pd.to_datetime(df['ARRIVAL_DATE_TIME']).dt.floor('H')
s = pd.concat([pd.Series(r.ID,pd.date_range(r.ARRIVAL_DATE_TIME,
r.DISPOSITION_DATE, freq='H'))
for r in df.itertuples()])
df1 = pd.DataFrame({'ID':s.to_numpy(),
'DATE_HOUR_IN_ED':s.index,
'HOUR_IN_ED': s.index.strftime('%H:%M')})
print (df1)
ID DATE_HOUR_IN_ED HOUR_IN_ED
0 1 2021-11-07 08:00:00 08:00
1 1 2021-11-07 09:00:00 09:00
2 1 2021-11-07 10:00:00 10:00
3 1 2021-11-07 11:00:00 11:00
4 1 2021-11-07 12:00:00 12:00
5 1 2021-11-07 13:00:00 13:00
6 1 2021-11-07 14:00:00 14:00
7 1 2021-11-07 15:00:00 15:00
8 1 2021-11-07 16:00:00 16:00
9 1 2021-11-07 17:00:00 17:00
10 2 2021-11-07 13:00:00 13:00
11 2 2021-11-07 14:00:00 14:00
12 2 2021-11-07 15:00:00 15:00
13 2 2021-11-07 16:00:00 16:00
14 2 2021-11-07 17:00:00 17:00
15 2 2021-11-07 18:00:00 18:00
16 2 2021-11-07 19:00:00 19:00
17 2 2021-11-07 20:00:00 20:00
18 2 2021-11-07 21:00:00 21:00
19 2 2021-11-07 22:00:00 22:00
20 2 2021-11-07 23:00:00 23:00
21 2 2021-11-08 00:00:00 00:00
22 2 2021-11-08 01:00:00 01:00
23 2 2021-11-08 02:00:00 02:00
24 3 2021-11-07 15:00:00 15:00
25 3 2021-11-07 16:00:00 16:00
26 3 2021-11-07 17:00:00 17:00
27 3 2021-11-07 18:00:00 18:00
28 3 2021-11-07 19:00:00 19:00
29 3 2021-11-07 20:00:00 20:00
30 3 2021-11-07 21:00:00 21:00
Alternative solution:
df['ARRIVAL_DATE_TIME'] = pd.to_datetime(df['ARRIVAL_DATE_TIME']).dt.floor('H')
L = [pd.date_range(s,e, freq='H')
for s, e in df[['ARRIVAL_DATE_TIME','DISPOSITION_DATE']].to_numpy()]
df['DATE_HOUR_IN_ED'] = L
df = (df.drop(['ARRIVAL_DATE_TIME','DISPOSITION_DATE'], axis=1)
.explode('DATE_HOUR_IN_ED')
.reset_index(drop=True)
.assign(HOUR_IN_ED = lambda x: x['DATE_HOUR_IN_ED'].dt.strftime('%H:%M')))
Try this:
import pandas as pd
import numpy as np
df = pd.read_excel('test.xls')
df1 = (df.set_index(['ID'])
.assign(DATE_HOUR_IN_ED=lambda x: [pd.date_range(s,d, freq='H')
for s,d in zip(x.ARRIVAL_DATE_TIME, x.DISPOSITION_DATE)])
['DATE_HOUR_IN_ED'].explode()
.reset_index()
)
df1['DATE_HOUR_IN_ED'] = df1['DATE_HOUR_IN_ED'].dt.floor('H')
df1['HOUR_IN_ED'] = df1['DATE_HOUR_IN_ED'].dt.strftime('%H:%M')
print(df1)
Output:
ID DATE_HOUR_IN_ED HOUR_IN_ED
0 1 2021-11-07 08:00:00 08:00
1 1 2021-11-07 09:00:00 09:00
2 1 2021-11-07 10:00:00 10:00
3 1 2021-11-07 11:00:00 11:00
4 1 2021-11-07 12:00:00 12:00
5 1 2021-11-07 13:00:00 13:00
6 1 2021-11-07 14:00:00 14:00
7 1 2021-11-07 15:00:00 15:00
8 1 2021-11-07 16:00:00 16:00
9 1 2021-11-07 17:00:00 17:00
10 2 2021-11-07 13:00:00 13:00
11 2 2021-11-07 14:00:00 14:00
12 2 2021-11-07 15:00:00 15:00
13 2 2021-11-07 16:00:00 16:00
14 2 2021-11-07 17:00:00 17:00
15 2 2021-11-07 18:00:00 18:00
16 2 2021-11-07 19:00:00 19:00
17 2 2021-11-07 20:00:00 20:00
18 2 2021-11-07 21:00:00 21:00
19 2 2021-11-07 22:00:00 22:00
20 2 2021-11-07 23:00:00 23:00
21 2 2021-11-08 00:00:00 00:00
22 2 2021-11-08 01:00:00 01:00
23 2 2021-11-08 02:00:00 02:00
24 3 2021-11-07 15:00:00 15:00
25 3 2021-11-07 16:00:00 16:00
26 3 2021-11-07 17:00:00 17:00
27 3 2021-11-07 18:00:00 18:00
28 3 2021-11-07 19:00:00 19:00
29 3 2021-11-07 20:00:00 20:00

How to Update Values Conditionally in For Loop for Multiple Categories

I have a DataFrame where 'B' is a Category and 'Boy ' is an Event , For Boy{1,2,3,4} B = 1 is alloted ;Boy = 1 uses B for 10 Mins Start from 12:00 to End = 12:10 ,Next boy should be using it from End_Time[0] , Like that for B =1 there are four Samples and B = 2 Different 4 Samples
Input Sample :
B Boy Start End Out
1 1 12:00 12:10 0:10
1 2 12:01 12:11 0:10
1 3 12:02 12:12 0:10
1 4 12:03 12:13 0:10
2 5 12:00 12:10 0:05
2 6 12:01 12:11 0:05
2 7 12:02 12:12 0:05
2 8 12:03 12:13 0:05
3 9 12:00 12:10 0:03
3 10 12:01 12:11 0:03
3 11 12:02 12:12 0:03
3 12 12:03 12:13 0:03
Code Tried :
data_1['End'] = pd.to_datetime(data_1['Start'] + pd.to_timedelta(data_1['Out'])
for i in range(1, len(data_1)):
data_1.loc[i, 'Start'] = data_1.loc[i-1, 'End']
Output :
B Boy Start End Out
1 1 12:00 12:10 0:10
1 2 12:10 12:20 0:10
1 3 12:20 12:30 0:10
1 4 12:30 12:40 0:10
2 5 12:40 12:45 0:05
2 6 12:45 12:50 0:05
2 7 12:50 12:55 0:05
2 8 12:55 13:00 0:05
3 9 13:00 13:03 0:03
3 10 13:03 13:06 0:03
3 11 13:06 13:09 0:03
3 12 13:09 13:12 0:03
Code Failed :
new_Start_time = []
for i,item in data_1.groupby('B'):
temp_list = [item.iloc[0,2]]
list_all = [item.iloc[0,3]]
for j in range(len(list_all)):
temp_list[j+1] = [list_all[j] for i in range(len(list_all) - 1) ]
temp_list.append(temp_list[j])
new_Start_time.extend(temp_list)
data_1['new_Start_time'] = new_Start_time
Error : IndexError: list assignment index out of range
Expected Result :
B Boy Start End Out
1 1 12:00 12:10 0:10
1 2 12:10 12:20 0:10
1 3 12:20 12:30 0:10
1 4 12:30 12:40 0:10
2 5 12:00 12:05 0:05
2 6 12:05 12:10 0:05
2 7 12:10 12:15 0:05
2 8 12:15 12:20 0:05
3 9 12:00 12:03 0:03
3 10 12:03 12:06 0:03
3 11 12:06 12:09 0:03
3 12 12:09 12:12 0:03
Thanks In Advance
I found a solution. It is not the best if your table is really big but it works.
First I converted the columns to datetime and timedelta:
df["Start"] = pd.to_datetime(df["Start"], format='%H:%M')
df["End"] = pd.to_datetime(df["End"], format='%H:%M')
df["Out"] = pd.to_timedelta("0"+df["Out"]+":00")
Then the code to create the new start and end columns:
new_start =[]
new_end = []
for i, group in df.groupby("B"):
temp_start =[]
temp_end = []
out = group.iloc[0,4]
for j in range(0,group.shape[0]):
if j==0:
temp_start.append(group.iloc[0,2])
temp_end.append(group.iloc[0,2]+out)
else:
temp_start.append(temp_end[j-1])
temp_end.append(temp_start[j]+out)
new_start.extend(temp_start)
new_end.extend(temp_end)
Now update the old start and end columns with the new values:
df["Start"]= new_start
df["End"] = new_end
df
Output:
B Boy Start End Out
0 1 1 1900-01-01 12:00:00 1900-01-01 12:10:00 00:10:00
1 1 2 1900-01-01 12:10:00 1900-01-01 12:20:00 00:10:00
2 1 3 1900-01-01 12:20:00 1900-01-01 12:30:00 00:10:00
3 1 4 1900-01-01 12:30:00 1900-01-01 12:40:00 00:10:00
4 2 5 1900-01-01 12:00:00 1900-01-01 12:05:00 00:05:00
5 2 6 1900-01-01 12:05:00 1900-01-01 12:10:00 00:05:00
6 2 7 1900-01-01 12:10:00 1900-01-01 12:15:00 00:05:00
7 2 8 1900-01-01 12:15:00 1900-01-01 12:20:00 00:05:00
8 3 9 1900-01-01 12:00:00 1900-01-01 12:03:00 00:03:00
9 3 10 1900-01-01 12:03:00 1900-01-01 12:06:00 00:03:00
10 3 11 1900-01-01 12:06:00 1900-01-01 12:09:00 00:03:00
11 3 12 1900-01-01 12:09:00 1900-01-01 12:12:00 00:03:00
You can use:
def toTimeDelta(s):
h = pd.to_timedelta(s.str.split(':').str[0].astype(int), unit='h')
m = pd.to_timedelta(s.str.split(':').str[1].astype(int), unit='m')
return h + m
def fx(s):
s = s.transform(toTimeDelta)
out = s['Out'].copy()
out.iloc[0] += s['Start'].iloc[0]
s['End'] = out.cumsum()
s['Start'].iloc[1:] = s['End'].shift().iloc[1:]
return s
df[['Start', 'End', 'Out']] = df.groupby('B')[['Start', 'End', 'Out']].apply(fx)
Result:
# print(df)
B Boy Start End Out
0 1 1 12:00:00 12:10:00 00:10:00
1 1 2 12:10:00 12:20:00 00:10:00
2 1 3 12:20:00 12:30:00 00:10:00
3 1 4 12:30:00 12:40:00 00:10:00
4 2 5 12:00:00 12:05:00 00:05:00
5 2 6 12:05:00 12:10:00 00:05:00
6 2 7 12:10:00 12:15:00 00:05:00
7 2 8 12:15:00 12:20:00 00:05:00
8 3 9 12:00:00 12:03:00 00:03:00
9 3 10 12:03:00 12:06:00 00:03:00
10 3 11 12:06:00 12:09:00 00:03:00
11 3 12 12:09:00 12:12:00 00:03:00

Plot line plot per weekday and week number

I have the following data. This represents the number of occurrences in January:
date value WeekDay WeekNo Year Month
2018-01-01 214.0 Monday 1 2018 1
2018-01-02 232.0 Tuesday 1 2018 1
2018-01-03 147.0 Wed 1 2018 1
2018-01-04 257.0 Thursd 1 2018 1
2018-01-05 164.0 Friday 1 2018 1
2018-01-06 187.0 Saturd 1 2018 1
2018-01-07 201.0 Sunday 1 2018 1
2018-01-08 141.0 Monday 2 2018 1
2018-01-09 152.0 Tuesday 2 2018 1
2018-01-10 167.0 Wednesd 2 2018 1
2018-01-15 113.0 Monday 3 2018 1
2018-01-16 139.0 Tuesday 3 2018 1
2018-01-17 159.0 Wednesd 3 2018 1
2018-01-18 202.0 Thursd 3 2018 1
2018-01-19 207.0 Friday 3 2018 1
... ... ... ... ...
WeekNo is the number of the week in a year.
My goal is to have a line plot showing the evolution of occurrences, for this particular month, per week number. Therefore, I'd like to have the weekday in the x-axis, the occurrences on the y-axis and different lines, each with a different color, for each week (and a legend with the color that corresponds to each week).
Does anyone have any idea how this could be done? Thanks a lot!
You can first reshape your dataframe to a format where the columns are the week number and one row per weekday. Then, use the plot pandas method:
reshaped = (df
.assign(date=lambda f: pd.to_datetime(f.date))
.assign(dayofweek=lambda f: f.date.dt.dayofweek,
dayname=lambda f: f.date.dt.weekday_name)
.set_index(['dayofweek', 'dayname', 'WeekNo'])
.value
.unstack()
.reset_index(0, drop=True))
print(reshaped)
reshaped.plot(marker='x')
WeekNo 1 2 3
dayname
Monday 214.0 141.0 113.0
Tuesday 232.0 152.0 139.0
Wednesday 147.0 167.0 159.0
Thursday 257.0 NaN 202.0
Friday 164.0 NaN 207.0
Saturday 187.0 NaN NaN
Sunday 201.0 NaN NaN

Filling in missing hours in a Pandas dataframe

Date_NZST Time_NZST Radiation_Amount_MJ/m2
5/08/2011 0:00 0
5/08/2011 1:00 0
5/08/2011 2:00 0
5/08/2011 3:00 0
5/08/2011 4:00 0
5/08/2011 5:00 0
5/08/2011 6:00 0
5/08/2011 7:00 0
5/08/2011 8:00 0
5/08/2011 9:00 0.37
5/08/2011 10:00 0.41
5/08/2011 11:00 1.34
5/08/2011 12:00 0.87
5/08/2011 14:00 1.69
5/08/2011 15:00 1.53
5/08/2011 16:00 1.14
5/08/2011 17:00 0.63
5/08/2011 18:00 0.08
5/08/2011 19:00 0
5/08/2011 20:00 0
5/08/2011 21:00 0
5/08/2011 22:00 0
5/08/2011 23:00 0
I have an Excel spreadsheet that contains hourly measurements of solar irradiance everyday for a year. It has 3 columns, Date_NZST, Time_NZST and Radiation_Amount_MJ/m2.
I'm trying to find a way to automatically find all missing hours, generate a row for that missing hour and fill it with a - symbol in the Radiation_Amount_MJ/m2 column. For example, hour 13:00 is missing so I'd like to make row between the 12:00 and 14:00 rows with the correct date and fill the Radiation_Amount_MJ/m2 column with a -. All dates are present, just some hours are missing.
5/08/2011 11:00 1.34
5/08/2011 12:00 0.87
5/08/2011 13:00 -
5/08/2011 14:00 1.69
5/08/2011 15:00 1.53
I've been doing this in Excel but this is a very tedious process and there could be hundreds of missing points. I've resorted to trying to do it using the Pandas library in Python and I saw this thread (Fill in missing hours in a pandas dataframe) and I tried to alter the answer code to fit my data but I got really confused by the line
df.groupby('area').\
apply(lambda x : x.set_index('Datetime').resample('H').mean().fillna(0)).\
reset_index()'
and how to repurpose it to my data. Anyone have any ideas?
First we create a datetime index which contains the date + time with pd.to_datetime.
Then we use resample to get hourly data, and use fillna to fill the missing vallues with a -:
df.set_index(pd.to_datetime(df['Date_NZST'] + ' ' + df['Time_NZST']), inplace=True)
df = df.drop(columns=['Date_NZST', 'Time_NZST'])
df = df.resample('H').first().fillna('-')
Output
Radiation_Amount_MJ/m2
2011-05-08 00:00:00 0
2011-05-08 01:00:00 0
2011-05-08 02:00:00 0
2011-05-08 03:00:00 0
2011-05-08 04:00:00 0
2011-05-08 05:00:00 0
2011-05-08 06:00:00 0
2011-05-08 07:00:00 0
2011-05-08 08:00:00 0
2011-05-08 09:00:00 0.37
2011-05-08 10:00:00 0.41
2011-05-08 11:00:00 1.34
2011-05-08 12:00:00 0.87
2011-05-08 13:00:00 -
2011-05-08 14:00:00 1.69
2011-05-08 15:00:00 1.53
2011-05-08 16:00:00 1.14
2011-05-08 17:00:00 0.63
2011-05-08 18:00:00 0.08
2011-05-08 19:00:00 0
2011-05-08 20:00:00 0
2011-05-08 21:00:00 0
2011-05-08 22:00:00 0
2011-05-08 23:00:00 0
If you want the datetime out of your index use df.reset_index()
Note, by filling in - in a numeric column, it gets converted to object type.

Categories

Resources