I have a pandas dataframe containing n time series in the same Datetime column, each one associated to a different Id, with a corresponding value associated. I would like to pivot the table and reindex to the nearest timestamp. Notice that there can be cases where a timestamp is missing, as in Id-3, in this case the value would need to become NaN.
Datetime Id Value
5-26-17 8:00 1 2.3
5-26-17 8:30 1 4.5
5-26-17 9:00 1 7
5-26-17 9:30 1 8.1
5-26-17 10:00 1 7.9
5-26-17 10:30 1 3.4
5-26-17 11:00 1 2.1
5-26-17 11:30 1 1.8
5-26-17 12:00 1 0.4
5-26-17 8:02 2 2.6
5-26-17 8:32 2 4.8
5-26-17 9:02 2 7.3
5-26-17 9:32 2 8.4
5-26-17 10:02 2 8.2
5-26-17 10:32 2 3.7
5-26-17 11:02 2 2.4
5-26-17 11:32 2 2.1
5-26-17 12:02 2 0.7
5-26-17 8:30 3 4.5
5-26-17 9:00 3 7
5-26-17 9:30 3 8.1
5-26-17 10:00 3 7.9
5-26-17 10:30 3 3.4
5-26-17 11:00 3 2.1
5-26-17 11:30 3 1.8
5-26-17 12:00 3 0.4
Expected results:
Datetime Id-1 Id-2 Id-3
5-26-17 8:00 2.3 2.6 NaN
5-26-17 8:30 4.5 4.8 4.5
5-26-17 9:00 7 7.3 7
5-26-17 9:30 8.1 8.4 8.1
5-26-17 10:00 7.9 8.2 7.9
5-26-17 10:30 3.4 3.7 3.4
5-26-17 11:00 2.1 2.4 2.1
5-26-17 11:30 1.8 2.1 1.8
5-26-17 12:00 0.4 0.7 0.4
How would you do this?
I believe need convert column to datetimes and floor by 30 minutes by floor, last pivot and add_prefix:
df['Datetime'] = pd.to_datetime(df['Datetime']).dt.floor('30T')
df = df.pivot('Datetime','Id','Value').add_prefix('Id-')
print (df)
Id Id-1 Id-2 Id-3
Datetime
2017-05-26 08:00:00 2.3 2.6 NaN
2017-05-26 08:30:00 4.5 4.8 4.5
2017-05-26 09:00:00 7.0 7.3 7.0
2017-05-26 09:30:00 8.1 8.4 8.1
2017-05-26 10:00:00 7.9 8.2 7.9
2017-05-26 10:30:00 3.4 3.7 3.4
2017-05-26 11:00:00 2.1 2.4 2.1
2017-05-26 11:30:00 1.8 2.1 1.8
2017-05-26 12:00:00 0.4 0.7 0.4
Another solution is use resample with mean:
df['Datetime'] = pd.to_datetime(df['Datetime'])
df = (df.set_index('Datetime')
.groupby('Id')
.resample('30T')['Value']
.mean().unstack(0)
.add_prefix('Id-'))
print (df)
Id Id-1 Id-2 Id-3
Datetime
2017-05-26 08:00:00 2.3 2.6 NaN
2017-05-26 08:30:00 4.5 4.8 4.5
2017-05-26 09:00:00 7.0 7.3 7.0
2017-05-26 09:30:00 8.1 8.4 8.1
2017-05-26 10:00:00 7.9 8.2 7.9
2017-05-26 10:30:00 3.4 3.7 3.4
2017-05-26 11:00:00 2.1 2.4 2.1
2017-05-26 11:30:00 1.8 2.1 1.8
2017-05-26 12:00:00 0.4 0.7 0.4
Related
How to transform this data so that the pm 2.5 pm 10 columns are the average of the whole day. The data I collected (example here below) collects data every 15 minutes.
Pm 2.5 Pm 10 Created At
0 6.00 19.20 2021-06-21 19:00
1 4.70 17.00 2021-06-21 19:15
2 4.80 16.70 2021-06-21 19:30
3 5.10 12.10 2021-06-21 19:45
4 7.90 19.10 2021-06-21 20:00
Let's resample the dataframe:
df['Created At'] = pd.to_datetime(df['Created At'])
df.resample('D', on='Created At').mean()
Pm 2.5 Pm 10
Created At
2021-06-21 5.7 16.82
You can use pd.Grouper and then transform if you want to preserve the dataframe shape:
df['Created At'] = pd.to_datetime(df['Created At'])
df[['Pm 2.5', 'Pm 10']] = df.groupby(pd.Grouper(key='Created At', freq='D'))\
[['Pm 2.5', 'Pm 10']].transform('mean')
Output:
Pm 2.5 Pm 10 Created At
0 5.7 16.82 2021-06-21 19:00:00
1 5.7 16.82 2021-06-21 19:15:00
2 5.7 16.82 2021-06-21 19:30:00
3 5.7 16.82 2021-06-21 19:45:00
4 5.7 16.82 2021-06-21 20:00:00
here is one way do it
convert the date using to_datetime, grab the date and carry out the mean
df.groupby(pd.to_datetime(df['Created At']).dt.date).mean()
Created At Pm 2.5 Pm 10
0 2021-06-21 5.7 16.82
i generated timestemp index with 3 hours frequency and assigned it to dataframe that contained forecasted data of weather for next 10 days with 3 hours interval. due to 3 hours frequency date is repeating in index against each value but i want to make group of each date and its respective data
i have tried groupby, but it entirely changed my dataframe values. is there any solution for this problem ?
df['Hours'] = pd.date_range(start= start_time , periods=81, freq='3h')
df['Days'] = df['Hours'].dt.day_name()
df.index = df.Hours
df = df.drop('Hours',1)
df = df.drop('Days',1)
##df.groupby(['Days', 'Hours'])['Days'].nunique()
df
MSL TEMPERATURE DPT RH% PRECIPITATION CLOUD COVER
Hours
2021-12-20 00:00:00 1019.4 7.9 -9.7 27.4 0.00 6.8
2021-12-20 03:00:00 1019.4 7.5 -9.9 27.9 0.00 8.9
2021-12-20 06:00:00 1018.3 6.7 -10.3 28.6 0.00 6.4
2021-12-20 09:00:00 1019.2 7.9 -9.0 29.1 0.00 0.8
2021-12-20 12:00:00 1018.6 14.5 -7.7 20.8 0.00 0.9
... ... ... ... ... ... ...
2021-12-29 12:00:00 1024.2 12.1 -1.4 38.9 0.00 82.8
2021-12-29 15:00:00 1021.5 14.2 -1.8 32.9 0.01 99.8
2021-12-29 18:00:00 1021.7 10.6 -1.3 43.7 0.00 99.9
2021-12-29 21:00:00 1023.7 8.7 -2.3 45.8 0.00 91.5
2021-12-30 00:00:00 1024.4 7.9 -2.4 48.1 0.00 59.7
Pandas dataframe is a table structure, so only place you can merge some columns is index. merge in quotes, because it is visualized like "merged" by pandas.
I simplified your sample data to show how it will be visualized.
>>> df
time MSL TEMPERATURE
2021-12-20T00:00:00 1019.4 7.9
2021-12-20T03:00:00 1019.4 7.5
2021-12-20T06:00:00 1018.3 6.7
2021-12-20T09:00:00 1019.2 7.9
2021-12-21T00:00:00 1019.2 7.9
2021-12-21T03:00:00 1019.2 7.9
2021-12-21T06:00:00 1019.2 7.9
2021-12-21T09:00:00 1019.2 7.9
2021-12-22T00:00:00 1019.2 7.9
Create a column to summarize.
>>> df['date'] = df.time.dt.date
>>> df['hour'] = df.time.dt.time
>>> df
time MSL TEMPERATURE date hour
2021-12-20T00:00:00 1019.4 7.9 2021-12-20 00:00:00
2021-12-20T03:00:00 1019.4 7.5 2021-12-20 03:00:00
2021-12-20T06:00:00 1018.3 6.7 2021-12-20 06:00:00
2021-12-20T09:00:00 1019.2 7.9 2021-12-20 09:00:00
2021-12-21T00:00:00 1019.2 7.9 2021-12-21 00:00:00
2021-12-21T03:00:00 1019.2 7.9 2021-12-21 03:00:00
2021-12-21T06:00:00 1019.2 7.9 2021-12-21 06:00:00
2021-12-21T09:00:00 1019.2 7.9 2021-12-21 09:00:00
2021-12-22T00:00:00 1019.2 7.9 2021-12-22 00:00:00
Then, use set_index on those columns. This will create MultiIndex which will display data as grouped on the first index.
Please check https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html for more details.
>>> df.set_index(['date', 'hour'])
MSL TEMPERATURE
date hour
2021-12-20 00:00:00 1019.4 7.9
03:00:00 1019.4 7.5
06:00:00 1018.3 6.7
09:00:00 1019.2 7.9
2021-12-21 00:00:00 1019.2 7.9
03:00:00 1019.2 7.9
06:00:00 1019.2 7.9
09:00:00 1019.2 7.9
2021-12-22 00:00:00 1019.2 7.9
I have a table recorded hourly weather data like this:
Year Month Day Time Temp
Date/Time
2005-01-01 00:00:00 2005 1 1 0:00 6.0
2005-01-01 01:00:00 2005 1 1 1:00 6.1
2005-01-01 02:00:00 2005 1 1 2:00 6.7
2005-01-01 03:00:00 2005 1 1 3:00 6.8
2005-01-01 04:00:00 2005 1 1 4:00 6.3
2005-01-01 05:00:00 2005 1 1 5:00 6.6
2005-01-01 06:00:00 2005 1 1 6:00 6.9
2005-01-01 07:00:00 2005 1 1 7:00 7.1
2005-01-01 08:00:00 2005 1 1 8:00 6.9
2005-01-01 09:00:00 2005 1 1 9:00 6.7
2005-01-01 10:00:00 2005 1 1 10:00 7.1
2005-01-01 11:00:00 2005 1 1 11:00 7.1
2005-01-01 12:00:00 2005 1 1 12:00 7.2
2005-01-01 13:00:00 2005 1 1 13:00 7.7
2005-01-01 14:00:00 2005 1 1 14:00 8.8
2005-01-01 15:00:00 2005 1 1 15:00 8.6
2005-01-01 16:00:00 2005 1 1 16:00 7.4
2005-01-01 17:00:00 2005 1 1 17:00 6.8
2005-01-01 18:00:00 2005 1 1 18:00 6.3
2005-01-01 19:00:00 2005 1 1 19:00 5.9
2005-01-01 20:00:00 2005 1 1 20:00 5.6
2005-01-01 21:00:00 2005 1 1 21:00 3.6
2005-01-01 22:00:00 2005 1 1 22:00 2.6
2005-01-01 23:00:00 2005 1 1 23:00 1.7
I wanted to save the dataframe in this format:
How can I transpose the dataframe and create new columns for each record?
1) Copy below data
Date/Time Year Month Day Time Temp
0 1/1/2005 0:00 2005 1 1 0:00 6.0
1 1/1/2005 1:00 2005 1 1 1:00 6.1
2 1/1/2005 2:00 2005 1 1 2:00 6.7
3 1/1/2005 3:00 2005 1 1 3:00 6.8
4 1/1/2005 4:00 2005 1 1 4:00 6.3
5 1/1/2005 5:00 2005 1 1 5:00 6.6
6 1/1/2005 6:00 2005 1 1 6:00 6.9
7 1/1/2005 7:00 2005 1 1 7:00 7.1
8 1/1/2005 8:00 2005 1 1 8:00 6.9
9 1/1/2005 9:00 2005 1 1 9:00 6.7
10 1/1/2005 10:00 2005 1 1 10:00 7.1
11 1/1/2005 11:00 2005 1 1 11:00 7.1
12 1/1/2005 12:00 2005 1 1 12:00 7.2
2) Use pd.read_clipboard with a double space or more param due to space in Date/Time column
import pandas as pd
df=pd.read_clipboard('\s\s+')
df
3) format date/time columns and create a pivot table and reset/rename axis.
df['Date/Time']=pd.to_datetime(df['Date/Time'],format='%m/%d/%Y
%H:%M').dt.strftime('%m/%d/%Y')
df['Time']=pd.to_datetime(df['Time']).dt.time
df=pd.pivot_table(df, index='Date/Time', columns='Time', values='Temp').reset_index().rename_axis(index=None, columns=None)
df['Date/Time']=df['Date/Time'].apply(lambda x:(x + ' 0:00'))
df
Output:
Date/Time 00:00:00 01:00:00 02:00:00 03:00:00 04:00:00 05:00:00 06:00:00 07:00:00 08:00:00 09:00:00 10:00:00 11:00:00 12:00:00
01/01/2005 6.0 6.1 6.7 6.8 6.3 6.6 6.9 7.1 6.9 6.7 7.1 7.1 7.2
This does the trick similarly to previous answer but on a new date column such that only one row per day is created:
datetimes = pd.date_range("2005-01-01 00:00:00","2005-01-02 23:00:00", freq="1h")
df = pd.DataFrame({"Date/Time": datetimes, "temp": rand(len(datetimes))})
df["Date"] = df["Date/Time"].dt.date
df["Hour"] = df["Date/Time"].dt.hour
reshaped= df.pivot(index='Date', columns='Hour', values='temp')
reshaped.columns = ['HR'+str(hour) for hour in reshaped.columns]
I have two data frames eg.
Shorter time frame ( 4 hourly )
Time Data_4h
1/1/01 00:00 1.1
1/1/01 06:00 1.2
1/1/01 12:00 1.3
1/1/01 18:00 1.1
2/1/01 00:00 1.1
2/1/01 06:00 1.2
2/1/01 12:00 1.3
2/1/01 18:00 1.1
3/1/01 00:00 1.1
3/1/01 06:00 1.2
3/1/01 12:00 1.3
3/1/01 18:00 1.1
Longer time frame ( 1 day )
Time Data_1d
1/1/01 00:00 1.1
2/1/01 00:00 1.6
3/1/01 00:00 1.0
I want to label the shorter time frame data with the data from the longer time frame data but n-1 days, leaving NaN where the n-1 day doesn't exist.
For example,
Final merged data combining 4h and 1d
Time Data_4h Data_1d
1/1/01 00:00 1.1 NaN
1/1/01 06:00 1.2 NaN
1/1/01 12:00 1.3 NaN
1/1/01 18:00 1.1 NaN
2/1/01 00:00 1.1 1.1
2/1/01 06:00 1.2 1.1
2/1/01 12:00 1.3 1.1
2/1/01 18:00 1.1 1.1
3/1/01 00:00 1.1 1.6
3/1/01 06:00 1.2 1.6
3/1/01 12:00 1.3 1.6
3/1/01 18:00 1.1 1.6
So for 1/1 - it tried to find 31/12 but couldn't find it so it was labelled as NaN. For 2/1, it searched for 1/1 and labelled those entires with 1.1 - the value for 1/1. For 3/1, it searched for 2/1 and labelled those entires with 1.6 - the value for 2/1.
It is important to note that the timeframe datas may have large gaps. So I can't access the rows in the larger time frame directly.
What is the best way to do this?
Currently I am iterating through all the rows of the smaller timeframe and then searching for the larger time frame date using a filter like:
large_tf_data[(large_tf_data.index <= target_timestamp)][0]
Where target_timestamp is calculated on each row in the smaller time frame data frame.
This is extremely slow! Any suggestions on how to speed it up?
First, take care of dates
dayfirstme = lambda d: pd.to_datetime(d.Time, dayfirst=True)
df = df.assign(Time=dayfirstme)
df2 = df2.assign(Time=dayfirstme)
Then Convert df2 to something useful
d2 = df2.assign(Time=lambda d: d.Time + pd.Timedelta(1, 'D')).set_index('Time').Data_1d
Apply magic
df.join(df.Time.dt.date.map(d2).rename(d2.name))
Time Data_4h Data_1d
0 2001-01-01 00:00:00 1.1 NaN
1 2001-01-01 06:00:00 1.2 NaN
2 2001-01-01 12:00:00 1.3 NaN
3 2001-01-01 18:00:00 1.1 NaN
4 2001-01-02 00:00:00 1.1 1.1
5 2001-01-02 06:00:00 1.2 1.1
6 2001-01-02 12:00:00 1.3 1.1
7 2001-01-02 18:00:00 1.1 1.1
8 2001-01-03 00:00:00 1.1 1.6
9 2001-01-03 06:00:00 1.2 1.6
10 2001-01-03 12:00:00 1.3 1.6
11 2001-01-03 18:00:00 1.1 1.6
I'm sure there are other ways but I didn't want to think about this anymore.
Any ideas on how to turn of the outliers for the box plot?
Or send a link to the documentation for kind="box" because I've not been able to find it
Code is:
df9=df.ix[Month+"-2015":Month+"-2015"]
df9=df9.rename(columns={'hour_mean': "2015"})
x=pd.concat([df1['2004'],df2['2005'],df3['2006'],df4['2007'],df5['2008'],df6['2012'],df7['2013'],df8['2014'],df9['2015']],axis=1)
ax=x.plot(kind="box")
Because its plotting out a series of series: Seemed to be the only way of doing a box plot. So it plots it for 2004 then 2005etc.
x looks like:
Date 2004 2005 2006 2007 2008
01/12/2004 00:00 9.8
01/12/2004 01:00 4.5
01/12/2004 02:00 2.7
01/12/2004 03:00 5.7
01/12/2004 04:00 10.7
01/12/2004 05:00 10.2
01/12/2004 06:00 11.3
01/12/2004 07:00 7.3
01/12/2004 08:00 7.2
01/12/2004 09:00 6.6
01/12/2004 10:00 9.7
01/12/2004 11:00 1.7
01/12/2004 12:00 3.3
01/12/2004 13:00 8.8
01/12/2004 14:00 5.4
01/12/2004 15:00 8.5
01/12/2004 16:00 1.9
01/12/2004 17:00 3.1
01/12/2004 18:00 6.1
01/12/2004 19:00 -4.1
01/12/2004 20:00 4.8
01/12/2004 21:00 2.1
01/12/2004 22:00 2.6
Have you tried
ax=x.plot(kind="box", showfliers=False)
Outliers in pyplot are called fliers. Check this doc.