I'm pulling data from google trends and the output values come out as follows:
date value
0 2017-01-01 03:00:00 [65]
1 2017-01-01 03:01:00 [66]
2 2017-01-01 03:02:00 [77]
3 2017-01-01 03:03:00 [64]
4 2017-01-01 03:04:00 [94]
I've trimmed what I don't need. My issue is I need to remove the brackets and make the calue column an int. I've tried the following:
result['value'].apply(lambda x: pd.Series(str(x).replace('[', '').replace(']', '')))
But I get the same output either way. Any thoughts or suggestions?
You can also do:
df['value'] = df['value'].explode()
Output:
date value
0 2017-01-01 03:00:00 65
1 2017-01-01 03:01:00 66
2 2017-01-01 03:02:00 77
3 2017-01-01 03:03:00 64
4 2017-01-01 03:04:00 94
You can do
df['value'] = df['value'].str[0]
Related
I have two dataframe: the first represents the output of a model simulation and the second the real value. I would like to compute the RMSE between all the value with the same hours. Basically I should compute 24 RMSE value, one for each hour.
These are the first columns of my dataframes:
date;model
2017-01-01 00:00:00;53
2017-01-01 01:00:00;52
2017-01-01 02:00:00;51
2017-01-01 03:00:00;47.27
2017-01-01 04:00:00;45.49
2017-01-01 05:00:00;45.69
2017-01-01 06:00:00;48.07
2017-01-01 07:00:00;45.67
2017-01-01 08:00:00;45.48
2017-01-01 09:00:00;42.06
2017-01-01 10:00:00;46.86
2017-01-01 11:00:00;48.02
2017-01-01 12:00:00;49.57
2017-01-01 13:00:00;48.69
2017-01-01 14:00:00;46.91
2017-01-01 15:00:00;49.43
2017-01-01 16:00:00;50.45
2017-01-01 17:00:00;53.3
2017-01-01 18:00:00;59.07
2017-01-01 19:00:00;61.71
2017-01-01 20:00:00;56.26
2017-01-01 21:00:00;55
2017-01-01 22:00:00;54
2017-01-01 23:00:00;52
2017-01-02 00:00:00;53
and
date;real
2017-01-01 00:00:00;55
2017-01-01 01:00:00;55
2017-01-01 02:00:00;55
2017-01-01 03:00:00;48.27
2017-01-01 04:00:00;48.49
2017-01-01 05:00:00;48.69
2017-01-01 06:00:00;49.07
2017-01-01 07:00:00;49.67
2017-01-01 08:00:00;49.48
2017-01-01 09:00:00;50.06
2017-01-01 10:00:00;50.86
2017-01-01 11:00:00;50.02
2017-01-01 12:00:00;33.57
2017-01-01 13:00:00;33.69
2017-01-01 14:00:00;33.91
2017-01-01 15:00:00;33.43
2017-01-01 16:00:00;33.45
2017-01-01 17:00:00;33.3
2017-01-01 18:00:00;33.07
2017-01-01 19:00:00;33.71
2017-01-01 20:00:00;33.26
2017-01-01 21:00:00;33
2017-01-01 22:00:00;33
2017-01-01 23:00:00;33
2017-01-02 00:00:00;33
due to the fact that I am considering one year, I have to consider 365 value for each RMSE computation.
Up to now, I able only to read the dataframes. One option could be to set-up a cycle between 1-24 and to try do create 24 new dataframes by means of dfr[dfr.index.hour == i-th hours].
Do you have some more elegant and efficient solution?
Thanks
RMSE depends on the pairing order so you should join the model to the real data first, then group by hour and calculate your RMSE:
def rmse(group):
if len(group) == 0:
return np.nan
s = (group['model'] - group['real']).pow(2).sum()
return np.sqrt(s / len(group))
result = (
df1.merge(df2, on='date')
.assign(hour=lambda x: x['date'].dt.hour)
.groupby('hour')
.apply(rmse)
)
Result:
hour
0 14.21267
1 3.00000
2 4.00000
3 1.00000
4 3.00000
5 3.00000
6 1.00000
7 4.00000
8 4.00000
9 8.00000
10 4.00000
11 2.00000
12 16.00000
13 15.00000
14 13.00000
15 16.00000
16 17.00000
17 20.00000
18 26.00000
19 28.00000
20 23.00000
21 22.00000
22 21.00000
23 19.00000
dtype: float64
Explanation
Here what the code does:
merge: combine the two data frames together based on the date index
assign: create a new column hour, extracted from the date index
groupby: group rows based on their hour values
apply allows you to write a custom aggregator. All the rows with hour = 0 will be sent into the rmse function (our custom function), all the rows with hour = 1 will be sent next. As an illustration:
date hour model real
2017-01-01 00:00:00 0 ... ...
2017-01-02 00:00:00 0 ... ...
2017-01-03 00:00:00 0 ... ...
2017-01-04 00:00:00 0 ... ...
--------------------------------------
2017-01-01 01:00:00 1 ... ...
2017-01-02 01:00:00 1 ... ...
2017-01-03 01:00:00 1 ... ...
2017-01-04 01:00:00 1 ... ...
--------------------------------------
2017-01-01 02:00:00 2 ... ...
2017-01-02 02:00:00 2 ... ...
2017-01-03 02:00:00 2 ... ...
2017-01-04 02:00:00 2 ... ...
--------------------------------------
2017-01-01 03:00:00 3 ... ...
2017-01-02 03:00:00 3 ... ...
2017-01-03 03:00:00 3 ... ...
2017-01-04 03:00:00 3 ... ...
Each chunk is then sent to our custom function: rmse(group=<a chunk>). Within the function, we reduce that chunk down into a single number: its RMSE. That's how you get the 24 RMSE numbers back as a result.
You need to provide to by= a function that takes the date and extract the hour.
import pandas as pd
from time import strptime
df = pd.DataFrame([
['2017-01-01 00:00:00', 53],
['2017-01-01 01:00:00', 52],
['2017-01-02 00:00:00', 53],
['2017-01-03 01:00:00', 50],
['2017-01-04 00:00:00', 53]
], columns=['date', 'model'])
def group_fun(ix):
return strptime(df['date'][ix], '%Y-%m-%d %H:%M:%S').tm_hour
print(df.groupby(by=group_fun).std())
model
0 0.000000
1 1.414214
I have the following dataframe:
;h0;h1;h2;h3;h4;h5;h6;h7;h8;h9;h10;h11;h12;h13;h14;h15;h16;h17;h18;h19;h20;h21;h22;h23
2017-01-01;52.72248155184351;49.2949899678983;46.57492391198069;44.087373768731766;44.14801243124734;42.17606224526609;43.18529986793594;39.58391124876044;41.63499969987035;41.40594457169249;47.58107920806581;46.56963630932529;47.377935483897694;37.99479190229543;38.53347417483357;40.62674178535282;45.81503347748674;49.0590694393733;52.73183568074295;54.37213882189341;54.737087166843295;50.224872755157314;47.874441844531056;47.8848916244788
2017-01-02;49.08874087825248;44.998912615866075;45.92457207636786;42.38001388673675;41.66922093408655;43.02027406525752;49.82151473221541;53.23401784350719;58.33805556091773;56.197239473200206;55.7686948361035;57.03099874898539;55.445563603040405;54.929102019056195;55.85170734639889;57.98929007227575;56.65821961018764;61.01309728212006;63.63384537162659;61.730431501017684;54.40180394585544;50.27375006416599;51.229656340500156;51.22066846069472
2017-01-03;50.07885876956572;47.00180020415448;44.47243045246001;42.62192562660052;40.15465704760352;43.48422695796396;50.01631022884173;54.8674584250141;60.434849010428685;61.47694796693493;60.766557330286844;59.12019178422993;53.97447369962696;51.85242030255539;53.604945764469065;56.48188852869667;59.12301823257856;72.05688032286155;74.61342126987793;70.76845988290785;64.13311592022278;58.7237387203283;55.2422389373486;52.63648285910918
As you can notice, there are the days, in the column and the hours.
I would like to create a new dataframe with only two columns:
the first the days (with also the hour data) and a column with the value. Something like the following:
2017-01-01 00:00:00 ; 52.72248
2017-01-01 01:00:00 ; 49.2949899678983
...
I could create a new dataframe and use a cycle to fullfill it. This is I do now
icount = 0
for idd in range(0,365):
for ih in range(0,24):
df.loc[df.index.values[icount]] = ecodf.iloc[idd,ih]
icount = icount + 1
What do you think?
Thanks
Turn columns names into a new column, turn to hours and use pd.to_datetime
s = df.stack()
pd.concat([
pd.to_datetime(s.reset_index() \
.replace({'level_1': r'h(\d+)'}, {'level_1': '\\1:00'}, regex=True) \
[['level_0','level_1']].apply(' '.join, axis=1)), \
s.reset_index(drop=True)], \
axis=1, sort=False)
0 1
0 2017-01-01 00:00:00 52.722482
1 2017-01-01 01:00:00 49.294990
2 2017-01-01 02:00:00 46.574924
3 2017-01-01 03:00:00 44.087374
4 2017-01-01 04:00:00 44.148012
.. ... ...
67 2017-01-03 19:00:00 70.768460
68 2017-01-03 20:00:00 64.133116
69 2017-01-03 21:00:00 58.723739
70 2017-01-03 22:00:00 55.242239
71 2017-01-03 23:00:00 52.636483
[72 rows x 2 columns]
>>>
I have the following pandas data frame df:
Actual Scheduled
2017-01-01 04:03:00.000 2017-01-01 04:25:00.000
2017-01-01 04:56:00.000 2017-01-01 04:55:00.000
2017-01-01 04:36:00.000 2017-01-01 05:05:00.000
2017-01-01 06:46:00.000 2017-01-01 06:55:00.000
2017-01-01 06:46:00.000 2017-01-01 07:00:00.000
I need to create an additional column DIFF_MINUTES that contains the difference (in minutes) between Actual and Scheduled (Actual - Scheduled).
This is how I tried to solve this task:
import pandas as pd
import datetime
df["Actual"] = df.apply(lambda row: datetime.datetime.strptime(str(row["Actual"]),"%Y-%m-%d %H:%M:%S.%f"), axis=1)
df["Scheduled"] = df.apply(lambda row: datetime.datetime.strptime(str(row["Scheduled"]),"%Y-%m-%d %H:%M:%S.%f"), axis=1)
df["DIFF_MINUTES"] = df.apply(lambda row: (pd.Timedelta(row["Actual"]-row["Scheduled"]).seconds)/60, axis=1)
However, I got wrong results for a negative difference cases (e.g. 04:03:00-04:25:00 should give 22 minutes instead of 1418 minutes):
Actual Scheduled DIFF_MINUTES
2017-01-01 04:03:00 2017-01-01 04:25:00 1418.0
2017-01-01 04:56:00 2017-01-01 04:55:00 1.0
2017-01-01 04:36:00 2017-01-01 05:05:00 1411.0
2017-01-01 06:46:00 2017-01-01 06:55:00 1431.0
2017-01-01 06:46:00 2017-01-01 07:00:00 1426.0
How to fix it?
Expected result:
Actual Scheduled DIFF_MINUTES
2017-01-01 04:03:00 2017-01-01 04:25:00 -22.0
2017-01-01 04:56:00 2017-01-01 04:55:00 1.0
2017-01-01 04:36:00 2017-01-01 05:05:00 -29
2017-01-01 06:46:00 2017-01-01 06:55:00 -9.0
2017-01-01 06:46:00 2017-01-01 07:00:00 -14.0
Use dt.total_seconds() as (also check whether date is coming first or month in your columns):
df['Actual'] = pd.to_datetime(df['Actual'], dayfirst=True)
df['Scheduled'] = pd.to_datetime(df['Scheduled'], dayfirst=True)
df['DIFF_MINUTES'] = (df['Actual']-df['Scheduled']).dt.total_seconds()/60
print(df)
Actual Scheduled DIFF_MINUTES
0 2017-01-01 04:03:00 2017-01-01 04:25:00 -22.0
1 2017-01-01 04:56:00 2017-01-01 04:55:00 1.0
2 2017-01-01 04:36:00 2017-01-01 05:05:00 -29.0
3 2017-01-01 06:46:00 2017-01-01 06:55:00 -9.0
4 2017-01-01 06:46:00 2017-01-01 07:00:00 -14.0
Assuming that both column are DateTime, run just:
df['DIFF_MINUTES'] = (df.Actual - df.Scheduled).dt.total_seconds() / 60
(a one-liner).
If you read this DataFrame e.g. from Excel or CSV file, add
parse_dates=[0, 1] parameter to have these columns converted into dates,
so that there will be no need to cast them by your code.
And if for some reason you have these column as text, then to
convert them run:
df.Actual = pd.to_datetime(df.Actual)
df.Scheduled = pd.to_datetime(df.Scheduled)
(another quicker solution than "plain Python" functions).
I have a time series and I want to group the rows by hour of day (regardless of date) and visualize these as boxplots. So I'd want 24 boxplots starting from hour 1, then hour 2, then hour 3 and so on.
The way I see this working is splitting the dataset up into 24 series (1 for each hour of the day), creating a boxplot for each series and then plotting this on the same axes.
The only way I can think of to do this is to manually select all the values between each hour, is there a faster way?
some sample data:
Date Actual Consumption
2018-01-01 00:00:00 47.05
2018-01-01 00:15:00 46
2018-01-01 00:30:00 44
2018-01-01 00:45:00 45
2018-01-01 01:00:00 43.5
2018-01-01 01:15:00 43.5
2018-01-01 01:30:00 43
2018-01-01 01:45:00 42.5
2018-01-01 02:00:00 43
2018-01-01 02:15:00 42.5
2018-01-01 02:30:00 41
2018-01-01 02:45:00 42.5
2018-01-01 03:00:00 42.04
2018-01-01 03:15:00 41.96
2018-01-01 03:30:00 44
2018-01-01 03:45:00 44
2018-01-01 04:00:00 43.54
2018-01-01 04:15:00 43.46
2018-01-01 04:30:00 43.5
2018-01-01 04:45:00 43
2018-01-01 05:00:00 42.04
This is what i've tried so far:
zero = df.between_time('00:00', '00:59')
one = df.between_time('01:00', '01:59')
two = df.between_time('02:00', '02:59')
and then I would plot a boxplot for each of these on the same axes. However it's very tedious to do this for all 24 hours in a day.
This is the kind of output I want:
https://www.researchgate.net/figure/Boxplot-of-the-NOx-data-by-hour-of-the-day_fig1_24054015
there are 2 steps to achieve this:
convert Actual to date time:
df.Actual = pd.to_datetime(df.Actual)
Group by the hour:
df.groupby([df.Date, df.Actual.dt.hour+1]).Consumption.sum().reset_index()
I assumed you wanted to sum the Consumption (unless you wish to have mean or whatever just change it). One note: hour+1 so it will start from 1 and not 0 (remove it if you wish 0 to be midnight).
desired result:
Date Actual Consumption
0 2018-01-01 1 182.05
1 2018-01-01 2 172.50
2 2018-01-01 3 169.00
3 2018-01-01 4 172.00
4 2018-01-01 5 173.50
5 2018-01-01 6 42.04
I have a Series with a DatetimeIndex and an integer value. I want to make a table that shows the change in value from each time to all the other subsequent times.
Below is a visual representation of what I want. The gray and orange cells are irrelevant data.
I can't figure out a way to create this in a vectorized style inside pandas.
z = pd.DatetimeIndex(periods=10, freq='H', start='2018-12-1')
import random
df = pd.DataFrame(random.sample(range(1, 100), 10), index=z, columns=['foo'])
I've tried things like:
df['foo'].sub(df['foo'].transpose())
But that doesn't work.
The output DataFrame could either have a multindex (beforeTime, AfterTime) or could be a single index "beforeTime" and then have a column for each possible "aftertime". I think they're equivalent, as I can use the unstack() and related functions to get the shape I want?
I think you can use np.substract with np.outer to calculate all the values and create the dataframe like:
df_output = pd.DataFrame(np.subtract.outer(df.foo, df.foo),
columns= df.index.time, index=df.index.time)
print (df_output.head())
00:00:00 01:00:00 02:00:00 03:00:00 04:00:00 05:00:00 \
00:00:00 0 6 -7 -57 -33 3
01:00:00 -6 0 -13 -63 -39 -3
02:00:00 7 13 0 -50 -26 10
03:00:00 57 63 50 0 24 60
04:00:00 33 39 26 -24 0 36
06:00:00 07:00:00 08:00:00 09:00:00
00:00:00 -53 -28 5 17
01:00:00 -59 -34 -1 11
02:00:00 -46 -21 12 24
03:00:00 4 29 62 74
04:00:00 -20 5 38 50
You can use np.triu to set to 0 all the values in grey in your example such as:
pd.DataFrame(np.triu(np.subtract.outer(df.foo, df.foo)), columns = ...)
Note the .time is not necessary when creating the columns= and index=, it was to copy and paste a dataframe readable