read excel and convert index to datatimeindex pandas - python

I read an excel in pandas like this
df = pd.read_excel("Test.xlsx", index_col=[0])
The dataframe look like this with the index containing a date and time and one column:
01.01.2015 00:15:00 47.2
01.01.2015 00:30:00 46.6
01.01.2015 00:45:00 19.4
01.01.2015 01:00:00 14.8
01.01.2015 01:15:00 14.8
01.01.2015 01:30:00 16.4
01.01.2015 01:45:00 16.2
...
I want to convert the index to a datatimeindex, I tried
df.index = pd.to_datetime(df.index)
and got: "ValueError: Unknown string format"
What is here the best way to convert the index to a datatime format containing date and tiem to use datetime based functions

I think you need add parameter format - see http://strftime.org/:
df.index = pd.to_datetime(df.index, format='%d.%m.%Y %H:%M:%S')
print (df)
a
2015-01-01 00:15:00 47.2
2015-01-01 00:30:00 46.6
2015-01-01 00:45:00 19.4
2015-01-01 01:00:00 14.8
2015-01-01 01:15:00 14.8
2015-01-01 01:30:00 16.4
2015-01-01 01:45:00 16.2

Related

box ploting timeseries with pandas and matplotlib

I have a timeseries dataframe and I want to plot it using box plot type. Furthermore I want to group the data by days.
This is what I did,
groups = df.groupby(pd.Grouper(freq='D'))
data = pd.DataFrame(pd.concat([pd.DataFrame(x[1].values) for x in groups], axis=1))
data.boxplot(figsize=(20,5))
plt.show()
This is the result,
Why I have lost the dates in X axis? Can achieve my goal in a easier way? I feel my code is not pythonic.
In addition I did the code below too, but this stacks the days of different months.
sns.boxplot(x=d1.index.day, y=d1['Temperature'])
An example of my dataframe:
Temp
2019/01/01 00:00:00 25.3
2019/01/01 00:30:00 22.0
2019/01/01 01:00:00 22.1
2019/01/01 01:30:00 28.1
2019/01/01 02:00:00 26.8
2019/01/01 02:30:00 25.3
...
2019/01/02 00:00:00 20.2
2019/01/02 00:30:00 27.0
2019/01/02 01:00:00 27.5
2019/01/02 01:30:00 28.1
2019/01/02 02:00:00 28.8
2019/01/02 02:30:00 26.3
...
2019/02/10 23:30:00 21.6
Thank you!

Pandas sort_index only the given timeframe

I have a pandas series object consists of a datetime_index and some values, looks like following:
df
2020-01-01 00:00:00 39.6
2020-01-01 00:15:00 35.6
2020-01-01 00:30:00 35.6
2020-01-01 00:45:00 39.2
2020-01-01 01:00:00 56.7
...
2020-12-31 23:45:00 56.3
I am adding some values to this df with .append(). Since it is not sorted then I sort its index via .sort_index(). However what I would like to achieve is that I want to sort only for given day.
So for example I add some values to day 2020-01-01, and since the added values will be after the end of the day 2020-01-01 I just need to sort the first day of the year. NOT ALL THE DF.
Here is an example, NaN value is added with .append():
df
2020-01-01 00:00:00 39.6
2020-01-01 00:15:00 35.6
...
2020-01-01 23:45:00 34.3
2020-01-01 15:00:00 NaN
...
2020-12-31 23:45:00 56.3
Now I cannot df.sort_index(), because it breaks other days. That is why I just want to apply .sort_index() to the day 2020-01-01. How do I do that?
WHAT I TRIED SO FAR AND DOES NOT WORK:
df.loc['2020-01-01'] = df.loc['2020-01-01'].sort_index()
Filter rows for 2020-01-01 days, sorting and join back with not matched rows:
mask = df.index.normalize() == '2020-01-01'
df = pd.concat([df[mask].sort_index(), df[~mask]])
print (df)
2020-01-01 00:00:00 39.6
2020-01-01 00:15:00 35.6
2020-01-01 15:00:00 NaN
2020-01-01 23:45:00 34.3
2020-12-31 23:45:00 56.3
Name: a, dtype: float64
Another idea:
df1 = df['2020-01-01'].sort_index()
df = pd.concat([df1, df.drop(df1.index)])

How to extract hourly data from a df in python?

I have the following df
dates Final
2020-01-01 00:15:00 94.7
2020-01-01 00:30:00 94.1
2020-01-01 00:45:00 94.1
2020-01-01 01:00:00 95.0
2020-01-01 01:15:00 96.6
2020-01-01 01:30:00 98.4
2020-01-01 01:45:00 99.8
2020-01-01 02:00:00 99.8
2020-01-01 02:15:00 98.0
2020-01-01 02:30:00 95.1
2020-01-01 02:45:00 91.9
2020-01-01 03:00:00 89.5
The entire dataset is till 2021-01-01 00:00:00 95.6 with a gap of 15mins.
Since the freq is 15mins, I would like to change it to 1 hour and maybe drop the middle values
Expected output
dates Final
2020-01-01 01:00:00 95.0
2020-01-01 02:00:00 99.8
2020-01-01 03:00:00 89.5
With the last row being 2021-01-01 00:00:00 95.6
How can this be done?
Thanks
Use Series.dt.minute to performance a boolean indexing:
df_filtered = df.loc[df['dates'].dt.minute.eq(0)]
#if necessary
#df_filtered = df.loc[pd.to_datetime(df['dates']).dt.minute.eq(0)]
print(df_filtered)
dates Final
3 2020-01-01 01:00:00 95.0
7 2020-01-01 02:00:00 99.8
11 2020-01-01 03:00:00 89.5
If you're doing data analysis or data science I don't think dropping the middle values is a good approach at all! You should sum them I guess (I don't know about your use case but I know some stuff about Time Series data).

Data grouping into weekyly, monthly and yearly for large datasets using python?

I have datasets that record 'X' value for the 20 years in Dataframe format. X records the data with 3 hrs average and sample of data are given below.
Time_stamp X
1992-01-01 03:00:00 10.2
1992-01-01 06:00:00 10.4
1992-01-01 09:00:00 11.8
1992-01-01 12:00:00 12.0
1992-01-01 15:00:00 10.4
1992-01-01 18:00:00 9.4
1992-01-01 21:00:00 10.4
1992-01-02 00:00:00 13.6
1992-01-02 03:00:00 13.2
1992-01-02 06:00:00 11.8
1992-01-02 09:00:00 12.0
1992-01-02 12:00:00 12.8
1992-01-02 15:00:00 12.6
1992-01-02 18:00:00 11.0
1992-01-02 21:00:00 12.2
1992-01-03 00:00:00 13.8
1992-01-03 03:00:00 14.0
1992-01-03 06:00:00 13.4
1992-01-03 09:00:00 14.2
1992-01-03 12:00:00 16.2
1992-01-03 15:00:00 13.2
1992-01-03 18:00:00 13.4
1992-01-03 21:00:00 13.8
1992-01-04 00:00:00 14.8
1992-01-04 03:00:00 13.8
1992-01-04 06:00:00 7.6
1992-01-04 09:00:00 5.8
1992-01-04 12:00:00 4.4
1992-01-04 15:00:00 5.6
1992-01-04 18:00:00 6.0
1992-01-04 21:00:00 7.0
1992-01-05 00:00:00 6.8
1992-01-05 03:00:00 3.4
1992-01-05 06:00:00 5.8
1992-01-05 09:00:00 10.6
1992-01-05 12:00:00 9.2
1992-01-05 15:00:00 10.6
1992-01-05 18:00:00 9.8
1992-01-05 21:00:00 11.2
1992-01-06 00:00:00 12.0
1992-01-06 03:00:00 10.2
1992-01-06 06:00:00 9.0
1992-01-06 09:00:00 9.0
1992-01-06 12:00:00 8.6
1992-01-06 15:00:00 8.4
1992-01-06 18:00:00 8.2
1992-01-06 21:00:00 8.8
1992-01-07 00:00:00 10.0
1992-01-07 03:00:00 9.6
1992-01-07 06:00:00 8.0
1992-01-07 09:00:00 9.6
1992-01-07 12:00:00 10.8
1992-01-07 15:00:00 10.2
1992-01-07 18:00:00 9.8
1992-01-07 21:00:00 10.2
1992-01-08 00:00:00 9.4
1992-01-08 03:00:00 11.4
1992-01-08 06:00:00 12.6
1992-01-08 09:00:00 12.8
1992-01-08 12:00:00 10.4
1992-01-08 15:00:00 11.2
1992-01-08 18:00:00 9.0
1992-01-08 21:00:00 10.2
1992-01-09 00:00:00 8.2
I would like to create seperate dataframe that calcute and records yearly mean, weekly mean and daily mean of the given datasets. I am new to python and just started working with time series data. I found some question related to this here at stackoverflow but did not find appropriate answer related to this aand did not find any idea how to start with. Any help on this ?
I wrote this code so far,
import pandas as pd
import numpy as np
datasets['date_minus_time'] = df["Time_stamp"].apply( lambda df :
datetime.datetime(year=datasets.year, month=datasets.month,
day=datasets.day))
datasets.set_index(df["date_minus_time"],inplace=True)
df['count'].resample('D', how='sum')
df['count'].resample('W', how='sum')
df['count'].resample('M', how='sum')
But not getting a clue how to include that data records every 3 hrs. and what should to next for my desired result.
Use to_datetime for datetimes in column for improve performance and then DataFrame.resample with parameter on for specify datetime column:
df['Time_stamp'] = pd.to_datetime(df['Time_stamp'])
df_daily = df.resample('D', on='Time_stamp').mean()
df_monthly = df.resample('M', on='Time_stamp').mean()
df_weekly = df.resample('W', on='Time_stamp').mean()
You may use:
df['Time_stamp'] = pd.to_datetime(df['Time_stamp'], format='%Y-%m-%d %H:%M:%S')
df.set_index('Time_stamp',inplace=True)
df_monthly = df.resample('M').mean()
df_monthly outputs:
X
Time_stamp
1992-01-31 10.403125
For daily mean use: df_daily = df.resample('D').mean() which outputs:
X
Time_stamp
1992-01-01 10.657143
1992-01-02 12.400000
1992-01-03 14.000000
1992-01-04 8.125000
1992-01-05 8.425000
1992-01-06 9.275000
1992-01-07 9.775000
1992-01-08 10.875000
1992-01-09 8.200000

Grouping dates by 5 minute periods irrespective of day

I have a DataFrame with data similar to the following
import pandas as pd; import numpy as np; import datetime; from datetime import timedelta;
df = pd.DataFrame(index=pd.date_range(start='20160102', end='20170301', freq='5min'))
df['value'] = np.random.randn(df.index.size)
df.index += pd.Series([timedelta(seconds=np.random.randint(-60, 60))
for _ in range(df.index.size)])
which looks like this
In[37]: df
Out[37]:
value
2016-01-02 00:00:33 0.546675
2016-01-02 00:04:52 1.080558
2016-01-02 00:10:46 -1.551206
2016-01-02 00:15:52 -1.278845
2016-01-02 00:19:04 -1.672387
2016-01-02 00:25:36 -0.786985
2016-01-02 00:29:35 1.067132
2016-01-02 00:34:36 -0.575365
2016-01-02 00:39:33 0.570341
2016-01-02 00:44:56 -0.636312
...
2017-02-28 23:14:57 -0.027981
2017-02-28 23:19:51 0.883150
2017-02-28 23:24:15 -0.706997
2017-02-28 23:30:09 -0.954630
2017-02-28 23:35:08 -1.184881
2017-02-28 23:40:20 0.104017
2017-02-28 23:44:10 -0.678742
2017-02-28 23:49:15 -0.959857
2017-02-28 23:54:36 -1.157165
2017-02-28 23:59:10 0.527642
Now, I'm aiming to get the mean per 5 minute period over the course of a 24 hour day - without considering what day those values actually come from.
How can I do this effectively? I would like to think I could somehow remove the actual dates from my index and then use something like pd.TimeGrouper, but I haven't figured out how to do so.
My not-so-great solution
My solution so far has been to use between_time in a loop like this, just using an arbitrary day.
aggregates = []
start_time = datetime.datetime(1990, 1, 1, 0, 0, 0)
while start_time < datetime.datetime(1990, 1, 1, 23, 59, 0):
aggregates.append(
(
start_time,
df.between_time(start_time.time(),
(start_time + timedelta(minutes=5)).time(),
include_end=False).value.mean()
)
)
start_time += timedelta(minutes=5)
result = pd.DataFrame(aggregates, columns=['time', 'value'])
which works as expected
In[68]: result
Out[68]:
time value
0 1990-01-01 00:00:00 0.032667
1 1990-01-01 00:05:00 0.117288
2 1990-01-01 00:10:00 -0.052447
3 1990-01-01 00:15:00 -0.070428
4 1990-01-01 00:20:00 0.034584
5 1990-01-01 00:25:00 0.042414
6 1990-01-01 00:30:00 0.043388
7 1990-01-01 00:35:00 0.050371
8 1990-01-01 00:40:00 0.022209
9 1990-01-01 00:45:00 -0.035161
.. ... ...
278 1990-01-01 23:10:00 0.073753
279 1990-01-01 23:15:00 -0.005661
280 1990-01-01 23:20:00 -0.074529
281 1990-01-01 23:25:00 -0.083190
282 1990-01-01 23:30:00 -0.036636
283 1990-01-01 23:35:00 0.006767
284 1990-01-01 23:40:00 0.043436
285 1990-01-01 23:45:00 0.011117
286 1990-01-01 23:50:00 0.020737
287 1990-01-01 23:55:00 0.021030
[288 rows x 2 columns]
But this doesn't feel like a very Pandas-friendly solution.
IIUC then the following should work:
In [62]:
df.groupby(df.index.floor('5min').time).mean()
Out[62]:
value
00:00:00 -0.038002
00:05:00 -0.011646
00:10:00 0.010701
00:15:00 0.034699
00:20:00 0.041164
00:25:00 0.151187
00:30:00 -0.006149
00:35:00 -0.008256
00:40:00 0.021389
00:45:00 0.016851
00:50:00 -0.074825
00:55:00 0.012861
01:00:00 0.054048
01:05:00 0.041907
01:10:00 -0.004457
01:15:00 0.052428
01:20:00 -0.021518
01:25:00 -0.019010
01:30:00 0.030887
01:35:00 -0.085415
01:40:00 0.002386
01:45:00 -0.002189
01:50:00 0.049720
01:55:00 0.032292
02:00:00 -0.043642
02:05:00 0.067132
02:10:00 -0.029628
02:15:00 0.064098
02:20:00 0.042731
02:25:00 -0.031113
... ...
21:30:00 -0.018391
21:35:00 0.032155
21:40:00 0.035014
21:45:00 -0.016979
21:50:00 -0.025248
21:55:00 0.027896
22:00:00 -0.117036
22:05:00 -0.017970
22:10:00 -0.008494
22:15:00 -0.065303
22:20:00 -0.014623
22:25:00 0.076994
22:30:00 -0.030935
22:35:00 0.030308
22:40:00 -0.124668
22:45:00 0.064853
22:50:00 0.057913
22:55:00 0.002309
23:00:00 0.083586
23:05:00 -0.031043
23:10:00 -0.049510
23:15:00 0.003520
23:20:00 0.037135
23:25:00 -0.002231
23:30:00 -0.029592
23:35:00 0.040335
23:40:00 -0.021513
23:45:00 0.104421
23:50:00 -0.022280
23:55:00 -0.021283
[288 rows x 1 columns]
Here I floor the index to '5 min' intervals and then group on the time attribute and aggregate the mean

Categories

Resources