pandas quick way to break dataframe by timeindex - python

I have a dataframe with a timeindex. But the timeindex is not consecutive.
df with microsecond resolution timestamp index.
Time Bid
2014-03-03 23:30:30.383002 1.37315
2014-03-03 23:30:30.383042 1.37318
2014-03-03 23:30:30.383067 1.37318
2014-03-03 23:30:31.174442 1.37315
2014-03-03 23:30:32.028966 1.37315
2014-03-03 23:30:32.052447 1.37315
I want to check if there is minute without any data, so I did resample
tick_count = e.resample('1Min', how=np.size)
Time Bid
2014-03-04 00:15:00 73
2014-03-04 00:16:00 298
2014-03-04 00:17:00 124
2014-03-04 00:18:00 318
2014-03-04 00:19:00 27
2014-03-04 00:20:00 0
2014-03-04 00:21:00 0
2014-03-04 00:22:00 241
2014-03-04 00:23:00 97
2014-03-04 00:24:00 52
2014-03-04 00:25:00 446
2014-03-04 00:26:00 867
so here I find two minutes with no data, how do I separate the original df into multiple
df and each of them has data every minute. In the case above
this first df will start from 00:15 to 00:19, second one starts from 00:22 to 00:26, etc.
Thank you!

Assuming the times are sorted, you could use
df['group'] = (df['Time'].diff() > np.timedelta64(60,'s')).cumsum()
to add a column to your DataFrame, which will classify the rows according to which group they belong to. The result looks like this:
Time Bid group
0 2014-03-04 00:15:00 73 0
1 2014-03-04 00:16:00 298 0
2 2014-03-04 00:17:00 124 0
3 2014-03-04 00:18:00 318 0
4 2014-03-04 00:19:00 27 0
5 2014-03-04 00:22:00 241 1
6 2014-03-04 00:23:00 97 1
7 2014-03-04 00:24:00 52 1
8 2014-03-04 00:25:00 446 1
9 2014-03-04 00:26:00 867 1
This is better than having multiple DataFrames, because you can apply fast numpy/pandas operations to the entire DataFrame whereas, if you had a list of DataFrames you would be forced to use a Python loop to operate on the sub-DataFrames individually (assuming you want to perform the same operation on each sub-DataFrame). Doing so is generally always slower.
Typically, the pandas-way to operate on the sub-DataFrames would be to use a groupby operation. For example,
>>> grouped = df.groupby(['group'])
>>> grouped['Bid'].sum()
group
0 840
1 1703
Name: Bid, dtype: int64
to find the sum of the bids in each group.
However, if you really wish to have a list of sub-DataFrames, you could obtain it using
subdfs = [subdf for key, subdf in grouped]
For those wanting to reproduce the result above, I put this in a file called data:
Time Bid
2014-03-04 00:15:00 73
2014-03-04 00:16:00 298
2014-03-04 00:17:00 124
2014-03-04 00:18:00 318
2014-03-04 00:19:00 27
2014-03-04 00:22:00 241
2014-03-04 00:23:00 97
2014-03-04 00:24:00 52
2014-03-04 00:25:00 446
2014-03-04 00:26:00 867
and ran
import pandas as pd
import numpy as np
df = pd.read_table('data', sep='\s{2,}', parse_dates=[0])
print(df.dtypes)
# Time datetime64[ns] # It is important that Time has dtype datetime64[ns]
# Bid int64
# dtype: object
df['group'] = (df['Time'].diff() > np.timedelta64(60,'s')).cumsum()
print(df)

Related

Slicing pandas DateTimeIndex with steps

I often deal with pandas DataFrames with DateTimeIndexes, where I want to - for example - select only the parts where the hour of the index = 6. The only way I currently know how to do this is with reindexing:
df.reindex(pd.date_range(*df.index.to_series().agg([min, max]).apply(lambda ts: ts.replace(hour=6)), freq="24H"))
But this is quite unreadable and complex, which gets even worse when there is a MultiIndex with multiple DateTimeIndex levels. I know of methods that use .reset_index() and then either df.where or df.loc with conditional statements, but is there a simpler way to do this with regular IndexSlicing? I tried it as follows
df.loc[df.index.min().replace(hour=6)::pd.Timedelta(24, unit="H")]
but this gives a TypeError:
TypeError: '>=' not supported between instances of 'Timedelta' and 'int'
If your index is a DatetimeIndex, you can use:
>>> df[df.index.hour == 6]
val
2022-03-01 06:00:00 7
2022-03-02 06:00:00 31
2022-03-03 06:00:00 55
2022-03-04 06:00:00 79
2022-03-05 06:00:00 103
2022-03-06 06:00:00 127
2022-03-07 06:00:00 151
2022-03-08 06:00:00 175
2022-03-09 06:00:00 199
2022-03-10 06:00:00 223
2022-03-11 06:00:00 247
2022-03-12 06:00:00 271
2022-03-13 06:00:00 295
2022-03-14 06:00:00 319
2022-03-15 06:00:00 343
2022-03-16 06:00:00 367
2022-03-17 06:00:00 391
2022-03-18 06:00:00 415
2022-03-19 06:00:00 439
2022-03-20 06:00:00 463
2022-03-21 06:00:00 487
Setup:
dti = pd.date_range('2022-3-1', '2022-3-22', freq='1H')
df = pd.DataFrame({'val': range(1, len(dti)+1)}, index=dti)

How to use Pandas to find 30 min average flows and then find max 30 min average flow per day?

I have flow rate data per minute of each day. I want to take average flow rates for every 30 minutes each day. Then I want to find the maximum 30 minute average flow rate for each day. Once I have the max 30 min average flow rate per day I would like to save them into an excel sheet displaying max average flow rate per day.
import pandas as pd
import numpy as np
peakflow= pd.read_csv('P:\Waste Water\Totalizer Data\Main DAF\July_1_17_July_20_17.xls.csv')
peakflow['DateTime'] = pd.to_datetime(peakflow.DateTime)
Here is a sample of my data frame called peakflow:
DateTime Gallons
0 NaT Average
1 NaT gpm
2 2017-07-01 00:00:00 743
3 2017-07-01 00:01:00 1273
4 2017-07-01 00:02:00 1256
5 2017-07-01 00:03:00 723
6 2017-07-01 00:04:00 0
7 2017-07-01 00:05:00 0
8 2017-07-01 00:06:00 0
9 2017-07-01 00:07:00 455
10 2017-07-01 00:08:00 1279
11 2017-07-01 00:09:00 1258
12 2017-07-01 00:10:00 1052
13 2017-07-01 00:11:00 0
14 2017-07-01 00:12:00 0
15 2017-07-01 00:13:00 0
16 2017-07-01 00:14:00 919
17 2017-07-01 00:15:00 1271
18 2017-07-01 00:16:00 1244
19 2017-07-01 00:17:00 343
20 2017-07-01 00:18:00 0
21 2017-07-01 00:19:00 0
22 2017-07-01 00:20:00 0
23 2017-07-01 00:21:00 1248
24 2017-07-01 00:22:00 1258
25 2017-07-01 00:23:00 836
26 2017-07-01 00:24:00 0
27 2017-07-01 00:25:00 0
28 2017-07-01 00:26:00 451
29 2017-07-01 00:27:00 1284
I tried using the following code:
df2 = peakflow.resample(rule = '30Min').mean()
To resample data frame peakflow and take an average every 30 min. then save it to a new data frame called df2 where I was going to use this code:
df3 = df2.resample(rule = '1D').max()
To resample df2 every day and find dayle max valvues then save it to df3.
However my code did not work to create df2 and I got the following error:
TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex,
but got an instance of 'RangeIndex'
Do you guys got any ideas on what would work for this application or what went wrong with this code? Any help will be appreciated.
Thanks.
The dataframe that you try to resample must have a DateTime index.
peakflow.set_index('DateTime', inplace=True)
peakflow.index = pd.to_datetime(peakflow.index)
peakflow.resample(rule = '30Min').mean()
# Gallons
#DateTime
#2017-07-01 603.321429
df3 = df2.resample(rule = '1D').max()
# Gallons
#DateTime
#2017-07-01 603.321429

python pandas String to TimeStramps convert ambigous

I'm trying to slice a Dataframe using DateTimeIndex, but a got one issue.
When the new DataFrame Change Month, he switch the day and the month.
Here is my dataframe:
Valeur
date
2015-01-08 00:00:00 93
2015-01-08 00:10:00 90
2015-01-08 00:20:00 88
2015-01-08 00:30:00 103
2015-01-08 00:40:00 86
2015-01-08 00:50:00 88
2015-01-08 01:00:00 86
2015-01-08 01:10:00 84
2015-01-08 01:20:00 95
2015-01-08 01:30:00 88
2015-01-08 01:40:00 85
2015-01-08 01:50:00 92
... ...
2016-10-30 22:20:00 98
2016-10-30 22:30:00 94
2016-10-30 22:40:00 94
2016-10-30 22:50:00 103
2016-10-30 23:00:00 92
2016-10-30 23:10:00 85
2016-10-30 23:20:00 98
2016-10-30 23:30:00 96
2016-10-30 23:40:00 95
2016-10-30 23:50:00 101
[65814 rows x 1 columns]
Here my two TimeStamps:
startingDate : 2015-10-31 23:50:00
lastDate : 2016-10-30 23:50:00
When i slice my df like this :
dfconso = dfconso[startingDate:lastDate]
i got something like this :
Valeur
date
2015-10-31 23:50:00 88
2015-01-11 00:00:00 83
2015-01-11 00:10:00 82
2015-01-11 00:20:00 87
2015-01-11 00:30:00 77
2015-01-11 00:40:00 72
2015-01-11 00:50:00 86
2015-01-11 01:00:00 77
2015-01-11 01:10:00 80
... ...
2016-10-30 23:10:00 85
2016-10-30 23:20:00 98
2016-10-30 23:30:00 96
2016-10-30 23:40:00 95
2016-10-30 23:50:00 101
The problem is the slice start at the good date, but when the DateTimeIndex change month, something wrong append.
Pass from 31 October 2015 to 11 January 2015.
And i don't understand why..
I try to print the month and day to see and i got that :
In:
print("Index 0 : month", dfconso.index[0].month, ", day", dfconso.index[0].day)
print("Index 1 : month", dfconso.index[1].month, ", day", dfconso.index[1].day)
Out:
Index 0 : month 10 , day 31
Index 1 : month 1 , day 11
If someone has an idea
EDIT :
After df.sort_index() my df, i can see the convert of String date to TimeStamps date, didn't work sometimes, and switch Month and Day.
Format at String :
"31/08/2015 20:00:00"
My code to transform from String to TimeStamps:
dfconso.index = pd.to_datetime(dfconso.index, infer_datetime_format=True, format="%d/%m/%Y")
SOLUTION :
that was a bad use of pd.to_datetime, i change infer_date_time_format to Dayfirst :
dfconso.index = pd.to_datetime(dfconso.index, dayfirst=True)
That solve my problem.
The error might not be a mixup of day and month, but just an ordering problem. Try reordering the data before slicing it (the provided part of your data looks fine, but who knows about the rest..).
Here is how reordering works: Sort a pandas datetime index

Query same time value every day in Pandas timeseries

I would like to get the 07h00 value every day, from a multiday DataFrame that has 24 hours of minute data in it each day.
import numpy as np
import pandas as pd
aframe = pd.DataFrame([np.arange(10000), np.arange(10000) * 2]).T
aframe.index = pd.date_range("2015-09-01", periods = 10000, freq = "1min")
aframe.head()
Out[174]:
0 1
2015-09-01 00:00:00 0 0
2015-09-01 00:01:00 1 2
2015-09-01 00:02:00 2 4
2015-09-01 00:03:00 3 6
2015-09-01 00:04:00 4 8
aframe.tail()
Out[175]:
0 1
2015-09-07 22:35:00 9995 19990
2015-09-07 22:36:00 9996 19992
2015-09-07 22:37:00 9997 19994
2015-09-07 22:38:00 9998 19996
2015-09-07 22:39:00 9999 19998
In this 10 000 row DataFrame spanning 7 days, how would I get the 7am value each day as efficiently as possible? Assume I might have to do this for very large tick databases so I value speed and low memory usage highly.
I know I can index with strings such as:
aframe.ix["2015-09-02 07:00:00"]
Out[176]:
0 1860
1 3720
Name: 2015-09-02 07:00:00, dtype: int64
But what I need is basically a wildcard style query for example
aframe.ix["* 07:00:00"]
You can use indexer_at_time:
>>> locs = aframe.index.indexer_at_time('7:00:00')
>>> aframe.iloc[locs]
0 1
2015-09-01 07:00:00 420 840
2015-09-02 07:00:00 1860 3720
2015-09-03 07:00:00 3300 6600
2015-09-04 07:00:00 4740 9480
2015-09-05 07:00:00 6180 12360
2015-09-06 07:00:00 7620 15240
2015-09-07 07:00:00 9060 18120
There's also indexer_between_time if you need select all indices that lie between two particular time of day.
Both of these methods return the integer locations of the desired values; the corresponding rows of the Series or DataFrame can be fetched with iloc, as shown above.

matplotlib plots strange horizontal lines on graph

I have used openpyxl to read data from an Excel spreadsheet into a pandas data frame, called 'tides'. The dataset contains over 32,000 rows of data (of tide times in the UK measured every 15 minutes). One of the columns contains date and time information (variable called 'datetime') and another contains the height of the tide (called 'tide'):
I want to plot datetime along the x-axis and tide on the y axis using:
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import pandas as pd
import openpyxl
import datetime as dt
from matplotlib.dates import date2num
<-- Data imported from Excel spreadsheet into DataFrame using openpyxl. -->
<-- Code omitted for ease of reading. -->
# Convert datatime variable to datetime64 format:
tides['datetime'] = pd.to_datetime(tides['datetime'])
# Plot figure of 'datetime' vs 'tide':
fig = plt.figure()
ax_tides = fig.add_subplot(1,1,1)
ax_tides.plot_date(date2num(phj_tides['datetime']),phj_tides['tide'],'-',xdate=True,label='Tides 2011',linewidth=0.5)
min_datetime = dt.datetime.strptime('01/01/2011 00:00:00',"%d/%m/%Y %H:%M:%S")
max_datetime = dt.datetime.strptime('03/01/2011 23:59:45',"%d/%m/%Y %H:%M:%S")
ax_tides.set_xlim( [min_datetime, max_datetime] )
plt.show()
The plot shows just the first few days of data. However, at the change from one day to the next, something strange happens; after the last point of day 1, the line disappears off to the right and then returns to plot the first point of the second day - but the data is plotted incorrectly on the y axis. This happens throughout the dataset. A printout shows that the data seems to be OK.
number datetime tide
0 1 2011-01-01 00:00:00 4.296
1 2 2011-01-01 00:15:00 4.024
2 3 2011-01-01 00:30:00 3.768
3 4 2011-01-01 00:45:00 3.521
4 5 2011-01-01 01:00:00 3.292
5 6 2011-01-01 01:15:00 3.081
6 7 2011-01-01 01:30:00 2.887
7 8 2011-01-01 01:45:00 2.718
8 9 2011-01-01 02:00:00 2.577
9 10 2011-01-01 02:15:00 2.470
10 11 2011-01-01 02:30:00 2.403
11 12 2011-01-01 02:45:00 2.389
12 13 2011-01-01 03:00:00 2.417
13 14 2011-01-01 03:15:00 2.492
14 15 2011-01-01 03:30:00 2.611
15 16 2011-01-01 03:45:00 2.785
16 17 2011-01-01 04:00:00 3.020
17 18 2011-01-01 04:15:00 3.314
18 19 2011-01-01 04:30:00 3.665
19 20 2011-01-01 04:45:00 4.059
20 21 2011-01-01 05:00:00 4.483
[21 rows x 3 columns]
number datetime tide
90 91 2011-01-01 22:30:00 7.329
91 92 2011-01-01 22:45:00 7.014
92 93 2011-01-01 23:00:00 6.690
93 94 2011-01-01 23:15:00 6.352
94 95 2011-01-01 23:30:00 6.016
95 96 2011-01-01 23:45:00 5.690
96 97 2011-02-01 00:00:00 5.366
97 98 2011-02-01 00:15:00 5.043
98 99 2011-02-01 00:30:00 4.729
99 100 2011-02-01 00:45:00 4.426
100 101 2011-02-01 01:00:00 4.123
101 102 2011-02-01 01:15:00 3.832
102 103 2011-02-01 01:30:00 3.562
103 104 2011-02-01 01:45:00 3.303
104 105 2011-02-01 02:00:00 3.055
105 106 2011-02-01 02:15:00 2.827
106 107 2011-02-01 02:30:00 2.620
107 108 2011-02-01 02:45:00 2.434
108 109 2011-02-01 03:00:00 2.268
109 110 2011-02-01 03:15:00 2.141
110 111 2011-02-01 03:30:00 2.060
[21 rows x 3 columns]
number datetime tide
35020 35021 2011-12-31 19:00:00 5.123
35021 35022 2011-12-31 19:15:00 4.838
35022 35023 2011-12-31 19:30:00 4.551
35023 35024 2011-12-31 19:45:00 4.279
35024 35025 2011-12-31 20:00:00 4.033
35025 35026 2011-12-31 20:15:00 3.803
35026 35027 2011-12-31 20:30:00 3.617
35027 35028 2011-12-31 20:45:00 3.438
35028 35029 2011-12-31 21:00:00 3.278
35029 35030 2011-12-31 21:15:00 3.141
35030 35031 2011-12-31 21:30:00 3.019
35031 35032 2011-12-31 21:45:00 2.942
35032 35033 2011-12-31 22:00:00 2.909
35033 35034 2011-12-31 22:15:00 2.918
35034 35035 2011-12-31 22:30:00 2.923
35035 35036 2011-12-31 22:45:00 2.985
35036 35037 2011-12-31 23:00:00 3.075
35037 35038 2011-12-31 23:15:00 3.242
35038 35039 2011-12-31 23:30:00 3.442
35039 35040 2011-12-31 23:45:00 3.671
I am at a loss to explain this. Can anyone explain what is happening, why it is happening and how can I correct it?
Thanks in advance.
Phil
Doh! Finally found the answer. The original workflow was quite complicated. I stored the data in an Excel spreadsheet and used openpyxl to read data from a named cell range. This was then converted to a pandas DataFrame. The date-and-time variable was converted to datetime format using pandas' .to_datetime() function. And finally the data were plotted using matplotlib. As I was preparing the data to post to this forum (as suggested by rauparaha) and paring down the script to it bare essentials, I noticed that Day1 data were plotted on 01 Jan 2011 but Day2 data were plotted on 01 Feb 2011. If you look at the output in the original post, the dates are mixed formats: The last date given is '2011-12-31' (i.e. year-month-day') but the 2nd date representing 2nd Jan 2011 is '2011-02-01' (i.e. year-day-month).
So, looks like I misunderstood how the pandas .to_datetime() function interprets datetime information. I had purposely had not set the infer_datetime_format attribute (default=False) and had assumed any problems would have been flagged up. But it seems pandas assumes dates are in a month-first format. Unless they're not, in which case, it changes to a day-first format. I should have picked that up!
I have corrected the problem by providing a string that explicitly defines the datetime format. All is fine again.
Thanks again for your suggestions. And apologies for any confusion.
Cheers.
I have been unable to replicate your error but perhaps my working dummy code can help diagnose the problem. I generated dummy data and plotted it with this code:
import pandas as pd
import numpy as np
ydata = np.sin(np.linspace(0, 10, num=200))
time_index = pd.date_range(start=pd.datetime(2000, 1, 1, 0, 0), periods=200, freq=15*pd.datetools.Minute())
df = pd.DataFrame({'tides': ydata, 'datetime': time_index})
df.plot(x='datetime', y='tides')
My data looks like this
datetime tides
0 2000-01-01 00:00:00 0.000000
1 2000-01-01 00:15:00 0.050230
2 2000-01-01 00:30:00 0.100333
3 2000-01-01 00:45:00 0.150183
4 2000-01-01 01:00:00 0.199654
[200 rows]
and generates the following plot

Categories

Resources