Doing calculations on higher frequency data in lower frequency bins in Pandas - python

I have some data in a pandas dataframe that has entries at the per-second level over the course of a few hours. Entries are indexed by datetime format as TIMESTAMP. I would like to group all data within each minute and do some calculations and manipulations. That is, I would like to take all data within 09:00:00 to 09:00:59 and report some things about what happened in this minute. I would then like to do the same calculations and manipulations from 09:01:00 to 09:01:59 and so on through to the end of my dataset.
I've been fiddling around with groupby() and .resample() but I have had no success so far. I can think of a very inelegant way to do it with a series of for loops and if statements but I was wondering if there was an easier way here.

You didn't provide any data or code, so I'll just make some up. You also don't specify what calculations you want to do, so I'm just taking the mean:
>>> import numpy as np
>>> import pandas as pd
>>> dates = pd.date_range("1/1/2020 00:00:00", "1/1/2020 03:00:00", freq="S")
>>> values = np.random.random(len(dates))
>>> df = pd.DataFrame({"dates": dates, "values": values})
>>> df.resample("1Min", on="dates").mean().reset_index()
dates values
0 2020-01-01 00:00:00 0.486985
1 2020-01-01 00:01:00 0.454880
2 2020-01-01 00:02:00 0.467397
3 2020-01-01 00:03:00 0.543838
4 2020-01-01 00:04:00 0.502764
.. ... ...
236 2020-01-01 03:56:00 0.478224
237 2020-01-01 03:57:00 0.460435
238 2020-01-01 03:58:00 0.508211
239 2020-01-01 03:59:00 0.415030
240 2020-01-01 04:00:00 0.050993
[241 rows x 2 columns]

Related

How to plot part of date time in Python

I have massive data from CSV which spans every hour for a whole year. It has not been difficult plotting the whole data (or specific data) through the whole year.
However, I would like to take a closer look at month (for ex just plot January or February), and for the life of me, I haven't found out how to do that.
Date Company1 Company2
2020-01-01 00:00:00 100 200
2020-01-01 01:00:00 110 180
2020-01-01 02:00:00 90 210
2020-01-01 03:00:00 100 200
.... ... ...
2020-12-31 21:00:00 100 200
2020-12-31 22:00:00 80 230
2020-12-31 23:00:00 120 220
All of the columns are correctly formatted, the datetime is correctly formatted. How can I slice or define exactly the period I want to plot?
You can extract the month portion of a pandas datetime using .dt.month on a datetime series. Then check if that is equal to the month in question:
df_january = df[df['Date'].dt.month == 1]
You can then plot using your df_january dataframe. N.B. this will pick up data from other years as well if your dataset expanded to cover other years.
#WakemeUpNow had the solution I hadn't noticed. defining xlin while plotting did the trick.
df.DateTime.plot(x='Date', y='Company', xlim=('2020-01-01 00:00:00 ', '2020-12-31 23:00:00'))
plt.show()

Create Multiple DataFrames using Rolling Window from DataFrame Timestamps

I have one year's worth of data at four minute time series intervals. I need to always load 24 hours of data and run a function on this dataframe at intervals of eight hours. I need to repeat this process for all the data in the ranges of 2021's start and end dates.
For example:
Load year_df containing ranges between 2021-01-01 00:00:00 and 2021-01-01 23:56:00 and run a function on this.
Load year_df containing ranges between 2021-01-01 08:00:00 and 2021-01-02 07:56:00 and run a function on this.
Load year_df containing ranges between 2021-01-01 16:00:00 and 2021-01-02 15:56:00 and run a function on this.
#Proxy DataFrame
year_df = pd.DataFrame()
start = pd.to_datetime('2021-01-01 00:00:00', infer_datetime_format=True)
end = pd.to_datetime('2021-12-31 23:56:00', infer_datetime_format=True)
myIndex = pd.date_range(start, end, freq='4T')
year_df = year_df.rename(columns={'Timestamp': 'delete'}).drop('delete', axis=1).reindex(myIndex).reset_index().rename(columns={'index':'Timestamp'})
year_df.head()
Timestamp
0 2021-01-01 00:00:00
1 2021-01-01 00:04:00
2 2021-01-01 00:08:00
3 2021-01-01 00:12:00
4 2021-01-01 00:16:00
This approach avoids explicit for loops but the apply method is essentially a for loop under the hood so it's not that efficient. But until more functionality based on rolling datetime windows is introduced to pandas then this might be the only option.
The example uses the mean of the timestamps. Knowing exactly what function you want to apply may help with a better answer.
s = pd.Series(myIndex, index=myIndex)
def myfunc(e):
temp = s[s.between(e, e+pd.Timedelta("24h"))]
return temp.mean()
s.apply(myfunc)

Pandas dataframe with multiple datetime

I have a pandas dataframe that has datetime in multiple columns and looks similar to below but with hundreds of columns, almost pushing 1k.
datetime, battery, datetime, temperature, datetime, pressure
2020-01-01 01:01:01, 13.8, 2020-01-01 01:01:02, 97, 2020-01-01 01:01:03, 10
2020-01-01 01:01:04, 13.8, 2020-01-01 01:01:05, 97, 2020-01-01 01:01:06, 11
What I have done is imported it and then converted every datetime column using pd.to_datetime. This reduces the memory usage by more than half (2.4GB to 1.0GB), but I'm wondering if this is still inefficient and maybe a better way.
Would I benefit from converting this down to 3 columns where I have datetime, data name, data measurement? If so what is the best method of doing this? I've tried this but end up with a lot of empty spaces.
Would there be another way to handle this data that I'm just not presenting?
or what I'm doing makes sense and is efficient enough?
I eventually want to plot some of this data by selecting specific data names.
I ran a small experiment with the above data and converting the data to date / type / value columns reduces the overall memory consumption:
print(df)
datetime battery datetime.1 temperature datetime.2 pressure
0 2020-01-01 01:01:01 13.8 2020-01-01 01:01:02 97 2020-01-01 01:01:03 10
1 2020-01-01 01:01:04 13.8 2020-01-01 01:01:05 97 2020-01-01 01:01:06 11
print(df.memory_usage().sum())
==> 224
After converting the dataframe:
dfs = []
for i in range(0, 6, 2):
d = df.iloc[:, i:i+2]
d["type"] = d.columns[1]
d.columns = ["datetime", "value", "type"]
dfs.append(d)
new_df = pd.concat(dfs)
print(new_df)
==>
datetime value type
0 2020-01-01 01:01:01 13.8 battery
1 2020-01-01 01:01:04 13.8 battery
0 2020-01-01 01:01:02 97.0 temperature
1 2020-01-01 01:01:05 97.0 temperature
0 2020-01-01 01:01:03 10.0 pressure
1 2020-01-01 01:01:06 11.0 pressure
print(new_df.memory_usage().sum())
==> 192

How can i hourly resample a dataframe that has a column of tweets in it? (I would like to concatenate all tweets per hour)

i have a Dataframe that has date time as index and tweets in a different column as well as other stats like number of likes. I would like to resample the df with an hourly interval so that i would get all tweets and the sum of all stats per hour, which i have done with the following code:
df.resample('60min').sum()
The problem is that my tweets column disappears.. And i need it for a sentiment analysis.
I'm new to programming so thanks in advance for reading this!
IIUC you will groupby and use agg
import numpy as np
import pandas as pd
# sample data
np.random.seed(1)
df = pd.DataFrame(np.transpose([np.random.randint(1,10, 1489), ['abc']*1489]),
index=pd.date_range('2020-01-01', '2020-02-01', freq='30T'),
columns=['num', 'tweet'])
# groupby the index floored to hour, sum the num col
# and join the tweets with a semi-colon or what ever you want
df.groupby(df.index.floor('H')).agg({'num': sum, 'tweet': '; '.join})
num tweet
2020-01-01 00:00:00 69 abc; abc
2020-01-01 01:00:00 61 abc; abc
2020-01-01 02:00:00 12 abc; abc
2020-01-01 03:00:00 87 abc; abc
2020-01-01 04:00:00 35 abc; abc
Or if you just want to join the strings as is then sum everything:
df.groupby(df.index.floor('H')).agg(sum)
num tweet
2020-01-01 00:00:00 69 abcabc
2020-01-01 01:00:00 61 abcabc
2020-01-01 02:00:00 12 abcabc
2020-01-01 03:00:00 87 abcabc
2020-01-01 04:00:00 35 abcabc

Pandas and HDF5 aggregate performance

I'm trying to understand the ideal way to organise data within Pandas to achieve the best aggregating performance. The data I am dealing with is of the form yyyy-mm.csv which I just read_csv in and then to_hdf out. It generally looks something a bit like this:
ObjectID Timestamp ParamA ParamB --> ParamZ
1 2013-01-01 00:00:00 1 9
2 2013-01-01 00:00:00 3 2
1 2013-01-01 00:10:00 8 11
2 2013-01-01 00:10:00 6 14
There are about 50 object ids and readings for each batch of 10 minutes for the whole month. The end result I want to achieve is aggregated data (e.g. the mean) for a single parameter grouped by month (or potentially finer resolution eventually) over say 5 years.
What I've discovered so far is that a HDFStore.select of a single column isn't really a great deal quicker than bringing in all of those params into a single data frame at once. Therefore it feels very wasteful and the performance is not great. Without knowing exactly why this is, I can't really decide the best way to move forward. It seems that if the data were transposed such that the yyyy-mm was along the x axis with the dd hh:mm:ss down the y axis, and there were one of these data frames per parameter that the performance would massively improve as it could bring in more data in one hit. The groupby's are really quick once things have been read in from disk. However I'm not at all convinced that this is how it is supposed to be used. Can anyone advise the best way to organise and store the data?
Thanks
Pls review the HDFStore docs here, and the cookboo recipies here
PyTables stores data in a row-oriented format, so it behooves you to generally have long and not so wide tables. However, if you tend to query and need/want the entire row then the width does not present a problem.
On the other hand, if you are generally after a small subset of columns, you will want to shard the table into multiples (possibly with the same indexing scheme), so you can use a 'master' table to run the query, then select 'columns' (other tables) as needed. You can accomplish this via the append_to_multiple/select_from_multiple methods for example. Taken to the extreme, this you could store a single column in a separate group and make yourself a column-oriented table. However this will substantially slow down if say you tend to select a lot of columns.
Furthermore you always want to have the queryable columns as indexes or data_columns, as these allow queries in the first place and are indexed.
So it comes down to the ratio of queries that select lots of columns vs single-column selections.
For example
In [5]: df = DataFrame(np.random.randn(16,2),
columns=['A','B'],
index=MultiIndex.from_tuples(
[ (i,j) for i in range(4) for j in date_range(
'20130101 00:00:00',periods=4,freq='10T') ],
names=['id','date']))
In [6]: df
Out[6]:
A B
id date
0 2013-01-01 00:00:00 -0.247945 0.954260
2013-01-01 00:10:00 1.035678 -0.657710
2013-01-01 00:20:00 -2.399376 -0.188057
2013-01-01 00:30:00 -1.043764 0.510098
1 2013-01-01 00:00:00 -0.009998 0.239947
2013-01-01 00:10:00 2.038563 0.640080
2013-01-01 00:20:00 1.123922 -0.944170
2013-01-01 00:30:00 -1.757766 -1.398392
2 2013-01-01 00:00:00 -1.053324 -1.015211
2013-01-01 00:10:00 0.062408 -1.476484
2013-01-01 00:20:00 -1.202875 -0.747429
2013-01-01 00:30:00 -0.798126 -0.485392
3 2013-01-01 00:00:00 0.496098 0.700073
2013-01-01 00:10:00 -0.042914 1.099115
2013-01-01 00:20:00 -1.762597 -0.239100
2013-01-01 00:30:00 -0.344125 -1.607524
[16 rows x 2 columns]
In 0.12, use table=True rather than format
In [7]: df.to_hdf('test.h5','df',mode='w',format='table')
In [8]: store = pd.HDFStore('test.h5')
In [9]: store
Out[9]:
<class 'pandas.io.pytables.HDFStore'>
File path: test.h5
/df frame_table (typ->appendable_multi,nrows->16,ncols->4,indexers->[index],dc->[date,id])
In [10]: store.select('df',where='id=0')
Out[10]:
A B
id date
0 2013-01-01 00:00:00 -0.247945 0.954260
2013-01-01 00:10:00 1.035678 -0.657710
2013-01-01 00:20:00 -2.399376 -0.188057
2013-01-01 00:30:00 -1.043764 0.510098
[4 rows x 2 columns]
This is 0.13 syntax, this is a bit more tricky in 0.12
In [18]: store.select('df',where='date>"20130101 00:10:00" & date<"20130101 00:30:00"')
Out[18]:
A B
id date
0 2013-01-01 00:20:00 -2.399376 -0.188057
1 2013-01-01 00:20:00 1.123922 -0.944170
2 2013-01-01 00:20:00 -1.202875 -0.747429
3 2013-01-01 00:20:00 -1.762597 -0.239100
[4 rows x 2 columns]
In [19]: store.close()
So for example to do a groupby on the id, you can select all of the unique ids (use the select_column method. Then iterate over these, doing a query and performing your function on the results. This will be quite fast and these are indexed columns. Something like this:
In [24]: ids = store.select_column('df','id').unique()
In [25]: ids
Out[25]: array([0, 1, 2, 3])
In [27]: pd.concat([ store.select('df',where='id={0}'.format(i)).sum() for i in ids ],axis=1)
Out[27]:
0 1 2 3
A -2.655407 1.394721 -2.991917 -1.653539
B 0.618590 -1.462535 -3.724516 -0.047436
[2 rows x 4 columns]
A multi-groupby is just a combination query, e.g. id=1 & date>="20130101 00:10:00' & date<='20130101 00:30:00'
You might find this example instructive as well here

Categories

Resources