Creating multiple subsets of a timeseries pandas dataframe by weekly intervals - python

New to python. I have a dataframe with a date time column (essentially a huge time series data). I basically want to divide this into multiple subsets where each subset data frame contains one week worth of data (starting from the first timestamp). I have been trying this with groupBy and Grouper but it returns tuples which themselves don't contain a week's worth of data. In addition, the Grouper (Erstwhile TimeGrouper) documentation isn't very clear on this.
This is what I tried. Any better ideas or approaches?
grouped = uema_label_format.groupby(pd.Grouper(key='HEADER_START_TIME', freq='W'))

If your dataset is really big, it could be worth externalising this work to a time-series database and then query it to get each week you are interested in. These results can then be loaded into pandas, but the database handles the heavy lifting. For example in QuestDB you could get the current week as follows
select * from yourTable where timestamp = '2020-06-22;7d'
Although this would return the data for a single week, you could iterate on this to get the individual objects quickly since the results are instantaneous. Also, you can easily change the sample interval after the fact, for example to monthly using 1M. This would still be an instant response.
You can try this here using this query as an example to get one week worth of data (roughly 3M rows) out of a 1.6 billion rows NYC taxi dataset.
select * from trips where pickup_datetime = '2015-08-01;7d';
If this would solve your use case, there is a tutorial on how to get query results from QuestDB to pandas here.

Related

Approach to storing forecast time series data using python

So I want to scrape a weather forecast table once a day and store my results for future analysis. I want to store the data but im not sure how to.
Example of the data: Forecast Table
My four variables of interest are wind speed, wind gusts, wave height, and wave period.
This is my first python project involving time series data and I’m fairly new to databases so go easy on me and ELI5 please.
In the Python For Everyone course I recently took, I learned about relational databases and using SQLlite. The main idea here was basically to be efficient in storing data and never storing the same data twice. However none of the examples involved time series data. And so now I’m not sure what the best approach here is.
If I create a table for each variable and lastly one for the date I scraped the forecast. The date of scraping would then serve as the primary key. In this example the variables such as windspeed's first column would be date of scraping followed by the next columns being the forecasted values for the time stamps. Although this would make the storage more efficient as opposed to creating a new table every day, there are a few problems. The timestamps are not uniform (see image, forecast times are only from 3am to 9pm). Also depending on the time of day that the forecast is scraped the date and time values on the timestamps are always changing and thus the next timestamp is not always in 2 hours.
Seeing as each time I scrape the forecast, I get a new table, should I create a new database table each time in sqlite? This seems like a rather rudimentary solution, and I’m sure there are better ways to store the data.
How would you go about this?
Summarizing my comments:
You may want to append forecast data from a new scrapping to the existing data in the same database table.
From each new web-scrapping you will get approx. 40 new records with the same scrapping time stamp but different forecast time stamp.
e.g., this would be the columns of the table using ID as primary key with AUTOINCREMENT:
ID
Scrapping_time
Forecast_hours
Wind_speed
Wind_gusts
Wind_direction
Wave
Wave_period
wave_direction
Note:
if you use SQLite, you could leave out the ID column as SQLite would add such ROWID column by default if no other primary key had been specified
(https://www.sqlite.org/autoinc.html)

Extract data faster from Redis and store in Panda Dataframe by avoiding key generation

I am using Redis with Python to store my per second ticker data (price and volume of an instrument). I am performing r.hget(instrument,key) and facing the following issue.
My key (string) looks like 01/01/2020-09:32:01 and goes on incrementing per second till the user specified interval.
For example 01/01/2020-09:32:01
01/01/2020-09:32:02 01/01/2020-09:32:03 ....
My r.hget(instrument,key) result looks likeb'672.2,432'(price and volume separated by a comma).
The issue am facing is that a user can specify a long time interval, like 2 years, that is, he/she wants the data from 01/01/2020 to 31/12/2020 (d/m/y format).So to perform the get operation I have to first generate timestamps for that period and then perform the get operation to form a panda dataframe. The generation of this datastamp to use as key for get operation is slowing down my process terribly (but it also ensures that the data is in strict ordering. For example 01/01/2020-09:32:01 will definitely be before 01/01/2020-09:32:02). Is there another way to achieve the same?
If I simply do r.hgetall(...) I wont be able to satisfy the time interval condition of user.
redis sorted set's are good fit for such range queries, sorted sets are made up of unique member's with a score, in your case timestamp can be score in epoch seconds and price and volume can be member, however member in sorted set is unique you may consider adding timestamp to make it unique.
zadd instrument 1577883600 672.2,432,1577883600
zadd instrument 1577883610 672.2,412,1577883610
After adding members to the set you can do range queries using zrangebyscore as below
zrangebyscore instrument 1577883600 1577883610
If your instrument contains many values then consider sharding it into multiple for example per month each set like instrument:202001, instrument:202002 and so on.
following are good read on this topic
Sorted Set Time Series
Sharding Structure
So to perform the get operation I have to first generate timestamps for that period and then perform the get operation...
No. This is the problem.
Make a function that calculates the timestamps and yield a smaller set of values, for a smaller time span (one week or one month).
So the new workflow will be in batches, see this loop:
generate a small set of timestamps
fetch items from redis
Pros:
minimize the memory usage
easy to change your current code to this new algo.
I don't know about redis specific functions, so other specific solutions can be better. My idea is a general approach, I used it with success for other problems.
Have you considered using RedisTimeSeries for this task? It is a redis module that is tailored exactly for the sort of task you are describing.
You can keep two timeseries per instrument that will hold price and value.
With RedisTimeSeries is it easy the query over different ranges and you can use the filtering mechanism to group different series, instrument families for example, and query all of them at once.
// create your timeseries
TS.CREATE instrument:price LABELS instrument violin type string
TS.CREATE instrument:volume LABELS instrument violin type string
// add values
TS.ADD instrument:price 123456 9.99
TS.ADD instrument:volume 123456 42
// query timeseries
TS.RANGE instrument:price - +
TS.RANGE instrument:volume - +
// query multiple timeseries by filtering according to labels
TS.MRANGE - + FILTER instrument=violin
TS.MRANGE - + FILTER type=string
RedisTimeSeries allows running queries with aggregations such as average standard-deviation, and uses double-delta compression which can reduce your memory usage by over 90%.
You can checkout a benchmark here.

what does "Aggregate the data to weekly level, so that there is one row per product-week combination" mean and how can I do it using python (pandas)

I am working on a transactions data frame using python (anaconda) and I was told to Aggregate the data to a weekly level so that there is one row per product-week combination
I want to make sure if the following code is correct because I don't think I fully understood what I need to do
dataset.groupby(['id', dataset['history_date'].dt.strftime('%W')])['sales'].sum()
Note my dataset contains the following:
id history_date item_id price inventory sales category_id
Aggregating data means combining datasets based on a certain criteria, to narrow it down.
For example, it sounds like your dataset may be broken down by daily dates, where each row corresponds to a specific date.
What you need to do is aggregate the data into weekly segments, instead of having it broken down on a daily basis.
This is achieved by grouping your datasets based on the date & the most granular / detailed /specific pairing of your dataset.

pandas, trying to only sample 5 rows per movie_id, from a dataframe where there are too many rows

I have a huge dataframe df in terms of the total rows
In fact it has too many rows inside of it. And I need to limit the rows amount in a sensible way while still maintaining that each of the movies will have same amount of reviews in the dataframe (currently it varies greatly)
dataframe has shape such as this
first column is userID, second column is animeID (movieID essentially) third column is just that user's own movie rating of that movie. Each row is a movie review. There should be about 300 movieIDs in the column animeID.
What I need to do in pandas is to limit the amount of rows such that I resample that dataframe to have only something like 5 rows per animeID(i.e. the movieID) and that the new dataframe should only have those newly sampled rows. I got totally stuck on how to do this in pandas I could maybe have done it in excel somewhat easily, but I don't want to separate all my preprocessing into excel stages and pandas stages...
I'm pretty sure that each animeID should have at least 1000 rows (each row is an individual movie review, but it could have been by the same user or a different user). I just need to limit the amount of rows(movie reviews) so that all movies still have reviews about them, but I'm still able to process the data.
I will have about 300 movies (300 animeIDs ) from which I know that those movies will have had at least 1000 reviews about each of them, and I will have already done that. So the main problem is that some of the movies just have huge amount of reviews about them like tens of thousands or something.
ratingsDataframe
I cannot think of any single function applicable to your case. Instead, you may try following lines where df is the original data frame that you want to sample from:
current_row=0
df_sample=pd.DataFrame([],columns=df.columns)
for i in np.unique(df['animeID']):
new_sample=df[df['animeID']==i].sample(n=5)
df_sample=pd.concat([df_sample,new_sample],axis=0)
Try converting the Dataframe into a Numpy array. The problem will reduce to just playing around with arrays. The code for converting the Dataframe into an array is:
<numpy_array_name> = <dataframe_name>.values
Hope this helps you. If you still want to work with the dataframes, the check out this article.

Python: aggregating data by row count

I'm trying to aggregate this call center data in various different ways in Python, for example mean q_time by type and priority. This is fairly straightforward using df.groupby.
However, I would also like to be able to aggregate by call volume. The problem is that each line of the data represents a call, so I'm not sure how to do it. If I'm just grouping by date then I can just use 'count' as the aggregate function, but how would I aggregate by e.g. weekday, i.e. create a data frame like:
weekday mean_row_count
1 100
2 150
3 120
4 220
5 200
6 30
7 35
Is there a good way to do this? All I can think of is looping through each weekday and counting the number of unique dates, then dividing the counts per weekday by the number of unique dates, but I think this could get messy and maybe really slow it down if I need to also group by other variables, or do it by date and hour of the day.
Since the date of each call is given, one idea is to implement a function to determine the day of the week from a given date. There are many ways to do this such as Conway's Doomsday algorithm.
https://en.wikipedia.org/wiki/Doomsday_rule
One can then go through each line, determine the week day, and add to the count for each weekday.
When I find myself thinking how to aggregate and query data in a versatile way, it think that the solution is probably a database. SQLite is a lightweight embedded database with high performances for simple use cases, and Python and a native support for it.
My advice here is : create a database and a table for your data, eventually add ancillary tables depending on your needs, load data into it, and use interative sqlite or Python scripts for your queries.

Categories

Resources