Calculating average of multiple sets of data (performance issue)

Calculating average of multiple sets of data (performance issue) - python

I have a needs to do calculation like average of selected data grouped by time rage collections.
Example:
Table which is storing data has several main columns which are:
| time_stamp | external_id | value |
Now i want to calculate average for 20 (or more) groups of date ranges:
1) 2000-01-01 00-00-00 -> 2000-01-04 00-00-00
2) 2000-01-04 00-00-00 -> 2000-01-15 00-00-00
...
The important thing is that there are no gaps and intersections between groups so it means that first date and last date are covering full time range.
The other important thing is that in set of "date_from" to "date_to" there can be rows for outside of the collection (unneeded external_id's).
I have tried 2 approaches:
1) Execute query for each "time range" step with average function in SQL query (but i don't like that - it's consuming too much time for all queries, plus executing multiple queries sounds like not good approach)
2) I have selected all required rows (at one SQL request) and then i made loop over the results. The problem is that i have to check on each step to which "data group" current datetime belongs. This seams like a better approach (from SQL perspective) but right now i have not too good performance because of loop in the loop. I need to figure out how to avoid executing loop (checking to which group current timestamp belongs) in the main loop.
Any suggestions would be much helpful.

Actually both approaches are nice, and both could benefit on the index on the time_stamp column in your database, if you have it. I will try to provide advice on them:
Multiple queries are not such a bad idea, your data looks to be pretty static, and you can run 20 select avg(value) from data where time_stamp between date_from and date_to-like queries in 20 different connections to speed up the total operation. You'll eliminate need of transferring a lot of data to your client from DB as well. The downside would be that you need to include an additional where condition to exclude rows with unneeded external_id values. This complicates the query and can slow the processing down a little if there are a lot of these values.
Here you could sort the data on server by time_stamp index before sending and then just checking if your current item is from a new data range (because of sorting you will be sure later items will be from later dates). This would reduce the inner loop to an if statement. I am unsure this is the bottleneck here, though. Maybe you'd like to look into streaming the results instead of waiting them all to be fetched.

Related

Creating multiple subsets of a timeseries pandas dataframe by weekly intervals

New to python. I have a dataframe with a date time column (essentially a huge time series data). I basically want to divide this into multiple subsets where each subset data frame contains one week worth of data (starting from the first timestamp). I have been trying this with groupBy and Grouper but it returns tuples which themselves don't contain a week's worth of data. In addition, the Grouper (Erstwhile TimeGrouper) documentation isn't very clear on this.
This is what I tried. Any better ideas or approaches?
grouped = uema_label_format.groupby(pd.Grouper(key='HEADER_START_TIME', freq='W'))

If your dataset is really big, it could be worth externalising this work to a time-series database and then query it to get each week you are interested in. These results can then be loaded into pandas, but the database handles the heavy lifting. For example in QuestDB you could get the current week as follows
select * from yourTable where timestamp = '2020-06-22;7d'
Although this would return the data for a single week, you could iterate on this to get the individual objects quickly since the results are instantaneous. Also, you can easily change the sample interval after the fact, for example to monthly using 1M. This would still be an instant response.
You can try this here using this query as an example to get one week worth of data (roughly 3M rows) out of a 1.6 billion rows NYC taxi dataset.
select * from trips where pickup_datetime = '2015-08-01;7d';
If this would solve your use case, there is a tutorial on how to get query results from QuestDB to pandas here.

Extract data faster from Redis and store in Panda Dataframe by avoiding key generation

I am using Redis with Python to store my per second ticker data (price and volume of an instrument). I am performing r.hget(instrument,key) and facing the following issue.
My key (string) looks like 01/01/2020-09:32:01 and goes on incrementing per second till the user specified interval.
For example 01/01/2020-09:32:01
01/01/2020-09:32:02 01/01/2020-09:32:03 ....
My r.hget(instrument,key) result looks likeb'672.2,432'(price and volume separated by a comma).
The issue am facing is that a user can specify a long time interval, like 2 years, that is, he/she wants the data from 01/01/2020 to 31/12/2020 (d/m/y format).So to perform the get operation I have to first generate timestamps for that period and then perform the get operation to form a panda dataframe. The generation of this datastamp to use as key for get operation is slowing down my process terribly (but it also ensures that the data is in strict ordering. For example 01/01/2020-09:32:01 will definitely be before 01/01/2020-09:32:02). Is there another way to achieve the same?
If I simply do r.hgetall(...) I wont be able to satisfy the time interval condition of user.

redis sorted set's are good fit for such range queries, sorted sets are made up of unique member's with a score, in your case timestamp can be score in epoch seconds and price and volume can be member, however member in sorted set is unique you may consider adding timestamp to make it unique.
zadd instrument 1577883600 672.2,432,1577883600
zadd instrument 1577883610 672.2,412,1577883610
After adding members to the set you can do range queries using zrangebyscore as below
zrangebyscore instrument 1577883600 1577883610
If your instrument contains many values then consider sharding it into multiple for example per month each set like instrument:202001, instrument:202002 and so on.
following are good read on this topic
Sorted Set Time Series
Sharding Structure

So to perform the get operation I have to first generate timestamps for that period and then perform the get operation...
No. This is the problem.
Make a function that calculates the timestamps and yield a smaller set of values, for a smaller time span (one week or one month).
So the new workflow will be in batches, see this loop:
generate a small set of timestamps
fetch items from redis
Pros:
minimize the memory usage
easy to change your current code to this new algo.
I don't know about redis specific functions, so other specific solutions can be better. My idea is a general approach, I used it with success for other problems.

Have you considered using RedisTimeSeries for this task? It is a redis module that is tailored exactly for the sort of task you are describing.
You can keep two timeseries per instrument that will hold price and value.
With RedisTimeSeries is it easy the query over different ranges and you can use the filtering mechanism to group different series, instrument families for example, and query all of them at once.
// create your timeseries
TS.CREATE instrument:price LABELS instrument violin type string
TS.CREATE instrument:volume LABELS instrument violin type string
// add values
TS.ADD instrument:price 123456 9.99
TS.ADD instrument:volume 123456 42
// query timeseries
TS.RANGE instrument:price - +
TS.RANGE instrument:volume - +
// query multiple timeseries by filtering according to labels
TS.MRANGE - + FILTER instrument=violin
TS.MRANGE - + FILTER type=string
RedisTimeSeries allows running queries with aggregations such as average standard-deviation, and uses double-delta compression which can reduce your memory usage by over 90%.
You can checkout a benchmark here.

Python: aggregating data by row count

I'm trying to aggregate this call center data in various different ways in Python, for example mean q_time by type and priority. This is fairly straightforward using df.groupby.
However, I would also like to be able to aggregate by call volume. The problem is that each line of the data represents a call, so I'm not sure how to do it. If I'm just grouping by date then I can just use 'count' as the aggregate function, but how would I aggregate by e.g. weekday, i.e. create a data frame like:
weekday mean_row_count
1 100
2 150
3 120
4 220
5 200
6 30
7 35
Is there a good way to do this? All I can think of is looping through each weekday and counting the number of unique dates, then dividing the counts per weekday by the number of unique dates, but I think this could get messy and maybe really slow it down if I need to also group by other variables, or do it by date and hour of the day.

Since the date of each call is given, one idea is to implement a function to determine the day of the week from a given date. There are many ways to do this such as Conway's Doomsday algorithm.
https://en.wikipedia.org/wiki/Doomsday_rule
One can then go through each line, determine the week day, and add to the count for each weekday.

When I find myself thinking how to aggregate and query data in a versatile way, it think that the solution is probably a database. SQLite is a lightweight embedded database with high performances for simple use cases, and Python and a native support for it.
My advice here is : create a database and a table for your data, eventually add ancillary tables depending on your needs, load data into it, and use interative sqlite or Python scripts for your queries.

Django database planning - time series data

I would like some advice on how to best organize my django models/database tables to hold the data in my webapp
Im designing a site that will hold a users telemetry data from a racing sim game. So there will be a desktop companion app that will sample the game data every 0.1 seconds for a variety of information (car, track, speed, gas, brake, clutch, rpm, etc). For example, in a 2 minute race, each of those variables will hold 1200 data points (10 samples a second * 120 seconds).
The important thing here is that this list of data can be as many as 20 variables, and could potentially grow in the future. So 1200 * the number of variables you have is the amount of data for an individual race session. If a single user submits 100 sessions, and there are 100 users....the amount of data adds up very quickly.
The app will then ship all this data for a race session off to the database for the website. The data MUST be transferred between game and website via a CSV file. So structurally I am limited to what CSV can do. The website will then allow you to choose a race session/lap and plot this information on separate time series graphs (for each variable), and importantly allow you to plot your session against somebody elses to see where differences lie
My question here is how do you structure such a database to hold this much information?
The simplest structure I have in my mind is to have a separate table for each race track, then each row/entry will be a race session on that track. Fields in this table will be the variables above.
The problem I have is:
1) most of the variables in the list above are time series data and not individual values (e.g. var speed might look like: 70, 72, 74, 77, 72, 71, 65 where the values are samples spaced 0.1 seconds apart over the course of the entire lap). How do you store this type of information in a table/field?
2) The length of each var in the list above will always be the same length for any single race session (if your lap took 1min 35 then you all your vars will only capture data for that length of time), but given that I want to be able to compare different laps with each other, session times will be different for each lap. In other words, however I store the time series data for those variables, it must be variable in size
Any thoughts would be appreciated

One thing that may help you with HUGE tables is partitioning. Judging by the postgresql tag that you set for your question, take a look here: http://www.postgresql.org/docs/9.1/static/ddl-partitioning.html
But for a start I would go with a one, simple table, supported by a reasonable set of indexes. From what I understand, each data entry in the table will be identified by race session id, player id and time indicator. Those columns should be covered with indexes according to your querying requirements.
As for your two questions:
1) You store those informations as simple integers. Remember to set a proper data types for those columns. For e.g. if you are 100% sure that some values will be very small, you can use smallint data type. More on integer data types here: http://www.postgresql.org/docs/9.3/static/datatype-numeric.html#DATATYPE-INT
2) That won't be a problem if you every var list will be different row in the table. You will be able to insert as many as you'd like.
So, to sum things up. I would start with a VERY simple single table schema. From django perspective this would look something like this:
class RaceTelemetryData(models.Model):
user = models.ForeignKey(..., index_db=True)
race = models.ForeignKey(YourRaceModel, index_db=True)
time = models.IntegerField()
gas = models.IntegerField()
speed = models.SmallIntegerField()
# and so on...
Additionaly, you should create an index (manually) for (user_id, race_id, time) columns, so looking up, data about one race session (and sorting it) would be quick.
In the future, if you'll find the performance of this single table too slow, you'll be able to experiment with additional indexes, or partitioning. PostgreSQL is quite flexible in modifying existing database structures, so you shouldn't have many problems with it.
If you decide to add a new variable to the collection, you will simply need to add a new column to the table.
EDIT:
In the end you end up with one table, that has at least these columns:
user_id - To specify which users data this row is about.
race_id - To specify which race data this row is about.
time - To identify the correct order in which to represent the data.
This way, when you want to get information on Joe's 5th race, you would look up rows that have user_id = 'Joe_ID' and race_id = 5, then sort all those rows by the time column.

Optimizing Sqlite3 for 20,000+ Updates

I have lists of about 20,000 items that I want to insert into a table (with about 50,000 rows in it). Most of these items update certain fields in existing rows and a minority will insert entirely new rows.
I am accessing the database twice for each item. First is a select query that checks whether the row exists. Next I insert or update a row depending on the result of the select query. I commit each transaction right after the update/insert.
For the first few thousand entries, I am getting through about 3 or 4 items per second, then it starts to slow down. By the end it takes more than 1/2 second for each iteration. Why might it be slowing down?
My average times are: 0.5 seconds for an entire run divided up as .18s per select query and .31s per insert/update. The last 0.01 is due to a couple of unmeasured processes to do with parsing the data before entering into the database.
Update
I've commented out all the commits as a test and got no change, so that's not it (any more thoughts on optimal committing would be welcome, though).
As to table structure:
Each row has twenty columns. The first four are TEXT fields (all set with the first insert) and the 16 are REAL fields, one of which is inputted with the initial insert statement.
Over time the 'outstanding' REAL fields will be populated with the process I'm trying to optimize here.
I don't have an explicit index, though one of the fields is unique key to each row.
I should note that as the database has gotten larger both the SELECT and UPDATE queries have taken more and more time, with a particularly remarkable deterioration in performance in the SELECT operation.
I initially thought this might be some kind of structural problem with SQLITE (whatever that means), but haven't been able to find any documentation anywhere that suggests there are natural limits to the program.
The database is about 60ish megs, now.

I think your bottleneck is that you commit with/avec each insert/update:
I commit each transaction right after the update/insert.
Either stop doing that, or at least switch to WAL journaling; see this answer of mine for why:
SQL Server CE 4.0 performance comparison
If you have a primary key you can optimize out the select by using the ON CONFLICT clause with INSERT INTO:
http://www.sqlite.org/lang_conflict.html
EDIT : Earlier I meant to write "if you have a primary key " rather than foreign key; I fixed it.

Edit: shame on me. I misread the question and somehow understood this was for mySQL rather that SQLite... Oops.
Please disregard this response, other than to get generic ideas about upating DBMSes. The likely solution to the OP's problem is with the overly frequent commits, as pointed in sixfeetsix' response.
A plausible explanation is that the table gets fragmented.
You can verify this fact by defragmenting the table every so often, and checking if the performance returns to the 3 or 4 items per seconds rate. (Which BTW, is a priori relatively slow, but then may depend on hardware, data schema and other specifics.) Of course, you'll need to consider the amount of time defragmentation takes, and balance this against the time lost by slow update rate to find an optimal frequency for the defragmentation.
If the slowdown is effectively caused, at least in part, by fragmentation, you may also look into performing the updates in a particular order. It is hard to be more specific without knowing details of the schema of of the overall and data statistical profile, but fragmentation is indeed sensitive to the order in which various changes to the database take place.
A final suggestion, to boost the overall update performance, is (if this is possible) drop a few indexes on the table, perform the updates, and recreate the indexes anew. This counter-intuitive approach works for relative big updates because the cost for re-creating new indexes is often less that the cumulative cost for maintaining them as the update progresses.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.