Appengine: Query only a subset of the data?

Appengine: Query only a subset of the data? - python

My users can supply a start and end date and my server will return a list of points between those two dates.
However, there are too many points between each hour and I am interested to pick only one random point per every 15 minutes.
Is there an easy to do this in Appengine?

You should add to each Datastore entity an indexed property to query one.
For example you could create an "hash" property that will contain the date (in ms since epoch) modulo 15 minutes (in ms).
Then you just have to query with a filter saying hash=0, or rather a random value between 0 and 15 min (in ms).

Related

Extract data faster from Redis and store in Panda Dataframe by avoiding key generation

I am using Redis with Python to store my per second ticker data (price and volume of an instrument). I am performing r.hget(instrument,key) and facing the following issue.
My key (string) looks like 01/01/2020-09:32:01 and goes on incrementing per second till the user specified interval.
For example 01/01/2020-09:32:01
01/01/2020-09:32:02 01/01/2020-09:32:03 ....
My r.hget(instrument,key) result looks likeb'672.2,432'(price and volume separated by a comma).
The issue am facing is that a user can specify a long time interval, like 2 years, that is, he/she wants the data from 01/01/2020 to 31/12/2020 (d/m/y format).So to perform the get operation I have to first generate timestamps for that period and then perform the get operation to form a panda dataframe. The generation of this datastamp to use as key for get operation is slowing down my process terribly (but it also ensures that the data is in strict ordering. For example 01/01/2020-09:32:01 will definitely be before 01/01/2020-09:32:02). Is there another way to achieve the same?
If I simply do r.hgetall(...) I wont be able to satisfy the time interval condition of user.

redis sorted set's are good fit for such range queries, sorted sets are made up of unique member's with a score, in your case timestamp can be score in epoch seconds and price and volume can be member, however member in sorted set is unique you may consider adding timestamp to make it unique.
zadd instrument 1577883600 672.2,432,1577883600
zadd instrument 1577883610 672.2,412,1577883610
After adding members to the set you can do range queries using zrangebyscore as below
zrangebyscore instrument 1577883600 1577883610
If your instrument contains many values then consider sharding it into multiple for example per month each set like instrument:202001, instrument:202002 and so on.
following are good read on this topic
Sorted Set Time Series
Sharding Structure

So to perform the get operation I have to first generate timestamps for that period and then perform the get operation...
No. This is the problem.
Make a function that calculates the timestamps and yield a smaller set of values, for a smaller time span (one week or one month).
So the new workflow will be in batches, see this loop:
generate a small set of timestamps
fetch items from redis
Pros:
minimize the memory usage
easy to change your current code to this new algo.
I don't know about redis specific functions, so other specific solutions can be better. My idea is a general approach, I used it with success for other problems.

Have you considered using RedisTimeSeries for this task? It is a redis module that is tailored exactly for the sort of task you are describing.
You can keep two timeseries per instrument that will hold price and value.
With RedisTimeSeries is it easy the query over different ranges and you can use the filtering mechanism to group different series, instrument families for example, and query all of them at once.
// create your timeseries
TS.CREATE instrument:price LABELS instrument violin type string
TS.CREATE instrument:volume LABELS instrument violin type string
// add values
TS.ADD instrument:price 123456 9.99
TS.ADD instrument:volume 123456 42
// query timeseries
TS.RANGE instrument:price - +
TS.RANGE instrument:volume - +
// query multiple timeseries by filtering according to labels
TS.MRANGE - + FILTER instrument=violin
TS.MRANGE - + FILTER type=string
RedisTimeSeries allows running queries with aggregations such as average standard-deviation, and uses double-delta compression which can reduce your memory usage by over 90%.
You can checkout a benchmark here.

Calculating average of multiple sets of data (performance issue)

I have a needs to do calculation like average of selected data grouped by time rage collections.
Example:
Table which is storing data has several main columns which are:
| time_stamp | external_id | value |
Now i want to calculate average for 20 (or more) groups of date ranges:
1) 2000-01-01 00-00-00 -> 2000-01-04 00-00-00
2) 2000-01-04 00-00-00 -> 2000-01-15 00-00-00
...
The important thing is that there are no gaps and intersections between groups so it means that first date and last date are covering full time range.
The other important thing is that in set of "date_from" to "date_to" there can be rows for outside of the collection (unneeded external_id's).
I have tried 2 approaches:
1) Execute query for each "time range" step with average function in SQL query (but i don't like that - it's consuming too much time for all queries, plus executing multiple queries sounds like not good approach)
2) I have selected all required rows (at one SQL request) and then i made loop over the results. The problem is that i have to check on each step to which "data group" current datetime belongs. This seams like a better approach (from SQL perspective) but right now i have not too good performance because of loop in the loop. I need to figure out how to avoid executing loop (checking to which group current timestamp belongs) in the main loop.
Any suggestions would be much helpful.

Actually both approaches are nice, and both could benefit on the index on the time_stamp column in your database, if you have it. I will try to provide advice on them:
Multiple queries are not such a bad idea, your data looks to be pretty static, and you can run 20 select avg(value) from data where time_stamp between date_from and date_to-like queries in 20 different connections to speed up the total operation. You'll eliminate need of transferring a lot of data to your client from DB as well. The downside would be that you need to include an additional where condition to exclude rows with unneeded external_id values. This complicates the query and can slow the processing down a little if there are a lot of these values.
Here you could sort the data on server by time_stamp index before sending and then just checking if your current item is from a new data range (because of sorting you will be sure later items will be from later dates). This would reduce the inner loop to an if statement. I am unsure this is the bottleneck here, though. Maybe you'd like to look into streaming the results instead of waiting them all to be fetched.

Items movement daily collection database design system issue

I am doing a very simple database in mysql to track movement of items. The current paper form looks like this:
Date totalFromPreviousDay NewToday LeftToday RemainAtEndOfDay
1.1.2017 5 5 2 8 (5+5-2)
2.1.2017 8 3 0 11 ( 8+ 3 -0)
3.1.2017 11 0 5 6 (11+0-5)
And so forth. In my table, I want to make totalFromPreviousDay and RemainAtEndOfDay calculated fields which I show in my front end only. That is mainly cos we tend to erase on the paper due to errors. I want them to be reflected based on changes to the other two fields. As such, I did my table like this:
id
date
NewToday
LeftToday
Now the problem I am facing is, I want to select any date and be able to say "there were 5 items at the start of the day or from previous day, then 5 were added, 0 left and the day ended with 10 items"
So far, I can't really think of a way going about it. Theoretically, I want to try something like this: if the requested day is Feb. 1, 2017, start at 0 cos that's the day we started collecting data. If not, loop thru the records at 0 and doing the math until the requested date is found.
But that is obviously inefficient cos i have to start form first date until the last every time.
Is my approach ok or I should include the columns in the table? If the first, what would be the way to do it in python/mysql?

I think you have to step back a little bit and define the business needs first (it is worthwhile to talk somebody, who worked with stocks before) because these determine your table structure.
A system always tracks the current level of stocks and the movement. It is a business decision how often you save your historical stock level and this influences how you store the data.
You may save the current stock level along with all transactions. In this case you would store the stock level in the transactions table. You do not even have to sum up a transactions per day because the last transaction per day will have the daily closing stock level anyway.
You may choose to save the historic stock levels regularly (on a daily / weekly / monthly, etc. basis). In this case you will have a separate historic stock levels table with stock id, stock name (name may change over the time, so may be a good idea to save it), date and the level. If you would like to know the historic stock level for any point of time that falls between your saved points, then you need to take the latest saved stock level before the period you are looking for, and sum up all transactions to the saved period.

Python: aggregating data by row count

I'm trying to aggregate this call center data in various different ways in Python, for example mean q_time by type and priority. This is fairly straightforward using df.groupby.
However, I would also like to be able to aggregate by call volume. The problem is that each line of the data represents a call, so I'm not sure how to do it. If I'm just grouping by date then I can just use 'count' as the aggregate function, but how would I aggregate by e.g. weekday, i.e. create a data frame like:
weekday mean_row_count
1 100
2 150
3 120
4 220
5 200
6 30
7 35
Is there a good way to do this? All I can think of is looping through each weekday and counting the number of unique dates, then dividing the counts per weekday by the number of unique dates, but I think this could get messy and maybe really slow it down if I need to also group by other variables, or do it by date and hour of the day.

Since the date of each call is given, one idea is to implement a function to determine the day of the week from a given date. There are many ways to do this such as Conway's Doomsday algorithm.
https://en.wikipedia.org/wiki/Doomsday_rule
One can then go through each line, determine the week day, and add to the count for each weekday.

When I find myself thinking how to aggregate and query data in a versatile way, it think that the solution is probably a database. SQLite is a lightweight embedded database with high performances for simple use cases, and Python and a native support for it.
My advice here is : create a database and a table for your data, eventually add ancillary tables depending on your needs, load data into it, and use interative sqlite or Python scripts for your queries.

MongoDB Update-Upsert Performance Barrier (Performance falls off a cliff)

I'm performing a repetitive update operation to add documents into my MongoDB as part of some performance evaluation. I've discovered a huge non-linearity in execution time based on the number of updates (w/ upserts) I'm performing:
Looping with the following command in Python...
collection.update({'timestamp': x}, {'$set': {'value1':y, v1 : y/2, v2 : y/4}}, upsert=True)
Gives me these results...
500 document upserts 2 seconds.
1000 document upserts 3 seconds.
2000 document upserts 3 seconds.
4000 document upserts 6 seconds.
8000 document upserts 14 seconds.
16000 document upserts 77 seconds.
32000 document upserts 280 seconds.
Notice how after 8k document updates the performance starts to rapidly degrade, and by 32k document updates we're seeing a 6x reduction in throughput. Why is this? It seems strange that "manually" running 4k document updates 8 times in a row would be 6x faster than having Python perform them all consecutively.
I've seen that in mongostats I'm getting a ridiculously high locked db ratio (>100%) and
top is showing me >85% CPU usage when this is running. I've got an i7 processor with 4 cores available to the VM.

You should put an ascending index on your "timestamp" field:
collection.ensure_index("timestamp") # shorthand for single-key, ascending index
If this index should contain unique values:
collection.ensure_index("timestamp", unique=True)
Since the spec is not indexed and you are performing updates, the database has to check every document in the collection to see if any documents already exist with that spec. When you do this for 500 documents (in a blank collection), the effects are not so bad...but when you do it for 32k, it does something like this (in the worst case):
document 1 - assuming blank collection, definitely gets inserted
document 2 - check document 1, update or insert occurs
document 3 - check documents 1-2, update or insert occurs
...etc...
document 32000 - check documents 1-31999, update or insert
When you add the index, the database no longer has to check every document in the collection; instead, it can use the index to find any possible matches much more quickly using a B-tree cursor instead of a basic cursor.
You should compare the results of collection.find({"timestamp": x}).explain() with and without the index (note you may need to use the hint() method to force it to use the index). The critical factor is how many documents you have to iterate over (the "nscanned" result of explain()) versus how many documents match your query (the "n" key). If the db only has to scan exactly what matches or close to that, that is very efficient; if you scan 32000 items but only found 1 or a handful of matches, that is terribly inefficient, especially if the db has to do something like that for each and every upsert.
A notable wrinkle for you to double check- since you have not set multi=True in your update call, if an update operation finds a matching document, it will update just it and not continue to check the entire collection.
Sorry for the link spam, but these are all must-reads:
http://docs.mongodb.org/manual/core/indexes/
http://api.mongodb.org/python/current/api/pymongo/collection.html#pymongo.collection.Collection.ensure_index
http://api.mongodb.org/python/current/api/pymongo/collection.html#pymongo.collection.Collection.update
http://docs.mongodb.org/manual/reference/method/cursor.explain/

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Appengine: Query only a subset of the data? - python

My users can supply a start and end date and my server will return a list of points between those two dates. However, there are too many points between each hour and I am interested to pick only one random point per every 15 minutes. Is there an easy to do this in Appengine?

Related

Extract data faster from Redis and store in Panda Dataframe by avoiding key generation

Calculating average of multiple sets of data (performance issue)

Items movement daily collection database design system issue

Python: aggregating data by row count

MongoDB Update-Upsert Performance Barrier (Performance falls off a cliff)

Categories

Resources