How to speed up Elasticsearch scroll in python

How to speed up Elasticsearch scroll in python - python

I need to get data for a certain period of time by es api and use python to do some customized analysis of these data and display the result on dashboard.
There are about two hundred thousand records every 15 minutes,indexed by date.
Now I use scroll-scan to get data,But it takes nearly a minute to get 200000 records，It seems to be too slow.
Is there any way to process these data more quickly?and can I use something like redis to save the results and avoid repetitive work?

Is it possible to do the analysis on the Elasticsearch side using aggregations?
Assuming you're not doing it already, you should use _source to only download the absolute minimum data required. You could also try increasing the size parameter to scan() from the default of 1000. I would expect only modest speed improvements from that, however.
If the historical data doesn't change, then a cache like Redis (or even just a local file) could be a good solution. If the historical data can change, then you'd have to manage cache invalidation.

Related

Is it better practice/for scalability to store many blobs or one append blob for data captured per-hour?

I am storing data (JSON) into blobs in Azure, capturing it hourly to create relatively small JSON documents. Between backup times, I may have produce 10s or 100s (unlikely 1000s, but possible) of these documents that I then want to backup into blobs and organise by the year, month, day, and hour.
Two approaches I came up with are:
Making the hour a folder and storing a separate blob for every backup within it
Making each hour its own blob under the day's folder and appending all new documents to that blob so they are stored together
The access case will usually be they'll have somewhat frequent reads for awhile before being backed-off into cold/archive once they get old.
My question is: should I be favouring one method over the other for best practice, resource, or logical reasons, or is it basically personal preference with negligible performance hits? I'm especially interested in any resource differences in terms of reads and writes as I couldn't find or work out any useful information about that.
I'm also curious if there is any access benefits particularly for the append method (although the trade-off might be having to make sure you don't mess the blob up as you append to it) as you'll be storing the per-hour data always together in the same file, as well as how nicely one method or the other might fit with how the Python SDK is architected.
For this scenario I am using Python and making use of the Azure Python SDK packages.
Any other suggestions/methods also very welcome. Thanks.

If the read/write requirements are low, then it won’t matter, if you need high throughput then you might opt not to name your files this way.
Take a look at this, specifically the partitioning section.
Performance and scalability checklist for Blob storage - Azure Storage | Microsoft Learn
Additional information: Note that “relatively small” and “somewhat frequent” mean different things to different people. some users might interpret that to mean < 1KB and several times an hour, while someone else might interpret it to mean < 1MB and several times a second (or even several times a ms). If the former, There is nothing to worry about.
If you still have any question on performance, I would recommended to contact support.

Is it useful to multithread sql queries to fetch data from a large DB

I am writing my bachelor thesis on a project with a massive database that tracks around 8000 animals, three times a second. After a few months, we now have approx 127 million entries and each row includes a column with an array with 1000-3000 entries that has the coordinates for every animal that was tracked in that square that moment. All that lays in a sql database that now easily exceeds 2 TB in size.
To export the data and analyse the moving patterns of the animals, they did it online over PHPMyAdmin as a csv export that would take hours to be finished and break down about everytime.
I wrote them a python (they wanted me to use python) script with mysql-connector-python that will fetch the data for them automatically. The problem is, since the database is so massive, one query can take up minutes or technically even hours to complete. (downloading a day of tracking data would be 3*60*60*24 entries)
The moment anything goes wrong (connection fails, computer is overloaded etc) the whole query is closed and it has to start all over again cause its not cached anywhere.
I then rewrote the whole thing as a class that will fetch the data by using smaller multithreaded queries.
I start about 5-7 Threads that each take a connection out of a connection pool, make the query, write it in a csv file successively and put the connection back in the pool once done with the query.
My solution works perfectly, the queries are about 5-6 times faster, depending on the amount of threads I use and the size of the chunks that I download. The data gets written into the file and when the connection breaks or anything happens, the csvfile still holds all the data that has been downloaded up to that point.
But on looking at solutions how to improve my method, I can find absolutely nothing about a similar approach and no-one seems to do it that way for large datasets.
What am I missing? Why does it seem like everyone is using a single-query approach to fetch their massive datasets, instead of splitting it into threads and avoiding these annoying issues with connection breaks and whatnot?
Is my solution even usable and good in a commercial environment or are there things that I just dont see right now, that would make my approach useless or even way worse?
Or maybe it is a matter of the programming language and if I had used C# to do the same thing it wouldve been faster anyways?
EDIT:
To clear some things up, I am not responsible for the database. While I can tinker with it since I also have admin rights, someone else that (hopefully) actually knows what he is doing, has set it up and writes the data. My Job is only to fetch it as simple and effective as possible. And since exporting from PHPMyAdmin is too slow and so is a single query on python for 100k rows (i do it using pd.read_sql) I switched to multithreading. So my question is only related to SELECTing the data effectively, not to change the DB.
I hope this is not becoming too long of a question...

There are many issues in a database of that size. We need to do the processing fast enough so that it never gets behind. (Once it lags, it will keel over, as you see.)
Ingestion. It sounds like a single client is receiving 8000 lat/lng values every 3 seconds, then INSERTing a single, quite wide row. Is that correct?
When you "process" the data, are you looking at each of the 8000 animals? Or looking at a selected animal? Fetching one out of a lat/lng from a wide row is messy and slow.
If the primary way things are SELECTed is one animal at a time, then your matrix needs to be transposed. That will make selecting all the data for one animal much faster, and we can mostly avoid the impact that Inserting and Selecting have on each other.
Are you inserting while you are reading?
What is the value of innodb_buffer_pool_size? You must plan carefully with the 2TB versus the much smaller RAM size. Depending on the queries, you may be terribly I/O-bound and maybe the data structure can be changed to avoid that.
"...csv file and put it back..." -- Huh? Are you deleting data, then re-inserting it? That sees 'wrong'. And very inefficient.
Do minimize the size of every column in the table. How big is the range for the animals? Your backyard? The Pacific Ocean? How much precision is needed in the location? Meters for whales; millimeters for ants. Maybe the coordinates can be scaled to a pair of SMALLINTs (2 bytes, 16-bit precision) or MEDIUMINTs (3 bytes each)?
I haven't dwelled on threading; I would like to wait until the rest of the issues are ironed out. Threads interfere with each other to some extent.
I find this topic interesting. Let's continue the discussion.

New Relic Servers API Getting data using available metrics

I'm running a New Relic server agent on a couple Linux boxes (in R&D stage right now) for gathering performance data, CPU utilization, Memory, etc. I've the the NR API to get back available metrics and the names passable to them. However, I'm not entirely sure how to get that data back correctly (not convinced it's even possible at this point). The one I'm most concerned about this point is:
System/Disk/^dev^xvda1/Utilization/percent.
With available names:
[u'average_response_time', u'calls_per_minute', u'call_count', u'min_response_time', u'max_response_time', u'average_exclusive_time', u'average_value', u'total_call_time_per_minute', u'requests_per_minute', u'standard_deviation']
According to the NR API doc, the proper end point for this is https://api.newrelic.com/v2/servers/${APP_ID}/metrics/data.xml. Where I assume ${APP_ID} is the Server ID.
So, I'm able to send the request, however, the data I'm getting back is not at all what I'm looking for.
Response:
<average_response_time>0</average_response_time>
<calls_per_minute>1.4</calls_per_minute>
<call_count>1</call_count>
<min_response_time>0</min_response_time>
<max_response_time>0</max_response_time>
<average_exclusive_time>0</average_exclusive_time>
<average_value>0</average_value>
<total_call_time_per_minute>0</total_call_time_per_minute>
<requests_per_minute>1.4</requests_per_minute>
<standard_deviation>0</standard_deviation>
Which would be what is expected. I think the data in these metrics is accurate, but I think they're to be taken at face value. However, the reason I even say they're to be taken for face value is based upon this statement in the NR API Docs:
Metric values include:
Total disk space used, indicated by average_response_time
Capacity of the disk, indicated by average_exclusive_time.
Which would lead one to believe that the data we want is is listed within one of the available name parameters for the the request. So, essentially my question is, is there a more specific way I need to hit the NR API to actually get the disk utilization as a percentage? Or is that not possible, even though I'm given to believe otherwise based upon the aforementioned information?. I'm hoping maybe there is information I'm missing here... Thanks!

Returning full results in MongoDB find()

I've been working on a project to evaluate mongodb speed compared to another data store. To this end I'm trying to perform a full scan over a collection I've made. I found out about the profiler, so I have that enabled and set to log every query. I have a collection of a million objects, and i'm trying to time how long it takes to scan the collection. Unfortunately when I run
db.sampledata.find()
it returns immediately with a cursor to 1000 or so objects. So I wrote a python script to iterate through the cursor to handle all results. Here it is:
from pymongo import MongoClient
client = MongoClient()
db = client.argocompdb
data = db.sampledata
count = 0
my_info = data.find()
for row in my_info:
count += 1
print count
This seems to be taking the requisite time. However, when I check the profiler, theres no overall amount for the full query time, it's just a whole whack of "getmore" ops that take 3-6 millis each. Is there any way to do what I'm trying to do using the profiler instead of timing it in python? I essentially just want to:
Be able to execute a query and have it return all results, instead
of just the few in the cursor.
Get time for the "full query" in the profiler. The time it took to get all results.
Is what I want to do feasible?
I'm very new to MongoDB so I'm very sorry if this has been asked before but I couldn't find anything on it.

The profiler is measuring the correct thing. The Mongo driver is not returning all the records in the collection at once; it is first giving you a cursor, and then feeding the documents one by one as you iterate through the cursor. So the profiler is measuring exactly what is being done.
And I argue that this is a more correct metric than the one you are seeking, which I believe is the time that it takes to actually read all the documents into your client. You actually don't want the Mongo driver to read all the documents into memory before returning. No application would perform well if written that way, except for the smallest of collections. It's much faster for a client to read documents on demand, so that the smallest total memory footprint is necessary.
Also, what are you comparing this against? If you are comparing to a relational database, then it matters a great deal what your schema is in the relational DB, and what your collections and documents look like in Mongo. And of course, how each is indexed. Different choices can produce very different performance results, at no fault of the database engine.
The simplest, and therefore fastest, operations in Mongo will probably be lookups of tiny documents retrieved by their id which is always indexed: db.collection.find({id: ...}). But if you really want to measure a linear scan, then the smaller the documents are, the faster the scan will be. But really, this isn't very useful, as it basically only measures how quickly the server can read data from disk.

Import Data Efficiently from Datastore to BigQuery every Hour - Python

Currently, I'm using Google's 2-step method to backup the datastore and than import it to BigQuery.
I also reviewed the code using pipeline.
Both methods are not efficient and have high cost since all data is imported everytime.
I need only to add the records added from last import.
What is the right way of doing it?
Is there a working example on how to do it in python?

You can look at Streaming inserts. I'm actually looking at doing the same thing in Java at the moment.
If you want to do it every hour, you could maybe add your inserts to a pull queue (either as serialised entities or keys/IDs) each time you put a new entity to Datastore. You could then process the queue hourly with a cron job.

There is no full working example (as far as I know), but I believe that the following process could help you :
1- You'd need to add a "last time changed" to your entities, and update it.
2- Every hour you can run a MapReduce job, where your mapper can have a filter to check for last time updated and only pick up those entities that were updated in the last hour
3- Manually add what needs to be added to your backup.
As I said, this is pretty high level, but the actual answer will require a bunch of code. I don't think it is suited to Stack Overflow's format honestly.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.