elastic search performance using pyes

elastic search performance using pyes - python

Sorry for cross posting.The following question is also posted on Elastic Search's google group.
In short I am trying to find out why I am not able to get optimal performance while doing searches on a ES index which contains about 1.5 millon records.
Currently I am able to get about 500-1000 searches in 2 seconds. I would think that this should be orders of magnitudes faster. Also currently I am not using thrift.
Here is how I am checking the performance.
Using 0.19.1 version of pyes (tried both stable and dev version from github)
Using 0.13.8 version of requests
conn = ES(['localhost:9201'],timeout=20,bulk_size=1000)
loop_start = time.clock()
q1 = TermQuery("tax_name","cellvibrio")
for x in xrange(1000000):
if x % 1000 == 0 and x > 0:
loop_check_point = time.clock()
print 'took %s secs to search %d records' % (loop_check_point-loop_start,x)
results = conn.search(query=q1)
if results:
for r in results:
pass
# print len(results)
else:
pass
Appreciate any help that you can give to help me scaleup the searches.
Thanks!

Isn't it just a matter of concurrency?
You're doing all your queries in sequence. So a query has to finish before the next one can come in to play. If you have a 1ms RTT to the server, this will limit you to 1000 requests per second.
Try to run a few instances of your script in parallel and see what kind of performance you got.

There are severeal ways to improve that with using pyes.
First of all try to get rid of the DottedDict class/object which is used to generat from every json/dict to an object for every result you get.
Second switch the json encoder to ujson.
These two things brought up a lot of performance.
This has the disadvantage that you
have to use the ways to access dicts instead of the dotted version ("result.facets.attribute.term" instead you have to use something like "result.facets['attribute']['term']" or "result.facets.get('attribute', {}).get('term', None)" )
I did this through extending the ES class and replace the "_send_request" function.

Related

Elasticsearch : retrieve all documents from index with python

I need to retrieve documents from Elasticsearch in Python.
So I wrote this small code :
es = Elasticsearch(
myHost,
port=myPort,
scheme="http")
request = '''{"query": {"match_all": {}}}'''
results = es.search(index=myIndex, body=request)['hits']['hits']
print(len(results))
>> 10
The problem is that it only returns 10 documents from my index when I expect to have few hundreds. How is it possible to retrieve all documents from the index ?

You have several ways to solve this.
If you know the maximum amount of documents you will have in the index, you can set the size parameter of the search as that number or more. For example, if you know you will have less than 100, you can retrieve them this way results = es.search(index=myIndex, body=request, size=100)['hits']['hits']
If you don't know that number, and you still want all of them, you will have to use the scan function, instead of the search function. The documentation for that is here

Why does ArangoDB (using Python-Arango) return ERR 1600 ERROR_CURSOR_NOT_FOUND?

The problem
I iterate over an entire vertex collection, e.g. journals, and use it to create edges, author, from a person to the given journal.
I use python-arango and the code is something like:
for journal in journals.all():
create_author_edge(journal)
I have a relatively small dataset, and the journals-collection has only ca. 1300 documents. However: this is more than 1000, which is the batch size in the Web Interface - but I don't know if this is of relevance.
The problem is that it raises a CursorNextError, and returns HTTP 404 and ERR 1600 from the database, which is the ERROR_CURSOR_NOT_FOUND error:
Will be raised when a cursor is requested via its id but a cursor with that id cannot be found.
Insights to the cause
From ArangoDB Cursor Timeout, and from this issue, I suspect that it's because the cursor's TTL has expired in the database, and in the python stacktrace something like this is seen:
# Part of the stacktrace in the error:
(...)
if not cursor.has_more():
raise StopIteration
cursor.fetch() <---- error raised here
(...)
If I iterate over the entire collection fast, i.e. if I do print(len(journals.all()) it outputs "1361" with no errors.
When I replace the journals.all() with AQL, and increase the TTL parameter, it works without errors:
for journal in db.aql.execute("FOR j IN journals RETURN j", ttl=3600):
create_author_edge(journal)
However, without the the ttl-parameter, the AQL approach gives the same error as using journals.all().
More information
A last piece of information is that I'm running this on my personal laptop when the error is raised. On my work computer, the same code was used to create the graph and populate it with the same data, but there no errors were raised. Because I'm on holiday I don't have access to my work computer to compare versions, but both systems were installed during the summer so there's a big chance the versions are the same.
The question
I don't know if this is an issue with python-arango, or with ArangoDB. I believe that because there is no problem when TTL is increased that it could indicate an issue with ArangodDB and not the Python driver, but I cannot know.
(I've added a feature request to add ttl-param to the .all()-method here.)
Any insights into why this is happening?
I don't have the rep to create the tag "python-arango", so it would be great if someone would create it and tag my question.

Inside of the server the simple queries will be translated to all().
As discussed on the referenced github issue, simple queries don't support the TTL parameter, and won't get them.
The prefered solution here is to use an AQL-Query on the client, so that you can specify the TTL parameter.
In general you should refrain from pulling all documents from the database at once, since this may introduce other scaling issues. You should use proper AQL with FILTER statements backed by indices (use explain() to revalidate) to fetch the documents you require.
If you need to iterate over all documents in the database, use paging. This is usually implemented the best way by combining a range FILTER with a LIMIT clause:
FOR x IN docs
FILTER x.offsetteableAttribute > #lastDocumentWithThisID
LIMIT 200
RETURN x

So here is how I did it. You can specify with the more args param makes it easy to do.
Looking at the source you can see the doc string says what to do
def AQLQuery(self, query, batchSize = 100, rawResults = False, bindVars = None, options = None, count = False, fullCount = False,
json_encoder = None, **moreArgs):
"""Set rawResults = True if you want the query to return dictionnaries instead of Document objects.
You can use **moreArgs to pass more arguments supported by the api, such as ttl=60 (time to live)"""
from pyArango.connection import *
conn = Connection(username=usr, password=pwd,arangoURL=url)# set this how ya need
db = conn['collectionName']#set this to the name of your collection
aql = """ for journal in journals.all():
create_author_edge(journal)"""
doc = db.AQLQuery(aql,ttl=300)
Thats all ya need to do!

how to use pyknackhq python library for getting whole objects/tables from my knack builder

I am trying to connect knack online database with my python data handling scripts in order to renew objects/tables directly into my knack app builder. I discovered pyknackhq Python API for KnackHQ can fetch objects and return json objects for the object's records. So far so good.
However, following the documentation (http://www.wbh-doc.com.s3.amazonaws.com/pyknackhq/quick%20start.html) I have tried to fetch all rows (records in knack) for my object-table (having in total 344 records).
My code was:
i =0
for rec in undec_obj.find():
print(rec)
i=i+1
print(i)
>> 25
All first 25 records were returned indeed, however the rest until the 344-th were never returned. The documentation of pyknackhq library is relatively small so I couldn't find a way around my problem there. Is there a solution to get all my records/rows? (I have also changed the specification in knack to have all my records appear in the same page - page 1).
The ultimate goal is to take all records and make them a pandas dataframe.
thank you!

I haven't worked with that library, but I've written another python Knack API wrapper that should help:
https://github.com/cityofaustin/knackpy
The docs should get you where you want to go. Here's an example:
>>> from knackpy import Knack
# download data from knack object
# will fetch records in chunks of 1000 until all records have been downloaded
# optionally pass a rows_per_page and/or page_limit parameter to limit record count
>>> kn = Knack(
obj='object_3',
app_id='someappid',
api_key='topsecretapikey',
page_limit=10, # not needed; this is the default
rows_per_page=1000 # not needed; this is the default
)
>>> for row in kn.data:
print(row)
{'store_id': 30424, 'inspection_date': 1479448800000, 'id': '58598262bcb3437b51194040'},...
Hope that helps. Open a GitHub issue if you have any questions using the package.

Why is the reported number of hits from elasticsearch different depending on the query method?

I have an elasticsearch index which has 60k elements. I know that by checking the head plugin and I get the same information via Sense (the result is in the lower right corner)
I then wanted to query the same index from Python, in two diffrent ways: via a direct requests call and using the elasticsearch module:
import elasticsearch
import json
import requests
# the requests version
data = {"query": {"match_all": {}}}
r = requests.get('http://elk.example.com:9200/nessus_current/_search', data=json.dumps(data))
print(len(r.json()['hits']['hits']))
# the elasticsearch module version
es = elasticsearch.Elasticsearch(hosts='elk.example.com')
res = es.search(index="nessus_current", body={"query": {"match_all": {}}})
print(len(res['hits']['hits']))
In both cases the result is 10 - far from the expected 60k. The results of the query make sense (the content is what I expect), it is just that there are only a few of them.
I took one of these 10 hits and queried with Sense for its _id to close the loop. It is, as expected, found indeed:
So it looks like the 10 hits are a subset of the whole index, why aren't all elements reported in the Python version of the calls?

10 is the default size of the results returned by Elasticsearch. If you want more, specify "size": 100 for example. But, be careful, returning all the docs using size is not recommended as it can bring down your cluster. For getting back all the results use scan&scroll.
And I think it should be res['hits']['total'] not res['hits']['hits'] to get the number of total hits.

How do I evade the limit of 100 entries in python splunk query

When executing a query via the splunk SDK, apparently the results are clipped after 100 entries. How to get around this limit?
I tried:
>job = service.jobs.create(qstring,max_count=0, max_time=0, count=10000)
>while not job.is_ready():
time.sleep(1)
>out = list(results.ResultsReader(job.results()))
>print(len(out))
100
but the same query in the splunk web interface produces over 100 lines of results.

Try job.results(count=0)
count=0 means no limit.

Here is a hack which appears to work (but this is surely not the right way to do this):
in splunklib.binding
HttpLib.get and HttpLib.post, add the following line to the beginning of each method:
kwargs['count'] = 100000

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.