Elasticsearch : retrieve all documents from index with python - python

I need to retrieve documents from Elasticsearch in Python.
So I wrote this small code :
es = Elasticsearch(
myHost,
port=myPort,
scheme="http")
request = '''{"query": {"match_all": {}}}'''
results = es.search(index=myIndex, body=request)['hits']['hits']
print(len(results))
>> 10
The problem is that it only returns 10 documents from my index when I expect to have few hundreds. How is it possible to retrieve all documents from the index ?

You have several ways to solve this.
If you know the maximum amount of documents you will have in the index, you can set the size parameter of the search as that number or more. For example, if you know you will have less than 100, you can retrieve them this way results = es.search(index=myIndex, body=request, size=100)['hits']['hits']
If you don't know that number, and you still want all of them, you will have to use the scan function, instead of the search function. The documentation for that is here

Related

Can't get the Place details (Gmaps Places API) more than 20 data

I'm quite a newbie in Python espescially to use Gmaps API to get place details.
I want to search for places with this parameters:
places_result = gmaps.places_nearby(location=' -6.880270,107.60794', radius = 300, type = 'cafe')
But actually i want to get many data as i can in the specific lat/lng and radius. So, I try to get new parameters that google api has provided. That's page_token. This is the detail of documentation:
pagetoken — Returns up to 20 results from a previously run search. Setting a pagetoken parameter will execute a search with the same parameters used previously — all parameters other than pagetoken will be ignored.
https://developers.google.com/places/web-service/search
So i tried to get more data (Next page data) with this function:
places_result = gmaps.places_nearby(location=' -6.880270,107.60794', radius = 300, type = 'cafe')
time.sleep(5)
place_result = gmaps.places_nearby(page_token = places_result['next_page_token'])
And this is my whole output function:
for place in places_result['results']:
my_place_id = place['place_id']
my_fields = ['name','formatted_address','business_status','rating','user_ratings_total','formatted_phone_number']
places_details = gmaps.place(place_id= my_place_id , fields= my_fields)
pprint.pprint(places_details['result'])
But unfortunately when i start to running i only get 20 (Max) data of place details. I don't know whether my function of page token parameter it's true or not, because the output can't get more than 20 data.
I'm very appreciate for anyone who can give me an advice to solve the problem. Thank you very much :)
As stated on the documentation here:
By default, each Nearby Search or Text Search returns up to 20 establishment results per query; however, each search can return as many as 60 results, split across three pages.
So basically, what you are currently experiencing is an intended behavior. There is no way for you to get more than 20 results in a single nearby search query.
If a next_page_token was returned upon sending your first nearby search query, then, this means that a second page with results is available.
To access this second page of results, then just like what you did, you just have to send another nearby search request, but use the pagetoken parameter this time, and set its value with the next_page_token you got from the first response.
And if the next_page_token also exists on the response of your second nearby search query, then this means that the third (and the last) page of the result is also available. You could access the third page of results using the same way you accessed the second page.
Going back to your query, I tried the parameters you've specified but I could only get around 9 results. Is it intended that your radius parameter is only set at 300 meters?

Elasticsearch-Py bulk not indexing all documents

I am using the elasticsearch-py Python package to interact with Elasticsearch through code. I have a script that is meant to take each document from one index, generate a field + value, then re-index it into a new index.
The issue is that there is 1216 documents in the first index, but only 1000 documents make it to the second one. Typically, it is exactly 1000 documents, occasionally making it higher around 1100, but never making it to the full 1216.
I usually keep the batch_size at 200, but changing it around seems to have some effect on the amount of documents that make it to the second index. Changing it to 400 will typically get a result of 800 documents being transferred. Using parallel_bulk seems to have the same results as using bulk.
I believe the issue is with the generating process I am performing. For each document I am generating its ancestry (they are organized in a tree structure) by recursively getting its parent from the first index. This involves rapid document GET requests interwoven with Bulk API calls to index the documents and Scroll API calls to get the documents from the index in the first place.
Would activity like this cause the documents to not go through? If I remove (comment out) the recursive GET requests, all documents seem to go through every time. I have tried creating multiple Elasticsearch clients, but that wouldn't even help if ES itself is the bottleneck.
Here is the code if you're curious:
def complete_resources():
for result in helpers.scan(client=es, query=query, index=TEMP_INDEX_NAME):
resource = result["_source"]
ancestors = []
parent = resource.get("parent")
while parent is not None:
ancestors.append(parent)
parent = es.get(
index=TEMP_INDEX_NAME,
doc_type=TEMPORARY_DOCUMENT_TYPE,
id=parent["uid"]
).get("_source").get("parent")
resource["ancestors"] = ancestors
resource["_id"] = resource["uid"]
yield resource
This generator is consumed by helpers.parallel_bulk()
for success, info in helpers.parallel_bulk(
client=es,
actions=complete_resources(),
thread_count=10,
queue_size=12,
raise_on_error=False,
chunk_size=INDEX_BATCH_SIZE,
index=new_primary_index_name,
doc_type=PRIMARY_DOCUMENT_TYPE,
):
if success:
successful += 1
else:
failed += 1
print('A document failed:', info)
This gives me the following result:
Time: 7 seconds
Successful: 1000
Failed: 0

Is there a way of setting a range when querying data of firestore?

I have a collection of documents, all with random id's and a field called date.
docs = collection_ref.order_by(u'date', direction=firestore.Query.ASCENDING).get()
Imagine I had limited the search to the first ten
docs = collection_ref.order_by(date', direction=firestore.Query.ASCENDING).limit(10).get()
How would I continue with my next query when I want to get the items from 11 to 20?
you can use an offset(),but every doc skipped counts as a read. For example, if you did query.offset(10).limit(5), you would get charged for 15 reads: the 10 offset + the 5 that you actually got.
If you want to avoid unnecessary reads, use startAt() or startAfter()
example (Java, sorry, don't speak python. here's link to docs though):
QuerySnapshot querySnapshot = // your query snapshot
List<DocumentSnapshot> docs = querySnapshot.getDocuments();
//reference doc, query starts at or after this one
DocumentSnapshot indexDocSnapshot = docs.get(docs.size());
//to 'start after', or paginate, you can do below:
query.startAfter(indexDocSnapshot).limit(10).get()
.addOnSuccessListener(queryDocumentSnapshots -> {
// next 10 docs here, no extra reads necessary
});

Why is the reported number of hits from elasticsearch different depending on the query method?

I have an elasticsearch index which has 60k elements. I know that by checking the head plugin and I get the same information via Sense (the result is in the lower right corner)
I then wanted to query the same index from Python, in two diffrent ways: via a direct requests call and using the elasticsearch module:
import elasticsearch
import json
import requests
# the requests version
data = {"query": {"match_all": {}}}
r = requests.get('http://elk.example.com:9200/nessus_current/_search', data=json.dumps(data))
print(len(r.json()['hits']['hits']))
# the elasticsearch module version
es = elasticsearch.Elasticsearch(hosts='elk.example.com')
res = es.search(index="nessus_current", body={"query": {"match_all": {}}})
print(len(res['hits']['hits']))
In both cases the result is 10 - far from the expected 60k. The results of the query make sense (the content is what I expect), it is just that there are only a few of them.
I took one of these 10 hits and queried with Sense for its _id to close the loop. It is, as expected, found indeed:
So it looks like the 10 hits are a subset of the whole index, why aren't all elements reported in the Python version of the calls?
10 is the default size of the results returned by Elasticsearch. If you want more, specify "size": 100 for example. But, be careful, returning all the docs using size is not recommended as it can bring down your cluster. For getting back all the results use scan&scroll.
And I think it should be res['hits']['total'] not res['hits']['hits'] to get the number of total hits.

Cloudsearch Request Exceed 10,000 Limit

When I search a query that has more than 10,000 matches I get the following error:
{u'message': u'Request depth (10100) exceeded, limit=10000', u'__type': u'#SearchException', u'error': {u'rid': u'zpXDxukp4bEFCiGqeQ==', u'message': u'[*Deprecated*: Use the outer message field] Request depth (10100) exceeded, limit=10000'}}
When I search for more narrowed down keywords and queries with less results, everything works fine and no error is returned.
I guess I have to limit the search somehow, but I'm unable to figure out how. My search function looks like this:
def execute_query_string(self, query_string):
amazon_query = self.search_connection.build_query(q=query_string, start=0, size=100)
json_search_results = []
for json_blog in self.search_connection.get_all_hits(amazon_query):
json_search_results.append(json_blog)
results = []
for json_blog in json_search_results:
results.append(json_blog['fields'])
return results
And it's being called like this:
results = searcher.execute_query_string(request.GET.get('q', ''))[:100]
As you can see, I've tried to limit the results with the start and size attributes of build_query(). I still get the error though.
I must have missunderstood how to avoid getting more than 10,000 matches on a search result. Can someone tell me how to do it?
All I can find on this topic is Amazon's Limits where it says that you can only request 10,000 results. It does not say how to limit it.
You're calling get_all_hits, which gets ALL results for your query. That is why your size param is being ignored.
From the docs:
get_all_hits(query) Get a generator to iterate over all search results
Transparently handles the results paging from Cloudsearch search
results so even if you have many thousands of results you can iterate
over all results in a reasonably efficient manner.
http://boto.readthedocs.org/en/latest/ref/cloudsearch2.html#boto.cloudsearch2.search.SearchConnection.get_all_hits
You should be calling search instead -- http://boto.readthedocs.org/en/latest/ref/cloudsearch2.html#boto.cloudsearch2.search.SearchConnection.search

Categories

Resources