How to load elastic data in python using scroll? - python

I have an index in elastic search which is having huge data. I am trying to load some of its data (more than 10000 records)in python for further processing. As per documentation and web search scroll is used but it is able to fetch only few records. After sometime this exception occurs,
errorNotFoundError(404, 'search_phase_execution_exception', 'No search context found for id [101781]')
My code is as following:
from elasticsearch import Elasticsearch
##########elastic configuration
host='localhost'
port=9200
user=''
pasw=''
el_index_name = 'test'
es = Elasticsearch([{'host':host , 'port': port}], http_auth=(user,pasw))
res = es.search(index=el_index_name, body={"query": {"match_all": {}}},scroll='10m')
rows=[]
while True:
try:
rows.append(es.scroll(scroll_id=res['_scroll_id'])['hits']['hits'])
except Exception as esl:
print ('error{}'.format(esl))
break
##deleting scroll
es.clear_scroll(scroll_id=res['_scroll_id'])
I have changed the value of scroll='10m' but still, this exception occurs.

You need to change your scroll request line to this:
rows.append(es.scroll(scroll_id=res['_scroll_id'], body={"scroll": "10m","scroll_id": res['_scroll_id']})['hits']['hits'])
As an advice, It is better to increase number of retrieved posts. retrieving just 1 post in each request have negative influence on your performance and it has overhead for your cluster, as well. as an example:
{
"query": {
"match_all": {}
},"size":100
}
I have added the below part to answer to the question in comments. It is not stopping because you have put While True in your code. You need to change it to this:
res = es.search(index=el_index_name, body={"query": {"match_all": {}}}, scroll='10m')
scroll_id = res['_scroll_id']
query = {
"scroll": "10m",
"scroll_id": scroll_id
}
rows = []
while len(res['hits']['hits']):
for item in res['hits']['hits']:
rows.append(item)
res = es.scroll(scroll_id=scroll_id, body=query)
Please let me know if there was any problem with this.

Related

Get data for JSON Python second List

Hoping you are good
I am trying to get data from zendisk by API and Python Json
i can get any data Value under Audits LikeTicket_id and auther_id but when tried get data under Events such as Body
keep get this error
print(audit['body'])
KeyError: 'body'
JSON Output
{
"audits":[
{
"id":1727876301271,
"ticket_id":54010951,
"created_at":"2021-10-21T10:58:06Z",
"author_id":12306596687,
"metadata":{
"system":{
"client":"GuzzleHttp/6.2.1 curl/7.29.0 PHP/7.1.2",
"ip_address":"x.x.x.x",
"location":"Boardman, OR, United States",
"latitude":45.8491,
"longitude":-119.7143
},
"custom":{
}
},
"events":[
{
"id":1727876301291,
"type":"Comment",
"author_id":366289833251,
"body":"Sehr geehrte Damen und Herren,\n\nIn unserer Bestellung fehlt das Kleid, es war nicht mit dabei, obwohl es hätte drin sein müssen.\nFreundliche Grüße",
"attachments":[
],
"audit_id":1727876301271
},
{
"id":1727876301311,
"type":"Create",
"value":"366289833251",
"field_name":"requester_id"
},
Python Code
import requests
import csv
# Settings
auth = 'xxxxxxx', 'xxxxxx'
view_tickets = []
view_id = 214459268
view_audits = []
ticket_id = 54010951
view_events =[]
print(f'Getting tickets from ticket_id ID {ticket_id}')
url = f'https://xxxx.zendesk.com/api/v2/tickets/54010951/audits.json'
while url:
response = requests.get(url, auth=auth)
page_data = response.json()
audits = page_data['audits'] # extract the "tickets" list from the page
view_audits.extend(audits)
url = page_data['next_page']
for audit in audits:
print(audit['body'])
You know you're overwriting, not adding, to audits right? (in this line: audits = page_data['audits']). I don't think that makes sense, but it's hard to know your intent.
To fix the error itself, your json structure has the body key inside the events key. So you can access it with:
print(audit['events'][0]['body'])
or, using another loop:
for audit in audits:
for event in audit['events']
print(event['body'])
You might get an error for the 2nd one because it doesn't appear to have the body key. You can add an if statement to handle that if you want.

MongoDB: Update function does not return a response in Python Flask

I built a small script in Flask that takes a file from my MongoDB with the vkey and updates the credits and level from the document:
client = pymongo.MongoClient("mongodb+srv://root:<password>#cluster0.3ypph.mongodb.net/data?retryWrites=true&w=majority")
db=client["data"]
col=db["user_currency"]
h=hmac.new(b"test",b"",hashlib.sha3_512)
credits_update=credits-cost
h.update(vkey.encode("utf8"))
try:
db.col.update_one(
{"vkey":h.hexdigest()},
{"$set":{"credits":str(credits_update)}}
)
db.col.update_one(
{"vkey":h.hexdigest()},
{
"$set":{"level":count},
"$currentDate": { "lastModified": True }
}
)
except:
return redirect("/currency?error=02")
else:
return redirect("/currency?bought=lvlboost")
However, nothing is updated in MongoDB after execution and it only returns to the target page /currency?bought=lvlboost. I reloaded the database and also checked the vkey for correctness. Both are identical. Does anyone know what the problem could be?
Add a variable to your query, so that the return value will be added to that,
result = db.col.update_one(
{"vkey":h.hexdigest()},
{
"$set":{ "level":count, "credits":str(credits_update) },
"$currentDate": {"lastModified": True }
}
)
Now if you print(result) you can see the matched_count and the modified_count.
Doc: https://docs.mongodb.com/manual/reference/method/db.collection.updateOne/#returns

I want to firstly create a database and then update it as per the values in mongodb

I want to update the value of Entry1 using upsert. I have a sensor that returns the value of Entry1. If sensor is blocked, the value is true. If sensor is not blocked then the value is False.
machineOne = None
oneIn = 1
while True:
global machineOneId
global userId
try:
if Entry1.get_value() and oneIn < 2:
machineOne = Entry1.get_value()
print('entered looopp ONeeeE', machineOne)
machine1 = {
'Entry1': Entry1.get_value(),
'Exit1': Exit1.get_value(),
'id': 'test'
}
result = Machine1.insert_one(machine1)
myquery = {"Entry1": 'true'}
newvalues = {"$set": {"id": result.inserted_id}}
#result = Machine1.insert_one(machine1)
Machine1.update_one(myquery, newvalues)
userId = result.inserted_id
oneIn += 1
print('added', result.inserted_id, oneIn)
elif machineOne:
print('entered looopp', userId)
myquery = {"id": userId}
newvalues = {"$set": {"id": Entry1.get_value()}}
upsert = True
#result = Machine1.insert_one(machine1)
Machine1.update_one(myquery, newvalues)
if Exit1.get_value():
print('added',)
finally:
print('nothings happened', machineOne)
what is expected: i should be able to update the Entry1 from true to false in the same log, displayed in robo3t
Good Afternoon #digs10 ,
I read your post and I think the error is how you locate the document that you want to update.
I remember that MongoDB document primary key is "_id" instead of "id". You could take a look here MongoDB Documents
For what I see in the code (I don't know Python but it is readable), you are referring to the document "Entry1" using the field "id" instead of "_id.
Try modifing the line myquery = {"id": userId} for myquery = {"_id": userId}.
I hope this answer can help you.
Best Regards,
JB
P.S: I saw this question in my email and I took a quick read of it, If I misunderstood it, please let me know.

How to get all documents under an elasticsearch index with python client ?

I'm trying to get all index document using python client but the result show me only the first document
This is my python code :
res = es.search(index="92c603b3-8173-4d7a-9aca-f8c115ff5a18", doc_type="doc", body = {
'size' : 10000,
'query': {
'match_all' : {}
}
})
print("%d documents found" % res['hits']['total'])
data = [doc for doc in res['hits']['hits']]
for doc in data:
print(doc)
return "%s %s %s" % (doc['_id'], doc['_source']['0'], doc['_source']['5'])
try "_doc" instead of "doc"
res = es.search(index="92c603b3-8173-4d7a-9aca-f8c115ff5a18", doc_type="_doc", body = {
'size' : 100,
'query': {
'match_all' : {}
}
})
Elasticsearch by default retrieve only 10 documents. You could change this behaviour - doc here . The best practice for pagination are search after query and scroll query. It depends from your needs. Please read this answer Elastic search not giving data with big number for page size
To show all the results:
for doc in res['hits']['hits']:
print doc['_id'], doc['_source']
You can try the following query. It will return all the documents.
result = es.search(index="index_name", body={"query":{"match_all":{}}})
You can also use elasticsearch_dsl and its Search API which allows you to iterate over all your documents via the scan method.
import elasticsearch
from elasticsearch_dsl import Search
client = elasticsearch.Elasticsearch()
search = Search(using=client, index="92c603b3-8173-4d7a-9aca-f8c115ff5a18")
for hit in search.scan():
print(hit)
I dont see mentioned that the index must be refreshed if you just added data. Use this:
es.indices.refresh(index="index_name")

Import list of dicts or JSON file to elastic search with python

I have a .json.gz file that I wish to load into elastic search.
My first attempt involved using the json module to convert the JSON to a list of dicts.
import gzip
import json
from pprint import pprint
from elasticsearch import Elasticsearch
nodes_f = gzip.open("nodes.json.gz")
nodes = json.load(nodes_f)
Dict example:
pprint(nodes[0])
{u'index': 1,
u'point': [508163.122, 195316.627],
u'tax': u'fehwj39099'}
Using Elasticsearch:
es = Elasticsearch()
data = es.bulk(index="index",body=nodes)
However, this returns:
elasticsearch.exceptions.RequestError: TransportError(400, u'illegal_argument_exception', u'Malformed action/metadata line [1], expected START_OBJECT or END_OBJECT but found [VALUE_STRING]')
Beyond this, I wish to be able to find the tax for given point query, in case this has an impact on how I should be indexing the data with elasticsearch.
Alfe pointed me in the right direction, but I couldn't get his code to work.
I found two solutions:
Line by line with a for loop:
es = elasticsearch.Elasticsearch()
for node in nodes:
_id = node['index']
es.index(index='nodes',doc_type='external',id=_id,body=node)
In bulk, using helper:
actions = [
{
"_index" : "nodes_bulk",
"_type" : "external",
"_id" : str(node['index']),
"_source" : node
}
for node in nodes
]
helpers.bulk(es,actions)
Bulk was around 22 times faster for a list of 343724 dicts.
Here is my working code using bulk api:
Define a list of dicts:
from elasticsearch import Elasticsearch, helpers
es = Elasticsearch([{'host':'localhost', 'port': 9200}])
doc = [{'_id': 1,'price': 10, 'productID' : 'XHDK-A-1293-#fJ3'},
{'_id':2, "price" : 20, "productID" : "KDKE-B-9947-#kL5"},
{'_id':3, "price" : 30, "productID" : "JODL-X-1937-#pV7"},
{'_id':4, "price" : 30, "productID" : "QQPX-R-3956-#aD8"}]
helpers.bulk(es, doc, index='products',doc_type='_doc', request_timeout=200)
The ES bulk library showed several problems, including performance trouble, not being able to set specific _ids etc. But since the bulk API of ES is not very complicated, we did it ourselves:
import requests
headers = { 'Content-type': 'application/json',
'Accept': 'text/plain'}
jsons = []
for d in docs:
_id = d.pop('_id') # take _id out of dict
jsons.append('{"index":{"_id":"%s"}}\n%s\n' % (_id, json.dumps(d)))
data = ''.join(jsons)
response = requests.post(url, data=data, headers=headers)
We needed to set a specific _id but I guess you can skip this part in case you want a random _id set by ES automatically.
Hope that helps.

Categories

Resources