Optimize Elasticsearch update script body using python bulk update - python

Consider this script:
hashed_ids = [hashlib.md5(doc.encode('utf-8')).hexdigest() for doc in shingles]
update_by_query_body =
{
"query":{
"terms": {
"id":["id1","id2"]
}
},
"script":{
"source":"long weightToAdd = params.hashed_ids.stream().filter(idFromList -> ctx._source.id.equals(idFromList)).count(); ctx._source.weight += weightToAdd;",
"params":{
"hashed_ids":["id1","id1","id1","id2"]
}
}
}
This script does what it's supposed to do but it is very slow and from time to time raises time out error.
What it's supposed to do?
I need to update a field of a doc in Elasticsearch and add the count of that doc in a list inside python code. The weight field contains the count of the doc in a dataset. The dataset needs to be updated from time to time. So the count of each document must be updated too. hashed_ids is a list of document ids that are in the new batch of data. the weight of matched id must be increased by the count of that id in hashed_ids.
for example let say a doc with id=d1b145716ce1b04ea53d1ede9875e05a and weight=5 is already present in index. and also the string d1b145716ce1b04ea53d1ede9875e05a is repeated three times in the hashed_ids so the update_with_query query will match the doc in database. I need to add 3 to 5 and have 8 as final weight.
I need ideas to improve the efficiency of the code.

Related

How to ensure all data is captured in ES API?

I am trying to create an API in Python to pull the data from ES and feed it in a data warehouse. The data is live and being filled every second so I am going to create a near-real-time pipeline.
The current URL format is {{url}}/{{index}}/_search and the test payload I am sending is:
{
"from" : 0,
"size" : 5
}
On the next refresh it will pull using payload:
{
"from" : 6,
"size" : 5
}
And so on until it reaches the total amount of records. The PROD environment has about 250M rows and I'll set the size to 10 K per extract.
I am worried though as I don't know if the records are being reordered within ES. Currently, there is a plugin which uses a timestamp generated by the user but that is flawed as sometimes documents are being skipped due to a delay in the jsons being made available for extract in ES and possibly the way the time is being generated.
Does anyone know what is the default sorting when pulling the data using /_search?
I suppose what you're looking for is a streaming / changes API which is nicely described by #Val here and also an open feature request.
I the meantime, you cannot really count on the size and from parameters -- you could probably make redundant queries and handle the duplicates before they reach your data warehouse.
Another option would be to skip ES in this regard and stream directly to the warehouse? What I mean is, take an ES snapshot up until a given time once (so you keep the historical data), feed it to the warehouse, and then stream directly from where ever you're getting your data to the warehouse.
Addendum
AFAIK the default sorting is by the insertion date. But there's no internal _insertTime or similar.. You can use cursors -- it's called scrolling and here's a py implementation. But this goes from the 'latest' doc to the 'first', not vice versa. So it'll give you all the existing docs but I'm not so sure about the newly added ones while you were scrolling. You'd then wanna run the scroll again which is suboptimal.
You could also pre-sort your index which should work quite nicely for your use case when combined w/ scrolling.
Thanks for the responses. After consideration with my colleagues, we decided to implement and use the _ingest API instead to create a pipeline in ES which inserts the server document ingestion date on each doc.
Steps:
Create the timestamp pipeline
PUT _ingest/pipeline/timestamp_pipeline
{
"description" : "Inserts timestamp field for all documents",
"processors" : [
{
"set" : {
"field": "insert_date",
"value": "{{_ingest.timestamp}}"
}
}
]
}
Update indexes to add the new default field
PUT /*/_settings
{
"index" : {
"default_pipeline": "timestamp_pipeline"
}
}
In Python then I use the _scroll API like so:
es = Elasticsearch(cfg.esUrl, port = cfg.esPort, timeout = 200)
doc = {
"query": {
"range": {
"insert_date": {
"gte": lastRowDateOffset
}
}
}
}
res = es.search(
index = Index,
sort = "insert_date:asc",
scroll = "2m",
size = NumberOfResultsPerPage,
body = doc
)
Where lastRowDateOffset is the date of the last run

Is there a way of setting a range when querying data of firestore?

I have a collection of documents, all with random id's and a field called date.
docs = collection_ref.order_by(u'date', direction=firestore.Query.ASCENDING).get()
Imagine I had limited the search to the first ten
docs = collection_ref.order_by(date', direction=firestore.Query.ASCENDING).limit(10).get()
How would I continue with my next query when I want to get the items from 11 to 20?
you can use an offset(),but every doc skipped counts as a read. For example, if you did query.offset(10).limit(5), you would get charged for 15 reads: the 10 offset + the 5 that you actually got.
If you want to avoid unnecessary reads, use startAt() or startAfter()
example (Java, sorry, don't speak python. here's link to docs though):
QuerySnapshot querySnapshot = // your query snapshot
List<DocumentSnapshot> docs = querySnapshot.getDocuments();
//reference doc, query starts at or after this one
DocumentSnapshot indexDocSnapshot = docs.get(docs.size());
//to 'start after', or paginate, you can do below:
query.startAfter(indexDocSnapshot).limit(10).get()
.addOnSuccessListener(queryDocumentSnapshots -> {
// next 10 docs here, no extra reads necessary
});

Cleansing Data with Updates - Mongodb + Python

I have imported the into Mongodb but not able to cleanse the data in Python. Please see the below question and the script. I need answer of Script 1 & 2
import it into MongoDB, cleanse the data in Python, and update MongoDB with the cleaned data. Specifically, you'll be taking a people dataset where some of the birthday fields look like this:
{
...
"birthday": ISODate("2011-03-17T11:21:36Z"),
...
}
And other birthday fields look like this:
{
...
"birthday": "Thursday, March 17, 2011 at 7:21:36 AM",
...
}
MongoDB natively supports a Date datatype through BSON. This datatype is used in the first example, but a plain string is used in the second example. In this assessment, you'll complete the attached notebook to script a fix that makes all of the document's birthday field a Date.
Download the notebook and dataset to your notebook directory. Once you have the notebook up and running, and after you've updated your connection URI in the third cell, continue through the cells until you reach the fifth cell, where you'll import the dataset. This can take up to 10 minutes depending on the speed of your Internet connection and computing power of your computer.
After verifying that all of the documents have successfully been inserted into your cluster, you'll write a query in the 7th cell to find all of the documents that use a string for the birthday field.
To verify your understanding of the first part of this assessment, how many documents had a string value for the birthday field (the output of cell 8)?
Script1
Replace YYYY with a query on the people-raw collection that will return a cursor with only
documents where the birthday field is a string
people_with_string_birthdays = YYYY
This is the answer to verify you completed the lab:
people_with_string_birthdays.count()
Script2
updates = []
# Again, we're updating several thousand documents, so this will take a little while
for person in people_with_string_birthdays:
# Pymongo converts datetime objects into BSON Dates. The dateparser.parse function
# returns a datetime object, so we can simply do the following to update the field
# properly. Replace ZZZZ with the correct update operator
updates.append(UpdateOne(
{"_id": person["_id"]},
{ZZZZ: { "birthday": dateparser.parse(person["birthday"]) } }
))
count += 1
if count == batch_size:
people_raw.bulk_write(updates)
updates = []
count = 0
if updates:
people_raw.bulk_write(updates)
count = 0
# If everything went well this should be zero
people_with_string_birthdays.count()
import json
with open("./people-raw.json") as dataset:
array={}
for i in dataset:
a=json.loads(i)
if type(a["birthday"]) not in array:
array[type(a["birthday"])]=1
else:
array[type(a["birthday"])]+=1
print(array)
give the path of your people-raw.json file in open() method iff JSON file not in same directory.
Ans : 10382
Script 1: YYYY = people_raw.find({"Birthday" : {"$type" : "string"}})

Adding value to array in sub-document (nested within main doc) without duplication - MongoDB

it is quite complicated with the nested documents, but please let me know if you all has any solutions, thanks.
To summarize, I would like to:
Add a value to an array (without duplication), and the array is within a sub-document, that is within an array of a main document. (Document > Array > Subdoc > Array)
The subdocument itself might not exist, so if not exist, the subdocument itself need to be added, i.e. UpSert
The command be the same for both action (i.e. adding of value to subdoc's array, and adding of subdoc)
I have tried the following, but it doesn't work:
key = {'username':'user1'}
update1 = {
'$addToSet':{'clients':{
'$set':{'fname':'Jessica'},
'$set':{'lname':'Royce'},
'$addToSet':{'cars':'Toyota'}
}
}
}
#the document with 'Jessica' and 'Royce' does not exist in clients array, so a new document should be created
update2 = {
'$addToSet':{'clients':{
'$set':{'fname':'Jessica'},
'$set':{'lname':'Royce'},
'$addToSet':{'cars':'Honda'}
}
}
}
#now that the document with 'Jessica' and 'Royce' already exist in clients array, only the value of 'Honda' should be added to the cars array
mongo_collection.update(key, update1 , upsert=True)
mongo_collection.update(key, update2 , upsert=True)
error message: $set is not valid for storage
My intended outcome:
Before:
{
'username':'user1',
'clients':[
{'fname':'John',
'lname':'Baker',
'cars':['Merc','Ferrari']}
]
}
1st After:
{
'username':'user1',
'clients':[
{'fname':'John',
'lname':'Baker',
'cars':['Merc','Ferrari']},
{'fname':'Jessica',
'lname':'Royce',
'cars':['Toyota']}
]
}
2nd After:
{
'username':'user1',
'clients':[
{'fname':'John',
'lname':'Baker',
'cars':['Merc','Ferrari']},
{'fname':'Jessica',
'lname':'Royce',
'cars':['Toyota','Honda']}
]
}
My understanding says you won't be able to completely achieve intended solution directly. You can very well do nested update or upsert but duplication check probably not, as there is no direct way to check item contains in a array document.
For upsert operation you can refer mongodb update operation doc or bulk operation. And for duplication probably you need to have separate logic to identify.

Building an ElasticSearch search with exists using pyes

The goal of this example code is to figure out how to create a query consisting out of multiple filters and queries.
The below example is not working as expected.
I want to be able to execute my search only on document which contain a certain "key". That what I'm trying to reach with the ExistsFilter, but when enabling I don't get any results back.
Any pointers to clear up this question?
#!/usr/bin/python
import pyes
conn = pyes.ES('sandbox:9200')
conn.index('{"test":{"field1":"value1","field2":"value2"}}','2012.9.23','test')
filter = pyes.filters.BoolFilter()
filter.add_must(pyes.filters.LimitFilter(1))
filter.add_must(pyes.filters.ExistsFilter('test')) #uncommenting this line returns the documents
query = pyes.query.BoolQuery()
query.add_must(pyes.query.TextQuery('test.field1','value1'))
query.add_must(pyes.query.TextQuery('test.field2','value2'))
search = pyes.query.FilteredQuery(query, filter)
for reference in conn.search(query=search,indices=['2012.9.23']):
print reference
I don't use pyes (neither python). But, what I can see here is that some informations seems to miss in the ExistsFilter if I compare to the ExistsFilter documentation :
{
"constant_score" : {
"filter" : {
"exists" : { "field" : "user" }
}
}
}
Could it be your issue?

Categories

Resources