Elasticsearch: update multiple docs with python script? - python

I have a scenario where we have an array field on documents and we're trying to update a key/value in that array on each document, so for instance a doc would look like:
_source:{
"type": 1,
"items": [
{"item1": "value1"},
{"item2": "value2"}
]
}
We're trying to efficiently update "value1" for instance on every doc of "type": 1. We'd like to avoid conflicts and we're hoping we can do this all by using a scrip, preferably in python, but I can't find any examples of how to update fields in python, let alone across multiple documents.
So, is it possible to do this with a script and if so, does anyone have a good example?
Thanks

I know this is a little late, but I came across this in a search so I figured I would answer for anyone else who comes by. You can definitely do this utilizing the elasticsearch python library.
You can find all of the info and examples you need via the Elasticsearch RTD.
More specifically, I would look into the "ingest" operations as you can update specific pieces of documents within an index using elasticsearch.
So your script should do a few things:
Gather list of documents using the "search" operation
Depending on size, or if you want to thread, you can store in a queue
Loop through list of docs / pop doc off queue
Create updated field list
Call "update" operations EX: self.elasticsearch.update(index=MY_INDEX, doc_type=1, id=123456, body={"item1": "updatedvalue1"})

Related

How to handle an update of many docs in a single query in Elasticsearch and Python script

I'm working in a Python script which in simple terms it will update the stock field of every document when his _id matches with the Id that I get from
a DB2 query, in this query I bring two columns: catentry_id and stock. So the idea is to find every single id from DB2 in every single document of ES and update the stock from DB2 to ES.
I new int the world of ES and I did many searches and read many sites and also the documentation looking a way to handle this
I try first to get all the docs of the index using this querys i have to put in a obj in the python for nex iteration with the resultset from db2.
GET /_search
{
"_source": {
"includes": ["_id","stock"],
"excludes": ["_index","_score","_type","boost","brand","cat_1","cat_1_id",\
"cat_1_url","cat_2","cat_2_id","cat_2_url","cat_3","cat_3_id",\
"cat_3_url","cat_4","cat_4_id","cat_4_url","cat_5","cat_5_id",\
"cat_5_url","category","category_breadcrumbs","children",\
"children_tmp","delivery","discount","fullImage","id","keyword",\
"longDescription","name","partNumber","pickupinstore","price",\
"price_internet","price_m2","price_tc","product_can","published",\
"ribbon_ads","shipping_normal","shortDescription","specs",\
"specs_open","thumb","ts","url"]
},
"query": {
"range": {
"stock": {
"gte": 0
}
}
}
}
But I don't know the way to create the proper query to update all the docs. I was thinking to try to do it in a script with painless or _bulk, but I didn't find any example or anyone who does a similar task.
Update:
I could solve the taske with the guidelines of the netx link, but for my case this take aprox 20 min to update all the doc in elastic.
First i try to solve the upde with bulk o parallel bulk but then i figure out the bulk's update all the source and i doesnt work with painless script, if im wrong about what i said may be i couldnt made it work.
Second i try to compare only the values of the stock that have difference between DB2 and ES and that reduce me a lot of time, but for some reason im not 100% there is updating the correct amount of docs.
And the last craizy thing i try to doit was to pass the last dictionary inside of a painless script as a param and iterate inside the script, but that didn't work, as I'm new to this and I read about painless syntax is similar to groovy I try to iterate the dictionary as a map again didn't work where the API throws a syntax error.
I would like to optimize this but my sprint finished last week and now I have another tasks.
Sorry this is really not my day, I had 9 tabs open, how much experience do you have with parsing and python in general? maybe I am just wasting your time, but your project sounds fun and if you want to share code live I got some hours free now.

Is there a "$elemMatch" for consult text in a object (not array)?

I have this object:
I want to set all the values of "estado" to "false"
I try to use $elemMatch to find all the fields, and pass this as a filter to the $set method, but I think $elemMatch only works with arrays.
I think you've realized this, but while mongo doesn't have good operators to deal with nested objects with arbitrary keys, it does have great array operators that make it easy (and quick!) to update & deal with nested array documents. You can even create indexes that work on keys within arrays to make querying quicker.
Depending on how much your application code depends on this array structure, you might want to first transform your documents into arrays and then use https://docs.mongodb.com/manual/reference/operator/update/positional-filtered/ to update your documents. To transform your documents, you can write a script in application code, write a script in the mongo shell in js, or use an aggregation pipeline with $out to write your documents to a new collection.
If updating your schema isn't feasible, I think you're going to have to write a script to change those nested document fields.

What is the best choice of field for this project, using MongoEngine?

im currently working on a project and was wondering if i could get a little bit of advice. I aim to be storing information about a number of URLs, which will each have a number of parameters and will look something like this:
{
Name: "Report1",
Host: "127.0.0.1",
Description: "This is a test report",
Date: "00/00/00",
Time: "00:00:00",
Pages:{
Page: "test_page_url",
Parameters:{
Parameter: "test_param",
Value: "test_parm_value"
}
}
}
I've not been able to find much information/ examples of using a one-to-many, within a one-to-many relationship using MongoEngine and was wondering what the best approach would be? Is it possible to use EmbeddedDocumentListField in this manner or would it be best practice to use ReferenceField. Any advice would be greatly appreciated as im still quite new to the NoSQL approach
As a simple rule, embedded documents are well suited if
a Parameter belongs to only one Page and a Page and its Parameters belongs to only one test
your document won't grow over 16 Mo.
AFAIU, this is the case here, so I don't see any reason to use ReferenceFields, which would introduce complexity and performance penalty.
To put it simply, if you can represent your data as you did in your question, laid out as a single document rather than linking several documents together like in a relational model, then its probably safe to use embedded documents.
More details here as suggested in a comment.

Quicker way of updating subdocuments

My JSON documents (called "i"), have sub documents (called "elements").
I am looping trhough these subdocuments and updating them one at a time. However, to do so (once the value i need is computed), I have mongo scan through all the documents in the database, then through all the subdocuments, and then find the subdocument it needs to update.
I am having major time issues, as I have ~3000 documents and this is taking about 4minutes.
I would like to know if there is a quicker way to do this, without mongo having to scan all the documents but by doing it within the loop.
Here is the code:
for i in db.stuff.find():
for element in i['counts']:
computed_value = element[a] + element[b]
db.stuff.update({'id':i['id'], 'counts.timestamp':element['timestamp']},
{'$set': {'counts.$.total':computed_value}})
I am identifying the overall document by "id" and then the subdocument by its timestamp (which is unique to each subdocument). I need to find a quicker way than this. Thank you for your help.
What indexes do you have on your collection ? This could probably be sped up by creating an index on your embedded documents. You can do this using dot notation -- there's a good explanation and example here.
In your case, you'd do something like
db.stuff.ensureIndex( { "i.elements.timestamp" : 1 });
This will make your searches through embedded documents run much faster.
Your update is based on id (and i assume it is diff from default _id of mongo)
Put index on your id field
You want to set new field for all documents within collection or want to do it only for some matching collection to given criteria? if only for matching collections, use query operator (with index if possible)
dont fetch full document, fetch only those fields which are being used.
What is your avg document size? Use explain and mongostat to understand what is actual bottleneck.

Django ORM: Organizing massive amounts of data, the right way

I have a Django app that uses django-piston to send out XML feeds to internal clients. Generally, these work pretty well but we have some XML feeds that currently run over 15 minutes long. This causes timeouts, and the feeds become unreliable.
I'm trying to ponder ways that I can improve this setup. If it requires some re-structuring of the data, that could be possible too.
Here is how the data collection currently looks:
class Data(models.Model)
# fields
class MetadataItem(models.Model)
data = models.ForeignKey(Data)
# handlers.py
data = Data.objects.filter(**kwargs)
for d in data:
for metaitem in d.metadataitem_set.all():
# There is usually anywhere between 55 - 95 entries in this loop
label = metaitem.get_label() # does some formatting here
data_metadata[label] = metaitem.body
Obviously, the core of the program is doing much more, but I'm just pointing out where the problem lies. When we have a data list of 300 it just becomes unreliable and times out.
What I've tried:
Getting a collection of all the data id's, then doing a single large query to get all the MetadataItem's. Finally, filtering those in my loop. This was to preserve some queries which it did reduce.
Using .values() to reduce model instance overhead, which did speed it up but not by much.
One idea I'm thinking one simpler solution to this is to write to a cache in steps. So to reduce time out; I would write the first 50 data sets, save to cache, adjust some counter, write the next 50, etc. Still need to ponder this.
Hoping someone can help lead me into the right direction with this.
The problem in the piece of code you posted is that Django doesn't include objects that are connected through a reverse relationship automatically, so you have to make a query for each object. There's a nice way around this, as Daniel Roseman points out in his blog!
If this doesn't solve your problem well, you could also have a look at trying to get everything in one raw sql query...
You could maybe further reduce the query count by first getting all Data id's and then using select_related to get the data and it's metadata in a single big query. This would greatly reduce the number of queries, but the size of the queries might be impractical/too big. Something like:
data_ids = Data.objects.filter(**kwargs).values_list('id', flat = True)
for i in data_ids:
data = Data.objects.get(pk = i).select_related()
# data.metadataitem_set.all() can now be called without quering the database
for metaitem in data.metadataitem_set.all():
# ...
However, I would suggest, if possible, to precompute the feeds from somewhere outside the webserver. Maybe you could store the result in memcache if it's smaller than 1 MB. Or you could be one of the cool new kids on the block and store the result in a "NoSQL" database like redis. Or you could just write it to a file on disk.
If you can change the structure of the data, maybe you can also change the datastore?
The "NoSQL" databases which allow some structure, like CouchDB or MongoDB could actually be useful here.
Let's say for every Data item you have a document. The document would have your normal fields. You would also add a 'metadata' field which is a list of metadata. What about the following datastructure:
{
'id': 'someid',
'field': 'value',
'metadata': [
{ 'key': 'value' },
{ 'key': 'value' }
]
}
You would then be able to easily get to a data record and get all it's metadata. For searching, add indexes to the fields in the 'data' document.
I've worked on a system in Erlang/OTP that used Mnesia which is basically a key-value database with some indexing and helpers. We used nested records heavily to great success.
I added this as a separate answer as it's totally different than the other.
Another idea is to use Celery (www.celeryproject.com) which is a task management system for python and django. You can use it to perform any long running tasks asynchronously without holding up your main app server.

Categories

Resources