How to ensure all data is captured in ES API? - python

I am trying to create an API in Python to pull the data from ES and feed it in a data warehouse. The data is live and being filled every second so I am going to create a near-real-time pipeline.
The current URL format is {{url}}/{{index}}/_search and the test payload I am sending is:
{
"from" : 0,
"size" : 5
}
On the next refresh it will pull using payload:
{
"from" : 6,
"size" : 5
}
And so on until it reaches the total amount of records. The PROD environment has about 250M rows and I'll set the size to 10 K per extract.
I am worried though as I don't know if the records are being reordered within ES. Currently, there is a plugin which uses a timestamp generated by the user but that is flawed as sometimes documents are being skipped due to a delay in the jsons being made available for extract in ES and possibly the way the time is being generated.
Does anyone know what is the default sorting when pulling the data using /_search?

I suppose what you're looking for is a streaming / changes API which is nicely described by #Val here and also an open feature request.
I the meantime, you cannot really count on the size and from parameters -- you could probably make redundant queries and handle the duplicates before they reach your data warehouse.
Another option would be to skip ES in this regard and stream directly to the warehouse? What I mean is, take an ES snapshot up until a given time once (so you keep the historical data), feed it to the warehouse, and then stream directly from where ever you're getting your data to the warehouse.
Addendum
AFAIK the default sorting is by the insertion date. But there's no internal _insertTime or similar.. You can use cursors -- it's called scrolling and here's a py implementation. But this goes from the 'latest' doc to the 'first', not vice versa. So it'll give you all the existing docs but I'm not so sure about the newly added ones while you were scrolling. You'd then wanna run the scroll again which is suboptimal.
You could also pre-sort your index which should work quite nicely for your use case when combined w/ scrolling.

Thanks for the responses. After consideration with my colleagues, we decided to implement and use the _ingest API instead to create a pipeline in ES which inserts the server document ingestion date on each doc.
Steps:
Create the timestamp pipeline
PUT _ingest/pipeline/timestamp_pipeline
{
"description" : "Inserts timestamp field for all documents",
"processors" : [
{
"set" : {
"field": "insert_date",
"value": "{{_ingest.timestamp}}"
}
}
]
}
Update indexes to add the new default field
PUT /*/_settings
{
"index" : {
"default_pipeline": "timestamp_pipeline"
}
}
In Python then I use the _scroll API like so:
es = Elasticsearch(cfg.esUrl, port = cfg.esPort, timeout = 200)
doc = {
"query": {
"range": {
"insert_date": {
"gte": lastRowDateOffset
}
}
}
}
res = es.search(
index = Index,
sort = "insert_date:asc",
scroll = "2m",
size = NumberOfResultsPerPage,
body = doc
)
Where lastRowDateOffset is the date of the last run

Related

Optimize Elasticsearch update script body using python bulk update

Consider this script:
hashed_ids = [hashlib.md5(doc.encode('utf-8')).hexdigest() for doc in shingles]
update_by_query_body =
{
"query":{
"terms": {
"id":["id1","id2"]
}
},
"script":{
"source":"long weightToAdd = params.hashed_ids.stream().filter(idFromList -> ctx._source.id.equals(idFromList)).count(); ctx._source.weight += weightToAdd;",
"params":{
"hashed_ids":["id1","id1","id1","id2"]
}
}
}
This script does what it's supposed to do but it is very slow and from time to time raises time out error.
What it's supposed to do?
I need to update a field of a doc in Elasticsearch and add the count of that doc in a list inside python code. The weight field contains the count of the doc in a dataset. The dataset needs to be updated from time to time. So the count of each document must be updated too. hashed_ids is a list of document ids that are in the new batch of data. the weight of matched id must be increased by the count of that id in hashed_ids.
for example let say a doc with id=d1b145716ce1b04ea53d1ede9875e05a and weight=5 is already present in index. and also the string d1b145716ce1b04ea53d1ede9875e05a is repeated three times in the hashed_ids so the update_with_query query will match the doc in database. I need to add 3 to 5 and have 8 as final weight.
I need ideas to improve the efficiency of the code.

Save a subset of MongoDB(3.0) collection to another collection in Python

I found this answer - Answer link
db.full_set.aggregate([ { $match: { date: "20120105" } }, { $out: "subset" } ]);
I want do same thing but with first 15000 documents in collection, I couldn't find how to apply limit to such query (I tried using $limit : 15000, but it doesn't recognize $limit)
also when I tried -
db.subset.insert(db.full_set.find({}).limit(15000).toArray())
there is no function toArray() for output type cursor.
Guide me how can I accomplish it?
Well,
in python, this is how things work - $limit needs to be wrapped in "",
and you need to create a pipeline to execute it as a command.
In my code -
pipeline = [{ '$limit': 15000 },{'$out': "destination_collection"}]
db.command('aggregate', "source_collection", pipeline=pipeline)
You need to wrap everything in double quotes, including your source and destination collection.
And in db.command db is the object of your database (ie dbclient.database_name)
As per this answer -
It works about 100 times faster than forEach at least in my case. This is because the entire aggregation pipeline runs in the mongod process, whereas a solution based on find() and insert() has to send all of the documents from the server to the client and then back. This has a performance penalty, even if the server and client are on the same machine.
The one that really helped me figure this answer out - Reference 1
And official documentation

Adding value to array in sub-document (nested within main doc) without duplication - MongoDB

it is quite complicated with the nested documents, but please let me know if you all has any solutions, thanks.
To summarize, I would like to:
Add a value to an array (without duplication), and the array is within a sub-document, that is within an array of a main document. (Document > Array > Subdoc > Array)
The subdocument itself might not exist, so if not exist, the subdocument itself need to be added, i.e. UpSert
The command be the same for both action (i.e. adding of value to subdoc's array, and adding of subdoc)
I have tried the following, but it doesn't work:
key = {'username':'user1'}
update1 = {
'$addToSet':{'clients':{
'$set':{'fname':'Jessica'},
'$set':{'lname':'Royce'},
'$addToSet':{'cars':'Toyota'}
}
}
}
#the document with 'Jessica' and 'Royce' does not exist in clients array, so a new document should be created
update2 = {
'$addToSet':{'clients':{
'$set':{'fname':'Jessica'},
'$set':{'lname':'Royce'},
'$addToSet':{'cars':'Honda'}
}
}
}
#now that the document with 'Jessica' and 'Royce' already exist in clients array, only the value of 'Honda' should be added to the cars array
mongo_collection.update(key, update1 , upsert=True)
mongo_collection.update(key, update2 , upsert=True)
error message: $set is not valid for storage
My intended outcome:
Before:
{
'username':'user1',
'clients':[
{'fname':'John',
'lname':'Baker',
'cars':['Merc','Ferrari']}
]
}
1st After:
{
'username':'user1',
'clients':[
{'fname':'John',
'lname':'Baker',
'cars':['Merc','Ferrari']},
{'fname':'Jessica',
'lname':'Royce',
'cars':['Toyota']}
]
}
2nd After:
{
'username':'user1',
'clients':[
{'fname':'John',
'lname':'Baker',
'cars':['Merc','Ferrari']},
{'fname':'Jessica',
'lname':'Royce',
'cars':['Toyota','Honda']}
]
}
My understanding says you won't be able to completely achieve intended solution directly. You can very well do nested update or upsert but duplication check probably not, as there is no direct way to check item contains in a array document.
For upsert operation you can refer mongodb update operation doc or bulk operation. And for duplication probably you need to have separate logic to identify.

Building an ElasticSearch search with exists using pyes

The goal of this example code is to figure out how to create a query consisting out of multiple filters and queries.
The below example is not working as expected.
I want to be able to execute my search only on document which contain a certain "key". That what I'm trying to reach with the ExistsFilter, but when enabling I don't get any results back.
Any pointers to clear up this question?
#!/usr/bin/python
import pyes
conn = pyes.ES('sandbox:9200')
conn.index('{"test":{"field1":"value1","field2":"value2"}}','2012.9.23','test')
filter = pyes.filters.BoolFilter()
filter.add_must(pyes.filters.LimitFilter(1))
filter.add_must(pyes.filters.ExistsFilter('test')) #uncommenting this line returns the documents
query = pyes.query.BoolQuery()
query.add_must(pyes.query.TextQuery('test.field1','value1'))
query.add_must(pyes.query.TextQuery('test.field2','value2'))
search = pyes.query.FilteredQuery(query, filter)
for reference in conn.search(query=search,indices=['2012.9.23']):
print reference
I don't use pyes (neither python). But, what I can see here is that some informations seems to miss in the ExistsFilter if I compare to the ExistsFilter documentation :
{
"constant_score" : {
"filter" : {
"exists" : { "field" : "user" }
}
}
}
Could it be your issue?

How to build "Tagging" support using CouchDB?

I'm using the following view function to iterate over all items in the database (in order to find a tag), but I think the performance is very poor if the dataset is large.
Any other approach?
def by_tag(tag):
return '''
function(doc) {
if (doc.tags.length > 0) {
for (var tag in doc.tags) {
if (doc.tags[tag] == "%s") {
emit(doc.published, doc)
}
}
}
};
''' % tag
Disclaimer: I didn't test this and don't know if it can perform better.
Create a single perm view:
function(doc) {
for (var tag in doc.tags) {
emit([tag, doc.published], doc)
}
};
And query with
_view/your_view/all?startkey=['your_tag_here']&endkey=['your_tag_here', {}]
Resulting JSON structure will be slightly different but you will still get the publish date sorting.
You can define a single permanent view, as Bahadir suggests. when doing this sort of indexing, though, don't output the doc for each key. Instead, emit([tag, doc.published], null). In current release versions you'd then have to do a separate lookup for each doc, but SVN trunk now has support for specifying "include_docs=True" in the query string and CouchDB will automatically merge the docs into your view for you, without the space overhead.
You are very much on the right track with the view. A list of thoughts though:
View generation is incremental. If you're read traffic is greater than you're write traffic, then your views won't cause an issue at all. People that are concerned about this generally shouldn't be. Frame of reference, you should be worried if you're dumping hundreds of records into the view without an update.
Emitting an entire document will slow things down. You should only emit what is necessary for use of the view.
Not sure what the val == "%s" performance would be, but you shouldn't over think things. If there's a tag array you should emit the tags. Granted if you expect a tags array that will contain non-strings, then ignore this.
# Works on CouchDB 0.8.0
from couchdb import Server # http://code.google.com/p/couchdb-python/
byTag = """
function(doc) {
if (doc.type == 'post' && doc.tags) {
doc.tags.forEach(function(tag) {
emit(tag, doc);
});
}
}
"""
def findPostsByTag(self, tag):
server = Server("http://localhost:1234")
db = server['my_table']
return [row for row in db.query(byTag, key = tag)]
The byTag map function returns the data with each unique tag in the "key", then each post with that tag in value, so when you grab key = "mytag", it will retrieve all posts with the tag "mytag".
I've tested it against about 10 entries and it seems to take about 0.0025 seconds per query, not sure how efficient it is with large data sets..

Categories

Resources