Building an ElasticSearch search with exists using pyes - python

The goal of this example code is to figure out how to create a query consisting out of multiple filters and queries.
The below example is not working as expected.
I want to be able to execute my search only on document which contain a certain "key". That what I'm trying to reach with the ExistsFilter, but when enabling I don't get any results back.
Any pointers to clear up this question?
#!/usr/bin/python
import pyes
conn = pyes.ES('sandbox:9200')
conn.index('{"test":{"field1":"value1","field2":"value2"}}','2012.9.23','test')
filter = pyes.filters.BoolFilter()
filter.add_must(pyes.filters.LimitFilter(1))
filter.add_must(pyes.filters.ExistsFilter('test')) #uncommenting this line returns the documents
query = pyes.query.BoolQuery()
query.add_must(pyes.query.TextQuery('test.field1','value1'))
query.add_must(pyes.query.TextQuery('test.field2','value2'))
search = pyes.query.FilteredQuery(query, filter)
for reference in conn.search(query=search,indices=['2012.9.23']):
print reference

I don't use pyes (neither python). But, what I can see here is that some informations seems to miss in the ExistsFilter if I compare to the ExistsFilter documentation :
{
"constant_score" : {
"filter" : {
"exists" : { "field" : "user" }
}
}
}
Could it be your issue?

Related

How do I use the Elasticsearch Query DSL to query an API using Python?

I'm trying to query the public Art Institute of Chicago API to only show me results that match certain criteria. For example:
classification_title = "painting"
colorfulness <= 13
material_titles includes "paper (fiber product)"
The API documentation states:
Behind the scenes, our search is powered by Elasticsearch. You can use its Query DSL to interact with our API.
I can't figure out how to take an Elasticsearch DSL JSON-like object and pass it into the API URL, beyond a single criteria.
Here are some working single-criteria examples specific to this API:
requests.get("https://api.artic.edu/api/v1/artworks/search?q=woodblock[classification_title]").json()
requests.get("https://api.artic.edu/api/v1/artworks/search?q=monet[artist_title]").json()
And here are some of my failed attempts to have return only items that pass 2+ criteria items:
requests.get("https://api.artic.edu/api/v1/artworks/search?q=woodblock[classification_title]monet[artist_title]")
requests.get("https://api.artic.edu/api/v1/artworks/search?q=woodblock[classification_title],monet[artist_title]")
requests.get("https://api.artic.edu/api/v1/artworks/search?q=%2Bclassification_title%3A(woodblock)+%2Bartist_title%3A(monet)")
And lastly some of my failed attempts to return more complex criteria, like range:
requests.get("https://api.artic.edu/api/v1/artworks/search?q={range:lte:10}&query[colorfulness]").json()
requests.get("https://api.artic.edu/api/v1/artworks/search?q=<10&query[colorfulness]").json()
requests.get("https://api.artic.edu/api/v1/artworks/search?q=%2Bdate_display%3A%3C1900").json()
All of these failed attempts return data but not within my passed criteria. For example, woodblock[classification_title]monet[artist_title] should return no results.
How could I query all of these criteria, only returning results (if any) that match all these conditions? The JSON-like Query DSL does not seem compatible with a requests.get.
Solved. I was lacking knowledge on GET and POST. I can indeed use the JSON-like Query DSL. It just needs to be sent as part of a requests.post instead of a requests.get, like so:
import requests
fields = "id,image_id,title,artist_id,classification_title,colorfulness,material_titles"
url = f"https://api.artic.edu/api/v1/artworks/search?&fields={fields}"
criteria = {
"query": {
"bool": {
"must": [
{"match": {"classification_title": "painting"}},
{"range": {"colorfulness": {"lte": 13}}},
{"match": {"material_titles": "paper (fiber product)"}},
],
}
}
}
r = requests.post(url, json=criteria)
art = r.json()
print(art)
Notice within the requests.post that the desired criteria query is passed through as a json argument, separate from the url.

Get documents near a match in pyMongo

I'm trying to get the documents that surround a match in pyMongo. So I would search for a string and get the matches and the entries that are around this match (using the '_index' well, index), so the user has some context on the result.
I'm trying to do it using $setWindowFields to no success, as I'm getting no results. Probably I'm using the wrong syntax?. This is the aggregation that I'm trying:
show_near = ([{'$setWindowFields':{
'partitionBy':None,
'sortBy': {'_index':1},
'output':{
'nearIds':{
'$addToSet':'$_id',
'window':{'documents':[-2,2]}
}
}
}
},
{
'$match':
{field:{'$regex':f'({s})'}}
},
{'$lookup':
{'from':'collection',
'localField':'nearIds',
'foreignField':'_id',
'as':'nearDocs'}
},
{'$unwind':'$nearDocs'},
{'$replaceRoot':{
'newRoot':'$nearDocs'}}])
cursor = self.collection.aggregate(show_near)
Where 's' is the string I want to match and '_index' is the order of the entries.
Any idea? Maybe there is another method to do this? This feature looks perfect for what I want, but maybe I'm mistaken and there is another way. I've tried going back and forth with $gte and $lte, but is not feasible when results start to pile up.
Thanks!

How to ensure all data is captured in ES API?

I am trying to create an API in Python to pull the data from ES and feed it in a data warehouse. The data is live and being filled every second so I am going to create a near-real-time pipeline.
The current URL format is {{url}}/{{index}}/_search and the test payload I am sending is:
{
"from" : 0,
"size" : 5
}
On the next refresh it will pull using payload:
{
"from" : 6,
"size" : 5
}
And so on until it reaches the total amount of records. The PROD environment has about 250M rows and I'll set the size to 10 K per extract.
I am worried though as I don't know if the records are being reordered within ES. Currently, there is a plugin which uses a timestamp generated by the user but that is flawed as sometimes documents are being skipped due to a delay in the jsons being made available for extract in ES and possibly the way the time is being generated.
Does anyone know what is the default sorting when pulling the data using /_search?
I suppose what you're looking for is a streaming / changes API which is nicely described by #Val here and also an open feature request.
I the meantime, you cannot really count on the size and from parameters -- you could probably make redundant queries and handle the duplicates before they reach your data warehouse.
Another option would be to skip ES in this regard and stream directly to the warehouse? What I mean is, take an ES snapshot up until a given time once (so you keep the historical data), feed it to the warehouse, and then stream directly from where ever you're getting your data to the warehouse.
Addendum
AFAIK the default sorting is by the insertion date. But there's no internal _insertTime or similar.. You can use cursors -- it's called scrolling and here's a py implementation. But this goes from the 'latest' doc to the 'first', not vice versa. So it'll give you all the existing docs but I'm not so sure about the newly added ones while you were scrolling. You'd then wanna run the scroll again which is suboptimal.
You could also pre-sort your index which should work quite nicely for your use case when combined w/ scrolling.
Thanks for the responses. After consideration with my colleagues, we decided to implement and use the _ingest API instead to create a pipeline in ES which inserts the server document ingestion date on each doc.
Steps:
Create the timestamp pipeline
PUT _ingest/pipeline/timestamp_pipeline
{
"description" : "Inserts timestamp field for all documents",
"processors" : [
{
"set" : {
"field": "insert_date",
"value": "{{_ingest.timestamp}}"
}
}
]
}
Update indexes to add the new default field
PUT /*/_settings
{
"index" : {
"default_pipeline": "timestamp_pipeline"
}
}
In Python then I use the _scroll API like so:
es = Elasticsearch(cfg.esUrl, port = cfg.esPort, timeout = 200)
doc = {
"query": {
"range": {
"insert_date": {
"gte": lastRowDateOffset
}
}
}
}
res = es.search(
index = Index,
sort = "insert_date:asc",
scroll = "2m",
size = NumberOfResultsPerPage,
body = doc
)
Where lastRowDateOffset is the date of the last run

How to delete documents from Elasticsearch

I can't find any example of deleting documents from Elasticsearch in Python. What I've seen by now - is definition of delete and delete_by_query functions. But for some reason documentation does not provide even a microscopic example of using these functions. The single list of parameters does not tell me too much, if I do not know how to correctly feed them into the function call. So, lets say, I've just inserted one new doc like so:
doc = {'name':'Jacobian'}
db.index(index="reestr",doc_type="some_type",body=doc)
Who in the world knows how can I now delete this document using delete and delete_by_query ?
Since you are not giving a document id while indexing your document, you have to get the auto-generated document id from the return value and delete according to the id. Or you can define the id yourself, try the following:
db.index(index="reestr",doc_type="some_type",id=1919, body=doc)
db.delete(index="reestr",doc_type="some_type",id=1919)
In the other case, you need to look into return value;
r = db.index(index="reestr",doc_type="some_type", body=doc)
# r = {u'_type': u'some_type', u'_id': u'AU36zuFq-fzpr_HkJSkT', u'created': True, u'_version': 1, u'_index': u'reestr'}
db.delete(index="reestr",doc_type="some_type",id=r['_id'])
Another example for delete_by_query. Let's say after adding several documents with name='Jacobian', run the following to delete all documents with name='Jacobian':
db.delete_by_query(index='reestr',doc_type='some_type', q={'name': 'Jacobian'})
The Delete-By-Query API was removed from the ES core in version 2 for several reasons. This function became a plugin. You can look for more details here:
Why Delete-By-Query is a plugin
Delete By Query Plugin
Because I didn't want to add another dependency (because I need this later to run in a docker image) I wrote an own function solving this problem. My solution is to search for all quotes with the specified index and type. After that I remove them using the Bulk API:
def delete_es_type(es, index, type_):
try:
count = es.count(index, type_)['count']
response = es.search(
index=index,
filter_path=["hits.hits._id"],
body={"size": count, "query": {"filtered" : {"filter" : {
"type" : {"value": type_ }}}}})
ids = [x["_id"] for x in response["hits"]["hits"]]
if len(ids) > 0:
return
bulk_body = [
'{{"delete": {{"_index": "{}", "_type": "{}", "_id": "{}"}}}}'
.format(index, type_, x) for x in ids]
es.bulk('\n'.join(bulk_body))
# es.indices.flush_synced([index])
except elasticsearch.exceptions.TransportError as ex:
print("Elasticsearch error: " + ex.error)
raise ex
I hope that helps future googlers ;)
One can also do something like this:
def delete_by_ids(index, ids):
query = {"query": {"terms": {"_id": ids}}}
res = es.delete_by_query(index=index, body=query)
pprint(res)
# Pass index and list of id that you want to delete.
delete_by_ids('my_index', ['test1', 'test2', 'test3'])
Which will perform the delete operation on bulk data
I came across this post while searching for a way to delete a document on ElasticSearch using their Python library, ElasticSearch-DSL.
In case it helps anyone, this part of their documentation describes the document lifecycle.
https://elasticsearch-dsl.readthedocs.io/en/latest/persistence.html#document-life-cycle
And at the end of it, it details how to delete a document:
To delete a document just call its delete method:
first = Post.get(id=42)
first.delete()
Hope that helps 🤞

Adding value to array in sub-document (nested within main doc) without duplication - MongoDB

it is quite complicated with the nested documents, but please let me know if you all has any solutions, thanks.
To summarize, I would like to:
Add a value to an array (without duplication), and the array is within a sub-document, that is within an array of a main document. (Document > Array > Subdoc > Array)
The subdocument itself might not exist, so if not exist, the subdocument itself need to be added, i.e. UpSert
The command be the same for both action (i.e. adding of value to subdoc's array, and adding of subdoc)
I have tried the following, but it doesn't work:
key = {'username':'user1'}
update1 = {
'$addToSet':{'clients':{
'$set':{'fname':'Jessica'},
'$set':{'lname':'Royce'},
'$addToSet':{'cars':'Toyota'}
}
}
}
#the document with 'Jessica' and 'Royce' does not exist in clients array, so a new document should be created
update2 = {
'$addToSet':{'clients':{
'$set':{'fname':'Jessica'},
'$set':{'lname':'Royce'},
'$addToSet':{'cars':'Honda'}
}
}
}
#now that the document with 'Jessica' and 'Royce' already exist in clients array, only the value of 'Honda' should be added to the cars array
mongo_collection.update(key, update1 , upsert=True)
mongo_collection.update(key, update2 , upsert=True)
error message: $set is not valid for storage
My intended outcome:
Before:
{
'username':'user1',
'clients':[
{'fname':'John',
'lname':'Baker',
'cars':['Merc','Ferrari']}
]
}
1st After:
{
'username':'user1',
'clients':[
{'fname':'John',
'lname':'Baker',
'cars':['Merc','Ferrari']},
{'fname':'Jessica',
'lname':'Royce',
'cars':['Toyota']}
]
}
2nd After:
{
'username':'user1',
'clients':[
{'fname':'John',
'lname':'Baker',
'cars':['Merc','Ferrari']},
{'fname':'Jessica',
'lname':'Royce',
'cars':['Toyota','Honda']}
]
}
My understanding says you won't be able to completely achieve intended solution directly. You can very well do nested update or upsert but duplication check probably not, as there is no direct way to check item contains in a array document.
For upsert operation you can refer mongodb update operation doc or bulk operation. And for duplication probably you need to have separate logic to identify.

Categories

Resources