Elasticsearch DocumentSimilarity dense_vector got multiple values for argument 'body' - python

I want to store Document Vectors in an Elasticsearch index in order to calculate document similarity. I'm using the Python client for Elasticsearch 7.8.0.
I have a (dummy) Elasticsearch index with the following mapping:
mapping = {
"mappings": {
"properties": {
"title_vector":{
"type": "dense_vector",
"dims": 3
}
}
}
}
es.indices.create(index="test_vector", body=mapping)
And I stored a bunch of vectors in the following way:
vectors = [[1,2,3],[2,2,2],[1,2,2],[2,2,2],[4,5,6],[1,1,1]]
for i, v in enumerate(vectors):
doc = {"title_vector": v}
es.create("test_vector", id=i, body=doc)
According to the documentation, my query to get the most similar documents, should be as follows:
doc = {
"query": {
"script_score": {
"query": {
"match_all": {}
},
"script": {
"source": "cosineSimilarity(params.queryVector, 'title_vector') + 1.0",
"params": {
"queryVector": [1,1,1]
}
}
}
}}
es.search("test_vector", body=doc)
But I'm getting
TypeError: search() got multiple values for argument 'body'
It seems more like a Python error than an Elastic error. But I can't really find the cause of the error and how I should structure my query differently in order to solve it.
Thanks in advance!
Edit: added Elasticsearch version

You are correct, it is a python error. So below is how the es.search is defined according to this link
search(body=None, index=None, params=None, headers=None)
As you see the first parameter is body.
Notice the es.search you have, you haven't specified the key in the first parameter i.e. body, index, params, headers. As a result, python interprets that as value for body according to the above method declaration.
Just add index="test_vector" instead of just "test_vector" in the first parameter and that should do the trick.
es.search(index="test_vector", body=doc)
Hope it helps!

Related

How to index list of object in Elasticsearch?

A document format I ingest into ElasticSearch looks like this:
{
'id':'514d4e9f-09e7-4f13-b6c9-a0aa9b4f37a0'
'created':'2019-09-06 06:09:33.044433',
'meta':{
'userTags':[
{
'intensity':'1',
'sentiment':'0.84',
'keyword':'train'
},
{
'intensity':'1',
'sentiment':'-0.76',
'keyword':'amtrak'
}
]
}
}
...ingested with python:
r = requests.put(itemUrl, auth = authObj, json = document, headers = headers)
The idea here is that ElasticSearch will treat keyword, intensity and sentiment as fields that can be later queried. However, on ElasticSearch side I can observe that this is not happening (I use Kibana for search UI) -- instead, I see field "meta.userTags" with the value that is the whole list of objects.
How can I make ElasticSearch index elements within a list?
I used the document body you provided to create a new index 'testind' and type 'testTyp' using the Postman REST client.:
POST http://localhost:9200/testind/testTyp
{
"id":"514d4e9f-09e7-4f13-b6c9-a0aa9b4f37a0",
"created":"2019-09-06 06:09:33.044433",
"meta":{
"userTags":[
{
"intensity":"1",
"sentiment":"0.84",
"keyword":"train"
},
{
"intensity":"1",
"sentiment":"-0.76",
"keyword":"amtrak"
}
]
}
}
When I queried for the index's mapping this is what i get :
GET http://localhost:9200/testind/testTyp/_mapping
{
"testind":{
"mappings":{
"testTyp":{
"properties":{
"created":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
},
"id":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
},
"meta":{
"properties":{
"userTags":{
"properties":{
"intensity":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
},
"keyword":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
},
"sentiment":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
}
}
}
}
}
}
}
}
}
}
As you can see in the mapping the fields are part of the mapping and can be queried as per need in future, so I don't see the problem here as long as the field names are not one of these - https://www.elastic.co/guide/en/elasticsearch/reference/6.4/sql-syntax-reserved.html ( you might want to avoid the term 'keyword' as it might be confusing later when writing search queries as the fieldname and type are both same - 'keyword') . Also, note one thing, the mapping gets created via dynamic mapping (https://www.elastic.co/guide/en/elasticsearch/reference/6.3/dynamic-field-mapping.html#dynamic-field-mapping ) in Elasticsearch and so the data types are determined by elasticsearch based on the values you have provided.However, this may not be always accurate , so to prevent that you can use the PUT _mapping API to define your own mapping for the index, and then prevent new fields within a type from being added to mappings.
You don't need a special mapping to index a list - every field can contain one or more values of the same type. See array datatype.
In the case of a list of objects, they can be indexed as object or nested datatype. Per default elastic uses object datatype. In this case you can query meta.userTags.keyword or/and meta.userTags.sentiment. The result will allways contains whole documents with values matched independently, ie. searching keyword=train and sentiment=-0.76 you WILL find document with id=514d4e9f-09e7-4f13-b6c9-a0aa9b4f37a0.
If this is not what you want, you need to define nested datatype mapping for field userTags and use a nested query.

Upsert in mongoengine is not generating ObjectId

I am trying to execute an upsert function in mongoengine. That is, if a document is present, I want to update it with new values, and if it isn't present, I want to create and insert.
I have list of objects. These objects can or cannot have ObjectIds. Example is:
[
{
"id" : ObjectId("5c1791b7397df4a9c8518342"),
"type": "Line"
},
{
"type": "Line"
}
]
As you can see the second object does not have an Id.
I have written my query as:
updates = Collection.objects(
id=obj.get('id', None)).modify(
new=True,
upsert= True,
**update_dict
)
obj is each object when I iterate through the list.
Note: update_dict is another dict that gets its value from a function that returns the attributes to set. (For example: set__type: "Line")
Problem
The first object is getting modified just fine. However there is an error:
"'None' is not a valid ObjectId, it must be a 12-byte input or a
24-character hex string"
Clearly it's because of the obj.get('id', None) part.
So, is there a way that an id can be generated if it is passed as None?
I tried same thing with mongoose and nodejs and it works for me if i am using like below:
Here is i my array Object:
var arr = [
{
_id: "5c13de7d47zfe91e3484362f",
email: 'test1#gmail.com',
},
{
_id: "5c13de7d47zfe91e3484362f",
email: 'test2#gmail.com',
},
{
// _id: "5c66aa87751fz5368759f9bc", // Commented
email: 'test3#gmail.com',
}
]
Now i am iterating through the array as below with nodejs.
arr.forEach(async element => {
await Driver.findOneAndUpdate(
{
_id: Types.ObjectId(element._id)
},
{
email: element.email
},{ upsert: true, new: true }
).lean().exec();
});
And it works for me. It's updating documents in first two cases and inserting new doc for last case.
The main thing is to use Types.ObjectId which is used to specify a type of ObjectId. If i am doing it without specifying Schema.Types.ObjectId then it does not working.

Find in referenced/linked MongoDB document, using pymongo

I have 3 linked documents like:
{'_id':1, 'name':'abc', 'label':'actionA', 'prev':null}
{'_id':2, 'name':'pqr', 'label':'actionB', 'prev':ObjectId('1')}
{'_id':3, 'name':'xyz', 'label':'actionC', 'prev':ObjectId('2')}
Now I want to query a document whose 'name' is 'pqr' and also its previous/linked document should contains 'label' as 'actionA'.
All I want is it should find 'name' and check whether previous liked doc is available, if so then check its previous doc should have 'label' which I want.
It will be preferable if using some 1 line command something like:
db.collection.find({'$and'[{'name':'pqr'},{'prev': <gotoprev>({'label':'actionA'})}]})
you can achieve this using aggregation
MongoDB 3.4 Solution
take advantage of the $graphLookup operator:
db.collection.aggregate([
{
$match:{
"name":"pqr"
}
},
{
$graphLookup:{
from:"collection",
startWith:"$prev",
connectFromField:"prev",
connectToField:"_id",
as:"parent",
maxDepth:1,
restrictSearchWithMatch:{
label:"actionA"
}
}
}
])
Mongodb 3.2
filter out document where name != 'pqr' in a $match stage
link parent an child with $lookup
unwind the resulting array with $unwind
finally filter out document where parent.label != 'actionA'
here is the query:
db.collection.aggregate([
{
$match:{
"name":"pqr"
}
},
{
$lookup:{
from:"collection",
localField:"prev",
foreignField:"_id",
as:"prev"
}
},
{
$unwind:"$prev"
},
{
$match:{
"prev.label":"actionA"
}
}
])
You can denormalise and store prev_label also in the referenced document (NoSQL way).
{'_id':1, 'name':'abc', 'label':'actionA', 'prev':null}
{'_id':2, 'name':'pqr', 'label':'actionB', 'prev':ObjectId('1'),'prev_label': 'actionA'}
{'_id':3, 'name':'xyz', 'label':'actionC', 'prev':ObjectId('2'),'prev_label': 'actionB'}
Then you can use find query for the result
db.collection.find({'$and'[{'name':'pqr'},{'prev_label': 'actionA'}]})
If label in original document is changed you can keep them updated in referenced documents with an update query
db.collection.update({'prev': updatedDocumentId},{'$set': {'prev_label': newLabel}}, multi=True)

Elasticsearch/Python - Re-index data after changing the mappings?

I'm a little stuck on how to re-index data in elastic search after a mapping or a data type has been changed.
According to elastic search docs
Pull the documents in from your old index, using a scrolled search and index them into the new index using the bulk API. Many of the client APIs provide a reindex() method which will do all of this for you. Once you are done, you can delete the old index.
This is my old mapping
{
"test-index2": {
"mappings": {
"business": {
"properties": {
"address": {
"type": "nested",
"properties": {
"country": {
"type": "string"
},
"full_address": {
"type": "string"
}
}
}
}
}
}
}
}
New Index mapping, I'm changing full_address -> location_address
{
"test-index2": {
"mappings": {
"business": {
"properties": {
"address": {
"type": "nested",
"properties": {
"country": {
"type": "string"
},
"location_address": {
"type": "string"
}
}
}
}
}
}
}
}
I'm using the python client for elasticsearch
https://elasticsearch-py.readthedocs.org/en/master/helpers.html#elasticsearch.helpers.reindex
from elasticsearch import Elasticsearch
from elasticsearch.helpers import reindex
es = Elasticsearch(["es.node1"])
reindex(es, "source_index", "target_index")
However this transfers the data from one index to another.
How may i use this to change the mappings/(data types etc) for my case above?
It's Straightforward if you use the scan&scroll and the Bulk API already implemented in the python client of elasticsearch
First -> Fetch all the documents by scan&scroll method
Loop through and make neccessary modifications to each document
Insert the modified documents into a new index using the Bulk API
from elasticsearch import Elasticsearch, helpers
es = Elasticsearch()
# Use the scan&scroll method to fetch all documents from your old index
res = helpers.scan(es, query={
"query": {
"match_all": {}
},
"size":1000
},index="old_index")
new_insert_data = []
# Change the mapping and everything else by looping through all your documents
for x in res:
x['_index'] = 'new_index'
# Change "address" to "location_address"
x['_source']['location_address'] = x['_source']['address']
del x['_source']['address']
# This is a useless field
del x['_score']
es.indices.refresh(index="testing_index3")
# Add the new data into a list
new_insert_data.append(x)
es.indices.refresh(index="new_index")
print new_insert_data
#Use the Bulk API to insert the list of your modified documents into the database
helpers.bulk(es,new_insert_data)
The reindex() API simply "moves" documents from one index to another. There is no way it can detect/infer that the field name full_address in documents of the old index should be location_address in documents in the new index. I doubt there is any API provided by standard Elasticsearch clients that can do what you desire. The only way I can think of achieving this is through additional custom logic on the client side which maintains a dictionary of field names from old index to new index and then read documents from old index and indexes the corresponding document to the new index with new field names obtained from the field name dictionary.
After updating the mapping, this can be done by updating the exiting documents using bulk API.
POST /_bulk
{"update":{"_id":"59519","_type":"asset","_index":"assets"}}
{"doc":{"facility_id":491},"detect_noop":false}
Note - Use 'detect_noop' for detecting the noop update.

get filtered embeded elements from all class on mongoengine

I got two class on Mongoengine:
class UserPoints(EmbeddedDocument):
user = ReferenceField(User, verbose_name='user')
points = IntField(verbose_name='points', required=True)
def __unicode__(self):
return self.points
And
class Local(Document):
token = StringField(max_length=250,verbose_name='token_identifier',unique=True)
points = ListField(EmbeddedDocumentField(UserPoints),required=False)
def __unicode__(self):
return self.name
If i do something like: "LP = Local.objects.filter(points__user=user)" I got all the locals with userpoints from my user. But i Want all the UserPoints from a User. How can i?
I try also: "lUs = UserPoints.objects.filter(user=user)" but i got an empty Array.
PD: I do something like this to solve the problem, but it's not efficient.
LDPoints = []
LP = Local.objects.filter(points__user=user)
print 'List P: '+str(len(LP))
for local in LP:
for points in local.points:
if points.user == user:
dPoints = parsePoints(points)
lDPoints.append(dPoints)
Adding to the original and getting venerable answer is that the aggregation framework has $filter now for some time, which is a lot cleaner that the $map and $setDifference method used in the original answer.
Local._get_collection().aggregate([
{ "$match": { "points.user": user } },
{ "$project": {
"token": 1,
"points": {
"$filter": {
"input": "$points",
"as": "el",
"cond": { "$eq": [ "$$el.user", user ] }
}
}
}}
])
The same principles apply though for obtaining "multiple" matches from an array in the collection you use the aggregate() method of the underlying driver, as called from _get_collection().
Original
The answer to avoid "filtering" your embedded documents for the selected "user" only is to use the aggregation framework. This allows you to manipulate the "array content" on the database server rather than filtering the results in your client code.
Aggregation is done with the raw pymongo driver methods, but since Mongoengine is built on top of this driver you access the raw collection object from your class with the ._get_collection() method:
Local._get_collection().aggregate([
# Match the documents that have the required user
{ "$match": {
"points.user": user
}},
# unwind the embedded array to de-normalize
{ "$unwind": "$points" },
# Matching now filters the elements
{ "$match": {
"points.user": user
}},
# Group back as an array
{ "$group": {
"_id": "$_id",
"token": { "$first": "$token" },
"points": { "$push": "$points" }
}}
])
If you have MongoDB 2.6 or greater on your server and your "user/points" combination is always unique you can alternately filter without the $unwind|$match|$group cycle using the $map and $setDifference operators available there:
Local._get_collection().aggregate([
# Match the documents that have the required user
{ "$match": {
"points.user": user
}},
# Filter the array in place
{ "$project": {
"token": 1,
"points": {
"$setDifference": [
{
"$map": {
"input": "$points",
"as": "el",
"in": {
"$cond": [
{ "$eq": [ "$$el.user", user ] },
"$$el",
false
]
}
}
},
[false]
]
}
}}
])
In the second case there the $cond is a ternary operator which takes a logical expression as it's first argument and the values to return when that expression is either true or false as it's other arguments. Inside the $map, each element is tested to see if the condition is true, in this case "is the user field equal to the selected user".
Either the content of that array position is returned or otherwise false. The $setDifference takes the resulting array and "filters" the false values out, so only the matching elements are returned.
In the legacy approach, the $unwind pipeline operator is used to effectively turn each array element into it's own document with all other parent properties. This allows you to apply the same $match condition, which unlike the initial query actually removes the documents which now as single elements no longer match your condition. You always want the first stage as there is no point processing this $unwind|$match combination on all of the documents that might not contain your matching condition.
The $group stage brings everything back into line per document. Using the $first option to return all other fields that were essentially duplicated by the $unwind and the $push operator to rebuild the array with the matching elements.
So while there no "built-in" methods to MongoEngine to do this sort of query, you can do this the MongoDB way by accessing the raw driver.
Also note that if you only expected one element to match in any array for your given "user" or other query, then you could alternately use the field projection form available to the raw driver as well. But the aggregation method is required for any more than one matching element of the array.

Categories

Resources