Why does my query using a MinHash analyzer fail to retrieve duplicates? - python

I am trying to query an Elasticsearch index for near-duplicates using its MinHash implementation.
I use the Python client running in containers to index and perform the search.
My corpus is a JSONL file a bit like this:
{"id":1, "text":"I'd just like to interject for a moment"}
{"id":2, "text":"I come up here for perception and clarity"}
...
I create an Elasticsearch index successfully, trying to use custom settings and analyzer, taking inspiration from the official examples and MinHash docs:
def create_index(client):
client.indices.create(
index="documents",
body={
"settings": {
"analysis": {
"filter": {
"my_shingle_filter": {
"type": "shingle",
"min_shingle_size": 5,
"max_shingle_size": 5,
"output_unigrams": False
},
"my_minhash_filter": {
"type": "min_hash",
"hash_count": 10,
"bucket_count": 512,
"hash_set_size": 1,
"with_rotation": True
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"filter": [
"my_shingle_filter",
"my_minhash_filter"
]
}
}
}
},
"mappings": {
"properties": {
"name": {"type": "text", "analyzer": "my_analyzer"}
}
},
},
ignore=400,
)
I verify that index creation hasn't big problems via Kibana and also by visiting http://localhost:9200/documents/_settings I get something that seems in order:
However, querying the index with:
def get_duplicate_documents(body, K, es):
doc = {
'_source': ['_id', 'body'],
'size': K,
'query': {
"match": {
"body": {
"query": body,
"analyzer" : "my_analyzer"
}
}
}
}
res = es.search(index='documents', body=doc)
top_matches = [hit['_source']['_id'] for hit in res['hits']['hits']]
my res['hits'] is consistently empty even if I set my body to match exactly the text of one of the entries in my corpus. In other words I don't get any results if I try as values for body e.g.
"I come up here for perception and clarity"
or substrings like
"I come up here for perception"
while ideally, I'd like the procedure to return near-duplicates, with a score being an approximation of the Jaccard similarity of the query and the near-duplicates, obtained via MinHash.
Is there something wrong in my query and/or way I index Elasticsearch? Am I missing something else entirely?
P.S.: You can have a look at https://github.com/davidefiocco/dockerized-elasticsearch-duplicate-finder/tree/ea0974363b945bf5f85d52a781463fba76f4f987 for a non-functional, but hopefully reproducible example (I will also update the repo as I find a solution!)

Here are some things that you should double-check as they are likely culprits:
when you create your mapping you should change from "name" to "text" in your client.indices.create method inside body param, because your json document has a field called text:
"mappings": {
"properties": {
"text": {"type": "text", "analyzer": "my_analyzer"}
}
in indexing phase you could also rework your generate_actions() method following the documentation with something like:
for elem in corpus:
yield {
"_op_type": "index"
"_index": "documents",
"_id": elem["id"],
"_source": elem["text"]
}
Incidentally, if you are indexing pandas dataframes, you may want to check the experimental official library eland.
Also, according to your mapping, you are using a minhash token filter, so Lucene will transform your text inside text field in hash. So you can query against this field with an hash and not with a string as you have done in your example "I come up here for perception and clarity".
So the best way to use it is to retrieve the content of the field text and then query in Elasticsearch for the same value retrieved. Then the _id metafield is not inside _source metafield, so you should change your get_duplicate_documents() method in:
def get_duplicate_documents(body, K, es):
doc = {
'_source': ['text'],
'size': K,
'query': {
"match": {
"text": { # I changed this line!
"query": body
}
}
}
}
res = es.search(index='documents', body=doc)
# also changed the list comprehension!
top_matches = [(hit['_id'], hit['_source']) for hit in res['hits']['hits']]

Related

How to paginate terms aggregation results in Elasticsearch

I've been trying to figure out a way to paginate the results of a terms aggregation in Elasticsearch and so far I have not been able to achieve the desired result.
Here's the problem I am trying to solve. In my index, I have a bunch of documents that have a score (separate to the ES _score) that is calculated based on the values of the other fields in the document. Each document "belongs" to a customer, referenced by the customer_id field. The document also has an id, referenced by the doc_id field, and is the same as the ES meta-field _id. Here is an example.
{
'_id': '1',
'doc_id': '1',
'doc_score': '85',
'customer_id': '123'
}
For each customer_id there are multiple documents, all with different document ids and different scores. What I want to be able to do is, given a list of customer ids, return the top document for each customer_id (only 1 per customer) and be able to paginate those results similar to the size, from method in the regular ES search API. The field that I want to use for the document score is the doc_score field.
So far in my current Python script, I've tried is a nested aggs with a "top hits" aggregation to only get the top document for each customer.
{
"size": 0,
"query:": {
"bool": {
"must": [
{
"match_all": {}
},
{
"terms": {
"customer_id": customer_ids # a list of the customer ids I want documents for
}
},
{
"exists": {
"field": "score" # sometimes it's possible a document does not have a score
}
}
]
}
}
"aggs": {
"customers": {
"terms" : {
{"field": "customer_id", "min_doc_count": 1},
"aggs": {
"top_documents": {
"top_hits": {
"sort": [
{"score": {"order": "desc"}}
],
"size": 1
}
}
}
}
}
}
}
I then "paginate" by going through each customer bucket, appending the top document blob to a list and then sorting the list based on the value of the score field and finally taking a slice documents_list[from:from+size].
The issue with this is that, say I have 500 customers in the list but I only want the 2nd 20 documents, i.e. size = 20, from=20. So each time I call the function I have to first get the list for each of the 500 customers and then slice. This sounds very inefficient and is also a speed issue, since I need that function to be as fast as I can possibly make it.
Ideally, I could just get the 2nd 20 directly from ES without having to do any slicing in my function.
I have looked into Composite aggregations that ES offers, but it looks to me like I would not be able to use it in my case, since I need to get the entire doc, i.e. everything in the _source field in the regular search API response.
I would greatly appreciate any suggestions.
The best way to do this would be to use partitions
According to documentation:
GET /_search
{
"size": 0,
"aggs": {
"expired_sessions": {
"terms": {
"field": "account_id",
"include": {
"partition": 1,
"num_partitions": 25
},
"size": 20,
"order": {
"last_access": "asc"
}
},
"aggs": {
"last_access": {
"max": {
"field": "access_date"
}
}
}
}
}
}
https://www.elastic.co/guide/en/elasticsearch/reference/6.8/search-aggregations-bucket-terms-aggregation.html#_filtering_values_with_partitions

How to filter many value in same time?

I'm using elastic search to filter 1 document and i use a loop to filer many documents. But now i want to filter many document in one request to optimize my script.
For the moment i have this query, and i'm using a "for" loop to filter by uuid.
for id in id_list:
filter (id)
def filter(id):
result = requests.get(
settings + '/data/_search?size=10000',
json={
"query": {
"bool": {
"filter": {
"terms": {
"id": id
}
}
}
},
"_source": {
"exclude": ["type", "date"]
}
}
)
I would like to do only one request to get all my document at once to optimize my code.
The terms query takes an array of arguments, see the reference for an example.

Is it better to query by type in ElasticSearch?

My question is about performance.
I am using filtered query a lot and I am not certain what is the proper way to query by type.
So first, lets have a look at the mappings:
{
"my_index": {
"mappings": {
"type_Light_Yellow": {
"properties": {
"color_type": {
"properties": {
"color": {
"type": "string",
"index": "not_analyzed"
},
"brightness": {
"type": "string",
"index": "not_analyzed"
}
}
},
"details": {
"properties": {
"FirstName": {
"type": "string",
"index": "not_analyzed"
},
"LastName": {
"type": "string",
"index": "not_analyzed"
},
.
.
.
}
}
}
}
}
}
}
Above, we can see example of one mapping for type light Yellow. As well, there are many more mappings for various types (colors. e.g: dark Yellow, light Brown and so on...)
Please notice color_type's sub fields.
For type type_Light_Yellow, values are always: "color": "Yellow", "brightness" : "Light" and so on for all other types.
And now, my performance question: I wonder if there is a favorite method for querying my index.
For example, let's search for all documents where "details.FirstName": "John" and "details.LastName": "Doe" under type type_Light_Yellow.
Current method I'm using:
curl -XPOST 'http://somedomain.com:1234my_index/_search' -d '{
"query":{
"filtered":{
"filter":{
"bool":{
"must":[
{
"term":{
"color_type.color": "Yellow"
}
},
{
"term":{
"color_type.brightness": "Light"
}
},
{
"term":{
"details.FirstName": "John"
}
},
{
"term":{
"details.LastName": "Doe"
}
}
]
}
}
}
}
}'
As can be seen above, by defining
"color_type.color": "Yellow" and "color_type.brightness": "Light", I am querying all the index and referring type type_Light_Yellow as it was just another field under the documents I'm searching.
The alternate method is to query directly under the type:
curl -XPOST 'http://somedomain.com:1234my_index/type_Light_Yellow/_search' -d '{
"query": {
"filtered": {
"filter": {
"bool": {
"must": [
{
"term": {
"details.FirstName": "John"
}
},
{
"term": {
"details.LastName": "Doe"
}
}
]
}
}
}
}
}'
Please notice the first line: my_index/type_Light_Yellow/_search.
What would be, by performance means, more efficient to query?
Would it be a different answer if I am querying via code (I am using Python with ElasticSearch package)?
Types in elasticsearch work by adding _type attribute to documents and every time you search a specific type it automatically filters by _type attributes. So, performance wise there shouldn't be much of a difference. Types are an abstraction and not actual data. What I mean here is that, fields across multiple document types are flattened out on entire index, i.e. fields of one type occupy space on fields of other type as well, even though they are not indexed (think of it the same way as null occupies space).
But its important to keep in mind that order of filtering impacts performance.You must aim to exclude as many documents as possible in one go. So, if you think its better not to first filter by type, filtering the way first way is preferable. Otherwise, I don't think there would be much of a difference if ordering is same.
Since Python API also queries over http in default settings, use of Python shouldn't impact performance.
Here, in your case is certain degree of data duplication though as color is captured both in _type meta field as well as color field.

Pymongo How to update specific value of row in array

I have this simple JSON in mongodb:
{
"pwd_list": [
{
"pwd": "password1",
"pwd_date": str(int(time.time()))
},
{
"pwd": "password2",
"pwd_date": str(int(time.time()))
},
]
}
What I am simply trying to do is to update one of the row of the pwd_list array using an index...
I tried to use the $position of mongodb but it seems to only work with $push but I don't want to push!
I'm using pymongo.
So I tried different things like this one:
self.collection.update({"email": "email"},
{
"$set": {
"pwd_list": {
"temp_passwd": "NEW VALUE",
"temp_date": "str(int(time.time()))"
}
}
})
But it it not working as expected. (The above example is transforming the array in object...)
If it is not possible, can I at least update the first row (always this one)?
You have two options when doing this.
If you want to search for a specific item to change, you should do the following:
self.collection.update({"email": "email", "pwd_list.pwd": "Password2"},
{
"$set": {
"pwd_list.$": {
"pwd": "NEW VALUE",
"pwd_date": "str(int(time.time()))"
}
}
})
If you want to change a specific item, and you know the position of the item in the array, you should do the following (in my example I am changing the first item):
self.collection.update({"email": "email"},
{
"$set": {
"pwd_list.0": {
"pwd": "NEW VALUE",
"pwd_date": "str(int(time.time()))"
}
}
})
You can find more details about this on this page.

highlighting based on term or bool query match in elasticsearch

I have two queries.
{'bool':
{'must':
{ 'terms': 'metadata.loc':['ten','twenty']}
{ 'terms': 'metadata.doc':['prince','queen']}
}
{'should':
{ 'match': 'text':'kingdom of dreams'}
}
},
{'highlight':
{'text':
{'type':fvh,
'matched_fields':['metadata.doc','text']
}
}
}
There are two questions ?
Why the documents with should query match are getting highlighted whereas documents with only must term match are not getting highlighted.
Is there any way to mention highlight condition specific to term query above ?
This means highlight condition for { 'terms': 'metadata.loc':['ten','twenty']}
and a seperate highlight condition for { 'terms': 'metadata.doc':['prince','queen']}
1) Only documents with should query are getting highlighted because you are highlighting against only text field which is basically your should clause. Although you are using matched_fields , you are considering only text field.
From the Docs
All matched_fields must have term_vector set to with_positions_offsets but only the field to which the matches are combined is loaded so only that field would benefit from having store set to yes.
Also you are combining two very different fields, 'matched_fields':['metadata.doc','text'], this is hard to understand, again from the Docs
Technically it is also fine to add fields to matched_fields that don’t share the same underlying string as the field to which the matches are combined. The results might not make much sense and if one of the matches is off the end of the text then the whole query will fail.
2) You can write highlight condition specific to term query with Highlight Query
Try this in your highlight part of the query
{
"query": {
...your query...
},
"highlight": {
"fields": {
"text": {
"type": "fvh",
"matched_fields": [
"text",
"metadata.doc"
]
},
"metadata.doc": {
"highlight_query": {
"terms": {
"metadata.doc": [
"prince",
"queen"
]
}
}
},
"metadata.loc": {
"highlight_query": {
"terms": {
"metadata.loc": [
"ten",
"twenty"
]
}
}
}
}
}
}
Does this help?

Categories

Resources