I'm using ElasticSearch 8.3.2 to store some data I have. The data consists of metabolites and several "studies" for each metabolite, with each study in turn containing concentration values. I am also using the Python ElasticSearch client to communicate with the backend, which works fine.
To associate metabolites with studies, I was considering using a join field as described here.
I have defined this index mapping:
INDEXMAPPING_MET = {
"mappings": {
"properties": {
"id": {"type": "keyword"},
"entry_type": {"type": "text"},
"pc_relation": {
"type": "join",
"relations": {
"metabolite": "study"
}
},
"concentration": {
"type": "nested",
}
}
}
}
pc_relation is the join field here, with metabolites being the parent documents of each study document.
I can create metabolite entries (the parent documents) just fine using the Python client, for example
self.client.index(index="metabolitesv2", id=metabolite, body=json.dumps({
#[... some other fields here]
"pc_relation": {
"name": "metabolite",
},
}))
However, once I try adding child documents, I get a mapping_parser_exception. Notably, I only get this exception when trying to add the pc_relation field, any other fields work just fine and I can create documents if I omit the join field. Here is an example for a study document I am trying to create (on the same index):
self.client.index(index="metabolitesv2", id=study, body=json.dumps({
#[... some other fields here]
"pc_relation": {
"name": "study",
"parent": metabolite_id
},
}))
At first I thought there might be some typing issues, but casting everything to a string sadly does not change the outcome. I would really appreciate any help with regards to where the error could be as I am not really sure what the issue is - From what I can tell from the official ES documentation and other Python+ES projects I am not really doing anything differently.
Tried: Creating an index with a join field, creating a parent document, creating a child document with a join relation to the parent.
Expectation: Documents get created and can be queried using has_child or has_parent tags.
Result: MappingParserException when trying to create the child document
Tldr;
You need to provide a routing value at indexing time for the child document.
The routing value is mandatory because parent and child documents must be indexed on the same shard
By default the routing value of a document is its _id, so in practice you need to provide the _id of the parent document when indexing the child.
Solution
self.client.index(index="metabolitesv2", id=study, routing=metabolite, body=json.dumps({
#[... some other fields here]
"pc_relation": {
"name": "study",
"parent": metabolite_id
},
}))
To reproduce
PUT 75224800
{
"settings": {
"number_of_shards": 4
},
"mappings": {
"properties": {
"id": {
"type": "keyword"
},
"pc_relation": {
"type": "join",
"relations": {
"metabolite": "study"
}
}
}
}
}
PUT 75224800/_doc/1
{
"id": "1",
"pc_relation": "metabolite"
}
# No routing Id this is going to fail
PUT 75224800/_doc/2
{
"id": 2,
"pc_relation":{
"name": "study",
"parent": "1"
}
}
PUT 75224800/_doc/3
{
"id": "3",
"pc_relation": "metabolite"
}
PUT 75224800/_doc/4?routing=3
{
"id": 2,
"pc_relation":{
"name": "study",
"parent": "3"
}
}
Related
I am trying to query an Elasticsearch index for near-duplicates using its MinHash implementation.
I use the Python client running in containers to index and perform the search.
My corpus is a JSONL file a bit like this:
{"id":1, "text":"I'd just like to interject for a moment"}
{"id":2, "text":"I come up here for perception and clarity"}
...
I create an Elasticsearch index successfully, trying to use custom settings and analyzer, taking inspiration from the official examples and MinHash docs:
def create_index(client):
client.indices.create(
index="documents",
body={
"settings": {
"analysis": {
"filter": {
"my_shingle_filter": {
"type": "shingle",
"min_shingle_size": 5,
"max_shingle_size": 5,
"output_unigrams": False
},
"my_minhash_filter": {
"type": "min_hash",
"hash_count": 10,
"bucket_count": 512,
"hash_set_size": 1,
"with_rotation": True
}
},
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"filter": [
"my_shingle_filter",
"my_minhash_filter"
]
}
}
}
},
"mappings": {
"properties": {
"name": {"type": "text", "analyzer": "my_analyzer"}
}
},
},
ignore=400,
)
I verify that index creation hasn't big problems via Kibana and also by visiting http://localhost:9200/documents/_settings I get something that seems in order:
However, querying the index with:
def get_duplicate_documents(body, K, es):
doc = {
'_source': ['_id', 'body'],
'size': K,
'query': {
"match": {
"body": {
"query": body,
"analyzer" : "my_analyzer"
}
}
}
}
res = es.search(index='documents', body=doc)
top_matches = [hit['_source']['_id'] for hit in res['hits']['hits']]
my res['hits'] is consistently empty even if I set my body to match exactly the text of one of the entries in my corpus. In other words I don't get any results if I try as values for body e.g.
"I come up here for perception and clarity"
or substrings like
"I come up here for perception"
while ideally, I'd like the procedure to return near-duplicates, with a score being an approximation of the Jaccard similarity of the query and the near-duplicates, obtained via MinHash.
Is there something wrong in my query and/or way I index Elasticsearch? Am I missing something else entirely?
P.S.: You can have a look at https://github.com/davidefiocco/dockerized-elasticsearch-duplicate-finder/tree/ea0974363b945bf5f85d52a781463fba76f4f987 for a non-functional, but hopefully reproducible example (I will also update the repo as I find a solution!)
Here are some things that you should double-check as they are likely culprits:
when you create your mapping you should change from "name" to "text" in your client.indices.create method inside body param, because your json document has a field called text:
"mappings": {
"properties": {
"text": {"type": "text", "analyzer": "my_analyzer"}
}
in indexing phase you could also rework your generate_actions() method following the documentation with something like:
for elem in corpus:
yield {
"_op_type": "index"
"_index": "documents",
"_id": elem["id"],
"_source": elem["text"]
}
Incidentally, if you are indexing pandas dataframes, you may want to check the experimental official library eland.
Also, according to your mapping, you are using a minhash token filter, so Lucene will transform your text inside text field in hash. So you can query against this field with an hash and not with a string as you have done in your example "I come up here for perception and clarity".
So the best way to use it is to retrieve the content of the field text and then query in Elasticsearch for the same value retrieved. Then the _id metafield is not inside _source metafield, so you should change your get_duplicate_documents() method in:
def get_duplicate_documents(body, K, es):
doc = {
'_source': ['text'],
'size': K,
'query': {
"match": {
"text": { # I changed this line!
"query": body
}
}
}
}
res = es.search(index='documents', body=doc)
# also changed the list comprehension!
top_matches = [(hit['_id'], hit['_source']) for hit in res['hits']['hits']]
I'm passing id and customer id fields as parameters to get the document. With my below code I'm only able to fetch only those fields of a document. How do I get entire document with multiple fields as parameter?
Code:
#reviews.route('/<inp_id>/<cat_id>', methods=['GET'])
def index(inp_id, cat_id):
my_coln = mongo_connection.db.db_name
document = collection.find_one({'id': inp_id}, {'category.id': cat_id})
Result:
{
"category": {
"id": "13"
},
"_id": "5cdd36cd8a348e81d8995d3b"
}
I want:
{
"customer": {
"id": "1",
"name": "Kit Data"
},
"category": {
"id": "13",
"name": "TrainKit"
},
"review_date": "2019-05-06",
"phrases": null,
.....
}
Pass all your filters in the first dict, the second one is for projection.
document = collection.find_one({'id': inp_id, 'category.id': cat_id})
Your original query, collection.find_one({'id': inp_id}, {'category.id': cat_id}) means give me only category.id (and nothing else (well, apart from _id which is returned by default)) of a document in which the value of id equals inp_id.
Say you have this AVDL as a simplified example:
#namespace("example.avro")
protocol User {
record Man {
int age;
}
record Woman {
int age;
}
record User {
union {
Man,
Woman
} user_info;
}
}
in python you are not able to properly serialize objects stating the type because this syntax is not allowed:
{"user_info": {"Woman": {"age": 18}}}
and the only object that gets serialized is
{"user_info": {"age": 18}}
losing all the type information and the DatumWriter picking usually the first record that matches the set of fields, in this case a Man.
The above problem works perfectly well when using the Java API.
So, what am I doing wrong here? Is it possible that serialization and deserialization is not idempotent in Python's Avro implementation?
You are correct that the standard avro library has no way to specify which schema to use in cases like this. However, fastavro (an alternative implementation) does have a way to do this. In that implementation, a record can be specified as a tuple where the first value is the schema name and the second value is the actual record data. The record would look like this:
{"user_info": ("Woman", {"age": 18})}
Here's and example script:
from io import BytesIO
from fastavro import writer
schema = {
"type": "record",
"name": "User",
"fields": [{
"name": "user_info",
"type": [
{
"type": "record",
"name": "Man",
"fields": [{
"name": "age",
"type": "int"
}]
},
{
"type": "record",
"name": "Woman",
"fields": [{
"name": "age",
"type": "int"
}]
}
]
}]
}
records = [{"user_info": ("Woman", {"age": 18})}]
bio = BytesIO()
writer(bio, schema, records)
My question is about performance.
I am using filtered query a lot and I am not certain what is the proper way to query by type.
So first, lets have a look at the mappings:
{
"my_index": {
"mappings": {
"type_Light_Yellow": {
"properties": {
"color_type": {
"properties": {
"color": {
"type": "string",
"index": "not_analyzed"
},
"brightness": {
"type": "string",
"index": "not_analyzed"
}
}
},
"details": {
"properties": {
"FirstName": {
"type": "string",
"index": "not_analyzed"
},
"LastName": {
"type": "string",
"index": "not_analyzed"
},
.
.
.
}
}
}
}
}
}
}
Above, we can see example of one mapping for type light Yellow. As well, there are many more mappings for various types (colors. e.g: dark Yellow, light Brown and so on...)
Please notice color_type's sub fields.
For type type_Light_Yellow, values are always: "color": "Yellow", "brightness" : "Light" and so on for all other types.
And now, my performance question: I wonder if there is a favorite method for querying my index.
For example, let's search for all documents where "details.FirstName": "John" and "details.LastName": "Doe" under type type_Light_Yellow.
Current method I'm using:
curl -XPOST 'http://somedomain.com:1234my_index/_search' -d '{
"query":{
"filtered":{
"filter":{
"bool":{
"must":[
{
"term":{
"color_type.color": "Yellow"
}
},
{
"term":{
"color_type.brightness": "Light"
}
},
{
"term":{
"details.FirstName": "John"
}
},
{
"term":{
"details.LastName": "Doe"
}
}
]
}
}
}
}
}'
As can be seen above, by defining
"color_type.color": "Yellow" and "color_type.brightness": "Light", I am querying all the index and referring type type_Light_Yellow as it was just another field under the documents I'm searching.
The alternate method is to query directly under the type:
curl -XPOST 'http://somedomain.com:1234my_index/type_Light_Yellow/_search' -d '{
"query": {
"filtered": {
"filter": {
"bool": {
"must": [
{
"term": {
"details.FirstName": "John"
}
},
{
"term": {
"details.LastName": "Doe"
}
}
]
}
}
}
}
}'
Please notice the first line: my_index/type_Light_Yellow/_search.
What would be, by performance means, more efficient to query?
Would it be a different answer if I am querying via code (I am using Python with ElasticSearch package)?
Types in elasticsearch work by adding _type attribute to documents and every time you search a specific type it automatically filters by _type attributes. So, performance wise there shouldn't be much of a difference. Types are an abstraction and not actual data. What I mean here is that, fields across multiple document types are flattened out on entire index, i.e. fields of one type occupy space on fields of other type as well, even though they are not indexed (think of it the same way as null occupies space).
But its important to keep in mind that order of filtering impacts performance.You must aim to exclude as many documents as possible in one go. So, if you think its better not to first filter by type, filtering the way first way is preferable. Otherwise, I don't think there would be much of a difference if ordering is same.
Since Python API also queries over http in default settings, use of Python shouldn't impact performance.
Here, in your case is certain degree of data duplication though as color is captured both in _type meta field as well as color field.
I used following function in Python to initialize an index in Elasticsearch.
def init_index():
constants.ES_CLIENT.indices.create(
index = constants.INDEX_NAME,
body = {
"settings": {
"index": {
"type": "default"
},
"number_of_shards": 1,
"number_of_replicas": 1,
"analysis": {
"filter": {
"ap_stop": {
"type": "stop",
"stopwords_path": "stoplist.txt"
},
"shingle_filter" : {
"type" : "shingle",
"min_shingle_size" : 2,
"max_shingle_size" : 5,
"output_unigrams": True
}
},
"analyzer": {
constants.ANALYZER_NAME : {
"type": "custom",
"tokenizer": "standard",
"filter": ["standard",
"ap_stop",
"lowercase",
"shingle_filter",
"snowball"]
}
}
}
}
}
)
new_mapping = {
constants.TYPE_NAME: {
"properties": {
"text": {
"type": "string",
"store": True,
"index": "analyzed",
"term_vector": "with_positions_offsets_payloads",
"search_analyzer": constants.ANALYZER_NAME,
"index_analyzer": constants.ANALYZER_NAME
}
}
}
}
constants.ES_CLIENT.indices.put_mapping (
index = constants.INDEX_NAME,
doc_type = constants.TYPE_NAME,
body = new_mapping
)
Using this function I was able to create an index by user-defined specs.
I recently started to work with scala and spark. For integrating elasticsearch into this I can either use Spark's API i.e. org.elasticsearch.spark or I can use Hadoop org.elasticsearch.hadoop. Most of the examples I see are related to Hadoop's methodology but I don't wish to use Hadoop here. I went through Spark-elasticsearch documentation and was able to atleast index documents without including Hadoop but I noticed that this created everything default, I can't even specify _id there. It generates _id on its own.
In scala I use the following code for indexing (not complete code):
val document = mutable.Map[String, String]()
document("id") = docID
document("text") = textChunk.mkString(" ") //textChunk is a list of Strings
sc.makeRDD(Seq(document)).saveToEs("es_park_ap/document")
This created an index this way:
{
"es_park_ap": {
"mappings": {
"document": {
"properties": {
"id": {
"type": "string"
},
"text": {
"type": "string"
}
}
}
},
"settings": {
"index": {
"creation_date": "1433006647684",
"uuid": "QNXcTamgQgKx7RP-h8FVIg",
"number_of_replicas": "1",
"number_of_shards": "5",
"version": {
"created": "1040299"
}
}
}
}
}
So if I pass a document to it, a following document is created:
{
"_index": "es_park_ap",
"_type": "document",
"_id": "AU2l2ixcAOrl_Gagnja5",
"_score": 1,
"_source": {
"text": "some large text",
"id": "12345"
}
}
Just like Python, how can I use Spark and Scala to create an index with user defined specifications?
I think we should divide your question to several smaller issues.
If you want to create an index with specific mapping / settings you should use elasticsearch JAVA api directly (You can use it from Scala code of course).
You can use the following sources for examples of index creating using Scala:
Creating index and adding mapping in Elasticsearch with java api gives missing analyzer errors
Define custom ElasticSearch Analyzer using Java API
Elasticsearch Hadoop / Spark plugin is used in order transport data easily from HDFS to ES. ES maintenance should be done separately.
The fact that you still seeing automatically generated id is because you must specify to the plugin your id field using the following syntax:
EsSpark.saveToEs(rdd, "spark/docs", Map("es.mapping.id" -> "your_id_field"))
Or in your case:
sc.makeRDD(Seq(document)).saveToEs("es_park_ap/document", Map("es.mapping.id" -> "your_id_field"))
You can find more details about syntax and proper use here:
https://www.elastic.co/guide/en/elasticsearch/hadoop/master/spark.html
Michael