Elasticsearch/Python - Re-index data after changing the mappings?

Elasticsearch/Python - Re-index data after changing the mappings? - python

I'm a little stuck on how to re-index data in elastic search after a mapping or a data type has been changed.
According to elastic search docs
Pull the documents in from your old index, using a scrolled search and index them into the new index using the bulk API. Many of the client APIs provide a reindex() method which will do all of this for you. Once you are done, you can delete the old index.
This is my old mapping
{
"test-index2": {
"mappings": {
"business": {
"properties": {
"address": {
"type": "nested",
"properties": {
"country": {
"type": "string"
},
"full_address": {
"type": "string"
}
}
}
}
}
}
}
}
New Index mapping, I'm changing full_address -> location_address
{
"test-index2": {
"mappings": {
"business": {
"properties": {
"address": {
"type": "nested",
"properties": {
"country": {
"type": "string"
},
"location_address": {
"type": "string"
}
}
}
}
}
}
}
}
I'm using the python client for elasticsearch
https://elasticsearch-py.readthedocs.org/en/master/helpers.html#elasticsearch.helpers.reindex
from elasticsearch import Elasticsearch
from elasticsearch.helpers import reindex
es = Elasticsearch(["es.node1"])
reindex(es, "source_index", "target_index")
However this transfers the data from one index to another.
How may i use this to change the mappings/(data types etc) for my case above?

It's Straightforward if you use the scan&scroll and the Bulk API already implemented in the python client of elasticsearch
First -> Fetch all the documents by scan&scroll method
Loop through and make neccessary modifications to each document
Insert the modified documents into a new index using the Bulk API
from elasticsearch import Elasticsearch, helpers
es = Elasticsearch()
# Use the scan&scroll method to fetch all documents from your old index
res = helpers.scan(es, query={
"query": {
"match_all": {}
},
"size":1000
},index="old_index")
new_insert_data = []
# Change the mapping and everything else by looping through all your documents
for x in res:
x['_index'] = 'new_index'
# Change "address" to "location_address"
x['_source']['location_address'] = x['_source']['address']
del x['_source']['address']
# This is a useless field
del x['_score']
es.indices.refresh(index="testing_index3")
# Add the new data into a list
new_insert_data.append(x)
es.indices.refresh(index="new_index")
print new_insert_data
#Use the Bulk API to insert the list of your modified documents into the database
helpers.bulk(es,new_insert_data)

The reindex() API simply "moves" documents from one index to another. There is no way it can detect/infer that the field name full_address in documents of the old index should be location_address in documents in the new index. I doubt there is any API provided by standard Elasticsearch clients that can do what you desire. The only way I can think of achieving this is through additional custom logic on the client side which maintains a dictionary of field names from old index to new index and then read documents from old index and indexes the corresponding document to the new index with new field names obtained from the field name dictionary.

After updating the mapping, this can be done by updating the exiting documents using bulk API.
POST /_bulk
{"update":{"_id":"59519","_type":"asset","_index":"assets"}}
{"doc":{"facility_id":491},"detect_noop":false}
Note - Use 'detect_noop' for detecting the noop update.

Related

How to search inside json file with redis?

Here I set a json object inside a key in a redis. Later I want to perform search on the json file stored in the redis. My search key will always be a json string like in the example below and i want to match this inside the stored json file.
Currently here i am doing this by iterating and comparing but instead i want to do it with redis. How can I do it ?
rd = redis.StrictRedis(host="localhost",port=6379, db=0)
if not rd.get("mykey"):
with open(os.path.join(BASE_DIR, "my_file.josn")) as fl:
data = json.load(fl)
rd.set("mykey", json.dumps(data))
else:
key_values = json.loads(rd.get("mykey"))
search_json_key = {
"key":"value",
"key2": {
"key": "val"
}
}
# here i am searching by iterating and comparing instead i want to do it with redis
for i in key_values['all_data']:
if json.dumps(i) == json.dumps(search_json_key):
# return
# mykey format looks like this:
{
"all_data": [
{
"key":"value",
"key2": {
"key": "val"
}
},
{
"key":"value",
"key2": {
"key": "val"
}
},
{
"key":"value",
"key2": {
"key": "val"
}
},
]
}

To do search with Redis and JSON you have two options - you can use the FT CREATE command to create an index that you can then use FT SEARCH over, (while both of these web pages show the CLI syntax you can do
rd.ft().create() / search() in your python script)
OR you can check out the python OM client that will take care of that to some extent for you.
Either way you'll have to do a bit of a rework to fully take advantage of Redis' search capabilities.

Export part of data filed - MongoDB

I'm using MongoDB Compass to export my data as csv file, but I have only the choice to select which field I want and not elements in a specific field.
MongoDB export data:
Actually, I'm interested to save only the "scores" for object "0,1,2".
Here a ScreenShot from MongDB Compas:
It is something that I should deal with python?

One option could be to "rewrite" "scoreTable" so that there are a maximum of 3 elements in the "scores" array and then "$out" to a new collection that can be exported in full.
db.devicescores.aggregate([
{
"$set": {
"scoreTable": {
"$map": {
"input": "$scoreTable",
"as": "player",
"in": {
"$mergeObjects": [
"$$player",
{"scores": {"$slice": ["$$player.scores", 3]}}
]
}
}
}
}
},
{"$out": "outCollection"}
])
Try it on mongoplayground.net.

Automatically entering next JSON level using Python in a similar way to JQ in bash

I am trying to use Python to extract pricePerUnit from JSON. There are many entries, and this is just 2 of them -
{
"terms": {
"OnDemand": {
"7Y9ZZ3FXWPC86CZY": {
"7Y9ZZ3FXWPC86CZY.JRTCKXETXF": {
"offerTermCode": "JRTCKXETXF",
"sku": "7Y9ZZ3FXWPC86CZY",
"effectiveDate": "2020-11-01T00:00:00Z",
"priceDimensions": {
"7Y9ZZ3FXWPC86CZY.JRTCKXETXF.6YS6EN2CT7": {
"rateCode": "7Y9ZZ3FXWPC86CZY.JRTCKXETXF.6YS6EN2CT7",
"description": "Processed translation request in AWS GovCloud (US)",
"beginRange": "0",
"endRange": "Inf",
"unit": "Character",
"pricePerUnit": {
"USD": "0.0000150000"
},
"appliesTo": []
}
},
"termAttributes": {}
}
},
"CQNY8UFVUNQQYYV4": {
"CQNY8UFVUNQQYYV4.JRTCKXETXF": {
"offerTermCode": "JRTCKXETXF",
"sku": "CQNY8UFVUNQQYYV4",
"effectiveDate": "2020-11-01T00:00:00Z",
"priceDimensions": {
"CQNY8UFVUNQQYYV4.JRTCKXETXF.6YS6EN2CT7": {
"rateCode": "CQNY8UFVUNQQYYV4.JRTCKXETXF.6YS6EN2CT7",
"description": "$0.000015 per Character for TextTranslationJob:TextTranslationJob in EU (London)",
"beginRange": "0",
"endRange": "Inf",
"unit": "Character",
"pricePerUnit": {
"USD": "0.0000150000"
},
"appliesTo": []
}
},
"termAttributes": {}
}
}
}
}
}
The issue I run into is that the keys, which in this sample, are 7Y9ZZ3FXWPC86CZY, CQNY8UFVUNQQYYV4.JRTCKXETXF, and CQNY8UFVUNQQYYV4.JRTCKXETXF.6YS6EN2CT7 are a changing string that I cannot just type out as I am parsing the dictionary.
I have python code that works for the first level of these random keys -
with open('index.json') as json_file:
data = json.load(json_file)
json_keys=list(data['terms']['OnDemand'].keys())
#Get the region
for i in json_keys:
print((data['terms']['OnDemand'][i]))
However, this is tedious, as I would need to run the same code three times to get the other keys like 7Y9ZZ3FXWPC86CZY.JRTCKXETXF and 7Y9ZZ3FXWPC86CZY.JRTCKXETXF.6YS6EN2CT7, since the string changes with each JSON entry.
Is there a way that I can just tell python to automatically enter the next level of the JSON object, without having to parse all keys, save them, and then iterate through them? Using JQ in bash I can do this quite easily with jq -r '.terms[][][]'.

If you are really sure, that there is exactly one key-value pair on each level, you can try the following:
def descend(x, depth):
for i in range(depth):
x = next(iter(x.values()))
return x

You can use dict.values() to iterate over the values of a dict. You can also use next(iter(dict.values())) to get a first (only) element of a dict.
for demand in data['terms']['OnDemand'].values():
next_level = next(iter(demand.values()))
print(next_level)
If you expect other number of children than 1 in the second level, you can just nest the fors:
for demand in data['terms']['OnDemand'].values():
for sub_demand in demand.values()
print(sub_demand)
If you are insterested in the keys too, you can use dict.items() method to iterate over dict keys and values at the same time:
for demand_key, demand in data['terms']['OnDemand'].items():
for sub_demand_key, sub_demand in demand.items()
print(demand_key, sub_demand_key, sub_demand)

Elasticsearch DocumentSimilarity dense_vector got multiple values for argument 'body'

I want to store Document Vectors in an Elasticsearch index in order to calculate document similarity. I'm using the Python client for Elasticsearch 7.8.0.
I have a (dummy) Elasticsearch index with the following mapping:
mapping = {
"mappings": {
"properties": {
"title_vector":{
"type": "dense_vector",
"dims": 3
}
}
}
}
es.indices.create(index="test_vector", body=mapping)
And I stored a bunch of vectors in the following way:
vectors = [[1,2,3],[2,2,2],[1,2,2],[2,2,2],[4,5,6],[1,1,1]]
for i, v in enumerate(vectors):
doc = {"title_vector": v}
es.create("test_vector", id=i, body=doc)
According to the documentation, my query to get the most similar documents, should be as follows:
doc = {
"query": {
"script_score": {
"query": {
"match_all": {}
},
"script": {
"source": "cosineSimilarity(params.queryVector, 'title_vector') + 1.0",
"params": {
"queryVector": [1,1,1]
}
}
}
}}
es.search("test_vector", body=doc)
But I'm getting
TypeError: search() got multiple values for argument 'body'
It seems more like a Python error than an Elastic error. But I can't really find the cause of the error and how I should structure my query differently in order to solve it.
Thanks in advance!
Edit: added Elasticsearch version

You are correct, it is a python error. So below is how the es.search is defined according to this link
search(body=None, index=None, params=None, headers=None)
As you see the first parameter is body.
Notice the es.search you have, you haven't specified the key in the first parameter i.e. body, index, params, headers. As a result, python interprets that as value for body according to the above method declaration.
Just add index="test_vector" instead of just "test_vector" in the first parameter and that should do the trick.
es.search(index="test_vector", body=doc)
Hope it helps!

How to index list of object in Elasticsearch?

A document format I ingest into ElasticSearch looks like this:
{
'id':'514d4e9f-09e7-4f13-b6c9-a0aa9b4f37a0'
'created':'2019-09-06 06:09:33.044433',
'meta':{
'userTags':[
{
'intensity':'1',
'sentiment':'0.84',
'keyword':'train'
},
{
'intensity':'1',
'sentiment':'-0.76',
'keyword':'amtrak'
}
]
}
}
...ingested with python:
r = requests.put(itemUrl, auth = authObj, json = document, headers = headers)
The idea here is that ElasticSearch will treat keyword, intensity and sentiment as fields that can be later queried. However, on ElasticSearch side I can observe that this is not happening (I use Kibana for search UI) -- instead, I see field "meta.userTags" with the value that is the whole list of objects.
How can I make ElasticSearch index elements within a list?

I used the document body you provided to create a new index 'testind' and type 'testTyp' using the Postman REST client.:
POST http://localhost:9200/testind/testTyp
{
"id":"514d4e9f-09e7-4f13-b6c9-a0aa9b4f37a0",
"created":"2019-09-06 06:09:33.044433",
"meta":{
"userTags":[
{
"intensity":"1",
"sentiment":"0.84",
"keyword":"train"
},
{
"intensity":"1",
"sentiment":"-0.76",
"keyword":"amtrak"
}
]
}
}
When I queried for the index's mapping this is what i get :
GET http://localhost:9200/testind/testTyp/_mapping
{
"testind":{
"mappings":{
"testTyp":{
"properties":{
"created":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
},
"id":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
},
"meta":{
"properties":{
"userTags":{
"properties":{
"intensity":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
},
"keyword":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
},
"sentiment":{
"type":"text",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
}
}
}
}
}
}
}
}
}
}
As you can see in the mapping the fields are part of the mapping and can be queried as per need in future, so I don't see the problem here as long as the field names are not one of these - https://www.elastic.co/guide/en/elasticsearch/reference/6.4/sql-syntax-reserved.html ( you might want to avoid the term 'keyword' as it might be confusing later when writing search queries as the fieldname and type are both same - 'keyword') . Also, note one thing, the mapping gets created via dynamic mapping (https://www.elastic.co/guide/en/elasticsearch/reference/6.3/dynamic-field-mapping.html#dynamic-field-mapping ) in Elasticsearch and so the data types are determined by elasticsearch based on the values you have provided.However, this may not be always accurate , so to prevent that you can use the PUT _mapping API to define your own mapping for the index, and then prevent new fields within a type from being added to mappings.

You don't need a special mapping to index a list - every field can contain one or more values of the same type. See array datatype.
In the case of a list of objects, they can be indexed as object or nested datatype. Per default elastic uses object datatype. In this case you can query meta.userTags.keyword or/and meta.userTags.sentiment. The result will allways contains whole documents with values matched independently, ie. searching keyword=train and sentiment=-0.76 you WILL find document with id=514d4e9f-09e7-4f13-b6c9-a0aa9b4f37a0.
If this is not what you want, you need to define nested datatype mapping for field userTags and use a nested query.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Elasticsearch/Python - Re-index data after changing the mappings? - python

After updating the mapping, this can be done by updating the exiting documents using bulk API. POST /_bulk {"update":{"_id":"59519","_type":"asset","_index":"assets"}} {"doc":{"facility_id":491},"detect_noop":false} Note - Use 'detect_noop' for detecting the noop update.

Related

How to search inside json file with redis?

Export part of data filed - MongoDB

Automatically entering next JSON level using Python in a similar way to JQ in bash

Elasticsearch DocumentSimilarity dense_vector got multiple values for argument 'body'

How to index list of object in Elasticsearch?

Categories

Resources