Bulk Update for elasticsearch documents using Python - python

I have elasticsearch documents like below where I need to rectify age value based on creationtime currentdate
age = creationtime - currentdate
:
hits = [
{
"_id":"CrRvuvcC_uqfwo-WSwLi",
"creationtime":"2018-05-20T20:57:02",
"currentdate":"2021-02-05 00:00:00",
"age":"60 months"
},
{
"_id":"CrRvuvcC_uqfwo-WSwLi",
"creationtime":"2013-07-20T20:57:02",
"currentdate":"2021-02-05 00:00:00",
"age":"60 months"
},
{
"_id":"CrRvuvcC_uqfwo-WSwLi",
"creationtime":"2014-08-20T20:57:02",
"currentdate":"2021-02-05 00:00:00",
"age":"60 months"
},
{
"_id":"CrRvuvcC_uqfwo-WSwLi",
"creationtime":"2015-09-20T20:57:02",
"currentdate":"2021-02-05 00:00:00",
"age":"60 months"
}
]
I want to do bulk update based on each document ID, but the problem is I need to correct 6 months of data & per data size (doc count of Index) is almost 535329, I want to efficiently do bulk update on age based on _id for each day on all documents using python.
Is there a way to do this, without looping through, all examples I came across using Pandas dataframes for update is based on a known value. But here _id I will get as and when the code runs.
The logic I had written was to fetch all doc & store their _id & then for each _id update the age . But its not an efficient way if I want to update all documents in bulk for each day of 6 months.
Can anyone give me some ideas for this or point me in the right direction.

As mentioned in the comments, fetching the IDs won't be necessary. You don't even need to fetch the documents themselves!
A single _update_by_query call will be enough. You can use ChronoUnit to get the difference after you've parsed the dates:
POST your-index-name/_update_by_query
{
"query": {
"match_all": {}
},
"script": {
"source": """
def created = LocalDateTime.parse(ctx._source.creationtime, DateTimeFormatter.ofPattern("yyyy-MM-dd'T'HH:mm:ss"));
def currentdate = LocalDateTime.parse(ctx._source.currentdate, DateTimeFormatter.ofPattern("yyyy-MM-dd HH:mm:ss"));
def months = ChronoUnit.MONTHS.between(created, currentdate);
ctx._source._age = months + ' month' + (months > 1 ? 's' : '');
""",
"lang": "painless"
}
}
The official python client has this method too. Here's a working example.
🔑 Try running this update script on a small subset of your documents before letting in out on your whole index by adding a query other than the match_all I put there.
💡 It's worth mentioning that unless you search on this age field, it doesn't need to be stored in your index because it can be calculated at query time.
You see, if your index mapping's dates are properly defined like so:
{
"mappings": {
"properties": {
"creationtime": {
"type": "date",
"format": "yyyy-MM-dd'T'HH:mm:ss"
},
"currentdate": {
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss"
},
...
}
}
}
the age can be calculated as a script field:
POST ttimes/_search
{
"query": {
"match_all": {}
},
"script_fields": {
"age_calculated": {
"script": {
"source": """
def months = ChronoUnit.MONTHS.between(
doc['creationtime'].value,
doc['currentdate'].value );
return months + ' month' + (months > 1 ? 's' : '');
"""
}
}
}
}
The only caveat is, the value won't be inside of the _source but rather inside of its own group called fields (which implies that more script fields are possible at once!).
"hits" : [
{
...
"_id" : "FFfPuncBly0XYOUcdIs5",
"fields" : {
"age_calculated" : [ "32 months" ] <--
}
},
...

Related

JSON jq/python file manipulation with specific key name aggregation

I need to modify the structure of this json file:
[
{
"id":"3333",
"properties":{
"label":"Computer",
"name":"My-Laptop"
}
},
{
"id":"9998",
"type":"file_system",
"properties":{
"mount_point":"/opt",
"name":"/dev/mapper/rhel-opt",
"root_container":"3333"
},
"label":"FileSystem"
},
{
"id":"9999",
"type":"file_system",
"properties":{
"mount_point":"/var",
"name":"/dev/mapper/rhel-var",
"root_container":"3333"
},
"label":"FileSystem"
}
]
in order to have this kind of output:
[
{
"id":"3333",
"properties":{
"label":"Computer",
"name":"My-Laptop",
"file_system":[
"/opt",
"/var"
]
}
}
]
The idea is to have, in the new json structure, the visibility of my laptop with the two file-system partition in an array named "file_system".
As you can see the two partition are related to the first by the id and root_container.
So, imagine to have not only one laptop, bat thousands of laptop, with different id and every one of these have different partition, related to the laptop by the root_container key.
Is there an option to do this with jq functions or python script?
Many thanks
You could employ reduce to iterate over the items while extracting their id, mount_point and root_container. Then, if a root_container was present, delete that entry and add its mount_point to the entry whose id matches their root_container. For convenience, I also employed INDEX on the items' id fields to simplify their access as .[$id] and .[$root_container], which had to be undone at the end using map(.).
jq '
reduce .[] as {$id, properties: {$mount_point, $root_container}} (
INDEX(.id);
if $root_container then
del(.[$id])
| .[$root_container].properties.file_system += [$mount_point]
else . end
)
| map(.)
'
[
{
"id": "3333",
"properties": {
"label": "Computer",
"name": "My-Laptop",
"file_system": [
"/opt",
"/var"
]
}
}
]
Demo

Pymongo include only the fields which are starting with a name

For example, if this is my record
{
"_id":"123",
"name":"google",
"ip_1":"10.0.0.1",
"ip_2":"10.0.0.2",
"ip_3":"10.0.1",
"ip_4":"10.0.1",
"description":""}
I want to get only those fields starting with 'ip_'. Consider I have 500 fields & only 15 of them start with 'ip_'
Can we do something like this to get the output -
db.collection.find({id:"123"}, {'ip*':1})
Output -
{
"ip_1":"10.0.0.1",
"ip_2":"10.0.0.2",
"ip_3":"10.0.1",
"ip_4":"10.0.1"
}
The following aggregate query, using PyMongo, returns documents with the field names starting with "ip_".
Note the various aggregation operators used: $filter, $regexMatch, $objectToArray, $arrayToObject. The aggregation pipeline the two stages $project and $replaceWith.
pipeline = [
{
"$project": {
"ipFields": {
"$filter" : {
"input": { "$objectToArray": "$$ROOT" },
"cond": { "$regexMatch": { "input": "$$this.k" , "regex": "^ip" } }
}
}
}
},
{
"$replaceWith": { "$arrayToObject": "$ipFields" }
}
]
pprint.pprint(list(collection.aggregate(pipeline)))
I am unaware of a way to specify an expression that would decide which hash keys would be projected. MongoDB has projection operators but they deal with arrays and text search.
If you have a fixed possible set of ip fields, you can simply request all of them regardless of which fields are present in a particular document, e.g. project with
{ip_1: true, ip_2: true, ...}

A efficient way to unpack nested json into a dataframe

I have a nested json, and i want to transform it into a pandas dataframe. I was able to normalize with json_normalize.
However, there are still json layer within the dataframe, which i also want to unpack. How can i do it in the best way? I will likely have to deal with this a few more times within the project i am doing currently
The json i have is the following
{
"data": {
"allOpportunityApplication": {
"data": [
{
"id": "111111111",
"opportunity": {
"programme": {
"short_name": "XX"
}
},
"person": {
"home_lc": {
"name": "NAME"
}
},
"standards": [
{
"constant_name": "constant1",
"standard_option": {
"option": "true"
}
},
{
"constant_name": "constant2",
"standard_option": {
"option": "true"
}
}
]
}
]
}
}
}
Used json_normalize
standards_df = json_normalize(
standard_json['allOpportunityApplication']['data'],
record_path=['standards'],
meta=['id','person','opportunity']
)
with that i get a dataframe with the columns: constant_name, standard_option, id, person, opportunity. The problem is that the data standard_option, person and opportunity are json, with a single option inside.
The current ouput and expected output for each column is as follow
Standard_option
Currently an item in the column "standard_option" looks like:
{'option': 'true'}
I want it to be just true
Person
Currently an item in the column "person" looks like:
{'programme': {'short_name': 'XX'}}
I want it to look like: XX
Opportunity
Currently an item in the column "opportunity" looks like:
{'home_lc': {'name': 'NAME'}}
I want it to look like: NAME
Might not be the best way, but I think it works.
standards_df['person'] = (standards_df.loc[:, 'person']
.apply(lambda x: x['home_lc']['name']))
standards_df['opportunity'] = (standards_df.loc[:, 'opportunity']
.apply(lambda x: x['programme']['short_name']))
constant_name standard_option.option id person opportunity
0 constant1 true 111111111 NAME XX
1 constant2 true 111111111 NAME XX
standard_option was already fine when I run your code

How to generate a word cloud using elasticsearch?

I have an elasticsearch DB with data of the form
record = {#all but age are strings
'diagnosis': self.diagnosis,
'vignette': self.vignette,
'symptoms': self.symptoms_list,
'care': self.care_level_string,
'age': self.age, #float
'gender': self.gender
}
I want to create a word cloud of the data in vignette.
I tried all sorts of queries and I get error 400, meaning I don't understand how to query the database.
I am using python
This is the only successful query I was able to come up with
def search_phrase_in_vignettes(self, phrase):
body = {
"_source": ["vignette"],
"query": {
"match_phrase": {
"vignette": {
"query": phrase,
}
}
}
}
res = self.es.search(index=self.index_name, doc_type=self.doc_type, body=body)
Which finds any record with phrase contained in the field `'vignette'
I am thinking some aggregation should do the trick, but I can't seem to be able to write a correct query with 'aggr'.
Would love some help on how to correctly write even the simplest query with aggregation in python.
Use terms aggregation for the approach words count. Your query will be:
{
"query": {
"match_phrase": {
"vignette": {
"query": phrase,
}
}
},
"aggs" : {
"cloud" : {
"terms" : { "field" : "vignette" }
}
}
}
When you receive results take buckets from aggregations key:
res = self.es.search(index=self.index_name, doc_type=self.doc_type, body=body)
for bucket in res['aggregations']['cloud']['buckets']:
rest of build cloud

Why my index fields are still shown as "analyzed" even after i index em as "not_analyzed"

I have a lot of data (json format) in Amazon SQS. I basically have a simple python script which pulls data from the SQS queue & then indexes it in ES. My problem is even though i have specified in my script to index as "not_analyzed", i still see my index filed as "analyzed" in index setting of kibana4 dashboard
Here is my python code :
doc = {
"settings" : {
"number_of_shards" : 1
},
"mappings" : {
"type_name": {
"dynamic_templates": [
{
"strings": {
"match_mapping_type": "string",
"mapping": {
"type": "string",
"index": "not_analyzed"
}
}
}
]
}
}
}
es = Elasticsearch()
h = { "Content-type":"application/json" }
res = requests.request("POST","http://localhost:9200/"+index_name+"/",headers=h,data=json.dumps(doc))
post = es.index(index=index_name , doc_type='server' , id =1 , body=json.dumps(new_list))
print "------------------------------"
print "Data Pushed Successfully to ES"
I am not sure what's wrong here?
The doc_type you're using when indexing (= server) doesn't match the one you have in your index mappings (= type_name).
So if you index your documents like this instead, it will work
post = es.index(index=index_name , doc_type='type_name' , id =1 , body=json.dumps(new_list))
^
|
change this

Categories

Resources