Bulk Index data in Elasticsearch with sequential IDs - python

I am using this code to bulk index all data in Elasticsearch using python:
from elasticsearch import Elasticsearch, helpers
import json
import os
import sys
import sys, json
es = Elasticsearch()
def load_json(directory):
for filename in os.listdir(directory):
if filename.endswith('.json'):
with open(filename,'r') as open_file:
yield json.load(open_file)
helpers.bulk(es, load_json(sys.argv[1]), index='v1_resume', doc_type='candidate')
I know that if ID is not mentioned ES gives a 20 character long ID by itself, but I want it to get indexed starting from ID = 1 till the number of documents.
How can I achieve this ?

In elastic search if you don't pick and ID for your document an ID is automatically created for you, check here in
elastic docs:
Autogenerated IDs are 20 character long, URL-safe, Base64-encoded GUID
strings. These GUIDs are generated from a modified FlakeID scheme which
allows multiple nodes to be generating unique IDs in parallel with
essentially zero chance of collision.
If you like to have custom ids you need to build them yourself, using similar syntax:
[
{'_id': 1,
'_index': 'index-name',
'_type': 'document',
'_source': {
"title": "Hello World!",
"body": "..."}
},
{'_id': 2,
'_index': 'index-name',
'_type': 'document',
'_source': {
"title": "Hello World!",
"body": "..."}
}
]
helpers.bulk(es, load_json(sys.argv[1])
Since you are decalring the type and index inside your schema you don't have to do it inside helpers.bulk() method. You need to change the output of 'load_json' to create list with dicts (like above) to be saved in es (python elastic client docs)

Related

Find all unique values for field in Elasticsearch through python

I've been scouring the web for some good python documentation for Elasticsearch. I've got a query term that I know returns the information I need, but I'm struggling to convert the raw string into something Python can interpret.
This will return a list of all unique 'VALUE's in the dataset.
{"find": "terms", "field": "hierarchy1.hierarchy2.VALUE"}
Which I have taken from a dashboarding tool which accesses this data.
But I don't seem to be able to convert this into correct python.
I've tried this:
body_test = {"find": "terms", "field": "hierarchy1.hierarchy2.VALUE"}
es = Elasticsearch(SETUP CONNECTION)
es.search(
index="INDEX_NAME",
body = body_test
)
but it doesn't like the find value. I can't find anything in the documentation about find.
RequestError: RequestError(400, 'parsing_exception', 'Unknown key for
a VALUE_STRING in [find].')
The only way I've got it to slightly work is with
es_search = (
Search(
using=es,
index=db_index
).source(['hierarchy1.hierarchy2.VALUE'])
)
But I think this is pulling the entire dataset and then filtering (which I obviously don't want to be doing each time I run this code). This needs to be done through python and so I cannot simply POST the query I know works.
I am completely new to ES and so this is all a little confusing. Thanks in advance!
So it turns out that the find in this case was specific to Grafana (the dashboarding tool I took the query from.
In the end I used this site and used the code from there. It's a LOT more complicated than I thought it was going to be. But it works very quickly and doesn't put a strain on the database (which my alternative method was doing).
In case the link dies in future years, here's the code I used:
from elasticsearch import Elasticsearch
es = Elasticsearch()
def iterate_distinct_field(es, fieldname, pagesize=250, **kwargs):
"""
Helper to get all distinct values from ElasticSearch
(ordered by number of occurrences)
"""
compositeQuery = {
"size": pagesize,
"sources": [{
fieldname: {
"terms": {
"field": fieldname
}
}
}
]
}
# Iterate over pages
while True:
result = es.search(**kwargs, body={
"aggs": {
"values": {
"composite": compositeQuery
}
}
})
# Yield each bucket
for aggregation in result["aggregations"]["values"]["buckets"]:
yield aggregation
# Set "after" field
if "after_key" in result["aggregations"]["values"]:
compositeQuery["after"] = \
result["aggregations"]["values"]["after_key"]
else: # Finished!
break
# Usage example
for result in iterate_distinct_field(es, fieldname="pattern.keyword", index="strings"):
print(result) # e.g. {'key': {'pattern': 'mypattern'}, 'doc_count': 315}

Does python provide a hook to customize json stringification based on key name?

I'm trying to write an efficient stringification routine for logging dicts, but want to redact certain values based on key names. I see that JSONDecoder provides the object_pairs_hook which provides key and value, but I don't see a corresponding hook for JSONEncoder - just 'default' which only provides value. In my case, the values are just other strings so can't base the processing on that alone. Is there something I missed?
For example, if I have a dict with:
{
"username": "Todd",
"role": "Editor",
"privateKey": "1234ad1234e434134"
}
I would want to log:
'{"username":"Todd","role":"Editor","privateKey":"**redacted**"}'
Any good tools in python to do this? Or should I just recursively iterate the (possibly nested) dict directly?
You can "reload" it using the object hook then dump it again.
def redact(o):
if 'privateKey' in o:
o['privateKey'] = '***redacted***'
return o
>>> d
{'username': 'Todd', 'role': 'Editor', 'privateKey': '1234ad1234e434134', 'foo': ['bar', {'privateKey': 'baz'}]}
>>> json.dumps(json.loads(json.dumps(d), object_hook=redact))
'{"username": "Todd", "role": "Editor", "privateKey": "***redacted***", "foo": ["bar", {"privateKey": "***redacted***"}]}'
JSON library has two functions viz. dumps() and loads(). To convert a json to a string use dumps() and vice-versa use loads().
import json
your_dict = {
"username": "Todd",
"role": "Editor",
"privateKey": "1234ad1234e434134"
}
string_of_your_json = json.dumps(your_dict)

Updating a collection based on the value extracted from another collection

I'm revising my previous question. I have a collection named FileCollection with the following document:
{
"_id": {
"$oid": "5e791a53185fbb070378660a"
},
"selectedfiles": [{
"inputfile": "https://localhost/_HAC-154_1584994899979.jpg",
"Selectedby: "Joe"
}]}
I need to read the value of selectedfiles.inputfile as a string variable. I'm trying to do this in Python using this code:
from pymongo import MongoClient
mydb = MongoClient(mongodbConnection)
myCollection=mydb.FileCollection
myValue=myCollection.selectedfile[0].inputfile.value
print(myValue)
client.close
the output is a JSON without having the actual value of inputfile. Please help.
Thanks
Isn't it just because you're missing an s?
You had:
myValue=myCollection.selectedfile[0].inputfile.value
instead of:
myValue=myCollection.selectedfiles[0].inputfile.value

Indexing "large" (>40Mb) documents in Elasticsearch

I am trying to add a document of 43Mb into an index in Elasticsearch. I use the bulk API in python. Here is a snippet of my code:
from elasticsearch import helpers
from elasticsearch import Elasticsearch
document = <read a 43Mb json file, with two fields>
action = [
{
"_index":"test_index",
"_type":"test_type",
"_id": 1
}
]
action[0]["_source"]=document
es = Elasticsearch(hosts=<HOST>:9200, timeout = 30)
helpers.bulk(es, action)
This code always times out. I have also tried with different timeout values. Am I missing something here?

JSON parsing using python

I'm attempting to understand the basics of JSON and thought using some Google translate examples would be interesting. I'm not actually making requests via the API but they have the following example I have saved as "file.json":
{
"data": {
"detections": [
[
{
"language": "en",
"isReliable": false,
"confidence": 0.18397073
}
]
]
}
}
I'm reading in the file and used simplejson:
json_data = open('file.json').read()
json = simplejson.loads(json_data)
>>> json
{'data': {'detections': [[{'isReliable': False, 'confidence': 0.18397073, 'language': 'en'}]]}}
I've tried multiple ways to print the value of 'language' with no success. For example, this fails. Any pointers would be appreciated!
print json['detections']['language']
You need json['data']['detections'][0][0]['language']. As your example data shows, 'language' is a key of a dict that is inside a list that is inside another list that inside the 'detections' dict which is inside the 'data' dict.

Categories

Resources