Loading irregular json into Elasticsearch index with mapping using Python client - python

I have some .json where not all fields are present in all records, for e.g. caseclass.json looks like:
[{
"name" : "john smith",
"age" : 12,
"cars": ["ford", "toyota"],
"comment": "i am happy"
},
{
"name": "a. n. other",
"cars": "",
"comment": "i am panicking"
}]
Using Elasticsearch-7.6.1 via python client elasticsearch:
from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search
import json
import os
from elasticsearch_dsl import Document, Text, Date, Integer, analyzer
es = Elasticsearch([{'host': 'localhost', 'port': 9200}])
class Person(Document):
class Index:
using = es
name = 'person_index'
name = Text()
age = Integer()
cars = Text()
comment = Text(analyzer='snowball')
Person.init()
with open ("caseclass.json") as json_file:
data = json.load(json_file)
for indexid in range(len(data)):
document = Person(name=data[indexid]['name'], age=data[indexid]['age'], cars=data[indexid]['cars'], comment=data[indexid]['comment'])
document.meta.id = indexid
document.save()
Naturally I get KeyError: 'age' when the second record is trying to be read. My question is: it is possible to load such records onto a Elasticsearch index using the Python client and a pre-defined mapping, instead of dynamic mapping? Above code works if all fields are present in all records but is there a way to do this without checking presence of each field per record as the actual records have complex structure and there are millions of them? Thanks

The error has nothing to do w/ your mapping -- it's just telling you that age could not be accessed in one of your caseclasses.
The index mapping is created when you call Person.init() -- you can verify that by calling print(es.indices.get_mapping(Person.Index.name)) right after Person.init().
I've cleaned up your code a bit:
import json
import os
from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search, Document, Text, Date, Integer, analyzer
es = Elasticsearch([{'host': 'localhost', 'port': 9200}])
class Person(Document):
class Index:
using = es
name = 'person_index'
name = Text()
age = Integer()
cars = Text()
comment = Text(analyzer='snowball')
Person.init()
print(es.indices.get_mapping(Person.Index.name))
with open("caseclass.json") as json_file:
data = json.load(json_file)
for indexid, case in enumerate(data):
document = Person(**case)
document.meta.id = indexid
document.save()
Notice how I used **case to spread all key-value pairs inside of a case instead of using data[property_key].
The generated mapping is as follows:
{
"person_index" : {
"mappings" : {
"properties" : {
"age" : {
"type" : "integer"
},
"cars" : {
"type" : "text"
},
"comment" : {
"type" : "text",
"analyzer" : "snowball"
},
"name" : {
"type" : "text"
}
}
}
}
}

Related

boto3, python: appending value to DynamoDB String Set

I have an object in DynamoDB:
{ 'UserID' : 'Hank', ConnectionList : {'con1', 'con2'} }
By using boto3 in lambda functions, I would like to add 'con3' to the String Set.
So far, I have been trying with the following code without success:
ddbClient = boto3.resource('dynamodb')
table = ddbClient.Table("UserInfo")
table.update_item(
Key={
"UserId" : 'Hank'
},
UpdateExpression =
"SET ConnectionList = list_append(ConnectionList, :i)",
ExpressionAttributeValues = {
":i": { "S": "Something" }
},
ReturnValues="ALL_NEW"
)
However, no matter the way I try to put the information inside the String Set, it always runs error.
Since you're using the resource API, you have to use the Python data type set in your statement:
table.update_item(
Key={
"UserId" : 'Hank'
},
UpdateExpression =
"ADD ConnectionList :i",
ExpressionAttributeValues = {
":i": {"Something"}, # needs to be a set type
},
ReturnValues="ALL_NEW"
)

How to append a new array of values to an existing array document in mongodb using pymongo?

An existing collection like as below:
"_id" : "12345",
"vals" : {
"dynamickey1" : {}
}
I need to add
"vals" : {
"dynamickey2" : {}
}
I have tried in python 2.7 with pymongo 2.8:
col.update({'_id': id)},{'$push': {'vals': {"dynamickey2":{"values"}}}})
Error log:
pymongo.errors.OperationFailure: The field 'vals' must be an array but is of type object in document
Expected Output:
"_id" : "12345",
"vals" : {
"dynamickey1" : {},
"dynamickey2" : {}
}
Edited following question edit:
Two options; use $set with the dot notation, or use python dict manipulation.
The first method is more MongoDB native and is one line of code; the second is a bit more work but gives more flexilbility if you use case is more nuanced.
Method 1:
from pymongo import MongoClient
from bson.json_util import dumps
db = MongoClient()['mydatabase']
db.mycollection.insert_one({
"_id": "12345",
"vals": {
"dynamickey1": {},
}
})
db.mycollection.update_one({'_id': '12345'},{'$set': {'vals.dynamickey2':{}}})
print(dumps(db.mycollection.find_one({}), indent=4))
Method 2:
from pymongo import MongoClient
from bson.json_util import dumps
db = MongoClient()['mydatabase']
db.mycollection.insert_one({
"_id": "12345",
"vals": {
"dynamickey1": {},
}
})
record = db.mycollection.find_one({'_id': '12345'})
vals = record['vals']
vals['dynamickey2'] = {}
record = db.mycollection.update_one({'_id': record['_id']}, {'$set': {'vals': vals}})
print(dumps(db.mycollection.find_one({}), indent=4))
Either way gives:
{
"_id": "12345",
"vals": {
"dynamickey1": {},
"dynamickey2": {}
}
}
Previous answer
Your expected output has an object with duplicate fields (vals); this isn't allowed.~
So whatever you are trying to do, it isn't going to work.

How to write JSON data to Dynamodb by ignoring empty elements in boto3

I would like to write the following data group to Dynamodb.
There are about 100 data. Since images are not necessarily required, there is a mixture with and without the image_url element.
(questionsList.json)
{
"q_id" : "001",
"q_body" : "Where is the capital of the United States?",
"q_answer" : "Washington, D.C.",
"image_url" : "/Washington.jpg",
"keywords" : [
"UnitedStates",
"Washington"
]
},
{
"q_id" : "002",
"q_body" : "Where is the capital city of the UK?",
"q_answer" : "London",
"image_url" : "",
"keywords" : [
"UK",
"London"
]
},
Since it is the writing test phase, Dynamodb to write to is prepared in localhost:8000 using the serverless-dynamodb-local plugin of the serverless framework, not the production environment.
In order to write the above JSON data to this Dynamodb, I wrote the following code in Boto 3 (AWS SDK for Python).
from __future__ import print_function
import boto3
import codecs
import json
dynamodb = boto3.resource('dynamodb', region_name='us-east-1', endpoint_url="http://localhost:8000")
table = dynamodb.Table('questionListTable')
with open("questionList.json", "r", encoding='utf-8') as json_file:
items = json.load(json_file)
for item in items:
q_id = item['q_id']
q_body = item['q_body']
q_answer = item['q_answer']
image_url = item['image_url']
keywords = item['keywords']
print("Adding detail:", q_id, q_body)
table.put_item(
Item={
'q_id': q_id,
'q_body': q_body,
'q_answer': q_answer,
'image_url': image_url,
'keywords': keywords,
}
)
When this code is executed, the following error occurs in the null character part.
botocore.exceptions.ClientError: An error occurred (ValidationException) when calling the PutItem operation: One or more parameter values were invalid: An AttributeValue may not contain an empty string
Apparently it seems to be caused by JSON 's null character.
If you exclude the image_url containing the null character from the target of writing as below, the writing is completed without any problem.
from __future__ import print_function
import boto3
import codecs
import json
dynamodb = boto3.resource('dynamodb', region_name='us-east-1', endpoint_url="http://localhost:8000")
table = dynamodb.Table('questionListTable')
with open("questionList.json", "r", encoding='utf-8') as json_file:
items = json.load(json_file)
for item in items:
q_id = item['q_id']
q_body = item['q_body']
q_answer = item['q_answer']
#image_url = item['image_url']
keywords = item['keywords']
print("Adding detail:", q_id, q_body)
table.put_item(
Item={
'q_id': q_id,
'q_body': q_body,
'q_answer': q_answer,
#'image_url': image_url,
'keywords': keywords,
}
)
Since DynamoDB is NoSQL, there may be other methods that make good use of the characteristics, but how to correct the code to write the above data ignoring empty characters? I would like to say "if image_url exists, write it if it does not, ignore it."
Thank you.
I solved my problem. You can set null as follows.
from __future__ import print_function
import boto3
import codecs
import json
dynamodb = boto3.resource('dynamodb', region_name='ap-northeast-1', endpoint_url="http://localhost:8000")
table = dynamodb.Table('questionListTable')
with open("questionList.json", "r", encoding='utf-8_sig') as json_file:
items = json.load(json_file)
for item in items:
q_id = item['q_id']
q_body = item['q_body']
q_answer = item['q_answer']
image_url = item['image_url'] if item['image_url'] else None
keywords = item['keywords'] if item['keywords'] else None
print("Adding detail:", q_id, q_body)
table.put_item(
Item={
'q_id': q_id,
'q_body': q_body,
'q_answer': q_answer,
'image_url': image_url,
'keywords': keywords,
}
)
In order to check the situation of Dynamodb, use the offline plugin of the serverless framework to run the API Gateway in the local environment. When I actually called the API using Postman, Null was properly inserted in the value.
{
"q_id" : "001",
"q_body" : "Where is the capital of the United States?",
"q_answer" : "Washington, D.C.",
"image_url" : "/Washington.jpg",
"keywords" : [
"UnitedStates",
"Washington"
]
},
{
"q_id" : "002",
"q_body" : "Where is the capital city of the UK?",
"q_answer" : "London",
"image_url" : "null",
"keywords" : [
"UK",
"London"
]
},

python elasticsearch bulk index datatype

I am using the following code to create an index and load data in elastic search
from elasticsearch import helpers, Elasticsearch
import csv
es = Elasticsearch()
es = Elasticsearch('localhost:9200')
index_name='wordcloud_data'
with open('./csv-data/' + index_name +'.csv') as f:
reader = csv.DictReader(f)
helpers.bulk(es, reader, index=index_name, doc_type='my-type')
print ("done")
My CSV data is as follows
date,word_data,word_count
2017-06-17,luxury vehicle,11
2017-06-17,signifies acceptance,17
2017-06-17,agency imposed,16
2017-06-17,customer appreciation,11
The data loads fine but then the datatype is not accurate
How do I force it to say that the word_count is integer and not text
See how it figures out the date type ?
Is there a way it can figure out the int datatype automatically ? or by passing some parameter ?
Also what do I do to increase the ignore_above or remove it for some of the fields if I wanted to. basically no limit to the number of characters ?
{
"wordcloud_data" : {
"mappings" : {
"my-type" : {
"properties" : {
"date" : {
"type" : "date"
},
"word_count" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"word_data" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}
}
}
You need to create a mapping that would describe field types.
With the elasticsearch-py client this can be done using es.indices.put_mapping or index.create methods, by passing it JSON document that describes mappings, like shown in this SO answer. It would be something like this:
es.indices.put_mapping(
index="wordcloud_data",
doc_type="my-type",
body={
"properties": {
"date": {"type":"date"},
"word_data": {"type": "text"},
"word_count": {"type": "integer"}
}
}
)
However, I'd suggest to take a look at the elasticsearch-dsl package that provides much nicer declarative API to describe things. It would be something along those lines (untested):
from elasticsearch_dsl import DocType, Date, Integer, Text
from elasticsearch_dsl.connections import connections
from elasticsearch.helpers import bulk
connections.create_connection(hosts=["localhost"])
class WordCloud(DocType):
word_data = Text()
word_count = Integer()
date = Date()
class Index:
name = "wordcloud_data"
doc_type = "my_type" # If you need it to be called so
WordCloud.init()
with open("./csv-data/%s.csv" % index_name) as f:
reader = csv.DictReader(f)
bulk(
connections.get_connection(),
(WordCloud(**row).to_dict(True) for row in reader)
)
Please note, I haven't tried the code I've posted - just written it. Don't have an ES server at hand to test. There could be some small mistakes or typos there (please point out if there are), but the general idea should be correct.
Thanks. #drdaeman's Solution worked for me. Although, I thought it's worth mentioning that in elasticsearch-dsl 6+
class Meta:
index = "wordcloud_data"
doc_type = "my-type"
This snippet will raise cannot write to wildcard index exception. Change the following to,
class Index:
name = 'wordcloud_data'
doc_type = 'my_type'

MongoDB generating same ID between inserts

I am using pymongo and I am trying to insert dicts into mongodb database. My dictionaries look like this
{
"name" : "abc",
"Jobs" : [
{
"position" : "Systems Engineer (Data Analyst)",
"time" : [
"October 2014",
"May 2015"
],
"is_current" : 1,
"location" : "xyz",
"organization" : "xyz"
},
{
"position" : "Systems Engineer (MDM Support Lead)",
"time" : [
"January 2014",
"October 2014"
],
"is_current" : 1,
"location" : "xxx",
"organization" : "xxx"
},
{
"position" : "Asst. Systems Engineer (ETL Support Executive)",
"time" : [
"May 2012",
"December 2013"
],
"is_current" : 1,
"location" : "zzz",
"organization" : "xzx"
},
],
"location" : "Buffalo, New York",
"education" : [
{
"school" : "State University of New York at Buffalo - School of Management",
"major" : "Management Information Systems, General",
"degree" : "Master of Science (MS), "
},
{
"school" : "Rajiv Gandhi Prodyogiki Vishwavidyalaya",
"major" : "Electrical and Electronics Engineering",
"degree" : "Bachelor of Engineering (B.E.), "
}
],
"id" : "abc123",
"profile_link" : "example.com",
"html_source" : "<html> some_source_code </html>"
}
I am getting this error:
pymongo.errors.DuplicateKeyError: E11000 duplicate key error index:
Linkedin_DB.employee_info.$id dup key: { :
ObjectId('56b64f6071c54604f02510a8') }
When I run my program 1st document gets inserted properly but when I insert the second document I get this error. When I start my script again the document which was not inserted because of this error get inserted properly and error comes for next document and this continues.
Clearly mognodb is using the same objecID during two inserts. I don't understand why mongodb is failing to generate a unique ID for new documents.
My code to save passed data:
class Mongosave:
"""
Pass collection_name and dict data
This module stores the passed dict in collection
"""
def __init__(self):
self.connection = pymongo.MongoClient()
self.db = self.connection['Linkedin_DB']
def _exists(self, id):
#To check if user alredy exists
return True if list(self.collection.find({'id': id})) else False
def save(self, collection_name, data):
self.collection = self.db[collection_name]
if not self._exists(data['id']):
print (data['id'])
self.collection.insert(data)
else:
self.collection.update({'id':data['id']}, {"$set": data})
I can figure out why this is happening. Any help is appreciated.
The problem is that your save method is using a field called "id" to decide if it should do an insert or an upsert. You want to use "_id" instead. You can read about the _id field and index here. PyMongo automatically adds an _id to you document if one is not already present. You can read more about that here.
You might have inserted two copies of the same document into your collection in one run.
I cannot quite understand what do you mean by:
When I start my script again the document which was not inserted because of this error get inserted properly and error comes for next document and this continues.
What I do know is if you do:
from pymongo import MongoClient
client = MongoClient()
db = client['someDB']
collection = db['someCollection']
someDocument = {'x': 1}
for i in range(10):
collection.insert_one(someDocument)
You'll get a:
pymongo.errors.DuplicateKeyError: E11000 duplicate key error index:
This make me think although pymongo would generate a unique _id for you if you don't provide one, it is not guaranteed to be unique, especially if the document provided is not unique. Presumably pymongo is using some sort of hash algorithm on what you insert for their auto-gen _id without changing the seed.
Try generate your own _id and see if it would happen again.
Edit:
I just tried this and it works:
for i in range(10):
collection.insert_one({'x':1})
This make me think the way pymongo generates _id is associated with the object you feed into it, this time I'm not referencing to the same object anymore and the problem disappeared.
Are you giving your database two references of a same object?

Categories

Resources