python elasticsearch bulk index datatype

python elasticsearch bulk index datatype - python

I am using the following code to create an index and load data in elastic search
from elasticsearch import helpers, Elasticsearch
import csv
es = Elasticsearch()
es = Elasticsearch('localhost:9200')
index_name='wordcloud_data'
with open('./csv-data/' + index_name +'.csv') as f:
reader = csv.DictReader(f)
helpers.bulk(es, reader, index=index_name, doc_type='my-type')
print ("done")
My CSV data is as follows
date,word_data,word_count
2017-06-17,luxury vehicle,11
2017-06-17,signifies acceptance,17
2017-06-17,agency imposed,16
2017-06-17,customer appreciation,11
The data loads fine but then the datatype is not accurate
How do I force it to say that the word_count is integer and not text
See how it figures out the date type ?
Is there a way it can figure out the int datatype automatically ? or by passing some parameter ?
Also what do I do to increase the ignore_above or remove it for some of the fields if I wanted to. basically no limit to the number of characters ?
{
"wordcloud_data" : {
"mappings" : {
"my-type" : {
"properties" : {
"date" : {
"type" : "date"
},
"word_count" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"word_data" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}
}
}

You need to create a mapping that would describe field types.
With the elasticsearch-py client this can be done using es.indices.put_mapping or index.create methods, by passing it JSON document that describes mappings, like shown in this SO answer. It would be something like this:
es.indices.put_mapping(
index="wordcloud_data",
doc_type="my-type",
body={
"properties": {
"date": {"type":"date"},
"word_data": {"type": "text"},
"word_count": {"type": "integer"}
}
}
)
However, I'd suggest to take a look at the elasticsearch-dsl package that provides much nicer declarative API to describe things. It would be something along those lines (untested):
from elasticsearch_dsl import DocType, Date, Integer, Text
from elasticsearch_dsl.connections import connections
from elasticsearch.helpers import bulk
connections.create_connection(hosts=["localhost"])
class WordCloud(DocType):
word_data = Text()
word_count = Integer()
date = Date()
class Index:
name = "wordcloud_data"
doc_type = "my_type" # If you need it to be called so
WordCloud.init()
with open("./csv-data/%s.csv" % index_name) as f:
reader = csv.DictReader(f)
bulk(
connections.get_connection(),
(WordCloud(**row).to_dict(True) for row in reader)
)
Please note, I haven't tried the code I've posted - just written it. Don't have an ES server at hand to test. There could be some small mistakes or typos there (please point out if there are), but the general idea should be correct.

Thanks. #drdaeman's Solution worked for me. Although, I thought it's worth mentioning that in elasticsearch-dsl 6+
class Meta:
index = "wordcloud_data"
doc_type = "my-type"
This snippet will raise cannot write to wildcard index exception. Change the following to,
class Index:
name = 'wordcloud_data'
doc_type = 'my_type'

Related

Loading irregular json into Elasticsearch index with mapping using Python client

I have some .json where not all fields are present in all records, for e.g. caseclass.json looks like:
[{
"name" : "john smith",
"age" : 12,
"cars": ["ford", "toyota"],
"comment": "i am happy"
},
{
"name": "a. n. other",
"cars": "",
"comment": "i am panicking"
}]
Using Elasticsearch-7.6.1 via python client elasticsearch:
from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search
import json
import os
from elasticsearch_dsl import Document, Text, Date, Integer, analyzer
es = Elasticsearch([{'host': 'localhost', 'port': 9200}])
class Person(Document):
class Index:
using = es
name = 'person_index'
name = Text()
age = Integer()
cars = Text()
comment = Text(analyzer='snowball')
Person.init()
with open ("caseclass.json") as json_file:
data = json.load(json_file)
for indexid in range(len(data)):
document = Person(name=data[indexid]['name'], age=data[indexid]['age'], cars=data[indexid]['cars'], comment=data[indexid]['comment'])
document.meta.id = indexid
document.save()
Naturally I get KeyError: 'age' when the second record is trying to be read. My question is: it is possible to load such records onto a Elasticsearch index using the Python client and a pre-defined mapping, instead of dynamic mapping? Above code works if all fields are present in all records but is there a way to do this without checking presence of each field per record as the actual records have complex structure and there are millions of them? Thanks

The error has nothing to do w/ your mapping -- it's just telling you that age could not be accessed in one of your caseclasses.
The index mapping is created when you call Person.init() -- you can verify that by calling print(es.indices.get_mapping(Person.Index.name)) right after Person.init().
I've cleaned up your code a bit:
import json
import os
from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search, Document, Text, Date, Integer, analyzer
es = Elasticsearch([{'host': 'localhost', 'port': 9200}])
class Person(Document):
class Index:
using = es
name = 'person_index'
name = Text()
age = Integer()
cars = Text()
comment = Text(analyzer='snowball')
Person.init()
print(es.indices.get_mapping(Person.Index.name))
with open("caseclass.json") as json_file:
data = json.load(json_file)
for indexid, case in enumerate(data):
document = Person(**case)
document.meta.id = indexid
document.save()
Notice how I used **case to spread all key-value pairs inside of a case instead of using data[property_key].
The generated mapping is as follows:
{
"person_index" : {
"mappings" : {
"properties" : {
"age" : {
"type" : "integer"
},
"cars" : {
"type" : "text"
},
"comment" : {
"type" : "text",
"analyzer" : "snowball"
},
"name" : {
"type" : "text"
}
}
}
}
}

Jsonify data not returning to ajax call

I have an app where I am using flask, python, ajax, json, javascript, and leaflet. This app reads a csv file, puts it into json format, then returns it to an ajax call. My issue is that the geojson is not being returned. In the console, I am getting a 5000 NetworkError in the console log. The end result is to use the return geojson in a leaflet map layer. If I remove the jsonify, the return works fine, but it is a string of course, and this wont work for the layer.
As you can see, I have a simple alert("success") in the ajax success part. This is not being executed. Nor is the alert(data).
I do have jsonify in the from Flask import statement.
Thank you for the help
Ajax call
$.ajax({
type : "POST",
url : '/process',
data: {
chks: chks
}
})
.success(function(data){
alert("success"); // I am doing this just to get see if I get back here. I do not
alert(data);
python/flask
#app.route('/process', methods=['POST'])
def process():
data = request.form['chks']
rawData = csv.reader(open('static/csvfile.csv', 'r'), dialect='excel')
count = sum(1 for row in open('static/csvfile.csv))
template =\
''' \
{"type" : "Feature",
"geometry" : {
"type" : "Point",
"coordinates" : [%s, %s]},
"properties" : {"name" : "%s" }
}%s
'''
output = \
''' \
{"type" : "Feature Collection",
"features" : [
'''
iter = 0
separator = ","
lastrow = ""
for row in rawData:
iter += 1 // this is used to skip the first line of the csv file
if iter >=2:
id = row[0]
lat = row[1]
long = row[2]
if iter != count:
output += template % (row[2], row[1], row[0], separator)
else:
output += template % (row[2], row[1], row[0], lastrow)
output += \
''' \
]}
'''
return jsonify(output)
More Info - taking David Knipe's info into hand, If I remove the jsonify from my return statement, it returns what I expect, and I can output the return in an alert. It looks like this
{ "type" : "Feature Collection",
"features" : [
{"type" : "Feature",
"geometry" : {
"type" : "Point",
"coordinates" : [ -86.28, 32.36]},
"properties" : {"name" : "Montgomery"}
},
{ "type" : "Feature",
"geometry" : {
"type" : "Point",
"coordinates" : [ -105.42, 40.30]},
"properties" : {"name" : "Boulder"}
},
]}
If I take that data and hard code it into the ajax success, then pass it to the leaflet layer code like this - it will work, and my points will be displayed on my map
...
.success(function(data){
var pointsHC= { "type" : "Feature Collection",
"features" : [
{"type" : "Feature",
"geometry" : {
"type" : "Point",
"coordinates" : [ -86.28, 32.36]},
"properties" : {"name" : "Montgomery"}
},
{ "type" : "Feature",
"geometry" : {
"type" : "Point",
"coordinates" : [ -105.42, 40.30]},
"properties" : {"name" : "Boulder"}
},
]};
// leaflet part
var layer = L.geoJson(pointsHC, {
pointToLayer: function(feature, latlng){
return L.circleMarker( ...
If I do not hard code and pass the data via a variable, it does not work, and I get and invalid geoJson object. I have tried it with both the final semi-colon removed and not removed, and no love either way
...
.success(function(data){
// leaflet part
var layer = L.geoJson(data, {
pointToLayer: function(feature, latlng){
return L.circleMarker( ...

So it works if you don't try to parse the JSON, but if you do then it fails. Your JSON is invalid:
As loganbertram pointed out, you're missing a " on "Feature Collection".
You're missing a " on "properties".
output = template % ... should be output += template % ... - you're appending to output, not replacing it.
the features array will have a trailing comma (unless it is empty).
Although actually in your code features will always be empty anyway: you set iter = 0, never change its value, and then don't do the output = ... bit because iter < 2.
Are you sure you actually want to use jsonify? As I understand it, that turns any object into a JSON string. But output is already a JSON string - or should be, if you fix the various bugs loganbertram and I have spotted. In that case the client-side code will not fail trying to parse JSON. But if you jsonify something that's already JSON, you'll get something like this:
"{\"type\" : \"Feature\",
\"geometry\" : {
...
which the javascript will then convert back to the original JSON string, instead of a JSON object.
Actually, it would be better to rewrite the whole thing so it constructs an object instead of a string, and then calls jsonify on that object. But I don't know enough Python to give more details easily.

working with JSON with a bash script

I’m writing a bash script that does a few things. Right now it copies a few files into correct directories and runs a few commands. I need this bash script to edit a JSON file. Essentially this script would append a snippet of JSON to an existing JSON object that exists in a file.JSON. I cannot just append the data because the JSON snippet must be a part of an existing JSON object (should be added to the tracks array). So is this possible to do with a bash script? Should I just write another python or R script to handle this JSON Logic or is there a more elegant solution. Thanks for any help.
file.JSON looks like this...
{
"formatVersion" : 1,
"tracks" : [
{
"key" : "Reference sequence",
"chunkSize" : 20000,
"urlTemplate" : "seq/{refseq_dirpath}/{refseq}-",
"storeClass" : "JBrowse/Store/Sequence/StaticChunked",
"type" : "SequenceTrack",
"seqType" : "dna",
"category" : "Reference sequence",
"label" : "DNA"
},
{
"type" : "FeatureTrack",
"label" : "gff_track1",
"trackType" : null,
"key" : "gff_track1",
"compress" : 0,
"style" : {
"className" : "feature"
},
"storeClass" : "JBrowse/Store/SeqFeature/NCList",
"urlTemplate" : "tracks/gff_track1/{refseq}/trackData.json"
},
{
"storeClass" : "JBrowse/Store/SeqFeature/NCList",
"style" : {
"className" : "feature"
},
"urlTemplate" : "tracks/ITAG2.4_gene_models.gff3/{refseq}/trackData.json",
"key" : "ITAG2.4_gene_models.gff3",
"compress" : 0,
"trackType" : null,
"label" : "ITAG242.4_gene_models.gff3",
"type" : "FeatureTrack"
},
{
"urlTemplate" : "g-231FRL.bam",
"storeClass" : "JBrowse/Store/SeqFeature/BAM",
"label" : "g-1FRL.bam",
"type" : "JBrowse/View/Track/Alignments2",
"key" : "g-1FRL.bam"
}
]
}
the JSON snippet looks like this ...
{
"urlTemplate": "AX2_filtered.vcf.gz",
"label": "AX2_filtered.vcf.gz",
"storeClass": "JBrowse/Store/SeqFeature/VCFTabix",
"type": "CanvasVariants"
}

Do yourself a favor and install jq, then it's as simple as:
jq -n 'input | .tracks += [inputs]' file.json snippet.json > out.json
Trying to modify a structured data (like JSON) is a fool's errand without a proper parser and jq really makes it easy.
However, if you prefer doing it through Python (although it would be an overkill for this kind of a task) it's pretty much as straight forward as with jq:
import json
with open("file.json", "r") as f, open("snippet.json", "r") as s, open("out.json", "w") as u:
data = json.load(f) # parse `file.json`
data["tracks"].append(json.load(s)) # parse `snippet.json` and append it to `.tracks[]`
json.dump(data, u, indent=4) # encode the data back to JSON and write it to `out.json`

Pull from a list in a dict using mongoengine

I have this Document in mongo engine:
class Mydoc(db.Document):
x = db.DictField()
item_number = IntField()
And I have this data into the Document
{
"_id" : ObjectId("55e360cce725070909af4953"),
"x" : {
"mongo" : [
{
"list" : "lista"
},
{
"list" : "listb"
}
],
"hello" : "world"
},
"item_number" : 1
}
Ok if I want to push to mongo list using mongoengine, i do this:
Mydoc.objects(item_number=1).update_one(push__x__mongo={"list" : "listc"})
That works pretty well, if a query the database again i get this
{
"_id" : ObjectId("55e360cce725070909af4953"),
"x" : {
"mongo" : [
{
"list" : "lista"
},
{
"list" : "listb"
},
{
"list" : "listc"
}
],
"hello" : "world"
},
"item_number" : 1
}
But When I try to pull from same list using pull in mongo engine:
Mydoc.objects(item_number=1).update_one(pull__x__mongo={'list': 'lista'})
I get this error:
mongoengine.errors.OperationError: Update failed (Cannot apply $pull
to a non-array value)
comparising the sentences:
Mydoc.objects(item_number=1).update_one(push__x__mongo={"list" : "listc"}) # Works
Mydoc.objects(item_number=1).update_one(pull__x__mongo={"list" : "listc"}) # Error
How can I pull from this list?
I appreciate any help

I believe that the problem is that mongoengine doesn't know the structure of your x document. You declared it as DictField, so mongoengine thinks you are pulling from DictField not from ListField. Declare x as ListField and both queries should work just fine.
I suggest you should also create an issue for this:
https://github.com/MongoEngine/mongoengine/issues
As a workaround, you can use a raw query:
Mydoc.objects(item_number=1).update_one(__raw__={'$pull': {'x.mongo': {'list': 'listc'}}})

using key as value in Mongoengine

I use mongoengine for mongodb in django.
but.. mongoengine fields (like StringField) makes me build up schema toward the way that I don't want. I mean, it strictly insists that I pre-write key name before I do know what it will be. for example...
in case that I do not know what key name will be put into database...
> for(var i=0; i<10; i++){
... o = {};
... o[i.toString()] = i + 100;
... db.test.save(o)
... }
> db.test.find()
{ "_id" : ObjectId("4ed623aa45c8729573313811"), "0" : 100 }
{ "_id" : ObjectId("4ed623aa45c8729573313812"), "1" : 101 }
{ "_id" : ObjectId("4ed623aa45c8729573313813"), "2" : 102 }
{ "_id" : ObjectId("4ed623aa45c8729573313814"), "3" : 103 }
{ "_id" : ObjectId("4ed623aa45c8729573313815"), "4" : 104 }
{ "_id" : ObjectId("4ed623aa45c8729573313816"), "5" : 105 }
{ "_id" : ObjectId("4ed623aa45c8729573313817"), "6" : 106 }
{ "_id" : ObjectId("4ed623aa45c8729573313818"), "7" : 107 }
{ "_id" : ObjectId("4ed623aa45c8729573313819"), "8" : 108 }
{ "_id" : ObjectId("4ed623aa45c872957331381a"), "9" : 109 }
[addition]
as you can see above, key is very different from each other..
just assume that "I do not know what key name will be put into document as key ahead of time
as dcrosta replied.. I am looking for a way to use mongoengine without specifying the fields ahead of time.
[/addition]
How can I do the same thing through mongoengine?
please give me schema design like
class Test(Document):
tag = StringField(db_field='xxxx')
[addition]
I don't know what 'xxxx' will be as key name.
sorry.. I'm Korean so my english is awkward.
please give me your some knowledge.
Thanks for reading this.
[/addition]

Have you considered using PyMongo directly instead of using Mongoengine? Mongoengine is designed to declare and validate a schema for your documents, and provides many tools and conveniences around that. If your documents are going to vary, I'm not sure Mongoengine is the right choice for you.
If, however, you have some fields in common across all documents, and then each document has some set of fields specific to itself, you can use Mongoengine's DictField. The downside of this is that the keys will not be "top-level", for instance:
class UserThings(Document):
# you can look this document up by username
username = StringField()
# you can store whatever you want here
things = DictField()
dcrosta_things = UserThings(username='dcrosta')
dcrosta_things.things['foo'] = 'bar'
dcrosta_things.things['bad'] = 'quack'
dcrosta_things.save()
Results in a MongoDB document like:
{ _id: ObjectId(...),
_types: ["UserThings"],
_cls: "UserThings",
username: "dcrosta",
things: {
foo: "bar",
baz: "quack"
}
}
Edit: I should also note, there's work in progress on the development branch of Mongoengine for "dynamic" documents, where attributes on the Python document instances will be saved when the model is saved. See https://github.com/hmarr/mongoengine/pull/112 for details and history.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python elasticsearch bulk index datatype - python

Related

Loading irregular json into Elasticsearch index with mapping using Python client

Jsonify data not returning to ajax call

working with JSON with a bash script

Pull from a list in a dict using mongoengine

using key as value in Mongoengine

Categories

Resources