Indexing avro file to elasticsearch in bulk

Indexing avro file to elasticsearch in bulk - python

I wrote this short simple script
from elasticsearch import Elasticsearch
from fastavro import reader
es = Elasticsearch(['someIP:somePort'])
with open('data.avro', 'rb') as fo:
avro_reader = reader(fo)
for record in avro_reader:
es.index(index="my_index", body=record)
It works absolutely fine. Each record is a json and Elasticsearch can index json files. But rather than going one by one in a for loop, is there a way to do this in bulk? Because this is very slow.

There are 2 ways to do this.
Use Elasticsearch Bulk API and requests python
Use Elasticsearch python library which internally calls the same bulk API
from elasticsearch import Elasticsearch
from elasticsearch import helpers
from fastavro import reader
es = Elasticsearch(['someIP:somePort'])
with open('data.avro', 'rb') as fo:
avro_reader = reader(fo)
records = [
{
"_index": "my_index",
"_type": "record",
"_id": j,
"_source": record
}
for j,record in enumerate(avro_reader)
]
helpers.bulk(es, records)

Related

Upload a DynamoDB array from Json string in boto3

To upload a JSON file to an AWS DynamoDB table in Python I am happily using the script found on this page, but I can't understand if it is possible to tell Python to split a single string of the JSON file on a specific character in order to create an array of elements on DynamoDB.
For example, let's use this data.json file
[
{
"artist": "Romero Allen",
"song": "Atomic Dim",
"id": "b4b0da3f-36e3-4569-b196-3ad982f72bbd",
"priceUsdCents": 392,
"publisher": "QUAREX|IME|RUME"
},
{
"artist": "Hilda Barnes",
"song": "Almond Dutch",
"id": "eeb58c73-603f-4d6b-9e3b-cf587488f488",
"priceUsdCents": 161,
"publisher": "LETPRO|SOUNDSCARE"
}
]
and this script.py file
import boto3
import json
dynamodb = boto3.client('dynamodb')
def upload():
with open('data.json', 'r') as datafile:
records = json.load(datafile)
for song in records:
print(song)
item = {
'artist':{'S':song['artist']},
'song':{'S':song['song']},
'id':{'S': song['id']},
'priceUsdCents':{'S': str(song['priceUsdCents'])},
'publisher':{'S': song['publisher']}
}
print(item)
response = dynamodb.put_item(
TableName='basicSongsTable',
Item=item
)
print("UPLOADING ITEM")
print(response)
upload()
My target is to edit the script so the publisher column won't include the string
publisher: "QUAREX|IME|RUME"
but a nested array of elements
publisher:["QUAREX","IME","RUME"]
For me, an extra edit of the JSON file with Python before running the upload script is an option.

You can just use .split('|')
item = {
'artist':{'S':song['artist']},
'song':{'S':song['song']},
'id':{'S': song['id']},
'priceUsdCents':{'S': str(song['priceUsdCents'])},
'publisher':{'L': song['publisher'].split('|')}
}

MongoDB / Python: Upload JSONs to mondodb database with authentication

I have to overcome problem which is probably nothing to most advanced users of mongoDB. I have to upload few JSON files to mongoDB database with authentication but it is not working as easy as I though it will. I am amateur at mongoDB and still at python so please be kind for me :)
# PART FOR CREATING JSON FILES (WORKING)
pliki = "/users/user/CSVtoGD/"
files = Path(pliki).glob('*.csv')
for f in files:
print(json.dumps(csv_rows, indent=19))
read_CSV(f, str(f.with_suffix('.json')))
#PART FOR UPLOADING EXAMPLE FILE TO MONGODB
import json
import pymongo
from pymongo import MongoClient
#conn = pymongo.MongoClient('mongodb://user:password#1.1.1.1/')
#db = conn['TTF-Files']
#coll = db['JSON files of TTF from game assets']
uri = "mongodb://user:password#1.1.1.1/default_db?authSource=admin"
client = MongoClient(uri)
with open('/Users/user/CSVtoGD/FILE.json', 'r') as data_file:
data = json.loads(data_file)
# if pymongo >= 3.0 use insert_many() for inserting many documents
collection_currency.insert_one(data_file)
client.close()
I receive error:
TypeError: the JSON object must be str, bytes or bytearray, not TextIOWrapper
which honestly I can not understand with my level of knowledge.
JSON files looks like that:
[
{
"Name": "path/to/file/font2.ttf",
"Vendor": "Comfortaa",
"Other1": "Regular",
"Other2": "Comfortaa",
"Other3": "Comfortaa",
"Other4": "Comfortaa",
"Other5": "Version 2.004 2013",
"Other6": "JohanAakerlund: Comfortaa Regular: 2011",
"Other7": "Johan Aakerlund - aajohan",
"Other8": "Johan Aakerlund",
"Other9": "Copyright (c) 26.12.2011, Johan Aakerlund (aajohan#gmail.com), with Reserved Font Name \"Comfortaa\". This Font Software is licensed under the SIL Open Font License, Version 1.1. http://scripts.sil.org/OFL",
"Other10": "http://scripts.sil.org/OFL",
"Other11": "",
"Other12": "Comfortaa"
},
[
{
"Name": "path/to/file/font2.ttf",
"Vendor": "Comfortaa",
"Other1": "Regular",
"Other2": "Comfortaa",
"Other3": "Comfortaa",
"Other4": "Comfortaa",
"Other5": "Version 2.004 2013",
"Other6": "JohanAakerlund: Comfortaa Regular: 2011",
"Other7": "Johan Aakerlund - aajohan",
"Other8": "Johan Aakerlund",
"Other9": "Copyright (c) 26.12.2011, Johan Aakerlund (aajohan#gmail.com), with Reserved Font Name \"Comfortaa\". This Font Software is licensed under the SIL Open Font License, Version 1.1. http://scripts.sil.org/OFL",
"Other10": "http://scripts.sil.org/OFL",
"Other11": "",
"Other12": "Comfortaa"
},
EDIT:
I have modified the script:
#zapisywanie do bazy danych
import pymongo
import json
#from pymongo import MongoClient, InsertOne
myclient = pymongo.MongoClient("mongodb://user:password#10.1.1.205:27017/default_db?authSource=admin")
db = myclient.TTF
collection = myclient.TTF
with open("/users/user/CSVtoGD/file.json") as file:
file_data = json.load(file)
collection.insert_many(file_data)
But now I have error:
TypeError: 'Collection' object is not callable. If you meant to call the 'insert_many' method on a 'Database' object it is failing because no such method exists.
It means that I am not even connected to database?

improve python elasticsearch performance

i am new in python and Elasticsearch
i write a python code that read data from very large json file and index some attributes in Elasricsearch.
import elasticsearch
import json
es = elasticsearch.Elasticsearch() # use default of localhost, port 9200
with open('j.json') as f:
n=0
for line in f:
try:
j_content = json.loads(line)
event_type = j_content['6000000']
device_id = j_content['6500048']
raw_event_msg= j_content['6000012']
event_id = j_content["0"]
body = {
'6000000': str(event_type),
'6500048': str(device_id),
'6000012': str(raw_event_msg),
'6000014': str(event_id),
}
n=n+1
es.index(index='coredb', doc_type='json_data', body=body)
except:
pass
but it's too slow and i have many free hardware resources. how can i improve performance of code with multi thread or bulk ?

You probably want to look into using Elasticsearch helpers, one in particular called bulk, seems you are aware of it, so instead of Elasticsearch setting the data to the index on every loop, collect the results in a list, and then once that list reaches a certain length, use the bulk function, and this dramatically increases performance.
You can see a rough idea with the following example; I had a very large text file, with 72873471 lines, efficiently calculated from the command line with wc -l big-file.txt, and then, using the same method you posted, resulted in an estimated ETA of 10 days
# Slow Method ~ 10 days
from elasticsearch import Elasticsearch
import progressbar # pip3 install progressbar2
import re
es = Elasticsearch()
file = open("/path/to/big-file.txt")
with progressbar.ProgressBar(max_value=72873471) as bar:
for idx, line in enumerate(file):
bar.update(idx)
clean = re.sub("\n","",line).lstrip().rstrip()
doc = {'tag1': clean, "tag2": "some extra data"}
es.index(index="my_index", doc_type='index_type', body=doc)
Now importing helpers from Elasticsearch, cut that time down to 3.5 hours:
# Fast Method ~ 3.5 hours
from elasticsearch import Elasticsearch, helpers
import progressbar # pip3 install progressbar2
import re
es = Elasticsearch()
with progressbar.ProgressBar(max_value=72873471) as bar:
actions = []
file = open("/path/to/big-file.txt")
for idx, line in enumerate(file):
bar.update(idx)
if len(actions) > 10000:
helpers.bulk(es, actions)
actions = []
clean = re.sub("\n","",line).lstrip().rstrip()
actions.append({
"_index": "my_index", # The index on Elasticsearch
"_type": "index_type", # The document type
"_source": {'tag1': clean, "tag2": "some extra data"}
})

What you want it is called Cython ;)
You can speed up your code up to 20x for sure only enabling static typing to your variables.
The code bellow should go into cython, give it a try, you'll see:
try:
j_content = json.loads(line) # Here you might want to work with cython structs.
# I can see you have a json per line, so it should be easy
event_type = j_content['6000000']
device_id = j_content['6500048']
raw_event_msg= j_content['6000012']
event_id = j_content["0"]
body = {
'6000000': str(event_type),
'6500048': str(device_id),
'6000012': str(raw_event_msg),
'6000014': str(event_id),
}
n=n+1

Indexing "large" (>40Mb) documents in Elasticsearch

I am trying to add a document of 43Mb into an index in Elasticsearch. I use the bulk API in python. Here is a snippet of my code:
from elasticsearch import helpers
from elasticsearch import Elasticsearch
document = <read a 43Mb json file, with two fields>
action = [
{
"_index":"test_index",
"_type":"test_type",
"_id": 1
}
]
action[0]["_source"]=document
es = Elasticsearch(hosts=<HOST>:9200, timeout = 30)
helpers.bulk(es, action)
This code always times out. I have also tried with different timeout values. Am I missing something here?

Bulk Index data in Elasticsearch with sequential IDs

I am using this code to bulk index all data in Elasticsearch using python:
from elasticsearch import Elasticsearch, helpers
import json
import os
import sys
import sys, json
es = Elasticsearch()
def load_json(directory):
for filename in os.listdir(directory):
if filename.endswith('.json'):
with open(filename,'r') as open_file:
yield json.load(open_file)
helpers.bulk(es, load_json(sys.argv[1]), index='v1_resume', doc_type='candidate')
I know that if ID is not mentioned ES gives a 20 character long ID by itself, but I want it to get indexed starting from ID = 1 till the number of documents.
How can I achieve this ?

In elastic search if you don't pick and ID for your document an ID is automatically created for you, check here in
elastic docs:
Autogenerated IDs are 20 character long, URL-safe, Base64-encoded GUID
strings. These GUIDs are generated from a modified FlakeID scheme which
allows multiple nodes to be generating unique IDs in parallel with
essentially zero chance of collision.
If you like to have custom ids you need to build them yourself, using similar syntax:
[
{'_id': 1,
'_index': 'index-name',
'_type': 'document',
'_source': {
"title": "Hello World!",
"body": "..."}
},
{'_id': 2,
'_index': 'index-name',
'_type': 'document',
'_source': {
"title": "Hello World!",
"body": "..."}
}
]
helpers.bulk(es, load_json(sys.argv[1])
Since you are decalring the type and index inside your schema you don't have to do it inside helpers.bulk() method. You need to change the output of 'load_json' to create list with dicts (like above) to be saved in es (python elastic client docs)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Indexing avro file to elasticsearch in bulk - python

Related

Upload a DynamoDB array from Json string in boto3

MongoDB / Python: Upload JSONs to mondodb database with authentication

improve python elasticsearch performance

Indexing "large" (>40Mb) documents in Elasticsearch

Bulk Index data in Elasticsearch with sequential IDs

Categories

Resources