wanted to transfer following csv to elsticsearch
|hcode|hname|
|1|aaaa|
|2|bbbbb|
|3|ccccc|
|4|dddd|
|5|eeee|
|6|ffff|
and need to insert hcode field as document_id. getting below error
File "C:\Users\Namali\Anaconda3\lib\site-packages\elasticsearch\connection\base.py", line 181, in _raise_error
status_code, error_message, additional_info
RequestError: RequestError(400, 'mapper_parsing_exception', 'failed to parse')"
use elasticseach version is 7.1.1 and python vervion is 3.7.6
Python code-----------------------------------------------------------------
import csv
import json
from elasticsearch import Elasticsearch
es = Elasticsearch([{'host': 'localhost', 'port': 9200}])
def csv_reader(file_obj, delimiter=','):
reader_ = csv.reader(file_obj,delimiter=delimiter,quotechar='"')
i = 1
results = []
for row in reader_:
#try :
#es.index(index='hb_hotel_raw', doc_type='hb_hotel_raw', id=row[0],
# body=json.dump([row for row in reader_], file_obj))
es.index(index='test', doc_type='test', id=row[0],body=json.dumps(row))
#except:
# print("error")
i = i + 1
results.append(row)
print(row)
if __name__ == "__main__":
with open("D:\\namali\\rez\\data_mapping\\test.csv") as f_obj:
csv_reader(f_obj)
First, doc_type is omitted in the elasticsearch 7. Second, you need to pass a valid json to elasticsearch. I edited your code as below:
for row in reader_:
_id = row[0].split("|")[1]
text = row[0].split("|")[2]
my_dict = {"hname" : text}
es.index(index='test', id=_id, body=my_dict)
<disclosure: I'm the developer of Eland and employed by Elastic>
If you're willing to load the CSV into a Pandas DataFrame you can use Eland to create/append the tabular data to an Elasticsearch index with all data types resolved properly.
I would recommend reading pandas.read_csv() and eland.pandas_to_eland() function documentation for ideas on how to accomplish this.
Related
I have a glue script to create new partitions using create_partition(). The glue script is running successfully, and i could see the partitions in the Athena console when using SHOW PARTITIONS. For glue script create_partitions, I did refer to this sample code here : https://medium.com/#bv_subhash/demystifying-the-ways-of-creating-partitions-in-glue-catalog-on-partitioned-s3-data-for-faster-e25671e65574
When I try to run a Athena query for a given partition which was newly added, I am getting no results.
Is it that I need to trigger the MSCK command, even if I add the partitions using create_partitions. Appreciate any suggestions please
.
I have got the solution myself, wanted to share with SO community, so it would be useful someone. The following code when run as a glue job, creates partitions, and can also be queried in Athena for the new partition columns. Please change/add the parameter values db name, table name, partition columns as needed.
import boto3
import urllib.parse
import os
import copy
import sys
# Configure database / table name and emp_id, file_id from workflow params?
DATABASE_NAME = 'my_db'
TABLE_NAME = 'enter_table_name'
emp_id_tmp = ''
file_id_tmp = ''
# # Initialise the Glue client using Boto 3
glue_client = boto3.client('glue')
#get current table schema for the given database name & table name
def get_current_schema(database_name, table_name):
try:
response = glue_client.get_table(
DatabaseName=DATABASE_NAME,
Name=TABLE_NAME
)
except Exception as error:
print("Exception while fetching table info")
sys.exit(-1)
# Parsing table info required to create partitions from table
table_data = {}
table_data['input_format'] = response['Table']['StorageDescriptor']['InputFormat']
table_data['output_format'] = response['Table']['StorageDescriptor']['OutputFormat']
table_data['table_location'] = response['Table']['StorageDescriptor']['Location']
table_data['serde_info'] = response['Table']['StorageDescriptor']['SerdeInfo']
table_data['partition_keys'] = response['Table']['PartitionKeys']
return table_data
#prepare partition input list using table_data
def generate_partition_input_list(table_data):
input_list = [] # Initializing empty list
part_location = "{}/emp_id={}/file_id={}/".format(table_data['table_location'], emp_id_tmp, file_id_tmp)
input_dict = {
'Values': [
emp_id_tmp, file_id_tmp
],
'StorageDescriptor': {
'Location': part_location,
'InputFormat': table_data['input_format'],
'OutputFormat': table_data['output_format'],
'SerdeInfo': table_data['serde_info']
}
}
input_list.append(input_dict.copy())
return input_list
#create partition dynamically using the partition input list
table_data = get_current_schema(DATABASE_NAME, TABLE_NAME)
input_list = generate_partition_input_list(table_data)
try:
create_partition_response = glue_client.batch_create_partition(
DatabaseName=DATABASE_NAME,
TableName=TABLE_NAME,
PartitionInputList=input_list
)
print('Glue partition created successfully.')
print(create_partition_response)
except Exception as e:
# Handle exception as per your business requirements
print(e)
Hi i'm trying to save data that i get from this api into my Json column in my postgresql using sqlalchemy and python requests.
r = requests.get(api)
content = r.content
data = json.loads(content)
crawl_item = {}
crawl_item = session.query(CrawlItem).filter_by(site_id=3, href=list_id).first()
crawl_item.description = data['ad']['body']
crawl_item.meta_data = {}
crawl_item.meta_data["ward"] = data['ad_params']['ward']['value']
try:
session.commit()
except:
session.rollback()
raise
finally:
ret_id = crawl_item.id
session.close()
my model:
class CrawlItem(Base):
...
description = Column(Text)
meta_data = Column(postgresql.JSON)
i want to get the value of ward :
"ward": {
"id": "ward",
"value": "Thị trấn Trạm Trôi",
"label": " Phường, thị xã, thị trấn"
}
I already encoding my postgresql to utf-8 so other fields that are not json column (description = Column(Text)) save utf-8 characters normally only my json column data are not decode:
{
"ward":"Th\u1ecb tr\u1ea5n Tr\u1ea1m Tr\u00f4i"
}
description column:
meta_data column:
i had tried using :
crawl_item.meta_data["ward"] = data['ad_params']['ward']['value'].decode('utf-8')
but the ward data don't get save
I have no idea what is wrong, hope someone can help me
EDIT:
i checked the data with psql and got these:
description column:
meta_data column:
It seems like only meta_data json column have trouble with the characters
Sqlalchemy serializes JSON field before save to db (see url and url and url).
json_serializer = dialect._json_serializer or json.dumps
By default, the PostgreSQL' dialect uses json.dumps and json.loads.
When you work with Text column, the data is converted in the following flow:
str -> bytes in utf-8 encoding
When you work with JSON column for PostgreSQL dialect, the data is converted in the following flow:
dict -> str with escaped non-ascii symbols -> bytes in utf-8 encoding
You can override the serializer in your engine configuration using json_serializer field:
json_serializer=partial(json.dumps, ensure_ascii=False)
use "jsonb" data type for your json column or cast "meta_data" field to "jsonb" like this:
select meta_data::jsonb from your_table;
I would like to create a report in Lambda using Python that is saved in a CSV file. So you will find the code of the function:
import boto3
import datetime
import re
def lambda_handler(event, context):
client = boto3.client('ce')
now = datetime.datetime.utcnow()
end = datetime.datetime(year=now.year, month=now.month, day=1)
start = end - datetime.timedelta(days=1)
start = datetime.datetime(year=start.year, month=start.month, day=1)
start = start.strftime('%Y-%m-%d')
end = end.strftime('%Y-%m-%d')
response = client.get_cost_and_usage(
TimePeriod={
'Start': "2019-02-01",
'End': "2019-08-01"
},
Granularity='MONTHLY',
Metrics=['BlendedCost'],
GroupBy=[
{
'Type': 'TAG',
'Key': 'Project'
},
]
)
How can I create a CSV file from it?
Here is a sample function to create a CSV file in Lambda using Python:
Assuming that the variable 'response' has the required data for creating the report for you, the following piece of code will help you create a temporary CSV file in the /tmp folder of the lambda function:
import csv
temp_csv_file = csv.writer(open("/tmp/csv_file.csv", "w+"))
# writing the column names
temp_csv_file.writerow(["Account Name", "Month", "Cost"])
# writing rows in to the CSV file
for detail in response:
temp_csv_file.writerow([detail['account_name'],
detail['month'],
detail['cost']
])
Once you have created the CSV file, you can upload it S3 and send it as an email or share it as link using the following piece of code:
client = boto3.client('s3')
client.upload_file('/tmp/csv_file.csv', BUCKET_NAME,'final_report.csv')
Points to remember:
The /tmp is a directory storage of size 512 MB which can be used to store a few in memory/ temporary files
You should not rely on this storage to maintain state across sub-sequent lambda functions.
The above answer by Repakula Srushith is correct but will be creating an empty csv as the file is not being closed. You can change the code to
f = open("/tmp/csv_file.csv", "w+")
temp_csv_file = csv.writer(f)
temp_csv_file.writerow(["Account Name", "Month", "Cost"])
# writing rows in to the CSV file
for detail in response:
temp_csv_file.writerow([detail['account_name'],
detail['month'],
detail['cost']
])
f.close()
Looking to index a CSV file to ElasticSearch, without using Logstash.
I am using the elasticsearch-dsl high level library.
Given a CSV with header for example:
name,address,url
adam,hills 32,http://rockit.com
jane,valleys 23,http://popit.com
What will be the best way to index all the data by the fields? Eventually I'm looking to get each row to look like this
{
"name": "adam",
"address": "hills 32",
"url": "http://rockit.com"
}
This kind of task is easier with the lower-level elasticsearch-py library:
from elasticsearch import helpers, Elasticsearch
import csv
es = Elasticsearch()
with open('/tmp/x.csv') as f:
reader = csv.DictReader(f)
helpers.bulk(es, reader, index='my-index', doc_type='my-type')
If you want to create elasticsearch database from .tsv/.csv with strict types and model for a better filtering u can do something like that :
class ElementIndex(DocType):
ROWNAME = Text()
ROWNAME = Text()
class Meta:
index = 'index_name'
def indexing(self):
obj = ElementIndex(
ROWNAME=str(self['NAME']),
ROWNAME=str(self['NAME'])
)
obj.save(index="index_name")
return obj.to_dict(include_meta=True)
def bulk_indexing(args):
# ElementIndex.init(index="index_name")
ElementIndex.init()
es = Elasticsearch()
//here your result dict with data from source
r = bulk(client=es, actions=(indexing(c) for c in result))
es.indices.refresh()
Examples I found so far is streaming json to BQ, e.g. https://cloud.google.com/bigquery/streaming-data-into-bigquery
How do I stream Csv or any file type into BQ? Below is a block of code for streaming and seems "issue" is in insert_all_data where 'row' defined as json.. thanks
# [START stream_row_to_bigquery]
def stream_row_to_bigquery(bigquery, project_id, dataset_id, table_name, row,
num_retries=5):
insert_all_data = {
'rows': [{
'json': row,
# Generate a unique id for each row so retries don't accidentally
# duplicate insert
'insertId': str(uuid.uuid4()),
}]
}
return bigquery.tabledata().insertAll(
projectId=project_id,
datasetId=dataset_id,
tableId=table_name,
body=insert_all_data).execute(num_retries=num_retries)
# [END stream_row_to_bigquery]
This is how I wrote using bigquery-python library very easily.
def insert_data(datasetname,table_name,DataObject):
client = get_client(project_id, service_account=service_account,
private_key_file=key, readonly=False, swallow_results=False)
insertObject = DataObject
try:
result = client.push_rows(datasetname,table_name,insertObject)
except Exception, err:
print err
raise
return result
Here insertObject is a list of dictionaries where one dictionary contains one row.
eg: [{field1:value1, field2:value2},{field1:value3, field2:value4}]
csv can be read as follows,
import pandas as pd
fileCsv = pd.read_csv(file_path+'/'+filename, parse_dates=C, infer_datetime_format=True)
data = []
for row_x in range(len(fileCsv.index)):
i = 0
row = {}
for col_y in schema:
row[col_y['name']] = _sorted_list[i]['col_data'][row_x]
i += 1
data.append(row)
insert_data(datasetname,table_name,data)
data list can be sent to the insert_data
This will do that but still there's a limitation that I already raised here.