How to convert a column of dictionaries to separate columns in pandas? - python

Given the following dictionary created from df['statistics'].head().to_dict()
{0: {'executions': {'total': '1',
'passed': '1',
'failed': '0',
'skipped': '0'},
'defects': {'product_bug': {'total': 0, 'PB001': 0},
'automation_bug': {'AB001': 0, 'total': 0},
'system_issue': {'total': 0, 'SI001': 0},
'to_investigate': {'total': 0, 'TI001': 0},
'no_defect': {'ND001': 0, 'total': 0}}},
1: {'executions': {'total': '1',
'passed': '1',
'failed': '0',
'skipped': '0'},
'defects': {'product_bug': {'total': 0, 'PB001': 0},
'automation_bug': {'AB001': 0, 'total': 0},
'system_issue': {'total': 0, 'SI001': 0},
'to_investigate': {'total': 0, 'TI001': 0},
'no_defect': {'ND001': 0, 'total': 0}}},
2: {'executions': {'total': '1',
'passed': '1',
'failed': '0',
'skipped': '0'},
'defects': {'product_bug': {'total': 0, 'PB001': 0},
'automation_bug': {'AB001': 0, 'total': 0},
'system_issue': {'total': 0, 'SI001': 0},
'to_investigate': {'total': 0, 'TI001': 0},
'no_defect': {'ND001': 0, 'total': 0}}},
3: {'executions': {'total': '1',
'passed': '1',
'failed': '0',
'skipped': '0'},
'defects': {'product_bug': {'total': 0, 'PB001': 0},
'automation_bug': {'AB001': 0, 'total': 0},
'system_issue': {'total': 0, 'SI001': 0},
'to_investigate': {'total': 0, 'TI001': 0},
'no_defect': {'ND001': 0, 'total': 0}}},
4: {'executions': {'total': '1',
'passed': '1',
'failed': '0',
'skipped': '0'},
'defects': {'product_bug': {'total': 0, 'PB001': 0},
'automation_bug': {'AB001': 0, 'total': 0},
'system_issue': {'total': 0, 'SI001': 0},
'to_investigate': {'total': 0, 'TI001': 0},
'no_defect': {'ND001': 0, 'total': 0}}}}
Is there a way to expand the dictionary key/value pairs into their own columns and prefix these columns with the name of the original column, i.e. statisistics.executions.total would become statistics_executions_total or even executions_total?
I have demonstrated that I can create the columns using the following:
pd.concat([df.drop(['statistics'], axis=1), df['statistics'].apply(pd.Series)], axis=1)
However, you will notice that each of these newly created columns have a duplicate name "total".
I; however, have not been able to find a way to prefix the newly created columns with the original column name, i.e. executions_total.
For additional insight, statistics will expand into executions and defects and executions will expand into pass | fail | skipped | total and defects will expand into automation_bug | system_issue | to_investigate | product_bug | no_defect. The later will then expand into total | **001 columns where total is duplicated several times.
Any ideas are greatly appreciated. -Thanks!

.apply(pd.Series) is slow, don't use it.
See timing in Splitting dictionary/list inside a Pandas Column into Separate Columns
Create a DataFrame with a 'statistics' column from the dict in the OP.
This will create a DataFrame with a column of dictionaries.
Use pandas.json_normalize on the 'statistics' column.
The default sep is ..
Nested records will generate names separated by sep.
import pandas as pd
# this is for setting up the test dataframe from the data in the question, where data is the name of the dict
df = pd.DataFrame({'statistics': [v for v in data.values()]})
# display(df)
statistics
0 {'executions': {'total': '1', 'passed': '1', 'failed': '0', 'skipped': '0'}, 'defects': {'product_bug': {'total': 0, 'PB001': 0}, 'automation_bug': {'AB001': 0, 'total': 0}, 'system_issue': {'total': 0, 'SI001': 0}, 'to_investigate': {'total': 0, 'TI001': 0}, 'no_defect': {'ND001': 0, 'total': 0}}}
1 {'executions': {'total': '1', 'passed': '1', 'failed': '0', 'skipped': '0'}, 'defects': {'product_bug': {'total': 0, 'PB001': 0}, 'automation_bug': {'AB001': 0, 'total': 0}, 'system_issue': {'total': 0, 'SI001': 0}, 'to_investigate': {'total': 0, 'TI001': 0}, 'no_defect': {'ND001': 0, 'total': 0}}}
2 {'executions': {'total': '1', 'passed': '1', 'failed': '0', 'skipped': '0'}, 'defects': {'product_bug': {'total': 0, 'PB001': 0}, 'automation_bug': {'AB001': 0, 'total': 0}, 'system_issue': {'total': 0, 'SI001': 0}, 'to_investigate': {'total': 0, 'TI001': 0}, 'no_defect': {'ND001': 0, 'total': 0}}}
3 {'executions': {'total': '1', 'passed': '1', 'failed': '0', 'skipped': '0'}, 'defects': {'product_bug': {'total': 0, 'PB001': 0}, 'automation_bug': {'AB001': 0, 'total': 0}, 'system_issue': {'total': 0, 'SI001': 0}, 'to_investigate': {'total': 0, 'TI001': 0}, 'no_defect': {'ND001': 0, 'total': 0}}}
4 {'executions': {'total': '1', 'passed': '1', 'failed': '0', 'skipped': '0'}, 'defects': {'product_bug': {'total': 0, 'PB001': 0}, 'automation_bug': {'AB001': 0, 'total': 0}, 'system_issue': {'total': 0, 'SI001': 0}, 'to_investigate': {'total': 0, 'TI001': 0}, 'no_defect': {'ND001': 0, 'total': 0}}}
# normalize the statistics column
dfs = pd.json_normalize(df.statistics)
# display(dfs)
total passed failed skipped product_bug.total product_bug.PB001 automation_bug.AB001 automation_bug.total system_issue.total system_issue.SI001 to_investigate.total to_investigate.TI001 no_defect.ND001 no_defect.total
0 1 1 0 0 0 0 0 0 0 0 0 0 0 0
1 1 1 0 0 0 0 0 0 0 0 0 0 0 0
2 1 1 0 0 0 0 0 0 0 0 0 0 0 0
3 1 1 0 0 0 0 0 0 0 0 0 0 0 0
4 1 1 0 0 0 0 0 0 0 0 0 0 0 0

Related

python nested dictionary to pandas DataFrame

main_dict = {
'NSE:ACC': {'average_price': 0,
'buy_quantity': 0,
'depth': {'buy': [{'orders': 0, 'price': 0, 'quantity': 0},
{'orders': 0, 'price': 0, 'quantity': 0},
{'orders': 0, 'price': 0, 'quantity': 0},
{'orders': 0, 'price': 0, 'quantity': 0},
{'orders': 0, 'price': 0, 'quantity': 0}],
'sell': [{'orders': 0, 'price': 0, 'quantity': 0},
{'orders': 0, 'price': 0, 'quantity': 0},
{'orders': 0, 'price': 0, 'quantity': 0},
{'orders': 0, 'price': 0, 'quantity': 0},
{'orders': 0, 'price': 0, 'quantity': 0}]},
'instrument_token': 5633,
'last_price': 2488.9,
'last_quantity': 0,
'last_trade_time': '2022-09-23 15:59:10',
'lower_circuit_limit': 2240.05,
'net_change': 0,
'ohlc': {'close': 2555.7,
'high': 2585.5,
'low': 2472.2,
'open': 2575},
'oi': 0,
'oi_day_high': 0,
'oi_day_low': 0,
'sell_quantity': 0,
'timestamp': '2022-09-23 18:55:17',
'upper_circuit_limit': 2737.75,
'volume': 0},
}
convert dict to pandas dataframe
for example:
symbol last_price net_change Open High Low Close
NSE:ACC 2488.9 0 2575 2585.5 2472.2 2555.7
I am trying pd.DataFrame.from_dict(main_dict)
but it does not work.
please give the best suggestion.
I would first select the necessary data from your dict and then pass that as input to pd.DataFrame()
df_input = [{
"symbol": symbol,
"last_price": main_dict.get(symbol).get("last_price"),
"net_change": main_dict.get(symbol).get("net_change"),
"open": main_dict.get(symbol).get("ohlc").get("open"),
"high": main_dict.get(symbol).get("ohlc").get("high"),
"low": main_dict.get(symbol).get("ohlc").get("low"),
"close": main_dict.get(symbol).get("ohlc").get("close")
} for symbol in main_dict]
import pandas as pd
df = pd.DataFrame(df_input)

printing into a text file from dictionary key and values

I need some help with some code where it needs to go into the log file and it should look like this:
I have the dictonary which holds the count value and the keys which is the event id, but I want to display it like that but I do not know how to since it comes out all at once and it does not print individually instead of 1 by 1 and I have used a nested dictionary to do this.
This is an example of the dictionary which holds the count vals and keys which need to be printed.
eventIDs = {1102: {'count': 0}, 4611: {'count': 0}, 4624: {'count': 0}, 4634: {'count': 0}, 4648: {'count': 0}, 4661: {'count': 0},
4662: {'count': 0}, 4663: {'count': 0}, 4672: {'count': 0}, 4673: {'count': 0}, 4688: {'count': 0}, 4698: {'count': 0},
4699: {'count': 0}, 4702: {'count': 0}, 4703: {'count': 0}, 4719: {'count': 0}, 4732: {'count': 0}, 4738: {'count': 0},
4742: {'count': 0}, 4776: {'count': 0}, 4798: {'count': 0}, 4799: {'count': 0}, 4985: {'count': 0}, 5136: {'count': 0},
5140: {'count': 0}, 5142: {'count': 0}, 5156: {'count': 0}, 5158: {'count': 0}}
This is the code I have tried:
def log_output():
with open('path' + timeStamp + '.txt', 'a') as VisualiseLog:
event_id = list{eventIDs.keys()}
event_count = list(eventIDs.values)
for item in eventIDs:
print(f'Event ID: {event_id}')
VisualiseLog.write('Event ID: {event_id}')
print(f'Event Count: {event_count}')
VisualiseLog.write(f'Event Count: {event_count}')
Try this code:
eventIDs = {
1102: {'count': 0},
4611: {'count': 0}
}
timeStamp = "1234"
def log_output():
with open('path' + timeStamp + '.txt', 'a') as VisualiseLog:
for id in eventIDs:
count = eventIDs[id]['count']
print(f'Event ID: {id}')
VisualiseLog.write(f'Event ID: {id}\n')
print(f'Event Count: {count}')
VisualiseLog.write(f'Event Count: {count}\n\n')
log_output()
# Outputs:
# Event ID: 1102
# Event Count: 0
#
# Event ID: 4611
# Event Count: 0

Why index keep on creating while hitting the api in elastic search

code is below
r = [{'eid': '1', 'data': 'Health'},
{'eid': '2', 'data': 'countries'},
{'eid': '3', 'data': 'countries currency'},
{'eid': '4', 'data': 'countries language'}]
from elasticsearch import Elasticsearch
es = Elasticsearch()
es.cluster.health()
es.indices.create(index='my-index_1', ignore=400)
for e in enumerate(r):
es.index(index="my-index_1", body=e[1])
search1 = es.search(index="my-index_1", body={'query': {'term' : {'data.keyword': 'Health'}}})
search1
First time out is below
{'took': 0,
'timed_out': False,
'_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0},
'hits': {'total': {'value': 0, 'relation': 'eq'},
'max_score': None,
'hits': []}}
Second time
{'took': 0,
'timed_out': False,
'_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0},
'hits': {'total': {'value': 1, 'relation': 'eq'},
'max_score': 1.2039728,
'hits': [{'_index': 'my-index_1',
'_type': '_doc',
'_id': 'Rov4UHMBpo0uANDoY2_5',
'_score': 1.2039728,
'_source': {'eid': '1', 'data': 'Health'}}]}}
Third time
{'took': 0,
'timed_out': False,
'_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0},
'hits': {'total': {'value': 2, 'relation': 'eq'},
'max_score': 1.2809337,
'hits': [{'_index': 'my-index_1',
'_type': '_doc',
'_id': 'Rov4UHMBpo0uANDoY2_5',
'_score': 1.2809337,
'_source': {'eid': '1', 'data': 'Health'}},
{'_index': 'my-index_1',
'_type': '_doc',
'_id': 'aov4UHMBpo0uANDonm_E',
'_score': 1.2809337,
'_source': {'eid': '1', 'data': 'Health'}}]}}
​Below tag are keep on repating while hitting again and again
{'_index': 'my-index_1',
'_type': '_doc',
'_id': 'aov4UHMBpo0uANDonm_E',
'_score': 1.2809337,
'_source': {'eid': '1', 'data': 'Health'}}
Is it because of enumerate?. My input is list of dictionary then which having multiple keys, otherwise how to parse this?
My expected out is it should show only one time for every hit
?

Python read particular data from response JSON

I am new to Python and JSON. I am calling an API and as response body I am getting below :
{'product': 'Cycle', 'available': 20, 'blocked': 0, 'orderBooked': 0, 'transfer': 0, 'restock': 0, 'unavailable': 0, 'total': 0, 'lCode': '2000112', 'locationId': '745', 'locationCode': '425', 'stockType': 'IN STOCK', 'adminStock': {'rp': 0, 'management': 0, 'rc': 0, 'total': 0, 'default': 0}, 'isBlocked': False, 'plannedDate': None, 'plannedUpdate': True, 'bookedQuantity': 0}
{'product': 'Cooker', 'available': 958, 'blocked': 10, 'orderBooked': 10, 'transfer': 30, 'restock': 0, 'unavailable': 0, 'total': 0, 'lCode': '589620', 'locationId': '420', 'locationCode': '695', 'stockType': 'PRE ORDER', 'adminStock': {'rp': 0, 'management': 0, 'rc': 0, 'total': 0, 'default': 0}, 'isBlocked': False, 'plannedDate': None, 'plannedUpdate': True, 'bookedQuantity': 0}
{'product': 'Cycle', 'available': 96220, 'blocked': 0, 'orderBooked': 0, 'transfer': 0, 'restock': 0, 'unavailable': 0, 'total': 0, 'lCode': '2000112', 'locationId': '745', 'locationCode': '425', 'stockType': 'CONFIRMED', 'adminStock': {'rp': 0, 'management': 0, 'rc': 0, 'total': 0, 'default': 0}, 'isBlocked': False, 'plannedDate': None, 'plannedUpdate': True, 'bookedQuantity': 0}
{'product': 'Lapms', 'available': 89958, 'blocked': 1890, 'orderBooked': 1045, 'transfer': 230, 'restock': 0, 'unavailable': 0, 'total': 0, 'lCode': '78963', 'locationId': '896', 'locationCode': '463', 'stockType': 'TRANSIT', 'adminStock': {'rp': 0, 'management': 0, 'rc': 0, 'total': 0, 'default': 0}, 'isBlocked': False, 'plannedDate': None, 'plannedUpdate': True, 'bookedQuantity': 0}
The data I mentioned above will vary as per the API request. So Whatever be the response. Based on the Products, I need to Print Multi line data. My request is to read this Json and get the following Data :
Name:<'product'>, Code:<'lCode'>, Location:<'locationCode'>, Stock Type:<'stockType'>, Availability:<'available'>
So For the Above Json, the output should be like :
Name:Cycle, Code:2000112, Location:425, Stock Type:PRE ORDER, Availability:20
Name:Cooker, Code:589620, Location:695, Stock Type:<'stockType'>, Availability:958
Name:Cycle, Code:2000112, Location:425, Stock Type:CONFIRMED, Availability:96220
Name:Lapms, Code:78963, Location:463, Stock Type:TRANSIT, Availability:89958
So Based on the Times,
product is occuring, the data output will be having that much lines
I dont have any idea on parsing Json in Python. Please help in understanding how I can get the data in below format. I havent tried anything as I am stuck
This is what I believe you want. As some comments say, indeed these outputs should be treated as dictionaries or lists, with dictionaries and/or lists nested within them. It's important to know the difference since the first should be addressed by its key whereas the latter by its index. You can find some extra information regarding how to read jsons/dictionaries here
import pandas as pd
json_1 = {'product': 'Cycle', 'available': 20, 'blocked': 0, 'orderBooked': 0, 'transfer': 0, 'restock': 0, 'unavailable': 0, 'total': 0, 'lCode': '2000112', 'locationId': '745', 'locationCode': '425', 'stockType': 'IN STOCK', 'adminStock': {'rp': 0, 'management': 0, 'rc': 0, 'total': 0, 'default': 0}, 'isBlocked': False, 'plannedDate': None, 'plannedUpdate': True, 'bookedQuantity': 0}
json_2 = {'product': 'Cooker', 'available': 958, 'blocked': 10, 'orderBooked': 10, 'transfer': 30, 'restock': 0, 'unavailable': 0, 'total': 0, 'lCode': '589620', 'locationId': '420', 'locationCode': '695', 'stockType': 'PRE ORDER', 'adminStock': {'rp': 0, 'management': 0, 'rc': 0, 'total': 0, 'default': 0}, 'isBlocked': False, 'plannedDate': None, 'plannedUpdate': True, 'bookedQuantity': 0}
json_3 = {'product': 'Cycle', 'available': 96220, 'blocked': 0, 'orderBooked': 0, 'transfer': 0, 'restock': 0, 'unavailable': 0, 'total': 0, 'lCode': '2000112', 'locationId': '745', 'locationCode': '425', 'stockType': 'CONFIRMED', 'adminStock': {'rp': 0, 'management': 0, 'rc': 0, 'total': 0, 'default': 0}, 'isBlocked': False, 'plannedDate': None, 'plannedUpdate': True, 'bookedQuantity': 0}
json_4 = {'product': 'Lapms', 'available': 89958, 'blocked': 1890, 'orderBooked': 1045, 'transfer': 230, 'restock': 0, 'unavailable': 0, 'total': 0, 'lCode': '78963', 'locationId': '896', 'locationCode': '463', 'stockType': 'TRANSIT', 'adminStock': {'rp': 0, 'management': 0, 'rc': 0, 'total': 0, 'default': 0}, 'isBlocked': False, 'plannedDate': None, 'plannedUpdate': True, 'bookedQuantity': 0}
support_list = []
support_list.append([json_1,json_2,json_3,json_4])
support_dict = {'Name':[],'Code':[],'Location':[],'Stock type':[],'Availability':[]}
for i in range(len(support_list[0])):
support_dict['Name'].append(support_list[0][i]['product'])
support_dict['Code'].append(support_list[0][i]['lCode'])
support_dict['Location'].append(support_list[0][i]['locationCode'])
support_dict['Stock type'].append(support_list[0][i]['stockType'])
support_dict['Availability'].append(support_list[0][i]['available'])
df = pd.DataFrame(support_dict)
print(df)
Output:
Name Code Location Stock type Availability
0 Cycle 2000112 425 IN STOCK 20
1 Cooker 589620 695 PRE ORDER 958
2 Cycle 2000112 425 CONFIRMED 96220
3 Lapms 78963 463 TRANSIT 89958
EDIT: OPs says it's only list with multiple jsons in it.
It applies the same logic:
import pandas as pd
json_output= [{'product': 'Cycle', 'available': 20, 'blocked': 0, 'orderBooked': 0, 'transfer': 0, 'restock': 0, 'unavailable': 0, 'total': 0, 'lCode': '2000112', 'locationId': '745', 'locationCode': '425', 'stockType': 'IN STOCK', 'adminStock': {'rp': 0, 'management': 0, 'rc': 0, 'total': 0, 'default': 0}, 'isBlocked': False, 'plannedDate': None, 'plannedUpdate': True, 'bookedQuantity': 0},{'product': 'Cooker', 'available': 958, 'blocked': 10, 'orderBooked': 10, 'transfer': 30, 'restock': 0, 'unavailable': 0, 'total': 0, 'lCode': '589620', 'locationId': '420', 'locationCode': '695', 'stockType': 'PRE ORDER', 'adminStock': {'rp': 0, 'management': 0, 'rc': 0, 'total': 0, 'default': 0}, 'isBlocked': False, 'plannedDate': None, 'plannedUpdate': True, 'bookedQuantity': 0},{'product': 'Cycle', 'available': 96220, 'blocked': 0, 'orderBooked': 0, 'transfer': 0, 'restock': 0, 'unavailable': 0, 'total': 0, 'lCode': '2000112', 'locationId': '745', 'locationCode': '425', 'stockType': 'CONFIRMED', 'adminStock': {'rp': 0, 'management': 0, 'rc': 0, 'total': 0, 'default': 0}, 'isBlocked': False, 'plannedDate': None, 'plannedUpdate': True, 'bookedQuantity': 0},{'product': 'Lapms', 'available': 89958, 'blocked': 1890, 'orderBooked': 1045, 'transfer': 230, 'restock': 0, 'unavailable': 0, 'total': 0, 'lCode': '78963', 'locationId': '896', 'locationCode': '463', 'stockType': 'TRANSIT', 'adminStock': {'rp': 0, 'management': 0, 'rc': 0, 'total': 0, 'default': 0}, 'isBlocked': False, 'plannedDate': None, 'plannedUpdate': True, 'bookedQuantity': 0}]
support_dict = {'Name':[],'Code':[],'Location':[],'Stock type':[],'Availability':[]}
for i in range(len(json_output)):
support_dict['Name'].append(json_output[i]['product'])
support_dict['Code'].append(json_output[i]['lCode'])
support_dict['Location'].append(json_output[i]['locationCode'])
support_dict['Stock type'].append(json_output[i]['stockType'])
support_dict['Availability'].append(json_output[i]['available'])
df = pd.DataFrame(support_dict)
print(df)
Output:
Name Code Location Stock type Availability
0 Cycle 2000112 425 IN STOCK 20
1 Cooker 589620 695 PRE ORDER 958
2 Cycle 2000112 425 CONFIRMED 96220
3 Lapms 78963 463 TRANSIT 89958
EDIT 2: If you want the output as lines:
json_output= [{'product': 'Cycle', 'available': 20, 'blocked': 0, 'orderBooked': 0, 'transfer': 0, 'restock': 0, 'unavailable': 0, 'total': 0, 'lCode': '2000112', 'locationId': '745', 'locationCode': '425', 'stockType': 'IN STOCK', 'adminStock': {'rp': 0, 'management': 0, 'rc': 0, 'total': 0, 'default': 0}, 'isBlocked': False, 'plannedDate': None, 'plannedUpdate': True, 'bookedQuantity': 0},{'product': 'Cooker', 'available': 958, 'blocked': 10, 'orderBooked': 10, 'transfer': 30, 'restock': 0, 'unavailable': 0, 'total': 0, 'lCode': '589620', 'locationId': '420', 'locationCode': '695', 'stockType': 'PRE ORDER', 'adminStock': {'rp': 0, 'management': 0, 'rc': 0, 'total': 0, 'default': 0}, 'isBlocked': False, 'plannedDate': None, 'plannedUpdate': True, 'bookedQuantity': 0},{'product': 'Cycle', 'available': 96220, 'blocked': 0, 'orderBooked': 0, 'transfer': 0, 'restock': 0, 'unavailable': 0, 'total': 0, 'lCode': '2000112', 'locationId': '745', 'locationCode': '425', 'stockType': 'CONFIRMED', 'adminStock': {'rp': 0, 'management': 0, 'rc': 0, 'total': 0, 'default': 0}, 'isBlocked': False, 'plannedDate': None, 'plannedUpdate': True, 'bookedQuantity': 0},{'product': 'Lapms', 'available': 89958, 'blocked': 1890, 'orderBooked': 1045, 'transfer': 230, 'restock': 0, 'unavailable': 0, 'total': 0, 'lCode': '78963', 'locationId': '896', 'locationCode': '463', 'stockType': 'TRANSIT', 'adminStock': {'rp': 0, 'management': 0, 'rc': 0, 'total': 0, 'default': 0}, 'isBlocked': False, 'plannedDate': None, 'plannedUpdate': True, 'bookedQuantity': 0}]
for i in range(len(json_output)):
print('Name: ' + str(json_output[i]['product']) + ', Code: ' + str(json_output[i]['lCode']) + ', Location: ' + str(json_output[i]['locationCode']) + ', Stock type: ' + str(json_output[i]['stockType']) + ', Availability: ' + str(json_output[i]['available']))
Output:
Name: Cycle, Code: 2000112, Location: 425, Stock type: IN STOCK, Availability: 20
Name: Cooker, Code: 589620, Location: 695, Stock type: PRE ORDER, Availability: 958
Name: Cycle, Code: 2000112, Location: 425, Stock type: CONFIRMED, Availability: 96220
Name: Lapms, Code: 78963, Location: 463, Stock type: TRANSIT, Availability: 89958
If you parse json file you will get standard python dictionary.
import json
json_data = '{"a": 1, "b": 2, "c": 3, "d": 4, "e": 5}'
parsed_json = (json.loads(json_data))

elasticsearch-py and multiprocessing

What is the correct way to use elasticsearch-py in multiprocessing script? Should I create a new client object before start processes and use that object or should I create a new object inside each of the processes. The 2nd one gives me an an error with connection issues from elasticsearch
Thanks
Kiran
It seems the first method works for me, when I declare the client object as a global variable.
from multiprocessing import Pool
from elasticsearch import Elasticsearch
import time
def task(body):
result = es.index(index='test', doc_type='test', body=body)
return result
def main():
pool = Pool(processes=MAX_CONNECTS)
result = []
for x in range(10):
result.append(pool.apply_async(task, ({'id': x},)))
time.sleep(1)
for rs in result:
print(rs.get())
if __name__ == "__main__":
MAX_CONNECTS = 5
es = Elasticsearch(hosts="localhost", maxsize=MAX_CONNECTS)
main()
The output looks like
{'_index': 'test', '_type': 'test', '_id': 'xEjqBWcB9xsUYKqz-P6U', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 1, 'failed': 0}, '_seq_no': 1, '_primary_term': 1}
{'_index': 'test', '_type': 'test', '_id': 'w0jqBWcB9xsUYKqz-P6U', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 1, 'failed': 0}, '_seq_no': 0, '_primary_term': 1}
{'_index': 'test', '_type': 'test', '_id': 'x0jqBWcB9xsUYKqz-P6X', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 1, 'failed': 0}, '_seq_no': 4, '_primary_term': 1}
{'_index': 'test', '_type': 'test', '_id': 'xkjqBWcB9xsUYKqz-P6X', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 1, 'failed': 0}, '_seq_no': 3, '_primary_term': 1}
{'_index': 'test', '_type': 'test', '_id': 'xUjqBWcB9xsUYKqz-P6W', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 1, 'failed': 0}, '_seq_no': 2, '_primary_term': 1}
{'_index': 'test', '_type': 'test', '_id': 'yEjqBWcB9xsUYKqz-P66', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 1, 'failed': 0}, '_seq_no': 4, '_primary_term': 1}
{'_index': 'test', '_type': 'test', '_id': 'ykjqBWcB9xsUYKqz-P7I', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 1, 'failed': 0}, '_seq_no': 2, '_primary_term': 1}
{'_index': 'test', '_type': 'test', '_id': 'yUjqBWcB9xsUYKqz-P7I', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 1, 'failed': 0}, '_seq_no': 3, '_primary_term': 1}
{'_index': 'test', '_type': 'test', '_id': 'y0jqBWcB9xsUYKqz-P7P', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 1, 'failed': 0}, '_seq_no': 4, '_primary_term': 1}
{'_index': 'test', '_type': 'test', '_id': 'zEjqBWcB9xsUYKqz-P7V', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 1, 'failed': 0}, '_seq_no': 5, '_primary_term': 1}
The recommended way is to create a unique client object and you can increase the number of simultaneous thread using the maxsize (10 by default).
es = Elasticsearch( "host1", maxsize=25)
Source

Categories

Resources