What is the correct way to use elasticsearch-py in multiprocessing script? Should I create a new client object before start processes and use that object or should I create a new object inside each of the processes. The 2nd one gives me an an error with connection issues from elasticsearch
Thanks
Kiran
It seems the first method works for me, when I declare the client object as a global variable.
from multiprocessing import Pool
from elasticsearch import Elasticsearch
import time
def task(body):
result = es.index(index='test', doc_type='test', body=body)
return result
def main():
pool = Pool(processes=MAX_CONNECTS)
result = []
for x in range(10):
result.append(pool.apply_async(task, ({'id': x},)))
time.sleep(1)
for rs in result:
print(rs.get())
if __name__ == "__main__":
MAX_CONNECTS = 5
es = Elasticsearch(hosts="localhost", maxsize=MAX_CONNECTS)
main()
The output looks like
{'_index': 'test', '_type': 'test', '_id': 'xEjqBWcB9xsUYKqz-P6U', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 1, 'failed': 0}, '_seq_no': 1, '_primary_term': 1}
{'_index': 'test', '_type': 'test', '_id': 'w0jqBWcB9xsUYKqz-P6U', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 1, 'failed': 0}, '_seq_no': 0, '_primary_term': 1}
{'_index': 'test', '_type': 'test', '_id': 'x0jqBWcB9xsUYKqz-P6X', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 1, 'failed': 0}, '_seq_no': 4, '_primary_term': 1}
{'_index': 'test', '_type': 'test', '_id': 'xkjqBWcB9xsUYKqz-P6X', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 1, 'failed': 0}, '_seq_no': 3, '_primary_term': 1}
{'_index': 'test', '_type': 'test', '_id': 'xUjqBWcB9xsUYKqz-P6W', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 1, 'failed': 0}, '_seq_no': 2, '_primary_term': 1}
{'_index': 'test', '_type': 'test', '_id': 'yEjqBWcB9xsUYKqz-P66', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 1, 'failed': 0}, '_seq_no': 4, '_primary_term': 1}
{'_index': 'test', '_type': 'test', '_id': 'ykjqBWcB9xsUYKqz-P7I', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 1, 'failed': 0}, '_seq_no': 2, '_primary_term': 1}
{'_index': 'test', '_type': 'test', '_id': 'yUjqBWcB9xsUYKqz-P7I', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 1, 'failed': 0}, '_seq_no': 3, '_primary_term': 1}
{'_index': 'test', '_type': 'test', '_id': 'y0jqBWcB9xsUYKqz-P7P', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 1, 'failed': 0}, '_seq_no': 4, '_primary_term': 1}
{'_index': 'test', '_type': 'test', '_id': 'zEjqBWcB9xsUYKqz-P7V', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 1, 'failed': 0}, '_seq_no': 5, '_primary_term': 1}
The recommended way is to create a unique client object and you can increase the number of simultaneous thread using the maxsize (10 by default).
es = Elasticsearch( "host1", maxsize=25)
Source
Related
I have a list with barline ticks and midi notes that can overlap the barlines. So I made a list of 'barlineticks':
barlinepos = [0, 768.0, 1536.0, 2304.0, 3072.0, 3840.0, 4608.0, 5376.0, 6144.0, 6912.0, 0, 576.0, 1152.0, 1728.0, 2304.0, 2880.0, 3456.0, 4032.0, 4608.0, 5184.0, 5760.0, 6336.0, 6912.0, 7488.0]
And a MidiFile:
{'type': 'time_signature', 'numerator': 4, 'denominator': 4, 'time': 0, 'duration': 768, 'ID': 0}
{'type': 'set_tempo', 'tempo': 500000, 'time': 0, 'ID': 1}
{'type': 'track_name', 'name': 'Tempo Track', 'time': 0, 'ID': 2}
{'type': 'track_name', 'name': 'New Instrument', 'time': 0, 'ID': 3}
{'type': 'note_on', 'time': 0, 'channel': 0, 'note': 48, 'velocity': 100, 'ID': 4, 'duration': 956}
{'type': 'time_signature', 'numerator': 3, 'denominator': 4, 'time': 768, 'duration': 6911, 'ID': 5}
{'type': 'note_on', 'time': 768, 'channel': 0, 'note': 46, 'velocity': 100, 'ID': 6, 'duration': 575}
{'type': 'note_off', 'time': 956, 'channel': 0, 'note': 48, 'velocity': 0, 'ID': 7}
{'type': 'note_off', 'time': 1343, 'channel': 0, 'note': 46, 'velocity': 0, 'ID': 8}
{'type': 'end_of_track', 'time': 7679, 'ID': 9}
And I want to check if the midi note is overlapping a barline. Every note_on message has a 'time' and a 'duration' value. I have to check if one of the barlineticks(in the list) is inside the range of the note('time' and 'duration'). I tried:
if barlinepos in range(0, 956):
print(True)
Of course this doesn't work because barlinepos is a list. How can I check if one of the values in the list results in True?
Simple iteration to solve the requirement:
for i in midifile:
start, end = i["time"], i["time"]+i["duration"]
for j in barlinepos:
if j >= start and j<= end:
print(True)
break
print(False)
Given the following dictionary created from df['statistics'].head().to_dict()
{0: {'executions': {'total': '1',
'passed': '1',
'failed': '0',
'skipped': '0'},
'defects': {'product_bug': {'total': 0, 'PB001': 0},
'automation_bug': {'AB001': 0, 'total': 0},
'system_issue': {'total': 0, 'SI001': 0},
'to_investigate': {'total': 0, 'TI001': 0},
'no_defect': {'ND001': 0, 'total': 0}}},
1: {'executions': {'total': '1',
'passed': '1',
'failed': '0',
'skipped': '0'},
'defects': {'product_bug': {'total': 0, 'PB001': 0},
'automation_bug': {'AB001': 0, 'total': 0},
'system_issue': {'total': 0, 'SI001': 0},
'to_investigate': {'total': 0, 'TI001': 0},
'no_defect': {'ND001': 0, 'total': 0}}},
2: {'executions': {'total': '1',
'passed': '1',
'failed': '0',
'skipped': '0'},
'defects': {'product_bug': {'total': 0, 'PB001': 0},
'automation_bug': {'AB001': 0, 'total': 0},
'system_issue': {'total': 0, 'SI001': 0},
'to_investigate': {'total': 0, 'TI001': 0},
'no_defect': {'ND001': 0, 'total': 0}}},
3: {'executions': {'total': '1',
'passed': '1',
'failed': '0',
'skipped': '0'},
'defects': {'product_bug': {'total': 0, 'PB001': 0},
'automation_bug': {'AB001': 0, 'total': 0},
'system_issue': {'total': 0, 'SI001': 0},
'to_investigate': {'total': 0, 'TI001': 0},
'no_defect': {'ND001': 0, 'total': 0}}},
4: {'executions': {'total': '1',
'passed': '1',
'failed': '0',
'skipped': '0'},
'defects': {'product_bug': {'total': 0, 'PB001': 0},
'automation_bug': {'AB001': 0, 'total': 0},
'system_issue': {'total': 0, 'SI001': 0},
'to_investigate': {'total': 0, 'TI001': 0},
'no_defect': {'ND001': 0, 'total': 0}}}}
Is there a way to expand the dictionary key/value pairs into their own columns and prefix these columns with the name of the original column, i.e. statisistics.executions.total would become statistics_executions_total or even executions_total?
I have demonstrated that I can create the columns using the following:
pd.concat([df.drop(['statistics'], axis=1), df['statistics'].apply(pd.Series)], axis=1)
However, you will notice that each of these newly created columns have a duplicate name "total".
I; however, have not been able to find a way to prefix the newly created columns with the original column name, i.e. executions_total.
For additional insight, statistics will expand into executions and defects and executions will expand into pass | fail | skipped | total and defects will expand into automation_bug | system_issue | to_investigate | product_bug | no_defect. The later will then expand into total | **001 columns where total is duplicated several times.
Any ideas are greatly appreciated. -Thanks!
.apply(pd.Series) is slow, don't use it.
See timing in Splitting dictionary/list inside a Pandas Column into Separate Columns
Create a DataFrame with a 'statistics' column from the dict in the OP.
This will create a DataFrame with a column of dictionaries.
Use pandas.json_normalize on the 'statistics' column.
The default sep is ..
Nested records will generate names separated by sep.
import pandas as pd
# this is for setting up the test dataframe from the data in the question, where data is the name of the dict
df = pd.DataFrame({'statistics': [v for v in data.values()]})
# display(df)
statistics
0 {'executions': {'total': '1', 'passed': '1', 'failed': '0', 'skipped': '0'}, 'defects': {'product_bug': {'total': 0, 'PB001': 0}, 'automation_bug': {'AB001': 0, 'total': 0}, 'system_issue': {'total': 0, 'SI001': 0}, 'to_investigate': {'total': 0, 'TI001': 0}, 'no_defect': {'ND001': 0, 'total': 0}}}
1 {'executions': {'total': '1', 'passed': '1', 'failed': '0', 'skipped': '0'}, 'defects': {'product_bug': {'total': 0, 'PB001': 0}, 'automation_bug': {'AB001': 0, 'total': 0}, 'system_issue': {'total': 0, 'SI001': 0}, 'to_investigate': {'total': 0, 'TI001': 0}, 'no_defect': {'ND001': 0, 'total': 0}}}
2 {'executions': {'total': '1', 'passed': '1', 'failed': '0', 'skipped': '0'}, 'defects': {'product_bug': {'total': 0, 'PB001': 0}, 'automation_bug': {'AB001': 0, 'total': 0}, 'system_issue': {'total': 0, 'SI001': 0}, 'to_investigate': {'total': 0, 'TI001': 0}, 'no_defect': {'ND001': 0, 'total': 0}}}
3 {'executions': {'total': '1', 'passed': '1', 'failed': '0', 'skipped': '0'}, 'defects': {'product_bug': {'total': 0, 'PB001': 0}, 'automation_bug': {'AB001': 0, 'total': 0}, 'system_issue': {'total': 0, 'SI001': 0}, 'to_investigate': {'total': 0, 'TI001': 0}, 'no_defect': {'ND001': 0, 'total': 0}}}
4 {'executions': {'total': '1', 'passed': '1', 'failed': '0', 'skipped': '0'}, 'defects': {'product_bug': {'total': 0, 'PB001': 0}, 'automation_bug': {'AB001': 0, 'total': 0}, 'system_issue': {'total': 0, 'SI001': 0}, 'to_investigate': {'total': 0, 'TI001': 0}, 'no_defect': {'ND001': 0, 'total': 0}}}
# normalize the statistics column
dfs = pd.json_normalize(df.statistics)
# display(dfs)
total passed failed skipped product_bug.total product_bug.PB001 automation_bug.AB001 automation_bug.total system_issue.total system_issue.SI001 to_investigate.total to_investigate.TI001 no_defect.ND001 no_defect.total
0 1 1 0 0 0 0 0 0 0 0 0 0 0 0
1 1 1 0 0 0 0 0 0 0 0 0 0 0 0
2 1 1 0 0 0 0 0 0 0 0 0 0 0 0
3 1 1 0 0 0 0 0 0 0 0 0 0 0 0
4 1 1 0 0 0 0 0 0 0 0 0 0 0 0
code is below
r = [{'eid': '1', 'data': 'Health'},
{'eid': '2', 'data': 'countries'},
{'eid': '3', 'data': 'countries currency'},
{'eid': '4', 'data': 'countries language'}]
from elasticsearch import Elasticsearch
es = Elasticsearch()
es.cluster.health()
es.indices.create(index='my-index_1', ignore=400)
for e in enumerate(r):
es.index(index="my-index_1", body=e[1])
search1 = es.search(index="my-index_1", body={'query': {'term' : {'data.keyword': 'Health'}}})
search1
First time out is below
{'took': 0,
'timed_out': False,
'_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0},
'hits': {'total': {'value': 0, 'relation': 'eq'},
'max_score': None,
'hits': []}}
Second time
{'took': 0,
'timed_out': False,
'_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0},
'hits': {'total': {'value': 1, 'relation': 'eq'},
'max_score': 1.2039728,
'hits': [{'_index': 'my-index_1',
'_type': '_doc',
'_id': 'Rov4UHMBpo0uANDoY2_5',
'_score': 1.2039728,
'_source': {'eid': '1', 'data': 'Health'}}]}}
Third time
{'took': 0,
'timed_out': False,
'_shards': {'total': 1, 'successful': 1, 'skipped': 0, 'failed': 0},
'hits': {'total': {'value': 2, 'relation': 'eq'},
'max_score': 1.2809337,
'hits': [{'_index': 'my-index_1',
'_type': '_doc',
'_id': 'Rov4UHMBpo0uANDoY2_5',
'_score': 1.2809337,
'_source': {'eid': '1', 'data': 'Health'}},
{'_index': 'my-index_1',
'_type': '_doc',
'_id': 'aov4UHMBpo0uANDonm_E',
'_score': 1.2809337,
'_source': {'eid': '1', 'data': 'Health'}}]}}
​Below tag are keep on repating while hitting again and again
{'_index': 'my-index_1',
'_type': '_doc',
'_id': 'aov4UHMBpo0uANDonm_E',
'_score': 1.2809337,
'_source': {'eid': '1', 'data': 'Health'}}
Is it because of enumerate?. My input is list of dictionary then which having multiple keys, otherwise how to parse this?
My expected out is it should show only one time for every hit
?
input={11: {'perc': 0, 'name': u'B test', 'cid': 11, 'total': 0, 'pending': 0, 'complete': 0}, 10: {'perc': 0, 'name': u'C test', 'cid': 10, 'total': 0, 'pending': 0,'complete': 0}, 3: {'perc': 9, 'name': u'Atest Pre-requisites', 'cid': 3, 'total': 11, 'pending': 10, 'complete': 1}}
I want to sort this dict based on name field. I'm new in python, anyone please help me.
First, you should avoid using reserved words (such as input) as variables (now input is redefined and no longer calls the function input()).
Also, a dictionary cannot be sorted. If you don't need the keys, you can transform the dictionary into a list, and then sort it. The code would be like this:
input_dict = {11: {'perc': 0, 'name': u'B test', 'cid': 11, 'total': 0, 'pending': 0, 'complete': 0}, 10: {'perc': 0, 'name': u'C test', 'cid': 10, 'total': 0, 'pending': 0,'complete': 0}, 3: {'perc': 9, 'name': u'Atest Pre-requisites', 'cid': 3, 'total': 11, 'pending': 10, 'complete': 1}}
input_list = sorted(input_dict.values(), key=lambda x: x['name'])
print(input_list)
# prints [{'perc': 9, 'complete': 1, 'cid': 3, 'total': 11, 'pending': 10, 'name': u'Atest Pre-requisites'}, {'perc': 0, 'complete': 0, 'cid': 11, 'total': 0, 'pending': 0, 'name': u'B test'}, {'perc': 0, 'complete': 0, 'cid': 10, 'total': 0, 'pending': 0, 'name': u'C test'}]
EDIT
If you wish to keep the keys and use iteritems() as you said in the comments, use this code instead:
input_dict = {11: {'perc': 0, 'name': u'B test', 'cid': 11, 'total': 0, 'pending': 0, 'complete': 0}, 10: {'perc': 0, 'name': u'C test', 'cid': 10, 'total': 0, 'pending': 0,'complete': 0}, 3: {'perc': 9, 'name': u'Atest Pre-requisites', 'cid': 3, 'total': 11, 'pending': 10, 'complete': 1}}
input_list = sorted(input_dict.iteritems(), key=lambda x: x[1]['name'])
print(input_list)
# prints [(3, {'perc': 9, 'complete': 1, 'cid': 3, 'total': 11, 'pending': 10, 'name': u'Atest Pre-requisites'}), (11, {'perc': 0, 'complete': 0, 'cid': 11, 'total': 0, 'pending': 0, 'name': u'B test'}), (10, {'perc': 0, 'complete': 0, 'cid': 10, 'total': 0, 'pending': 0, 'name': u'C test'})]
I've created complex query builder in my project, and during tests stumbled upon strange issue: same query with the same plan produces different results on different clients: cx_Oracle ignores order by clause, while Oracle SQLDeveloper Studio process query correctly, however in both cases order by present in both plans.
Query in question is:
select *
from
(
select
a.*,
ROWNUM tmp__rnum
from
(
select base.*
from
(
select id
from
(
(
select
profile_id as id,
surname as sort__col
from names
)
/* here usually are several other subqueries chained by unions */
)
group by id
order by min(sort__col) asc
) tmp
left join (profiles) base
on tmp.id = base.id
where exists
(
select t.object_id
from object_rights t
where
t.object_id = base.id
and t.subject_id = :a__subject_id
and t.rights in ('r','w')
)
) a
where ROWNUM < :rows_to
)
where tmp__rnum >= :rows_from
and plan from cx_Oracle in case I missed anything:
{'operation': 'SELECT STATEMENT', 'position': 9225, 'cardinality': 2164, 'time': 1, 'cost': 9225, 'depth': 0, 'bytes': 84396, 'optimizer': 'ALL_ROWS', 'id': 0, 'cpu_cost': 1983805801},
{'operation': 'VIEW', 'position': 1, 'filter_predicates': '"TMP__RNUM">=TO_NUMBER(:ROWS_FROM)', 'parent_id': 0, 'object_instance': 1, 'cardinality': 2164SEL$1', 'projection': '"from$_subquery$_001"."ID"[NUMBER,22], "from$_subquery$_001"."CREATION_TIME"[TIMESTAMP,11], "TMP__RNUM"[NUMBER,22]', 'time': 1, 'cost': 9225, 'depth': 1, 'bytes': 84396, 'id': 1, 'cpu_cost': 1983805801},
{'operation': 'COUNT', 'position': 1, 'filter_predicates': 'ROWNUM<TO_NUMBER(:ROWS_TO)', 'parent_id': 1, 'projection': '"BASE"."ID"[NUMBER,22], "BASE"."CREATION_TIME"[TIMESTAMP,11], ROWNUM[8]', 'options': 'STOPKEY', 'depth': 2, 'id': 2,
{'operation': 'HASH JOIN', 'position': 1, 'parent_id': 2, 'access_predicates': '"TMP"."ID"="BASE"."ID"', 'cardinality': 2164, 'projection': '(#keys=1) "BASE"."ID"[NUMBER,22], "BASE"."CREATION_TIME"[TIMESTAMP,11]', 'time': 1, 'cost': 9225, 'depth': 3, 'bytes': 86560, 'id': 3, 'cpu_cost': 1983805801},
{'operation': 'JOIN FILTER', 'position': 1, 'parent_id': 3, 'object_owner': 'SYS', 'cardinality': 2219, 'projection': '"BASE"."ID"[NUMBER,22], "BASE"."CREATION_TIME"[TIMESTAMP,11]', 'object_name': ':BF0000', 'time': 1, 'cost': 662, 'options': 'CREATE', 'depth': 4, 'bytes': 59913, 'id': 4, 'cpu_cost': 223290732},
{'operation': 'HASH JOIN', 'position': 1, 'parent_id': 4, 'access_predicates': '"T"."OBJECT_ID"="BASE"."ID"', 'cardinality': 2219, 'projection': '(#keys=1) "BASE"."ID"[NUMBER,22], "BASE"."CREATION_TIME"[TIMESTAMP,11]', 'time': 1, 'cost': 662, 'options': 'RIGHT SEMI', 'depth': 5, 'bytes': 59913, 'id': 5, 'cpu_cost': 223290732},
{'operation': 'TABLE ACCESS', 'position': 1, 'filter_predicates': '"T"."SUBJECT_ID"=TO_NUMBER(:A__SUBJECT_ID) AND ("T"."RIGHTS"=\'r\' OR "T"."RIGHTS"=\'w\')', 'parent_id': 5, 'object_type': 'TABLE', 'object_instance': 8, 'cardinality': 2219, 'projection': '"T"."OBJECT_ID"[NUMBER,22]', 'object_name': 'OBJECT_RIGHTS', 'time': 1, 'cost': 5, 'options': 'FULL', 'depth': 6, 'bytes': 24409, 'optimizer': 'ANALYZED', 'id': 6, 'cpu_cost': 1823386},
{'operation': 'TABLE ACCESS', 'position': 2, 'parent_id': 5, 'object_type': 'TABLE', 'object_instance': 6, 'cardinality': 753862, 'projection': '"BASE"."ID"[NUMBER,22], "BASE"."CREATION_TIME"[TIMESTAMP,11]', 'object_name': 'PROFILES', 'time': 1, 'cost': 654, 'options': 'FULL', 'depth': 6, 'bytes': 12061792, 'optimizer': 'ANALYZED', 'id': 7, 'cpu_cost': 145148296},
{'operation': 'VIEW', 'position': 2, 'parent_id': 3, 'object_instance': 3, 'cardinality': 735296, 'projection': '"TMP"."ID"[NUMBER,22]', 'time': 1, 'cost': 8559, 'depth': 4, 'bytes': 9558848, 'id': 8, 'cpu_cost': 1686052619},
{'operation': 'SORT', 'position': 1, 'parent_id': 8, 'cardinality': 735296, 'projection': '(#keys=1) MIN("SURNAME")[50], "PROFILE_ID"[NUMBER,22]', 'time': 1, 'cost': 8559, 'options': 'ORDER BY', 'temp_space': 18244000, 'depth': 5, 'bytes': 10294144, 'id': 9, 'cpu_cost': 1686052619},
{'operation': 'HASH', 'position': 1, 'parent_id': 9, 'cardinality': 735296, 'projection': '(#keys=1; rowset=200) "PROFILE_ID"[NUMBER,22], MIN("SURNAME")[50]', 'time': 1, 'cost': 8559, 'options': 'GROUP BY', 'temp_space': 18244000, 'depth': 6, 'bytes': 10294144, 'id': 10, 'cpu_cost': 1686052619},
{'operation': 'JOIN FILTER', 'position': 1, 'parent_id': 10, 'object_owner': 'SYS', 'cardinality': 756586, 'projection': '(rowset=200) "PROFILE_ID"[NUMBER,22], "SURNAME"[VARCHAR2,50]', 'object_name': ':BF0000', 'time': 1, 'cost': 1202, 'options': 'USE', 'depth': 7, 'bytes': 10592204, 'id': 11, 'cpu_cost': 190231639},
{'operation': 'TABLE ACCESS', 'position': 1, 'filter_predicates': 'SYS_OP_BLOOM_FILTER(:BF0000,"PROFILE_ID")', 'parent_id': 11, 'object_type': 'TABLE', 'object_instance': 5, 'cardinality': 756586, 'projection': '(rowset=200) "PROFILE_ID"[NUMBER,22], "SURNAME"[VARCHAR2,50]', 'object_name': 'NAMES', 'time': 1, 'cost': 1202, 'options': 'FULL', 'depth': 8, 'bytes': 10592204, 'optimizer': 'ANALYZED', 'id': 12, 'cpu_cost': 190231639}
cx_Oracle output (appears to be ordered by id):
ID, Created, rownum
(1829, 2016-08-24, 1)
(2438, 2016-08-24, 2)
SQLDeveloper Output (ordered by surname, as expected):
ID, Created, rownum
(518926, 2016-08-28, 1)
(565556, 2016-08-29, 2)
I don't see an ORDER BY clause that would affect the ordering of the results of the query. In SQL, the only way to guarantee the ordering of a result set is to have an ORDER BY clause for the outer-most SELECT.
In almost all cases, an ORDER BY in a subquery is not necessarily respected (Oracle makes an exception when there are rownum comparisons in the next level of the query -- and even that is now out of date with the support of FETCH FIRST <n> ROWS).
So, there is no reason to expect that an ORDER BY in the innermost subquery would have any effect, particularly with the JOIN that then happens.
Suggestions:
Move the ORDER BY to the outermost query.
Use FETCH FIRST syntax, if you are using Oracle 12c+.
Move the ORDER BY after the JOIN.
Use ROW_NUMBER() instead of rownum.