Elasticsearch scroll upper limit - python api - python

Is there a way using the python api to set an upper limit to the number of documents that are retrieved if we scroll in chunks of a specific size. So let's say I want a maximum of 100K documents being scrolled in chunks of 2K, where there are over 10Mil documents available.
I've implemented a counter like object but I want to know if there is a more natural solution.
es_query = {"query": {"function_score": {"functions": [{"random_score": {"seed": "1234"}}]}}}
es = Elasticsearch(ADDRESS, port=PORT)
result = es.search(
index="INDEX",
doc_type="DOC_TYPE",
body=es_query,
size=2000,
scroll="1m")
data = []
for hit in result["hits"]["hits"]:
for d in hit["_source"]["attributes"]["data_of_interest"]:
data.append(d)
do_something(*args)
scroll_id = result['_scroll_id']
scroll_size = result["hits"]["total"]
i = 0
while(scroll_size>0):
if i % 10000 == 0:
print("Scrolling ({})...".format(i))
result = es.scroll(scroll_id=scroll_id, scroll="1m")
scroll_id = result["_scroll_id"]
scroll_size = len(result['hits']['hits'])
data = []
for hit in result["hits"]["hits"]:
for d in hit["_source"]["attributes"]["data_of_interest"]:
data.append(d)
do_something(*args)
i += 1
if i == 100000:
break

To me if you only want the first 100K you should narrow your query in the first place. That wills speed up your process. You can add a filter on date for example.
Regarding the code I do not know other way than using the counter. I would just correct the indentation and remove the if statement for readability.
es_query = {"query": {"function_score": {"functions": [{"random_score": {"seed": "1234"}}]}}}
es = Elasticsearch(ADDRESS, port=PORT)
result = es.search(
index="INDEX",
doc_type="DOC_TYPE",
body=es_query,
size=2000,
scroll="1m")
data = []
for hit in result["hits"]["hits"]:
for d in hit["_source"]["attributes"]["data_of_interest"]:
data.append(d)
do_something(*args)
scroll_id = result['_scroll_id']
scroll_size = result["hits"]["total"]
i = 0
while(scroll_size > 0 & i < 100000):
print("Scrolling ({})...".format(i))
result = es.scroll(scroll_id=scroll_id, scroll="1m")
scroll_id = result["_scroll_id"]
scroll_size = len(result['hits']['hits'])
# data = [] why redefining the list ?
for hit in result["hits"]["hits"]:
for d in hit["_source"]["attributes"]["data_of_interest"]:
data.append(d)
do_something(*args)
i ++

Related

How to order a python dictionary containing a list of values

I'm not sure I am approaching this in the right way.
Scenario:
I have two SQL tables that contain rent information. One table contains rent due, and the other contains rent received.
I'm trying to build a rent book which takes the data from both tables for a specific lease and generates a date ordered statement which will be displayed on a webpage.
I'm using Python, Flask and SQL Alchemy.
I am currently learning Python, so I'm not sure if my approach is the best.
I've created a dictionary which contains the keys 'Date', 'Payment type' and 'Payment Amount', and in each of these keys I store a list which contains the data from my SQL queries. The bit im struggling on is how to sort the dictionary so it sorts by the date key, keeping the values in the other keys aligned to their date.
lease_id = 5
dates_list = []
type_list = []
amounts_list = []
rentbook_dict = {}
payments_due = Expected_Rent_Model.query.filter(Expected_Rent_Model.lease_id == lease_id).all()
payments_received = Rent_And_Fee_Income_Model.query.filter(Rent_And_Fee_Income_Model.lease_id == lease_id).all()
for item in payments_due:
dates_list.append(item.expected_rent_date)
type_list.append('Rent Due')
amounts_list.append(item.expected_rent_amount)
for item in payments_received:
dates_list.append(item.payment_date)
type_list.append(item.payment_type)
amounts_list.append(item.payment_amount)
rentbook_dict.setdefault('Date',[]).append(dates_list)
rentbook_dict.setdefault('Type',[]).append(type_list)
rentbook_dict.setdefault('Amount',[]).append(amounts_list)
I was then going to use a for loop within the flask template to iterate through each value and display it in a table on the page.
Or am I approaching this in the wrong way?
so I managed to get this working just using zipped list. Im sure there is a better way for me to accomplish this but im pleased I've got it working.
lease_id = 5
payments_due = Expected_Rent_Model.query.filter(Expected_Rent_Model.lease_id == lease_id).all()
payments_received = Rent_And_Fee_Income_Model.query.filter(Rent_And_Fee_Income_Model.lease_id == lease_id).all()
total_due = 0
for debit in payments_due:
total_due = total_due + int(debit.expected_rent_amount)
total_received = 0
for income in payments_received:
total_received = total_received + int(income.payment_amount)
balance = total_received - total_due
if balance < 0 :
arrears = "This account is in arrears"
else:
arrears = ""
dates_list = []
type_list = []
amounts_list = []
for item in payments_due:
dates_list.append(item.expected_rent_date)
type_list.append('Rent Due')
amounts_list.append(item.expected_rent_amount)
for item in payments_received:
dates_list.append(item.payment_date)
type_list.append(item.payment_type)
amounts_list.append(item.payment_amount)
payment_data = zip(dates_list, type_list, amounts_list)
sorted_payment_data = sorted(payment_data)
tuples = zip(*sorted_payment_data)
list1, list2, list3 = [ list(tuple) for tuple in tuples]
return(render_template('rentbook.html',
payment_data = zip(list1,list2,list3),
total_due = total_due,
total_received = total_received,
balance = balance))

Why do I have a large gap between Elasticsearch and Snowflake?

I have been tasked to build a process in python that would extract the data from Elasticsearch, drop data in an Azure Blob after which Snowflake would ingest the data. I have the process running on Azure Functions that extracts an index group (like game_name.*) and for each index in the index group, it creates a thread to scroll on. I save the last date of each result and on the next run parse it in the range query. I am running the process every five minutes and have offset the end of the range by 5 minutes (we have a refresh running every 2 minutes). I let the process run for a while and then I do a gap analysis by taking a count(*) in both Elasticsearch and Snowflake by hour (or by day) and expect to have a max of 1% gap. However, for one index pattern which groups about 127 indexes, when I run a catchup job (for a day or more), the resulting gap is as expected, however, as soon as I let it run on the cron job (every 5 min), after a while I get gaps of 6-10% and only for this index group.
It looks as if the scroller function picks up an N amount of documents within the queried range but then for some reason documents are later added (PUT) with an earlier date. Or I might be wrong and my code is doing something funny. I've talked to our team and they don't cache any docs on the client, and the data is synced to a network clock (not the client's) and sending UTC.
Please see below the query I am using to paginate through elasticsearch:
def query(searchSize, lastRowDateOffset, endDate, pit, keep_alive):
body = {
"size": searchSize,
"query": {
"bool": {
"must": [
{
"exists": {
"field": "baseCtx.date"
}
},
{
"range": {
"baseCtx.date": {
"gt": lastRowDateOffset,
"lte": endDate
}
}
}
]
}
},
"pit": {
"id": pit,
"keep_alive": keep_alive
},
"sort": [
{
"baseCtx.date": {"order": "asc", "unmapped_type": "long"}
},
{
"_shard_doc": "asc"
}
],
"track_total_hits": False
}
return body
def scroller(pit,
threadNr,
index,
lastRowDateOffset,
endDate,
maxThreads,
es,
lastRowCount,
keep_alive="1m",
searchSize=10000):
cumulativeResultCount = 0
iterationResultCount = 0
data = []
dataBytes = b''
lastIndexDate = ''
startScroll = time.perf_counter()
while 1:
if lastRowCount == 0: break
#if lastRowDateOffset == endDate: lastRowCount = 0; break
try:
page = es.search(body=body)
except: # It is believed that the point in time is getting closed, hence the below opens a new one
pit = es.open_point_in_time(index=index, keep_alive=keep_alive)['id']
body = query(searchSize, lastRowDateOffset, endDate, pit, keep_alive)
page = es.search(body=body)
pit = page['pit_id']
data += page['hits']['hits']
body['pit']['id'] = pit
if len(data) > 0: body['search_after'] = [x['sort'] for x in page['hits']['hits']][-1]
cumulativeResultCount += len(page['hits']['hits'])
iterationResultCount = len(page['hits']['hits'])
#print(f"This Iteration Result Count: {iterationResultCount} -- Cumulative Results Count: {cumulativeResultCount} -- {time.perf_counter() - startScroll} seconds")
if iterationResultCount < searchSize: break
if len(data) > rowsPerMB * maxSizeMB / maxThreads: break
if time.perf_counter() - startScroll > maxProcessTimeSeconds: break
if len(data) != 0:
dataBytes = gzip.compress(bytes(json.dumps(data)[1:-1], encoding='utf-8'))
lastIndexDate = max([x['_source']['baseCtx']['date'] for x in data])
response = {
"pit": pit,
"index": index,
"threadNr": threadNr,
"dataBytes": dataBytes,
"lastIndexDate": lastIndexDate,
"cumulativeResultCount": cumulativeResultCount
}
return response
def batch(game_name, env='prod', startDate='auto', endDate='auto', writeDate=True, minutesOffset=5):
es = Elasticsearch(
esUrl,
port=9200,
timeout=300)
lowerFormat = game_name.lower().replace(" ","_")
indexGroup = lowerFormat + "*"
if env == 'dev': lowerFormat, indexGroup = 'dev_' + lowerFormat, 'dev.' + indexGroup
azFormat = re.sub(r'[^0-9a-zA-Z]+', '-', game_name).lower()
storageContainerName = azFormat
curFileName = f"{lowerFormat}_cursors.json"
curBlobFilePath = f"cursors/{curFileName}"
compressedTools = [gzip.compress(bytes('[', encoding='utf-8')), gzip.compress(bytes(',', encoding='utf-8')), gzip.compress(bytes(']', encoding='utf-8'))]
pits = []
lastRowCounts = []
# Parameter and state settings
if os.getenv(f"{lowerFormat}_maxSizeMB") is not None: maxSizeMB = int(os.getenv(f"{lowerFormat}_maxSizeMB"))
if os.getenv(f"{lowerFormat}_maxThreads") is not None: maxThreads = int(os.getenv(f"{lowerFormat}_maxThreads"))
if os.getenv(f"{lowerFormat}_maxProcessTimeSeconds") is not None: maxProcessTimeSeconds = int(os.getenv(f"{lowerFormat}_maxProcessTimeSeconds"))
# Get all indices for the indexGroup
indicesEs = list(set([(re.findall(r"^.*-", x)[0][:-1] if '-' in x else x) + '*' for x in list(es.indices.get(indexGroup).keys())]))
indices = [{"indexName": x, "lastOffsetDate": (datetime.datetime.utcnow()-datetime.timedelta(days=5)).strftime("%Y/%m/%d 00:00:00")} for x in indicesEs]
# Load Cursors
cursors = getCursors(curBlobFilePath, indices)
# Offset the current time by -5 minutes to account for the 2-3 min delay in Elasticsearch
initTime = datetime.datetime.utcnow()
if endDate == 'auto': endDate = f"{initTime-datetime.timedelta(minutes=minutesOffset):%Y/%m/%d %H:%M:%S}"
print(f"Less than or Equal to: {endDate}, {keep_alive}")
# Start Multi-Threading
while 1:
dataBytes = []
dataSize = 0
start = time.perf_counter()
if len(pits) == 0: pits = ['' for x in range(len(cursors))]
if len(lastRowCounts) == 0: lastRowCounts = ['' for x in range(len(cursors))]
with concurrent.futures.ThreadPoolExecutor(max_workers=len(cursors)) as executor:
results = [
executor.submit(
scroller,
pit,
threadNr,
x['indexName'],
x['lastOffsetDate'] if startDate == 'auto' else startDate,
endDate,
len(cursors),
es,
lastRowCount,
keep_alive,
searchSize) for x, pit, threadNr, lastRowCount in (zip(cursors, pits, list(range(len(cursors))), lastRowCounts))
]
for f in concurrent.futures.as_completed(results):
if f.result()['lastIndexDate'] != '': cursors[f.result()['threadNr']]['lastOffsetDate'] = f.result()['lastIndexDate']
pits[f.result()['threadNr']] = f.result()['pit']
lastRowCounts[f.result()['threadNr']] = f.result()['cumulativeResultCount']
dataSize += f.result()['cumulativeResultCount']
if len(f.result()['dataBytes']) > 0: dataBytes.append(f.result()['dataBytes'])
print(f"Thread {f.result()['threadNr']+1}/{len(cursors)} -- Index {f.result()['index']} -- Results pulled {f.result()['cumulativeResultCount']} -- Cumulative Results: {dataSize} -- Process Time: {round(time.perf_counter()-start, 2)} sec")
if dataSize == 0: break
lastRowDateOffsetDT = datetime.datetime.strptime(max([x['lastOffsetDate'] for x in cursors]), '%Y/%m/%d %H:%M:%S')
outFile = f"elasticsearch/live/{lastRowDateOffsetDT:%Y/%m/%d/%H}/{lowerFormat}_live_{lastRowDateOffsetDT:%Y%m%d%H%M%S}_{datetime.datetime.utcnow():%Y%m%d%H%M%S}.json.gz"
print(f"Starting compression of {dataSize} rows -- {round(time.perf_counter()-start, 2)} sec")
dataBytes = compressedTools[0] + compressedTools[1].join(dataBytes) + compressedTools[2]
# Upload to Blob
print(f"Comencing to upload data to blob -- {round(time.perf_counter()-start, 2)} sec")
uploadJsonGzipBlobBytes(outFile, dataBytes, storageContainerName, len(dataBytes))
print(f"File compiled: {outFile} -- {dataSize} rows -- Process Time: {round(time.perf_counter()-start, 2)} sec\n")
# Update cursors
if writeDate: postCursors(curBlobFilePath, cursors)
# Clean Up
print("Closing PITs")
for pit in pits:
try: es.close_point_in_time({"id": pit})
except: pass
print(f"Closing Connection to {esUrl}")
es.close()
return
# Start the process
while 1:
batch("My App")
I think I just need a second pair of eyes to point out where the issue might be in the code. I've tried increasing the minutesOffset argv to 60 (so every 5 minutes it pulls the data from the last run until Now()-60 minutes) but had the same issue. Please help.
So the "baseCtx.date" is triggered by the client and it seems that in some cases there is a delay between when the event is triggered and when it is available to be searched. We fixed this by using the ingest pipeline as follows:
PUT _ingest/pipeline/indexDate
{
"description": "Creates a timestamp when a document is initially indexed",
"version": 1,
"processors": [
{
"set": {
"field": "indexDate",
"value": "{{{_ingest.timestamp}}}",
"tag": "indexDate"
}
}
]
}
And set index.default_pipeline to "indexDate" in the template settings. Every month the index name changes (we append the year and month) and this approach creates a server date we used to scroll.

How to create a function to tell whether a value is increasing or decreasing?

I want to create comments from a dataset that details the growth rate, market share, etc for various markets and products. The dataset is in the form of a pd.DataFrame(). I would like the comment to include keywords like increase/decrease based on the calculations, for example, if 2020 Jan has sale of 1000, and 2021 Jan has a sale of 1600, then it will necessary mean an increase of 60%.
I defined a function outside as such and I would like to seek if this method is too clumsy, if so, how should I improve on it.
GrowthIncDec = namedtuple('gr_tuple', ['annual_growth_rate', 'quarterly_growth_rate'])
def increase_decrease(annual_gr, quarter_gr):
if annual_gr > 0:
annual_growth_rate = 'increased'
elif annual_gr < 0:
annual_growth_rate = 'decreased'
else:
annual_growth_rate = 'stayed the same'
if quarter_gr > 0:
quarterly_growth_rate = 'increased'
elif quarter_gr < 0:
quarterly_growth_rate = 'decreased'
else:
quarterly_growth_rate = 'stayed the same'
gr_named_tuple = GrowthIncDec(annual_growth_rate=annual_growth_rate, quarterly_growth_rate=quarterly_growth_rate)
return gr_named_tuple
myfunc = increase_decrease(5, -1)
myfunc.annual_growth_rate
output: 'increased'
A snippet of my main code is as follows to illustrate the use of the above function:
def get_comments(grp, some_dict: Dict[str, List[str]]):
.......
try:
subdf = the dataframe
annual_gr = subdf['Annual_Growth'].values[0]
quarter_gr = subdf['Quarterly_Growth'].values[0]
inc_dec_named_tup = increase_decrease(annual_gr, quarter_gr)
inc_dec_annual_gr = inc_dec_named_tup.annual_growth_rate
inc_dec_quarterly_gr = inc_dec_named_tup.quarterly_growth_rate
comment = "The {} has {} by {:.1%} in {} {} compared to {} {}"\
.format(market, inc_dec_annual_gr, annual_gr, timeperiod, curr_date, timeperiod, prev_year)
comments_df = pd.DataFrame(columns=['Date','Comments'])
# comments_df['Date'] = [curr_date]
comments_df['Comments'] = [comment]
return comments_df
except (IndexError, KeyError) as e:
# this is for all those nan values which is empty
annual_gr = 0
quarter_gr = 0

Parsing Security Matrix Spreadsheet - NoneType is not Iterable

Trying to Nest no's and yes's with their respective applications and services.
That way when a request comes in for a specific zone to zone sequence, a check can be run against this logic to verify accepted requests.
I have tried calling Decision_List[Zone_Name][yes_no].update and i tried ,append when it was a list type and not a dict but there is no update method ?
Base_Sheet = range(5, sh.ncols)
Column_Rows = range(1, sh.nrows)
for colnum in Base_Sheet:
Zone_Name = sh.col_values(colnum)[0]
Zone_App_Header = {sh.col_values(4)[0]:{}}
Zone_Svc_Header = {sh.col_values(3)[0]:{}}
Zone_Proto_Header = {sh.col_values(2)[0]:{}}
Zone_DestPort_Header = {sh.col_values(1)[0]: {}}
Zone_SrcPort_Header = {sh.col_values(0)[0]: {}}
Decision_List = {Zone_Name:{}}
for rows in Column_Rows:
app_object = sh.col_values(4)[rows]
svc_object = sh.col_values(3)[rows]
proto_object = sh.col_values(3)[rows]
dst_object = sh.col_values(2)[rows]
src_object = sh.col_values(1)[rows]
yes_no = sh.col_values(colnum)[rows]
if yes_no not in Decision_List[Zone_Name]:
Decision_List[Zone_Name][yes_no] = [app_object]
else:
Decision_List[Zone_Name]=[yes_no].append(app_object)
I would like it present info as follows
Decision_List{Zone_Name:{yes:[ssh, ssl, soap], no:
[web-browsing,facebook]}}
I would still like to know why i couldnt call the append method on that specific yes_no key whos value was a list.
But in the mean time, i made a work around of sorts. I created a set as the key and gave the yes_no as the value. this will allow me to pair many no type values with the keys being a set of the application, port, service, etc.. and then i can search for yes values and create additional dicts out of them for logic.
Any better ideas out there i am all ears.
for rownum in range(0, sh.nrows):
#row_val is all the values in the row of cell.index[rownum] as determined by rownum
row_val = sh.row_values(rownum)
col_val = sh.col_values(rownum)
print rownum, col_val[0], col_val[1: CoR]
header.append({col_val[0]: col_val[1: CoR]})
print header[0]['Start Port']
dec_tree = {}
count = 1
Base_Sheet = range(5, sh.ncols)
Column_Rows = range(1, sh.nrows)
for colnum in Base_Sheet:
Zone_Name = sh.col_values(colnum)[0]
Zone_App_Header = {sh.col_values(4)[0]:{}}
Zone_Svc_Header = {sh.col_values(3)[0]:{}}
Zone_Proto_Header = {sh.col_values(2)[0]:{}}
Zone_DestPort_Header = {sh.col_values(1)[0]: {}}
Zone_SrcPort_Header = {sh.col_values(0)[0]: {}}
Decision_List = {Zone_Name:{}}
for rows in Column_Rows:
app_object = sh.col_values(4)[rows]
svc_object = sh.col_values(3)[rows]
proto_object = sh.col_values(3)[rows]
dst_object = sh.col_values(2)[rows]
src_object = sh.col_values(1)[rows]
yes_no = sh.col_values(colnum)[rows]
for rule_name in Decision_List.iterkeys():
Decision_List[Zone_Name][(app_object, svc_object, proto_object)]= yes_no
Thanks again.
I think still a better way is to use collections.defaultdict
In this manner it will ensure that i am able to append to the specific yes_no as i had originally intended.
for colnum in Base_Sheet:
Zone_Name = sh.col_values(colnum)[0]
Zone_App_Header = {sh.col_values(4)[0]:{}}
Zone_Svc_Header = {sh.col_values(3)[0]:{}}
Zone_Proto_Header = {sh.col_values(2)[0]:{}}
Zone_DestPort_Header = {sh.col_values(1)[0]: {}}
Zone_SrcPort_Header = {sh.col_values(0)[0]: {}}
Decision_List = {Zone_Name:defaultdict(list)}
for rows in Column_Rows:
app_object = sh.col_values(4)[rows]
svc_object = sh.col_values(3)[rows]
proto_object = sh.col_values(2)[rows]
dst_object = sh.col_values(1)[rows]
src_object = sh.col_values(0)[rows]
yes_no = sh.col_values(colnum)[rows]
if yes_no not in Decision_List[Zone_Name]:
Decision_List[Zone_Name][yes_no]= [app_object, svc_object, proto_object, dst_object, src_object]
else:
Decision_List[Zone_Name][yes_no].append([(app_object, svc_object, proto_object,dst_object, src_object)])
This allows me to then set the values as a set and append them as needed

Python: How to Speed Up API Requests?

Problem: I am trying to extract data through an API Service. A single request can take anywhere from 3 to 10 seconds. There are roughly 20,000 rows of data from a Pandas DataFrame to input into the API Call. I have managed to speed it up a bit through multiprocessing, but it's still running very slow. Any suggestions?
Code:
def scored_card_features2(source, n_batches):
"""Multiprocessing version of Scored Card Features Function
Returns reason for rating
"""
# read in source data and convert to list of lists for inputs
data = pd.read_excel(source)
data = data[['primary_bank_report_id', 'primary_tu_credit_report_id', 'purpose']]
inputs = data.values.tolist()
def scored_card_map(i):
"""form request to scored card service and retrieve values"""
url = "url/FourthGen?bank_report_id=%s&credit_report_id=%s&" \
"&loan_purpose=%s" % (i[0], i[1], i[2].replace(" ", "%20"))
r = requests.get(url)
try:
d = json.loads(r.text)
l = [d['probability_of_default'],
d['condition'],
d['purpose_of_loan'],
d['rating'],
d['bank_report_id'],
d['reason_for_rating'],
d['credit_report_id']]
return l
except:
l = [np.nan] * 7
return l
# inititate multithreading
with Pool(n_batches) as p:
vals = p.map(scored_card_map, inputs)
result = pd.DataFrame(vals, columns=['Probability of Default', 'Condition', 'Purpose of Loan', 'Rating', 'Bank Report ID',
'Reason for Rating', 'Credit Report ID'])
result = result.dropna(how='all')
return result
if __name__ == '__main__':
# model features
start = time.time()
df = scored_card_features2('BankCreditPortalIDsPurpose.xlsx', multiprocessing.cpu_count()-1)
df.to_csv('scored_card_features.csv', index=False)
end = time.time()
print(end-start)

Categories

Resources