How to speed up ElasticSearch indexing? - python

I am a beginner with elasticsearch and i have to write 1-million random events into an Elastic search cluster (hosted on the cloud), with a python script...
es = Elasticsearch(
[host_name],
port=9243,
http_auth=("*****","*******"),
use_ssl=True,
verify_certs=True,
ca_certs=certifi.where(),
sniff_on_start=True
)
Here's my code for the indexing:
for i in range(1000000):
src_centers=['data center a','data center b','data center c','data center d','data center e']
transfer_src = np.random.choice(src_centers, p=[0.3, 0.175, 0.175, 0.175, 0.175])
dst_centers = [x for x in src_centers if x != transfer_src]
transfer_dst = np.random.choice(dst_centers)
final_transfer_status = ['transfer-success','transfer-failure']
transfer_starttime = generate_timestamp()
file_size=random.choice(range(1024,10000000000))
ftp={
'event_type': 'transfer-queued',
'uuid': uuid.uuid4(),
'src_site' : transfer_src,
'dst_site' : transfer_dst,
'timestamp': transfer_starttime,
'bytes' : file_size
}
print(i)
es.index(index='ft_initial', id=(i+1), doc_type='initial_transfer_details', body= ftp)
transfer_status = ['transfer-success', 'transfer-failure']
final_status = np.random.choice(transfer_status, p=[0.95,0.05])
ftp['event_type'] = final_status
if (final_status=='transfer-failure'):
time_delay = 10
else :
time_delay = int(transfer_time(file_size)) # ranges roughly from 0-10000 s
ftp['timestamp'] = transfer_starttime + timedelta(seconds=time_delay)
es.index(index='ft_final', id=(i+1), doc_type='final_transfer_details', body=ftp)
Is there any alternate way to speed up the process??
Any help/pointers will be appreciated. Thanks.

Use bulks, otherwise you have a lot of overhead for each single request: https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html
Change the refresh rate, ideally disable it totally until you're done: https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-update-settings.html#bulk
Use monitoring (there's a free basic license) to see what is actually the bottleneck (IO, memory, CPU): https://www.elastic.co/guide/en/x-pack/current/xpack-monitoring.html

Related

Getting Expired Options Contract pricing from Interactive Brokers

I am looking to reconstruct Expired Options pricing with the help of ib_insync library and Interactive Brokers available data..
Because IB provides OPTION_IMPLIED_VOLATILITY as an output for reqHistoricalData I was thinking proceeding this way:
Have a function to infer Expired Options Contract Prices from Black-Scholes Model:
def black_scholes(stock_price,strike_price,vol,time,rate,right="Call"):
d1 = (np.log(stock_price/strike_price) + (rate + 0.5* vol**2)*time)/(vol*np.sqrt(time))
d2 = (np.log(stock_price/strike_price) + (rate - 0.5* vol**2)*time)/(vol*np.sqrt(time))
nd1 = norm.cdf(d1)
nd2 = norm.cdf(d2)
n_d1 = norm.cdf(-1*d1)
n_d2 = norm.cdf(-1*d2)
if right.capitalize()[0] == "C":
return round((stock_price*nd1) - (strike_price*np.exp(-1*rate*time)*nd2),2)
else:
return round((strike_price*np.exp(-1*rate*time)*n_d2) - (stock_price*n_d1),2)
Using the contract on the underlying stock assuming I have a valid ib connection opened elsewhere in my code to retrieve the data
def get_stock_history(symbol,whattoshow_string):
contract = Stock(symbol, 'SMART', 'USD')
ib.reqMarketDataType(2)
bars = ib.reqHistoricalData(
contract,
endDateTime='',
durationStr='2 Y',
barSizeSetting='1 Hour',
whatToShow=whattoshow_string,
useRTH=True,
formatDate=1)
ib.sleep(5)
df = util.df(bars)
df['date'] = pd.to_datetime(df['date'] ).dt.date
return df
I also have a handy function to compute maturity in the BSM based on an hourly time decay:
def hourCount(DF, expiry):
DF["maturity"] = ((dt.datetime.strptime(expiry, "%Y-%m-%d") - pd.to_datetime(DF.index))/pd.Timedelta(hours=1))/(365*24)
I could then get the data as below assuming I have an expiration date and strike from elsewhere I wish to backtest:
strike = 148
expiration_date = '2022-12-02'
symbol = 'AAPL'
historicalData = get_stock_history(symbol,'ADJUSTED_LAST')
impVolData = get_stock_history(symbol,'OPTION_IMPLIED_VOLATILITY')
option_price_call = pd.DataFrame(columns=["open","high","low","close"])
option_price_put = pd.DataFrame(columns=["open","high","low","close"])
hourCount(historicalData, expiration_date)
hourCount(impVolData, expiration_date)
historicalData = historicalData [(historicalData["maturity"] > 0)]
impVolData = impVolData[(impVolData["maturity"] > 0)]
for column in ["open","high","low","close"]:
option_price_call[column] = black_scholes(historicalData[column], strike, impVolData[column], historicalData["maturity"], 0.03,right="Call")
option_price_put[column] = black_scholes(historicalData[column], strike, impVolData[column], historicalData["maturity"], 0.03,right="Put")
Would that be a good approach to reconstruct/backtest the Expired Options contract pricing or am I overlooking something here? and maybe a smarter way to achieve this operation?
Thanks in advance for your suggestions! (y)

To find Oplog size using python

How to find oplog size in mongodb using python?
For example :
replGetSetStatus is equivalent to rs.status()
Is there any similar command to find rs.printReplicationInfo()
uri = "mongodb://usernamen:password#host:port/admin"
conn = pymongo.MongoClient(uri)
db = conn['admin']
db_stats = db.command({'replSetGetStatus' :1})
primary_optime = 0
secondary_optime = 0
for key in db_stats['members'] :
if key['stateStr'] == 'SECONDARY' :
secondary_optime = key['optimeDate']
if key['stateStr'] == 'PRIMARY' :
primary_optime =key['optimeDate']
print 'primary_optime : ' + str(primary_optime)
print 'secondary_optime : ' + str(secondary_optime)
seconds_lag = (primary_optime - secondary_optime ).total_seconds()
#total_seconds() userd to get the lag in seconds rather than datetime object
print 'secondary_lag : ' + str(seconds_lag)
This is my code. The db.command({'replSetGetStatus' :1}) is working.
Similarly I need for the oplog size.
The following commands executed from any replicaSet member will give you the size of oplog:
Uncompressed size in MB:
db.getReplicationInfo().logSizeMB
Uncompressed current size in Bytes:
db.getSiblingDB('local').oplog.rs.stats().size
Compressed current size in Bytes:
db.getSiblingDB('local').oplog.rs.stats().storageSize
Max configured size:
db.getSiblingDB('local').oplog.rs.stats().maxSize

get results from kafka for a specific period of time

Here is my code, that uses kafka-python.
now = datetime.now()
month_ago = now - relativedelta(month=1)
topic = 'some_topic_name'
consumer = KafkaConsumer(topic, bootstrap_servers=PROD_KAFKA_SERVER,
security_protocol=PROTOCOL,
group_id=GROUP_ID,
enable_auto_commit=False,
sasl_mechanism=SASL_MECHANISM, sasl_plain_username=SASL_USERNAME,
sasl_plain_password=SASL_PASSWORD)
for msg in consumer:
print(msg)
I want to get results from topic just between now and month_ago in a loop. How can I do this?
Thanks for any help!
Get topic partitions, assigned to your consumer:
partitions = consumer.assignment()
Get offsets for partitions by datetime:
month_ago_timestamp = int(month_ago.timestamp() * 1000)
partition_to_timestamp = {part: month_ago_timestamp for part in partitions}
mapping = consumer.offsets_for_times(partition_to_timestamp)
Seek partitions to offsets:
for partition, offset_and_timestamp in partition_to_offset_and_timestamp.items():
consumer.seek(partition, offset_and_timestamp[0])
Warning! Consumer can return None, set with int zero or block indefinitely in cases like missing topic, missing partition or messages without timestamp
Finally, I do this :) My code looks like this:
topic = 'some_topic_name'
consumer = KafkaConsumer(bootstrap_servers=PROD_KAFKA_SERVER,
security_protocol=PROTOCOL,
group_id=GROUP_ID,
sasl_mechanism=SASL_MECHANISM, sasl_plain_username=SASL_USERNAME,
sasl_plain_password=SASL_PASSWORD)
month_ago = (datetime.now() - relativedelta(months=1)).timestamp()
topic_partition = TopicPartition(topic, 0)
assigned_topic = [topic_partition]
consumer.assign(assigned_topic)
partitions = consumer.assignment()
partition_to_timestamp = {part: int(month_ago * 1000) for part in partitions}
end_offsets = consumer.end_offsets(list(partition_to_timestamp.keys()))
mapping = consumer.offsets_for_times(partition_to_timestamp)
for partition, ts in mapping.items():
end_offset = end_offsets.get(partition)
consumer.seek(partition, ts[0])
for msg in consumer:
value = json.loads(msg.value.decode('utf-8'))
# do something
if msg.offset == end_offset - 1:
consumer.close()
break

Why do I have a large gap between Elasticsearch and Snowflake?

I have been tasked to build a process in python that would extract the data from Elasticsearch, drop data in an Azure Blob after which Snowflake would ingest the data. I have the process running on Azure Functions that extracts an index group (like game_name.*) and for each index in the index group, it creates a thread to scroll on. I save the last date of each result and on the next run parse it in the range query. I am running the process every five minutes and have offset the end of the range by 5 minutes (we have a refresh running every 2 minutes). I let the process run for a while and then I do a gap analysis by taking a count(*) in both Elasticsearch and Snowflake by hour (or by day) and expect to have a max of 1% gap. However, for one index pattern which groups about 127 indexes, when I run a catchup job (for a day or more), the resulting gap is as expected, however, as soon as I let it run on the cron job (every 5 min), after a while I get gaps of 6-10% and only for this index group.
It looks as if the scroller function picks up an N amount of documents within the queried range but then for some reason documents are later added (PUT) with an earlier date. Or I might be wrong and my code is doing something funny. I've talked to our team and they don't cache any docs on the client, and the data is synced to a network clock (not the client's) and sending UTC.
Please see below the query I am using to paginate through elasticsearch:
def query(searchSize, lastRowDateOffset, endDate, pit, keep_alive):
body = {
"size": searchSize,
"query": {
"bool": {
"must": [
{
"exists": {
"field": "baseCtx.date"
}
},
{
"range": {
"baseCtx.date": {
"gt": lastRowDateOffset,
"lte": endDate
}
}
}
]
}
},
"pit": {
"id": pit,
"keep_alive": keep_alive
},
"sort": [
{
"baseCtx.date": {"order": "asc", "unmapped_type": "long"}
},
{
"_shard_doc": "asc"
}
],
"track_total_hits": False
}
return body
def scroller(pit,
threadNr,
index,
lastRowDateOffset,
endDate,
maxThreads,
es,
lastRowCount,
keep_alive="1m",
searchSize=10000):
cumulativeResultCount = 0
iterationResultCount = 0
data = []
dataBytes = b''
lastIndexDate = ''
startScroll = time.perf_counter()
while 1:
if lastRowCount == 0: break
#if lastRowDateOffset == endDate: lastRowCount = 0; break
try:
page = es.search(body=body)
except: # It is believed that the point in time is getting closed, hence the below opens a new one
pit = es.open_point_in_time(index=index, keep_alive=keep_alive)['id']
body = query(searchSize, lastRowDateOffset, endDate, pit, keep_alive)
page = es.search(body=body)
pit = page['pit_id']
data += page['hits']['hits']
body['pit']['id'] = pit
if len(data) > 0: body['search_after'] = [x['sort'] for x in page['hits']['hits']][-1]
cumulativeResultCount += len(page['hits']['hits'])
iterationResultCount = len(page['hits']['hits'])
#print(f"This Iteration Result Count: {iterationResultCount} -- Cumulative Results Count: {cumulativeResultCount} -- {time.perf_counter() - startScroll} seconds")
if iterationResultCount < searchSize: break
if len(data) > rowsPerMB * maxSizeMB / maxThreads: break
if time.perf_counter() - startScroll > maxProcessTimeSeconds: break
if len(data) != 0:
dataBytes = gzip.compress(bytes(json.dumps(data)[1:-1], encoding='utf-8'))
lastIndexDate = max([x['_source']['baseCtx']['date'] for x in data])
response = {
"pit": pit,
"index": index,
"threadNr": threadNr,
"dataBytes": dataBytes,
"lastIndexDate": lastIndexDate,
"cumulativeResultCount": cumulativeResultCount
}
return response
def batch(game_name, env='prod', startDate='auto', endDate='auto', writeDate=True, minutesOffset=5):
es = Elasticsearch(
esUrl,
port=9200,
timeout=300)
lowerFormat = game_name.lower().replace(" ","_")
indexGroup = lowerFormat + "*"
if env == 'dev': lowerFormat, indexGroup = 'dev_' + lowerFormat, 'dev.' + indexGroup
azFormat = re.sub(r'[^0-9a-zA-Z]+', '-', game_name).lower()
storageContainerName = azFormat
curFileName = f"{lowerFormat}_cursors.json"
curBlobFilePath = f"cursors/{curFileName}"
compressedTools = [gzip.compress(bytes('[', encoding='utf-8')), gzip.compress(bytes(',', encoding='utf-8')), gzip.compress(bytes(']', encoding='utf-8'))]
pits = []
lastRowCounts = []
# Parameter and state settings
if os.getenv(f"{lowerFormat}_maxSizeMB") is not None: maxSizeMB = int(os.getenv(f"{lowerFormat}_maxSizeMB"))
if os.getenv(f"{lowerFormat}_maxThreads") is not None: maxThreads = int(os.getenv(f"{lowerFormat}_maxThreads"))
if os.getenv(f"{lowerFormat}_maxProcessTimeSeconds") is not None: maxProcessTimeSeconds = int(os.getenv(f"{lowerFormat}_maxProcessTimeSeconds"))
# Get all indices for the indexGroup
indicesEs = list(set([(re.findall(r"^.*-", x)[0][:-1] if '-' in x else x) + '*' for x in list(es.indices.get(indexGroup).keys())]))
indices = [{"indexName": x, "lastOffsetDate": (datetime.datetime.utcnow()-datetime.timedelta(days=5)).strftime("%Y/%m/%d 00:00:00")} for x in indicesEs]
# Load Cursors
cursors = getCursors(curBlobFilePath, indices)
# Offset the current time by -5 minutes to account for the 2-3 min delay in Elasticsearch
initTime = datetime.datetime.utcnow()
if endDate == 'auto': endDate = f"{initTime-datetime.timedelta(minutes=minutesOffset):%Y/%m/%d %H:%M:%S}"
print(f"Less than or Equal to: {endDate}, {keep_alive}")
# Start Multi-Threading
while 1:
dataBytes = []
dataSize = 0
start = time.perf_counter()
if len(pits) == 0: pits = ['' for x in range(len(cursors))]
if len(lastRowCounts) == 0: lastRowCounts = ['' for x in range(len(cursors))]
with concurrent.futures.ThreadPoolExecutor(max_workers=len(cursors)) as executor:
results = [
executor.submit(
scroller,
pit,
threadNr,
x['indexName'],
x['lastOffsetDate'] if startDate == 'auto' else startDate,
endDate,
len(cursors),
es,
lastRowCount,
keep_alive,
searchSize) for x, pit, threadNr, lastRowCount in (zip(cursors, pits, list(range(len(cursors))), lastRowCounts))
]
for f in concurrent.futures.as_completed(results):
if f.result()['lastIndexDate'] != '': cursors[f.result()['threadNr']]['lastOffsetDate'] = f.result()['lastIndexDate']
pits[f.result()['threadNr']] = f.result()['pit']
lastRowCounts[f.result()['threadNr']] = f.result()['cumulativeResultCount']
dataSize += f.result()['cumulativeResultCount']
if len(f.result()['dataBytes']) > 0: dataBytes.append(f.result()['dataBytes'])
print(f"Thread {f.result()['threadNr']+1}/{len(cursors)} -- Index {f.result()['index']} -- Results pulled {f.result()['cumulativeResultCount']} -- Cumulative Results: {dataSize} -- Process Time: {round(time.perf_counter()-start, 2)} sec")
if dataSize == 0: break
lastRowDateOffsetDT = datetime.datetime.strptime(max([x['lastOffsetDate'] for x in cursors]), '%Y/%m/%d %H:%M:%S')
outFile = f"elasticsearch/live/{lastRowDateOffsetDT:%Y/%m/%d/%H}/{lowerFormat}_live_{lastRowDateOffsetDT:%Y%m%d%H%M%S}_{datetime.datetime.utcnow():%Y%m%d%H%M%S}.json.gz"
print(f"Starting compression of {dataSize} rows -- {round(time.perf_counter()-start, 2)} sec")
dataBytes = compressedTools[0] + compressedTools[1].join(dataBytes) + compressedTools[2]
# Upload to Blob
print(f"Comencing to upload data to blob -- {round(time.perf_counter()-start, 2)} sec")
uploadJsonGzipBlobBytes(outFile, dataBytes, storageContainerName, len(dataBytes))
print(f"File compiled: {outFile} -- {dataSize} rows -- Process Time: {round(time.perf_counter()-start, 2)} sec\n")
# Update cursors
if writeDate: postCursors(curBlobFilePath, cursors)
# Clean Up
print("Closing PITs")
for pit in pits:
try: es.close_point_in_time({"id": pit})
except: pass
print(f"Closing Connection to {esUrl}")
es.close()
return
# Start the process
while 1:
batch("My App")
I think I just need a second pair of eyes to point out where the issue might be in the code. I've tried increasing the minutesOffset argv to 60 (so every 5 minutes it pulls the data from the last run until Now()-60 minutes) but had the same issue. Please help.
So the "baseCtx.date" is triggered by the client and it seems that in some cases there is a delay between when the event is triggered and when it is available to be searched. We fixed this by using the ingest pipeline as follows:
PUT _ingest/pipeline/indexDate
{
"description": "Creates a timestamp when a document is initially indexed",
"version": 1,
"processors": [
{
"set": {
"field": "indexDate",
"value": "{{{_ingest.timestamp}}}",
"tag": "indexDate"
}
}
]
}
And set index.default_pipeline to "indexDate" in the template settings. Every month the index name changes (we append the year and month) and this approach creates a server date we used to scroll.

Elasticsearch scroll upper limit - python api

Is there a way using the python api to set an upper limit to the number of documents that are retrieved if we scroll in chunks of a specific size. So let's say I want a maximum of 100K documents being scrolled in chunks of 2K, where there are over 10Mil documents available.
I've implemented a counter like object but I want to know if there is a more natural solution.
es_query = {"query": {"function_score": {"functions": [{"random_score": {"seed": "1234"}}]}}}
es = Elasticsearch(ADDRESS, port=PORT)
result = es.search(
index="INDEX",
doc_type="DOC_TYPE",
body=es_query,
size=2000,
scroll="1m")
data = []
for hit in result["hits"]["hits"]:
for d in hit["_source"]["attributes"]["data_of_interest"]:
data.append(d)
do_something(*args)
scroll_id = result['_scroll_id']
scroll_size = result["hits"]["total"]
i = 0
while(scroll_size>0):
if i % 10000 == 0:
print("Scrolling ({})...".format(i))
result = es.scroll(scroll_id=scroll_id, scroll="1m")
scroll_id = result["_scroll_id"]
scroll_size = len(result['hits']['hits'])
data = []
for hit in result["hits"]["hits"]:
for d in hit["_source"]["attributes"]["data_of_interest"]:
data.append(d)
do_something(*args)
i += 1
if i == 100000:
break
To me if you only want the first 100K you should narrow your query in the first place. That wills speed up your process. You can add a filter on date for example.
Regarding the code I do not know other way than using the counter. I would just correct the indentation and remove the if statement for readability.
es_query = {"query": {"function_score": {"functions": [{"random_score": {"seed": "1234"}}]}}}
es = Elasticsearch(ADDRESS, port=PORT)
result = es.search(
index="INDEX",
doc_type="DOC_TYPE",
body=es_query,
size=2000,
scroll="1m")
data = []
for hit in result["hits"]["hits"]:
for d in hit["_source"]["attributes"]["data_of_interest"]:
data.append(d)
do_something(*args)
scroll_id = result['_scroll_id']
scroll_size = result["hits"]["total"]
i = 0
while(scroll_size > 0 & i < 100000):
print("Scrolling ({})...".format(i))
result = es.scroll(scroll_id=scroll_id, scroll="1m")
scroll_id = result["_scroll_id"]
scroll_size = len(result['hits']['hits'])
# data = [] why redefining the list ?
for hit in result["hits"]["hits"]:
for d in hit["_source"]["attributes"]["data_of_interest"]:
data.append(d)
do_something(*args)
i ++

Categories

Resources