ElasticSearch 7.10.2
Python 3.8.5
elasticsearch-py 7.12.1
I'm trying to do a bulk insert of 100,000 records to ElasticSearch using elasticsearch-py bulk helper.
Here is the Python code:
import sys
import datetime
import json
import os
import logging
from elasticsearch import Elasticsearch
from elasticsearch.helpers import streaming_bulk
# ES Configuration start
es_hosts = [
"http://localhost:9200",]
es_api_user = 'user'
es_api_password = 'pw'
index_name = 'index1'
chunk_size = 10000
errors_before_interrupt = 5
refresh_index_after_insert = False
max_insert_retries = 3
yield_ok = False # if set to False will skip successful documents in the output
# ES Configuration end
# =======================
filename = file.json
logging.info('Importing data from {}'.format(filename))
es = Elasticsearch(
es_hosts,
#http_auth=(es_api_user, es_api_password),
sniff_on_start=True, # sniff before doing anything
sniff_on_connection_fail=True, # refresh nodes after a node fails to respond
sniffer_timeout=60, # and also every 60 seconds
retry_on_timeout=True, # should timeout trigger a retry on different node?
)
def data_generator():
f = open(filename)
for line in f:
yield {**json.loads(line), **{
"_index": index_name,
}}
errors_count = 0
for ok, result in streaming_bulk(es, data_generator(), chunk_size=chunk_size, refresh=refresh_index_after_insert,
max_retries=max_insert_retries, yield_ok=yield_ok):
if ok is not True:
logging.error('Failed to import data')
logging.error(str(result))
errors_count += 1
if errors_count == errors_before_interrupt:
logging.fatal('Too many import errors, exiting with error code')
exit(1)
print("Documents loaded to Elasticsearch")
When the json file contains a small amount of documents (~100), this code runs without issue. But I just tested it with a file of 100k documents, and I got this error:
WARNING:elasticsearch:POST http://127.0.0.1:9200/_bulk?refresh=false [status:N/A request:10.010s]
Traceback (most recent call last):
File "/Users/me/opt/anaconda3/lib/python3.8/site-packages/urllib3/connectionpool.py", line 426, in _make_request
six.raise_from(e, None)
File "<string>", line 3, in raise_from
File "/Users/me/opt/anaconda3/lib/python3.8/site-packages/urllib3/connectionpool.py", line 421, in _make_request
httplib_response = conn.getresponse()
File "/Users/me/opt/anaconda3/lib/python3.8/http/client.py", line 1347, in getresponse
response.begin()
File "/Users/me/opt/anaconda3/lib/python3.8/http/client.py", line 307, in begin
version, status, reason = self._read_status()
File "/Users/me/opt/anaconda3/lib/python3.8/http/client.py", line 268, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "/Users/me/opt/anaconda3/lib/python3.8/socket.py", line 669, in readinto
return self._sock.recv_into(b)
socket.timeout: timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/me/opt/anaconda3/lib/python3.8/site-packages/elasticsearch/connection/http_urllib3.py", line 251, in perform_request
response = self.pool.urlopen(
File "/Users/me/opt/anaconda3/lib/python3.8/site-packages/urllib3/connectionpool.py", line 726, in urlopen
retries = retries.increment(
File "/Users/me/opt/anaconda3/lib/python3.8/site-packages/urllib3/util/retry.py", line 386, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/Users/me/opt/anaconda3/lib/python3.8/site-packages/urllib3/packages/six.py", line 735, in reraise
raise value
File "/Users/me/opt/anaconda3/lib/python3.8/site-packages/urllib3/connectionpool.py", line 670, in urlopen
httplib_response = self._make_request(
File "/Users/me/opt/anaconda3/lib/python3.8/site-packages/urllib3/connectionpool.py", line 428, in _make_request
self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
File "/Users/me/opt/anaconda3/lib/python3.8/site-packages/urllib3/connectionpool.py", line 335, in _raise_timeout
raise ReadTimeoutError(
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='127.0.0.1', port=9200): Read timed out. (read timeout=10)
I have to admit this one is a bit over my head. I don't typically like to paste large error messages here, but I'm not sure what about this message is relevant.
I can't help but think that I maybe need to adjust some of the params in the es object? Or the configuration variables? I don't know enough about the params to be able to make an educated decision on my own.
And the last but certainly not least point - it looks like some documents were loaded into the ES index nonetheless. But even stranger, the count shows 110k when the json file only has 100k.
TL;DR:
Reduce the chunk_size from 10000 to the default of 500 and I'd expect it to work. You probably want to disable automatic retries if that can give you duplicates.
What happened?
When creating your Elasticsearch object, you specified chunk_size=10000. This means that the streaming_bulk call will try to insert chunks of 10000 elements. The connection to elasticsearch has a configurable timeout, which by default is 10 seconds. So, if your elasticsearch server takes more than 10 seconds to process the 10000 elements you want to insert, a timeout will happen and this will be handled as an error.
When creating your Elasticsearch object, you also specified retry_on_timeout as True and in the streaming_bulk_call you set max_retries=max_insert_retries, which is 3.
This means that when such a timeout happens, the library will try reconnecting 3 times, however, when the insert still has a timeout after that, it will give you the error you noticed. (Documentation)
Also, when the timeout happens, the library can not know whether the documents were inserted successfully, so it has to assume that they were not. Thus, it will try to insert the same documents again. I don't know how your input lines look like, but if they do not contain an _id field, this would create duplicates in your index. You probably want to prevent this -- either by adding some kind of _id, or by disabling the automatic retry and handling it manually.
What to do?
There is two ways you can go about this:
Increase the timeout
Reduce the chunk_size
streaming_bulk by default has chunk_size set to 500. Your 10000 is much higher. I wouldn't expect a high performance gain when increasing this to more than 500, so I'd advice you to just use the default of 500 here. If 500 still fails with a timeout, you may even want to reduce it further. This could happen if the documents you want to index are very complex.
You could also increase the timeout for the streaming_bulk call, or, alternatively, for your es object. To only change it for the streaming_bulk call, you can provide the request_timeout keyword argument:
for ok, result in streaming_bulk(
es,
data_generator(),
chunk_size=chunk_size,
refresh=refresh_index_after_insert,
request_timeout=60*3, # 3 minutes
yield_ok=yield_ok):
# handle like you did
pass
However, this also means that elasticsearch node failure will only be detected after this higher timeout. See the documentation for more details
Related
I have some working code, that gets data from a queue, processes it and then emits the data via Flask-socketio to the browser.
This works when there aren't many messages to emit, however when the workload increases it simply can't cope.
Additionally I have tried processing the queue without transmitting and this appears to work correctly.
So what I was hoping to do, is rather than emit every time the queue.get() fires, to simply emit whatever the current dataset is on every ping.
In theory this should mean that even if the queue.get() fires multiple times between ping and pong, the messages being sent should remain constant and won't overload the system.
It means of course, some data will not be sent, but the most up-to date data as of the last ping should be sent, which is sufficient for what I need.
Hopefully that makes sense, so on to the (sort of working) code...
This works when there are not a lot of messages to emit (the socketio sleep needs to be there, otherwise the processing doesn't occur before it tries to emit):
def handle_message(*_args, **_kwargs):
try:
order_books = order_queue.get(block=True, timeout=1)
except Empty:
order_books = None
market_books = market_queue.get()
uo = update_orders(order_books, mkt_runners, profitloss, trading, eo, mb)
update_market_book(market_books, mb, uo, market_catalogues, mkt_runners, profitloss, trading)
socketio.sleep(0.2)
emit('initial_market', {'message': 'initial_market', 'mb': json.dumps(mb), 'ob': json.dumps(eo)})
socketio.sleep(0.2)
emit('my_response2', {'message': 'pong'})
def main():
socketio.run(app, debug=True, port=3000)
market_stream.stop()
order_stream.stop()
if __name__ == '__main__':
main()
This works (but I'm not trying to emit any messages here, this is just the Python script getting from the queue and processing):
while True:
try:
order_books = order_queue.get(block=True, timeout=1)
except Empty:
order_books = None
market_books = market_queue.get()
uo = update_orders(order_books, mkt_runners, profitloss, trading, eo, mb)
update_market_book(market_books, mb, uo, market_catalogues, mkt_runners, profitloss, trading)
print(mb)
(The mb at the end is the current dataset, which is returned by the update_market_book function).
Now, I was hoping, by having the While True at the end, this could simply run and the function would return the latest dataset on every ping, however using the above, the While True only fires when the main function is taken out...which of course stops the socketio section from working.
So, is there a way I can combine both these, to achieve what I am trying to do and/or is there an alternative method I haven't considered that might work?
As always, I appreciate your advice and if the question is not clear, please let me know, so I can clarify any bits.
Many thanks!
Just adding a stack trace as requested:
Traceback (most recent call last):
File "D:\Python37\lib\site-packages\betfairlightweight\endpoints\login.py", line 38, in request
response = session.post(self.url, data=self.data, headers=self.client.login_headers, cert=self.client.cert)
File "D:\Python37\lib\site-packages\requests\api.py", line 116, in post
return request('post', url, data=data, json=json, **kwargs)
File "D:\Python37\lib\site-packages\requests\api.py", line 60, in request
return session.request(method=method, url=url, **kwargs)
File "D:\Python37\lib\site-packages\requests\sessions.py", line 533, in request
resp = self.send(prep, **send_kwargs)
File "D:\Python37\lib\site-packages\requests\sessions.py", line 646, in send
r = adapter.send(request, **kwargs)
File "D:\Python37\lib\site-packages\requests\adapters.py", line 449, in send
timeout=timeout
File "D:\Python37\lib\site-packages\urllib3\connectionpool.py", line 600, in urlopen
chunked=chunked)
File "D:\Python37\lib\site-packages\urllib3\connectionpool.py", line 343, in _make_request
self._validate_conn(conn)
File "D:\Python37\lib\site-packages\urllib3\connectionpool.py", line 839, in _validate_conn
conn.connect()
File "D:\Python37\lib\site-packages\urllib3\connection.py", line 332, in connect
cert_reqs=resolve_cert_reqs(self.cert_reqs),
File "D:\Python37\lib\site-packages\urllib3\util\ssl_.py", line 279, in create_urllib3_context
context.options |= options
File "D:\Python37\lib\ssl.py", line 507, in options
super(SSLContext, SSLContext).options.__set__(self, value)
File "D:\Python37\lib\ssl.py", line 507, in options
super(SSLContext, SSLContext).options.__set__(self, value)
File "D:\Python37\lib\ssl.py", line 507, in options
super(SSLContext, SSLContext).options.__set__(self, value)
[Previous line repeated 489 more times]
RecursionError: maximum recursion depth exceeded while calling a Python object
I am using pandas.read_sql function with hive connection to extract a really large data. I have a script like this:
df = pd.read_sql(query_big, hive_connection)
df2 = pd.read_sql(query_simple, hive_connection)
The big query take a long time, and after it is executed, python returns the following error when trying to execute the second line:
raise NotSupportedError("Hive does not have transactions") # pragma: no cover
It seems there is something wrong with the connection.
Moreover, If I replace the second line with multirpocessing.Manager().Queue(), It returns the following error:
File "/usr/lib64/python3.6/multiprocessing/managers.py", line 662, in temp
token, exp = self._create(typeid, *args, **kwds)
File "/usr/lib64/python3.6/multiprocessing/managers.py", line 554, in _create
conn = self._Client(self._address, authkey=self._authkey)
File "/usr/lib64/python3.6/multiprocessing/connection.py", line 493, in Client
answer_challenge(c, authkey)
File "/usr/lib64/python3.6/multiprocessing/connection.py", line 732, in answer_challenge
message = connection.recv_bytes(256) # reject large message
File "/usr/lib64/python3.6/multiprocessing/connection.py", line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File "/usr/lib64/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
buf = self._recv(4)
File "/usr/lib64/python3.6/multiprocessing/connection.py", line 383, in _recv
raise EOFError
EOFError
It seems this kind of error are related to exit function being messed up, in the connection.py. Moreover, when I changed the query in the first command to extract smaller data that doesn't take too long, Everything works fine. So I assume it may be that because it takes too long to execute the first query, something is improperly terminated. which caused the two error, both of which are so different in nature but both are related to broken connection issues.
I'm on version 1.9.9 of the SDK and I'm having issues with the devserver. I have a manually scaled module with 1 instance. I created a webapp2.RequestHandler for /_ah/start. In that handler I start a background thread. When I run my app in the devserver, the _ah/start handler returns a 200, but /_ah/background will randomly return 500 errors for a while. After sometime (usually a minute or two, but sometimes more), the 500 errors stop, but will randomly occur again every few hours. It also seems that everytime I open a new browser tab (Chrome), I get the same error. Anyone know what could be causing this?
Here is the RequestHandler for /_ah/start:
class StartupHandler(webapp2.RequestHandler):
def get(self):
runtime.set_shutdown_hook(shutdown_hook)
global foo
if foo is None:
foo = Foo()
background_thread.start_new_background_thread(do_foo, [])
self.response.http_status_message(200)
Here is the 500 error:
ERROR 2014-08-18 07:39:36,256 module.py:717] Request to '/_ah/background' failed
Traceback (most recent call last):
File "\appengine\tools\devappserver2\module.py", line 694, in _handle_request
environ, wrapped_start_response)
File "\appengine\tools\devappserver2\request_rewriter.py", line 311, in _rewriter_middleware
response_body = iter(application(environ, wrapped_start_response))
File "\appengine\tools\devappserver2\module.py", line 1672, in _handle_script_request
request_type)
File "\appengine\tools\devappserver2\module.py", line 1624, in _handle_instance_request
request_id, request_type)
File "\appengine\tools\devappserver2\instance.py", line 382, in handle
request_type))
File "\appengine\tools\devappserver2\http_proxy.py", line 190, in handle
response = connection.getresponse()
File "E:\Programing\Python27\lib\httplib.py", line 1030, in getresponse
response.begin()
File "E:\Programing\Python27\lib\httplib.py", line 407, in begin
version, status, reason = self._read_status()
File "E:\Programing\Python27\lib\httplib.py", line 365, in _read_status
line = self.fp.readline()
File "E:\Programing\Python27\lib\socket.py", line 430, in readline
data = recv(1)
error: [Errno 10054] An existing connection was forcibly closed by the remote host
INFO 2014-08-18 07:39:36,257 module.py:1890] Waiting for instances to restart
INFO 2014-08-18 07:39:36,262 module.py:642] lease: "GET /_ah/background HTTP/1.1" 500 -
Well this might be not the answer , but how long will it take to complete a specific task to assign to a backend? Seems like an issue with concurrency
Looks like the issue (as far as I can currently tell) is that I'm using PyCharm, which synchronizes the project's files when its window is entered or exited. This rewrites the project files even if there are no changes, which causes the devserver to restart all instances, leading to the 500 errors.
More info on PyCharm Synchronization
Link to issue at PyCharm
I have a web-service deployed in my box. I want to check the result of this service with various input. Here is the code I am using:
import sys
import httplib
import urllib
apUrl = "someUrl:somePort"
fileName = sys.argv[1]
conn = httplib.HTTPConnection(apUrl)
titlesFile = open(fileName, 'r')
try:
for title in titlesFile:
title = title.strip()
params = urllib.urlencode({'search': 'abcd', 'text': title})
conn.request("POST", "/somePath/", params)
response = conn.getresponse()
data = response.read().strip()
print data+"\t"+title
conn.close()
finally:
titlesFile.close()
This code is giving an error after same number of lines printed (28233). Error message:
Traceback (most recent call last):
File "testService.py", line 19, in ?
conn.request("POST", "/somePath/", params)
File "/usr/lib/python2.4/httplib.py", line 810, in request
self._send_request(method, url, body, headers)
File "/usr/lib/python2.4/httplib.py", line 833, in _send_request
self.endheaders()
File "/usr/lib/python2.4/httplib.py", line 804, in endheaders
self._send_output()
File "/usr/lib/python2.4/httplib.py", line 685, in _send_output
self.send(msg)
File "/usr/lib/python2.4/httplib.py", line 652, in send
self.connect()
File "/usr/lib/python2.4/httplib.py", line 636, in connect
raise socket.error, msg
socket.error: (99, 'Cannot assign requested address')
I am using Python 2.4.3. I am doing conn.close() also. But why is this error being given?
This is not a python problem.
In linux kernel 2.4 the ephemeral port range is from 32768 through 61000. So number of available ports = 61000-32768+1 = 28233. From what i understood, because the web-service in question is quite fast (<5ms actually) thus all the ports get used up. The program has to wait for about a minute or two for the ports to close.
What I did was to count the number of conn.close(). When the number was 28000 wait for 90sec and reset the counter.
BIGYaN identified the problem correctly and you can verify that by calling "netstat -tn" right after the exception occurs. You will see very many connections with state "TIME_WAIT".
The alternative to waiting for port numbers to become available again is to simply use one connection for all requests. You are not required to call conn.close() after each call of conn.request(). You can simply leave the connection open until you are done with your requests.
I too faced similar issue while executing multiple POST statements using python's request library in Spark. To make it worse, I used multiprocessing over each executor to post to a server. So thousands of connections created in seconds that took few seconds each to change the state from TIME_WAIT and release the ports for the next set of connections.
Out of all the available solutions available over the internet that speak of disabling keep-alive, using with request.Session() et al, I found this answer to be working which makes use of 'Connection' : 'close' configuration as header parameter. You may need to put the header content in a separte line outside the post command though.
headers = {
'Connection': 'close'
}
with requests.Session() as session:
response = session.post('https://xx.xxx.xxx.x/xxxxxx/x', headers=headers, files=files, verify=False)
results = response.json()
print results
This is my answer to the similar issue using the above solution.
Quite often GAE is not able to upload the file and I am getting the following error:
ApplicationError: 2
Traceback (most recent call last):
File "/base/python_runtime/python_lib/versions/1/google/appengine/ext/webapp/__init__.py", line 636, in __call__
handler.post(*groups)
File "/base/data/home/apps/picasa2vkontakte/1.348093606241250361/picasa2vkontakte.py", line 109, in post
headers=headers
File "/base/python_runtime/python_lib/versions/1/google/appengine/api/urlfetch.py", line 260, in fetch
return rpc.get_result()
File "/base/python_runtime/python_lib/versions/1/google/appengine/api/apiproxy_stub_map.py", line 592, in get_result
return self.__get_result_hook(self)
File "/base/python_runtime/python_lib/versions/1/google/appengine/api/urlfetch.py", line 355, in _get_fetch_result
raise DownloadError(str(err))
DownloadError: ApplicationError: 2
How should I perform retries in case of such error?
try:
result = urlfetch.fetch(url=self.request.get('upload_url'),
payload=''.join(data),
method=urlfetch.POST,
headers=headers
)
except DownloadError:
# how to retry 2 more times?
# and how to verify result here?
If you can, move this work into the task queue. When tasks fail, they retry automatically. If they continue to fail, the system gradually backs off retry frequency to as slow as once-per hour. This is an easy way to handle API requests to rate-limited services without implementing one-off retry logic.
If you really need to handle requests synchronously, something like this should work:
for i in range(3):
try:
result = urlfetch.fetch(...)
# run success conditions here
break
except DownloadError:
#logging.debug("urlfetch failed!")
pass
You can also pass deadline=10 to urlfetch.fetch to double the default timeout deadline.