redis has timeouts every 5-7 minutes - python

I have install a redis server using google click to deploy. It has been unstable. Here is the health check i'm checking every minute:
redisClient = redis.StrictRedis(host = config['redis']['host'], port = config['redis']['port'], socket_timeout = 3)
redisClientConnection.get('bla_bla')
so I decided to install a new. This is his redis.conf file.
daemonize yes
pidfile /var/run/redis.pid
port 6379
tcp-backlog 511
timeout 0
tcp-keepalive 0
loglevel notice
logfile ""
databases 16
#save 900 1
#save 300 10
#save 60 10000
stop-writes-on-bgsave-error yes
rdbcompression yes
rdbchecksum yes
dbfilename dump.rdb
dir ./
slave-serve-stale-data yes
slave-read-only yes
repl-diskless-sync no
repl-diskless-sync-delay 5
repl-ping-slave-period 10
repl-disable-tcp-nodelay no
slave-priority 100
# maxmemory-samples 3
appendonly no
appendfilename "appendonly.aof"
# appendfsync always
appendfsync everysec
# appendfsync no
no-appendfsync-on-rewrite no
auto-aof-rewrite-percentage 100
auto-aof-rewrite-min-size 64mb
aof-load-truncated yes
lua-time-limit 5000
slowlog-max-len 128
notify-keyspace-events ""
list-max-ziplist-entries 512
list-max-ziplist-value 64
zset-max-ziplist-entries 128
zset-max-ziplist-value 64
hll-sparse-max-bytes 3000
activerehashing yes
client-output-buffer-limit normal 0 0 0
client-output-buffer-limit slave 256mb 64mb 60
client-output-buffer-limit pubsub 32mb 8mb 60
aof-rewrite-incremental-fsync yes
It still fails due to a timeout every 10 7 minutes in average (2 is fail):
last execution2015-03-18 09:14:09
history[0,0,0,0,2,0,0,2,0,0,0,0,2,2,2,2,0,0,0,0,0]
My app has 10 clients connected to redis using client list
you can see in my conf that I have canceled the persistance (save). I only use redis as cache.
my question is: Why Is timeouting every ~5-7 mins
here is the error and the stack trace:
TimeoutError: Timeout reading from socket
Traceback (most recent call last): File "healthCheck.py", line 14, in get redisClientConnection.get('bla_bla') File "/usr/local/lib/python2.7/dist-packages/redis/client.py", line 863, in get return self.execute_command('GET', name) File "/usr/local/lib/python2.7/dist-packages/redis/client.py", line 565, in execute_command return self.parse_response(connection, command_name, **options) File "/usr/local/lib/python2.7/dist-packages/redis/client.py", line 577, in parse_response response = connection.read_response() File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 569, in read_response response = self._parser.read_response() File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 224, in read_response response = self._buffer.readline() File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 162, in readline self._read_from_socket() File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 133, in _read_from_socket raise TimeoutError("Timeout reading from socket") TimeoutError: Timeout reading from socket
Edit:
I've added a health check on the actual redis server:
check-app-running.rb
#! /usr/bin/env ruby
exec "redis-cli set mykey somevalue"
exec "redis-cli get mykey"
exit 0
and it's running fine:
last execution2015-03-18 10:14:51
history[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
So now I suspect the network. any ideas??

Related

How to start a Ray cluster on one local server using yaml file config without docker

Could any one help me how to start ray on local server by a config file?
My current local server can run Ray successfully when using below command:
ray start --head --node-ip-address 127.0.0.1 --port 6379 --dashboard-host 0.0.0.0 --dashboard-port 8265 --gcs-server-port 8075 --object-manager-port 8076 --node-manager-port 8077 --min-worker-port 10002 --max-worker-port 19999
But now I need to move it to a config file so that other service can control it. I did try to make a cluster.yaml file and start it with command ray up cluster.yaml, but it's facing error.
The content of cluster.yaml file is:
cluster_name: default
max_workers: 1
upscaling_speed: 1.0
idle_timeout_minutes: 5
provider:
​type: local
​head_ip: 0.0.0.0
​worker_ips:
​- 127.0.0.1
auth:
​ssh_user: root
​ssh_private_key: ~/.ssh/id_rsa
file_mounts: {}
cluster_synced_files: []
file_mounts_sync_continuously: False
rsync_exclude:
​- "**/.git"
​- "**/.git/**"
rsync_filter:
​- ".gitignore"
initialization_commands: []
setup_commands: []
head_setup_commands: []
worker_setup_commands: []
head_start_ray_commands:
​- ray stop
​- ray start --head --port 6379 --dashboard-host '0.0.0.0' --dashboard-port 8265 --gcs-server-port 8075 --object-manager-port 8076 --node-manager-port 8077 --min-worker-port 10002 --max-worker-port 19999 --autoscaling-config=~/ray_bootstrap_config.yaml
worker_start_ray_commands:
​- ray stop
​- ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076
head_node: {}
worker_nodes: {}
But it's unable to start due to below error:
<1/1> Setting up head node
​Prepared bootstrap config
2022-01-27 05:52:26,910 INFO node_provider.py:103 -- ClusterState: Writing cluster state: ['127.0.0.1', '0.0.0.0']
[1/7] Waiting for SSH to become available
​Running `uptime` as a test.
​Fetched IP: 0.0.0.0
05:52:33 up 3:43, 0 users, load average: 0.10, 0.18, 0.16
Shared connection to 0.0.0.0 closed.
​Success.
[2/7] Processing file mounts
[3/7] No worker file mounts to sync
[4/7] No initialization commands to run.
[5/7] Initalizing command runner
[6/7] No setup commands to run.
[7/7] Starting the Ray runtime
Shared connection to 0.0.0.0 closed.
Local node IP: 192.168.1.18
2022-01-27 05:52:37,541 INFO services.py:1340 -- View the Ray dashboard at http://192.168.1.18:8265
Traceback (most recent call last):
​File "/usr/local/lib/python3.7/dist-packages/ray/node.py", line 240, in __init__
​self.redis_password)
​File "/usr/local/lib/python3.7/dist-packages/ray/_private/services.py", line 324, in wait_for_node
​raise TimeoutError("Timed out while waiting for node to startup.")
TimeoutError: Timed out while waiting for node to startup.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
​File "/usr/local/bin/ray", line 8, in <module>
​sys.exit(main())
​File "/usr/local/lib/python3.7/dist-packages/ray/scripts/scripts.py", line 1989, in main
​return cli()
​File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 1128, in __call__
​return self.main(*args, **kwargs)
​File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 1053, in main
​rv = self.invoke(ctx)
​File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 1659, in invoke
​return _process_result(sub_ctx.command.invoke(sub_ctx))
​File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 1395, in invoke
​return ctx.invoke(self.callback, **ctx.params)
​File "/usr/local/lib/python3.7/dist-packages/click/core.py", line 754, in invoke
​return __callback(*args, **kwargs)
​File "/usr/local/lib/python3.7/dist-packages/ray/scripts/scripts.py", line 633, in start
​ray_params, head=True, shutdown_at_exit=block, spawn_reaper=block)
​File "/usr/local/lib/python3.7/dist-packages/ray/node.py", line 243, in __init__
​"The current node has not been updated within 30 "
Exception: The current node has not been updated within 30 seconds, this could happen because of some of the Ray processes failed to startup.
Shared connection to 0.0.0.0 closed.
2022-01-27 05:53:07,718 INFO node_provider.py:103 -- ClusterState: Writing cluster state: ['127.0.0.1', '0.0.0.0']
​New status: update-failed
​!!!
​SSH command failed.
​!!!
​Failed to setup head node.

gunicorn threads getting killed silently

gunicorn version 19.9.0
Got the following gunicorn config:
accesslog = "access.log"
worker_class = 'sync'
workers = 1
worker_connections = 1000
timeout = 300
graceful_timeout = 300
keepalive = 300
proc_name = 'server'
bind = '0.0.0.0:8080'
name = 'server.py'
preload = True
log_level = "info"
threads = 7
max_requests = 0
backlog = 100
As you can see, the server is configured to run 7 threads.
The server is started with:
gunicorn -c gunicorn_config.py server:app
Here are the number of lines and thread IDs from our log file at the beginning (with the last line being the thread of the main server):
10502 140625414080256
10037 140624842843904
9995 140624859629312
9555 140625430865664
9526 140624851236608
9409 140625405687552
2782 140625422472960
6 140628359804736
So 7 threads are processing the requests. (Already we can see that thread 140625422472960 is processing substantially fewer requests than the other threads.)
But after the lines examined above, thread 140625422472960 just vanishes and the log file only has:
19602 140624859629312
18861 140625405687552
18766 140624851236608
18765 140624842843904
12523 140625414080256
2111 140625430865664
(excluding the main thread here)
From the server logs we could see that the thread received a request and started processing it, but never finished. The client received no response either.
There is no error/warning in the log file, nor in stderr.
And running the app for a little longer, two more threads are gone:
102 140624842843904
102 140624851236608
68 140624859629312
85 140625405687552
How to debug this?
Digging into the stderr logs further, finally found something like this exception stack trace:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/gunicorn/workers/sync.py", line 134, in handle
req = six.next(parser)
File "/usr/local/lib/python3.6/site-packages/gunicorn/http/parser.py", line 41, in __next__
self.mesg = self.mesg_class(self.cfg, self.unreader, self.req_count)
File "/usr/local/lib/python3.6/site-packages/gunicorn/http/message.py", line 181, in __init__
super(Request, self).__init__(cfg, unreader)
File "/usr/local/lib/python3.6/site-packages/gunicorn/http/message.py", line 54, in __init__
unused = self.parse(self.unreader)
File "/usr/local/lib/python3.6/site-packages/gunicorn/http/message.py", line 230, in parse
self.headers = self.parse_headers(data[:idx])
File "/usr/local/lib/python3.6/site-packages/gunicorn/http/message.py", line 74, in parse_headers
remote_addr = self.unreader.sock.getpeername()
OSError: [Errno 107] Transport endpoint is not connected
[2018-11-04 17:57:55 +0330] [31] [ERROR] Socket error processing request.
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/gunicorn/workers/sync.py", line 134, in handle
req = six.next(parser)
File "/usr/local/lib/python3.6/site-packages/gunicorn/http/parser.py", line 41, in __next__
self.mesg = self.mesg_class(self.cfg, self.unreader, self.req_count)
File "/usr/local/lib/python3.6/site-packages/gunicorn/http/message.py", line 181, in __init__
super(Request, self).__init__(cfg, unreader)
File "/usr/local/lib/python3.6/site-packages/gunicorn/http/message.py", line 54, in __init__
unused = self.parse(self.unreader)
File "/usr/local/lib/python3.6/site-packages/gunicorn/http/message.py", line 230, in parse
self.headers = self.parse_headers(data[:idx])
File "/usr/local/lib/python3.6/site-packages/gunicorn/http/message.py", line 74, in parse_headers
remote_addr = self.unreader.sock.getpeername()
OSError: [Errno 107] Transport endpoint is not connected
This is due to this gunicorn bug.
An interim solution until this bug is fixed is to monkey patch gunicorn as done by asantoni.

Unable to flush task queue within 1200 seconds

When running my automl pipeline, i am consistently getting an error during the MetricsAndSaveModel activity which causes my model training run to fail:
2019-12-06 22:48:01,233 - INFO - 295 : ActivityCompleted: Activity=MetricsAndSaveModel, HowEnded=Failure, Duration=1200977.92[ms]
2019-12-06 22:48:01,235 - CRITICAL - 295 : Type: Unclassified
Class: AzureMLException
Message: AzureMLException:
Message: Failed to flush task queue within 1200 seconds
InnerException None
ErrorResponse
{
"error": {
"message": "Failed to flush task queue within 1200 seconds"
}
}
Traceback:
File "fit_pipeline.py", line 222, in fit_pipeline
automl_run_context.batch_save_artifacts(strs_to_save, models_to_upload)
File "automl_run_context.py", line 201, in batch_save_artifacts
timeout_seconds=ARTIFACT_UPLOAD_TIMEOUT_SECONDS)
File "run.py", line 49, in wrapped
return func(self, *args, **kwargs)
File "run.py", line 1824, in upload_files
timeout_seconds=timeout_seconds)
File "artifacts_client.py", line 167, in upload_files
results.append(task)
File "task_queue.py", line 53, in __exit__
self.flush(self.identity)
File "task_queue.py", line 126, in flush
raise AzureMLException("Failed to flush task queue within {} seconds".format(timeout_seconds))
The current timeout limit is set to 20 minutes in the AutoML service and our product team is working to provide this as a configurable setting in future releases. Currently, to increase this limit you can modify the script automl_run_context.py to update ARTIFACT_UPLOAD_TIMEOUT_SECONDS to a higher value and retry running the pipeline.

python script stops after a while

I have a python script on an Amazon EC2 server that's requesting data from two different servers (using urllib and http.request), that data is then recorded on a text file. It has to be running for a long time. I am using nohup to get it running on the background.
The thing is, it stops after a while (sometimes it lasts 24 hours, sometimes 2 hours, it varies). I don't get an error message or anything. It just stops and the last string of characters received are just saved in the text file as an incomplete string (just the info that could read from the remote server).
What could be causing this problem?
This is the code I have:
import urllib3 # sudo pip install urllib3 --upgrade
import time
urllib3.disable_warnings() # Disable urllib3 warnings about unverified connections
http = urllib3.PoolManager()
f = open('okcoin.txt', 'w')
f2 = open('bitvc.txt', 'w')
while True:
try:
r = http.request("GET","https://www.okcoin.com/api/v1/future_ticker.do?symbol=btc_usd&contract_type=this_week")
r2 = http.request("GET","http://market.bitvc.com/futures/ticker_btc_week.js ")
except: # catch all exceptions
continue
#Status codes of 200 if it got an OK from the server
if r.status != 200 or r2.status != 200 or r.data.count(',') < 5 or r2.data.count(',') < 5: # avoids blank data, there should be at least 5 commas so that it's correct data
continue; # Try to read again if there was a problem with one reading
received = str(time.time()) # Timestamp of when the information was received to the server running this python code
data = r.data + "," + received + "\r\n"
data2 = r2.data + "," + received + "\r\n"
print data,r.status
print data2, r.status
f.write(data)
f2.write(data2)
time.sleep(0.5)
f.flush() #flush files
f2.flush()
f.close()
f2.close()
EDIT: I left the program opened using screen through ssh. It stopped again. If I press "CTRL+C" to stop it, this is what I get:
^CTraceback (most recent call last):
File "tickersave.py", line 72, in <module>
r2 = http.request("GET","http://market.bitvc.com/futures/ticker_btc_week.js")
File "/usr/local/lib/python2.7/dist-packages/urllib3/request.py", line 68, in request
**urlopen_kw)
File "/usr/local/lib/python2.7/dist-packages/urllib3/request.py", line 81, in request_encode_url
return self.urlopen(method, url, **urlopen_kw)
File "/usr/local/lib/python2.7/dist-packages/urllib3/poolmanager.py", line 153, in urlopen
response = conn.urlopen(method, u.request_uri, **kw)
File "/usr/local/lib/python2.7/dist-packages/urllib3/connectionpool.py", line 541, in urlopen
**response_kw)
File "/usr/local/lib/python2.7/dist-packages/urllib3/response.py", line 284, in from_httplib
**response_kw)
File "/usr/local/lib/python2.7/dist-packages/urllib3/response.py", line 104, in __init__
self._body = self.read(decode_content=decode_content)
File "/usr/local/lib/python2.7/dist-packages/urllib3/response.py", line 182, in read
data = self._fp.read()
File "/usr/lib/python2.7/httplib.py", line 551, in read
s = self._safe_read(self.length)
File "/usr/lib/python2.7/httplib.py", line 658, in _safe_read
chunk = self.fp.read(min(amt, MAXAMOUNT))
File "/usr/lib/python2.7/socket.py", line 380, in read
data = self._sock.recv(left)
Any clues? Should I add timeouts or something?
Collect all output of your program and you will find a very good hint about what is wrong. That is, definitely collect stdout and stderr. For this to happen, you might want to call your program like that:
$ nohup python script.py > log.outerr 2>&1 &
That collects both, stdout and stderr in the file log.outerr, and starts your program decoupled from the tty and in the background. Investigating log.outerr after your program has stopped working will probably be quite revealing, I guess.

nmap http-form-brute fails for admin form in django

Trying to use nmap on localhost to bruteforce admin name/pass in Django framework. I've created simple Django application with one admin, started it and now I try to bruteforce the password like this:
nmap -p 8000 --script http-form-brute --script-args userdb=test.txt,passdb=test.txt,path=/admin/,hostname=localhost -vv -d -sT 127.0.0.1
test.txt contains two lines with words one of which is correct.
nmap produces:
Scanned at 2014-10-23 11:42:12 ope for 0s
PORT STATE SERVICE REASON
8000/tcp open http-alt syn-ack
Final times for host: srtt: 1000 rttvar: 5000 to: 100000
NSE: Script Post-scanning.
NSE: Starting runlevel 1 (of 1) scan.
Read from C:\Program Files (x86)\Nmap: nmap-payloads nmap-services.
Nmap done: 1 IP address (1 host up) scanned in 6.90 seconds
Raw packets sent: 0 (0B) | Rcvd: 0 (0B)
Server log contains this on requests:
----------------------------------------
Exception happened during processing of request from ('127.0.0.1', 12312)
Traceback (most recent call last):
File "/usr/lib/python2.7/SocketServer.py", line 595, in process_request_thread
self.finish_request(request, client_address)
File "/usr/lib/python2.7/SocketServer.py", line 334, in finish_request
self.RequestHandlerClass(request, client_address, self)
File "/usr/lib/python2.7/site-packages/django/core/servers/basehttp.py", line
129, in __init__
super(WSGIRequestHandler, self).__init__(*args, **kwargs)
File "/usr/lib/python2.7/SocketServer.py", line 651, in __init__
self.handle()
File "/usr/lib/python2.7/wsgiref/simple_server.py", line 116, in handle
self.raw_requestline = self.rfile.readline()
File "/usr/lib/python2.7/socket.py", line 447, in readline
data = self._sock.recv(self._rbufsize)
error: [Errno 104] Connection reset by peer
----------------------------------------
The error 104 happens when the socket connection got disconnected by the server and then dead socket is being used by the client.
I think in your case this was happened because Django login is involved with http cookies for 'session-id' and 'csrf-token'. Those things are required for properly login. I don't know how nmap going to handle the http cookies. You could use curl which has options to add all http client specifications.

Categories

Resources