Airflow SSHOperator's Socket exception: Bad file descriptor - python

In Airflow, I use an SSHOperator to call an API that works on some automation work. The work ran successfully and the report did generate, but Airflow returns the task failed due to the Socket exception.
This error sometimes occurs, and I would like to know the reason that caused it.
The error message received:
[2021-07-20 08:00:07,345] {ssh.py:109} INFO - Running command: curl -u <user:pw> <URL>
[2021-07-20 08:00:07,414] {ssh.py:145} WARNING - % Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0
[2021-07-20 08:00:08,420] {ssh.py:145} WARNING -
0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0
[2021-07-20 08:00:09,421] {ssh.py:145} WARNING -
0 0 0 0 0 0 0 0 --:--:-- 0:00:02 --:--:-- 0
[2021-07-20 08:00:10,423] {ssh.py:145} WARNING -
0 0 0 0 0 0 0 0 --:--:-- 0:00:03 --:--:-- 0
[2021-07-20 08:00:10,615] {ssh.py:141} INFO - Report Sent Successfully.
[2021-07-20 08:00:10,616] {transport.py:1819} ERROR - Socket exception: Bad file descriptor (9)
[2021-07-20 08:00:10,633] {taskinstance.py:1481} ERROR - Task failed with exception
Traceback (most recent call last):
File "/u01/airflow-venv/lib/python3.8/site-packages/airflow/providers/ssh/operators/ssh.py", line 152, in execute
stdout.channel.close()
File "/u01/airflow-venv/lib/python3.8/site-packages/paramiko/channel.py", line 671, in close
self.transport._send_user_message(m)
File "/u01/airflow-venv/lib/python3.8/site-packages/paramiko/transport.py", line 1863, in _send_user_message
self._send_message(data)
File "/u01/airflow-venv/lib/python3.8/site-packages/paramiko/transport.py", line 1839, in _send_message
self.packetizer.send_message(data)
File "/u01/airflow-venv/lib/python3.8/site-packages/paramiko/packet.py", line 431, in send_message
self.write_all(out)
File "/u01/airflow-venv/lib/python3.8/site-packages/paramiko/packet.py", line 367, in write_all
raise EOFError()
EOFError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/u01/airflow-venv/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1137, in _run_raw_task
self._prepare_and_execute_task_with_callbacks(context, task)
File "/u01/airflow-venv/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1311, in _prepare_and_execute_task_with_callbacks
result = self._execute_task(context, task_copy)
File "/u01/airflow-venv/lib/python3.8/site-packages/airflow/models/taskinstance.py", line 1336, in _execute_task
result = task_copy.execute(context=context)
File "/u01/airflow-venv/lib/python3.8/site-packages/airflow/providers/ssh/operators/ssh.py", line 171, in execute
raise AirflowException(f"SSH operator error: {str(e)}")
airflow.exceptions.AirflowException: SSH operator error:
--- edit ---
generate_report = SSHOperator(
task_id = 'generate_report',
ssh_conn_id = 'ssh_123',
command = curl -u user:password "http://localhost:1234/path/to/trigger/report_creation_API?async=false",
)

This is a race condition in paramiko library. A line above this close we are calling shutdown on read, and at the very beginning of the same method we are calling shutdown on write channel. This means that after the second shutdown the channel should be closed and likely this is what happens under the hood in paramiko library.
However this seems to happen asynchronously in a separate thread. Depending on which thread gets first, the socket can be already closed when we call close() in the operator. If the async thread in paramiko was faster, we are attempting to close an already closed socket and the error is thrown.
This is a classic race condition.
Since closing the closed connection is basically no-op operation, we can safely ignore such an exception. This is what I just did in the PR here:
https://github.com/apache/airflow/pull/17528
It will be released in the next wave of providers most likely (Which is likely and of August).

Related

Streaming Dataflow job throws gRPC error from the worker

My streaming dataflow job (from Pub/Sub source) throws multiple error messages from the worker:
Failed to read inputs in the data plane. Traceback (most recent call last): File "/usr/local/lib/python3.8/site-packages/apache_beam/runners/worker/data_plane.py", line 581, in _read_inputs for elements in elements_iterator: File "/usr/local/lib/python3.8/site-packages/grpc/_channel.py", line 416, in __next__ return self._next() File "/usr/local/lib/python3.8/site-packages/grpc/_channel.py", line 706, in _next raise self grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with: status = StatusCode.UNAVAILABLE details = "keepalive watchdog timeout" debug_error_string = "{"created":"#1619783831.069194941","description":"Error received from peer ipv6:[::1]:12371","file":"src/core/lib/surface/call.cc","file_line":1061,"grpc_message":"keepalive watchdog timeout","grpc_status":14}" >
which leads to another error message:
Python sdk harness failed: Traceback (most recent call last): File "/usr/local/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker_main.py", line 155, in main SdkHarness( File "/usr/local/lib/python3.8/site-packages/apache_beam/runners/worker/sdk_worker.py", line 256, in run for work_request in self._control_stub.Control(get_responses()): File "/usr/local/lib/python3.8/site-packages/grpc/_channel.py", line 416, in __next__ return self._next() File "/usr/local/lib/python3.8/site-packages/grpc/_channel.py", line 689, in _next raise self grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with: status = StatusCode.UNAVAILABLE details = "keepalive watchdog timeout" debug_error_string = "{"created":"#1619786284.366500317","description":"Error received from peer ipv6:[::1]:12371","file":"src/core/lib/surface/call.cc","file_line":1061,"grpc_message":"keepalive watchdog timeout","grpc_status":14}" >
judging from "description":"Error received from peer ipv6:[::1]:12371","file":"src/core/lib/surface/call.cc" line, it seems that for some reason, the job cannot reach the grpc backend of the dataflow service.
Is that correct or any other reason for that error message? I use Apache Beam SDK 2.28 for python 3.8. I also use VPC settings that set using --subnetwork=https://www.googleapis.com/compute/v1/projects/.../regions/.../subnetworks/... pipeline option if that worth mentioning

Facing error : OSError: [Errno 98] Address already in use: ('', 8089)

I'm trying to run locust and after successfully running it for the first time I did some changes in my locustfile.py . After running the locust command it is giving error" OSError: [Errno 98] Address already in use: ('', 8089)" . have tried killing the PID and it doesn't work. I'm using Linux 18.04 system.
Sharing the complete log trace:
[2020-06-23 14:22:25,687] sonali-Latitude-3490/WARNING/locust.main: System open file limit setting is not high enough for load testing, and the OS wouldnt allow locust to increase it by itself. See https://docs.locust.io/en/stable/installation.html#increasing-maximum-number-of-open-files-limit for more info.
[2020-06-23 14:22:25,687] sonali-Latitude-3490/INFO/locust.main: Starting web interface at http://:8089
[2020-06-23 14:22:25,694] sonali-Latitude-3490/INFO/locust.main: Starting Locust 1.0.3
Traceback (most recent call last):
File "src/gevent/greenlet.py", line 854, in gevent._gevent_cgreenlet.Greenlet.run
File "/home/sonali/.local/lib/python3.6/site-packages/locust/web.py", line 288, in start
self.server.serve_forever()
File "/home/sonali/.local/lib/python3.6/site-packages/gevent/baseserver.py", line 398, in serve_forever
self.start()
File "/home/sonali/.local/lib/python3.6/site-packages/gevent/baseserver.py", line 336, in start
self.init_socket()
File "/home/sonali/.local/lib/python3.6/site-packages/gevent/pywsgi.py", line 1500, in init_socket
StreamServer.init_socket(self)
File "/home/sonali/.local/lib/python3.6/site-packages/gevent/server.py", line 180, in init_socket
self.socket = self.get_listener(self.address, self.backlog, self.family)
File "/home/sonali/.local/lib/python3.6/site-packages/gevent/server.py", line 192, in get_listener
return _tcp_listener(address, backlog=backlog, reuse_addr=cls.reuse_addr, family=family)
File "/home/sonali/.local/lib/python3.6/site-packages/gevent/server.py", line 288, in _tcp_listener
sock.bind(address)
OSError: [Errno 98] Address already in use: ('', 8089)
2020-06-23T08:52:25Z <Greenlet at 0x7fcb12f1f148: <bound method WebUI.start of <locust.web.WebUI object at 0x7fcb12ef2668>>> failed with OSError
[2020-06-23 14:22:25,696] sonali-Latitude-3490/CRITICAL/locust.web: Unhandled exception in greenlet: <Greenlet at 0x7fcb12f1f148: <bound method WebUI.start of <locust.web.WebUI object at 0x7fcb12ef2668>>>
Traceback (most recent call last):
File "src/gevent/greenlet.py", line 854, in gevent._gevent_cgreenlet.Greenlet.run
File "/home/sonali/.local/lib/python3.6/site-packages/locust/web.py", line 288, in start
self.server.serve_forever()
File "/home/sonali/.local/lib/python3.6/site-packages/gevent/baseserver.py", line 398, in serve_forever
self.start()
File "/home/sonali/.local/lib/python3.6/site-packages/gevent/baseserver.py", line 336, in start
self.init_socket()
File "/home/sonali/.local/lib/python3.6/site-packages/gevent/pywsgi.py", line 1500, in init_socket
StreamServer.init_socket(self)
File "/home/sonali/.local/lib/python3.6/site-packages/gevent/server.py", line 180, in init_socket
self.socket = self.get_listener(self.address, self.backlog, self.family)
File "/home/sonali/.local/lib/python3.6/site-packages/gevent/server.py", line 192, in get_listener
return _tcp_listener(address, backlog=backlog, reuse_addr=cls.reuse_addr, family=family)
File "/home/sonali/.local/lib/python3.6/site-packages/gevent/server.py", line 288, in _tcp_listener
sock.bind(address)
OSError: [Errno 98] Address already in use: ('', 8089)
[2020-06-23 14:22:25,696] sonali-Latitude-3490/INFO/locust.main: Running teardowns...
[2020-06-23 14:22:25,696] sonali-Latitude-3490/INFO/locust.main: Shutting down (exit code 2), bye.
[2020-06-23 14:22:25,696] sonali-Latitude-3490/INFO/locust.main: Cleaning up runner...
Name # reqs # fails Avg Min Max | Median req/s failures/s
--------------------------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------------------------------------------------------------------------------------------------
Aggregated 0 0(0.00%) 0 0 0 | 0 0.00 0.00
Percentage of the requests completed within given times
Type Name # reqs 50% 66% 75% 80% 90% 95% 98% 99% 99.9% 99.99% 100%
------------------------------------------------------------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------------------------------------------------------------
If I remember well...
The sockets are not closing immediately by default. There's a "gentle" period handled by the kernel, which is associated to the SO_LINGER value. See for example http://deepix.github.io/2016/10/21/tcprst.html
It means that you need to "wait a bit" before your server socket becomes available again.
An example on how to handle this in python is given in https://www.programcreek.com/python/example/67443/socket.SO_LINGER.

gunicorn threads getting killed silently

gunicorn version 19.9.0
Got the following gunicorn config:
accesslog = "access.log"
worker_class = 'sync'
workers = 1
worker_connections = 1000
timeout = 300
graceful_timeout = 300
keepalive = 300
proc_name = 'server'
bind = '0.0.0.0:8080'
name = 'server.py'
preload = True
log_level = "info"
threads = 7
max_requests = 0
backlog = 100
As you can see, the server is configured to run 7 threads.
The server is started with:
gunicorn -c gunicorn_config.py server:app
Here are the number of lines and thread IDs from our log file at the beginning (with the last line being the thread of the main server):
10502 140625414080256
10037 140624842843904
9995 140624859629312
9555 140625430865664
9526 140624851236608
9409 140625405687552
2782 140625422472960
6 140628359804736
So 7 threads are processing the requests. (Already we can see that thread 140625422472960 is processing substantially fewer requests than the other threads.)
But after the lines examined above, thread 140625422472960 just vanishes and the log file only has:
19602 140624859629312
18861 140625405687552
18766 140624851236608
18765 140624842843904
12523 140625414080256
2111 140625430865664
(excluding the main thread here)
From the server logs we could see that the thread received a request and started processing it, but never finished. The client received no response either.
There is no error/warning in the log file, nor in stderr.
And running the app for a little longer, two more threads are gone:
102 140624842843904
102 140624851236608
68 140624859629312
85 140625405687552
How to debug this?
Digging into the stderr logs further, finally found something like this exception stack trace:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/gunicorn/workers/sync.py", line 134, in handle
req = six.next(parser)
File "/usr/local/lib/python3.6/site-packages/gunicorn/http/parser.py", line 41, in __next__
self.mesg = self.mesg_class(self.cfg, self.unreader, self.req_count)
File "/usr/local/lib/python3.6/site-packages/gunicorn/http/message.py", line 181, in __init__
super(Request, self).__init__(cfg, unreader)
File "/usr/local/lib/python3.6/site-packages/gunicorn/http/message.py", line 54, in __init__
unused = self.parse(self.unreader)
File "/usr/local/lib/python3.6/site-packages/gunicorn/http/message.py", line 230, in parse
self.headers = self.parse_headers(data[:idx])
File "/usr/local/lib/python3.6/site-packages/gunicorn/http/message.py", line 74, in parse_headers
remote_addr = self.unreader.sock.getpeername()
OSError: [Errno 107] Transport endpoint is not connected
[2018-11-04 17:57:55 +0330] [31] [ERROR] Socket error processing request.
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/gunicorn/workers/sync.py", line 134, in handle
req = six.next(parser)
File "/usr/local/lib/python3.6/site-packages/gunicorn/http/parser.py", line 41, in __next__
self.mesg = self.mesg_class(self.cfg, self.unreader, self.req_count)
File "/usr/local/lib/python3.6/site-packages/gunicorn/http/message.py", line 181, in __init__
super(Request, self).__init__(cfg, unreader)
File "/usr/local/lib/python3.6/site-packages/gunicorn/http/message.py", line 54, in __init__
unused = self.parse(self.unreader)
File "/usr/local/lib/python3.6/site-packages/gunicorn/http/message.py", line 230, in parse
self.headers = self.parse_headers(data[:idx])
File "/usr/local/lib/python3.6/site-packages/gunicorn/http/message.py", line 74, in parse_headers
remote_addr = self.unreader.sock.getpeername()
OSError: [Errno 107] Transport endpoint is not connected
This is due to this gunicorn bug.
An interim solution until this bug is fixed is to monkey patch gunicorn as done by asantoni.

ssh-copy-id hanging sometimes

This is a "follow up" to my question from last week. Basically I am seeing that some python code that ssh-copy-id using pexpect occasionally hangs.
I thought it might be a pexect problem, but I am not so sure any more after I was able to collect a "stacktrace" from such a hang.
Here you can see some traces from my application; followed by the stack trace after running into the timeout:
2016-07-01 13:23:32 DEBUG copy command: ssh-copy-id -i /yyy/.ssh/id_rsa.pub someuser#some.ip
2016-07-01 13:23:33 DEBUG first expect: 1
2016-07-01 13:23:33 DEBUG sending PASSW0RD
2016-07-01 13:23:33 DEBUG consuming output from remote side ...
2016-07-01 13:24:03 INFO Timeout occured ... stack trace information ...
2016-07-01 13:24:03 INFO Traceback (most recent call last):
File "/usr/local/lib/python3.5/site-packages/pexpect-3.3-py3.5.egg/pexpect/__init__.py", line 1535, in expect_loop
c = self.read_nonblocking(self.maxread, timeout)
File "/usr/local/lib/python3.5/site-packages/pexpect-3.3-py3.5.egg/pexpect/__init__.py", line 968, in read_nonblocking
raise TIMEOUT('Timeout exceeded.')
pexpect.TIMEOUT: Timeout exceeded.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "xxx/PrepareSsh.py", line 28, in execute
self.copy_keys(context, user, timeout)
File "xxx/PrepareSsh.py", line 83, in copy_keys
child.expect('[#\$]')
File "/usr/local/lib/python3.5/site-packages/pexpect-3.3-py3.5.egg/pexpect/__init__.py", line 1451, in expect
timeout, searchwindowsize)
File "/usr/local/lib/python3.5/site-packages/pexpect-3.3-py3.5.egg/pexpect/__init__.py", line 1466, in expect_list
timeout, searchwindowsize)
File "/usr/local/lib/python3.5/site-packages/pexpect-3.3-py3.5.egg/pexpect/__init__.py", line 1568, in expect_loop
raise TIMEOUT(str(err) + '\n' + str(self))
pexpect.TIMEOUT: Timeout exceeded.
<pexpect.spawn object at 0x2b74694995c0>
version: 3.3
command: /usr/bin/ssh-copy-id
args: ['/usr/bin/ssh-copy-id', '-i', '/yyy/.ssh/id_rsa.pub', 'someuser#some.ip']
searcher: <pexpect.searcher_re object at 0x2b746ae1c748>
buffer (last 100 chars): b'\r\n/usr/bin/xauth: creating new authorityy file /home/hmcmanager/.Xauthority\r\n'
before (last 100 chars): b'\r\n/usr/bin/xauth: creating new authority file /home/hmcmanager/.Xauthority\r\n'
after: <class 'pexpect.TIMEOUT'>
So, what I find kinda strange: xauth showing up in the messages that pexpect received.
You see, today, I created another VM for testing; and did all the setup manually. This is what I see when doing so:
> ssh-copy-id -i ~/.ssh/id_rsa.pub someuser#some.ip
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/xxx/.ssh/id_rsa.pub"
The authenticity of host 'some.ip (some.ip)' can't be established.
ECDSA key fingerprint is SHA256:7...
Are you sure you want to continue connecting (yes/no)? yes
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
someuser#some.ip's password:
Number of key(s) added: 1
Now try logging into the machine, with: ....
So, lets recap:
when I run ssh-copy-id manually ... everything works; and the string "xauth" doesn't show up in the output coming back
when I run ssh-copy-id programmatically, it works most of the time; but sometimes there are timeouts ... and that message about xauth is send to my client
This is driving me crazy. Any ideas are welcome.
xauth reference smells like you are requesting X11 Forwarding. It will be configured in your ~/.ssh/config. That might be the difference in your configuration that can cause the hangs.

redis has timeouts every 5-7 minutes

I have install a redis server using google click to deploy. It has been unstable. Here is the health check i'm checking every minute:
redisClient = redis.StrictRedis(host = config['redis']['host'], port = config['redis']['port'], socket_timeout = 3)
redisClientConnection.get('bla_bla')
so I decided to install a new. This is his redis.conf file.
daemonize yes
pidfile /var/run/redis.pid
port 6379
tcp-backlog 511
timeout 0
tcp-keepalive 0
loglevel notice
logfile ""
databases 16
#save 900 1
#save 300 10
#save 60 10000
stop-writes-on-bgsave-error yes
rdbcompression yes
rdbchecksum yes
dbfilename dump.rdb
dir ./
slave-serve-stale-data yes
slave-read-only yes
repl-diskless-sync no
repl-diskless-sync-delay 5
repl-ping-slave-period 10
repl-disable-tcp-nodelay no
slave-priority 100
# maxmemory-samples 3
appendonly no
appendfilename "appendonly.aof"
# appendfsync always
appendfsync everysec
# appendfsync no
no-appendfsync-on-rewrite no
auto-aof-rewrite-percentage 100
auto-aof-rewrite-min-size 64mb
aof-load-truncated yes
lua-time-limit 5000
slowlog-max-len 128
notify-keyspace-events ""
list-max-ziplist-entries 512
list-max-ziplist-value 64
zset-max-ziplist-entries 128
zset-max-ziplist-value 64
hll-sparse-max-bytes 3000
activerehashing yes
client-output-buffer-limit normal 0 0 0
client-output-buffer-limit slave 256mb 64mb 60
client-output-buffer-limit pubsub 32mb 8mb 60
aof-rewrite-incremental-fsync yes
It still fails due to a timeout every 10 7 minutes in average (2 is fail):
last execution2015-03-18 09:14:09
history[0,0,0,0,2,0,0,2,0,0,0,0,2,2,2,2,0,0,0,0,0]
My app has 10 clients connected to redis using client list
you can see in my conf that I have canceled the persistance (save). I only use redis as cache.
my question is: Why Is timeouting every ~5-7 mins
here is the error and the stack trace:
TimeoutError: Timeout reading from socket
Traceback (most recent call last): File "healthCheck.py", line 14, in get redisClientConnection.get('bla_bla') File "/usr/local/lib/python2.7/dist-packages/redis/client.py", line 863, in get return self.execute_command('GET', name) File "/usr/local/lib/python2.7/dist-packages/redis/client.py", line 565, in execute_command return self.parse_response(connection, command_name, **options) File "/usr/local/lib/python2.7/dist-packages/redis/client.py", line 577, in parse_response response = connection.read_response() File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 569, in read_response response = self._parser.read_response() File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 224, in read_response response = self._buffer.readline() File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 162, in readline self._read_from_socket() File "/usr/local/lib/python2.7/dist-packages/redis/connection.py", line 133, in _read_from_socket raise TimeoutError("Timeout reading from socket") TimeoutError: Timeout reading from socket
Edit:
I've added a health check on the actual redis server:
check-app-running.rb
#! /usr/bin/env ruby
exec "redis-cli set mykey somevalue"
exec "redis-cli get mykey"
exit 0
and it's running fine:
last execution2015-03-18 10:14:51
history[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]
So now I suspect the network. any ideas??

Categories

Resources