This is a "follow up" to my question from last week. Basically I am seeing that some python code that ssh-copy-id using pexpect occasionally hangs.
I thought it might be a pexect problem, but I am not so sure any more after I was able to collect a "stacktrace" from such a hang.
Here you can see some traces from my application; followed by the stack trace after running into the timeout:
2016-07-01 13:23:32 DEBUG copy command: ssh-copy-id -i /yyy/.ssh/id_rsa.pub someuser#some.ip
2016-07-01 13:23:33 DEBUG first expect: 1
2016-07-01 13:23:33 DEBUG sending PASSW0RD
2016-07-01 13:23:33 DEBUG consuming output from remote side ...
2016-07-01 13:24:03 INFO Timeout occured ... stack trace information ...
2016-07-01 13:24:03 INFO Traceback (most recent call last):
File "/usr/local/lib/python3.5/site-packages/pexpect-3.3-py3.5.egg/pexpect/__init__.py", line 1535, in expect_loop
c = self.read_nonblocking(self.maxread, timeout)
File "/usr/local/lib/python3.5/site-packages/pexpect-3.3-py3.5.egg/pexpect/__init__.py", line 968, in read_nonblocking
raise TIMEOUT('Timeout exceeded.')
pexpect.TIMEOUT: Timeout exceeded.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "xxx/PrepareSsh.py", line 28, in execute
self.copy_keys(context, user, timeout)
File "xxx/PrepareSsh.py", line 83, in copy_keys
child.expect('[#\$]')
File "/usr/local/lib/python3.5/site-packages/pexpect-3.3-py3.5.egg/pexpect/__init__.py", line 1451, in expect
timeout, searchwindowsize)
File "/usr/local/lib/python3.5/site-packages/pexpect-3.3-py3.5.egg/pexpect/__init__.py", line 1466, in expect_list
timeout, searchwindowsize)
File "/usr/local/lib/python3.5/site-packages/pexpect-3.3-py3.5.egg/pexpect/__init__.py", line 1568, in expect_loop
raise TIMEOUT(str(err) + '\n' + str(self))
pexpect.TIMEOUT: Timeout exceeded.
<pexpect.spawn object at 0x2b74694995c0>
version: 3.3
command: /usr/bin/ssh-copy-id
args: ['/usr/bin/ssh-copy-id', '-i', '/yyy/.ssh/id_rsa.pub', 'someuser#some.ip']
searcher: <pexpect.searcher_re object at 0x2b746ae1c748>
buffer (last 100 chars): b'\r\n/usr/bin/xauth: creating new authorityy file /home/hmcmanager/.Xauthority\r\n'
before (last 100 chars): b'\r\n/usr/bin/xauth: creating new authority file /home/hmcmanager/.Xauthority\r\n'
after: <class 'pexpect.TIMEOUT'>
So, what I find kinda strange: xauth showing up in the messages that pexpect received.
You see, today, I created another VM for testing; and did all the setup manually. This is what I see when doing so:
> ssh-copy-id -i ~/.ssh/id_rsa.pub someuser#some.ip
/usr/bin/ssh-copy-id: INFO: Source of key(s) to be installed: "/xxx/.ssh/id_rsa.pub"
The authenticity of host 'some.ip (some.ip)' can't be established.
ECDSA key fingerprint is SHA256:7...
Are you sure you want to continue connecting (yes/no)? yes
/usr/bin/ssh-copy-id: INFO: attempting to log in with the new key(s), to filter out any that are already installed
/usr/bin/ssh-copy-id: INFO: 1 key(s) remain to be installed -- if you are prompted now it is to install the new keys
someuser#some.ip's password:
Number of key(s) added: 1
Now try logging into the machine, with: ....
So, lets recap:
when I run ssh-copy-id manually ... everything works; and the string "xauth" doesn't show up in the output coming back
when I run ssh-copy-id programmatically, it works most of the time; but sometimes there are timeouts ... and that message about xauth is send to my client
This is driving me crazy. Any ideas are welcome.
xauth reference smells like you are requesting X11 Forwarding. It will be configured in your ~/.ssh/config. That might be the difference in your configuration that can cause the hangs.
Related
I am trying to transfer an automation script I made in Python, to ansible (company request), and I have NEVER worked with ansible before. I have tried the "wait_for:", but I have not gotten that to work either. In the script, I could set dev.timeout=None or whatever I needed. I am finding it hard to figure out where I can do this in ansible. I have tried setting the timeout in the "ansible.cfg" file. But that doesnt work. I can do simple commands, like:
cli="show version", or
cli="show system firmware".
The following is my playbook:
- hosts: local
roles:
- Juniper.junos
connection: local
gather_facts: no
tasks:
- junos_cli:
host={{ inventory_hostname }}
user=root
passwd=Hardware1
cli="request system snapshot slice alternate"
dest="{{ inventory_hostname }}.txt"
After I run that, around 120 seconds later, I get the following error:
fatal: [192.168.2.254]: FAILED! => {"changed": false, "failed": true, "module_stderr": "/usr/local/lib/python2.7/dist-packages/jnpr/junos/device.py:652: RuntimeWarning: CLI command is for debug use only!\n warnings.warn(\"CLI command is for debug use only!\", RuntimeWarning)\nTraceback (most recent call last):\n File \"/home/pkb/.ansible/tmp/ansible-tmp-1457428640.58-63826456311723/junos_cli\", line 2140, in <module>\n main()\n File \"/home/pkb/.ansible/tmp/ansible-tmp-1457428640.58-63826456311723/junos_cli\", line 177, in main\n dev.close()\n File \"/usr/local/lib/python2.7/dist-packages/jnpr/junos/device.py\", line 504, in close\n self._conn.close_session()\n File \"/usr/local/lib/python2.7/dist-packages/ncclient/manager.py\", line 158, in wrapper\n return self.execute(op_cls, *args, **kwds)\n File \"/usr/local/lib/python2.7/dist-packages/ncclient/manager.py\", line 228, in execute\n raise_mode=self._raise_mode).request(*args, **kwds)\n File \"/usr/local/lib/python2.7/dist-packages/ncclient/operations/session.py\", line 28, in request\n return self._request(new_ele(\"close-session\"))\n File \"/usr/local/lib/python2.7/dist-packages/ncclient/operations/rpc.py\", line 342, in _request\n raise TimeoutExpiredError('ncclient timed out while waiting for an rpc reply.')\nncclient.operations.errors.TimeoutExpiredError: ncclient timed out while waiting for an rpc reply.\n", "module_stdout": "", "msg": "MODULE FAILURE", "parsed": false}
I think it is the timeout, I could be wrong. But it is doing my head in that such a simple task is eluding me.
Okay, I managed to fix the problem.
The module for Ansible that does the CLI (junos_cli) doesn't support timeout. I therefor went in to:
/etc/ansible/roles/Juniper.junos/library/junos_cli
And below line:
dev = Device(args['host'], user=args['user'], password=args['passwd'],
port=args['port'], gather_facts=False).open()
I added:
dev.timeout=None
This sets the timer to infinity, so I have time for the formatting when doing "request system snapshot slice alternate".
Hope this helps anyone else doing something with the junos cli through ansible automation.
It's suggested to increase the timeout value to some decent number (say 300 sec) which we think should be good for the call, rather than making it infinite.
I am trying to get my bottle server so that when one person in a game logs out, everyone can immediately see it. As I am using long polling, there is a request open with all the users.
The bit I am having trouble with is catching the exception that is thrown when the user leaves the page from the long polling that can no longer connect to the page. The error message is here.
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/gevent/pywsgi.py", line 438, in handle_one_response
self.run_application()
File "/usr/lib/python2.7/dist-packages/gevent/pywsgi.py", line 425, in run_application
self.process_result()
File "/usr/lib/python2.7/dist-packages/gevent/pywsgi.py", line 416, in process_result
self.write(data)
File "/usr/lib/python2.7/dist-packages/gevent/pywsgi.py", line 373, in write
self.socket.sendall(msg)
File "/usr/lib/python2.7/dist-packages/gevent/socket.py", line 509, in sendall
data_sent += self.send(_get_memory(data, data_sent), flags)
File "/usr/lib/python2.7/dist-packages/gevent/socket.py", line 483, in send
return sock.send(data, flags)
error: [Errno 32] Broken pipe
<WSGIServer fileno=3 address=0.0.0.0:8080>: Failed to handle request:
request = GET /refreshlobby/1 HTTP/1.1 from ('127.0.0.1', 53331)
application = <bottle.Bottle object at 0x7f9c05672750>
127.0.0.1 - - [2013-07-07 10:59:30] "GET /refreshlobby/1 HTTP/1.1" 200 160 6.038377
The function to handle that page is this.
#route('/refreshlobby/<id>')
def refreshlobby(id):
while True:
yield lobby.refresh()
gevent.sleep(1)
I tried catching the exception within the function, and in a decorator which I put to wrap #route, neither of which worked. I tried making an #error(500) decorator, but that didn't trigger, either. It seems that this is to do with the internals of bottle.
Edit: I know now that I need to be catching socket.error, but I don't know whereabouts in my code
The WSGI runner
Look closely at the traceback: this in not happening in your function, but in the WSGI runner.
Traceback (most recent call last):
File "/usr/lib/python2.7/dist-packages/gevent/pywsgi.py", line 438, in handle_one_response
self.run_application()
The way the WSGI runner works, in your case, is:
Receives a request
Gets a partial response from your code
Sends it to the client (this is where the exception is raised)
Repeats steps 2-3
You can't catch this exception
This error is not raised in your code.
It happens when you try to send a response to a client that closed the connection.
You'll therefore not be able to catch this error from within your code.
Alternate solutions
Unfortunately, it's not possible to tell from within the generator (your code) when it stops being consumed.
It's also not a good idea to rely on your generator being garbage collected.
You have a couple other solutions.
"Last seen"
Another way to know when an user disconnects would probably be to record a "last seen", after your yield statement.
You'll be able to identify clients that disconnected if their last seen is far in the past.
Other runner
Another, non-WSGI runner, will be more appropriate for a realtime application. You could give tornado a try.
I've been googling for days trying to find a straight answer for why this is happening, but can't find anything useful. I have a web2py application that simply reads a database and makes some requests to a REST api. It is a healthcheck monitor so it refreshes itself every minute. There are about 20 or so users at any given time. Here is the error I'm seeing very consistently in the log file:
ERROR:Rocket.Errors.Port8080:Traceback (most recent call last):
File "/opt/apps/web2py/gluon/rocket.py", line 562, in listen
sock = self.wrap_socket(sock)
File "/opt/apps/web2py/gluon/rocket.py", line 506, in wrap_socket
ssl_version = ssl.PROTOCOL_SSLv23)
File "/usr/local/lib/python2.7/ssl.py", line 342, in wrap_socket
ciphers=ciphers)
File "/usr/local/lib/python2.7/ssl.py", line 121, in __init__
self.do_handshake()
File "/usr/local/lib/python2.7/ssl.py", line 281, in do_handshake
self._sslobj.do_handshake()
error: [Errno 104] Connection reset by peer
Based on some googling the most promising piece of information is that someone is trying to connect through a firewall and so it is killing the connection, however I don't understand why it's taking the actual application down. The process is still running, but no one can connect and I have to restart web2py.
I will be very appreciative of any input here. I'm beyond frustration.
Thanks!
The most common source of Connection reset by peer errors is that the remote client decides he doesn't want to contact you anymore, and cancels the interaction (with shutdown/an RST packet). This happens if the user navigates to a different page while the site is loading.
In your case, the remote host gave up on the connection even before you got to read or write anything on it. With the current web2py, this should only output the warning you're seeing, and not terminate anything.
If you have the current web2py, the error of not being able to connect is unrelated to these error messages. If you have an old version of web2py, you should update.
Here is a python code snippet that uses paramiko:
import paramiko
sshClient = paramiko.SSHClient()
sshClient.set_missing_host_key_policy(paramiko.AutoAddPolicy)
sshClient.connect(_peerIp, username=_username, password=_password, timeout=3.0)
As soon as I run the script, I also unplug _peerIp's network cable. And connect() method hangs. Even though the timeout is 3.0, it has been 10 minutes and it still hangs.
(I think the TCP connection was established in a split second and I unplugged the cable during the ssh establishment)
So, do you know any workaround for this? My script will run at a manufacturing factory and it must not hang in such a scenario and handle it properly.
EDIT:
It just gave an exception:
No handlers could be found for logger "paramiko.transport"
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/pymodules/python2.6/paramiko/client.py", line 327, in connect
self._auth(username, password, pkey, key_filenames, allow_agent, look_for_keys)
File "/usr/lib/pymodules/python2.6/paramiko/client.py", line 438, in _auth
self._transport.auth_publickey(username, key)
File "/usr/lib/pymodules/python2.6/paramiko/transport.py", line 1234, in auth_publickey
return self.auth_handler.wait_for_response(my_event)
File "/usr/lib/pymodules/python2.6/paramiko/auth_handler.py", line 163, in wait_for_response
raise e
socket.error: [Errno 113] No route to host
Ok, at least it eventually raised an exception but I believe this is not the expected behaviour. If the timeout is 3.0, connect() method should return something after timeout expires.
When I attempt to connect to one of our internal servers using paramiko (inside of fabric, for what it's worth) I get this error:
Retrieving packages from server p-websvr-004
[p-websvr-004] run: /usr/sbin/pkg_info -aD|grep "Information for"
starting thread (client mode): 0x179f090L
Banner: ----------------------------------------------------------------------
Banner: Welcome to Mycompany, Inc. Unauthorized access, is strictly prohibited
Connected (version 2.0, client OpenSSH_4.5p1)
Exception: Invalid packet blocking
Traceback (most recent call last):
File "/Users/crose/virtualenv/mycompany/lib/python2.6/site-packages/paramiko/transport.py", line 1491, in run
ptype, m = self.packetizer.read_message()
File "/Users/crose/virtualenv/mycompany/lib/python2.6/site-packages/paramiko/packet.py", line 344, in read_message
raise SSHException('Invalid packet blocking')
Every other host we have works, as far as I can tell. What's causing this to happen, and how can I fix it?
First obvious question, how is this host different than the rest?
By the looks of it, it could be a bug in the SSH server. Does openssh on the command line work, and is it using a different cipher?