Hi all.
I have a really really weird problem that I've been working on for days.
The problem manifested as users cannot connect to our web servers via SSH when they're using our wireless network. Here's where it gets weird:
- Clients from anywhere other than the wireless subnet
can connect fine
- Wireless clients
can connect to ssh servers on subnets other than the one our web servers are on (both onsite and offsite)
- I
can run
nc -l 22 on one of the web servers and transfer big files from a wireless client with
cat bigfile | nc webserver 22.
- If I run telnetd on port 22 one of our web servers, I
cannot connect. It fails in a very similar way to ssh
-
Update (Three days later) I can recreate the problem in netcat by typing into the client and server alternately. If I just send one-way in netcat, the problem never comes up.
- The TCP handshake succeeds, then packets stop arriving and the client starts resending packets. The server seems to be waiting.
- When I kill the ssh or sshd process, a bunch of tcp packets start flowing. If I kill the client, the server will actually show a completed key exchange (ssh obviously). Said another way, the connection stalls, I kill the client, the connection continues a bit with the client dead and then closes.
- Googling around I found lots of folks who recommended fiddling with MTU and some IP /proc variables, but that did not help. The problem is too consistent to be that anyway. And I can nc big files (10Mb) with no problem. (md5 checked)
- I thought it might be a DNS problem, but tcpdump shows no DNS queries while the connection hangs (set UseDNS no in sshd_config).
-
Update (Two days later...) - I plugged in a machine that is not a Xen host or client, and it shows the same behaviour, so we can rule out any Xen strangeness as the culprit.
-
Update (Three days later...) - After the TCP handshake, the client can send as many packets as it wants UNTIL the server sends anything (again, after the initial handshake), after which any packets from the client do not reach the server.
Other important info:
I only control the client I'm testing with and the web servers. I do not control the wireless setup or the routers or the firewalls. Those are all controlled by my boss. He's checked his config and it looks good to him, so if it really is something wrong on his end, I need really good evidence before I waste his time some more. Really, the clues so far point to my servers being the source of the problem.
The servers are all CentOS 5.5. They are virtualized under Xen. (Tcpdump shows the same stuff on the Xen host/Dom0 as on the client/DomU, so I don't think it's a Xen problem, but then again....)
Update My client is also linux, Fedora 11. The problem was initially reported by a Mac user, version unknown.
Okay, I gotta go soak my head.... Thanks All!
-Pileofrogs