Ticket #50 (closed defect)

Opened 10 years ago

Last modified 10 years ago

libssh2_sftp_readdir() hangs forever

Reported by: tonyspoken Owned by: bagder
Priority: normal Milestone:
Component: SFTP Version:
Keywords: Cc: tonyspoken, bagder
Blocked By: Blocks:

Description

An abnormal situation of hanging is noted when doing a remote dir read on a particular directory with the ibssh2_sftp_readdir_ex().
In attach the complete stdout log.

Attachments

OutLogComplete_hangs.zip (43.4 KB) - added by tonyspoken 10 years ago.

Download all attachments as: .zip

Change History

Changed 10 years ago by tonyspoken

comment:1 Changed 10 years ago by tonyspoken

I'm further investigating this problem.
I got one interesting thing from logs and network snoops.
The problem seems connected to the amount of datas we're grabbing from the network.
In particular after calling libssh2_sftp_readdir() the SSH server is responding us with a 4100 bytes packet, that from libssh2 point of view is got in two different stages (4096 bytes raw data + further 4 bytes raw data, as can be seen from the logs) because 4096 is the actual fixed dimension of the chunk size, as defined in lissh2_priv.h (PACKETBUFSIZE).

The faulty and hanging situation is not appearing anynmore in two different cases:

  • for example if we add a new file in the directory
  • we change the PACKETBUFSIZE value

Both of the two options leads to a change in the number of bytes actually exchanged between server and client (for example when adding a new file in the remote directory to be ls-ed, the received 4100 bytes returned from SSH server became 4196, and everything went fine).

I can deep analyze it more, but probably someone who's inside the transport and packet routines could speed up things a lot.
Any help is appreciated.

comment:2 Changed 10 years ago by bagder

Well, I wrote the transport.c file and functions but what you describe sounds like there's a bug somewhere that makes the code get stuck in a way it shouldn't. And then I doubt it helps much that I wrote it, since I didn't mean to introduce bugs of this kind and I can't say I know where they are now...

Your report doesn't say what libssh2 version you're using, but if it's not the current CVS (recent daily snapshot) please try that since I've fixed a few bugs since 0.17 that would result in something like this.

I would also be interested in knowing how to repeat this problem so that I can debug/assist in my end, should I get time for it.

comment:3 Changed 10 years ago by tonyspoken

Hi Daniel,

i'm utilizing the latest CVS version.
from a further analysis it seems that the problem relies on transport.c.
I explain you briefly what happens.
After a call to libssh2_sftp_readdir(), the libssh2_packet_read() is called.
The SSH server is responding with a packet of 4100 Bytes.
Since we put 4096 as embedded PACKETBUFSIZE in libssh2_priv.h and the recv()in libssh2_packet_read() takes exactly PACKETBUFSIZE bytes from network.
When the 5 initial bytes are decoded, looking at the size we have to deal with it is clear that we have to do another recv(), because there are more bytes waiting for us.
These are 4100 - 4096 = 4 bytes more to take.

Here it is the code that gives problem
numbytes = remainbuf; Here numbytes = 4

if (numbytes < blocksize) { blocksize is 16 here.

/* we can't act on anything less than blocksize */
return PACKET_EAGAIN;

}

So we enter in a PACKET_EAGAIN loop. Deleting this check everything seems working, but i don't know which side effects can rise from this deletion.
This means that every time the fragmentation of packets is in a number inferior to 16 the read will hang.
In our case this faulty situation will rise if SSH server return 4096+X byte where 1 < X <= 15.
Don't know if the situation is now clear. I can send network dump for a follow-up, or other details if it's needed.

comment:4 Changed 10 years ago by tonyspoken

Hi.

Here it is only an idea of how we could manage, but i expect feedbacks from Daniel.
Probably the check "if (numbytes < blocksize)" should be done, but only the very first time we get a chunk, not always.

So my suggestion would be to move this check after we're doing the other check (the one that controls whether we're at first chunk or not)

The resulting code would look like this:
............

if (!p->total_num) {

/* No payload package area allocated yet. To know the

size of this payload, we need to decrypt the first
blocksize data. */

Beginning of the moved piece
if (numbytes < blocksize) {

/* we can't act on anything less than blocksize */
return PACKET_EAGAIN;

}

End of the moved piece
if (!p->total_num) {

...........

I don't know if it makes sense.

comment:5 Changed 10 years ago by bagder

It makes a lot of sense. Thanks for the hard work and suggested fix. I committed it just now. I take it this fixes this problem of yours?

comment:6 Changed 10 years ago by tonyspoken

Yes. That fixed our problems. We made some basic functional tests and it worked nicely.
Could you put the status of this as "Closed"?
Thanks for all the support

comment:7 Changed 10 years ago by bagder

Thanks for verifying, closed!

Note: See TracTickets for help on using tickets.