It is currently Mon, 26 Oct 2020 02:52:41 GMT



 
Author Message
 Another rsync over ssh hang (repeatable, with 2.4.1 on both ends)

I have a fairly repeatable rsync over ssh stall that I'm seeing between
two Linux boxes, both running identical 2.4.1 kernels.  The stall is
fairly easy to repeat in our environment -- it can happen up to several
times per minute, and usually happens at least once per minute.  It
doesn't really seem to be data-sensitive.  The stall will last until the
session times out *unless* I take one of two steps to "unstall" it.  The
easiest way to do this is to run 'strace -p $PID' against the sending ssh
process.  As soon as the strace is started, rsync starts working again,
but will stall again (even with strace still running) after a short period
of time.

We've seen this bug (or a *very* similar one) with 2.2.16 and 2.4.[01].  I
haven't tried a newer 2.2.x or 2.4.2 or -acX.

One system is a P2/400, the other is a P3/800.  The two boxes are
communicating over a mostly idle Ethernet, through 3 switches.  One end is
a EEPro 100, the other end is an Acenic, although that shouldn't matter.

During a stall, the sending end shows a lot of data stuck in the Recv-Q:

Proto Recv-Q Send-Q Local Address           Foreign Address         State
tcp    72848      0 ref.lab.ocp.interna:840 ref-0.sys.pnap.net:ssh  ESTABLISHED

The receiving end shows a similar problem, but on the sending queue:

Proto Recv-Q Send-Q Local Address           Foreign Address         State
tcp        0  28960 ref-0.sys.pnap.net:ssh  ref.lab.ocp.interna:840 ESTABLISHED

Like I said, I don't believe that this is a network issue, because I can
un-stall the rsync by either stracing the *sending* ssh process, or by
putting the sending rsync into the background with ^Z and then popping it
back into the foreground.  I have tcpdumps that I can send, but they look
pretty straightforward to me -- the window fills, so data stops flowing.

Strace doesn't seem to be particularly informative:

<blocked, strace starts>
select(4, [0], [1], NULL, NULL) = 1 (out [1])
write(1, "xxxxxxxxxxxxx"..., 66156) = 66156
...
select(4, [0], [1 3], NULL, NULL)       = 2 (out [1 3])
write(1, "\0\0\0\0\274\2\0\0\0\0\0\0\271\30\0\0\0\0\0\0\274\2\0\0"..., 69526
<blocked again>

Strace on the receiving end shows the obvious -- it's sitting in select
waiting for data to arrive.

According to 'ps l', the ssh process is waiting in 'sock_wait_for_wmem'.

We've tried changing versions of rsync and ssh without any success.  FWIW,
this kernel was compiled with GCC 2.95.2, from Debian potato.

Scott

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at   http://www.**-**.com/
Please read the FAQ at   http://www.**-**.com/



 Tue, 19 Aug 2003 08:50:03 GMT   
 Another rsync over ssh hang (repeatable, with 2.4.1 on both ends)

I've also reported this recently, and got told that it was because I was
running 2.2.15pre13 on one end.  Thanks for confirming that 2.2.15pre13
is not the cause.

--
Russell King (r...@arm.linux.org.uk)                The developer of ARM Linux
             http://www.arm.linux.org.uk/personal/aboutme.html

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



 Tue, 19 Aug 2003 18:40:05 GMT   
 Another rsync over ssh hang (repeatable, with 2.4.1 on both ends)
Hello!

The report by Scott Laird is sane unlike your one.
It can be explained by bug rather than only by poltergeist. 8)

Russel, you are warned that kernels<2.2.17 and rsync is an incompatible
combination.

Alexey
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



 Tue, 19 Aug 2003 23:50:04 GMT   
 Another rsync over ssh hang (repeatable, with 2.4.1 on both ends)

So, what you're saying is that because these kernels have known problems
with rsync, the fact that my symptoms on 2.4.0 are 100% _precisely_ the
same means its not the same bug?

In addition, the fact that the tcp _retries_ indicate that both sides
are behaving correctly _in this instance_ means that its not a 2.4 bug?

If you still insist that it is purely a 2.2.15pre13 bug dispite the
growing evidence against this, then I shall see if I can get everything
together to put 2.2.18 on this machine.  I can't guarantee when I'll
be able to do this though.

Also, as I pointed out, since the machines are 40+ miles away for
most of the week, and are without a reasonable net connection, I
can only comment on what is _currently_ running, and I thought it at
least useful to indicate that both my and Scott symptoms are identical.

PS, could you please spell my name correctly?

PPS, rather than arguing about this, can people proceed to investigate
Scotts problem, and I'll "tag along" to see if my problem gets fixed.
Thanks.

--
Russell King (r...@arm.linux.org.uk)                The developer of ARM Linux
             http://www.arm.linux.org.uk/personal/aboutme.html

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



 Wed, 20 Aug 2003 01:30:05 GMT   
 Another rsync over ssh hang (repeatable, with 2.4.1 on both ends)

Well, I can tell you that going from a 2.4.2pre2 sparc64 box via rsync
over ssh to a 2.4.2 or 2.4.1-pre8 i686 gives me the same problems.

However with slight differences. With the 2.4.1-pre8 kernel on the i686
I see "protocol error, different version of rsync?", and with the 2.4.2
kernel I get segv's in the remote rsync (I'm running the rsync -e ssh
from the sparc64).

Both systems are running IDE, ext2 only on both, no special config
options (pretty bare to be honest).

So no, this is not a 2.2.x interaction bug.

--
 -----------=======-=-======-=========-----------=====------------=-=------
/  Ben Collins  --  ...on that fantastic voyage...  --  Debian GNU/Linux   \
`  bcoll...@debian.org  --  bcoll...@openldap.org  --  bcoll...@linux.com  '
 `---=========------=======-------------=-=-----=-===-======-------=--=---'
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



 Wed, 20 Aug 2003 01:30:08 GMT   
 Another rsync over ssh hang (repeatable, with 2.4.1 on both ends)
Hello!

It is the same, I think.

I never said this. I said that your strace is _wrong_, how can I be
sure that tcpdump is not wrong too? You could understand this. 8)

You planned to make more accurate strace on Monday, if I remember correctly.
Now it is not necessary, Scott's one is enough to understand that
some problem exists and cannot be explained by buggy 2.2.15.

I bring apologies.

Alexey
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



 Wed, 20 Aug 2003 02:20:03 GMT   
 Another rsync over ssh hang (repeatable, with 2.4.1 on both ends)

On Fri, 2 Mar 2001 kuz...@ms2.inr.ac.ru wrote:

One data point on my hang -- I increased
/proc/sys/net/core/wmem_{max,default} from 64k to 256k, and then increased
/proc/sys/net/ipv5/tcp_wmem from "4096 16384 131072" to "16384 65536
262144", and the hangs seem to have either stopped or (more likely)
drastically reduced in frequency.  I was able to rsync a couple GB without
stalling.

I can perform more tests, if anyone has anything in particular that they'd
like to see.

Scott

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



 Wed, 20 Aug 2003 02:40:04 GMT   
 Another rsync over ssh hang (repeatable, with 2.4.1 on both ends)

Be very careful here. He did nothing of the sort. He merely indicated that
there is at least one problem running rsync over ssh between 2.4.1 systems.
There is no guarantee that your problem and his are identical. As Alexey
pointed out, there are bad bugs in 2.2.15 which can cause a TCP connection to
get stuck. Given that you are running 2.2.15, you'd need a tcpdump to
determine whether you hit one of these or not.

I've been bitten too many times assuming something was one big problem only
to find out later it was actually several smaller ones.

Regards,

Tim

--
Tim Wright - t...@splhi.com or t...@aracnet.com or twri...@us.ibm.com
IBM Linux Technology Center, Beaverton, Oregon
Interested in Linux scalability ? Look at http://lse.sourceforge.net/
"Nobody ever said I was charming, they said "Rimmer, you're a git!"" RD VI
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



 Wed, 20 Aug 2003 03:50:04 GMT   
 Another rsync over ssh hang (repeatable, with 2.4.1 on both ends)
Hello!

This is a hint.

Could you make the following things:

1. to disassemble tcp_poll() (the easiest way is to gdb vmlinux, to
   say x/i tcp_poll and to hold enter pressed long enough, copying screen
   to file) and to send the result to me.
2. to apply the enclosed patchlet.
3. if 3 does not change anything, recompile with egcs-1.1.2

Alexey

--- ../vger3-010223/linux/net/ipv4/tcp.c        Fri Feb 23 21:28:34 2001
+++ linux/net/ipv4/tcp.c        Sat Mar  3 18:37:22 2001
@@ -442,6 +443,8 @@
                                set_bit(SOCK_ASYNC_NOSPACE, &sk->socket->flags);
                                set_bit(SOCK_NOSPACE, &sk->socket->flags);

+                               barrier();
+
                                /* Race breaker. If space is freed after
                                 * wspace test but before the flags are set,
                                 * IO signal will be lost.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



 Thu, 21 Aug 2003 00:00:05 GMT   
 Another rsync over ssh hang (repeatable, with 2.4.1 on both ends)
Notice also that by default ssh opens stdin/stdout blocking, and can
relatively easily deadlock if the pipes it talks over really want to do
a write before a read or the other way round.

You can try compile the following file, put it in the same directory
as ssh, and then run rsync over this instead of plain ssh (I use it in
fact in all places where I connect to ssh over pipes).

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <errno.h>
#ifndef HAVE_NO_UNISTD_H
# include <unistd.h>
#endif /* HAVE_NO_UNISTD_H */
#include <fcntl.h>

static char ssh[] = "ssh";

int unblock(FILE *fp) {
    int fd, rc, flags;

    fd = fileno(fp);
    if (isatty(fd)) return 0;

    flags = fcntl(fd, F_GETFL, 0);
    if (flags < 0) {
        fprintf(stderr, "Could not query fd %d: %s\n", fd, strerror(errno));
        return 1;
    }
    rc = fcntl(fd, F_SETFL, flags | O_NONBLOCK);
    if (rc < 0) {
        fprintf(stderr, "Could not unblock fd %d: %s\n", fd, strerror(errno));
        return 1;
    }
    return 0;

int main(int argc, char **argv) {
    int rc;
    char *ptr, *work;

    if (unblock(stdin))  return 1;
    if (unblock(stdout)) return 1;
    if (unblock(stderr)) return 1;

    ptr = strrchr(argv[0], '/');
    if (ptr == NULL) ptr = argv[0];
    else ptr++;
    work = malloc(ptr-argv[0]+sizeof(ssh));
    if (!work) {
        fprintf(stderr, "Out of memory. Buy more ?\n");
        return 1;
    }
    memcpy(work, argv[0], ptr-argv[0]);
    memcpy(work+(ptr-argv[0]), ssh, sizeof(ssh));
    argv[0] = work;
    rc = execvp(work, argv);
    fprintf(stderr, "Could not exec %.300s: %s\n", work, strerror(errno));
    return rc;
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



 Fri, 22 Aug 2003 05:40:03 GMT   
 
   [ 10 post ] 

Similar Threads

1. rsync transfers seem to hang at end or freeze

2. intermittently repeatable Solid hang..

3. Repeatable 2.0.21 hang...

4. rsync and ssh authentification

5. ssh without password and rsync

6. rsync via ssh

7. rsync, ssh question

8. Locking down ssh commands, while using rsync.

9. rsync with ssh and no password prompt

10. Rsync SSH Password ?


 
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group.
Designed by ST Software