It is currently Thu, 17 Apr 2014 18:23:17 GMT



 
Author Message
 {sys_,/dev/}epoll waiting timeout
Hi!

Both /dev/epoll EP_POLL and sys_epoll_wait, when converting the passed
timeout value in msec to jiffies, round down instead of up.  This
occasionally causes these functions to return without any active fd's
before the given timeout period has passed.

This can cause fun situations like these:
        epoll_wait(epfd, events, maxevents, timeout_until_next_timer_expiry)
                [ returns too early ]

        gettimeofday(&now, NULL)
                [ notice that the first timer has not yet expired, do nothing ]

        epoll_wait(epfd, events, maxevents, random_small_value_less_than_jiffy)
                [ returns immediately ]

        gettimeofday(&now, NULL)
                [ notice that first timer still didn't expire yet, do nothing ]

        etc.

Effectively causing busy-wait loops of on average half a jiffy.

nanosleep(2) always rounds timeout values up (I think it is required to do
so by some specification which says that this call should sleep _at_least_
the given amount of time), and this approach to me makes sense for
{sys_,/dev/}epoll also.  See <linux/time.h>:timespec_to_jiffies and
kernel/time.c:sys_nanosleep.

Will you accept a patch to do this?

cheers,
Lennert
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at   http://www.**-**.com/
Please read the FAQ at   http://www.**-**.com/



 Sun, 10 Jul 2005 08:00:17 GMT   
 {sys_,/dev/}epoll waiting timeout

So you would like it to round up the timeout like poll(), select(),
and io_getevents() do?  Fair enough, for the sake of consistency!

sys_poll() says:

        if (timeout) {
                /* Careful about overflow in the intermediate values */
                if ((unsigned long) timeout < MAX_SCHEDULE_TIMEOUT / HZ)
                        timeout = (unsigned long)(timeout*HZ+999)/1000+1;
                else /* Negative or overflow */
                        timeout = MAX_SCHEDULE_TIMEOUT;
        }

sys_io_getevents() does something more complicated in a function
called set_timeout(), but it essentially comes to the same thing.  It
takes a value in nanseconds (which I prefer, btw, for future usefulness).

There is also a {*filter*} overflow in ep_poll().  If HZ == 1000 and
you ask to wait for > approx. 2000 seconds (not unreasonable, say if
epoll is running an ftp server and the client timeout is set to 1
hour), then ep_poll gets the calculation wrong.

I suggest that this line in eventpoll.c:

        jtimeout = timeout == -1 ? MAX_SCHEDULE_TIMEOUT: (timeout * HZ) / 1000;

be changed to this:

        jtimeout = 0;
        if (timeout) {
                /* Careful about overflow in the intermediate values */
                if ((unsigned long) timeout < MAX_SCHEDULE_TIMEOUT / HZ)
                        jtimeout = (unsigned long)(timeout*HZ+999)/1000+1;
                else /* Negative or overflow */
                        jtimeout = MAX_SCHEDULE_TIMEOUT;
        }

And that the prototypes for ep_poll() and sys_epoll_wait() be changed
to take a "long timeout" instead of an "int", just like sys_poll().

Sometimes busy waiting is the right thing to do.  With HZ == 100,
anything involving video animation needs to busy wait after select()
or equivalent, otherwise there is too much display jitter.

But all that sort of code needs to know how much select() et al. tend
to overrun by anyway, so they can just do the same if using epoll.

-- Jamie

ps.  sys_* system-call functions should never return "int".  They
should always return "long" or a pointer - even if the user-space
equivalent returns "int".  Take a look at sys_open() for an example.
Technical requirement of the system call return path on 64-bit targets.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at   http://www.**-**.com/
Please read the FAQ at   http://www.**-**.com/



 Sun, 10 Jul 2005 09:10:08 GMT   
 {sys_,/dev/}epoll waiting timeout

Why assume HZ=1000?  Would not:

timeout = (unsigned long)(timeout*HZ+(HZ-1))/HZ+1;

make more sense?

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



 Sun, 10 Jul 2005 13:50:12 GMT   
 {sys_,/dev/}epoll waiting timeout

No, that's silly.  Why do you want to multiply by HZ and then divide by HZ?

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



 Sun, 10 Jul 2005 14:30:13 GMT   
 {sys_,/dev/}epoll waiting timeout

| > Jamie Lokier wrote:
| >
| > >         jtimeout = 0;
| > >         if (timeout) {
| > >                 /* Careful about overflow in the intermediate values */
| > >                 if ((unsigned long) timeout < MAX_SCHEDULE_TIMEOUT / HZ)
| > >                         jtimeout = (unsigned long)(timeout*HZ+999)/1000+1;
| > >                 else /* Negative or overflow */
| > >                         jtimeout = MAX_SCHEDULE_TIMEOUT;
| > >         }
| >
| > Why assume HZ=1000?  Would not:
| >
| > timeout = (unsigned long)(timeout*HZ+(HZ-1))/HZ+1;
| >
| > make more sense?
|
| No, that's silly.  Why do you want to multiply by HZ and then divide by HZ?

OK, I don't get it.  All Ed did was replace 1000 with HZ and
999 with (HZ-1).  What's bad about that?  Seems to me like
the right thing to do.  Much more portable.

What if HZ changes?  Who's going to audit the kernel for changes?

--
~Randy

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



 Sun, 10 Jul 2005 20:30:13 GMT   
 {sys_,/dev/}epoll waiting timeout

You're being dense.  The input timeout is measured in milliseconds;
see poll(2).  The calculated timeout is measured in jiffies.  Hence
multiply by jiffies and divide by milliseconds.

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



 Sun, 10 Jul 2005 20:40:13 GMT   
 {sys_,/dev/}epoll waiting timeout

| > | > Why assume HZ=1000?  Would not:
| > | >
| > | > timeout = (unsigned long)(timeout*HZ+(HZ-1))/HZ+1;
| > | >
| > | > make more sense?
| > |
| > | No, that's silly.  Why do you want to multiply by HZ and then divide by HZ?
| >
| > OK, I don't get it.  All Ed did was replace 1000 with HZ and
| > 999 with (HZ-1).  What's bad about that?  Seems to me like
| > the right thing to do.  Much more portable.
| >
| > What if HZ changes?  Who's going to audit the kernel for changes?
|
| You're being dense.  The input timeout is measured in milliseconds;
| see poll(2).  The calculated timeout is measured in jiffies.  Hence
| multiply by jiffies and divide by milliseconds.

Like I said, I didn't get it.  Now I do.  Thanks.

--
~Randy

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



 Sun, 10 Jul 2005 20:40:16 GMT   
 {sys_,/dev/}epoll waiting timeout

From a mathematical point of view this is a ceil(v)+1, so this is wrong.
It should be :

t = (t * HZ + 999) / 1000;

The +999 already gives you the round up. Different is if we want to be
sure to sleep at least that amount of jiffies ( the rounded up ), in that
case since the timer tick might arrive immediately after we go to sleep by
making us to lose immediately a jiffie, we need another +1. Anyway I'll do
the round up. Same for the overflow check.

I don't see why. The poll(2) timeout is an int.

True, I'll change it.

- Davide

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



 Mon, 11 Jul 2005 15:10:09 GMT   
 {sys_,/dev/}epoll waiting timeout

I wonder if it's appropriate to copy sys_poll(), which has the +1, or
sys_select(), which doesn't!

poll(2) takes an int, but sys_poll() takes a long.
I think everyone is confused :)

The reason I suggested "long timeout" for ep_poll is because the
multiply in the expression:

        jtimeout = (unsigned long)(timeout*HZ+999)/1000;

can overflow if you don't.  If you stick with the int, you'll need to
write:

        jtimeout = (((unsigned long)timeout)*HZ+999)/1000;

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



 Mon, 11 Jul 2005 16:50:13 GMT   
 {sys_,/dev/}epoll waiting timeout

Or, fix sys_poll(). With the +1, this means that sys_poll() would have
a 1 in 1001 chance per second of returning one jiffie too early.

On a 16 bit platform, perhaps... :-)

mark

--
m...@mielke.cc/ma...@ncf.ca/ma...@nortelnetworks.com __________________________
.  .  _  ._  . .   .__    .  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/    |_     |\/|  |  |_  |   |/  |_   |
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada

  One ring to rule them all, one ring to find them, one ring to bring them all
                       and in the darkness bind them...

                           http://mark.mielke.cc/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



 Mon, 11 Jul 2005 18:30:07 GMT   
 {sys_,/dev/}epoll waiting timeout

Nope.  Read the expression again.

Nope.  It will overflow on a _64_ bit platform, if you give it a value
of MAXINT for example.

-- Jamie
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



 Mon, 11 Jul 2005 19:30:16 GMT   
 {sys_,/dev/}epoll waiting timeout

Sorry... not 1 in 1001... almost 100% chance of returning of one
jiffie too many. In practice, even on a relatively idle system,
the process will not be able to wake up as frequently as it might
be able to expect.

mark

--
m...@mielke.cc/ma...@ncf.ca/ma...@nortelnetworks.com __________________________
.  .  _  ._  . .   .__    .  . ._. .__ .   . . .__  | Neighbourhood Coder
|\/| |_| |_| |/    |_     |\/|  |  |_  |   |/  |_   |
|  | | | | \ | \   |__ .  |  | .|. |__ |__ | \ |__  | Ottawa, Ontario, Canada

  One ring to rule them all, one ring to find them, one ring to bring them all
                       and in the darkness bind them...

                           http://mark.mielke.cc/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



 Mon, 11 Jul 2005 21:40:07 GMT   
 {sys_,/dev/}epoll waiting timeout

It's curious that select() is different from poll() - almost is if
completely different people wrote the code :)

We have to wonder whether it was a design decision.
Perhaps the unix specifications require it (see below).

You're right - it's unfortunate that using poll() lets you sleep and
wake up no faster than every _two_ ticks.  That's actually caused by
poll()'s treating zero differently though, not by +1 as such.

There's a strange contradiction between rounding up the waiting time
to the next number of jiffies, and then rounding it down (in a
time-dependent way) by waiting until the next N'th timer interrupt.

If, as someone said, the appropriate unix specification says that
"wait for 10ms" means to wait for _at minimum_ 10ms, then you do need
the +1.

(Davide), IMHO epoll should decide whether it means "at minimum" (in
which case the +1 is a requirement), or it means "at maximum" (in
which case rounding up is wrong).

The current method of rounding up and then effectively down means that
you get an unpredictable mixture of both.

-- Jamie

ps. I would always prefer an absolute wakeup time anyway - it avoids a
race condition too.  What a shame none of the system calls work that way.

pps. To summarise, all the time APIs are a complete mess in unix, and
there's nothing you can do in user space to make up for the b0rken
system call interface.  Except not duplicate past errors in new interfaces :)
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



 Mon, 11 Jul 2005 23:30:08 GMT   
 {sys_,/dev/}epoll waiting timeout

|> (Davide), IMHO epoll should decide whether it means "at minimum" (in
|> which case the +1 is a requirement), or it means "at maximum" (in
|> which case rounding up is wrong).

Since Linux isn't a RTOS you can't meet "at maximum".  POSIX says "poll ()
shall wait at least timeout milliseconds"

Andreas.

--
Andreas Schwab, SuSE Labs, sch...@suse.de
SuSE Linux AG, Deutschherrnstr. 15-19, D-90429 Nrnberg
Key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



 Tue, 12 Jul 2005 15:50:08 GMT   
 {sys_,/dev/}epoll waiting timeout

I think I'll go with :

Tj = (Tms * HZ + 999) / 1000

Somehow I feel it more correct :)

- Davide

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/



 Wed, 13 Jul 2005 02:10:08 GMT   
 
   [ 30 post ]  Go to page: [1] [2] [3]

Similar Threads

1. Linux patches for dev/poll or dev/epoll

2. remove extra sys_ in epoll system call

3. epoll timeout and syscall return types ...

4. lt-epoll ( level triggered epoll ) ...

5. lt-epoll ( level triggered epoll ) ...

6. sys_epoll system call interface to /dev/epoll

7. /dev/epoll port to 2.5.42

8. /dev/epoll update ...

9. /dev/epoll update

10. /dev/epoll bug fix ...


 
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group.
Designed by ST Software