It is currently Sun, 16 Jun 2019 00:36:16 GMT



 
Author Message
 Strange SMP (pthreads) scalability problem

        I am experiencing an extremely strange SMP related problem with an
application that I am developing. In a nutshell, the problem is that I have
a test program that when run with 1 thread and 2 threads respectively,
should see linear scalability but it's seeing exponential scalability instead.
Each thread executes the same function, called worker(). When the test
program is started with 1 thread, the time it takes to execuce is on
average, 20 seconds.

        Now, when the test program is run with 2 threads, the tests take
on average, 78 seconds to execute. There aren't any mutex locks between
them, or anywhere for that matter. Here's the output from a test program
that was developed to examplify this problem:

        Threads: 1, Stack: 0, Loops: 250000000, Reps: 3, Elapsed: 61, Avg: 20
                [20 21 20 ]
        Threads: 2, Stack: 0, Loops: 250000000, Reps: 3, Elapsed: 235, Avg: 78
                [78 79 78 ]

        What follows is an explaination of the above output:

        Thread:    - means that the test is using n thread(s).
        Stack:     - Use stack based memory of malloc'ed memory. default=0
        Loops:     - The nuber of loops each thread will execute
        Reps:      - the number of repetitions
        Elapsed:   - The elapsed time of the test
        Avg:       - The average amount of time for each repetition to finish
        [...]      - The actual completeion times for each repetition

        All threads are bound to their own LWP, so the kernel does the
scheduling.

        So given that, when 1 thread executes, the average amount of time
it takes to complete is 20 seconds. When 2 threads execute, the test takes
78 seconds to complete.

        The machine it is running on is a dual processor Sun Ultra 2, 300mhz,
running Solaris 8:

        SunOS pine 5.8 Generic_108528-11 sun4u sparc SUNW,Ultra-2

        So given 2 processors, why is it when 2 threads are executing that
it is taking 78 seconds to run. It should take the same amount of time as with
a single thread right? 1 Thread will run on processor 0, and the other
thread will run on processor 1. It should take 20 seconds with 2 threads.
But it is taking alot longer, unacceptably longer.  Scalibility should be
linear in this type of test. But it is not. Is it exponential? Perhaps to
a degree.

        Both processors are pegged during the testing (using mpstat), and
nothing else is happening on this system other than what I am doing to it,
which is running these tests.

        Here is the source for the worker thread:

**************** BEGIN CODE SNIPPET *********************
void*
worker(void *data)
{
int             i, j, n=(int) data;
hrtime_t        starttime, endtime;
time_t          tstarttime, tendtime;
int             *localbuf=NULL, *src, *dst, stackbuf[BUF_LENGTH];

        if(verbose == 2)
                printf("\t\tT%d starting\n",
                        pthread_self());

        if(stack == 0)
                localbuf = (int*) malloc(sizeof(int)*BUF_LENGTH);

        if(stack)
                src = dst = stackbuf;
        else
                src = dst = localbuf;  

        starttime = gethrvtime();

        for(i=0; i<n; i++)
        {
                for(j=0; j<BUF_LENGTH; j++)
                        src[j] = ~dst[j];
        }

        endtime = gethrvtime() - starttime;

        if(stack == 0)
                free(localbuf);

        if(verbose == 2)
                printf("\t\tT%d done: %f\n",
                        pthread_self(),
                        endtime/1000.0/1000.0/1000.0);
**************** END CODE SNIPPET *********************

        As you can see, there isn't anything in worker() that shoiuld prevent
any thread form blocking another threads execution.

        By default, the worker thread uses malloc'ed memory. Now, when I set
'stack' to 1, this will cause the worker() thread to use a stack based
array for it's work. When this is done, guess what, scalability is linear:

        Threads: 1, Stack: 1, Loops: 250000000, Reps: 3, Elapsed: 61, Avg: 20
                [20 21 20 ]
        Threads: 2, Stack: 1, Loops: 250000000, Reps: 3, Elapsed: 62, Avg: 20
                [21 20 21 ]

        Now the above shows linear scalability using stack based data. But
why is scalability not linear when each thread is using it's own malloc'ed
memory?

        As you can see from the source for worker, when verbose == 2 each
thread will output the high resolution time each LWP is charged with. Here's
the output when stack is 0 (worker thread using melloc'ed memory):

                        T4 starting
                        T4 done: 20.338812
                Rep 0 complete in 20 seconds
                        T5 starting
                        T5 done: 20.392725
                Rep 1 complete in 21 seconds
                        T6 starting
                        T6 done: 20.376893
                Rep 2 complete in 20 seconds
        Threads: 1, Stack: 0, Loops: 250000000, Reps: 3, Elapsed: 61, Avg: 20
                [20 21 20 ]
                        T7 starting
                        T8 starting
                        T7 done: 73.645825
                        T8 done: 78.255280
                Rep 0 complete in 79 seconds
                        T9 starting
                        T10 starting
                        T10 done: 73.285507
                        T9 done: 78.389890
                Rep 1 complete in 78 seconds
                        T11 starting
                        T12 starting
                        T12 done: 74.132950
                        T11 done: 78.802745
                Rep 2 complete in 79 seconds
Threads: 2, Stack: 0, Loops: 250000000, Reps: 3, Elapsed: 236, Avg: 78

        And now here's the same test when stack == 1 (each thread using
stack based data):

                        T4 starting
                        T4 done: 20.398696
                Rep 0 complete in 20 seconds
                        T5 starting
                        T5 done: 20.384777
                Rep 1 complete in 21 seconds
                        T6 starting
                        T6 done: 20.386374
                Rep 2 complete in 20 seconds
        Threads: 1, Stack: 1, Loops: 250000000, Reps: 3, Elapsed: 61, Avg: 20
                [20 21 20 ]
                        T7 starting
                        T8 starting
                        T8 done: 20.342467
                        T7 done: 20.401012
                Rep 0 complete in 20 seconds
                        T9 starting
                        T10 starting
                        T10 done: 20.342904
                        T9 done: 20.404875
                Rep 1 complete in 21 seconds
                        T11 starting
                        T12 starting
                        T12 done: 20.340070
                        T11 done: 20.403368
                Rep 2 complete in 20 seconds
        Threads: 2, Stack: 1, Loops: 250000000, Reps: 3, Elapsed: 61, Avg: 20

        From thew above output, each thread is taking the expected 20
seconds to execute regardless of how many threads are running.

        Here's the snippet of code that binds each thread to a LWP:

                /*
                ** Bind each threads to a LWP (Solaris).
                */

                pthread_attr_init(&attr);

                pthread_attr_setscope(&attr, PTHREAD_SCOPE_SYSTEM);

                ...

                        for(k=0; k<threadnum; k++)
                                pthread_create(&tid[k], &attr, worker, nloops);

        I thought about appending the entire source to the test prgoram to
this message but decided against it becasue it might be confusing. I'll
post the entire test program in it's entirety in another message.

        Any help is solving this mystery is greately apprecieated. Also,
when more than 2 threads are executing, the problem becomes stranger still,
but I figure that if I can solve the 1 --> 2 thread scalability mystery,
the problem of > 2 threads may fall into place as well.

        Thanks for taking the time to slog through all of this. This has
me stumped at the moemnt.

        Jim

--
-----------------------
Jim
balson AT attbi DOT com



 Mon, 28 Mar 2005 22:42:33 GMT   
 Strange SMP (pthreads) scalability problem

        As promised, the test program in it's entirety:

To compile:

        cc -fast -o smp1e smp1e.c -lpthread

To run using malloc'ed memory:

        $ smp1e -n 250000000 -v -t 2 -r 3

To run using stack based memory:

        $ smp1e -s -n 250000000 -v -t 2 -r 3

To run and set verbose to 2 to see the hi-resolution timer output:

        $ ptime smp1e -s -n 250000000 -v -v -t 2 -r 3

*************** BEGIN smp1c.c ****************

#include        <pthread.h>

#define BUF_LENGTH      8

int     verbose=0, stack=0;

void    *worker(void *data);

main(int argc, char **argv)
{
pthread_attr_t  attr;
pthread_t       *tid;
int     rep, reps=1, i, nloops, nthreads=4, k, threadnum, startthread=1;
time_t          tstarttime, tendtime, ttstarttime, ttendtime, *avgtime;
hrtime_t        elapsedtime, intstarttime;
hrtime_t        starttime, endtime;

        for(i=1; i<argc; i++)
        {
                if(!strcmp(argv[i], "-n"))
                        nloops = atoi(argv[++i]);
                else if(!strcmp(argv[i], "-r"))
                        reps = atoi(argv[++i]);
                else if(!strcmp(argv[i], "-t"))
                        nthreads = atoi(argv[++i]);
                else if(!strcmp(argv[i], "-st"))
                        startthread = atoi(argv[++i]);
                else if(!strcmp(argv[i], "-v"))
                        verbose++;
                else if(!strcmp(argv[i], "-s"))
                        stack = 1;
        }

        tid = (pthread_t*) malloc(sizeof(pthread_t) * nthreads);

        avgtime = (int*) malloc(sizeof(time_t) * reps);

        /*
        ** Bind each threads to a LWP (Solaris).
        */

        pthread_attr_init(&attr);

        pthread_attr_setscope(&attr, PTHREAD_SCOPE_SYSTEM);

        ttstarttime = time(NULL);

        starttime = gethrtime();

        for(threadnum=startthread; threadnum<=nthreads; threadnum++)
        {
        time_t  avg;

                tstarttime = time(NULL);

                for(rep=0; rep<reps; rep++)
                {
                        intstarttime = time(NULL);

                        for(k=0; k<threadnum; k++)
                                pthread_create(&tid[k], &attr,
                                worker, nloops);

                        for(k=0; k<threadnum; k++)
                                pthread_join(tid[k], NULL);

                        endtime = time(NULL);

                        avgtime[rep] = endtime - intstarttime;

                        if(verbose == 2)
                                printf("\tRep %d complete in %d seconds\n",
                                        rep, avgtime[rep]);
                }

                tendtime = time(NULL);

                avg = 0;
                for(i=0; i<reps; i++)
                        avg += avgtime[i];

                avg = avg / reps;

                if(verbose)
                {
                        printf("Threads: %d, Stack: %d, Loops: %d, Reps: %d, Elapsed: %d, Avg: %d\n",
                                threadnum, stack, nloops, reps,
                                tendtime - tstarttime, avg);

                        printf("\t[");
                        for(i=0; i<reps; i++)
                                printf("%d ", avgtime[i]);
                        printf("]\n");
                }
        }

        elapsedtime = (gethrtime() - starttime);
        ttendtime = time(NULL);

        if(verbose)
                printf("Elapsed Time: %d - %6.2f\n", ttendtime - ttstarttime,
                        elapsedtime/1000.00/1000/1000);

/******************************************************************************
**
**
**
**
**
*/

void*
worker(void *data)
{
int             i, j, n=(int) data;
hrtime_t        starttime, endtime;
time_t          tstarttime, tendtime;
int             *localbuf=NULL, *src, *dst, stackbuf[BUF_LENGTH];

        if(verbose == 2)
                printf("\t\tT%d starting\n",
                        pthread_self());

        if(stack == 0)
                localbuf = (int*) malloc(sizeof(int)*BUF_LENGTH);

        if(stack)
                src = dst = stackbuf;
        else
                src = dst = localbuf;  

        /*
        ** This loop within this thread should not interfere with any
        ** other threads that may be running within this process.
        ** Why does it not scale? There aren't any mutex locks and all
        ** data is local to this thread. Wen 2 threads are
        ** executing the folowing loop, 2 threads takes much longer to
        ** complete than does 1 thread. When 2 threads are executing,
        ** the program should take the same amount of time as if
        ** 1 thread were executing.
        */

        starttime = gethrvtime();

        for(i=0; i<n; i++)
        {
                for(j=0; j<BUF_LENGTH; j++)
                        src[j] = ~dst[j];
        }

        endtime = gethrvtime() - starttime;

        if(stack == 0)
                free(localbuf);

        if(verbose == 2)
                printf("\t\tT%d done: %f\n",
                        pthread_self(),
                        endtime/1000.0/1000.0/1000.0);
*************** END smp1c.c ****************

--
-----------------------
Jim
balson AT attbi DOT com



 Mon, 28 Mar 2005 23:38:05 GMT   
 Strange SMP (pthreads) scalability problem

The results you have gotten actually make more sense than you think; if
you profile your program (always a good idea in a case where performance
is not what you think it should be) you are likely to find that the
bottleneck is in your malloc/free calls -- owing to the internal locks
in the allocator.

The other point worth making is that toy examples like this rarely
anything particularly interesting about performance; all it reveals is
the fundamental level of overhead in multithreading, which in a
real-world example would be amortized over the actual processing that
goes on in the threads.

One further note:
If the example you've provided is indicative of the real-world problem
you're trying to solve (i.e. the solution would be fundamentally
allocator-bound) you might want to look at Hoard (http://www.hoard.org)
a drop-in allocator for SMPs.

HTH,
--ag

--
Artie Gold -- Austin, Texas



 Tue, 29 Mar 2005 02:18:23 GMT   
 Strange SMP (pthreads) scalability problem
In comp.unix.solaris Artie Gold <ag...@bga.com> wrote:

[snip]

        Hi Artie. Point well taken on the malloc/free locks in multithreadded
applications. However, ther worker() thread calls malloc once, and then free
once, exactly. So when running the application with 2 threads, malloc is called
twice (at the beginning of worker()) and calls free twice (at the end of worker).
So the malloc/free locks are having nothing todo with the SMP performance of
this test program.

        And I don't think profiling is necessary, because all of the time in
this preogram is spent in the following section of worker():

                for(i=0; i<n; i++)
                {
                        for(j=0; j<BUF_LENGTH; j++)
                                src[j] = ~dst[j];
                }

        It's just memory accesses, either from the stack or from the heap. stack
accesses seem to scale, heap accesses do not. Weird ...

        Thanks, Jim

--
-----------------------
Jim
balson AT attbi DOT com



 Tue, 29 Mar 2005 03:33:47 GMT   
 Strange SMP (pthreads) scalability problem

[snip]

This is just a WAG - is it possible that the processor, when reading
from the stack buffer, does not do cache snooping? You can read text
regions without snooping, since they're readonly. A stack region,
in theory, can only be modified by the processor running the thread
that owns the stack, so perhaps you could safely not snoop addresses
in a stack region?

Since your test thread consists of very intense reading/writing with
little computation, it seems possible that the cache snooping bus is
being saturated when two processors run. The difference with stack-
based buffers would be because they are not subject to the same
cache-coherency protocol as regular memory.

Pure speculation on my part. Maybe someone more knowledgable will
chime in?

-SEan



 Tue, 29 Mar 2005 03:38:24 GMT   
 Strange SMP (pthreads) scalability problem
In comp.unix.solaris Sean P. Burke <sbu...@dev0.welkyn.com> wrote:

        Hmmm. This is interesting. But wait, the stack is going to be
treated the same way as heap memory, no? I mean, the CPU isn't going to know
if the address is with the heap storage or stack storage, right? Further,
both sets of memory, stack and heap are going ultilately reside on a DIMM
plugged into the motherboard, so I'm not sure how the CPU is going to be
able to deteck whether or not any particular address is stack or heap and
do the cache snooping thing based on that. But honestely, I don't know. I'm
just playing devil advocate.

        Like Sean said, can anyone else more knowledgable chime in ...

Jim

--
-----------------------
Jim
balson AT attbi DOT com



 Tue, 29 Mar 2005 04:35:38 GMT   
 Strange SMP (pthreads) scalability problem

    Stack data is not processor-private.

        int main(void) {
            double stack_data;
            pthread_create (..., &stack_data);
            ...
        }

Bingo!  Two threads (potentially running on two processors) both
able to read and write data residing in one thread's stack.

--
Eric.Sos...@sun.com



 Tue, 29 Mar 2005 04:22:33 GMT   
 Strange SMP (pthreads) scalability problem

Jim,

    I don't have a duplicate of your 2x300MHz Ultra2 available, so
I tried your code on a 12x900MHz UltraSPARC-III machine.  The bad
news is that I can't reproduce the slowdown you describe, so I can't
really study what's going on.  The good news is that you can easily
solve your problem by purchasing a SunFire 4800 ;-)

    One possible guess, though.  You mention that the locations of
the pounded-on data areas makes a difference: the slowdown occurs
when the memory is obtained from malloc(), but not when it resides
in the threads' own stacks.  One difference between two 32-byte
areas obtained from malloc() and two 32-byte areas on two different
thread stacks is that the latter will assuredly be in different
memory pages, while the former most likely will not be.  Indeed,
the small malloc()ated areas may even occupy the same cache line.
It could be that your two CPUs are spending their time invalidating
each other's caches, forcing nearly all the data accesses back to
physical RAM.

    A suggestion: When you malloc() the two 32-byte areas, inflate
the request size by an extra 8192 bytes so the areas will assuredly
not be on the same page.  Another experiment would be to have one
thread use stack memory while the other calls malloc(); again, this
will put the two areas on different pages.  Let us know what happens.

--
Eric.Sos...@sun.com



 Tue, 29 Mar 2005 05:18:51 GMT   
 Strange SMP (pthreads) scalability problem
when you see scalability problems like this, it's often some form of
"false sharing".  in shared-memory multiprocessors, the unit of memory
coherance is the cache line.

if both buffers end up in the same cache line (plausible in the malloc
case, implausible in the stack case), the threads end up on two
different cpus, and those cpus end up spending all their time playing
pingpong with that cache line.

The usual fix is to round data structures up to be a multiple of the
cache line in size.

                                        - Bill



 Tue, 29 Mar 2005 06:16:20 GMT   
 Strange SMP (pthreads) scalability problem

    The RWE bits are in the memory mapping structures as described,
but that doesn't imply that the CPU knows whether an address is in
stack space or elsewhere.  (Well, yes: of course it "knows," in the
sense that the kernel can figure out the answer.  But this isn't
the kind of "knowledge" that can be used on each memory reference.)

    All that happens with noexec_user_stack is that the kernel either
does or doesn't set the E bit for the pages assigned to the stack.
When the CPU reads or writes a page it can tell whether E is set or
clear, but that doesn't imply it can distinguish stack accesses from
other accesses.

--
Eric.Sos...@sun.com



 Tue, 29 Mar 2005 06:23:55 GMT   
 Strange SMP (pthreads) scalability problem

Yeah, sometimes I actually *read* all the code in question ;-(

Eric's comments about cache invalidation (elsethread) are certainly the
most likely answer.

Cheers,
--ag

--
Artie Gold -- Austin, Texas



 Tue, 29 Mar 2005 09:04:49 GMT   
 Strange SMP (pthreads) scalability problem

Having reasonably established (in this and other posts in this thread),
that the slowdown is related to whether accesses are to the same or
different pages, do you have an explanation of why the slowdown can't
be reproduced on your SunFire 4800? I know it's a much more modern
machine and all that, but what is it about the architecture that
makes the slowdown disappear? Finer-grained cache-lines?

And are you using the same version of Solaris-8 as the OP?

- MikeM.



 Tue, 29 Mar 2005 15:19:50 GMT   
 Strange SMP (pthreads) scalability problem

What is the buflength; which exact addresses doi you get back for buf
length?

Casper
--
Expressed in this posting are my opinions.  They are in no way related
to opinions held by my employer, Sun Microsystems.
Statements on Sun products included here are not gospel and may
be fiction rather than truth.



 Tue, 29 Mar 2005 20:29:53 GMT   
 Strange SMP (pthreads) scalability problem

Only if you can guarantee alignment at that size as well; if not, you
need to make sure that they're at least one cache line apart.

In this particular case, the false sharing is almost guaranteed because
of the size of the buffer (32 bytes); a large part of both will fall on
the saem cacheline.

Printing the buffer addresses gives:

        buf =21df8
        buf =21e20

Which tells you that all except the 8 bytes of the first buffer is on
one cache line.

If you add "cachelinesize" to the allocations you will find lineair
scaling for this problem.

Simialr false sharing effects happen when you break up a lock in, e.g.,
8 locks,  and the mutex_lock lockarray[8] end up on one cahce line;
effectively, you still have a single lock.

Casper
--
Expressed in this posting are my opinions.  They are in no way related
to opinions held by my employer, Sun Microsystems.
Statements on Sun products included here are not gospel and may
be fiction rather than truth.



 Tue, 29 Mar 2005 20:48:54 GMT   
 
   [ 23 post ]  Go to page: [1] [2]

Similar Threads

1. Recent kernel SMP scalability Benchmark/White-paper References.

2. : Support and scalability on SMP Alpha

3. Strange pthreads problem on Solaris

4. Strange Kernal instability-- Kenrnal/SCSI driver/SMP problem??

5. strange smp problem with 2.4.19 not seen with rc1

6. Strange SMP/X/Mouse problem

7. Linux SMP & Pthreads


 
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group.
Designed by ST Software