[parsec-users] ferret deadlock

Chris Fensch c.fensch at ed.ac.uk
Wed Mar 3 13:49:16 EST 2010


Hello Marc,
   hallo ChrisB.

I just encountered the same problem running the parallel version of
ferret on a single core sparc machine. The problem is rather simple: 

ferret uses the global variable "int input_end" to communicate to the
last stage of the pipeline that all input files have been read:

void *t_load (void *dummy)
{
   .....bla.....


	input_end = 1;
	return NULL;
}

The condition for the last stage to finish is:

void *t_out (void *dummy)
{
	struct rank_data *rank;
	for (;;)
	{
                .... bla ....
		
		fprintf(stderr, "(%d,%d)\n", cnt_enqueue, cnt_dequeue);
		if (input_end && (cnt_enqueue == cnt_dequeue))
		{
			// signal main thread that work is done
		}
	}
	return NULL;
}


It is quite easy to imagine an interleaving of threads, where the
following happens:

- t_load reads the last file from the disk and enqueues it for
processing, but a context switch happens before it can set input_end =
1.
- the other stages process the data (t_load is still suspended).
- t_out reads the data, process it and checks the condition (input_end
&& (cnt_enqueue == cnt_dequeue)), which fails as input_end == 0. After
that it waits for another element to arrive in its input queue (which
never arrives).
- t_load completes and sets input_end = 1. But this does no longer
matter, as t_out is stuck on dequeue.

A properly better method to signal the end of input would be to enqueue
a special end-of-input data token into the pipeline (possibly bypassing
all stages and inserting it directly in q_rank_out). Once this token has
been received by the last stage AND (cnt_enqueue == cnt_dequeue), the
last stage signals the main thread that the work is done.

Depending on the details of the used cache coherence protocol, the
problem can even happen if t_load sets input_end = 1 before the
condition check, but the change is not propagated in time. In general,
it is a bad idea to use normal global variables for synchronisation. I
think the variable should have at least been declared volatile, to make
sure that it is never cached in a register.

Cheers
   Chris F 


> Hello Marc,
> 
> We know of no deadlocks in any of the PARSEC workloads, but due to the 
> nondeterministic nature of multithreading we expected that problems like that 
> would show up. We are very interested in fixing this bug, can you provide us 
> with more information? You seem to be able to reproduce the deadlock 
> consistently, could you tell us which synchronization primitives are 
> involved?
> 
> - Chris
> 
> 
> On Monday 21 April 2008 11:54 am, Marc de Kruijf wrote:
> > I am seeing frequent deadlock running ferret on any input size
> > (happens approx. 50% of the time) .  My platform is "i686-linux.gcc",
> > which uses gcc 3.4.4 running on an Intel Core2 Duo.  I see the same
> > problem with versions of gcc 4.x as well.  Below is a sample output.
> > The deadlock happens as the program is completing.
> >
> > Has nobody seen this before?
> >
> > ----
> >
> > $ bin/parsecmgmt -a run -p ferret -c gcc -i simsmall
> > [PARSEC] Benchmarks to run:  ferret
> >
> > [PARSEC] [========== Running benchmark ferret ==========]
> > [PARSEC] Deleting old run directory.
> > [PARSEC] Setting up run directory.
> > [PARSEC] Unpacking benchmark input 'simsmall'.
> > corel/
> > corel/__cass.env
> > corel/corel.raw
> > corel/lsh.lsh
> > corel/map_corel.map
> > queries/
> > queries/acorn.jpg
> > queries/air-fighter.jpg
> > queries/airplane-2.jpg
> > queries/airplane-takeoff-3.jpg
> > queries/alcatraz-island-prison.jpg
> > queries/american-flag-3.jpg
> > queries/apartment.jpg
> > queries/apollo-2.jpg
> > queries/apollo-earth.jpg
> > queries/apple-11.jpg
> > queries/apple-14.jpg
> > queries/apple-16.jpg
> > queries/apple-7.jpg
> > queries/aquarium-fish-25.jpg
> > queries/arches-9.jpg
> > queries/arches.jpg
> > [PARSEC] Running 'time
> > /media/usbdisk/top/users/dekruijf/parsec-1.0/pkgs/apps/ferret/../../../pkgs
> >/apps/ferret/inst/i686-Linux.gcc/bin/ferret corel lsh queries 10 20 1
> > output.txt':
> > [PARSEC] [---------- Beginning of output ----------]
> > (12,1)
> > (12,2)
> > (12,3)
> > (12,4)
> > (12,5)
> > (12,6)
> > (12,7)
> > (12,8)
> > (12,9)
> > (12,10)
> > (12,11)
> > (12,12)
> > (13,13)
> > (14,14)
> > (15,15)
> > (16,16)
> >
> > ----
> >
> > Marc


-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



More information about the parsec-users mailing list