In the

following passes, the input to each thread is always two sorted vector arrays A and B and the output is

the sorted merge of these two arrays. One vector, a, will be taken

from A and one vector, b, from B, and the

components of these two vectors,

a and b, will be sorted so

that a contains the

lowest four floats and b

the highest four floats.

This is

easily done if one realizes that, for an element an to be among the

four highest elements, there must be at least 4 elements lower than it, and

since there will be n

lower elements in a

that there must be 4 ?

n lower elements in b. I.e., b4 ? n ? 1 must be lower than an. Thus this can be accomplished by

two compare-and-swap instructions, see Figure 1(a).

// get the four lowest floats

a.xyzw = (a.xyzw = a.wzyx) ? b.xyzw

: a.wzyx

The vectors a and b are then

internally sorted by calling sortElements(a)

and

sortElements(b). a

is output as the

next vector on the output and b

takes the place of a

(see Figure 1(b)). A new b

vector is fetched from A

or B depending on which

of these contains the lowest value. This process, adding one sorted vector to

the output, is repeated until either A

or B is

empty. When either input array is empty, the (already sorted) remains in

the other array are simply appended to the output.

Each thread

will be working on some specific part of the input stream, so the arguments

passed on to each thread in each pass is the offset where its two consecutive

lists begins and the number of elements it shall merge. The thread will start

reading its input values from this offset in the input list and will start

writing the result into the same offset of the output list. In the next pass,

the input and output lists will be swapped.

This method

of merge-sorting a list works very well in the first passes, where many threads

can merge many lists in parallel. Looking at the execution times for each pass,

we can easily see that the algorithm will become slow when there is not enough

parallel threads left to keep all processors busy (see Figure 2). In the final

step, the entire sorted result will be created by a single processor of a

single multiprocessor, on the G80 effectively leaving all but one processor

idle. See Figure 3 for pseudo code.