Project

General

Profile

[logo] 
 
Home
About/Contact
Major Tools
  Dinotrace
  Verilator
  Verilog-mode
  Verilog-Perl
Other Tools
  IPC::Locker
  Parallel::Forker
  Voneline
General Info
  Papers

Multithreaded/multicore/GPU support

Added by Wilson Snyder over 8 years ago

This thread is to discuss how to proceed with development of multithreaded/multicore/GPU support.

A parallel x86 implementation is the easiest and most obvious starting point and might as well be done before trying for GPU style massive parallelism.

As far as getting started, I'd suggest reading the internals.* document, running a few tests with --debug, and looking at the tree outputs.

Looking at V3Order for example, what is going on is building a graph of the code to execute. It knows what can go in parallel or not, and currently picks what it considers a good route to a sequential ordering. That can be extended to a parallel model. Then a few downstream optimizations need to know there's more than one code stream, and the final include/* runtime files are certain to need some thread cleanups.


Replies (5)

RE: Multithreaded/multicore/GPU support - Added by Jie Xu over 5 years ago

I have tried a bit different approach than you suggested to implement multithread support. Basically I start from the result of V3Order where a golden sequential order has been calculated. Then I check the adjacent function calls in _eval() method to see if they can be made to simulate in parallel. The criteria for checking if two adjacent function calls (e.g. Fa and Fb) can be run in parallel:
* output_set(Fa) and input_set(Fb) has NO common element
* output_set(Fb) and input_set(Fa) has NO common element
* output_set(Fa) and output_set(Fb) has NO common element

The checking is executed recursively to determine the maximum set of adjacent parallel function calls. That is if Fa and Fb can be parallel, the combined set of output_set(Fa & Fb) and input_set(Fa & Fb) are further checked with next function call Fc.

During the checking part, parallel functions calls are moved to different thread_eval() functions and synchronization info is added both to _eval and other thread_eval() functions. For example, if (Fa Fb Fc) is the maximum adjacent functions calls can be run in parallel, then the generated code would look like

_eval() {
    pthread_mutex_lock(&vlTOPp->mtx);
    vlTOPp->jobs_waiting_1 = 1;
    vlTOPp->jobs_waiting_2 = 1;
    pthread_cond_broadcast(&vlTOPp->cv);
    pthread_mutex_unlock(&vlTOPp->mtx);

    Fa();

    pthread_mutex_lock(&vlTOPp->mtx);
    while (vlTOPp->jobs_waiting_1 || vlTOPp->jobs_waiting_2 || vlTOPp->jobs_waiting_3)
        pthread_cond_wait(&vlTOPp->cv, &vlTOPp->mtx);
    pthread_mutex_unlock(&vlTOPp->mtx);
}

thread_eval_1() {
     pthread_mutex_lock(&vlTOPp->mtx);
     while (vlTOPp->jobs_waiting_1== 0)
          pthread_cond_wait(&vlTOPp->cv, &vlTOPp->mtx);
     pthread_mutex_unlock(&vlTOPp->mtx);

     Fb();

     pthread_mutex_lock(&vlTOPp->mtx);
     --vlTOPp->jobs_waiting_1;
     pthread_cond_broadcast(&vlTOPp->cv);
     pthread_mutex_unlock(&vlTOPp->mtx);
}

thread_eval_2() {
     pthread_mutex_lock(&vlTOPp->mtx);
     while (vlTOPp->jobs_waiting_2== 0)
          pthread_cond_wait(&vlTOPp->cv, &vlTOPp->mtx);
     pthread_mutex_unlock(&vlTOPp->mtx);

     Fc();

     pthread_mutex_lock(&vlTOPp->mtx);
     --vlTOPp->jobs_waiting_2;
     pthread_cond_broadcast(&vlTOPp->cv);
     pthread_mutex_unlock(&vlTOPp->mtx);
}

The threads running thread_eval_() will be started when the model is created but only runs when the main _eval() tells there are jobs for it.

I have tested the idea on one of our design and it actually works. But I didn't get performance improvement, the main reason I believe is the cost of synchronization between threads is too high. There are number of parameters here which can be tuned for decrease the frequency of synchronization:

* Number of threads: 
* size of functions called by _eval function: this can be tuned by using the option --output-split-cfuncs.
* only using another thread if both the adjacent function calls are large enough
* use Pthread barrier instead of mutex and conditional variable

Currently I start to tune the parameters to see if there will be performance gains, any suggestions will be highly appreciated.

RE: Multithreaded/multicore/GPU support - Added by Wilson Snyder over 5 years ago

Cool experiment. Rather than barriers it's better to have completion flags, where each process can mark completion then start their next step when the new function to execute knows that all of the functions feeding it inputs have indicated their result is ready. (Somewhat similar to how tracing sets the activity flags.)

Also you need to change the way variables are laid out in the data structures so that they won't thrash, this will have a huge impact; multithreaded is really all about minimizing cache movement. At a minimum you could try putting each output function's variables together. Then slightly better all communication that's only between func_a and func_b should be packed in minimal cache lines. If done well this could help single threaded performance, but be very careful not to increase the model size as that greatly harms performance. This is where knowing it's multithreaded will require different optimization - you'll want to pad to cache lines if multithreaded, but not pack if single threaded.

Also I think it's unlikely you'll get very good performance without a smarter partitioning scheme, but please prove me wrong.

RE: Multithreaded/multicore/GPU support - Added by Jie Xu over 5 years ago

Caching

I have done some analysis using cachegrind. Indeed I see more cache misses in the multithreaded case. Most of the cache misses are L1 instruction cache misses. Your suggestion about putting output variables together maybe is for data cache? In don't really have lots of knowledge about how/what should I do to get less instruction cache misses.

Partition scheme

My current scheme is definitely not optimal at all. The problem is really an algorithm/mathematical problem:
a) We have an ordered sequence of function calls, 
b) each with its own number of instruction, given the 
c) dependency relations between each function and 
d) number of multithread we have,
we need to find a scheme to finish the sequence as soon as possible.
In terms of function dependeny, basically if two functions meet the following requirement, then they can be run in parallel, otherwise strict order should be preserved between the functions calls.
* output_set(Fa) and input_set(Fb) has NO common element
* output_set(Fb) and input_set(Fa) has NO common element
* output_set(Fa) and output_set(Fb) has NO common element
For example, we have a sequence of functions as below:
funcs:
   f_1(); f_2(); f_3(); f_4(); f_5(); f_6();
Instruction count:
   c1     c2     c3     c4     c5     c6
Input set:
   I1     I2     I3     I4     I5     I6
Output set:
   O1     O2     O3     O4     O5     O6
We need to find a grouping solution like: where functions inside { } can be run in parallel.
f_1(); {f_2(); f_3();} {f_4(); f_5(); f_6()};

The problem is kind of similar to the maximum subarray/slice problem in addition we have the limitation of dependency. Do you happen to know an algorithm for this problem?

NOTE in this formulation of partition problem, I didn't take the overhead/cost of multi-thread synchronization into consideration yet.

RE: Multithreaded/multicore/GPU support - Added by Wilson Snyder over 5 years ago

Re Instruction cache misses. Yes, IC misses are the primary performance problem with Verilator, the second being dmisses. Ideally there would be an optimization which tries to make functions it can use in common by passing different pointers into them. V3Combine was a step in this direction to get the low-hanging fruit.

The algorithms you want are the "partitioning problem" algorithms.

Note that V3Order has a very simple algorithm for picking what blocks to combine in a function, this could easily be tweaked to get you part of what you want (e.g. optimize to first combine calls which have the maximum shared dstream with no other interaction).

Also you probably already saw it but the instrCount() method can be used to estimate the relative runtime of functions, see V3Table for an example usage. The instrCount weights haven't been experimentally validated and should be if this starts having more important usage. (You can use your existing Verilog code to do this - benchmark each little CFUNC separately then use regression analysis against the node types used in each CFUNC function to determine the weights.)

RE: Multithreaded/multicore/GPU support - Added by Wilson Snyder about 2 years ago

This is an old thread, but note the develop-v4 and eventual 4.000 release now supports multithreaded model generation. If there's questions/comments etc please start a new thread.

    (1-5/5)