Major Tools
Other Tools
General Info

UNOPTTHREADS even for parallel designs

Added by Al Grant over 1 year ago

I'm using --threads=4 and seeing UNOPTTHREADS, and the --stats output says "MTask graph, final, parallelism factor 4".

Intuitively, the response to "consider asking for fewer threads" might be "look at --stats and use as many threads as the parallelism factor says" but clearly this isn't the case. In fact, UNOPTTHREADS seems to be sensitive to logic size, independently of the parallelism in the design - so when I cut down the depth of combinational logic to try and produce a test case, the message goes away.

Are there other reasons why the thread scheduler might not be able to partition the design?

What is the parallelism factor if it's not the maximum number of threads the scheduler could use?

And is there something in --stats which will tell me what number of threads to use?

Replies (1)

RE: UNOPTTHREADS even for parallel designs - Added by John Coiner over 1 year ago

Hi Al,

Par factor is defined in internals.pod:

'The available parallelism or "par-factor" of a DAG is the total cost to execute all nodes, divided by the cost to execute the longest critical path through the graph. This is the speedup you would get from running the graph in parallel, if given infinite CPU cores available and communication and synchronization are zero.'

The partitioner starts with a fine-grained DAG and attempts to coarsen it by merging adjacent nodes, while minimizing the increase in critical path with each merge.

The overall critical path will increase as the partitioner works, and so par factor will fall. (Though, the assumption that communication and synchronization costs are negligible becomes a better assumption as the DAG shrinks from millions of tiny nodes to maybe just a few dozen big ones.)

If your critical path is high to begin with, the partitioner may not have much room to work. We've seen this in the past; for example there was a design where all the flops were in a single giant always block, and verilator (due to a bug, since fixed) wasn't splitting that always block into 100k separate flops, instead it was treating it as a single atom with 100k execution cost. The partitioner could not make much progress with such a long critical path in the pre-partitioned graph.

Could you share your entire stats output? It should show the critical-path cost and total-graph cost at the start of partitioning (eg "Mtask graph, initial, critical path cost") and that would suggest if there's an unusually long critical path somewhere.