Project

General

Profile

[logo] 
 
Home
About/Contact
Major Tools
  Dinotrace
  Verilator
  Verilog-mode
  Verilog-Perl
Other Tools
  IPC::Locker
  Parallel::Forker
  Voneline
General Info
  Papers

Increasing performance in a moderately clock gated design

Added by James Connolly 12 months ago

I've been looking at speeding up verilator for a design. A few months ago I added in a large module that is mostly idle and caused 60% performance degradation, and I have it clock gated for a large percentage (~80%) of simulation time. I've been profiling it using --prof-cfuncs and I've noticed the following;

  • ~20% of execution time can be attributed directly to the idle module
  • ~ 5% of execution time is attributed to _eval()
  • I have annotated all clock_enable signals with the verilator macro, although interestingly there are no IMPERFECTSCH warnings that pop up from this; clock gating happens all within a clock gater module, but I doubt that is an issue
  • I do currently have some UNOPTFLAT and IMPERFECTSCH warnings in my design; it is my understanding that these can lead to _eval() being called multiple times per loop. I have removed these warnings, but they don't seem to affect performance. Across all simulations, the ratio eval to _eval calls is about 1:1.002.

Stats output is attached. My vague suspicion is that clkgating isn't being recognized due to the lack of IMPERFECTSCH and the speed of the clkgater pass, but this is only a guess; any insight would be greatly appreciated.

Aside: I used oprofile and perf instead of gprof for profiling because I noticed immense slow down from binary instrumentation.

stats.txt View (357 KB)


Replies (5)

RE: Increasing performance in a moderately clock gated design - Added by Wilson Snyder 12 months ago

The Clkgate stage was experimental fir adding gating (not your case I think) and is not currently enabled as slowed things generally.

If you look at where the time is going can you tell what RTL causes it, or what pattern of C code seems to cause it?

If you can get it down to a Verilog example where you also can say what you think the output C code should look like to be faster, we can try to improve to get to that.

RE: Increasing performance in a moderately clock gated design - Added by James Connolly 12 months ago

Nasty combinational logic with nested loops seems to be where I'm sinking most of my cycles (~30% of time). Attached is a module that represents the type of logic I seem to be sinking time into; in my design I have instantiated many arbiters that use the same always@* block described at top.v:78. Across all the arbiters in my larger design, about 5% of time can be attributed directly to loops that look very similar to this example.

There is some weirdness with repeated assignments starting at Vtop.cpp:5886, but this seems to only affect output cells and isn't an issue for my design.

A larger problem perhaps is that many nested loops in my design are idle for a large slice of sim time; I would be interested in wrIting a pass similar to V3Clock but for always blocks, where we only execute a block if we notice a change in it's sensitivity list. What do you think? This could bloat memory usage and pushes verilator to be a bit more event driven.

Makefile (2.52 KB)

top.v (1.75 KB)

Vtop.cpp View (296 KB)

RE: Increasing performance in a moderately clock gated design - Added by Wilson Snyder 11 months ago

As to the for loops, it looks like a new optimization should be added to make this code perform better, e.g. a o_gnt10 = ..., o_gnt0 = ... o_gnt10 should collapse to one expression. replaceAssignMultiSel is an optimization that attempts to do something somewhat similar.

The 5886 stuff is a side effect of the loop unrolling. Certainly the optimization should be improved to remove the redundant ones.

Moving to something that looks for differences is a massive change that is difficult and in the general case is likely slower as the check for differences is typically slower than doing the calculations. If you change your code to calculate o_gnt1 and o_gnt2 in two separate loops, then with the above improvements V3Table should recognize it and change this into a table lookup. The result will be many thousands of lines should reduce to

o_gnt1 = TABLE[i_s1];
o_gnt2 = TABLE[i_s2];
o_tgnt = o_gnt1 | o_gnt2;

This will run in 10ish CPU clocks and be faster than any check for differences would be.

Perhaps you could look at some of these enhancements?

RE: Increasing performance in a moderately clock gated design - Added by James Connolly 11 months ago

Thanks for the tips here Wilson; I massaged some of the arbiters to be optimized by V3Table and got great speedup within the arbiter modules.

During profiling I also noticed that i$ and d$ utilization are not the best, with the number of 44 l1-i$ misses per thousand instructions.

I've used the llvm-bolt utility from facebook (https://github.com/facebookincubator/BOLT) to get marked improvement (20.9%) in performance, but this is only suitable for frozen RTL, as it expensive to run (hour runtime, 34G of memory). To contextualize this increase in performance, I do not use LTO, and verilator output is split.

Going through verilator internals documentation, this seems to be a well understood problem and I$ / D$ packing seem to be a priority; has there been any progress on this front I could extend?

RE: Increasing performance in a moderately clock gated design - Added by Wilson Snyder 11 months ago

Thanks for the pointer to BOLT. It would be very interesting ifs you could create a before and after istream heat map.

I suspect that if we analyze what BOLT is doing we could tune the outputted code to get some fraction of that benefit directly (maybe half - 10%)? FWIW the current branch prediction algorithm (V3Branch) is very simple but gave 10% when first introduced.

There hasn't been any work on icache/dcache packing beyond the extremely simple V3Combine which looks for completely identical functions. I would think istream is where the most benefit is, the thought there is if we can abstract common heavyweight routines, Verilator would make functions and functions calls for them.

E.g.

foo1 = heavy operations bara,barb
foo2 = heavy operations barc,bard
foo3 = heavy operations bare,barf

Becomes

heavy(foo1, bara, barb)
heavy(foo2, barc, bard)
heavy(foo3, bare, barf)
void heavy(a&, b&, c&) { ... }

The hard part is abstracting functions from variables (at which point it's easy to hash and look for duplicates). Then tuning the algorithm to know where this is worth doing.

Any progress you can contribute, even if only informational, in optimizations would be well appreciated.

    (1-5/5)