Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large increase in compile time and run time #1354

Closed
veripoolbot opened this issue Sep 27, 2018 · 13 comments
Closed

Large increase in compile time and run time #1354

veripoolbot opened this issue Sep 27, 2018 · 13 comments
Labels
area: performance Issue involves performance issues resolution: abandoned Closed; not enough information or otherwise never finished

Comments

@veripoolbot
Copy link
Contributor


Author Name: Stephan Teran
Original Redmine Issue: 1354 from https://www.veripool.org


Hello,

I have downloaded the newest version of Verilator and noticed that compile and run time has increased nearly 50%. I was able to trace the increase to commit ef3c7bb. I am not sure exactly why this has impacted our project so heavily. We have a lot of smaller files with lots of small test that we compile. I know that is not a lot of information but I am not sure what to provide as source code is out of the question and small test example is difficult to replicate this issue.

Thanks,
Stephan

@veripoolbot
Copy link
Contributor Author


Original Redmine Comment
Author Name: Wilson Snyder (@wsnyder)
Original Date: 2018-09-28T07:54:25Z


Can you please try with -Os and see how it compares with old and new?

@veripoolbot
Copy link
Contributor Author


Original Redmine Comment
Author Name: Stephan Teran
Original Date: 2018-09-28T16:01:07Z


Limited the build to show the differences, but this holds pretty universally across our design. The build with multiple threads will almost always take longer as we have lots (upto 160k) of small tests. We take advantage of ctest to run each individual test in parallel. Each test typically takes < 1s (very few over 1 minute).

Verilator 3.926 (VM_PARALLEL_BUILDS=1)

 -O0
     Compile: 66.164s
     Run    : 166.85 sec

 -O1
     Compile: 70.857s
     Run    : 140.41 sec

 -O2
     Compile: 69.203s
     Run    : 136.74 sec

 -O3
     Compile: 69.216s
     Run    : 131.08 sec

Verilator 4.003 (VM_PARALLEL_BUILDS=1)

 -O0 Threads 8 (I did not use numactl which would help for an individual test, but makes ctest worse)
     Compile: 134.226s
     Run    : > 2691 seconds (killed early)

 -O0
     Compile: 106.503s
     Run    : 256.17 sec

 -O1
     Compile: 109.140s
     Run    : 179.96 sec

 -O2
     Compile: 107.952s
     Run    : 175.53 sec

 -O3
     Compile: 105.462s
     Run    : 180.10 sec

Verilator 3.926 (VM_PARALLEL_BUILDS=0)

 -O2
     Compile: 93.195s
     Run    : 86.83 sec

Verilator 4.003 (VM_PARALLEL_BUILDS=0)

 -O2
     Compile: 192.181s
     Run    : 96.20 sec

@veripoolbot
Copy link
Contributor Author


Original Redmine Comment
Author Name: Wilson Snyder (@wsnyder)
Original Date: 2018-09-30T10:33:28Z


Can you comment on if -Os "solves" the issue (as this will narrow down the problem to one Verilator src file)?

Also attaching a --stats output text file might help, if nothing else shows the runtime per step.

@veripoolbot
Copy link
Contributor Author


Original Redmine Comment
Author Name: John Coiner (@jcoiner)
Original Date: 2018-09-30T14:31:07Z


Hi Stephan,

I wrote the fix for bug 1244 which became the ef3c commit.

Do your "Compile" time figures include verilator time, C compiler time, or both?

It's interesting that using VM_PARALLEL_BUILDS is consistently slower at runtime for you, all else being equal, for both the 3.x and 4.x versions.

VM_PARALLEL_BUILDS=1 causes verilator to produce a large set of '.cc' files that can be built in parallel, setting it to 0 causes verilator to produce a single '.cc' file to reduce the overhead of starting many compiler jobs. It's the same code in either case, only the way it's partitioned into .cc files changes. For the VM_PARALLEL_BUILDS=0 case, the compiler has more opportunities to inline function calls. My guess is that this is where the performance delta arises. (But that's not the performance delta between 3.x and 4.x so it's maybe irrelevant.)

A code sample would help. Otherwise we're just speculating. Could you produce a synthetic design using the same coding style as the real design, and post that publicly?

I would echo what Wilson said about -Os (to disable splitAlwaysAll(), the always block split routine). That would be an interesting data point and possible workaround. You could also try -Or (to disable splitReorderAll(), the always block reordering routine) as another data point. Both routines are implemented in V3Split.cpp which changed quite a bit at the ef3c commit. Both routines are supposed to be optimizations, and disabling them should (if anything) reduce runtime performance. If it instead helps runtime performance, that would be a concern.

Thanks,

John

@veripoolbot
Copy link
Contributor Author


Original Redmine Comment
Author Name: John Coiner (@jcoiner)
Original Date: 2018-09-30T14:45:45Z


Another possibility is that V3Split encounters an always like this:


always @(posedge clk) begin
  if (/*very long expression goes here*/) begin
      a_q <= a_d;
      b_q <= b_d;
      c_q <= c_d;
      // etc etc
  end
end

... and splits into it many small always blocks, each of which repeats the very long expression:


always @ (posedge clk) begin
  if (/*very long expression goes here*/) begin
     a_q <= a_d;
  end
end
// etc etc


That's the most obvious way I would see the performance getting worse from the V3Split changes. Does that look like the coding style you're working with?

@veripoolbot
Copy link
Contributor Author


Original Redmine Comment
Author Name: Wilson Snyder (@wsnyder)
Original Date: 2018-09-30T14:48:31Z


John, that's a good theory. If it turns out being that, perhaps we should make a temporary wire? If it ends up a simple expression then V3Gate will inline it back to what it was originally.

@veripoolbot
Copy link
Contributor Author


Original Redmine Comment
Author Name: Stephan Teran
Original Date: 2018-10-01T15:58:40Z


The times listed above have a caveat I noticed in our cmake script we set the optimization different for the C++ code and the Verilated code, and I did not properly change the verilated code. So it was always at -O3 for the verilated code.

I have tried with -Os (sorry I misunderstood above I thought you wanted to see the different optimizations). When I enabled -Os for some reason I got a lot of UNOPTFLAT warnings so I disabled those (not ideal).

The compile times I have listed are combined times.

With -Os I get the following

Compile: 49.215s
Run: 165.74s

With -Or I get the following

Compile: 112.215s
Run: 199.90s

John we definitely do have code that looks as you stated in your example. Most often without a very long if statement on the sequential write. We do have lots of split blocks though, with always_comb and always_ff (we are almost entirely in system verilog). The always_comb portion may have a large if/case. I will try and write some test code to see what happens with a sample like below.

always_comb
begin
     a_q = a_d;
     b_q = b_d;
     if <something here>
     begin
         a_q = 1'b1;
         b_q = 1'b0;
     end
end

always_ff @(posedge clk)
begin
     a_d <= a_q;
     b_d <= b_q;
end
</code>

I have attached 4 stats file, 3.926 with -O3, 4.003 with -O3, 4.003 with -Os, and 4.003 with -Or.

As for the large runtime differences between VM_PARALLEL_BUILDS=0 and 1 we had the same speculation about inlining, but have not dived into that, maybe a separate issue to investigate later ;)

@veripoolbot
Copy link
Contributor Author


Original Redmine Comment
Author Name: John Coiner (@jcoiner)
Original Date: 2018-10-03T02:02:49Z


Assigning this to myself.

@veripoolbot
Copy link
Contributor Author


Original Redmine Comment
Author Name: Stephan Teran
Original Date: 2018-10-09T18:11:04Z


Spent a bit of time yesterday to try and replicate this issue in test code without much luck. I will continue to attempt to replicate this with test code, unless it is already understood why this is occurring. Thanks!

@veripoolbot
Copy link
Contributor Author


Original Redmine Comment
Author Name: John Coiner (@jcoiner)
Original Date: 2018-10-14T23:14:47Z


Stephan,

Here's a patch you could try. It creates a temporary wire when duplicating 'if' conditions in split always blocks, to avoid duplicating the (maybe lengthy and expensive) conditional expression.

This isn't production quality yet, but it should pass the regression and be good enough to give a signal about whether this approach will help.

The patch makes runtime performance neither worse nor better for some testcases I have on hand. I haven't measured its impact on verilator-time or gcc-time.

Please confirm if you see a change in performance with this, when you have a chance.

BTW I also started on another patch and then discarded it. That patch would have limited the granularity of always-block splitting, so that instead of producing maximally-split fully-atomized blocks, V3Split would produce fewer larger blocks. The intuition was: we know that very large (1000s of lines) always blocks cause various scaling issues, but maybe fully atomizing them isn't the best for performance either, and maybe a middle ground would be better -- blocks in the range of 10-100 statements, say. I got that working and the performance was not good. In multithreaded mode, larger always blocks are worse. Combining unrelated state elements into a single always block is bad for the thread partitioner. Even with only 10s of statements per block, performance degrades. At least for the design I was using, there was a clear signal that full atomization really is the best.

Thanks,

John

@veripoolbot
Copy link
Contributor Author


Original Redmine Comment
Author Name: Stephan Teran
Original Date: 2018-10-15T16:58:15Z


Hi John,
Thank for all the work to try to fix this issue. Unfortunately it does not seem to help. I have improved some of the compile and run times from above purely by making our make script smarter. It is good to see that with the -Os flag compile times are the same, but the 45% increase in run time is concerning.

All times in seconds

3.926
-Os Compile: 46.195 Run: 20.89
-O3 Compile: 67.115 Run: 20.47

Using 4.004 hash 304a24d
-Os Compile 46.45   Run: 29.98
-O3 Compile 118.077 Run: 25.09

4.004 with Patch
-Os Compile 45.613 Run: 28.85
-O3 Compile 136.65 Run: 38.14 (However these test runs failed with the error Verilated model didn't converage)


@veripoolbot
Copy link
Contributor Author


Original Redmine Comment
Author Name: John Coiner (@jcoiner)
Original Date: 2018-12-03T00:32:20Z


Hi Stephan,

If I agreed to an NDA, could you send me some of the testcases to examine?

Another unturned stone would be to look closely at the change in compile time or runtime for verilator's built in test suite. The test suite isn't really designed to check performance. There's an opportunity to do better on performance testing...

So I made a few local hacks, to get some rough aggregate performance measurements of compile time from the regression tests:

  • Verilator.cpp main() routine prints the number of microseconds elapsed at exit.

  • The test driver will always execute verilator twice for each test. The first run should warm up the OS file cache, so the 2nd run should see low I/O overhead. It also passes --no-skip-identical to ensure the 2nd run actually happens; otherwise it would not.

  • A perl script compares the recorded compile times for two trees' test outputs, and finds tests that became faster or slower beyond some tunable threshold (which I'm usually setting to something in the 20-50% range.)

It looks like v3 usually verilates faster than v4; we may have a regression, whoops. For example, I see ~200 tests that verilate >20% faster on v3 than v4, versus ~60 tests that verilate >20% faster on v4 than v3. Both arms of this experiment were built with the same CC / CXX settings, the same ./configure options, bog standard build options, same machine, the regressions even ran at the same time, with half the machine's cores going to each.

OK, so how much of that is due to the V3Split edits? To find out I made an "old_split" branch from master, backported the v3 version of V3Split into this branch, then compared it against master. Between "old_split" and master there's no apparent divergence in verilation time. There are 150 tests that verilate >20% faster on old_split than master, and 148 tests that verilate >20% faster on master than old_split -- it's a draw, there is no signal.

Attached you can find the hacks I was working with:

  • timing.patch are the edits which cause each verilator to write its runtime into its log. Should apply to master; probably applies to v3.xxx with minor fixups.

  • timing.old_split.patch is the same, plus the edits to roll back V3Split to the version 3.xxx version of itself. Should apply to master.

  • check_times.pl compares verilation times for the 'vlt' regressions from two trees.

So we may have two separate issues:

  • whatever is going on at Stephan's work that's caused by V3Split changes

  • the performance regression in v4 that's independent of the V3Split changes

Thanks,

--John

@veripoolbot
Copy link
Contributor Author


Original Redmine Comment
Author Name: Wilson Snyder (@wsnyder)
Original Date: 2019-12-21T11:43:14Z


Pretty sure there's a bug here, but think we need a specific test case to take this further.

@veripoolbot veripoolbot added area: performance Issue involves performance issues resolution: abandoned Closed; not enough information or otherwise never finished labels Dec 22, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: performance Issue involves performance issues resolution: abandoned Closed; not enough information or otherwise never finished
Projects
None yet
Development

No branches or pull requests

1 participant