New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Large increase in compile time and run time #1354
Comments
Original Redmine Comment Can you please try with -Os and see how it compares with old and new? |
Original Redmine Comment Limited the build to show the differences, but this holds pretty universally across our design. The build with multiple threads will almost always take longer as we have lots (upto 160k) of small tests. We take advantage of ctest to run each individual test in parallel. Each test typically takes < 1s (very few over 1 minute). Verilator 3.926 (VM_PARALLEL_BUILDS=1)
Verilator 4.003 (VM_PARALLEL_BUILDS=1)
Verilator 3.926 (VM_PARALLEL_BUILDS=0)
Verilator 4.003 (VM_PARALLEL_BUILDS=0)
|
Original Redmine Comment Can you comment on if -Os "solves" the issue (as this will narrow down the problem to one Verilator src file)? Also attaching a --stats output text file might help, if nothing else shows the runtime per step. |
Original Redmine Comment Hi Stephan, I wrote the fix for bug 1244 which became the ef3c commit. Do your "Compile" time figures include verilator time, C compiler time, or both? It's interesting that using VM_PARALLEL_BUILDS is consistently slower at runtime for you, all else being equal, for both the 3.x and 4.x versions. VM_PARALLEL_BUILDS=1 causes verilator to produce a large set of '.cc' files that can be built in parallel, setting it to 0 causes verilator to produce a single '.cc' file to reduce the overhead of starting many compiler jobs. It's the same code in either case, only the way it's partitioned into .cc files changes. For the VM_PARALLEL_BUILDS=0 case, the compiler has more opportunities to inline function calls. My guess is that this is where the performance delta arises. (But that's not the performance delta between 3.x and 4.x so it's maybe irrelevant.) A code sample would help. Otherwise we're just speculating. Could you produce a synthetic design using the same coding style as the real design, and post that publicly? I would echo what Wilson said about -Os (to disable splitAlwaysAll(), the always block split routine). That would be an interesting data point and possible workaround. You could also try -Or (to disable splitReorderAll(), the always block reordering routine) as another data point. Both routines are implemented in V3Split.cpp which changed quite a bit at the ef3c commit. Both routines are supposed to be optimizations, and disabling them should (if anything) reduce runtime performance. If it instead helps runtime performance, that would be a concern. Thanks, John |
Original Redmine Comment Another possibility is that V3Split encounters an always like this:
... and splits into it many small always blocks, each of which repeats the very long expression:
That's the most obvious way I would see the performance getting worse from the V3Split changes. Does that look like the coding style you're working with? |
Original Redmine Comment John, that's a good theory. If it turns out being that, perhaps we should make a temporary wire? If it ends up a simple expression then V3Gate will inline it back to what it was originally. |
Original Redmine Comment The times listed above have a caveat I noticed in our cmake script we set the optimization different for the C++ code and the Verilated code, and I did not properly change the verilated code. So it was always at -O3 for the verilated code. I have tried with -Os (sorry I misunderstood above I thought you wanted to see the different optimizations). When I enabled -Os for some reason I got a lot of UNOPTFLAT warnings so I disabled those (not ideal). The compile times I have listed are combined times. With -Os I get the following
With -Or I get the following
John we definitely do have code that looks as you stated in your example. Most often without a very long if statement on the sequential write. We do have lots of split blocks though, with always_comb and always_ff (we are almost entirely in system verilog). The always_comb portion may have a large if/case. I will try and write some test code to see what happens with a sample like below.
I have attached 4 stats file, 3.926 with -O3, 4.003 with -O3, 4.003 with -Os, and 4.003 with -Or. As for the large runtime differences between VM_PARALLEL_BUILDS=0 and 1 we had the same speculation about inlining, but have not dived into that, maybe a separate issue to investigate later ;) |
Original Redmine Comment Assigning this to myself. |
Original Redmine Comment Spent a bit of time yesterday to try and replicate this issue in test code without much luck. I will continue to attempt to replicate this with test code, unless it is already understood why this is occurring. Thanks! |
Original Redmine Comment Stephan, Here's a patch you could try. It creates a temporary wire when duplicating 'if' conditions in split always blocks, to avoid duplicating the (maybe lengthy and expensive) conditional expression. This isn't production quality yet, but it should pass the regression and be good enough to give a signal about whether this approach will help. The patch makes runtime performance neither worse nor better for some testcases I have on hand. I haven't measured its impact on verilator-time or gcc-time. Please confirm if you see a change in performance with this, when you have a chance. BTW I also started on another patch and then discarded it. That patch would have limited the granularity of always-block splitting, so that instead of producing maximally-split fully-atomized blocks, V3Split would produce fewer larger blocks. The intuition was: we know that very large (1000s of lines) always blocks cause various scaling issues, but maybe fully atomizing them isn't the best for performance either, and maybe a middle ground would be better -- blocks in the range of 10-100 statements, say. I got that working and the performance was not good. In multithreaded mode, larger always blocks are worse. Combining unrelated state elements into a single always block is bad for the thread partitioner. Even with only 10s of statements per block, performance degrades. At least for the design I was using, there was a clear signal that full atomization really is the best. Thanks, John |
Original Redmine Comment Hi John, All times in seconds
|
Original Redmine Comment Hi Stephan, If I agreed to an NDA, could you send me some of the testcases to examine? Another unturned stone would be to look closely at the change in compile time or runtime for verilator's built in test suite. The test suite isn't really designed to check performance. There's an opportunity to do better on performance testing... So I made a few local hacks, to get some rough aggregate performance measurements of compile time from the regression tests:
It looks like v3 usually verilates faster than v4; we may have a regression, whoops. For example, I see ~200 tests that verilate >20% faster on v3 than v4, versus ~60 tests that verilate >20% faster on v4 than v3. Both arms of this experiment were built with the same CC / CXX settings, the same ./configure options, bog standard build options, same machine, the regressions even ran at the same time, with half the machine's cores going to each. OK, so how much of that is due to the V3Split edits? To find out I made an "old_split" branch from master, backported the v3 version of V3Split into this branch, then compared it against master. Between "old_split" and master there's no apparent divergence in verilation time. There are 150 tests that verilate >20% faster on old_split than master, and 148 tests that verilate >20% faster on master than old_split -- it's a draw, there is no signal. Attached you can find the hacks I was working with:
So we may have two separate issues:
Thanks, --John |
Original Redmine Comment Pretty sure there's a bug here, but think we need a specific test case to take this further. |
Author Name: Stephan Teran
Original Redmine Issue: 1354 from https://www.veripool.org
Hello,
I have downloaded the newest version of Verilator and noticed that compile and run time has increased nearly 50%. I was able to trace the increase to commit ef3c7bb. I am not sure exactly why this has impacted our project so heavily. We have a lot of smaller files with lots of small test that we compile. I know that is not a lot of information but I am not sure what to provide as source code is out of the question and small test example is difficult to replicate this issue.
Thanks,
Stephan
The text was updated successfully, but these errors were encountered: