New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
V3Combine not working, and use of this-> #1224
Comments
Original Redmine Comment Undoubtedly V3Combine can be improved. There are at least 3 tests for it, grep for "Combined" in the test_regress directory; the tests check the stats file to make sure combining happened. |
Original Redmine Comment Thanks, whoops, I missed those tests on first grep. In the tests V3Combine is only combining 'settle' routines with 'sequent' routines. It never combines sequent/comb routines with one another. There's probably not much Icache performance gain from combining settle routines; uncombined settle routines fall out of cache once the sim moves beyond time 0. There could be a lot of Icache performance gain from combining sequent/comb routines with one another. Here's a proposal, instead of producing this _eval() code:
... where those functions are 'static' to their class, let's make them non-static and call them like this:
Within the functions, references to local class members would be relative, not absolute from vlSymsp. If we do these transformations prior to (or concurrent with) V3Combine, it should be possible to combine away most of the logic in a large, repetitive, nested design. Does this make sense? Could I start on a patch? |
Original Redmine Comment It looks like verilator used to work just this way. And it's pretty easy to reenable this latent behavior, the attached patch (just a demonstration, do not commit please) will do it. That short patch gets V3Combine merging sequential functions in the 2-cpu version of the m68 design. It passes the regression, except for a few tests that see more v3combining than they expect. Wilson, it looks like you had some trouble in the past when it worked this way, from the comments you left in DescopeVisitor::descopedName() about ambiguous and/or aliasing pointers. Do you remember what happened and how to provoke it? I wonder if we would still have this problem with more recent compilers, and whether we might find a fix that doesn't require writing a unique routine per scope (restrict pointers...?) Thanks. |
Original Redmine Comment Unfortunately GCC when I last looked at this did not do at all well with using "this" versus a single pointer, which is why I changed it to all statics. Basically it uses both "this" and the other symbol pointer, and despite all the __restrict hints, every time GCC assumed that despite the restrict it needs to guard against one pointer dereference changing something accessible from the other. The result is a lot of lost optimizations, as no temporaries could cross the compiler boundary between the (incorrect assumption of an) alias. What performance change did you see? Can you look at the assembler deltas (with -O2) to see what changes between the current and your/old with a more recent GCC? As to further combining, it would be good to think about if Verilator should build more generic functions (that is without dereferencing a pointer) that it could use many places. That's what GCC is best at optimizing. Likewise Verilator's so aggressive with unrolling that when it gets close to done, it might be useful to go the other way and re-roll some loops. It could also use structures to accomplish this, e.g. common code with 2 usages becomes a struct with array 2. Likewise that could build all the way up to a module. |
Original Redmine Comment Here's a contrived example. I took the verilog sim benchmarks and added a few modules:
That's 64 cpu's grand total. With original combine code, the test runs at 59 KCPS, m68_k68_cpu.cpp is 60MB, m68.cpp is 547kB, and m68__ALLcls.o is 2.6MB. With 'this'-relative combining, the test runs at 184 KCPS, m68_k68_cpu.cpp is 338kB, m68.cpp is 547k, and m68__ALLcls.o is 113kB. My guess is the speedup comes from fitting the entire code stream in L2 instead of L3. This one compiles a lot faster too though I didn't time it. Here's a typical routine from m68_k68_cpu.cpp with the combining fix. I haven't looked at the assembly yet, but I'd be surprised if gcc is guarding much against aliasing, given that every variable is referenced through 'this', none are referenced through vlTOPp:
That's the common case, at least for this design built with these options. In all of m68_k68_cpu.cpp, which is a 10K line file, there are only five variable references through vlTOPp. Are there different design styles or options that cause vlTOPp references to become a lot more common (so that aliasing would be a bigger problem)? I'll look for evidence of gcc doing extra work to support aliasing, those five vlTOPp references should be enough to see it. I'm running gcc-7.2.0 on this system. |
Original Redmine Comment Some data points. First I tried removing the 'restrict' keyword everywhere that verilator emits it, to see if GCC 7.2.0 is doing anything with it at all. I compiled twice, with and without 'restrict'. (In both trials the more aggressive combining was enabled.) The only difference in the generated code between these two trials was the '_eval()' function, which mixes this references and TOPp references. (And hmm, those WOULD actually be aliases... probably we should not mix them here...) The generated code without restrict didn't look too bad. It was a little shorter with restrict. More relevant: there were no other differences, all the 'sequent' routines were unaffected. Maybe the very few TOPp references is too few to see the aliasing problem show up. Second experiment. I looked at the difference between generated assembly before and after the combining change. (The restrict keyword is back in place for both arms of this experiment.) Before the combining change, this C code becomes this assembly:
After the combining change, this C code becomes this assembly:
The new code is a little shorter. I'm no assembly ninja but it doesn't look crazy things are happening. |
Original Redmine Comment I found my original broken case, and the assembler from that GCC way back in 3.2 land. I now get identical code for each with GCC. |
Original Redmine Comment Oh awesome! That's great to have the original testcase. What's the oldest GCC version that Verilator should support? Are there other important compilers? I could test them to see if they all honor restrict. Are you open to reenabling the use of 'this' if the compilers test healthy? |
Original Redmine Comment It's best if every compiler works in terms of source code compatibility, but optimization-wise about 5 years of support seems reasonable. I tried it on 4.4.7 which we use a lot, and unfortunately it's not optimal but not as bad as it was. If you have a patch I can try that's not merge ready I can see how much is lost - it's ok if we loose a bit on older compilers to gain a lot on newer ones. |
Original Redmine Comment A couple more data points.
In gcc 7.2.0, if I remove the restrict keyword, the bad function is still bad, with the same number of loads and stores as the 4.4.1-with-restrict-keyword case above. This is gcc 7.2.0 with no restrict:
So we'll still need restrict with newer compilers. Restrict is standardized in C++11, so it's at least becoming less implementation-specific over time. The gcc 4.5 release notes mention improved support for restrict, that's probably the cutoff where things improved: [[https://gcc.gnu.org/gcc-4.5/changes.html]] |
Original Redmine Comment Here's a prospective patch. It passes the regression and works well on the repeating-cpus dummy design. If you test this, use "-O3 -inline-mult 2000" so that inlining stops at medium-sized modules. Bare "-O3" may inline everything into one giant module leaving no opportunity to V3Combine anything. (If we are able to improve V3Combining, it may make sense to stop setting infinite-inlining for -O3.) You may wonder what's going on in V3Delayed.cpp. It fixes this problem: suppose we have a module 'CPU' with a delayed assignment to array 'ARY'. Suppose there are two instances, cpu0 and cpu1. Prior to the V3Delayed patch, cpu0 would get an automatic delay variable called '__Vdlyvdim0__ARY__v0', and cpu1 would get an automatic delay variable called '__Vdlydim0__ARY__v1'. This was a performance bug twice over:
|
Original Redmine Comment V3Delayed seems reasonable. This seems separate, is it merge ready? V3Descope seems easy enough - esp as there was a lot of code from before. I wonder if we should still use "this->" when there's only a single instance? What gain could there be? I ran some code on the broken GCC and saw only 2% drop, but I wonder if this is worth an optimization disable switch? |
Original Redmine Comment Yes, V3Delayed is merge ready. It is separate. I guess we can easily test if a scope has no sibling scopes, then it must be an only instance. Sure let's do that. I suspect it's rare for there to only be one instance; only-instances are usually inlined away, but not always. I'll do a chicken bit to turn the whole this-relative feature off too, just in case. |
Original Redmine Comment One instance is common - every module has a "top" and any package has just one instance. Also if you can please add some test with appropriate verilator_flags2 and file_grep to make sure it it uses an appropriate "this". Perhaps you can use an existing combine test, or add another one if you think it doesn't hit enough. Thanks. |
Original Redmine Comment V3Delayed change pushed to git towards 3.913. |
Original Redmine Comment Here's a maybe production-ready patch. I've changed the handling of the 'hierThisr' output from descopedName(). Previously I believe we had a bug: for non-isFuncLocal() variables in non-top scopes, we'd always set hierThisr=false, which prevents V3Localize from optimizing the variable into a function local. Please double check me on this, but I believe if the variable is in the same scope as the function, we can safely set hierThisr=true and allow V3Localize to try its optimization. (I believe this is true regardless of whether we're emitting a relative or absolute reference.) That's how I have it set up now, and it seems to behave OK and pass the regression. Can you recommend a publicly available, large-scale design that works with verilator? I'd like to test this on a more realistic design to see the impact is on code size. |
Original Redmine Comment My bad, that last patch doesn't pass the regression, it needs this patch on top of it. |
Original Redmine Comment 3rd try at a production-ready patch, this adds an obvious optimization missed in post#16 and also fixes a bug, I was passing the wrong scope pointer to scopeIsOnlyInstance(). This patch is a rollup of the previous two plus the new fix. |
Original Redmine Comment
|
Original Redmine Comment Thanks. Here's another rollup with those fixes and all prior. |
Original Redmine Comment Patch looks good, great work. Pushed to git towards 3.913. |
Original Redmine Comment In 3.914. |
Author Name: John Coiner (@jcoiner)
Original Redmine Issue: 1224 from https://www.veripool.org
Original Assignee: John Coiner (@jcoiner)
To reproduce:
When I do this, Verilator elects not to inline the 'cpu' module. This is expected: it has more than 2000 statements and there's more than one instance of it. Literally every other non-top module gets inlined, both above and below 'cpu'. This is also expected: all of them have less than 2000 statements or only one instance.
I also expected V3Combine to do something. We have two cpu scopes with identical logic inside, and with similar functions generated for each. There should be opportunity to combine.
What's the state of V3Combine? Do you consider it stable?
Are there any tests for V3Combine? I did 'grep -i combine *' in the tests dir and didn't get any hits so maybe not. It's worth a test. I'm assuming V3Combine is worth fixing and maintaining: the istream will blast away data in the L2 and L3 caches. I guess for some designs, we might be able to get the full I+D working set into L3, and that may depend on combining away a lot of the istream.
I'm still digging, either to find root cause or to learn why inlining should not occur in this case.
The text was updated successfully, but these errors were encountered: