Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

V3Combine not working, and use of this-> #1224

Closed
veripoolbot opened this issue Oct 2, 2017 · 22 comments
Closed

V3Combine not working, and use of this-> #1224

veripoolbot opened this issue Oct 2, 2017 · 22 comments
Assignees
Labels
area: performance Issue involves performance issues resolution: fixed Closed; fixed

Comments

@veripoolbot
Copy link
Contributor


Author Name: John Coiner (@jcoiner)
Original Redmine Issue: 1224 from https://www.veripool.org

Original Assignee: John Coiner (@jcoiner)


To reproduce:

  • Start with the verilog-sim-benchmarks design (the m68 toy cpu and its testbench from http://git.veripool.org/git/verilog-sim-benchmarks)
  • Add the attached k68_cpu_multi.v module, which instantiates two k68_cpu's.
  • In k68_soc.v, replace the instance of k68_cpu with k68_cpu_multi.
  • Run verilator with '-inline-mult 2000'. This must go after -O3 on the command line, so that inlining is enabled, otherwise -O3 disables it. Also build with --debug.

When I do this, Verilator elects not to inline the 'cpu' module. This is expected: it has more than 2000 statements and there's more than one instance of it. Literally every other non-top module gets inlined, both above and below 'cpu'. This is also expected: all of them have less than 2000 statements or only one instance.

I also expected V3Combine to do something. We have two cpu scopes with identical logic inside, and with similar functions generated for each. There should be opportunity to combine.

What's the state of V3Combine? Do you consider it stable?

Are there any tests for V3Combine? I did 'grep -i combine *' in the tests dir and didn't get any hits so maybe not. It's worth a test. I'm assuming V3Combine is worth fixing and maintaining: the istream will blast away data in the L2 and L3 caches. I guess for some designs, we might be able to get the full I+D working set into L3, and that may depend on combining away a lot of the istream.

I'm still digging, either to find root cause or to learn why inlining should not occur in this case.

@veripoolbot
Copy link
Contributor Author


Original Redmine Comment
Author Name: Wilson Snyder (@wsnyder)
Original Date: 2017-10-02T15:55:16Z


Undoubtedly V3Combine can be improved. There are at least 3 tests for it, grep for "Combined" in the test_regress directory; the tests check the stats file to make sure combining happened.

@veripoolbot
Copy link
Contributor Author


Original Redmine Comment
Author Name: John Coiner (@jcoiner)
Original Date: 2017-10-02T19:09:26Z


Thanks, whoops, I missed those tests on first grep.

In the tests V3Combine is only combining 'settle' routines with 'sequent' routines. It never combines sequent/comb routines with one another. There's probably not much Icache performance gain from combining settle routines; uncombined settle routines fall out of cache once the sim moves beyond time 0. There could be a lot of Icache performance gain from combining sequent/comb routines with one another.

Here's a proposal, instead of producing this _eval() code:

  vlSymsp->TOP__t__DOT__s0._sequent__TOP__t__DOT__s0__1(vlSymsp);
  vlSymsp->TOP__t__DOT__s1._sequent__TOP__t__DOT__s1__2(vlSymsp);

... where those functions are 'static' to their class, let's make them non-static and call them like this:

  vlSymsp->TOP__t__DOT__s0._sequent__1(vlSymsp);
  vlSymsp->TOP__t__DOT__s1._sequent__1(vlSymsp);  // same routine twice now

Within the functions, references to local class members would be relative, not absolute from vlSymsp.

If we do these transformations prior to (or concurrent with) V3Combine, it should be possible to combine away most of the logic in a large, repetitive, nested design.

Does this make sense? Could I start on a patch?

@veripoolbot
Copy link
Contributor Author


Original Redmine Comment
Author Name: John Coiner (@jcoiner)
Original Date: 2017-10-02T22:43:57Z


It looks like verilator used to work just this way. And it's pretty easy to reenable this latent behavior, the attached patch (just a demonstration, do not commit please) will do it.

That short patch gets V3Combine merging sequential functions in the 2-cpu version of the m68 design. It passes the regression, except for a few tests that see more v3combining than they expect.

Wilson, it looks like you had some trouble in the past when it worked this way, from the comments you left in DescopeVisitor::descopedName() about ambiguous and/or aliasing pointers. Do you remember what happened and how to provoke it? I wonder if we would still have this problem with more recent compilers, and whether we might find a fix that doesn't require writing a unique routine per scope (restrict pointers...?)

Thanks.

@veripoolbot
Copy link
Contributor Author


Original Redmine Comment
Author Name: Wilson Snyder (@wsnyder)
Original Date: 2017-10-03T00:36:04Z


Unfortunately GCC when I last looked at this did not do at all well with using "this" versus a single pointer, which is why I changed it to all statics. Basically it uses both "this" and the other symbol pointer, and despite all the __restrict hints, every time GCC assumed that despite the restrict it needs to guard against one pointer dereference changing something accessible from the other. The result is a lot of lost optimizations, as no temporaries could cross the compiler boundary between the (incorrect assumption of an) alias.

What performance change did you see? Can you look at the assembler deltas (with -O2) to see what changes between the current and your/old with a more recent GCC?

As to further combining, it would be good to think about if Verilator should build more generic functions (that is without dereferencing a pointer) that it could use many places. That's what GCC is best at optimizing. Likewise Verilator's so aggressive with unrolling that when it gets close to done, it might be useful to go the other way and re-roll some loops. It could also use structures to accomplish this, e.g. common code with 2 usages becomes a struct with array 2. Likewise that could build all the way up to a module.

@veripoolbot
Copy link
Contributor Author


Original Redmine Comment
Author Name: John Coiner (@jcoiner)
Original Date: 2017-10-03T02:17:56Z


Here's a contrived example. I took the verilog sim benchmarks and added a few modules:

  • k68_cpu_multi instances two of k68_cpu
  • k68_cpu_multi2 instances two of k68_cpu_multi
  • ... and so on until ...
  • k68_cpu_multi6 instances two of k68_cpu_multi5
  • k68_soc instances k68_cpu_multi6

That's 64 cpu's grand total.

With original combine code, the test runs at 59 KCPS, m68_k68_cpu.cpp is 60MB, m68.cpp is 547kB, and m68__ALLcls.o is 2.6MB.

With 'this'-relative combining, the test runs at 184 KCPS, m68_k68_cpu.cpp is 338kB, m68.cpp is 547k, and m68__ALLcls.o is 113kB. My guess is the speedup comes from fitting the entire code stream in L2 instead of L3. This one compiles a lot faster too though I didn't time it.

Here's a typical routine from m68_k68_cpu.cpp with the combining fix. I haven't looked at the assembly yet, but I'd be surprised if gcc is guarding much against aliasing, given that every variable is referenced through 'this', none are referenced through vlTOPp:

VL_INLINE_OPT void m68_k68_cpu::_sequent__TOP__k68_soc_test__DOT__dut0__DOT__cpu0__DOT__cpu0__cpu0__DOT__cpu0__DOT__cpu0__DOT__cpu0__DOT__cpu0__2(m68__Syms* __restrict vlSymsp) {
     VL_DEBUG_IF(VL_PRINTF("                      m68_k68_cpu::_sequent__TOP__k68_soc_test__DOT__dut0__DOT__cpu0__DOT__cpu0__cpu0__DOT__cpu0__DOT__cpu0__DOT__cpu0__DOT__cpu0__2\n"); );
     m68* __restrict vlTOPp VL_ATTR_UNUSED = vlSymsp->TOPp;
     // Body
     this->__Vdlyvset__regbank0__DOT__bank0__DOT__mem__v0 = 0U;
     this->__Vdlyvset__regbank0__DOT__bank1__DOT__mem__v0 = 0U;
     this->__Vdly__fetch0__DOT__mem_cnt = this->__PVT__fetch0__DOT__mem_cnt;
     this->__Vdly__unify0__DOT__uni_cnt = this->__PVT__unify0__DOT__uni_cnt;
     this->__Vdly__a_dat_b_o = this->__PVT__a_dat_b_o;
     this->__Vdly__a_skip_o = this->__PVT__a_skip_o;
     this->__Vdly__addmode0__DOT__skipa = this->__PVT__addmode0__DOT__skipa;
     this->__Vdly__addmode0__DOT__skipb = this->__PVT__addmode0__DOT__skipb;
     this->__Vdly__m_add_o = this->__PVT__m_add_o;
     this->__Vdly__addmode0__DOT__m_cnt = this->__PVT__addmode0__DOT__m_cnt;
     this->__Vdly__a_dat_a_o = this->__PVT__a_dat_a_o;
     this->__Vdly__a_dat_c_o = this->__PVT__a_dat_c_o;
     this->__Vdly__clkgen0__DOT__cnt = this->__PVT__clkgen0__DOT__cnt;
     // ALWAYS at ../rtl/k68_dpmem.v:64
     if (this->__PVT__regbank0__DOT__we) {
	this->__Vdlyvval__regbank0__DOT__bank0__DOT__mem__v0 
	    = this->__PVT__regbank0__DOT__w_dat;
	this->__Vdlyvset__regbank0__DOT__bank0__DOT__mem__v0 = 1U;
	this->__Vdlyvdim0__regbank0__DOT__bank0__DOT__mem__v0 
	    = this->__PVT__regbank0__DOT__w_add;
     }
     // ALWAYS at ../rtl/k68_dpmem.v:64
     if (this->__PVT__regbank0__DOT__we) {
	this->__Vdlyvval__regbank0__DOT__bank1__DOT__mem__v0 
	    = this->__PVT__regbank0__DOT__w_dat;
	this->__Vdlyvset__regbank0__DOT__bank1__DOT__mem__v0 = 1U;
	this->__Vdlyvdim0__regbank0__DOT__bank1__DOT__mem__v0 
	    = this->__PVT__regbank0__DOT__w_add;
     }
     // ALWAYS at ../rtl/k68_fetch.v:205
     this->__Vdly__fetch0__DOT__mem_cnt = (3U & ((IData)(this->__PVT__rst_o)
						 ? 0U
						 : 
						((IData)(1U) 
						 + (IData)(this->__PVT__fetch0__DOT__mem_cnt))));
     // ALWAYS at ../rtl/k68_buni.v:81
     this->__Vdly__unify0__DOT__uni_cnt = (3U & ((IData)(this->__PVT__rst_o)
						 ? 0U
						 : 
						((IData)(1U) 
						 + (IData)(this->__PVT__unify0__DOT__uni_cnt))));
     // ALWAYS at ../rtl/k68_clkgen.v:62
     this->__Vdly__clkgen0__DOT__cnt = (3U & ((IData)(1U) 
					     + (IData)(this->__PVT__clkgen0__DOT__cnt)));
     // ALWAYSPOST at ../rtl/k68_dpmem.v:65
     if (this->__Vdlyvset__regbank0__DOT__bank0__DOT__mem__v0) {
	this->__PVT__regbank0__DOT__bank0__DOT__mem[this->__Vdlyvdim0__regbank0__DOT__bank0__DOT__mem__v0] 
	    = this->__Vdlyvval__regbank0__DOT__bank0__DOT__mem__v0;
     }
     // ALWAYSPOST at ../rtl/k68_dpmem.v:65
     if (this->__Vdlyvset__regbank0__DOT__bank1__DOT__mem__v0) {
	this->__PVT__regbank0__DOT__bank1__DOT__mem[this->__Vdlyvdim0__regbank0__DOT__bank1__DOT__mem__v0] 
	    = this->__Vdlyvval__regbank0__DOT__bank1__DOT__mem__v0;
     }
     this->__PVT__clkgen0__DOT__cnt = this->__Vdly__clkgen0__DOT__cnt;
     // ALWAYS at ../rtl/k68_regbank.v:73
     this->__PVT__regbank0__DOT__w_dat = this->__PVT__a_rd_dat_o;
     // ALWAYS at ../rtl/k68_regbank.v:72
     this->__PVT__regbank0__DOT__w_add = this->__PVT__a_rd_add_o;
     // ALWAYS at ../rtl/k68_regbank.v:74
     this->__PVT__regbank0__DOT__we = this->__PVT__a_we_o;
     this->__PVT__clk4_i = (1U & ((IData)(this->__PVT__clkgen0__DOT__cnt) 
				 >> 1U));
}

That's the common case, at least for this design built with these options. In all of m68_k68_cpu.cpp, which is a 10K line file, there are only five variable references through vlTOPp.

Are there different design styles or options that cause vlTOPp references to become a lot more common (so that aliasing would be a bigger problem)?

I'll look for evidence of gcc doing extra work to support aliasing, those five vlTOPp references should be enough to see it. I'm running gcc-7.2.0 on this system.

@veripoolbot
Copy link
Contributor Author


Original Redmine Comment
Author Name: John Coiner (@jcoiner)
Original Date: 2017-10-03T03:07:15Z


Some data points.

First I tried removing the 'restrict' keyword everywhere that verilator emits it, to see if GCC 7.2.0 is doing anything with it at all. I compiled twice, with and without 'restrict'. (In both trials the more aggressive combining was enabled.) The only difference in the generated code between these two trials was the '_eval()' function, which mixes this references and TOPp references. (And hmm, those WOULD actually be aliases... probably we should not mix them here...) The generated code without restrict didn't look too bad. It was a little shorter with restrict. More relevant: there were no other differences, all the 'sequent' routines were unaffected. Maybe the very few TOPp references is too few to see the aliasing problem show up.

Second experiment. I looked at the difference between generated assembly before and after the combining change. (The restrict keyword is back in place for both arms of this experiment.)

Before the combining change, this C code becomes this assembly:

VL_INLINE_OPT void m68_k68_cpu::_multiclk__TOP__k68_soc_test__DOT__dut0__DOT__cpu0__DOT__cpu0__14(m68__Syms* __restrict vlSymsp) {
     VL_DEBUG_IF(VL_PRINTF("            m68_k68_cpu::_multiclk__TOP__k68_soc_test__DOT__dut0__DOT__cpu0__DOT__cpu0__14\n"); );
     m68* __restrict vlTOPp VL_ATTR_UNUSED = vlSymsp->TOPp;
     // Body
     vlSymsp->TOP__k68_soc_test__DOT__dut0__DOT__cpu0__DOT__cpu0.__PVT__m_dat_i 
	= ((0U == (IData)(vlSymsp->TOP__k68_soc_test__DOT__dut0__DOT__cpu0__DOT__cpu0.__PVT__unify0__DOT__uni_cnt))
	    ? 0U : vlTOPp->k68_soc_test__DOT__dut0__DOT__m_dat_i);
     vlSymsp->TOP__k68_soc_test__DOT__dut0__DOT__cpu0__DOT__cpu0.__PVT__p_dat_i 
	= (0xffffU & ((0U == (IData)(vlSymsp->TOP__k68_soc_test__DOT__dut0__DOT__cpu0__DOT__cpu0.__PVT__unify0__DOT__uni_cnt))
		       ? vlTOPp->k68_soc_test__DOT__dut0__DOT__m_dat_i
		       : 0U));
     vlSymsp->TOP__k68_soc_test__DOT__dut0__DOT__cpu0__DOT__cpu0.__PVT__fetch0__DOT__s_dat_i 
	= ((0xff00U & ((IData)(vlSymsp->TOP__k68_soc_test__DOT__dut0__DOT__cpu0__DOT__cpu0.__PVT__p_dat_i) 
		       << 8U)) | (0xffU & ((IData)(vlSymsp->TOP__k68_soc_test__DOT__dut0__DOT__cpu0__DOT__cpu0.__PVT__p_dat_i) 
					   >> 8U)));
}

0000000000007f50 <_ZN11m68_k68_cpu65_multiclk__TOP__k68_soc_test__DOT__dut0__DOT__cpu0__DOT__cpu0__14EP9m68__Syms>:
     7f50:	80 bf b9 00 00 00 00 	cmpb   $0x0,0xb9(%rdi)
     7f57:	48 8b 47 10          	mov    0x10(%rdi),%rax
     7f5b:	8b 80 ac 00 00 00    	mov    0xac(%rax),%eax
     7f61:	74 1d                	je     7f80 <_ZN11m68_k68_cpu65_multiclk__TOP__k68_soc_test__DOT__dut0__DOT__cpu0__DOT__cpu0__14EP9m68__Syms+0x30>
     7f63:	89 87 f8 00 00 00    	mov    %eax,0xf8(%rdi)
     7f69:	31 d2                	xor    %edx,%edx
     7f6b:	31 c0                	xor    %eax,%eax
     7f6d:	66 89 97 da 00 00 00 	mov    %dx,0xda(%rdi)
     7f74:	66 89 87 ea 00 00 00 	mov    %ax,0xea(%rdi)
     7f7b:	c3                   	retq   
     7f7c:	0f 1f 40 00          	nopl   0x0(%rax)
     7f80:	89 c2                	mov    %eax,%edx
     7f82:	66 c1 c0 08          	rol    $0x8,%ax
     7f86:	c7 87 f8 00 00 00 00 	movl   $0x0,0xf8(%rdi)
     7f8d:	00 00 00 
     7f90:	66 89 97 da 00 00 00 	mov    %dx,0xda(%rdi)
     7f97:	66 89 87 ea 00 00 00 	mov    %ax,0xea(%rdi)
     7f9e:	c3                   	retq   
     7f9f:	90                   	nop

After the combining change, this C code becomes this assembly:

VL_INLINE_OPT void m68_k68_cpu::_multiclk__TOP__k68_soc_test__DOT__dut0__DOT__cpu0__DOT__cpu0__14(m68__Syms* __restrict vlSymsp) {
     VL_DEBUG_IF(VL_PRINTF("            m68_k68_cpu::_multiclk__TOP__k68_soc_test__DOT__dut0__DOT__cpu0__DOT__cpu0__14\n"); );
     m68* __restrict vlTOPp VL_ATTR_UNUSED = vlSymsp->TOPp;
     // Body
     this->__PVT__m_dat_i = ((0U == (IData)(this->__PVT__unify0__DOT__uni_cnt))
			     ? 0U : vlTOPp->k68_soc_test__DOT__dut0__DOT__m_dat_i);
     this->__PVT__p_dat_i = (0xffffU & ((0U == (IData)(this->__PVT__unify0__DOT__uni_cnt))
				        ? vlTOPp->k68_soc_test__DOT__dut0__DOT__m_dat_i
				        : 0U));
     this->__PVT__fetch0__DOT__s_dat_i = ((0xff00U & 
					  ((IData)(this->__PVT__p_dat_i) 
					   << 8U)) 
					 | (0xffU & 
					    ((IData)(this->__PVT__p_dat_i) 
					     >> 8U)));
}

00000000000073b0 <_ZN11m68_k68_cpu65_multiclk__TOP__k68_soc_test__DOT__dut0__DOT__cpu0__DOT__cpu0__14EP9m68__Syms>:
     73b0:	80 7f 39 00          	cmpb   $0x0,0x39(%rdi)
     73b4:	48 8b 46 10          	mov    0x10(%rsi),%rax
     73b8:	74 16                	je     73d0 <_ZN11m68_k68_cpu65_multiclk__TOP__k68_soc_test__DOT__dut0__DOT__cpu0__DOT__cpu0__14EP9m68__Syms+0x20>
     73ba:	8b 80 ac 00 00 00    	mov    0xac(%rax),%eax
     73c0:	31 d2                	xor    %edx,%edx
     73c2:	66 89 57 5a          	mov    %dx,0x5a(%rdi)
     73c6:	89 47 78             	mov    %eax,0x78(%rdi)
     73c9:	31 c0                	xor    %eax,%eax
     73cb:	66 89 47 6a          	mov    %ax,0x6a(%rdi)
     73cf:	c3                   	retq   
     73d0:	c7 47 78 00 00 00 00 	movl   $0x0,0x78(%rdi)
     73d7:	8b 80 ac 00 00 00    	mov    0xac(%rax),%eax
     73dd:	89 c2                	mov    %eax,%edx
     73df:	66 c1 c0 08          	rol    $0x8,%ax
     73e3:	66 89 57 5a          	mov    %dx,0x5a(%rdi)
     73e7:	66 89 47 6a          	mov    %ax,0x6a(%rdi)
     73eb:	c3                   	retq   
     73ec:	0f 1f 40 00          	nopl   0x0(%rax)

The new code is a little shorter. I'm no assembly ninja but it doesn't look crazy things are happening.

@veripoolbot
Copy link
Contributor Author


Original Redmine Comment
Author Name: Wilson Snyder (@wsnyder)
Original Date: 2017-10-03T03:22:36Z


I found my original broken case, and the assembler from that GCC way back in 3.2 land. I now get identical code for each with GCC.

@veripoolbot
Copy link
Contributor Author


Original Redmine Comment
Author Name: John Coiner (@jcoiner)
Original Date: 2017-10-03T12:55:48Z


Oh awesome! That's great to have the original testcase.

What's the oldest GCC version that Verilator should support? Are there other important compilers? I could test them to see if they all honor restrict.

Are you open to reenabling the use of 'this' if the compilers test healthy?

@veripoolbot
Copy link
Contributor Author


Original Redmine Comment
Author Name: Wilson Snyder (@wsnyder)
Original Date: 2017-10-03T13:07:47Z


It's best if every compiler works in terms of source code compatibility, but optimization-wise about 5 years of support seems reasonable. I tried it on 4.4.7 which we use a lot, and unfortunately it's not optimal but not as bad as it was. If you have a patch I can try that's not merge ready I can see how much is lost - it's ok if we loose a bit on older compilers to gain a lot on newer ones.

@veripoolbot
Copy link
Contributor Author


Original Redmine Comment
Author Name: John Coiner (@jcoiner)
Original Date: 2017-10-03T15:21:31Z


A couple more data points.

  • On gcc 7.2.0, restrict4 produces same object code for bad() and better().
  • On gcc 4.8.5 (June 2015) restrict4 produces same object code for bad() and better().
  • On gcc 4.4.1 (July 2009) restrict4 produces some extra stores/loads:
.globl _ZN3Mod3badEP3Rmt
         .type   _ZN3Mod3badEP3Rmt, @function
_ZN3Mod3badEP3Rmt:
.LFB1:
         .cfi_startproc
         .cfi_personality 0x3,__gxx_personality_v0
         movq    (%rsi), %rdx    # <variable>.remoteb, <variable>.remoteb
         addq    %rdx, %rdx      # D.2115
         orq     (%rdi), %rdx    # <variable>.locala, D.2115
         movq    %rdx, (%rdi)    # D.2115, <variable>.locala
         movq    (%rsi), %rax    # <variable>.remoteb, <variable>.remoteb
         salq    $2, %rax        #, D.2121
         orq     %rdx, %rax      # D.2115, D.2121
         movq    %rax, (%rdi)    # D.2121, <variable>.locala
         movq    (%rsi), %rdx    # <variable>.remoteb, <variable>.remoteb
         salq    $3, %rdx        #, tmp68
         orq     %rax, %rdx      # D.2121, tmp68
         movq    %rdx, (%rdi)    # tmp68, <variable>.locala
         ret
         .cfi_endproc
.LFE1:
         .size   _ZN3Mod3badEP3Rmt, .-_ZN3Mod3badEP3Rmt
         .align 2
         .p2align 4,,15
.globl _ZN3Mod6betterEP3Rmt
         .type   _ZN3Mod6betterEP3Rmt, @function
_ZN3Mod6betterEP3Rmt:
.LFB2:
         .cfi_startproc
         .cfi_personality 0x3,__gxx_personality_v0
         movq    (%rsi), %rdx    # <variable>.remoteb, localb
         leaq    (%rdx,%rdx), %rax       #, tmp66
         leaq    0(,%rdx,4), %rcx        #, tmp62
         salq    $3, %rdx        #, tmp65
         orq     %rcx, %rax      # tmp62, tmp66
         orq     (%rdi), %rax    # <variable>.locala, tmp66
         orq     %rdx, %rax      # tmp65, tmp66
         movq    %rax, (%rdi)    # tmp66, <variable>.locala
         ret
         .cfi_endproc

In gcc 7.2.0, if I remove the restrict keyword, the bad function is still bad, with the same number of loads and stores as the 4.4.1-with-restrict-keyword case above. This is gcc 7.2.0 with no restrict:

_ZN3Mod3badEP3Rmt:
.LFB1:
         .cfi_startproc
1. restrict4.cpp:24:     lhs = lhs | rhs<<(bit);
         movq    (%rsi), %rax    # rmtp_4(D)->remoteb, tmp106
         addq    %rax, %rax      # tmp99
         orq     (%rdi), %rax    # MEM[(long int &)this_6(D)], tmp99
         movq    %rax, (%rdi)    # _13, MEM[(long int &)this_6(D)]
         movq    %rax, %rdx      # tmp99, _13
         movq    (%rsi), %rax    # rmtp_4(D)->remoteb, tmp108
         salq    $2, %rax        #, tmp101
         orq     %rdx, %rax      # _13, _10
         movq    %rax, (%rdi)    # _10, MEM[(long int &)this_6(D)]
         movq    (%rsi), %rdx    # rmtp_4(D)->remoteb, rmtp_4(D)->remoteb
         salq    $3, %rdx        #, tmp103
         orq     %rdx, %rax      # tmp103, tmp105
         movq    %rax, (%rdi)    # tmp105, MEM[(long int &)this_6(D)]
1. restrict4.cpp:48: }
         ret
         .cfi_endproc
.LFE1:
         .size   _ZN3Mod3badEP3Rmt, .-_ZN3Mod3badEP3Rmt
         .align 2
         .p2align 4,,15
         .globl  _ZN3Mod6betterEP3Rmt
         .type   _ZN3Mod6betterEP3Rmt, @function
_ZN3Mod6betterEP3Rmt:
.LFB2:
         .cfi_startproc
1. restrict4.cpp:53:     long localb = rmtp->remoteb;
         movq    (%rsi), %rdx    # rmtp_2(D)->remoteb, localb
1. restrict4.cpp:24:     lhs = lhs | rhs<<(bit);
         leaq    0(,%rdx,4), %rax        #, tmp97
         leaq    (%rdx,%rdx), %rcx       #, tmp98
         salq    $3, %rdx        #, tmp101
         orq     %rcx, %rax      # tmp98, tmp99
         orq     (%rdi), %rax    # MEM[(long int &)this_4(D)], tmp100
         orq     %rdx, %rax      # tmp101, tmp102
         movq    %rax, (%rdi)    # tmp102, MEM[(long int &)this_4(D)]
1. restrict4.cpp:67: }
         ret
         .cfi_endproc

So we'll still need restrict with newer compilers. Restrict is standardized in C++11, so it's at least becoming less implementation-specific over time.

The gcc 4.5 release notes mention improved support for restrict, that's probably the cutoff where things improved: [[https://gcc.gnu.org/gcc-4.5/changes.html]]

@veripoolbot
Copy link
Contributor Author


Original Redmine Comment
Author Name: John Coiner (@jcoiner)
Original Date: 2017-10-03T18:45:48Z


Here's a prospective patch. It passes the regression and works well on the repeating-cpus dummy design.

If you test this, use "-O3 -inline-mult 2000" so that inlining stops at medium-sized modules. Bare "-O3" may inline everything into one giant module leaving no opportunity to V3Combine anything. (If we are able to improve V3Combining, it may make sense to stop setting infinite-inlining for -O3.)

You may wonder what's going on in V3Delayed.cpp. It fixes this problem: suppose we have a module 'CPU' with a delayed assignment to array 'ARY'. Suppose there are two instances, cpu0 and cpu1. Prior to the V3Delayed patch, cpu0 would get an automatic delay variable called '__Vdlyvdim0__ARY__v0', and cpu1 would get an automatic delay variable called '__Vdlydim0__ARY__v1'. This was a performance bug twice over:

  • It would prevent V3Combine from working due to asymmetrical code streams
  • Each copy of CPU is carrying both the '__v0' and '__v1' variables even though it only uses one of them.

@veripoolbot
Copy link
Contributor Author


Original Redmine Comment
Author Name: Wilson Snyder (@wsnyder)
Original Date: 2017-10-03T22:42:27Z


V3Delayed seems reasonable. This seems separate, is it merge ready?

V3Descope seems easy enough - esp as there was a lot of code from before. I wonder if we should still use "this->" when there's only a single instance? What gain could there be?

I ran some code on the broken GCC and saw only 2% drop, but I wonder if this is worth an optimization disable switch?

@veripoolbot
Copy link
Contributor Author


Original Redmine Comment
Author Name: John Coiner (@jcoiner)
Original Date: 2017-10-04T00:33:22Z


Yes, V3Delayed is merge ready. It is separate.

I guess we can easily test if a scope has no sibling scopes, then it must be an only instance. Sure let's do that. I suspect it's rare for there to only be one instance; only-instances are usually inlined away, but not always.

I'll do a chicken bit to turn the whole this-relative feature off too, just in case.

@veripoolbot
Copy link
Contributor Author


Original Redmine Comment
Author Name: Wilson Snyder (@wsnyder)
Original Date: 2017-10-04T00:57:53Z


One instance is common - every module has a "top" and any package has just one instance.

Also if you can please add some test with appropriate verilator_flags2 and file_grep to make sure it it uses an appropriate "this". Perhaps you can use an existing combine test, or add another one if you think it doesn't hit enough. Thanks.

@veripoolbot
Copy link
Contributor Author


Original Redmine Comment
Author Name: Wilson Snyder (@wsnyder)
Original Date: 2017-10-04T01:27:24Z


V3Delayed change pushed to git towards 3.913.

@veripoolbot
Copy link
Contributor Author


Original Redmine Comment
Author Name: John Coiner (@jcoiner)
Original Date: 2017-10-04T14:56:11Z


Here's a maybe production-ready patch.

I've changed the handling of the 'hierThisr' output from descopedName(). Previously I believe we had a bug: for non-isFuncLocal() variables in non-top scopes, we'd always set hierThisr=false, which prevents V3Localize from optimizing the variable into a function local. Please double check me on this, but I believe if the variable is in the same scope as the function, we can safely set hierThisr=true and allow V3Localize to try its optimization. (I believe this is true regardless of whether we're emitting a relative or absolute reference.) That's how I have it set up now, and it seems to behave OK and pass the regression.

Can you recommend a publicly available, large-scale design that works with verilator? I'd like to test this on a more realistic design to see the impact is on code size.

@veripoolbot
Copy link
Contributor Author


Original Redmine Comment
Author Name: John Coiner (@jcoiner)
Original Date: 2017-10-04T15:37:39Z


My bad, that last patch doesn't pass the regression, it needs this patch on top of it.

@veripoolbot
Copy link
Contributor Author


Original Redmine Comment
Author Name: John Coiner (@jcoiner)
Original Date: 2017-10-04T18:34:28Z


3rd try at a production-ready patch, this adds an obvious optimization missed in post#16 and also fixes a bug, I was passing the wrong scope pointer to scopeIsOnlyInstance().

This patch is a rollup of the previous two plus the new fix.

@veripoolbot
Copy link
Contributor Author


Original Redmine Comment
Author Name: Wilson Snyder (@wsnyder)
Original Date: 2017-10-04T22:35:30Z



+=item --relative-cfuncs
+
+In generated functions, allow the use of 'this->' to reference variables
+instead of an absolute path starting from 'vlTOPp->'. This is an
+optimization: it permits V3Combine to merge functions from multiple
+instances of a module, reducing the footprint of the instruction stream.
+
+This option may reduce performance when using older compilers (gcc < 4.5)
+which don't honor the __restrict__ keyword well.

I was originally thinking about an -O{letter} flag but this is fine too.  I
think we want this to be the direction going forward (right?), so probably make it
on by default, and document as --no-relative-cfuncs.  (Note V3Option.cpp decoder
remains the same - it automatically understands the no- prefix.)


+++ b/src/V3Descope.cpp
+    bool               m_scopeIsOnlyInstance;  // m_scopep is the only
+    bool               m_needThis;     // Make function non-static

These should be initialized to false in the constructor (for completeness,
even though won't cause problem)


+    static bool isScope(const AstNode* nodep) {
+        return NULL != dynamic_cast<const AstScope*>(nodep);
+    }

Remove and replace in caller with nodep->castScope()


+    static bool scopeIsOnlyInstance(const AstScope* scopep) {
+        // Iterate through scope's siblings, look for other scopes.
+        // If any found there are multiple instances.
+        //
+        // This could be faster if it assumes that scopes are
+        // contiguous in the list, but let's not assume that.
+        const AstNode* peerp = scopep->backp();
+        const AstNode* prevp = scopep;
+        while (peerp->nextp() == prevp) { // while not the parent
+            if (isScope(peerp)) { return false; }
+            prevp = peerp;
+            peerp = peerp->backp();
+        }
+        peerp = scopep->nextp();
+        while (peerp) {
+            if (isScope(peerp)) { return false; }
+            peerp = peerp->nextp();
+        }
+        return true;
+    }

For N scopes, you call this N times, so it's O(N^2) which can be bad.

I think you're really looking for how many scopes in a module.

So in AstNodeModule roughly
     int instances = 0;
     m_modOnlyInstance = true;
     for (AstNode* stmtp = node->stmtsp(); stmtp; stmtp=stmtp->nextp()) {
         if (stmtp->castScope()) {
            if (++instances >=2) { m_modOnlyInstance=false; break; }
         }
     }


@veripoolbot
Copy link
Contributor Author


Original Redmine Comment
Author Name: John Coiner (@jcoiner)
Original Date: 2017-10-05T13:58:51Z


Thanks. Here's another rollup with those fixes and all prior.

@veripoolbot
Copy link
Contributor Author


Original Redmine Comment
Author Name: Wilson Snyder (@wsnyder)
Original Date: 2017-10-05T22:18:36Z


Patch looks good, great work.

Pushed to git towards 3.913.

@veripoolbot
Copy link
Contributor Author


Original Redmine Comment
Author Name: Wilson Snyder (@wsnyder)
Original Date: 2017-10-14T20:21:54Z


In 3.914.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: performance Issue involves performance issues resolution: fixed Closed; fixed
Projects
None yet
Development

No branches or pull requests

2 participants