Performance delta between 4.020 and 4.022 (V3Case case handling) #1644

veripoolbot · 2019-12-17T18:41:44Z

Author Name: Julien Margetts
Original Redmine Issue: 1644 from https://www.veripool.org

For my design I am finding a very large runtime delta between Verilator 4.020 and either 4.022 or 4.024

I am seeing a 30x increase in bytes of C++ emitted, a 10x increase in compiled object size, and a 5x increase in verilator run time

Not ruling out user error, and I am trying to narrow down the area of the design which is blowing up, but in the meantime, are there any recent changes you can think of which may cause such an issue?

Thanks

veripoolbot · 2019-12-17T18:52:35Z

Original Redmine Comment
Author Name: Julien Margetts
Original Date: 2019-12-17T18:52:35Z

Possibly something to do with casez expansion?

function [3:0] LowestBitSet(input [15:0] v);
begin // Return the index of the lowest set bit in a 16-bit vector
     casez (v)
     16'b???????????????1 : LowestBitSet = 4'd0;
     16'b??????????????10 : LowestBitSet = 4'd1;
     16'b?????????????100 : LowestBitSet = 4'd2;
     16'b????????????1000 : LowestBitSet = 4'd3;
     16'b???????????10000 : LowestBitSet = 4'd4;
     16'b??????????100000 : LowestBitSet = 4'd5;
     16'b?????????1000000 : LowestBitSet = 4'd6;
     16'b????????10000000 : LowestBitSet = 4'd7;
     16'b???????100000000 : LowestBitSet = 4'd8;
     16'b??????1000000000 : LowestBitSet = 4'd9;
     16'b?????10000000000 : LowestBitSet = 4'd10;
     16'b????100000000000 : LowestBitSet = 4'd11;
     16'b???1000000000000 : LowestBitSet = 4'd12;
     16'b??10000000000000 : LowestBitSet = 4'd13;
     16'b?100000000000000 : LowestBitSet = 4'd14;
     default              : LowestBitSet = 4'd15;
     endcase
end
endfunction

veripoolbot · 2019-12-17T23:22:52Z

Original Redmine Comment
Author Name: Julien Margetts
Original Date: 2019-12-17T23:22:52Z

Yes, this is it. Re-coding the above as a for loop works-round the issue.
Is this a side-effect of the workaround for #�, which went in at 4.022?

veripoolbot · 2019-12-17T23:28:38Z

Original Redmine Comment
Author Name: Julien Margetts
Original Date: 2019-12-17T23:28:38Z

The above function emits an impressive 11MB of C++ on 4.022 compared to 14KB with 4.020 :)

veripoolbot · 2019-12-18T00:06:35Z

Original Redmine Comment
Author Name: Wilson Snyder (@wsnyder)
Original Date: 2019-12-18T00:06:35Z

Wow, that's an unanticipated consequence! Can you try reverting the define in #�? If that is it, I suspect it would still have done stupid code in earlier versions if you have a 10-bit encoder instead, you just happened to cross the 16-bit line.

Assuming that's it, it should know the instruction count is being increased and abort.

A longer term ideal fix would be to recognize all the common "diagonal" 0/1 cases and convert to just a few instructions.

veripoolbot · 2019-12-18T13:05:00Z

Original Redmine Comment
Author Name: Julien Margetts
Original Date: 2019-12-18T13:05:00Z

To your first point: Yes, simply reverting CASE_OVERLAP_WIDTH to 12 and we're back down to 14K
To your second point: Yes, a 12-bit search also emits lots of code regardless of version.

If you are interested, the attached patch reverts CASE_OVERLAP_WIDTH to 12, and performs a new overlap check earlier (up to 64-bit input width regardless of CASE_OVERLAP_WIDTH). This probably renders some of the existing code which follows redundant, but I have left it all in place.

I also think isCaseTreeFast would benefit from a test similar to the following to prevent the code explosion:

#define CASE_SPARSENESS 2
if ((m_caseItems == m_caseWidth) || m_caseItems < ((1UL<<m_caseWidth)/CASE_SPARSENESS)
     return false;

(Bit search cases will typically have m_caseItems == m_caseWidth)

veripoolbot · 2019-12-18T17:30:10Z

Original Redmine Comment
Author Name: Wilson Snyder (@wsnyder)
Original Date: 2019-12-18T17:30:10Z

Thanks for the patch, I'll check & merge it tonight.

Can you also try out the CASE_SPARSENESS test to tune your performance? I think the comparison should be caseItems == width, or items == width+1 (as there might be a default). But maybe only enable that part of the expression if the width is say 5 or larger as small statements/widths will do better optimizing.

veripoolbot · 2019-12-18T19:15:16Z

Original Redmine Comment
Author Name: Julien Margetts
Original Date: 2019-12-18T19:15:16Z

Sounds good. Will do.

If I get time I might look at a patch to recognise CLZ, CTZ type casez constructs, e.g.

width == items
each items const value has only a single bit set
each items const is prev items const >> or << 1 bit (i.e. spot the diagonal)
each items 'action' is an ASSIGN ( CONST, VARREF ) and the VARREF is the same for each item
the constant assigned to VARREF is equal to CTZ/CLZ(item_const)

Then I guess you could convert the whole CASE into something like ASSIGN ( CLZ( VARREF ), VARREF ) where CLZ is a new Math Op that emits a call to __builtin_clz or similar?

Do you think this would be worth it? Might be a fun self contained exercise either way!

Could then possibly recognise a for loop based implementation too, and expand to include popcount … :)

veripoolbot · 2019-12-18T22:57:30Z

Original Redmine Comment
Author Name: Wilson Snyder (@wsnyder)
Original Date: 2019-12-18T22:57:30Z

Short/Medium term the patch you mention seems good.

The really ideal long term improvement IMO would be to use a BDD or otherwise to make a function signature for N inputs, then map common signatures to fast implementations, this is one of many. E.g. it could even recognize low-level gates forming an adder. V3Table is an example of part of this concept, finding a series of inputs and making a truth table, it just doesn't make a signature and reduce it yet, which would replace making the table for known signatures.

veripoolbot · 2019-12-18T23:01:18Z

Original Redmine Comment
Author Name: Wilson Snyder (@wsnyder)
Original Date: 2019-12-18T23:01:18Z

As to the patch, it's O(n^2). I've seen real tables with thousands of entries, so this will effectively hang. It at least needs a size bound.

veripoolbot · 2019-12-19T10:41:49Z

Original Redmine Comment
Author Name: Julien Margetts
Original Date: 2019-12-19T10:41:49Z

Yes, I did wonder about that, but thought it would be OK since the existing overlap check (although bounded) is also O(N^2) (actually O(N*M) where M=1<<Width, which must be >= N)

Anyhow, it's easy to make it O(N) so I will do that (as it stands the patch takes 6s to check a 65,536 entry case)

margej · 2019-12-23T11:24:12Z

Been wrestling with this for a while, and after writing an O(N) overlap check which is 10,000x faster than my original lame effort, I came to the conclusion its not really worth including, as the work all has to be done again to populate m_value anyway, so its just adding effort to increase the width of the overlap check, which doesn't seem much of a value add.

I've come to the conclusion that my immediate issue is best resolved by this very simple change (as discussed above) but doing this check at the end at least allows CASE_OVERLAP_WIDTH to remain at 16

Changing this:
if (m_caseItems <= 3) return false; // Not worth simplifying
To this:
if (m_caseItems <= 3 || (m_caseWidth >= 8 && (m_caseItems <= (m_caseWidth+1)))) return false; // Not worth simplifying

margej · 2019-12-23T11:36:43Z

For reference, here is the patch for the wider overlap check.
Can't drag and drop ".patch" or ".diff" files BTW

V3Case_patch.txt

wsnyder · 2019-12-23T12:52:08Z

Thanks for your work, smaller patch pushed.

veripoolbot added area: performance Issue involves performance issues effort: hours Expect this issue to require roughly hours of invested effort to resolve labels Dec 22, 2019

wsnyder closed this as completed in c1fb938 Dec 23, 2019

wsnyder added the resolution: fixed Closed; fixed label Dec 23, 2019

wsnyder changed the title ~~Performance delta between 4.020 and 4.022~~ Performance delta between 4.020 and 4.022 (V3Case case handling) Dec 23, 2019

wsnyder mentioned this issue May 3, 2020

Verilator 4.026 2020-01-11 released verilator/verilator-announce#28

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance delta between 4.020 and 4.022 (V3Case case handling) #1644

Performance delta between 4.020 and 4.022 (V3Case case handling) #1644

veripoolbot commented Dec 17, 2019

veripoolbot commented Dec 17, 2019

veripoolbot commented Dec 17, 2019

veripoolbot commented Dec 17, 2019

veripoolbot commented Dec 18, 2019

veripoolbot commented Dec 18, 2019

veripoolbot commented Dec 18, 2019

veripoolbot commented Dec 18, 2019

veripoolbot commented Dec 18, 2019

veripoolbot commented Dec 18, 2019

veripoolbot commented Dec 19, 2019

margej commented Dec 23, 2019

margej commented Dec 23, 2019

wsnyder commented Dec 23, 2019

Navigation Menu

Performance delta between 4.020 and 4.022 (V3Case case handling) #1644

Performance delta between 4.020 and 4.022 (V3Case case handling) #1644

Comments

veripoolbot commented Dec 17, 2019

veripoolbot commented Dec 17, 2019

veripoolbot commented Dec 17, 2019

veripoolbot commented Dec 17, 2019

veripoolbot commented Dec 18, 2019

veripoolbot commented Dec 18, 2019

veripoolbot commented Dec 18, 2019

veripoolbot commented Dec 18, 2019

veripoolbot commented Dec 18, 2019

veripoolbot commented Dec 18, 2019

veripoolbot commented Dec 19, 2019

margej commented Dec 23, 2019

margej commented Dec 23, 2019

wsnyder commented Dec 23, 2019