New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance delta between 4.020 and 4.022 (V3Case case handling) #1644
Comments
Original Redmine Comment Possibly something to do with casez expansion?
|
Original Redmine Comment Yes, this is it. Re-coding the above as a for loop works-round the issue. |
Original Redmine Comment The above function emits an impressive 11MB of C++ on 4.022 compared to 14KB with 4.020 :) |
Original Redmine Comment Wow, that's an unanticipated consequence! Can you try reverting the define in #�? If that is it, I suspect it would still have done stupid code in earlier versions if you have a 10-bit encoder instead, you just happened to cross the 16-bit line. Assuming that's it, it should know the instruction count is being increased and abort. A longer term ideal fix would be to recognize all the common "diagonal" 0/1 cases and convert to just a few instructions. |
Original Redmine Comment To your first point: Yes, simply reverting CASE_OVERLAP_WIDTH to 12 and we're back down to 14K If you are interested, the attached patch reverts CASE_OVERLAP_WIDTH to 12, and performs a new overlap check earlier (up to 64-bit input width regardless of CASE_OVERLAP_WIDTH). This probably renders some of the existing code which follows redundant, but I have left it all in place. I also think isCaseTreeFast would benefit from a test similar to the following to prevent the code explosion:
(Bit search cases will typically have m_caseItems == m_caseWidth) |
Original Redmine Comment Thanks for the patch, I'll check & merge it tonight. Can you also try out the CASE_SPARSENESS test to tune your performance? I think the comparison should be caseItems == width, or items == width+1 (as there might be a default). But maybe only enable that part of the expression if the width is say 5 or larger as small statements/widths will do better optimizing. |
Original Redmine Comment Sounds good. Will do. If I get time I might look at a patch to recognise CLZ, CTZ type casez constructs, e.g.
Then I guess you could convert the whole CASE into something like ASSIGN ( CLZ( VARREF ), VARREF ) where CLZ is a new Math Op that emits a call to __builtin_clz or similar? Do you think this would be worth it? Might be a fun self contained exercise either way! Could then possibly recognise a for loop based implementation too, and expand to include popcount … :) |
Original Redmine Comment Short/Medium term the patch you mention seems good. The really ideal long term improvement IMO would be to use a BDD or otherwise to make a function signature for N inputs, then map common signatures to fast implementations, this is one of many. E.g. it could even recognize low-level gates forming an adder. V3Table is an example of part of this concept, finding a series of inputs and making a truth table, it just doesn't make a signature and reduce it yet, which would replace making the table for known signatures. |
Original Redmine Comment As to the patch, it's O(n^2). I've seen real tables with thousands of entries, so this will effectively hang. It at least needs a size bound. |
Original Redmine Comment Yes, I did wonder about that, but thought it would be OK since the existing overlap check (although bounded) is also O(N^2) (actually O(N*M) where M=1<<Width, which must be >= N) Anyhow, it's easy to make it O(N) so I will do that (as it stands the patch takes 6s to check a 65,536 entry case) |
Been wrestling with this for a while, and after writing an O(N) overlap check which is 10,000x faster than my original lame effort, I came to the conclusion its not really worth including, as the work all has to be done again to populate m_value anyway, so its just adding effort to increase the width of the overlap check, which doesn't seem much of a value add. I've come to the conclusion that my immediate issue is best resolved by this very simple change (as discussed above) but doing this check at the end at least allows CASE_OVERLAP_WIDTH to remain at 16 Changing this: |
For reference, here is the patch for the wider overlap check. |
Thanks for your work, smaller patch pushed. |
Author Name: Julien Margetts
Original Redmine Issue: 1644 from https://www.veripool.org
For my design I am finding a very large runtime delta between Verilator 4.020 and either 4.022 or 4.024
I am seeing a 30x increase in bytes of C++ emitted, a 10x increase in compiled object size, and a 5x increase in verilator run time
Not ruling out user error, and I am trying to narrow down the area of the design which is blowing up, but in the meantime, are there any recent changes you can think of which may cause such an issue?
Thanks
The text was updated successfully, but these errors were encountered: