Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multicore support #1216

Closed
veripoolbot opened this issue Sep 19, 2017 · 8 comments
Closed

multicore support #1216

veripoolbot opened this issue Sep 19, 2017 · 8 comments
Assignees
Labels
resolution: fixed Closed; fixed type: feature-IEEE Request to add new feature, described in IEEE 1800

Comments

@veripoolbot
Copy link
Contributor


Author Name: John Coiner (@jcoiner)
Original Redmine Issue: 1216 from https://www.veripool.org

Original Assignee: Wilson Snyder (@wsnyder)


It would be nice if Verilator's output would execute in parallel on multicore CPUs efficiently.

Single thread performance stopped scaling at Moore's-law-rate around 2005. Since then it scales more slowly. Around the same time, CPU core counts began to grow at a rate approaching Moore's law. These two effects combined have allowed aggregate MIPS and FLOPS to keep pace with Moore's law, if software can take advantage of all the cores.

Gate counts in the largest chip designs are also scaling with Moore's law. If we simulate on a single thread, our simulations get slower each year as chips grow. If we could take advantage of parallelism, simulation performance should stay roughly stable over time ... and we'd enjoy a one-time ~10x boost in cycles-per-second in the transition to multi-core.

Teams working on large designs are already using multicore CPUs, typically with a unique single-threaded simulation on each core. (At least that was my experience in the industry up to 2015; has anything changed?) This saturates the CPU, but it's not ideal: latency to run each simulation is high, and the simulations thrash the cache. If there's any hope to fit in L2/L3 cache, we're much more likely to achieve that with a single simulation spanning all cores. If that fits in cache, we could see super-linear scaling, where 1 N-threaded simulation makes better throughput on the same machine than N single-threaded simulations.

@veripoolbot
Copy link
Contributor Author


Original Redmine Comment
Author Name: John Coiner (@jcoiner)
Original Date: 2017-09-19T18:41:11Z


Here's my thinking about how this could be done. Broadly, there are two aspects: we must partition the code into parallelizable segments, and we'll probably want to partition the data to minimize cache movement.

  1. Partitioning the code:

V3Order already builds a dependency graph of all logic vertices. We can partition this graph into "trains". A train is a sequence of logic operations in a single clock domain, that executes sequentially on one core. Within any given clock domain, the trains themselves form a coarse dependency graph that we'll evaluate at runtime. At runtime, each train is an atom: its dependencies run, then the train runs, then trains that depend on it may run. Scheduling overhead will be low if the trains are large enough. We'll keep a fixed pool of threads ready to evaluate trains at runtime to avoid the overhead of spawning and exiting threads.

Here's how the partitioning might work:

  • start with a train per logic vertex
  • merge the trains:
  1. First, merge adjacent trains in the same domscope. Link up as much as possible from a given domscope, with no maximum size. This should keep V3Combine viable. ("No maximum size" means we might have very large trains that limit how much parallelism is available. My guess is that most designs have so many layers of repeating modules that this is unlikely to cause real trouble.)
  2. Second, in each domain, look across scopes for adjacent trains that are too small. Try to merge these also to reach a target average train size.

Once the trains are partitioned, we set up CFunc's for each train. This will be similar to the existing processMove...() routines, right down to the order in which code is emitted (depth-first for best dCache locality.) Except, instead of calling the CFunc's from a serial top-level eval routine, we'll emit a trainEval() routine to call the CFuncs within the train. The top level eval() routine will dispatch trainEval()'s into a runtime dependency-graph follower to keep threads fed with ready trains.

  1. Partitioning data

It would be nice to group the data together such that each train writes to a set of cache lines not written by any other train. That should minimize thrashing: while a train is writing its outputs, no other core should be reading them, as all consumers of this data should be waiting to run, and no other core should be writing the same cache lines either.

Right now, data is partitioned first by scope, and then by the size of the data type. That may intersperse trains in memory and cause thrashing.

It's not trivial to just sort the data within a scope by train: different instances of the same module might be segmented into trains differently, yet they share a class definition. We might handle that by segmenting the data into the lowest-common-denominator of the possible train-segmentations. (A more aggressive option is to actually emit different module classes per segmentation -- though that sounds like a difficult change, and it would reduce the viability of V3Combine. So maybe not that.)

  1. Crazy follow on: Core affinity?

It may help to steer the same trains to the same threads(cores) for icache locality. Not sure. There are some options for doing this, I haven't thought a lot about them. It's also possible that even the sum total of all cores' icaches cannot hold the working set and we're doomed to stream instructions from L3, making multicore no better or worse than single threaded execution.

  1. Crazy follow on: Async eval()?

If a simulation is alternating between a multicore eval() and a single-threaded testbench, maybe we can further accelerate it by having eval() generate its outputs early and returning, while continuing to process internal state in the background in parallel with testbench code. That's aggressive, and I'm sure it breaks things (testbench access to internal signals, alignment of testbench prints and $display prints in the log, etc.) That said you could imagine environments where speed isn't everything, it's the only thing. In which case that could be a nice optimization, unless there's a more fundamental reason why this doesn't work.

I'll start with (1), that will let us start testing and getting some experience. There's time to think about (2) and beyond yet.

@veripoolbot
Copy link
Contributor Author


Original Redmine Comment
Author Name: Wilson Snyder (@wsnyder)
Original Date: 2017-09-19T23:06:12Z


First, disclaimer for others that may come across this bug wanting to get their code to go faster: before considering multicore, make sure your code executes as fast as possible single core and is especially loop free; see the manual. This is also not to imply there isn't opportunity to improve single threaded performance through additional internal optimizations in Verilator itself.

That out of the way, obviously multicore can provide great speed ups. You might want to see the previous research done on this at http://www.veripool.org/issues/1216-Verilator-multicore-support#change-3943 this early work got a factor of 3 on 8 cores.

For past work in this area, see http://www.ecs.umass.edu/ece/labs/vlsicad/papers/ISVLSI_final_Tariq_B_Ahmad.pdf. I don't have the code, but should be available if you contact the authors.

I'd suggest first step 0: Verilator should (theory work) already with multiple separately-Verilated modules, each running in a thread, with verilated.cpp compiled to be multithreaded. I'm not sure how well tested this is; adding a test to the regressions would be a good idea, and would help you establish the user-side framework and API before getting into the optimization details. This also suggests another possible middle step that has been previously requested - manually splitting a large design on module boundaries. Each segment would be verilated separately, and a separate verilator run done at the top. Each segment can then run on a core.

As to your other comments, the partitioning steps you outline seem reasonable. If this does a good cache-aware job, it should also speed up the unithreaded execution which is great.

I'd suggest trying to convert what you want to do in V3Order into a different kind of graph, then optimize it appropriately. There are pretty well known graph partitioning algorithms that should help. Try to minimize changing the basic V3Order (order related) graph stuff, as it's very hard to debug and get sufficient real world test cases to stress changes appropriately.

@veripoolbot
Copy link
Contributor Author


Original Redmine Comment
Author Name: John Coiner (@jcoiner)
Original Date: 2017-09-20T00:50:45Z


Wilson could you check your first link after "previous research done on this"?

@veripoolbot
Copy link
Contributor Author


Original Redmine Comment
Author Name: Wilson Snyder (@wsnyder)
Original Date: 2017-09-20T01:40:13Z


As to previous splitting work "topic 412: modules into objects":https://www.veripool.org/boards/2/topics/412-Verilator-verilating-compiling-modules-into-separate-object-files

Are you working with Xaview Delacour? Seems strange coincidence he just started discussing a major patch to do exactly this, "topic 2348: verilator foreign models":https://www.veripool.org/boards/3/topics/2348-Verilator-Adding-foreign-module-support-to-Verilator

@veripoolbot
Copy link
Contributor Author


Original Redmine Comment
Author Name: John Coiner (@jcoiner)
Original Date: 2017-09-20T02:40:52Z


Thanks.

I wasn't aware of Xavier's work, it's a coincidence. Encryption was the exact stumbling block I ran into when evaluating verilator on a design at AMD. A foreign module interface sounds like a nice way to get around that.

@veripoolbot
Copy link
Contributor Author


Original Redmine Comment
Author Name: John Coiner (@jcoiner)
Original Date: 2018-01-31T17:40:09Z


(deleted)

@veripoolbot
Copy link
Contributor Author


Original Redmine Comment
Author Name: Wilson Snyder (@wsnyder)
Original Date: 2018-02-02T02:34:45Z


I have created the "develop-v4" branch on veripool. The plan is this feature will be pushed to that branch in pieces, and when complete released as version 4.000.

@veripoolbot
Copy link
Contributor Author


Original Redmine Comment
Author Name: Wilson Snyder (@wsnyder)
Original Date: 2018-08-25T14:46:22Z


In git master towards 4.000.

@veripoolbot veripoolbot added resolution: fixed Closed; fixed type: feature-IEEE Request to add new feature, described in IEEE 1800 labels Dec 22, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
resolution: fixed Closed; fixed type: feature-IEEE Request to add new feature, described in IEEE 1800
Projects
None yet
Development

No branches or pull requests

2 participants