Author: Chris Raeuber, Vice President Engineering
When developing RTL code to be used in a high performance, or really any design, whether for an ASIC or FPGA, it’s important to think about how best to structure code. Synthesis can work wonders to fine tune the translation between behavioral RTL and gates with the proper constraints. But much can be done in RTL before synthesis to help make logic easier, faster, and in some cases possible at all, to achieve the required clock rates for any given technology.
Ever run a synthesis job that took hours and hours, and you wondered ‘why so long’? This may not be avoidable in large and complex systems, but high run-time can be mitigated by following some simple design rules and knowing when to apply them.
A recent design from a client that we were working with, employed a standard RTL implementation across the board when needing to multiplex inputs to a single output. This approach is fine in some situations, and fully functional when running simulations. A mux is a mux. Right? Well, behavioral simulations don’t consider timing in most cases. You probably don’t need to keep reading this, if meeting the speed requirements is simple for the synthesis tool to meet. That is, if nothing but default constraints are needed to achieve the desired results, reworking muxes may not be needed.
But in today’s world everything wants to run faster! So what happens when you want to deliver an IP to a client. Let’s say they want to run a complex CPU core design at 900 MHz with a target ASIC technology of 16nm. At fist thought this seems very achievable based on previous experience with other IP. Unfortunately, when you adjust the default constraints for the desired clock rate, perhaps setting some setup/hold constraints and timing uncertainty, and run your initial synthesis you find large negative slack! What now?
Multiple possibilities exist to improve or fix unexpected delays in the design. One possibility is to review the constraint files, and provide some more specific direction to the synthesis tool. The tool can be instructed to concentrate on certain areas of the design to meet timing. This can be highly effective but also a big time and effort sink. The first thing that should be considered, which is discussed below, is the RTL implementation itself. The synthesis topic will require a whole other writeup.
There are many things to look at when reviewing RTL code. Some require deep thought and thorough understanding of the design architecture and possible changes. If the reviewer is not the designer of the IP, a complete understanding of every detail may prove elusive and very time consuming. In reality, with today’s schedules there may not be time to sit back and study the intricacies and design choices of the original developer. Or the client may not allow time to do so, but still want some effective results. This topic itself is loaded because consultants have a common saying: “pay me now or pay me later”. But that said, there are some straightforward things to look for in any RTL implementation that can help immensely.
One of those items is multiplexor implementation. This is a highly used construct in most designs.
Assume after synthesis a multiplexor, in a general sense, is made up of AND and OR gates. Figure 1 shows a simple implementation of a two-input mux which ends up having two levels of logic.
A simple form of a four input mux written in system Verilog is below.
if (variable1 == COMP_VAL1) begin
result = input_A;
end else if( variable2 == COMP_VAL2) begin
result = input_B;
end else if( variable3 == COMP_VAL3) begin
result = input_C;
end else begin
result = input_D;
It can be seen that there are four inputs to the mux: input_A, input_B, input_c, and input_D. One of them is chosen to be assigned to the output result by testing variable1, varible2, and variable3 to be equivalent to the specific values COMP_VAL1, COMP_VAL2, and COMP_VAL3.
The important thing being discussed here is that in this construct, when synthesized, logic will be generated in gates that will prioritize input_A above the other choices. This creates multiple levels of logic, at least one for each selection.
To implement the code above you may draw the following logic, Figure 2, after synthesis. With todays synthesis tools you will like see something else, unless your technology library is very simple and only consists of 2 input gates. The thing to note is that as described by the RTL, the multiplexor is designed to prioritize input_A, then input_B, and so on. No matter how hard the synth tool tries to eliminate delay, it will always keep the priority structure in place. In this example, there are six levels of logic in the data path, which culminates with ‘result’ as an output.
There are a lot’s of other ways to create multiplexor logic with Verilog. The same code from above can be written as an assign statement:
assign result = (variable1 == COMP_VAL1) ? input_A :
( variable2 == COMP_VAL2) ? input_B :
( variable3 == COMP_VAL3) ? input_C : input_D;
The assign statement is still a priority-based mux. Reasons for a priority mux may exist in which case the above styles of modeling work fine. There are other ways too. Including case statements. But this writeup is not about the intricacies of case statements, which are many.
Here we are discussing how to make basic muxes faster, since they are ubiquitous in RTL designs. One reason to use a priority mux is if the selections are not exclusive, that is, more than one select line can be active at once. If that is true, then a priority mux allows the first selection to take priority. For instance, in the above example, two of the selects may be active. So maybe “variable1 == COMP_VAL1” and “variable2 == COMP_VAL2” are both true. With a priority mux “variable1 == COMP_VAL1” will allow input_A to drive the result. The mux priority logic will mask input_B.
If you find that only one select can ever be active at a time, then no need for a priority mux. In that case we can implement the above multiplexer differently and ultimately end up with fewer levels of logic and less delay through the data path of the multiplexor. Figure 3 shows an implementation in gates of a non-priority or parallel four to one mux. It’s pretty easily seen that the data path has been reduced from six levels to three levels of logic using standard gates. The time saving equates directly to the lesser number of gates it takes data to propagate through the four input multiplexor. Also in figure 2, the data paths for the inputs will not have the same time delays to the ‘result’ pin while the parallel implementation, ignoring loading and other potential factors, can have more balanced timing paths.
Code for the parallel mux implementation can vary, and again case statements and pragmas can be used to allow a high level of modeling implementation.
One non-case RTL implementation can be seen below. Which is no more complex than the above RTL implementation of the priority mux.
Figure 3 A four to one mux implemented in a non-priority fashion
assign s1 = (variable1 == COMP_VAL1);
assign s2 = ( variable2 == COMP_VAL2);
assign s3 = ( variable3 == COMP_VAL3);
assign result = (s1 & input_A) | (s2 & input_B) |
(s3 & input_C) | (~S1 & ~s2 &~s3 & input_D);
In summary, this write-up discussed the importance of knowing how, and when, to employ parallel vs priority-based multiplexors in designs where timing can be a critical issue to meet desired performance requirements. Obviously, the above example is simplistic, but the concept can be applied to many implementations large and small.
A recent project that we have completed allowed our design and synthesis team to meet timing with a complex 3rd party IP by employing fixes similar to those described above to many of the implemented multiplexors in the design. Without the modifications, the design was 10-15% too slow to meet a clock rate approaching 1Ghz. Other design modifications were also identified, and we will be discussing some of those in future papers. Stay tuned!