RTL for Performance:  Priority Muxes

Author: Chris Raeuber, Vice President Engineering

When developing RTL code to be used in a high performance, or really any design, whether for an ASIC or FPGA, it’s important to think about how best to structure code.   Synthesis can work wonders to fine tune the translation between behavioral RTL and gates with the proper constraints.  But much can be done in RTL before synthesis to help make logic easier, faster, and in some cases possible at all, to achieve the required clock rates for any given technology.  

Ever run a synthesis job that took hours and hours, and you wondered ‘why so long’?  This may not be avoidable in large and complex systems, but high run-time can be mitigated by following some simple design rules and knowing when to apply them.

A recent design from a client that we were working with, employed a standard RTL implementation across the board when needing to multiplex inputs to a single output.  This approach is fine in some situations, and fully functional when running simulations.  A mux is a mux.  Right?  Well, behavioral simulations don’t consider timing in most cases.  You probably  don’t need to keep reading this,  if meeting the speed requirements is simple for the synthesis tool to meet.  That is, if nothing but default constraints are needed to achieve the desired results, reworking muxes may not be needed.

But in today’s world everything wants to run faster! So what happens when you want to deliver an IP to a client.  Let’s say they want to run a complex CPU core design at 900 MHz with a target ASIC technology of 16nm.   At fist thought this seems very achievable based on previous experience with other IP.   Unfortunately, when you adjust the default constraints for the desired clock rate, perhaps setting some setup/hold constraints and timing uncertainty, and run your initial synthesis you find large negative slack!   What now?

Multiple possibilities exist to improve or fix unexpected delays in the design.  One possibility is to review the constraint files, and provide some more specific direction to the synthesis tool.  The tool  can be instructed to concentrate on certain areas of the design to meet timing.  This can be highly effective but also a big time and effort sink.  The first thing that should be considered, which is discussed below, is the RTL implementation itself.  The synthesis topic will require a whole other writeup.

There are many things to look at when reviewing RTL code.   Some require deep thought and thorough understanding of the design architecture and possible changes.  If the reviewer is not the designer of the IP, a complete understanding of every detail may prove elusive and very time consuming.  In reality, with today’s schedules there may not be time to sit back and study the intricacies and design choices of the original developer.  Or the client may not allow time to do so, but still want some effective results.  This topic itself is loaded because consultants have a common saying: “pay me now or pay me later”.  But that said, there are some straightforward things to look for in any RTL implementation that can help immensely.

One of those items is multiplexor implementation.  This is a highly used construct in most designs.

Assume after synthesis a multiplexor, in a general sense, is made up of AND and OR gates. Figure 1 shows a simple implementation of a two-input mux which ends up having two levels of logic.

Figure 1: A basic two input multiplexor

A simple form of a four input mux written in system Verilog is below.

always_comb begin

    if (variable1 == COMP_VAL1) begin

         result = input_A;

     end else if( variable2 == COMP_VAL2) begin

        result = input_B;

      end else if( variable3 == COMP_VAL3) begin

        result = input_C;

     end else begin

       result = input_D;

    end

end

It can be seen that there are four inputs to the mux:  input_A, input_B, input_c, and input_D.   One of them is chosen to be assigned to the output result by testing variable1, varible2, and variable3 to be equivalent to the specific values COMP_VAL1, COMP_VAL2, and COMP_VAL3.

The important thing being discussed here is that in this construct, when synthesized, logic will be generated in gates that will prioritize input_A above the other choices. This creates multiple levels of logic, at least one for each selection. 

To implement the code above you may draw the following logic, Figure 2, after synthesis.  With todays synthesis tools you will like see something else, unless your technology library is very simple and only consists of 2 input gates.  The thing to note is that as described by the RTL, the multiplexor is designed to prioritize input_A, then input_B, and so on.  No matter how hard the synth tool tries to eliminate delay, it will always keep the priority structure in place.  In this example, there are six levels of logic in the data path, which culminates with ‘result’ as an output.

Figure 2   A four input priority base multiplexor

There are a lot’s of other ways to create multiplexor logic with Verilog.  The same code from above can be written as an assign statement:

assign result = (variable1 == COMP_VAL1) ? input_A :

                          ( variable2 == COMP_VAL2) ? input_B :

                           ( variable3 == COMP_VAL3) ? input_C : input_D;

The assign statement is still a priority-based mux.  Reasons for a priority mux may exist in which case the above styles of modeling work fine.  There are other ways too. Including case statements.  But this writeup is not about the intricacies of case statements, which are many.

Here we are discussing how to make basic muxes faster, since they are ubiquitous in RTL designs.  One reason to use a priority mux is if the selections are not exclusive, that is, more than one select line can be active at once.  If that is true, then a priority mux allows the first selection to take priority.  For instance, in the above example, two of the selects may be active.  So maybe “variable1 == COMP_VAL1” and “variable2 == COMP_VAL2” are both true.  With a priority mux “variable1 == COMP_VAL1” will allow input_A to drive the result.  The mux priority logic will mask input_B.

If you find that only one select can ever be active at a time, then no need for a priority mux.  In that case we can implement the above multiplexer differently and ultimately end up with fewer levels of logic and less delay through the data path of the multiplexor. Figure 3 shows an implementation in gates of a non-priority or parallel four to one mux.  It’s pretty easily seen that the data path has been reduced from six levels to three levels of logic using standard gates.  The time saving equates directly to the lesser number of gates it takes data to propagate through the four input multiplexor.  Also in figure 2, the data paths for the inputs will not have the same time delays to the ‘result’ pin while the parallel implementation, ignoring loading and other potential factors, can have more balanced timing paths.

Code for the parallel mux implementation can vary, and again case statements and pragmas can be used to allow a high level of modeling implementation.

One non-case RTL implementation can be seen below.  Which is no more complex than the above RTL implementation of the priority mux. 

Figure 3   A four to one mux implemented in a non-priority fashion

assign s1 = (variable1 == COMP_VAL1);

assign s2 = ( variable2 == COMP_VAL2);

assign s3 = ( variable3 == COMP_VAL3);

assign result = (s1 & input_A) | (s2 & input_B) |

                           (s3 & input_C) | (~S1 & ~s2 &~s3 & input_D);

In summary, this write-up discussed the importance of knowing how, and when, to employ parallel vs priority-based multiplexors in designs where timing can be a critical issue to meet desired performance requirements.  Obviously, the above example is simplistic, but the concept can be applied to many implementations large and small. 

A recent project that we have completed allowed our design and synthesis team to meet timing with a complex 3rd party IP by employing fixes similar to those described above to many of the implemented multiplexors in the design.   Without the modifications, the design was 10-15% too slow to meet a clock rate approaching 1Ghz.  Other design modifications were also identified, and we will be discussing some of those in future papers.  Stay tuned!

XtremeEDA is an experienced partner you can trust!!

Cadence Design Systems helps engineers pick up the development tempo. A leader in the market for electronic design automation (EDA) software, Cadence sells and leases software and hardware products used to design integrated circuits (ICs), printed circuit boards (PCBs), and other electronic systems. Semiconductor and electronics systems manufacturers use its products to build components for wireless devices, networking equipment, and other applications. The company also provides maintenance and support, and offers design and methodology consulting services. Customers have included Pegatron, Silicon Labs, and Texas Instruments. Cadence gets more than half of its sales from customers outside the US.

Synopsys, Inc. (Nasdaq:SNPS) provides products and services that accelerate innovation in the global electronics market. As a leader in electronic design automation (EDA) and semiconductor intellectual property (IP), Synopsys’ comprehensive, integrated portfolio of system-level, IP, implementation, verification, manufacturing, optical and field-programmable gate array (FPGA) solutions help address the key challenges designers face such as power and yield management, system-to-silicon verification and time-to-results. These technology-leading solutions help give Synopsys customers a competitive edge in quickly bringing the best products to market while reducing costs and schedule risk. For more than 25 years, Synopsys has been at the heart of accelerating electronics innovation with engineers around the world having used Synopsys technology to successfully design and create billions of chips and systems. The company is headquartered in Mountain View, California, and has approximately 90 offices located throughout North America, Europe, Japan, Asia and India.

asicNorth was established in January 2000 with one purpose in mind: deliver the highest quality design services possible. In an industry that can be quite volatile at times, it is important to have a design partner that you can depend upon to deliver the skills you need when you need them. A project can only be successful if there are:

Top quality skills on the team
Communication with the customer
Attention to detail
Cost sensitivity
Focus on the schedule

Today, asicNorth is enabling high-tech industry leaders and startups alike with a combination of digital, analog, and mixed-signal design capabilities. Driven to produce successful results, asicNorth is Making Chips Happen™.

Codasip delivers leading-edge RISC-V processor IP and high-level processor design tools, providing IC designers with all the advantages of the RISC-V open ISA, along with the unique ability to customize the processor IP. As a founding member of RISC-V International and a long-term supplier of LLVM and GNU-based processor solutions, Codasip is committed to open standards for embedded and application processors. Formed in 2014 and headquartered in Munich, Germany, Codasip currently has R&D centers in Europe and sales representatives worldwide. For more information about our products and services, visit www.codasip.com. For more information about RISC-V, visit www.riscv.org.

Founded in 1999, Avery Design Systems, Inc. enables system and SOC design teams to achieve dramatic functional verification productivity improvements through the use of

Formal analysis applications for RTL and gate-level X verification;

Robust Verification IP for PCI Express, USB, AMBA, UFS, MIPI, DDR/LPDDR, HBM, HMC, ONFI/Toggle, NVM Express, SCSI Express, SATA Express, eMMC, SD/SDIO, Unipro, CSI/DSI, Soundwire, and CAN FD standards.

Siemens EDA
The pace of innovation in electronics is constantly accelerating. To enable our customers to deliver life-changing innovations to the world faster and to become market leaders, we are committed to delivering the world’s most comprehensive portfolio of electronic design automation (EDA) software, hardware, and services.