Inferring DSP48 slices in Verilog

The full DSP48E1 component has so many ports that Xilinx has started making “macro” primitives for common DSP48 uses.  The DSP48E1 in the Virtex-6 has a lot of advanced features.  XST supports inferring multipliers, and there are even some code examples.  But I didn’t see any that really cover the advanced cases. 

One issue with inferring DSP slices is that the code must be written in such a way that it will actually map to the DSP slice correctly.  This means imposing some constraints on the way the code is written.  For example, nodes internal to the DSP slice can’t be used outside of the DSP slice unless extra hardware is generated.  I also figured this would give me a chance to brush up on my Verilog skills.

The first thing to note is that the DSP slices use singed math.  Thus signals should be declared as “reg signed” where possible.  Unsigned math is also possible, and will map to DSP slices less optimally.

always @ (posedge clk) begin
  a_d <= A;
  b_d <= B;
  c_d <= C;
  m   <= a * b;
  p   <= m + c_d;           // works
  //p   <= p + c_d;         // works
  //p   <= p + m;           // works
  //p   <= (p >>> 17) + m;  // works
end

I did a quick test to see what opmodes I could infer.  I was pleasantly surprised to find that (P>>>17) was picked up as opmode 110xxxx.  Multipliers, addition, and accumulation were also easy.  Unfortunately, dynamically changing opmodes was hit-or-miss.  The tools preferred to use the C port without register.

always @ (posedge clk) begin
  a_d <= A;
  b_d <= B;
  c_d <= C;
  m   <= a * b;
  if (sel == 1'b0) begin
    p <= p + m;
  end else begin
    p <= p + 0;             // dynamic opmode
    // p <= p - m;          // dynamic alumode
    // p <= p + c_d;        // moves C reg into fabric.  Uses mux.
    // p <= (p >>> 17) + m; // uses C port for (p >>> 17)
  end

end

One of the more useful DSP slice features is the P-chain.  Each P register can connect to the DSP slice above it.  This allows for dedicated routing of a large number of routes in a design.  I was curious to see if a multiply-add-add was possible using 2 DSP slices, and how the tools would handle the routing.  It was nice to see that it used the PCout/PCin connections correctly.

  always @(posedge clk) begin
    // DSP tile 1
    a_d <= A;
    b_d <= B;
    m   <= a_d * b_d;
    p   <= m + pc;

    // DSP tile 2
    ca_d <= Ca;
    cb_d <= Cb;
    pc   <= ca_d + cb_d;
  end

There are other DSP features, but these were the main ones that I’d wanted to see.  The conclusion from all of this is that inferring DSP slices is possible, but still isn’t a replacement for instantiating the primitives.  This becomes an issue mainly for low-area designs where a single DSP slice might be time shared.  The inability to infer dynamic opmodes is the limitation for this usage.  The other issue is that the synthesis tool is still able to find sub-optimal implementations by moving the C input out into fabric.  This can dramatically reduce the performance of the design.

(As always, Xilinx, Virtex-6, XST, ect… are trademarks and/or products of Xilinx.)

p   <= m + c_d;
This entry was posted in FPGA, Verilog. Bookmark the permalink.

Comments are closed.