Sandbox, Bit Counting

Bit-counting is a rare operation. This makes it a good candidate for sandboxing — it isn’t clear what the best way to express the problem will be. This article looks at four different ways to express the bit-couting operation with unexpected results.

Chained Addition

The first method is the first-year student “for-loop”. This is the highest level method, and the one that would be preferred if it offered suitable performance. This method can be expressed as:

p_bc1 : process (clk) is
  variable tmp : integer range 0 to 32;
begin
 if rising_edge(clk) then
   tmp := 0;
   for ii in 0 to 31 loop
     if din_reg(ii) = '1' then
       tmp := tmp + 1;
     end if;
   end loop;
   dout_reg <= std_logic_vector(to_unsigned(tmp,6));
  end if;
end process;

A quick run for a Virtex-6 shows fairly poor results. A 32b bit-count limits the speed for most applications. The key results:

Max Clock: 140 MHz
LUTs Used: 173

This isn’t that surprising. In the past, I’ve found that turning on global optimizations in MAP has had a significant impact on this type of design. Sadly, Xilinx support ends one year after you get a dev-board, so I wasn’t able to test this.

Adder Tree

The second method is the very similar, but performs the operations in a tree. Interestingly enough, synthesis reports that it infers a “6 bit, 24 inputs adder tree”, which suggest that it is making use of its own optimizations.

Here is the code:

p_bc2 : process (clk) is
  type int1_array  is array (natural range <>) of integer range 0 to 1;
  type int2_array  is array (natural range <>) of integer range 0 to 2;
  type int4_array  is array (natural range <>) of integer range 0 to 4;
  type int8_array  is array (natural range <>) of integer range 0 to 8;
  type int16_array is array (natural range <>) of integer range 0 to 16;
  variable l0  : int1_array(0 to 31);
  variable l1  : int2_array(0 to 15);
  variable l2  : int4_array(0 to 7);
  variable l3  : int8_array(0 to 3);
  variable l4  : int16_array(0 to 1);
  variable tmp : integer range 0 to 32;
begin
  if rising_edge(clk) then
    for ii in 0 to 31 loop
      l0(ii) := to_integer(unsigned(din_reg(ii downto ii)));
    end loop;
    for ii in 0 to 15 loop
      l1(ii) := l0(2*ii) + l0(2*ii+1);
    end loop;
    for ii in 0 to 7 loop
      l2(ii) := l1(2*ii) + l1(2*ii+1);
    end loop;
    for ii in 0 to 3 loop
      l3(ii) := l2(2*ii) + l2(2*ii+1);
    end loop;
    for ii in 0 to 1 loop
      l4(ii) := l3(2*ii) + l3(2*ii+1);
    end loop;
    tmp := l4(0) + l4(1);
    dout_reg <= std_logic_vector(to_unsigned(tmp,6));
  end if;
end process;

In the Virtex-6, the results were significantly better. The 32b bit-count isn’t a lightweight operation, but can be reasonably done in one cycle for designs in the 100-200MHz range without too much worry.

Max Clock: 294 MHz
LUTs Used: 67

LUT6 Optimization

The third method makes use of knowledge of the Virtex-6 fabric. Six input LUTs exist in the FPGA, so it would seem reasonable to remove the first layers of the adder tree by using the LUT6’s to process 6b at a time. For 32b, this results in six small three-bit values that are added to get three terms, and finally the result.

p_bc3 : process (clk) is
  type int6_array  is array (natural range <>) of integer range 0 to 6;
  type int12_array is array (natural range <>) of integer range 0 to 12;
  function mklut( void : integer ) return int6_array is
    variable ret  : int6_array(0 to 63);
    variable ctmp : unsigned(5 downto 0);
    variable bc   : integer;
  begin
    for ii in 0 to 63 loop
      ctmp := to_unsigned(ii,6);
      bc  := 0;
      for jj in 0 to 5 loop
        if ctmp(jj) = '1' then
          bc := bc+1;
        end if;
      end loop;
      ret(ii) := bc;
    end loop;
    return ret;
  end function;
  constant LUT6   : int6_array(0 to 63) := mklut(0);
  variable l1     : int6_array(5 downto 0);
  variable l2     : int12_array(2 downto 0);
  variable tmp    : integer range 0 to 32;
  variable ext_in : std_logic_vector(35 downto 0) := (others => '0');
begin
  if rising_edge(clk) then
    tmp := 0;
    ext_in(31 downto 0) := din_reg;
    for ii in 0 to 5 loop
      l1(ii) := LUT6(to_integer(unsigned(ext_in(6*ii+5 downto 6*ii))));
    end loop;
    for ii in 0 to 2 loop
      l2(ii) := l1(2*ii) + l1(2*ii+1);
    end loop;
    tmp := l2(0) + l2(1) + l2(2);
    dout_reg <= std_logic_vector(to_unsigned(tmp,6));
  end if;
end process;

There might be some concern over the use of higher level features like “integer” in the above testing. This appears to be unfounded. A test of the LUT6 approach shows that replacing the integer types with the correctly sized unsigned values resulted in the same performance in both area and speed.

The results for both integer and unsigned coding styles was favorable, resulting in a smaller and faster design:

Max Clock: 319 MHz
LUTs Used: 43

Of course, this isn’t expected to scale up very well, it can’t be re-applied at each stage, so it can only trim of around 2 levels of the adder tree.

More LUT6 Optimization

The last method attempted here again uses more knowledge of the Virtex-6. The LUT6 above took six inputs and generated 3b values. The addition could also be performed using two 3b values with a LUT6.

function mkadd ( void : integer) return int12_array is
  variable ret : int12_array(0 to 63) := (others => 0);
begin
  for ii in 0 to 6 loop
    for jj in 0 to 6 loop
      ret(8*ii+jj) := ii+jj;
    end loop;
  end loop;
  return ret;
end function;

The last method isn’t the fastest, but it is smaller than the second best by a very small margin.

Max Clock: 292 MHz
LUTs Used: 41

The code for the full process isn’t shown here.

Conclusions

The conclusion here is once again that sandboxing is a valuable tool. It is possible to get better performance by utilizing some knowledge of the architecture, but it is certainly possible to go overboard and even to make detrimental choices that make things hard to read without adding any performance.

Even though this seems simple, it actually was error-prone when I first wrote it. I had mistakenly attempted to add two 1b unsigned values to get a 2b result. Such problems are hard to spot, and again remind that it is easy to get small details wrong when trying to write “tricky” code.

Also, while writing the code for the second LUT6 approach, I noticed that indexing a ROM/LUT using “8*x + y” resulted in extra logic vs casting x and y to 3b values and concatenating them. This is another reminder that EDA tools still have a long way to go.

Sandbox, Bit Counting

Chained Addition

Adder Tree

LUT6 Optimization

More LUT6 Optimization

Conclusions

Welcome

Categories

Recent Posts

Archives

Meta