Bit-counting is a rare operation. This makes it a good candidate for sandboxing — it isn’t clear what the best way to express the problem will be. This article looks at four different ways to express the bit-couting operation with unexpected results.
Chained Addition
The first method is the first-year student “for-loop”. This is the highest level method, and the one that would be preferred if it offered suitable performance. This method can be expressed as:
p_bc1 : process (clk) is
variable tmp : integer range 0 to 32;
begin
if rising_edge(clk) then
tmp := 0;
for ii in 0 to 31 loop
if din_reg(ii) = '1' then
tmp := tmp + 1;
end if;
end loop;
dout_reg <= std_logic_vector(to_unsigned(tmp,6));
end if;
end process;
A quick run for a Virtex-6 shows fairly poor results. A 32b bit-count limits the speed for most applications. The key results:
- Max Clock: 140 MHz
- LUTs Used: 173
This isn’t that surprising. In the past, I’ve found that turning on global optimizations in MAP has had a significant impact on this type of design. Sadly, Xilinx support ends one year after you get a dev-board, so I wasn’t able to test this.
Adder Tree
The second method is the very similar, but performs the operations in a tree. Interestingly enough, synthesis reports that it infers a “6 bit, 24 inputs adder tree”, which suggest that it is making use of its own optimizations.
Here is the code:
p_bc2 : process (clk) is
type int1_array is array (natural range <>) of integer range 0 to 1;
type int2_array is array (natural range <>) of integer range 0 to 2;
type int4_array is array (natural range <>) of integer range 0 to 4;
type int8_array is array (natural range <>) of integer range 0 to 8;
type int16_array is array (natural range <>) of integer range 0 to 16;
variable l0 : int1_array(0 to 31);
variable l1 : int2_array(0 to 15);
variable l2 : int4_array(0 to 7);
variable l3 : int8_array(0 to 3);
variable l4 : int16_array(0 to 1);
variable tmp : integer range 0 to 32;
begin
if rising_edge(clk) then
for ii in 0 to 31 loop
l0(ii) := to_integer(unsigned(din_reg(ii downto ii)));
end loop;
for ii in 0 to 15 loop
l1(ii) := l0(2*ii) + l0(2*ii+1);
end loop;
for ii in 0 to 7 loop
l2(ii) := l1(2*ii) + l1(2*ii+1);
end loop;
for ii in 0 to 3 loop
l3(ii) := l2(2*ii) + l2(2*ii+1);
end loop;
for ii in 0 to 1 loop
l4(ii) := l3(2*ii) + l3(2*ii+1);
end loop;
tmp := l4(0) + l4(1);
dout_reg <= std_logic_vector(to_unsigned(tmp,6));
end if;
end process;
In the Virtex-6, the results were significantly better. The 32b bit-count isn’t a lightweight operation, but can be reasonably done in one cycle for designs in the 100-200MHz range without too much worry.
- Max Clock: 294 MHz
- LUTs Used: 67
LUT6 Optimization
The third method makes use of knowledge of the Virtex-6 fabric. Six input LUTs exist in the FPGA, so it would seem reasonable to remove the first layers of the adder tree by using the LUT6’s to process 6b at a time. For 32b, this results in six small three-bit values that are added to get three terms, and finally the result.
p_bc3 : process (clk) is
type int6_array is array (natural range <>) of integer range 0 to 6;
type int12_array is array (natural range <>) of integer range 0 to 12;
function mklut( void : integer ) return int6_array is
variable ret : int6_array(0 to 63);
variable ctmp : unsigned(5 downto 0);
variable bc : integer;
begin
for ii in 0 to 63 loop
ctmp := to_unsigned(ii,6);
bc := 0;
for jj in 0 to 5 loop
if ctmp(jj) = '1' then
bc := bc+1;
end if;
end loop;
ret(ii) := bc;
end loop;
return ret;
end function;
constant LUT6 : int6_array(0 to 63) := mklut(0);
variable l1 : int6_array(5 downto 0);
variable l2 : int12_array(2 downto 0);
variable tmp : integer range 0 to 32;
variable ext_in : std_logic_vector(35 downto 0) := (others => '0');
begin
if rising_edge(clk) then
tmp := 0;
ext_in(31 downto 0) := din_reg;
for ii in 0 to 5 loop
l1(ii) := LUT6(to_integer(unsigned(ext_in(6*ii+5 downto 6*ii))));
end loop;
for ii in 0 to 2 loop
l2(ii) := l1(2*ii) + l1(2*ii+1);
end loop;
tmp := l2(0) + l2(1) + l2(2);
dout_reg <= std_logic_vector(to_unsigned(tmp,6));
end if;
end process;
There might be some concern over the use of higher level features like “integer” in the above testing. This appears to be unfounded. A test of the LUT6 approach shows that replacing the integer types with the correctly sized unsigned values resulted in the same performance in both area and speed.
The results for both integer and unsigned coding styles was favorable, resulting in a smaller and faster design:
- Max Clock: 319 MHz
- LUTs Used: 43
Of course, this isn’t expected to scale up very well, it can’t be re-applied at each stage, so it can only trim of around 2 levels of the adder tree.
More LUT6 Optimization
The last method attempted here again uses more knowledge of the Virtex-6. The LUT6 above took six inputs and generated 3b values. The addition could also be performed using two 3b values with a LUT6.
function mkadd ( void : integer) return int12_array is variable ret : int12_array(0 to 63) := (others => 0); begin for ii in 0 to 6 loop for jj in 0 to 6 loop ret(8*ii+jj) := ii+jj; end loop; end loop; return ret; end function;
The last method isn’t the fastest, but it is smaller than the second best by a very small margin.
- Max Clock: 292 MHz
- LUTs Used: 41
The code for the full process isn’t shown here.
Conclusions
The conclusion here is once again that sandboxing is a valuable tool. It is possible to get better performance by utilizing some knowledge of the architecture, but it is certainly possible to go overboard and even to make detrimental choices that make things hard to read without adding any performance.
Even though this seems simple, it actually was error-prone when I first wrote it. I had mistakenly attempted to add two 1b unsigned values to get a 2b result. Such problems are hard to spot, and again remind that it is easy to get small details wrong when trying to write “tricky” code.
Also, while writing the code for the second LUT6 approach, I noticed that indexing a ROM/LUT using “8*x + y” resulted in extra logic vs casting x and y to 3b values and concatenating them. This is another reminder that EDA tools still have a long way to go.