## Friday, March 9, 2012

### Counting the number of "1" bits

Another "ridiculous" interview I had was when an interviewer asked me how to count the number of set bits in a 32-bit or 8-bit. I told him that there are various algorithms to do this, but the easiest way but yet good enough in performance is a function like this:

```uint32_t count_bits(uint32_t b)
{
int i;
int n = 0;

for(i=0; i<sizeof(b)*8; i++)
{
n += (b & 1);
b >>= 1;
}
return n;
}

```

I also told him that there are many ways to do this (referred him to look at "Beautiful Code" to see some of the beautiful algorithms).

But then the interviewer argued.  FIrst, he said the upper border of the for-loop should be to "sizeof(b)*8" (meaning, instead of using "<", I should have used "<="). To bad, I said yes to him without thinking further (Probably got nervous.  Later at home, I checked that he was wrong).  Second, he said that the fastest is to use look-up table (n[0] = 0, n[1]=1, n[2]=1,...,n[255]=8,...). Although I agreed with his argument, but then he become inconsistent with his original question and there is a big downside of his argument. I said that this wouldn't be a choice in embedded programming where memory resource is limited. Secondly, his original question was to compute ones in a 32-bit number. That means we have to 2^32 (4,294,967,296 or about 4 GBytes) items set manually in an array. That's so ridiculous. Who wants to waste that much space just to calculate the number of bits in an integer? For a byte, although it seems acceptable ("only" 256 bytes of table), but compare that to the above algorithm:

```movl \$32,  %edx
xorl %ecx, %ecx```
```.L2:
movl %ecx, %ebx
shrl %ecx
andl \$1,   %ebx
decl %edx
jne .L2

```
It's only 7 x86 mnemonic instructions (about 12-18 bytes) and no memory involvement in the loop!. Yes, it is O(32) for 32-bit, but it's a constant speed.  Even further, if the function is called only once, the fully optimized version of the instructions would be inserted inline in the place (without calling the function), so we can CPU time from doing push-pop stack frames. That's another tiny speedup.  It's probably fine for table lookup (yes, it's the fastest, O(1)), but it is limited to 8-bit.  For 16 or higher, seems the space-time trade-off is too big not to consider.  Besides, the above algorithm extendable into polymorphism in Object-Oriented code.

When I searched Google, I found the following (on Stanford's web):

```v = v - ((v >> 1) & 0x55555555);                    // reuse input as temporary
v = (v & 0x33333333) + ((v >> 2) & 0x33333333);     // temp
c = ((v + (v >> 4) & 0xF0F0F0F) * 0x1010101) >> 24; // count```

Which renders to instruction codes like below:

```shrl %edx
andl \$1431655765, %edx
subl %edx, %eax
movl %eax, %edx
shrl \$2, %eax
andl \$858993459, %edx
andl \$858993459, %eax
movl %eax, %edx
shrl \$4, %edx
leal (%edx,%eax), %eax
andl \$252645135, %eax
imull \$16843009, %eax, %eax
shrl \$24, %eax
```

(14 mnemonics).  This seems the fastest and most optimized.  It took some time for me to understand it.

Seems the interviewer doesn't have enough experience in embedded or microcontroller  or doesn't think much about optimization. Nevertheless, I failed an interview again.   I start to think that many interview questions are really not to dig the thinking ability of a candidate, but rather to get instant answers meet their mind.  I just hope big software companies like IBM, Microsoft, Google or Apple shouldn't fall to this kind of trap, but rather ask logical questions to know if the candidate can think systematically.

What a waste!

#### 1 comment:

1. Hi

Tks very much for post:

I like it and hope that you continue posting.

Let me show other source that may be good for community.

Source: Costco interview questions

Best rgs
David