uint32_t count_bits(uint32_t b) { int i; int n = 0; for(i=0; i<sizeof(b)*8; i++) { n += (b & 1); b >>= 1; } return n; }
I also told him that there are many ways to do this (referred him to look at "Beautiful Code" to see some of the beautiful algorithms).
But then the interviewer argued. FIrst, he said the upper border of the for-loop should be to "sizeof(b)*8" (meaning, instead of using "<", I should have used "<="). To bad, I said yes to him without thinking further (Probably got nervous. Later at home, I checked that he was wrong). Second, he said that the fastest is to use look-up table (n[0] = 0, n[1]=1, n[2]=1,...,n[255]=8,...). Although I agreed with his argument, but then he become inconsistent with his original question and there is a big downside of his argument. I said that this wouldn't be a choice in embedded programming where memory resource is limited. Secondly, his original question was to compute ones in a 32-bit number. That means we have to 2^32 (4,294,967,296 or about 4 GBytes) items set manually in an array. That's so ridiculous. Who wants to waste that much space just to calculate the number of bits in an integer? For a byte, although it seems acceptable ("only" 256 bytes of table), but compare that to the above algorithm:
movl $32, %edx xorl %ecx, %ecx
.L2: movl %ecx, %ebx shrl %ecx andl $1, %ebx addl %ebx, %eax decl %edx jne .L2It's only 7 x86 mnemonic instructions (about 12-18 bytes) and no memory involvement in the loop!. Yes, it is O(32) for 32-bit, but it's a constant speed. Even further, if the function is called only once, the fully optimized version of the instructions would be inserted inline in the place (without calling the function), so we can CPU time from doing push-pop stack frames. That's another tiny speedup. It's probably fine for table lookup (yes, it's the fastest, O(1)), but it is limited to 8-bit. For 16 or higher, seems the space-time trade-off is too big not to consider. Besides, the above algorithm extendable into polymorphism in Object-Oriented code.
When I searched Google, I found the following (on Stanford's web):
v = v - ((v >> 1) & 0x55555555); // reuse input as temporary v = (v & 0x33333333) + ((v >> 2) & 0x33333333); // temp c = ((v + (v >> 4) & 0xF0F0F0F) * 0x1010101) >> 24; // count
Which renders to instruction codes like below:
shrl %edx andl $1431655765, %edx subl %edx, %eax movl %eax, %edx shrl $2, %eax andl $858993459, %edx andl $858993459, %eax addl %edx, %eax movl %eax, %edx shrl $4, %edx leal (%edx,%eax), %eax andl $252645135, %eax imull $16843009, %eax, %eax shrl $24, %eax
(14 mnemonics). This seems the fastest and most optimized. It took some time for me to understand it.
Seems the interviewer doesn't have enough experience in embedded or microcontroller or doesn't think much about optimization. Nevertheless, I failed an interview again. I start to think that many interview questions are really not to dig the thinking ability of a candidate, but rather to get instant answers meet their mind. I just hope big software companies like IBM, Microsoft, Google or Apple shouldn't fall to this kind of trap, but rather ask logical questions to know if the candidate can think systematically.
What a waste!
No comments:
Post a Comment