## Thursday, March 8, 2012

### Parallelizable integer multiplication

At a recent interview with an Amazon, the hiring manager asked me about an algorithm to do integer multiplication without using any multiplication/division symbol in parallel computing.  The following algorithm was suggested to do integer multiplication without using any "MULT" instruction:

```int mult(int x, int y)
{
int z;

if ((x==0) || (y==0))
return 0;
if (x==1)
return y;

if (y==1)
return x;

z = mult(x, y>>1);
if (y % 2 == 0)
return z<<1;
else
return x + (z<<1);
}

```
My question came up, is this parallelizable?  My argument to the interviewer was that this is not a good approach if he is asking from parallel computation perspective (See Robertazzi's papers on ACM journals to know more detail about parallel computation)

When a recursive call is made, the parameters are pushed into the stack before calling each subsequence recursive call.  Involving stack in this manner is hardly parallelizable, because there is coupling of current result based on previous results.  One of the principle of parallel computing is to decouple two or more computation as much as we can.

A dig into highly-optimized assembly language of the above code:

```mult:
pushl %ebp
movl %esp, %ebp
subl \$24, %esp
movl %ebx, -8(%ebp)
movl %esi, -4(%ebp)
movl 12(%ebp), %ebx
movl 8(%ebp), %esi
testl %ebx, %ebx
je .L4
testl %esi, %esi
je .L4
cmpl \$1, %esi
je .L2
cmpl \$1, %ebx
.p2align 4,,3
je .L5
movl %ebx, %eax
movl %esi, (%esp)
sarl %eax
movl %eax, 4(%esp)
call mult
andl \$1, %ebx
je .L7
leal (%esi,%eax,2), %ebx
.L2:
movl %ebx, %eax
movl -4(%ebp), %esi
movl -8(%ebp), %ebx
leave
ret
.p2align 4,,10
.p2align 3
.L4:
xorl %ebx, %ebx
movl -4(%ebp), %esi
movl %ebx, %eax
movl -8(%ebp), %ebx
leave
ret
.p2align 4,,10
.p2align 3
.L7:
leal (%eax,%eax), %ebx
movl -4(%ebp), %esi
movl %ebx, %eax
movl -8(%ebp), %ebx
leave
ret
.p2align 4,,10
.p2align 3
.L5:
movl %esi, %ebx
movl -4(%ebp), %esi
movl %ebx, %eax
movl -8(%ebp), %ebx
leave
ret

```

As can be seen, "SARL" (Shift Arithmetic Left) for the shift-left operation.  Some issues after analyzing this code.  First, there are multiple register-to-memory flow while doing the recursive:

```movl %ebx, %eax
movl %esi, (%esp)
sarl %eax
movl %eax, 4(%esp)
```

Memory access is expensive and the instructions might cause cache miss thus forcing CPU to reread data from RAM.
Another issue, how do we parallelize this, if there is stack push-pop coupling for the result?

My argument was to decompose the computation into two (or more) portions.

For example:

```int pivot = b/2;

for(i=0; i<pivot; i++)
sum1 += a;

for(i=pivot+1; i<b; i++)
sum2 += a;
sum = sum1 + sum2;

```

No, let's check what we can see in the x86 assembly (gcc-generated):

```mult:
pushl %ebp
movl %esp, %ebp
subl \$8, %esp
movl 12(%ebp), %eax
movl %ebx, (%esp)
movl %esi, 4(%esp)
movl 8(%ebp), %edx
testl %eax, %eax
je .L5
testl %edx, %edx
je .L5
cmpl \$1, %edx
je .L2
cmpl \$1, %eax
.p2align 4,,3
je .L6
movl %eax, %ecx
xorl %esi, %esi
shrl \$31, %ecx
xorl %ebx, %ebx
sarl %ecx
testl %ecx, %ecx
jle .L3
movl %ecx, %esi
movl %ecx, %ebx
imull %edx, %esi
.L3:
xorl %ecx, %ecx
cmpl %ebx, %eax
jle .L4
movl %eax, %ecx
subl %ebx, %ecx
imull %edx, %ecx
.L4:
leal (%ecx,%esi), %eax
.L2:
movl (%esp), %ebx
movl 4(%esp), %esi
leave
ret
.p2align 4,,10
.p2align 3
.L5:
xorl %eax, %eax
movl (%esp), %ebx
movl 4(%esp), %esi
leave
ret
.p2align 4,,10
.p2align 3
.L6:
movl %edx, %eax
jmp .L2
```

As can be seen, most of operations are in registers, also there is decoupling between 2 partitions (e.g, sum1 does not rely on sum2). The more we partition the calculation (e.g, pivot1 = b/4, and do 4 for-next loops for each partition), the more divisible partition can be accomplished by parallel computer. I was arguing that to optimize computation, hence lowering O(n), we should use divisible loops ("divide-et-empera") instead of recursive approach (seems the interviewer disagreed)

PS: My argument above caused me not to land job at Amazon.