int mult(int x, int y) { int z; if ((x==0) || (y==0)) return 0; if (x==1) return y; if (y==1) return x; z = mult(x, y>>1); if (y % 2 == 0) return z<<1; else return x + (z<<1); }My question came up, is this parallelizable? My argument to the interviewer was that this is not a good approach if he is asking from parallel computation perspective (See Robertazzi's papers on ACM journals to know more detail about parallel computation)
When a recursive call is made, the parameters are pushed into the stack before calling each subsequence recursive call. Involving stack in this manner is hardly parallelizable, because there is coupling of current result based on previous results. One of the principle of parallel computing is to decouple two or more computation as much as we can.
A dig into highly-optimized assembly language of the above code:
mult: pushl %ebp movl %esp, %ebp subl $24, %esp movl %ebx, -8(%ebp) movl %esi, -4(%ebp) movl 12(%ebp), %ebx movl 8(%ebp), %esi testl %ebx, %ebx je .L4 testl %esi, %esi je .L4 cmpl $1, %esi je .L2 cmpl $1, %ebx .p2align 4,,3 je .L5 movl %ebx, %eax movl %esi, (%esp) sarl %eax movl %eax, 4(%esp) call mult andl $1, %ebx je .L7 leal (%esi,%eax,2), %ebx .L2: movl %ebx, %eax movl -4(%ebp), %esi movl -8(%ebp), %ebx leave ret .p2align 4,,10 .p2align 3 .L4: xorl %ebx, %ebx movl -4(%ebp), %esi movl %ebx, %eax movl -8(%ebp), %ebx leave ret .p2align 4,,10 .p2align 3 .L7: leal (%eax,%eax), %ebx movl -4(%ebp), %esi movl %ebx, %eax movl -8(%ebp), %ebx leave ret .p2align 4,,10 .p2align 3 .L5: movl %esi, %ebx movl -4(%ebp), %esi movl %ebx, %eax movl -8(%ebp), %ebx leave ret
As can be seen, "SARL" (Shift Arithmetic Left) for the shift-left operation. Some issues after analyzing this code. First, there are multiple register-to-memory flow while doing the recursive:
movl %ebx, %eax movl %esi, (%esp) sarl %eax movl %eax, 4(%esp)
Memory access is expensive and the instructions might cause cache miss thus forcing CPU to reread data from RAM.
Another issue, how do we parallelize this, if there is stack push-pop coupling for the result?
My argument was to decompose the computation into two (or more) portions.
For example:
int pivot = b/2; for(i=0; i<pivot; i++) sum1 += a; for(i=pivot+1; i<b; i++) sum2 += a; sum = sum1 + sum2;
No, let's check what we can see in the x86 assembly (gcc-generated):
mult: pushl %ebp movl %esp, %ebp subl $8, %esp movl 12(%ebp), %eax movl %ebx, (%esp) movl %esi, 4(%esp) movl 8(%ebp), %edx testl %eax, %eax je .L5 testl %edx, %edx je .L5 cmpl $1, %edx je .L2 cmpl $1, %eax .p2align 4,,3 je .L6 movl %eax, %ecx xorl %esi, %esi shrl $31, %ecx xorl %ebx, %ebx addl %eax, %ecx sarl %ecx testl %ecx, %ecx jle .L3 movl %ecx, %esi movl %ecx, %ebx imull %edx, %esi .L3: xorl %ecx, %ecx cmpl %ebx, %eax jle .L4 movl %eax, %ecx subl %ebx, %ecx imull %edx, %ecx .L4: leal (%ecx,%esi), %eax .L2: movl (%esp), %ebx movl 4(%esp), %esi leave ret .p2align 4,,10 .p2align 3 .L5: xorl %eax, %eax movl (%esp), %ebx movl 4(%esp), %esi leave ret .p2align 4,,10 .p2align 3 .L6: movl %edx, %eax jmp .L2
As can be seen, most of operations are in registers, also there is decoupling between 2 partitions (e.g, sum1 does not rely on sum2). The more we partition the calculation (e.g, pivot1 = b/4, and do 4 for-next loops for each partition), the more divisible partition can be accomplished by parallel computer. I was arguing that to optimize computation, hence lowering O(n), we should use divisible loops ("divide-et-empera") instead of recursive approach (seems the interviewer disagreed)
PS: My argument above caused me not to land job at Amazon.