# Fibonacci numbers module def fib(n): # write Fibonacci series up to n a, b = 0, 1 while b < n: #print b, a, b = b, a+b def fib2(n): # return Fibonacci series up to n result = [] a, b = 0, 1 while b < n: result.append(b) a, b = b, a+b print b return result # make it executable as a script as well as module_exit if __name__ == "__main__": import sys if (len(sys.argv) > 1): f = fib2(int(sys.argv[1])) str = "" print "Fibonacci sequence: ", f else: print sys.argv[0], ""
Sunday, March 11, 2012
Fibonacci in Python
Downloading book "Foundations of Computer Science"
This small python script would download all chapters in the book "Foundations of Computer Science".
#!/usr/bin/python import sys from subprocess import call # for 'call' urlbase = 'http://infolab.stanford.edu/~ullman/focs/' others = ['preface.pdf', 'toc.pdf', 'index.pdf'] chapters = range(1,15) links = [] links.extend(others) def download(index, f): url = urlbase + f print "file = ", index+1, url call(["wget", url]) for ch in chapters: links.append('ch' + '%02d' %(ch) + '.pdf') out = [download(index,obj) for index,obj in enumerate(links)] print "out=", out
Regular Expression in Python
The following is a Python snippet to do regular expression.
First it reads a file passed as an argument, the it goes to a for-loop for each line and do string regular-expression search
#!/usr/bin/python import sys # for sys.argv, etc. import re # regular expression if (len(sys.argv) > 1): filename = sys.argv[1] else: print sys.argv[0], "~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ readfile.py 17,3-9 All "readfile.py" 21L, 386C" sys.exit(0) with open(filename, 'r') as f: #f.read() for line in f: pattern = re.compile(r"print (.*)") match = pattern.search(line) if match: print "print is used to print '", match.group(1), "'"
Computer Design Fun Party!
- To compute expression A = (B-C)*(D-E)
Using accumulator:
Load B ; acc = B; opcode=1 B, oper=2B; mem.traf = 2 B
Sub C ; acc = B – C; 1 byte opcode, 2-bytes operand = 3 bytes
Store A ; A = B -C; 1 byte opcode, 2-bytes operand = 3 bytes
Load D ; acc = D; 1 byte opcode, 2-bytes operand = 3 bytes
Sub E ; acc = D -E; 1 byte opcode, 2-bytes operand = 3 bytes
Mul A ; acc = (B-C)* (D-E) ; 1 byte opcode, 2-bytes operand = 3 bytes
Store A ; A = (B-C) * (D-E) ; 1 byte opcode, 2-bytes operand = 3 bytes
Assume each opcode takes 8-bit, operands are 16-bit and data is move in 16-bit chunks:
Total bytes: 7 * 3 byte = 21 bytes
Memory traffic: 21 + 7 * 2 = 35 bytes
Using load-store:
Ld B, r1 ; r1 = B; opcode=1 B, op = 2 B; mem. traff = 2 bytes
Ld C, r2 ; r2 = C; memory traffic = 2 bytes
Sub r1, r2,r3 ; r3 = r1 * r2 = B – C; size=2 bytes, memory traffic = 0
Ld D, r1 ; r1 = D; memory traffic = 2 bytes
Ld E, r2 ; r2 = E; memory traffic = 2 bytes
Sub r1,r2,r4 ; r4 = r1 * r2 = D-E; memory traffic = 0 bytes
Mul r3, r4, r1 ; r1 = r3 * r4 = (B-C) * (D-E); memory traffic = 0 bytes
St r1,A ; A = r1 = (B-C) * (D-E); memory traffic = 2 bytes
Number of bytes for instructions: 3*7 = 21 bytes
Memory traffic: 21 + 5 *2 = 31 bytes
Using stack:
Push B ; mem. Traf = 2 bytes
Push C ; mem. Traf = 2 bytes
Sub ; B – C; mem. Traf. = 2 bytes
Push D ; mem traf = 2 bytes
Push E ; mem.traf = 2 bytes
Sub ; top of the stack contains D-E, below it is B-C
Mul ; top of the stack now contains (B-C) * (D-E)
Each instruction occupies 3 bytes (1 byte opcode and 2-byte operand), so total instructions use 3*7 = 21 bytes.
Memory traffic: 21 + 2*7 = 35 bytes
- Register 0, 1, 2 (R0, R1, R2) are GPRs. R3 is initialized to 1 and make sure it is unchanged.
- Control sequence that forms the 2’s complement difference (or, R0 ← R0 + (-R1) = R0 + 2’s complement of R1) of the contents of R0 and R1:
- Write EnableA-bus enablesB-bus enablesF0F1Time012301230123F0F1T010001000000110010001000001001100011001000002
Explanation:
At T=0:
F0F1 = 11 (A’), A-bus enable = 1, B-bus enable = all zeros. Data (say, Y) is placed in A-bus. C-bus contains Y’, Write-enable = pin 1. So Y is written to register 1 (R1 = Y’).
At T = 1:
F0F1 = 00 (add), A-bus enable = pin 1 (content of R1 which is X, is in A-bus), B-bus enable= pin 3 (content of R3 which is 1 is in B-bus), write-enable = pin 1 (C-bus contains Y’ + 1). So R1 = Y’ + 1 = 2’s compl. Of X.(or R1 = -Y)
At T = 2
B-bus = 0, hence data from R0, say X, is placed in B-bus. F0F1 = 00 (add), A-bus enable = 1 (which contains –Y). C-bus contains R0 + R1 = X - Y . Write-enable = pin 0, so result is stored in R0.
- R0 XOR R1 (R0’ & R1 + R0 & R1’) = (R0 + R1)(R0’ + R1’), leaving the result in R0.
Write Enable | A-bus enables | B-bus enables | F0F1 | Time | Note | ||||||||||
0 | 1 | 2 | 3 | 0 | 1 | 2 | 3 | 0 | 1 | 2 | 3 | F0 | F1 | T | (Movements etc.) |
0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | R2 ← A + B |
1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | R0 ← A’ |
0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 2 | R1 ← B’ |
0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 3 | R1 ← A’ + B’ |
1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 4 | R0 ← (A+B)&(A’+B’) |
- Booth Multiplication Algorithm:
/*********************************************************** Title: Booth Algorithm sample Desc: To simulate booth multiplication algorithm Author: Buhadram Notes: Ref: - Computer Architecture: A Quantitative Approach *************************************************************/ #include#include #include #include #define NUMTYPE short int #define NUMBITS (8*sizeof(NUMTYPE)) #define MAX(x,y) ((x) > (y) ? x : y) #define MIN(x,y) ((x) < (y) ? x : y) #define BITMASK(x,mask) ((x) & mask) /********************************* proc: bitmask desc: return 1 if bit[pos] is set pos=0 for LSB, and p=sizeof(x)-1 for MSB *********************************/ #define BIT(x,pos) ( (BITMASK(x,1<<(pos))) >> (pos) ) #define ASCII_BIT(x) ((x) + '0') char *get_input(const char *prompt, char *buf, int length) { char *p; int i; printf("%s",prompt); // fgets is safer to avoid buffer overflow p = fgets(buf, length, stdin); // remove newline character buf[strlen(buf)-1] = 0; return p; } /******************************************** * Swapbits * Desc: To swap bit orders, e.g LSB becomes MSB Note: b[0] is LSB, b[n-1] is MSB ********************************************/ void swapbits(NUMTYPE *b) { int i; int tmp; tmp = 0; for(i=0; i base) base = 10; /* can only use 0-9, A-Z */ tail = &buf[BUFSIZE - 1]; /* last character position */ *tail-- = '\0'; if (10 == base && N < 0L) { *head++ = '-'; uarg = -N; } else uarg = N; if (uarg) { for (i = 1; uarg; ++i) { register ldiv_t r; r = ldiv(uarg, base); *tail-- = (char)(r.rem + ((9L < r.rem) ? ('A' - 10L) : '0')); uarg = r.quot; } } else *tail-- = '0'; memcpy(head, ++tail, i); return str; } #endif /****************************** proc: btol desc: converts binary string to long integer ******************************/ long btol(const char *bin) { int n,i; long d; n = strlen(bin); d = 0; for(i=n-1; i>=0; i--) { d += (bin[i]-'0') * (1 << i); } return d; } /************************************** Proc: Ashift_bits_right desc: perform arithmetic shift right on bits in x, but preserving the sign bit. ***************************************/ void ashift_bits_right(long *x, int numbits) { int neg; neg = (*x<0) ? 1 : 0; *x >>= numbits; if (neg) //*x |= (1< >= 1; qm = q0; q >>= 1; if (a0) r |= b1; b1 <<= 1; } return r; } int main(int argc, char *argv[]) { #define BUF_SIZE (1024+1) char *buf; NUMTYPE A, B; long res; char str[256]; buf = (char *)malloc(BUF_SIZE+1); assert(buf); get_input("A = ", buf, BUF_SIZE); A = atol(buf); printf("A (in Binary) = %s\n", ltoa(A, str, 2)); get_input("B = ", buf, BUF_SIZE); B = atol(buf); printf("B (in Binary) = %s\n", ltoa(B, str, 2)); res = A+B; printf("A + B = %ld = 0x%0X\n", res, res); res = A - B; printf("A - B = %ld = 0x%0X\n", res, res); puts("\n\nBOOTH MULTIPLICATION\n"); res = booth_mult(A,B); printf("A * B = %ld = 0x%0X = %sb\n", res, res, ltoa(B,str,2)); free(buf); return 0; }
Friday, March 9, 2012
Counting the number of "1" bits
Another "ridiculous" interview I had was when an interviewer asked me how to count the number of set bits in a 32-bit or 8-bit. I told him that there are various algorithms to do this, but the easiest way but yet good enough in performance is a function like this:
I also told him that there are many ways to do this (referred him to look at "Beautiful Code" to see some of the beautiful algorithms).
But then the interviewer argued. FIrst, he said the upper border of the for-loop should be to "sizeof(b)*8" (meaning, instead of using "<", I should have used "<="). To bad, I said yes to him without thinking further (Probably got nervous. Later at home, I checked that he was wrong). Second, he said that the fastest is to use look-up table (n[0] = 0, n[1]=1, n[2]=1,...,n[255]=8,...). Although I agreed with his argument, but then he become inconsistent with his original question and there is a big downside of his argument. I said that this wouldn't be a choice in embedded programming where memory resource is limited. Secondly, his original question was to compute ones in a 32-bit number. That means we have to 2^32 (4,294,967,296 or about 4 GBytes) items set manually in an array. That's so ridiculous. Who wants to waste that much space just to calculate the number of bits in an integer? For a byte, although it seems acceptable ("only" 256 bytes of table), but compare that to the above algorithm:
When I searched Google, I found the following (on Stanford's web):
Which renders to instruction codes like below:
(14 mnemonics). This seems the fastest and most optimized. It took some time for me to understand it.
Seems the interviewer doesn't have enough experience in embedded or microcontroller or doesn't think much about optimization. Nevertheless, I failed an interview again. I start to think that many interview questions are really not to dig the thinking ability of a candidate, but rather to get instant answers meet their mind. I just hope big software companies like IBM, Microsoft, Google or Apple shouldn't fall to this kind of trap, but rather ask logical questions to know if the candidate can think systematically.
What a waste!
uint32_t count_bits(uint32_t b) { int i; int n = 0; for(i=0; i<sizeof(b)*8; i++) { n += (b & 1); b >>= 1; } return n; }
I also told him that there are many ways to do this (referred him to look at "Beautiful Code" to see some of the beautiful algorithms).
But then the interviewer argued. FIrst, he said the upper border of the for-loop should be to "sizeof(b)*8" (meaning, instead of using "<", I should have used "<="). To bad, I said yes to him without thinking further (Probably got nervous. Later at home, I checked that he was wrong). Second, he said that the fastest is to use look-up table (n[0] = 0, n[1]=1, n[2]=1,...,n[255]=8,...). Although I agreed with his argument, but then he become inconsistent with his original question and there is a big downside of his argument. I said that this wouldn't be a choice in embedded programming where memory resource is limited. Secondly, his original question was to compute ones in a 32-bit number. That means we have to 2^32 (4,294,967,296 or about 4 GBytes) items set manually in an array. That's so ridiculous. Who wants to waste that much space just to calculate the number of bits in an integer? For a byte, although it seems acceptable ("only" 256 bytes of table), but compare that to the above algorithm:
movl $32, %edx xorl %ecx, %ecx
.L2: movl %ecx, %ebx shrl %ecx andl $1, %ebx addl %ebx, %eax decl %edx jne .L2It's only 7 x86 mnemonic instructions (about 12-18 bytes) and no memory involvement in the loop!. Yes, it is O(32) for 32-bit, but it's a constant speed. Even further, if the function is called only once, the fully optimized version of the instructions would be inserted inline in the place (without calling the function), so we can CPU time from doing push-pop stack frames. That's another tiny speedup. It's probably fine for table lookup (yes, it's the fastest, O(1)), but it is limited to 8-bit. For 16 or higher, seems the space-time trade-off is too big not to consider. Besides, the above algorithm extendable into polymorphism in Object-Oriented code.
When I searched Google, I found the following (on Stanford's web):
v = v - ((v >> 1) & 0x55555555); // reuse input as temporary v = (v & 0x33333333) + ((v >> 2) & 0x33333333); // temp c = ((v + (v >> 4) & 0xF0F0F0F) * 0x1010101) >> 24; // count
Which renders to instruction codes like below:
shrl %edx andl $1431655765, %edx subl %edx, %eax movl %eax, %edx shrl $2, %eax andl $858993459, %edx andl $858993459, %eax addl %edx, %eax movl %eax, %edx shrl $4, %edx leal (%edx,%eax), %eax andl $252645135, %eax imull $16843009, %eax, %eax shrl $24, %eax
(14 mnemonics). This seems the fastest and most optimized. It took some time for me to understand it.
Seems the interviewer doesn't have enough experience in embedded or microcontroller or doesn't think much about optimization. Nevertheless, I failed an interview again. I start to think that many interview questions are really not to dig the thinking ability of a candidate, but rather to get instant answers meet their mind. I just hope big software companies like IBM, Microsoft, Google or Apple shouldn't fall to this kind of trap, but rather ask logical questions to know if the candidate can think systematically.
What a waste!
Thursday, March 8, 2012
Parallelizable integer multiplication
At a recent interview with an Amazon, the hiring manager asked me about an algorithm to do integer multiplication without using any multiplication/division symbol in parallel computing. The following algorithm was suggested to do integer multiplication without using any "MULT" instruction:
When a recursive call is made, the parameters are pushed into the stack before calling each subsequence recursive call. Involving stack in this manner is hardly parallelizable, because there is coupling of current result based on previous results. One of the principle of parallel computing is to decouple two or more computation as much as we can.
A dig into highly-optimized assembly language of the above code:
As can be seen, "SARL" (Shift Arithmetic Left) for the shift-left operation. Some issues after analyzing this code. First, there are multiple register-to-memory flow while doing the recursive:
Memory access is expensive and the instructions might cause cache miss thus forcing CPU to reread data from RAM.
Another issue, how do we parallelize this, if there is stack push-pop coupling for the result?
My argument was to decompose the computation into two (or more) portions.
For example:
No, let's check what we can see in the x86 assembly (gcc-generated):
As can be seen, most of operations are in registers, also there is decoupling between 2 partitions (e.g, sum1 does not rely on sum2). The more we partition the calculation (e.g, pivot1 = b/4, and do 4 for-next loops for each partition), the more divisible partition can be accomplished by parallel computer. I was arguing that to optimize computation, hence lowering O(n), we should use divisible loops ("divide-et-empera") instead of recursive approach (seems the interviewer disagreed)
PS: My argument above caused me not to land job at Amazon.
int mult(int x, int y) { int z; if ((x==0) || (y==0)) return 0; if (x==1) return y; if (y==1) return x; z = mult(x, y>>1); if (y % 2 == 0) return z<<1; else return x + (z<<1); }My question came up, is this parallelizable? My argument to the interviewer was that this is not a good approach if he is asking from parallel computation perspective (See Robertazzi's papers on ACM journals to know more detail about parallel computation)
When a recursive call is made, the parameters are pushed into the stack before calling each subsequence recursive call. Involving stack in this manner is hardly parallelizable, because there is coupling of current result based on previous results. One of the principle of parallel computing is to decouple two or more computation as much as we can.
A dig into highly-optimized assembly language of the above code:
mult: pushl %ebp movl %esp, %ebp subl $24, %esp movl %ebx, -8(%ebp) movl %esi, -4(%ebp) movl 12(%ebp), %ebx movl 8(%ebp), %esi testl %ebx, %ebx je .L4 testl %esi, %esi je .L4 cmpl $1, %esi je .L2 cmpl $1, %ebx .p2align 4,,3 je .L5 movl %ebx, %eax movl %esi, (%esp) sarl %eax movl %eax, 4(%esp) call mult andl $1, %ebx je .L7 leal (%esi,%eax,2), %ebx .L2: movl %ebx, %eax movl -4(%ebp), %esi movl -8(%ebp), %ebx leave ret .p2align 4,,10 .p2align 3 .L4: xorl %ebx, %ebx movl -4(%ebp), %esi movl %ebx, %eax movl -8(%ebp), %ebx leave ret .p2align 4,,10 .p2align 3 .L7: leal (%eax,%eax), %ebx movl -4(%ebp), %esi movl %ebx, %eax movl -8(%ebp), %ebx leave ret .p2align 4,,10 .p2align 3 .L5: movl %esi, %ebx movl -4(%ebp), %esi movl %ebx, %eax movl -8(%ebp), %ebx leave ret
As can be seen, "SARL" (Shift Arithmetic Left) for the shift-left operation. Some issues after analyzing this code. First, there are multiple register-to-memory flow while doing the recursive:
movl %ebx, %eax movl %esi, (%esp) sarl %eax movl %eax, 4(%esp)
Memory access is expensive and the instructions might cause cache miss thus forcing CPU to reread data from RAM.
Another issue, how do we parallelize this, if there is stack push-pop coupling for the result?
My argument was to decompose the computation into two (or more) portions.
For example:
int pivot = b/2; for(i=0; i<pivot; i++) sum1 += a; for(i=pivot+1; i<b; i++) sum2 += a; sum = sum1 + sum2;
No, let's check what we can see in the x86 assembly (gcc-generated):
mult: pushl %ebp movl %esp, %ebp subl $8, %esp movl 12(%ebp), %eax movl %ebx, (%esp) movl %esi, 4(%esp) movl 8(%ebp), %edx testl %eax, %eax je .L5 testl %edx, %edx je .L5 cmpl $1, %edx je .L2 cmpl $1, %eax .p2align 4,,3 je .L6 movl %eax, %ecx xorl %esi, %esi shrl $31, %ecx xorl %ebx, %ebx addl %eax, %ecx sarl %ecx testl %ecx, %ecx jle .L3 movl %ecx, %esi movl %ecx, %ebx imull %edx, %esi .L3: xorl %ecx, %ecx cmpl %ebx, %eax jle .L4 movl %eax, %ecx subl %ebx, %ecx imull %edx, %ecx .L4: leal (%ecx,%esi), %eax .L2: movl (%esp), %ebx movl 4(%esp), %esi leave ret .p2align 4,,10 .p2align 3 .L5: xorl %eax, %eax movl (%esp), %ebx movl 4(%esp), %esi leave ret .p2align 4,,10 .p2align 3 .L6: movl %edx, %eax jmp .L2
As can be seen, most of operations are in registers, also there is decoupling between 2 partitions (e.g, sum1 does not rely on sum2). The more we partition the calculation (e.g, pivot1 = b/4, and do 4 for-next loops for each partition), the more divisible partition can be accomplished by parallel computer. I was arguing that to optimize computation, hence lowering O(n), we should use divisible loops ("divide-et-empera") instead of recursive approach (seems the interviewer disagreed)
PS: My argument above caused me not to land job at Amazon.
Subscribe to:
Posts (Atom)