My Stuff

Friday, March 16, 2012

Google Interview (3)

Question:
How would you reverse the image on an n by n matrix where each pixel is represented by a bit?

Answer:
I: n x n matrix
I[0][0] represents 1 byte at the left-most, top-most corner (8 pixels).

Just image it as columns and rows of pixels. For example, a line of an image is represented on the screen as columns of pixels:

+----+----+----+ ....+-----+-----+---+
| p₁ | p₂ | p₃ |     |p_m-2| p_m-1| p_m|
+----+----+----+ ....+-----+-----+---+

In reversed image, it looks like below:

+---+----+----+ ....+----+----+----+
|p_m |p_m-1|p_m-2|     | p₃ | p₂ | p₁ |
+---+----+----+ ....+----+----+----+

The reverse version would have the column position reversed as well as the bit order in each bit. To do this, we need to read each row of the original image in reverse and also do bit reversal.

We have to be careful when swapping bytes above. On 32-bit computers, we can swap left-right of every 32-bit integer of columns, or on 64-bit PCs we can do 64-bit (long integer) swapping before doing the bit reverse.

For example:

n = m/(sizeof(long integer)
for i=0 to n
    right_idx = m-1-i;
    swapLongInteger(image[i], image[right_idx])
    Reverse8Bit(image[i])
    Reverse8Bit(image[i+1])
Reverse8Bit(image[i+2])
Reverse8Bit(image[i+3])
    Reverse8Bit(image[right_idx])
Reverse8Bit(image[right_idx-1])
Reverse8Bit(image[right_idx-2])
Reverse8Bit(image[right_idx-3])
    i = i + sizeof(long integer)
done

The call to multiple "Reverse8Bit" above can be put in multithreading/can be done using parallelism to speed up computation.

The pseudo-code is as below:

# I : list original image
# R : reversed image
for i in range(0,n):
    for j in range(0,n):
        R[i][j] = Reverse8Bit(I[i][n-1-j])

The Reverse8Bit() function is like below (from http://graphics.stanford.edu/~seander/bithacks.html#BitReverseObvious):

unsigned int v;     // input bits to be reversed
unsigned int r = v; // r will be reversed bits of v; first get LSB of v
int s = sizeof(v) * CHAR_BIT - 1; // extra shift needed at end

for (v >>= 1; v; v >>= 1)
{   
  r <<= 1;
  r |= v & 1;
  s--;
}
r <<= s; // shift when v's highest bits are zero

Thursday, March 15, 2012

Google Interview (2)

Question:
How many degrees are there in the angle between the hour and minute hands of a clock when the time is a quarter past three?

Answer:
The stupid and quick answer would be 0 degree.

But, if we rethink again, it's not the case exactly. When the minute hand points to '15' (quarter), the hour hand has moved little bit. How many? Let's analyze!

To make a full-circle (360 degrees rotation), the hour hand has to pass 12 hours, thus for one hour, it takes 360 degrees/12 = 30 degrees. For a quarter hour = 1/4 * 30 = 7.5 degrees.

So the answer is: 7.5 degrees.

A Google Interview

Question:
If a person dials a sequence of numbers on the telephone, what possible words/strings can be formed from the letters associated with those numbers?

Answer:
Let's analyze this.

On a handset, there are keys with number '0' to '9'. Assume we have to key in full phone number (area code + number), it's gonna be 10 digits. Only certain keys have alphabets: 2 [ABC], 3 [DEF], 4 [GHI], 5 [JKL], 6 [MNO], 7 [PQRS], 8 [WXYZ], with total number of sets = 8. The question is to find what is the total number of permutation of these 10-digits out of n set of possible phone numbers

Using permutation formula:

Formula:
	Note: , where _nP_r is the formula for permutations of n objects taken r at a time.

The set of possible numbers then is: [222-222-2222 to 888-888-8888] or 6,666,666,666 possible phone numbers.

We then compute:
n = 10,000,000,000
r = 10

Permutation =6,666,666,666 !/(10!(6,666,666,666 - 10)!) = (6,666,666,656 * 6,666,666,657 * 6,666,666,658 *6,666,666,669 * 6,666,666,660 * 6,666,666,661 * 6,666,666,662 * 6,666,666,663 * 6,666,666,664 * 6,666,666,665 * 6,666,666,666)/3628800 = <just figure out your self!>

If the allowed valid number is only 1 digit, then computation would be easier:

Permutation = 6,666,666,666!/(1!*(6,666,666,666-1)!) = 6,666,666,666 possible words

Fast Hashing of Variable-length text strings

The following Hash algorithm based on a research paper written by Peter Pearson (a scientist at Lawrence Livermore National Lab) has an ability to generate unique hash numbers for strings with lengths no longer than 256 bytes. That means, we can have up to 256 buckets in a hash table, but it guarantees the uniqueness (free of hash-collision), thus no need to have linked-list in buckets.

The beauty of this algorithm is that it is designed to work on small microprocessors or microcontrollers, which sometimes don't have luxury to do complex arithmetics in their Instruction set (or, require more instructions). Instead, the following algorithm uses XOR and table lookup which should be easy to do in small CPUs. On more modern and high-end CPUs, XOR operation translated directly to mnemonic XOR in machine-code (a single instruction) and done even in almost wire-speed.

One of the use I can think of is for creating MAC table for virtual bridging. An Ethernet MAC address takes only 6 octets, so this algorithm can be used to lookup 256-bucket MAC table. The hash-key is MACaddr, while the output is port.

For example:

A MAC table structure might look like below:

+-------------------+---------+------+
|      MACAddr      +  port   |  Age |
+-------------------+---------+------+
| XX:XX:XX:XX:XX:XX |    1    |  6   |
| YY:YY:YY:YY:YY:YY |    2    |  7   |
| ZZ:ZZ:ZZ:ZZ:ZZ:ZZ |    3    |  102 |
| AA:BB:BB:BB:BB:BB |    4    |  33  |
| AA:AA:AA:AA:AA:AA |    5    |  76  |
+-------------------+---------+------+

Learning:
port = PearsonHash(SourceMacAddr);
MACTable[port].addr = SourceMacAddr;
MACTable[port].port = port;

Forwarding:
inspect destination MAC address from incoming ethernet frame

//Flooding
if (DestMacAddr == Broadcast)
Send to all ports, except the original incoming port, with this frame

// a small filtering is done here to prevent loop
if (DestMacAddr != SourceMacAddr)
{
port = PearsonHash(DestMacAddr);
if port is invalid:
flooding
else
DataForward(EthernetFrame, port);
}

Aging:
The following statements can be executed periodically.

for each entry/bucket in the MAC table:
if MACTable[i].age < 1:
Flush(MACTable[i])

A challenge might be how to expand this algorithm to handle, say, 64000 buckets/entries? How to generate the pseudorandom hash table?

#include <stdio.h>
#include <string.h>
#include <stdlib.h>


static unsigned char PseudoRandomHash[256] = {
     1,   87,   49,  12, 176, 178,  102, 166, 121, 193,   6,  84, 249, 230,  44,  163, 
     14, 197,  213, 181, 161,  85,  218,  80,  64, 239,  24, 226, 236, 142,  38,  200, 
    110, 177,  104, 103, 141, 253,  255,  50,  77, 101,  81,  18,  45,  96,  31,  222, 
     25, 107,  190,  70,  86, 237,  240,  34,  72, 242,  20, 214, 244, 227, 149,  235, 
     97, 234,   57,  22,  60, 250,   82,  175, 208,   5, 127, 199, 111,  62, 135,  248, 
    174, 169,  211,  58,  66, 154,  106, 195, 245, 171,  17, 187, 182, 179,   0,  243, 
    132,  56,  148,  75, 128, 133,  158, 100, 130, 126,  91,  13, 153, 246, 216,  219,
    119,  68,  223,  78,  83,  88,  201,  99, 122,  11,  92,  32, 136, 114,  52,   10, 
    138,  30,   48, 183, 156,  35,   61,  26, 143,  74, 251,  94, 129, 162,  63,  152, 
    170,   7,  115, 167, 241, 206,    3, 150,  55,  59, 151, 220,  90,  53,  23,  131, 
    125, 173,   15, 238,  79,  95,   89,  16, 105, 137, 225, 224, 217, 160,  37,  123, 
    118,  73,    2, 157,  46, 116,    9, 145, 134, 228, 207, 212, 202, 215,  69,  229, 
     27, 188,   67, 124, 168, 252,   42,   4,  29, 108,  21, 247,  19, 205,  39,  203, 
    233,  40,  186, 147, 198, 192,  155,  33, 164, 191,  98, 204, 165, 180, 117,   76, 
    140,  36,  210, 172,  41,  54,  159,   8, 185, 232, 113, 196, 231,  47, 146,  120, 
     51,  65,   28, 144, 254, 221,   93, 189, 194, 139, 112,  43,  71, 109, 184,  209}; 
     
#define N   256
     
     
int PearsonHash(const char *str)
{
    int n = strlen(str);
    int i;
    unsigned char *h;
    int result;
    
    if (n > N)
    {
        puts("Pearson Hashing can only be used for string length <256 characters");
        return -1;
    }
    h = (unsigned char *)malloc(N);
    bzero(h, sizeof(char)*N);
    h[0] = 0;
    for(i=1; i<n; i++)
    {
        h[i] = PseudoRandomHash[(h[i-1] ^ str[i])& 0xFF];
        //printf("h[%u]=%0x\n", i, h[i]);
        result = h[i];
    }
    free(h);
    return result;
}

     
int main()
{
    int i;
    int n;
    const char MyStringBase[] = "Ahlan Wa Sahlan Bib, ";
    char *StrTable[N];
    
    for(i=0; i<N; i++)
    {
        printf(" %03u ", PseudoRandomHash[i]);
        if ((i+1)%16 == 0)
            printf("\n");
    }
    for(i=0; i<N; i++)
    {
        StrTable[i] = (char *)malloc(N);
        sprintf(StrTable[i], "%s%u", MyStringBase, i);
    }    
    for(i=0; i<N; i++)
    {
        printf("\n%s: \tHash Index=%u\n", StrTable[i], PearsonHash(StrTable[i]));
        free(StrTable[i]);
    }    
}

Wednesday, March 14, 2012

Print Canonical Addresses of a Site

This 2 lines of Python queries DNS and display the canonical names of a host. It's like a simple DNS lookup (nslookup).

#!/usr/bin/python

import sys
import socket

HOST=''
PORT=80

if (len(sys.argv) < 2):
 print sys.argv[0], " "
 sys.exit(0)

HOST = sys.argv[1]
for res in socket.getaddrinfo(HOST, PORT, socket.AF_UNSPEC, socket.SOCK_STREAM):
   print res[4][0]

For example, if we execute the script and pass parameter 'www.google.com', we'd get this:

$ ./getdns.py www.google.com
74.125.224.50
74.125.224.51
74.125.224.48
74.125.224.49
74.125.224.52

Python GUI

This small Python script creates a window application and handles menu events as well as updating statusbar at the bottom.

#!/usr/bin/python

from Tkinter import *
import tkMessageBox

class StatusBar(Frame):

    def __init__(self, master):
        Frame.__init__(self, master)
        self.label = Label(self, text="", bd=1, relief=SUNKEN, anchor=W)
        self.label.pack(side=BOTTOM, fill=X)

    def set(self, format, *args):
        self.label.config(text=format % args)
        self.label.update_idletasks()

    def clear(self):
        self.label.config(text="")
        self.label.update_idletasks()

class App:

    def __init__(self, parent):
        self.this = self
        frame = Frame(parent)
        parent.protocol("WM_DELETE_WINDOW", ConfirmQuit)

        parent.geometry("%dx%d%+d%+d" % (600, 400, 100, 50))

        # create a menu
        menu = Menu(parent)
        parent.config(menu=menu)

        filemenu = Menu(menu)
        menu.add_cascade(label="File", menu=filemenu)
        filemenu.add_command(label="New", command=self.newFile)
        filemenu.add_command(label="Open...", command=self.OpenFile)
        filemenu.add_separator()
        filemenu.add_command(label="Exit", command=self.Exit)

        runmenu = Menu(menu)
        menu.add_cascade(label="Run", menu=runmenu)
        runmenu.add_command(label="Run now", command=self.callback)

        helpmenu = Menu(menu)
        menu.add_cascade(label="Help", menu=helpmenu)
        helpmenu.add_command(label="About...", command=self.About)

    def callback(event):
        print "callback was called from event class ",event.__class__
        print "event.__name__=",__name__
        print "event type:", type(event)
        print "event.this.__class__=", event.this.__class__
        print "this: type=", type(event.this), ",class=", event.this.__class__
        print "status_bar", status_bar.__class__
        status_bar.set("%s", "Ahlan bib")

    def newFile(event):
        print "New file", event
        status_bar.set("%s", "Creating a new file")

    def OpenFile(event):
        print "Open File", event
        status_bar.set("%s", "Opening a file")

    def About(event):
        print "(c) 2012, mlutfi", event
        status_bar.clear()

    def say_hi(self):
        print "hi there, everyone!"

    def Exit(event):
        ConfirmQuit()

class BaseWindow:
    def __init__(self, parent):
        Toplevel()


def ConfirmQuit():
    if tkMessageBox.askokcancel("Quit", "Do you really want to quit?"):
        root.destroy()

root = Tk()
root.protocol("WM_DELETE_WINDOW", ConfirmQuit)

status_bar = StatusBar(root)
status_bar.pack(side=BOTTOM, fill=X)

app = App(root)

root.mainloop()

Sunday, March 11, 2012

Fibonacci in Python

# Fibonacci numbers module

def fib(n):    # write Fibonacci series up to n
    a, b = 0, 1
    while b < n:
        #print b,
        a, b = b, a+b

def fib2(n): # return Fibonacci series up to n
    result = []
    a, b = 0, 1
    while b < n:
        result.append(b)
        a, b = b, a+b
    print b
    return result
    
# make it executable as a script as well as module_exit
if __name__ == "__main__":
    import sys
    if (len(sys.argv) > 1):
        f = fib2(int(sys.argv[1]))
        str = ""
        print "Fibonacci sequence: ", f
    else:
        print sys.argv[0], " "

Downloading book "Foundations of Computer Science"

This small python script would download all chapters in the book "Foundations of Computer Science".


#!/usr/bin/python

import sys
from subprocess import call  # for 'call'


urlbase = 'http://infolab.stanford.edu/~ullman/focs/'

others = ['preface.pdf', 'toc.pdf', 'index.pdf']
chapters = range(1,15)
links = []
links.extend(others)

def download(index, f):
    url = urlbase + f
    print "file = ", index+1, url
    call(["wget", url])


for ch in chapters:
    links.append('ch' + '%02d' %(ch) + '.pdf')

out = [download(index,obj) for index,obj in enumerate(links)]

print "out=", out

Regular Expression in Python

The following is a Python snippet to do regular expression. First it reads a file passed as an argument, the it goes to a for-loop for each line and do string regular-expression search

#!/usr/bin/python

import sys # for sys.argv, etc.
import re  # regular expression

if (len(sys.argv) > 1):
    filename = sys.argv[1]
else:
    print sys.argv[0], " "
    sys.exit(0)


with open(filename, 'r') as f:
    #f.read()
    for line in f:
        pattern = re.compile(r"print (.*)")
        match = pattern.search(line)
        if match:
            print "print is used to print '", match.group(1), "'"

~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ readfile.py 17,3-9 All "readfile.py" 21L, 386C

Computer Design Fun Party!

To compute expression A = (B-C)*(D-E)

Using accumulator:

Load B ; acc = B; opcode=1 B, oper=2B; mem.traf = 2 B

Sub C ; acc = B – C; 1 byte opcode, 2-bytes operand = 3 bytes

Store A ; A = B -C; 1 byte opcode, 2-bytes operand = 3 bytes

Load D ; acc = D; 1 byte opcode, 2-bytes operand = 3 bytes

Sub E ; acc = D -E; 1 byte opcode, 2-bytes operand = 3 bytes

Mul A ; acc = (B-C)* (D-E) ; 1 byte opcode, 2-bytes operand = 3 bytes

Store A ; A = (B-C) * (D-E) ; 1 byte opcode, 2-bytes operand = 3 bytes

Assume each opcode takes 8-bit, operands are 16-bit and data is move in 16-bit chunks:

Total bytes: 7 * 3 byte = 21 bytes

Memory traffic: 21 + 7 * 2 = 35 bytes

Using load-store:

Ld B, r1 ; r1 = B; opcode=1 B, op = 2 B; mem. traff = 2 bytes

Ld C, r2 ; r2 = C; memory traffic = 2 bytes

Sub r1, r2,r3 ; r3 = r1 * r2 = B – C; size=2 bytes, memory traffic = 0

Ld D, r1 ; r1 = D; memory traffic = 2 bytes

Ld E, r2 ; r2 = E; memory traffic = 2 bytes

Sub r1,r2,r4 ; r4 = r1 * r2 = D-E; memory traffic = 0 bytes

Mul r3, r4, r1 ; r1 = r3 * r4 = (B-C) * (D-E); memory traffic = 0 bytes

St r1,A ; A = r1 = (B-C) * (D-E); memory traffic = 2 bytes

Number of bytes for instructions: 3*7 = 21 bytes

Memory traffic: 21 + 5 *2 = 31 bytes

Using stack:

Push B ; mem. Traf = 2 bytes

Push C ; mem. Traf = 2 bytes

Sub ; B – C; mem. Traf. = 2 bytes

Push D ; mem traf = 2 bytes

Push E ; mem.traf = 2 bytes

Sub ; top of the stack contains D-E, below it is B-C

Mul ; top of the stack now contains (B-C) * (D-E)

Each instruction occupies 3 bytes (1 byte opcode and 2-byte operand), so total instructions use 3*7 = 21 bytes.

Memory traffic: 21 + 2*7 = 35 bytes

Register 0, 1, 2 (R0, R1, R2) are GPRs. R3 is initialized to 1 and make sure it is unchanged.
1. Control sequence that forms the 2’s complement difference (or, R0 ← R0 + (-R1) = R0 + 2^’s complement of R1) of the contents of R0 and R1:

Write Enable				A-bus enables				B-bus enables				F₀F₁		Time
0	1	2	3	0	1	2	3	0	1	2	3	F₀	F₁	T
0	1	0	0	0	1	0	0	0	0	0	0	1	1	0
0	1	0	0	0	1	0	0	0	0	0	1	0	0	1
1	0	0	0	1	1	0	0	1	0	0	0	0	0	2

Explanation:

At T=0:

F₀F₁ = 11 (A’), A-bus enable = 1, B-bus enable = all zeros. Data (say, Y) is placed in A-bus. C-bus contains Y’, Write-enable = pin 1. So Y is written to register 1 (R1 = Y’).

At T = 1:

F₀F₁ = 00 (add), A-bus enable = pin 1 (content of R1 which is X, is in A-bus), B-bus enable= pin 3 (content of R3 which is 1 is in B-bus), write-enable = pin 1 (C-bus contains Y’ + 1). So R1 = Y’ + 1 = 2’s compl. Of X.(or R1 = -Y)

At T = 2

B-bus = 0, hence data from R0, say X, is placed in B-bus. F₀F₁ = 00 (add), A-bus enable = 1 (which contains –Y). C-bus contains R0 + R1 = X - Y . Write-enable = pin 0, so result is stored in R0.

R0 XOR R1 (R0’ & R1 + R0 & R1’) = (R0 + R1)(R0’ + R1’), leaving the result in R0.

Write Enable				A-bus enables				B-bus enables				F₀F₁		Time	Note
0	1	2	3	0	1	2	3	0	1	2	3	F₀	F₁	T	(Movements etc.)
0	0	1	0	1	0	0	0	0	1	0	0	0	0	0	R2 ← A + B
1	0	0	0	1	0	0	0	0	0	0	0	1	1	1	R0 ← A’
0	1	0	0	0	1	0	0	0	0	0	0	1	1	2	R1 ← B’
0	1	0	0	1	0	0	0	0	1	0	0	0	0	3	R1 ← A’ + B’
1	0	0	0	0	1	0	0	0	0	1		0	1	4	R0 ← (A+B)&(A’+B’)

Booth Multiplication Algorithm:

/***********************************************************
Title: Booth Algorithm sample
Desc: To simulate booth multiplication algorithm
Author: Buhadram
Notes:

Ref:
- Computer Architecture: A Quantitative Approach
*************************************************************/
#include 
#include 
#include 
#include 

#define NUMTYPE	   	short int
#define NUMBITS		(8*sizeof(NUMTYPE))
#define MAX(x,y) 		((x) > (y) ? x : y)
#define MIN(x,y) 		((x) < (y) ? x : y)
#define BITMASK(x,mask) 	((x) & mask)
/*********************************
proc: bitmask
desc: return 1 if bit[pos] is set
      pos=0 for LSB, and p=sizeof(x)-1
      for MSB
*********************************/
#define BIT(x,pos)		( (BITMASK(x,1<<(pos))) >> (pos) )
#define ASCII_BIT(x)	((x) + '0')

char *get_input(const char *prompt, char *buf, int length)
{
	char *p;
	int i;
    printf("%s",prompt);
	// fgets is safer to avoid buffer overflow
	p = fgets(buf, length, stdin);
	// remove newline character
	buf[strlen(buf)-1] = 0;
	return p;
}

/********************************************
* Swapbits
* Desc: To swap bit orders, e.g LSB becomes MSB
        Note: b[0] is LSB, b[n-1] is MSB
********************************************/
void swapbits(NUMTYPE *b)
{
    int i;
    int tmp;

	tmp = 0;
	for(i=0; i base)
       base = 10;                    /* can only use 0-9, A-Z        */
    tail = &buf[BUFSIZE - 1];           /* last character position      */
    *tail-- = '\0';

    if (10 == base && N < 0L) {
	   *head++ = '-';
       uarg = -N;
    }
    else uarg = N;

    if (uarg) {
  	   	for (i = 1; uarg; ++i) {
       	   register ldiv_t r;
		   r = ldiv(uarg, base);
           *tail-- = (char)(r.rem + ((9L < r.rem) ? ('A' - 10L) : '0'));
           uarg    = r.quot;
    	}
    }
    else  *tail-- = '0';
	memcpy(head, ++tail, i);
    return str;
}

#endif


/******************************
proc: btol
desc: converts binary string to
	  long integer
******************************/
long btol(const char *bin)
{
    int n,i;
    long d;

	n = strlen(bin);
	d = 0;
	for(i=n-1; i>=0; i--) {
		d += (bin[i]-'0') * (1 << i);
    }
    return d;
}

/**************************************
Proc: Ashift_bits_right
desc: perform arithmetic shift right on
    bits in x, but preserving the sign bit.
***************************************/
void ashift_bits_right(long *x, int numbits)
{
    int neg;

    neg = (*x<0) ? 1 : 0;
    *x >>= numbits;

    if (neg)
       //*x |= (1<>= 1;
		qm = q0;
		q >>= 1;
		if (a0)
		   r |= b1;
		b1 <<= 1;
    }
    return r;
}


int main(int argc, char *argv[])
{
	#define BUF_SIZE	(1024+1)
	char *buf;
	NUMTYPE A, B;
	long res;
	char str[256];

	buf = (char *)malloc(BUF_SIZE+1);
	assert(buf);
    get_input("A = ", buf, BUF_SIZE);
    A = atol(buf);
    printf("A (in Binary) = %s\n", ltoa(A, str, 2));
    get_input("B = ", buf, BUF_SIZE);
    B = atol(buf);
    printf("B (in Binary) = %s\n", ltoa(B, str, 2));
    res = A+B;
    printf("A + B = %ld = 0x%0X\n", res, res);
    res = A - B;
	printf("A - B = %ld = 0x%0X\n", res, res);
    puts("\n\nBOOTH MULTIPLICATION\n");
    res = booth_mult(A,B);
	printf("A * B = %ld = 0x%0X = %sb\n", res, res, ltoa(B,str,2));
    free(buf);
	return 0;
}

Friday, March 9, 2012

Counting the number of "1" bits

Another "ridiculous" interview I had was when an interviewer asked me how to count the number of set bits in a 32-bit or 8-bit. I told him that there are various algorithms to do this, but the easiest way but yet good enough in performance is a function like this:

uint32_t count_bits(uint32_t b)
{
    int i;
    int n = 0;
    
    for(i=0; i<sizeof(b)*8; i++)
    {
        n += (b & 1);
        b >>= 1;
    }   
    return n;
}

I also told him that there are many ways to do this (referred him to look at "Beautiful Code" to see some of the beautiful algorithms).

But then the interviewer argued. FIrst, he said the upper border of the for-loop should be to "sizeof(b)*8" (meaning, instead of using "<", I should have used "<="). To bad, I said yes to him without thinking further (Probably got nervous. Later at home, I checked that he was wrong). Second, he said that the fastest is to use look-up table (n[0] = 0, n[1]=1, n[2]=1,...,n[255]=8,...). Although I agreed with his argument, but then he become inconsistent with his original question and there is a big downside of his argument. I said that this wouldn't be a choice in embedded programming where memory resource is limited. Secondly, his original question was to compute ones in a 32-bit number. That means we have to 2^32 (4,294,967,296 or about 4 GBytes) items set manually in an array. That's so ridiculous. Who wants to waste that much space just to calculate the number of bits in an integer? For a byte, although it seems acceptable ("only" 256 bytes of table), but compare that to the above algorithm:

movl $32,  %edx
 xorl %ecx, %ecx

.L2:
 movl %ecx, %ebx
 shrl %ecx
 andl $1,   %ebx
 addl %ebx, %eax
 decl %edx
 jne .L2

It's only 7 x86 mnemonic instructions (about 12-18 bytes) and no memory involvement in the loop!. Yes, it is O(32) for 32-bit, but it's a constant speed. Even further, if the function is called only once, the fully optimized version of the instructions would be inserted inline in the place (without calling the function), so we can CPU time from doing push-pop stack frames. That's another tiny speedup. It's probably fine for table lookup (yes, it's the fastest, O(1)), but it is limited to 8-bit. For 16 or higher, seems the space-time trade-off is too big not to consider. Besides, the above algorithm extendable into polymorphism in Object-Oriented code.

When I searched Google, I found the following (on Stanford's web):

v = v - ((v >> 1) & 0x55555555);                    // reuse input as temporary
v = (v & 0x33333333) + ((v >> 2) & 0x33333333);     // temp
c = ((v + (v >> 4) & 0xF0F0F0F) * 0x1010101) >> 24; // count

Which renders to instruction codes like below:

shrl %edx
 andl $1431655765, %edx
 subl %edx, %eax
 movl %eax, %edx
 shrl $2, %eax
 andl $858993459, %edx
 andl $858993459, %eax
 addl %edx, %eax
 movl %eax, %edx
 shrl $4, %edx
 leal (%edx,%eax), %eax
 andl $252645135, %eax
 imull $16843009, %eax, %eax
 shrl $24, %eax

(14 mnemonics). This seems the fastest and most optimized. It took some time for me to understand it.

Seems the interviewer doesn't have enough experience in embedded or microcontroller or doesn't think much about optimization. Nevertheless, I failed an interview again. I start to think that many interview questions are really not to dig the thinking ability of a candidate, but rather to get instant answers meet their mind. I just hope big software companies like IBM, Microsoft, Google or Apple shouldn't fall to this kind of trap, but rather ask logical questions to know if the candidate can think systematically.

What a waste!

Thursday, March 8, 2012

Parallelizable integer multiplication

At a recent interview with an Amazon, the hiring manager asked me about an algorithm to do integer multiplication without using any multiplication/division symbol in parallel computing. The following algorithm was suggested to do integer multiplication without using any "MULT" instruction:

int mult(int x, int y)
{
 int z;

 if ((x==0) || (y==0))
  return 0;
 if (x==1)
  return y;

 if (y==1)
  return x;

 z = mult(x, y>>1);
 if (y % 2 == 0)
  return z<<1;
 else
  return x + (z<<1);
}

My question came up, is this parallelizable? My argument to the interviewer was that this is not a good approach if he is asking from parallel computation perspective (See Robertazzi's papers on ACM journals to know more detail about parallel computation)

When a recursive call is made, the parameters are pushed into the stack before calling each subsequence recursive call. Involving stack in this manner is hardly parallelizable, because there is coupling of current result based on previous results. One of the principle of parallel computing is to decouple two or more computation as much as we can.

A dig into highly-optimized assembly language of the above code:

mult:
 pushl %ebp
 movl %esp, %ebp
 subl $24, %esp
 movl %ebx, -8(%ebp)
 movl %esi, -4(%ebp)
 movl 12(%ebp), %ebx
 movl 8(%ebp), %esi
 testl %ebx, %ebx
 je .L4
 testl %esi, %esi
 je .L4
 cmpl $1, %esi
 je .L2
 cmpl $1, %ebx
 .p2align 4,,3
 je .L5
 movl %ebx, %eax
 movl %esi, (%esp)
 sarl %eax
 movl %eax, 4(%esp)
 call mult
 andl $1, %ebx
 je .L7
 leal (%esi,%eax,2), %ebx
.L2:
 movl %ebx, %eax
 movl -4(%ebp), %esi
 movl -8(%ebp), %ebx
 leave
 ret
 .p2align 4,,10
 .p2align 3
.L4:
 xorl %ebx, %ebx
 movl -4(%ebp), %esi
 movl %ebx, %eax
 movl -8(%ebp), %ebx
 leave
 ret
 .p2align 4,,10
 .p2align 3
.L7:
 leal (%eax,%eax), %ebx
 movl -4(%ebp), %esi
 movl %ebx, %eax
 movl -8(%ebp), %ebx
 leave
 ret
 .p2align 4,,10
 .p2align 3
.L5:
 movl %esi, %ebx
 movl -4(%ebp), %esi
 movl %ebx, %eax
 movl -8(%ebp), %ebx
 leave
 ret

As can be seen, "SARL" (Shift Arithmetic Left) for the shift-left operation. Some issues after analyzing this code. First, there are multiple register-to-memory flow while doing the recursive:

movl %ebx, %eax
 movl %esi, (%esp)
 sarl %eax
 movl %eax, 4(%esp)

Memory access is expensive and the instructions might cause cache miss thus forcing CPU to reread data from RAM.
Another issue, how do we parallelize this, if there is stack push-pop coupling for the result?

My argument was to decompose the computation into two (or more) portions.

For example:

int pivot = b/2;

for(i=0; i<pivot; i++)
    sum1 += a;

for(i=pivot+1; i<b; i++)
    sum2 += a;
sum = sum1 + sum2;

No, let's check what we can see in the x86 assembly (gcc-generated):

mult:
 pushl %ebp
 movl %esp, %ebp
 subl $8, %esp
 movl 12(%ebp), %eax
 movl %ebx, (%esp)
 movl %esi, 4(%esp)
 movl 8(%ebp), %edx
 testl %eax, %eax
 je .L5
 testl %edx, %edx
 je .L5
 cmpl $1, %edx
 je .L2
 cmpl $1, %eax
 .p2align 4,,3
 je .L6
 movl %eax, %ecx
 xorl %esi, %esi
 shrl $31, %ecx
 xorl %ebx, %ebx
 addl %eax, %ecx
 sarl %ecx
 testl %ecx, %ecx
 jle .L3
 movl %ecx, %esi
 movl %ecx, %ebx
 imull %edx, %esi
.L3:
 xorl %ecx, %ecx
 cmpl %ebx, %eax
 jle .L4
 movl %eax, %ecx
 subl %ebx, %ecx
 imull %edx, %ecx
.L4:
 leal (%ecx,%esi), %eax
.L2:
 movl (%esp), %ebx
 movl 4(%esp), %esi
 leave
 ret
 .p2align 4,,10
 .p2align 3
.L5:
 xorl %eax, %eax
 movl (%esp), %ebx
 movl 4(%esp), %esi
 leave
 ret
 .p2align 4,,10
 .p2align 3
.L6:
 movl %edx, %eax
 jmp .L2

As can be seen, most of operations are in registers, also there is decoupling between 2 partitions (e.g, sum1 does not rely on sum2). The more we partition the calculation (e.g, pivot1 = b/4, and do 4 for-next loops for each partition), the more divisible partition can be accomplished by parallel computer. I was arguing that to optimize computation, hence lowering O(n), we should use divisible loops ("divide-et-empera") instead of recursive approach (seems the interviewer disagreed)

PS: My argument above caused me not to land job at Amazon.

Sunday, February 26, 2012

Simple UDP Client

<pre>

/* UDP client in the internet domain */
#include <sys/types.h>
#include <sys/socket.h>
#include <netinet/in.h>
#include <arpa/inet.h>
#include <netdb.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>

#define MIN(x, y) ((x) < (y) ? (x) : (y))
#define MAX(x, y) ((x) > (y) ? (x) : (y))

void error(const char *);

int main(int argc, char *argv[])
{
int sock, n;
unsigned int length;
struct sockaddr_in server, from;
struct hostent *hp;
char buffer[256];

if (argc < 3) {
printf("Usage: %s <port>\n", argv[0]);
exit(1);
}
sock= socket(AF_INET, SOCK_DGRAM, 0);
if (sock < 0) error("socket");

server.sin_family = AF_INET;
hp = gethostbyname(argv[1]);
if (hp==0) error("Unknown host");

bcopy((char *)hp->h_addr,
(char *)&server.sin_addr,
hp->h_length);
server.sin_port = htons(atoi(argv[2]));
length=sizeof(struct sockaddr_in);
printf("Please enter the message: ");
bzero(buffer, sizeof(buffer));
fgets(buffer, sizeof(buffer)-1, stdin);
n=sendto(sock,buffer,
strlen(buffer),0,(const struct sockaddr *)&server,length);
if (n < 0)
error("Sendto");
n = recvfrom(sock, buffer, sizeof(buffer), 0, (struct sockaddr *)&from, &length);
if (n < 0)
error("recvfrom");
printf("Got an ack: %d bytes\n", n);
write(1, buffer, MIN(n, strlen(buffer)));
close(sock);
return 0;
}

void error(const char *msg)
{
perror(msg);
exit(0);
}

</pre>

Simple UDP Server

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <netdb.h>
#include <unistd.h>
#include <errno.h>
#include <sys/socket.h>
#include <sys/types.h>
#include <arpa/inet.h>

extern int h_errno;

void error(const char *msg);

int UdpSocket(int protocol)
{
 int sockfd;

    sockfd = socket(AF_INET, SOCK_DGRAM, 0);
 if (-1 == sockfd)
 {
  error(__func__);
 }
 return sockfd;
}


void error(const char *msg)
{
 printf("Error Code: %d", errno);
 perror(msg);
}


int BindSocketToAddr(int sock, struct sockaddr_in *addr, int port)
{
 int rc;

    if (!addr)
    {
        printf("%s: NULL pointer in addr\n", __func__);
        return -1;
    }
 bzero(addr, sizeof(struct sockaddr_in));
 addr->sin_family = AF_INET;
 addr->sin_addr.s_addr = INADDR_ANY;
 addr->sin_port = htons(port);
 rc = bind(sock, (struct sockaddr *)addr, sizeof(struct sockaddr));
 if (rc < 0)
 {
  perror(__func__);
 }
    return rc;
}



int main(int argc, char **argv)
{
 int sockfd;
 struct sockaddr_in fromSockAddr;
 struct sockaddr_in serverSockAddr;
 socklen_t fromlen;
 char buf[1024];
 char fromAddrStr[512];
 int n;

    if (argc < 2)
    {
        printf("\n%s <PORT NUMBER>\n\n", argv[0]);
        exit(0);
    }
    setuid(0);

 printf("UDP Server\n");

 sockfd = UdpSocket(0);
 if (sockfd > 0)
 {
  BindSocketToAddr(sockfd, &serverSockAddr, atoi(argv[1]));
  fromlen = sizeof(fromSockAddr);
  while (1) {
   /* wait for client to send a message */
   n = recvfrom(sockfd, buf, sizeof(buf), 0, (struct sockaddr *)&fromSockAddr, &fromlen);
   if (n < 0)
   {
    error("recvfrom");
    break;
   }
   else {
    inet_ntop(fromSockAddr.sin_family, &fromSockAddr.sin_addr, fromAddrStr, sizeof(fromAddrStr));
    printf("GOT msg from: %s\n", fromAddrStr);
    /* echo back */
    n = sendto(sockfd, buf, n, 0, (struct sockaddr *)&fromSockAddr, sizeof(fromSockAddr));
    if (n < 0)
    {
     error("sendto");
     break;
    }
   }
  }
 }
 return 0;
}

Saturday, February 4, 2012

Intro to Ruby

Just got my hands dirty with Ruby today. It's an interesting language, yet simple to learn. Unlike some other dynamic programming languages that too cryptic to learn, Ruby is easy to learn, similar to Tcl.

Here's an example of a program I tested:

#!/usr/bin/ruby

presidents = ["Ford", "Carter", "Reagan", "Bush1", "Clinton", "Bush2", "Obama"]

for fwd in [-1,1]
    if fwd == 1 
        print "Forward/ascending\n"
        dec = 0
    else
        print "Backward/descending\n"
        dec = -1
    end
    for ss in 0...presidents.length
        print ss+1, ": ", presidents[ss*fwd+dec], "\n";
    end
end

The output:

Backward/descending
1: Obama
2: Bush2
3: Clinton
4: Bush1
5: Reagan
6: Carter
7: Ford
Forward/ascending
1: Ford
2: Carter
3: Reagan
4: Bush1
5: Clinton
6: Bush2
7: Obama