Percolator

Notes Bigtable doesn’t support multi-row/multi-table transactions. Why does Google need multi-table transactions? Removing duplicates(Multiple URLs may lead to the same website), calculation of pagerank will get affected. Built on top of big table, because didn’t have that many people working on it and also didn’t have source code access to big table. Locks Locks in percolator could have been implemented in two ways - In place(in database) Problem with this is that you can’t maintain complex locks with queues and techniques like wound wait, wait die, etc. Database overhead Standalone lock service Why is Snapshot Isolation used instead of Optimistic Concurrency Control? Because OCC requires complex locking mechanisms like queues to hold pending locks. ...

MapReduce

TIL: POSIX Reason for MapReduce(Why use a distributed system?) Lots of data(1 PetaByte), machines had 160 GB storage so can’t process in one machine I/O speed is very low(performance) Fault tolerance(Tolerate machine and disk failures) Application programmers don’t need to work on systems and making sure their job is running in a distributed fashion Workflow Input Map Reduce Result key, value K1 <v1, v1`, v1`` > K1 R(v1, v1`, …) K2, v2 K3, v3 Examples Word Count Sort Reverse Links(Used for page ranking) Input - key: URL, value: HTML Map - (<target1, src1>, <target2, src2>,….) Reduce - target1, <src1, src1`,….> Result - rank Why MapReduce was developed as a library and not as a service? Because loading a user defined function into a already existing running process was not possible in case of C++. It is easier to compile the code with the library than dynamically loading the UDF(user defined function). ...