TIL:

POSIX

Reason for MapReduce(Why use a distributed system?)

Lots of data(1 PetaByte), machines had 160 GB storage so can’t process in one machine
I/O speed is very low(performance)
Fault tolerance(Tolerate machine and disk failures)
Application programmers don’t need to work on systems and making sure their job is running in a distributed fashion

Workflow

Input	Map	Reduce
key, value	K1 <v1, v1`, v1`` >	K1 R(v1, v1`, …)
	K2, v2
	K3, v3

Examples

Word Count
Sort
Reverse Links(Used for page ranking)
1. Input - key: URL, value: HTML
2. Map - (<target1, src1>, <target2, src2>,….)
3. Reduce - target1, <src1, src1`,….>
4. Result - rank

Why MapReduce was developed as a library and not as a service?

Because loading a user defined function into a already existing running process was not possible in case of C++. It is easier to compile the code with the library than dynamically loading the UDF(user defined function).