Notes | Akshat Sharma

TIL: POSIX Reason for MapReduce(Why use a distributed system?) Lots of data(1 PetaByte), machines had 160 GB storage so can’t process in one machine I/O speed is very low(performance) Fault tolerance(Tolerate machine and disk failures) Application programmers don’t need to work on systems and making sure their job is running in a distributed fashion Workflow Input Map Reduce Result key, value K1 <v1, v1`, v1`` > K1 R(v1, v1`, …) K2, v2 K3, v3 Examples Word Count Sort Reverse Links(Used for page ranking) Input - key: URL, value: HTML Map - (<target1, src1>, <target2, src2>,….) Reduce - target1, <src1, src1`,….> Result - rank Why MapReduce was developed as a library and not as a service? Because loading a user defined function into a already existing running process was not possible in case of C++. It is easier to compile the code with the library than dynamically loading the UDF(user defined function). ...