TIL:
POSIX
Reason for MapReduce(Why use a distributed system?)
- Lots of data(1 PetaByte), machines had 160 GB storage so can’t process in one machine
- I/O speed is very low(performance)
- Fault tolerance(Tolerate machine and disk failures)
- Application programmers don’t need to work on systems and making sure their job is running in a distributed fashion
Workflow
Input | Map | Reduce | Result |
---|---|---|---|
key, value | K1 <v1, v1`, v1`` > | K1 R(v1, v1`, …) | |
K2, v2 | |||
K3, v3 |
Examples
- Word Count
- Sort
- Reverse Links(Used for page ranking)
- Input - key: URL, value: HTML
- Map - (<target1, src1>, <target2, src2>,….)
- Reduce - target1, <src1, src1`,….>
- Result - rank
Why MapReduce was developed as a library and not as a service?
Because loading a user defined function into a already existing running process was not possible in case of C++. It is easier to compile the code with the library than dynamically loading the UDF(user defined function).