CtrlK

Distributed programming

Hadoop

Storage unit -> HDFS
- Data is stored across nodes sliced to blocks (128 Mb each)
- Blocks are replicated on diff nodes with factor of 3 (two copies of each block)
Processing -> MapReduce
YARN (large-scale distributed operating system used for Big Data processing)

Apache Spark vs MapReduce

Spark was designed to overcome limitations of MapReduce.

RDD Resilient Distributed Dataset (foundation)
- readonly objects distributed across cluster
- Dataset can be built from: files, SQL dbs, noSQL dbs, HDFS
⚠️Processing of RDD is in RAM
This is a core of Spark

PreviousJava 8 NextNetwork

Last updated 2 years ago

Was this helpful?