Distributed programming
Last updated
Last updated
Storage unit -> HDFS
Data is stored across nodes sliced to blocks (128 Mb each)
Blocks are replicated on diff nodes with factor of 3 (two copies of each block)
Processing -> MapReduce
YARN (large-scale distributed operating system used for Big Data processing)
Spark was designed to overcome limitations of MapReduce.
RDD Resilient Distributed Dataset (foundation)
readonly objects distributed across cluster
Dataset can be built from: files, SQL dbs, noSQL dbs, HDFS
⚠️Processing of RDD is in RAM
This is a core of Spark