Distributed programming
Last updated
Was this helpful?
Last updated
Was this helpful?
Storage unit -> HDFS
Data is stored across nodes sliced to blocks (128 Mb each)
Blocks are replicated on diff nodes with factor of 3 (two copies of each block)
Processing -> MapReduce
YARN (large-scale distributed operating system used for Big Data processing)
Spark was designed to overcome limitations of MapReduce.
RDD Resilient Distributed Dataset (foundation)
readonly objects distributed across cluster
Dataset can be built from: files, SQL dbs, noSQL dbs, HDFS
This is a core of Spark
Processing of RDD is in RAM