Apache Spark - PySpark⚓︎
What is Apache Spark ?
An Open Source framework for distributed from batch to stream data processing for big data workloads.
It provides analytics engine that is much faster than Hadoop and MapReduce.
It have 3 main data structures namely RDDs(Resilient Distributed Systems), Dataframes and Datasets.
What is RDD ?
It is a low level, object oriented, more control and less optimized data structure.
It is fault-tolerant, immutable, distriubuted collection of objects that can operated on in parallel.
What is DataFrame ?
It is a distributed collection of data with schema(table in relational database) and optimized with Catalyst.
Properties : Schema-aware, Optimized with Catalyst Optimizer and Language Agnostic.
What is Dataset ?
It is a strongly typed collection of JVM objects that combines the best of RDDs and DataFrames.
It is only available in Scala and Java.
Properties : Strongly typed, Optimized with Catalyst and Encoder.
What is the difference amongs RDD, DataFrame and Dataset ?
Which Data Structure we should prefered for spark workload ?
DataFrame is the perfect balance between developers productivity and execution performance, thus it is prefered due to following factors
- Easy to use - It provides high level api similar to SQL or Pandas that abstracts away much of the complexity of distributed processing.
- Catalyst Optimizer : It builds a logical and physical plan for our transformation then optimizes them by
- Rule based optimization : Predicate pushdown, Column pruning
- Cost based optimization : Choosing most efficient physical execution plan based on stats.
- Tungsten Execution Engine : It uses sparks' tungsten engine to perform the operation directly on serialized binary data in memory. Thus, it reduce the memory overhead and improve CPU utilization, leading to significantly faster execution.
- Memory management
- Code generation
- Cache locality