Skip to content

Apache Spark - PySpark⚓︎

What is Apache Spark ?

An Open Source framework for distributed from batch to stream data processing for big data workloads.

It provides analytics engine that is much faster than Hadoop and MapReduce.

It have 3 main data structures namely RDDs(Resilient Distributed Systems), Dataframes and Datasets.

What is RDD ?

It is a low level, object oriented, more control and less optimized data structure.

It is fault-tolerant, immutable, distriubuted collection of objects that can operated on in parallel.

What is DataFrame ?

It is a distributed collection of data with schema(table in relational database) and optimized with Catalyst.

Properties : Schema-aware, Optimized with Catalyst Optimizer and Language Agnostic.

What is Dataset ?

It is a strongly typed collection of JVM objects that combines the best of RDDs and DataFrames.

It is only available in Scala and Java.

Properties : Strongly typed, Optimized with Catalyst and Encoder.

What is the difference amongs RDD, DataFrame and Dataset ?

Which Data Structure we should prefered for spark workload ?

DataFrame is the perfect balance between developers productivity and execution performance, thus it is prefered due to following factors

  1. Easy to use - It provides high level api similar to SQL or Pandas that abstracts away much of the complexity of distributed processing.
  2. Catalyst Optimizer : It builds a logical and physical plan for our transformation then optimizes them by
    • Rule based optimization : Predicate pushdown, Column pruning
    • Cost based optimization : Choosing most efficient physical execution plan based on stats.
  3. Tungsten Execution Engine : It uses sparks' tungsten engine to perform the operation directly on serialized binary data in memory. Thus, it reduce the memory overhead and improve CPU utilization, leading to significantly faster execution.
    • Memory management
    • Code generation
    • Cache locality