Ans: Apache Spark is a cluster computing framework that runs on a cluster of commodity hardware and performs data unification i.e., reading and writing of a wide variety of data from multiple sources. In Spark, a task is an operation that can be a map task or a reduce task. SparkContext handles the execution of the job and also provides APIs in different languages i.e., Scala, Java, and Python to develop applications and faster execution as compared to MapReduce.
Ans: Spark Core acts as the base engine for large-scale parallel and distributed data processing. It is the distributed execution engine used in conjunction with the Java, Python, and Scala APIs that offer a platform for distributed ETL (Extract, Transform, Load) application development.
Various functions of Spark Core are:
Ans: When Spark operates on any dataset, it remembers the instructions. When a transformation such as a map() is called on an RDD, the operation is not performed instantly. Transformations in Spark are not evaluated until you perform an action, which aids in optimizing the overall data processing workflow, known as lazy evaluation.
Ans: A sparse vector has two parallel arrays –one for indices and the other for values. These vectors are used for storing non-zero entries to save space.
Ans: In general, Apache Spark is a graph execution engine that enables users to analyze massive data sets with high performance. For this, Spark first needs to be held in memory to improve performance drastically, if data needs to be manipulated with multiple stages of processing.
Ans: Resilient Distributed Datasets are a set of multiple data items which are huge in size such that they are not suitable for a single node and have to be partitioned across several nodes. For this, Spark automatically partitions RDD and distributes the partitions across different nodes. Partition in Spark referred to as an atomic chunk of data stored on a node in the cluster. RDDs in Apache Spark are sets of partitions.
Ans: Spark SQL modules help in integrating relational processing with Spark’s functional programming API. It supports querying data via SQL or HiveQL (Hive Query Language).
Also, Spark SQL supports a galore of data sources and allows for weaving SQL queries with code transformations. DataFrame API, Data Source API, Interpreter & Optimizer, and SQL Service are the four libraries contained by the Spark SQL.
Ans: Parquet is a columnar format that is supported by several data processing systems. With it, Spark SQL performs both read as well as write operations. Having columnar storage has the following advantages:
Able to fetch specific columns for access
Consumes less space
Follows type-specific encoding
Limited I/O operations
Offers better-summarized data
Ans: Some of the limitations of using PySpark are:
It is difficult to express a problem in MapReduce fashion sometimes.
Also, Sometimes, it is not as efficient as other programming models.
Ans: It is being assumed that the readers are already aware of what a programming language and a framework is, before proceeding with the various concepts given in this tutorial. Also, if the readers have some knowledge of Spark and Python in advance, it will be very helpful.
Ans: Spark has the following benefits over MapReduce:
Due to the availability of in-memory processing, Spark implements the processing around 10 to 100 times faster than Hadoop MapReduce whereas MapReduce makes use of persistence storage for any of the data processing tasks.
Unlike Hadoop, Spark provides inbuilt libraries to perform multiple tasks from the same core like batch processing, Steaming, Machine learning, Interactive SQL queries. However, Hadoop only supports batch processing.
Hadoop is highly disk-dependent whereas Spark promotes caching and in-memory data storage.
Spark is capable of performing computations multiple times on the same dataset. This is called iterative computation while there is no iterative computing implemented by Hadoop.
Ans: Similar to Hadoop, YARN is one of the key features in Spark, providing a central and resource management platform to deliver scalable operations across the cluster. YARN is a distributed container manager, like Mesos for example, whereas Spark is a data processing tool. Spark can run on YARN, the same way Hadoop Map Reduce can run on YARN. Running Spark on YARN necessitates a binary distribution of Spark as built on YARN support.
Ans: No, because Spark runs on top of YARN. Spark runs independently from its installation. Spark has some options to use YARN when dispatching jobs to the cluster, rather than its own built-in manager, or Mesos. Further, there are some configurations to run YARN. They include master, deploy-mode, driver-memory, executor-memory, executor-cores, and queue.
Ans: Accumulators are the write-only variables which are initialized once and sent to the workers. These workers will update based on the logic written and sent back to the driver which will aggregate or process based on the logic.
The only driver can access the accumulator’s value. For tasks, Accumulators are write-only. For example, it is used to count the number of errors seen in RDD across workers.
Ans: Parquet is a columnar format that is supported by several data processing systems. With the Parquet file, Spark can perform both read and write operations.
Some of the advantages of having a Parquet file are:
It enables you to fetch specific columns for access.
It consumes less space
It follows the type-specific encoding
It supports limited I/O operations
Ans: Spark Core is the engine for parallel and distributed processing of large data sets. The various functionalities supported by Spark Core include:
Scheduling and monitoring jobs
Memory management
Fault recovery
Task dispatching
Ans: FS API can read data from different storage devices like HDFS, S3, or local FileSystem. Spark uses FS API to read data from different storage engines.
Ans: Every transformation generates a new partition. Partitions use HDFS API so that partition is immutable, distributed, and fault tolerance. Partition also aware of data locality
Ans: Actions are RDD’s operation, that value returns back to the spar driver programs, which kick off a job to execute on a cluster. Transformation’s output is an input of Actions. reduce, collect, take samples, take, first, saveAsTextfile, saveAsSequenceFile, countByKey, for each is common actions in Apache spark.
Ans: Yes, Apache Spark provides checkpoints. They allow for a program to run all around the clock in addition to making it resilient towards failures not related to application logic. Lineage graphs are used for recovering RDDs from a failure.
Apache Spark comes with an API for adding and managing checkpoints. The user then decides which data to the checkpoint. Checkpoints are preferred over lineage graphs when the latter are long and have wider dependencies.
Ans: Although the intermediary data from different shuffle operations automatically persists in Spark, it is recommended to use the persist () method on the RDD if the data is to be reused.
Apache Spark features several persistence levels for storing the RDDs on disk, memory, or a combination of the two with distinct replication levels. These various persistence levels are:
DISK_ONLY - Stores the RDD partitions only on the disk.
MEMORY_AND_DISK - Stores RDD as deserialized Java objects in the JVM. In case the RDD isn’t able to fit in the memory, additional partitions are stored on the disk. These are read from here each time the requirement arises.
MEMORY_ONLY_SER - Stores RDD as serialized Java objects with one-byte array per partition.
MEMORY_AND_DISK_SER - Identical to MEMORY_ONLY_SER with the exception of storing partitions not able to fit in the memory to the disk in place of recomputing them on the fly when required.
MEMORY_ONLY - The default level, it stores the RDD as deserialized Java objects in the JVM. In case the RDD isn’t able to fit in the memory available, some partitions won’t be cached, resulting in recomputing the same on the fly every time they are required.
OFF_HEAP - Works like MEMORY_ONLY_SER but stores the data in off-heap memory.
Ans: The program that runs on the master node of a machine and declares actions and transformations on data RDDs is called Spark Driver. In simple words, a driver.in Spark develops SparkContext, connected to a given Spark Master.
Spark Driver also delivers RDD graphs to Master, when the standalone Cluster Manager runs.
Ans: Yes, Apache Spark provides an API for adding and managing checkpoints. Checkpointing is the process of making streaming applications resilient to failures. It allows you to save the data and metadata into a checkpointing directory. In case of a failure, the spark can recover this data and start from wherever it has stopped.
There are 2 types of data for which we can use checkpointing in Spark.
Metadata Checkpointing: Metadata means the data about data. It refers to saving the metadata to fault-tolerant storage like HDFS. Metadata includes configurations, DStream operations, and incomplete batches.
Data Checkpointing: Here, we save the RDD to reliable storage because its need arises in some of the stateful transformations. In this case, the upcoming RDD depends on the RDDs of previous batches.
Ans: Yes, it is possible if you use Spark Cassandra Connector.
Ans: Minimizing data transfers and avoiding shuffling helps write spark programs that run in a fast and reliable manner. The various ways in which data transfers can be minimized when working with Apache Spark are:
Using Broadcast Variable- Broadcast variable enhances the efficiency of joins between small and large RDDs.
Using Accumulators – Accumulators help update the values of variables in parallel while executing.
The most common way is to avoid operations ByKey, repartition, or any other operations which trigger shuffles.