PySpark Interview Questions and Answers
by Sathish, on Aug 10, 2020 3:35:55 PM
Q1. What is Apache Spark?
Ans: Apache Spark is a cluster computing framework that runs on a cluster of commodity hardware and performs data unification i.e., reading and writing of a wide variety of data from multiple sources. In Spark, a task is an operation that can be a map task or a reduce task. SparkContext handles the execution of the job and also provides APIs in different languages i.e., Scala, Java, and Python to develop applications and faster execution as compared to MapReduce.
Q2. What are the various functions of Spark Core?
Ans: Spark Core acts as the base engine for large-scale parallel and distributed data processing. It is the distributed execution engine used in conjunction with the Java, Python, and Scala APIs that offer a platform for distributed ETL (Extract, Transform, Load) application development.
Various functions of Spark Core are:
- Distributing, monitoring, and scheduling jobs on a cluster.
- Interacting with storage systems
- Memory management and fault recovery
Q3. What is lazy evaluation in Spark?
Ans: When Spark operates on any dataset, it remembers the instructions. When a transformation such as a map() is called on an RDD, the operation is not performed instantly. Transformations in Spark are not evaluated until you perform an action, which aids in optimizing the overall data processing workflow, known as lazy evaluation.
Q4. What is a Sparse Vector?
Ans: A sparse vector has two parallel arrays –one for indices and the other for values. These vectors are used for storing non-zero entries to save space.
Q5. Explain Spark Execution Engine?
Ans: In general, Apache Spark is a graph execution engine that enables users to analyze massive data sets with high performance. For this, Spark first needs to be held in memory to improve performance drastically, if data needs to be manipulated with multiple stages of processing.
Q6. What is a partition in ApacheSpark?
Ans: Resilient Distributed Datasets are a set of multiple data items which are huge in size such that they are not suitable for a single node and have to be partitioned across several nodes. For this, Spark automatically partitions RDD and distributes the partitions across different nodes. Partition in Spark referred to as an atomic chunk of data stored on a node in the cluster. RDDs in Apache Spark are sets of partitions.
Q7. Tell us how will you implement SQL in Spark?
Ans: Spark SQL modules help in integrating relational processing with Spark’s functional programming API. It supports querying data via SQL or HiveQL (Hive Query Language).
Also, Spark SQL supports a galore of data sources and allows for weaving SQL queries with code transformations. DataFrame API, Data Source API, Interpreter & Optimizer, and SQL Service are the four libraries contained by the Spark SQL.
Q8. What do you understand by the Parquet file?
Ans: Parquet is a columnar format that is supported by several data processing systems. With it, Spark SQL performs both read as well as write operations. Having columnar storage has the following advantages:
Able to fetch specific columns for access
Consumes less space
Follows type-specific encoding
Limited I/O operations
Offers better-summarized data
Q9. Cons of PySpark?
Ans: Some of the limitations of using PySpark are:
It is difficult to express a problem in MapReduce fashion sometimes.
Also, Sometimes, it is not as efficient as other programming models.
Q10. Prerequisites to learn PySpark?
Ans: It is being assumed that the readers are already aware of what a programming language and a framework is, before proceeding with the various concepts given in this tutorial. Also, if the readers have some knowledge of Spark and Python in advance, it will be very helpful.
Q11. What are the benefits of Spark over MapReduce?
Ans: Spark has the following benefits over MapReduce:
Due to the availability of in-memory processing, Spark implements the processing around 10 to 100 times faster than Hadoop MapReduce whereas MapReduce makes use of persistence storage for any of the data processing tasks.
Unlike Hadoop, Spark provides inbuilt libraries to perform multiple tasks from the same core like batch processing, Steaming, Machine learning, Interactive SQL queries. However, Hadoop only supports batch processing.
Hadoop is highly disk-dependent whereas Spark promotes caching and in-memory data storage.
Spark is capable of performing computations multiple times on the same dataset. This is called iterative computation while there is no iterative computing implemented by Hadoop.
Q12. What is YARN?
Ans: Similar to Hadoop, YARN is one of the key features in Spark, providing a central and resource management platform to deliver scalable operations across the cluster. YARN is a distributed container manager, like Mesos for example, whereas Spark is a data processing tool. Spark can run on YARN, the same way Hadoop Map Reduce can run on YARN. Running Spark on YARN necessitates a binary distribution of Spark as built on YARN support.
Q13. Do you need to install Spark on all nodes of the YARN cluster?
Ans: No, because Spark runs on top of YARN. Spark runs independently from its installation. Spark has some options to use YARN when dispatching jobs to the cluster, rather than its own built-in manager, or Mesos. Further, there are some configurations to run YARN. They include master, deploy-mode, driver-memory, executor-memory, executor-cores, and queue.
Q14. What are Accumulators?
Ans: Accumulators are the write-only variables which are initialized once and sent to the workers. These workers will update based on the logic written and sent back to the driver which will aggregate or process based on the logic.
The only driver can access the accumulator’s value. For tasks, Accumulators are write-only. For example, it is used to count the number of errors seen in RDD across workers.
Q15. What is a Parquet file and what are its advantages?
Ans: Parquet is a columnar format that is supported by several data processing systems. With the Parquet file, Spark can perform both read and write operations.
Some of the advantages of having a Parquet file are:
It enables you to fetch specific columns for access.
It consumes less space
It follows the type-specific encoding
It supports limited I/O operations
Q16. What are the various functionalities supported by Spark Core?
Ans: Spark Core is the engine for parallel and distributed processing of large data sets. The various functionalities supported by Spark Core include:
Scheduling and monitoring jobs
Memory management
Fault recovery
Task dispatching
Q17. What is File System API?
Ans: FS API can read data from different storage devices like HDFS, S3, or local FileSystem. Spark uses FS API to read data from different storage engines.
Q18. Why Partitions are immutable?
Ans: Every transformation generates a new partition. Partitions use HDFS API so that partition is immutable, distributed, and fault tolerance. Partition also aware of data locality
Q19. What is Action in Spark?
Ans: Actions are RDD’s operation, that value returns back to the spar driver programs, which kick off a job to execute on a cluster. Transformation’s output is an input of Actions. reduce, collect, take samples, take, first, saveAsTextfile, saveAsSequenceFile, countByKey, for each is common actions in Apache spark.
Q20. Does Apache Spark provide checkpoints?
Ans: Yes, Apache Spark provides checkpoints. They allow for a program to run all around the clock in addition to making it resilient towards failures not related to application logic. Lineage graphs are used for recovering RDDs from a failure.
Apache Spark comes with an API for adding and managing checkpoints. The user then decides which data to the checkpoint. Checkpoints are preferred over lineage graphs when the latter are long and have wider dependencies.
Q21. What are the different levels of persistence in Spark?
Ans: Although the intermediary data from different shuffle operations automatically persists in Spark, it is recommended to use the persist () method on the RDD if the data is to be reused.
Apache Spark features several persistence levels for storing the RDDs on disk, memory, or a combination of the two with distinct replication levels. These various persistence levels are:
DISK_ONLY - Stores the RDD partitions only on the disk.
MEMORY_AND_DISK - Stores RDD as deserialized Java objects in the JVM. In case the RDD isn’t able to fit in the memory, additional partitions are stored on the disk. These are read from here each time the requirement arises.
MEMORY_ONLY_SER - Stores RDD as serialized Java objects with one-byte array per partition.
MEMORY_AND_DISK_SER - Identical to MEMORY_ONLY_SER with the exception of storing partitions not able to fit in the memory to the disk in place of recomputing them on the fly when required.
MEMORY_ONLY - The default level, it stores the RDD as deserialized Java objects in the JVM. In case the RDD isn’t able to fit in the memory available, some partitions won’t be cached, resulting in recomputing the same on the fly every time they are required.
OFF_HEAP - Works like MEMORY_ONLY_SER but stores the data in off-heap memory.
Q22. Describe Spark Driver
Ans: The program that runs on the master node of a machine and declares actions and transformations on data RDDs is called Spark Driver. In simple words, a driver.in Spark develops SparkContext, connected to a given Spark Master.
Spark Driver also delivers RDD graphs to Master, when the standalone Cluster Manager runs.
Q23. Does Apache Spark provide checkpoints?
Ans: Yes, Apache Spark provides an API for adding and managing checkpoints. Checkpointing is the process of making streaming applications resilient to failures. It allows you to save the data and metadata into a checkpointing directory. In case of a failure, the spark can recover this data and start from wherever it has stopped.
There are 2 types of data for which we can use checkpointing in Spark.
Metadata Checkpointing: Metadata means the data about data. It refers to saving the metadata to fault-tolerant storage like HDFS. Metadata includes configurations, DStream operations, and incomplete batches.
Data Checkpointing: Here, we save the RDD to reliable storage because its need arises in some of the stateful transformations. In this case, the upcoming RDD depends on the RDDs of previous batches.
Q24. Can you use Spark to access and analyze data stored in Cassandra databases?
Ans: Yes, it is possible if you use Spark Cassandra Connector.
Q25. How can you minimize data transfers when working with Spark?
Ans: Minimizing data transfers and avoiding shuffling helps write spark programs that run in a fast and reliable manner. The various ways in which data transfers can be minimized when working with Apache Spark are:
Using Broadcast Variable- Broadcast variable enhances the efficiency of joins between small and large RDDs.
Using Accumulators – Accumulators help update the values of variables in parallel while executing.
The most common way is to avoid operations ByKey, repartition, or any other operations which trigger shuffles.