Ans: Apache Spark automatically persists the intermediary data from various shuffle operations, however it is often suggested that users call persist () method on the RDD in case they plan to reuse it. Spark has various persistence levels to store the RDDs on disk or in memory or as a combination of both with different replication levels.
Ans: Most of the data users know only SQL and are not good at programming. Shark is a tool, developed for people who are from a database background - to access Scala MLib capabilities through Hive like SQL interface. Shark tool helps data users run Hive on Spark - offering compatibility with Hive metastore, queries and data.
Ans:
Ans: A sparse vector has two parallel arrays –one for indices and the other for values. These vectors are used for storing non-zero entries to save space.
Ans: RDDs (Resilient Distributed Datasets) are basic abstraction in Apache Spark that represent the data coming into the system in object format. RDDs are used for in-memory computations on large clusters, in a fault tolerant manner. RDDs are read-only portioned, collection of records, that are:
Ans: Transformations are functions executed on demand, to produce a new RDD. All transformations are followed by actions. Some examples of transformations include map, filter and reduceByKey.
Actions are the results of RDD computations or transformations. After an action is performed, the data from RDD moves back to the local machine. Some examples of actions include reduce, collect, first, and take.
Ans: Scala, Java, Python, R and Clojure
Ans: Yes, it is possible if you use Spark Cassandra Connector.
Ans: Yes, Apache Spark can be run on the hardware clusters managed by Mesos.
Ans: The 3 different clusters managers supported in Apache Spark are:
Ans: To connect Spark with Mesos:
Ans: Minimizing data transfers and avoiding shuffling helps write spark programs that run in a fast and reliable manner. The various ways in which data transfers can be minimized when working with Apache Spark are:
Ans: These are read only variables, present in-memory cache on every machine. When working with Spark, usage of broadcast variables eliminates the necessity to ship copies of a variable for every task, so data can be processed faster. Broadcast variables help in storing a lookup table inside the memory which enhances the retrieval efficiency when compared to an RDD lookup ().
Ans: Yes, it is possible to run Spark and Mesos with Hadoop by launching each of these as a separate service on the machines. Mesos acts as a unified scheduler that assigns tasks to either Spark or Hadoop.
Ans: The RDDs in Spark, depend on one or more other RDDs. The representation of dependencies in between RDDs is known as the lineage graph. Lineage graph information is used to compute each RDD on demand, so that whenever a part of persistent RDD is lost, the data that is lost can be recovered using the lineage graph information.
Ans: You can trigger the clean-ups by setting the parameter ‘spark.cleaner.ttl’ or by dividing the long running jobs into different batches and writing the intermediary results to the disk.
Ans:
Ans: It renders scalable partitioning among various Spark instances and dynamic partitioning between Spark and other big data frameworks.
Ans: Sliding Window controls transmission of data packets between various computer networks. Spark Streaming library provides windowed computations where the transformations on RDDs are applied over a sliding window of data. Whenever the window slides, the RDDs that fall within the particular window are combined and operated upon to produce new RDDs of the windowed DStream.
Ans: Discretized Stream is a sequence of Resilient Distributed Databases that represent a stream of data. DStreams can be created from various sources like Apache Kafka, HDFS, and Apache Flume. DStreams have two operations:
Ans: Spark need not be installed when running a job under YARN or Mesos because Spark can execute on top of YARN or Mesos clusters without affecting any change to the cluster.
Ans: Catalyst framework is a new optimization framework present in Spark SQL. It allows Spark to automatically transform SQL queries by adding new optimizations to build a faster processing system.
Ans: Pinterest, Conviva, Shopify, Open Table.
Ans: Tachyon.
Ans: BlinkDB is a query engine for executing interactive SQL queries on huge volumes of data and renders query results marked with meaningful error bars. BlinkDB helps users balance ‘query accuracy’ with response time.
Ans: Hadoop MapReduce requires programming in Java which is difficult, though Pig and Hive make it considerably easier. Learning Pig and Hive syntax takes time. Spark has interactive APIs for different languages like Java, Python or Scala and also includes Shark i.e. Spark SQL for SQL lovers - making it comparatively easier to use than Hadoop.
Ans: Developers often make the mistake of:
Developers need to be careful with this, as Spark makes use of memory for processing.
Ans: Parquet file is a columnar format file that helps:
Ans:
Ans: Spark has its own cluster management computation and mainly uses Hadoop for storage.
Ans:
Ans: Special operations can be performed on RDDs in Spark using key/value pairs and such RDDs are referred to as Pair RDDs. Pair RDDs allow users to access each key in parallel. They have a reduceByKey () method that collects data based on each key and a join () method that combines different RDDs together, based on the elements having the same key.
Ans: Apache Spark is mainly used for:
Ans: No. Apache Spark works well only for simple machine learning algorithms like clustering, regression, classification.
Ans: It has all the basic functionalities of Spark, like - memory management, fault recovery, interacting with storage systems, scheduling tasks, etc.
Ans: Use the subtractByKey () function.
Ans: persist () allows the user to specify the storage level whereas cache () uses the default storage level.
Ans: Scala is an object functional programming and scripting language for general software applications designed to express solutions in a concise manner.
Ans: Scala set is a collection of pairwise elements of the same type. Scala set does not contain any duplicate elements. There are two kinds of sets, mutable and immutable.
Ans: Scala map is a collection of key or value pairs. Based on its key any value can be retrieved. Values are not unique but keys are unique in the Map.
Ans:
Ans:
Ans: Values and variables are two shapes that come in Scala. A value variable is constant and cannot be changed once assigned. It is immutable, while a regular variable, on the other hand, is mutable, and you can change the value.
The two types of variables are:
Ans: A class is a definition for a description. It defines a type in terms of methods and composition of other types. A class is a blueprint of the object. While, an object is a singleton, an instance of a class which is unique. An anonymous class is created for every object in the code, it inherits from whatever classes you declared object to implement.
Ans: ‘Recursion’ is a function that calls itself. A function that calls itself, for example, a function ‘A’ calls function ‘B’, which calls the function ‘C’. It is a technique used frequently in functional programming. In order for a tail recursive, the call back to the function must be the last function to be performed.
Ans: ‘Traits’ are used to define object types specified by the signature of the supported methods. Scala allows to be partially implemented but traits may not have constructor parameters. A trait consists of method and field definition, by mixing them into classes it can be reused.
Ans: There is no specific rule when you can use traits, but there is a guideline which you can consider.
Ans: Case classes provides a recursive decomposition mechanism via pattern matching, it is a regular classes which export their constructor parameter. The constructor parameters of case classes can be accessed directly and are treated as public values.
Ans: Scala tuples combine a fixed number of items together so that they can be passed around as whole. A tuple is immutable and can hold objects with different types, unlike an array or list.
Ans: Currying is the technique of transforming a function that takes multiple arguments into a function that takes a single argument Many of the same techniques as language like Haskell and LISP are supported by Scala. Function currying is one of the least used and misunderstood one.
Ans: Implicit parameter is the way that allows parameters of a method to be “found”. It is similar to default parameters, but it has a different mechanism for finding the “default” value. The implicit parameter is a parameter to method or constructor that is marked as implicit. This means if a parameter value is not mentioned then the compiler will search for an “implicit” value defined within a scope.
Ans: A closure is a function whose return value depends on the value of the variables declared outside the function.
Ans: A monad is an object that wraps another object. You pass the Monad mini-programs, i.e functions, to perform the data manipulation of the underlying object, instead of manipulating the object directly. Monad chooses how to apply the program to the underlying object.
Ans: In a source code, anonymous functions are called ‘function literals’ and at run time, function literals are instantiated into objects called function values. Scala provides a relatively easy syntax for defining anonymous functions.
Ans: Scala allows the definition of higher order functions. These are functions that take other functions as parameters, or whose result is a function. In the following example, apply () function takes another function ‘f’ and a value ‘v’ and applies function to v.
Related Interview Questions...