CseWay: PySpark Interview Question and Answers

PySpark Questions and Answers

Question	Answer	Explanation
Which method below is used to create a temporary view on DataFrame?	DataFrame.createOrReplaceTempView("View Name")	This method registers a DataFrame as a temporary table for SQL queries in Spark.
Which of the below Spark Core operations are wide transformations and result in data shuffling?	groupBy	groupBy triggers data shuffling across the network, which makes it a wide transformation.
Which of the below option is correct to persist RDD only in primary memory?	rdd.persist(StorageLevel.MEMORY_ONLY)	This persists the RDD in memory only; if memory is not sufficient, it will recompute when needed.
Which method can be used to verify the number of partitions in RDD?	RDD.getNumPartitions()	getNumPartitions() returns the number of partitions in an RDD.
Which code snippet correctly converts dataset to DataFrame using namedtuple?	transDF = sc.textFile(...).map(...).map(lambda c: Cust_Trans(c[0], c[1], c[2], int(c[3]))).toDF()	This code splits the line, maps it to a namedtuple, and converts it to a DataFrame.
Which of the below method(s) is/are Spark Core action operations?	collect(), foreach(), reduce()	These are Spark actions that trigger computation and return results.
Which method is used to read a JSON file as a DataFrame?	sparkSessionObj.read.json("Json file path")	This is the standard way to load a JSON file using SparkSession.
Which method is used to increase RDD partitions for better parallelism?	RDD.repartition(Number of partitions)	repartition increases partitions and involves shuffling for balanced distribution.
Which transformation aggregates values by key efficiently in paired RDD?	reduceByKey()	reduceByKey performs local aggregation before shuffling, making it efficient.
Which method is used to save a DataFrame as a Parquet file in HDFS?	DataFrame.write.parquet("File path")	This method saves a DataFrame in Parquet format to the given path.
Which object acts as a unified entry point for Spark SQL including Hive?	SparkSession	SparkSession is the main entry point for DataFrame and SQL functionality.
Which transformation can only be applied on paired RDDs?	mapValues()	mapValues transforms only values, keeping keys unchanged; used only on key-value RDDs.
Which method is used to save DataFrame to a Hive table?	DataFrame.write.option("hivepath", "/path").saveAsTable("Banking.CreditCardData")	This saves the DataFrame as a Hive table with additional options.
Which of the following is a Spark Action?	collect(), first, take	These actions trigger execution and return results from the RDD/DataFrame.
How to extract the first and third column from an RDD?	data1.map(lambda col: (col[0], col[2]))	Accessing elements by index allows extraction of specific columns from an RDD.
What is the output of foldByKey with add on [('a',1), ('b',2), ('a',3), ('a',4)]?	[('a', 8), ('b', 2)]	foldByKey with 0 as initial value sums values grouped by key using add.
What is the output of countByValue() on RDD with [(11,1), (1,), (11,1)]?	[((11, 1), 2), ((1,), 1)]	countByValue counts how many times each element occurs in the RDD.
Which method displays contents of DataFrame as a collection of Row?	data1.collect()	collect() returns the content as a list of Row objects.
Which object is created by the system in Spark interactive mode?	SparkSession	SparkSession is automatically created in interactive mode for convenience.
What is the difference between persist() and cache() in Spark?	cache() is equivalent to persist(StorageLevel.MEMORY_AND_DISK)	cache() is a shorthand for persist with default storage level MEMORY_AND_DISK. persist allows custom storage levels.
How does Spark handle data shuffling, and why is it expensive?	Spark redistributes data across partitions, causing I/O, network, and memory overhead.	Shuffling involves disk and network operations which slow down performance and require more resources.
What is a broadcast variable in Spark and when should it be used?	Used to cache a read-only variable on all nodes to avoid shipping with tasks.	Broadcast variables are efficient for small datasets that are reused across many tasks.
Explain the difference between narrow and wide transformations with examples.	Narrow (e.g., map): data from one partition. Wide (e.g., groupByKey): requires shuffle.	Narrow transformations don't require shuffling. Wide ones do and are more expensive.
What are the different storage levels in Spark?	MEMORY_ONLY, MEMORY_AND_DISK, DISK_ONLY, MEMORY_ONLY_SER, etc.	Storage levels define how RDDs are cached – in memory, disk, or serialized form.
What is a DAG in Spark, and how is it used in job execution?	DAG is a Directed Acyclic Graph of stages representing computation lineage.	Spark builds a DAG of execution for transformations before running any action.
What are accumulators in Spark and how are they different from broadcast variables?	Accumulators are write-only shared variables for aggregations; broadcast are read-only.	Accumulators are useful for debugging or counters; broadcast for small lookup data.
How do DataFrame APIs differ from RDD APIs in Spark?	DataFrames are optimized using Catalyst and Tungsten; RDDs offer more control.	DataFrames are higher-level APIs with better performance; RDDs are more flexible but slower.
What are some best practices for optimizing Spark jobs?	Use partitioning, caching, avoid shuffles, use broadcast joins, and monitor jobs.	Performance improves by reducing shuffles, tuning partitions, and reusing cached data.
Explain checkpointing and why it is used in Spark streaming applications.	Checkpointing saves RDD lineage info to stable storage to recover from failures.	It helps prevent long lineage chains and supports recovery in streaming jobs.

CseWay

PySpark Interview Question and Answers

PySpark Questions and Answers

No comments:

Post a Comment