PySpark Interview Question and Answers
PySpark Questions and Answers
Question | Answer | Explanation |
---|---|---|
Which method below is used to create a temporary view on DataFrame? | DataFrame.createOrReplaceTempView("View Name") | This method registers a DataFrame as a temporary table for SQL queries in Spark. |
Which of the below Spark Core operations are wide transformations and result in data shuffling? | groupBy | groupBy triggers data shuffling across the network, which makes it a wide transformation. |
Which of the below option is correct to persist RDD only in primary memory? | rdd.persist(StorageLevel.MEMORY_ONLY) | This persists the RDD in memory only; if memory is not sufficient, it will recompute when needed. |
Which method can be used to verify the number of partitions in RDD? | RDD.getNumPartitions() | getNumPartitions() returns the number of partitions in an RDD. |
Which code snippet correctly converts dataset to DataFrame using namedtuple? | transDF = sc.textFile(...).map(...).map(lambda c: Cust_Trans(c[0], c[1], c[2], int(c[3]))).toDF() | This code splits the line, maps it to a namedtuple, and converts it to a DataFrame. |
Which of the below method(s) is/are Spark Core action operations? | collect(), foreach(), reduce() | These are Spark actions that trigger computation and return results. |
Which method is used to read a JSON file as a DataFrame? | sparkSessionObj.read.json("Json file path") | This is the standard way to load a JSON file using SparkSession. |
Which method is used to increase RDD partitions for better parallelism? | RDD.repartition(Number of partitions) | repartition increases partitions and involves shuffling for balanced distribution. |
Which transformation aggregates values by key efficiently in paired RDD? | reduceByKey() | reduceByKey performs local aggregation before shuffling, making it efficient. |
Which method is used to save a DataFrame as a Parquet file in HDFS? | DataFrame.write.parquet("File path") | This method saves a DataFrame in Parquet format to the given path. |
Which object acts as a unified entry point for Spark SQL including Hive? | SparkSession | SparkSession is the main entry point for DataFrame and SQL functionality. |
Which transformation can only be applied on paired RDDs? | mapValues() | mapValues transforms only values, keeping keys unchanged; used only on key-value RDDs. |
Which method is used to save DataFrame to a Hive table? | DataFrame.write.option("hivepath", "/path").saveAsTable("Banking.CreditCardData") | This saves the DataFrame as a Hive table with additional options. |
Which of the following is a Spark Action? | collect(), first, take | These actions trigger execution and return results from the RDD/DataFrame. |
How to extract the first and third column from an RDD? | data1.map(lambda col: (col[0], col[2])) | Accessing elements by index allows extraction of specific columns from an RDD. |
What is the output of foldByKey with add on [('a',1), ('b',2), ('a',3), ('a',4)]? | [('a', 8), ('b', 2)] | foldByKey with 0 as initial value sums values grouped by key using add. |
What is the output of countByValue() on RDD with [(11,1), (1,), (11,1)]? | [((11, 1), 2), ((1,), 1)] | countByValue counts how many times each element occurs in the RDD. |
Which method displays contents of DataFrame as a collection of Row? | data1.collect() | collect() returns the content as a list of Row objects. |
Which object is created by the system in Spark interactive mode? | SparkSession | SparkSession is automatically created in interactive mode for convenience. |
What is the difference between persist() and cache() in Spark? | cache() is equivalent to persist(StorageLevel.MEMORY_AND_DISK) | cache() is a shorthand for persist with default storage level MEMORY_AND_DISK. persist allows custom storage levels. |
How does Spark handle data shuffling, and why is it expensive? | Spark redistributes data across partitions, causing I/O, network, and memory overhead. | Shuffling involves disk and network operations which slow down performance and require more resources. |
What is a broadcast variable in Spark and when should it be used? | Used to cache a read-only variable on all nodes to avoid shipping with tasks. | Broadcast variables are efficient for small datasets that are reused across many tasks. |
Explain the difference between narrow and wide transformations with examples. | Narrow (e.g., map): data from one partition. Wide (e.g., groupByKey): requires shuffle. | Narrow transformations don't require shuffling. Wide ones do and are more expensive. |
What are the different storage levels in Spark? | MEMORY_ONLY, MEMORY_AND_DISK, DISK_ONLY, MEMORY_ONLY_SER, etc. | Storage levels define how RDDs are cached – in memory, disk, or serialized form. |
What is a DAG in Spark, and how is it used in job execution? | DAG is a Directed Acyclic Graph of stages representing computation lineage. | Spark builds a DAG of execution for transformations before running any action. |
What are accumulators in Spark and how are they different from broadcast variables? | Accumulators are write-only shared variables for aggregations; broadcast are read-only. | Accumulators are useful for debugging or counters; broadcast for small lookup data. |
How do DataFrame APIs differ from RDD APIs in Spark? | DataFrames are optimized using Catalyst and Tungsten; RDDs offer more control. | DataFrames are higher-level APIs with better performance; RDDs are more flexible but slower. |
What are some best practices for optimizing Spark jobs? | Use partitioning, caching, avoid shuffles, use broadcast joins, and monitor jobs. | Performance improves by reducing shuffles, tuning partitions, and reusing cached data. |
Explain checkpointing and why it is used in Spark streaming applications. | Checkpointing saves RDD lineage info to stable storage to recover from failures. | It helps prevent long lineage chains and supports recovery in streaming jobs. |
No similar posts
Subscribe to:
Post Comments
(
Atom
)
No comments :
Post a Comment