Which method below is used to create a temporary view on DataFrame? |
DataFrame.createOrReplaceTempView("View Name") |
This method registers a DataFrame as a temporary table for SQL queries in Spark. |
Which of the below Spark Core operations are wide transformations and result in data shuffling? |
groupBy |
groupBy triggers data shuffling across the network, which makes it a wide transformation. |
Which of the below option is correct to persist RDD only in primary memory? |
rdd.persist(StorageLevel.MEMORY_ONLY) |
This persists the RDD in memory only; if memory is not sufficient, it will recompute when needed. |
Which method can be used to verify the number of partitions in RDD? |
RDD.getNumPartitions() |
getNumPartitions() returns the number of partitions in an RDD. |
Which code snippet correctly converts dataset to DataFrame using namedtuple? |
transDF = sc.textFile(...).map(...).map(lambda c: Cust_Trans(c[0], c[1], c[2], int(c[3]))).toDF() |
This code splits the line, maps it to a namedtuple, and converts it to a DataFrame. |
Which of the below method(s) is/are Spark Core action operations? |
collect(), foreach(), reduce() |
These are Spark actions that trigger computation and return results. |
Which method is used to read a JSON file as a DataFrame? |
sparkSessionObj.read.json("Json file path") |
This is the standard way to load a JSON file using SparkSession. |
Which method is used to increase RDD partitions for better parallelism? |
RDD.repartition(Number of partitions) |
repartition increases partitions and involves shuffling for balanced distribution. |
Which transformation aggregates values by key efficiently in paired RDD? |
reduceByKey() |
reduceByKey performs local aggregation before shuffling, making it efficient. |
Which method is used to save a DataFrame as a Parquet file in HDFS? |
DataFrame.write.parquet("File path") |
This method saves a DataFrame in Parquet format to the given path. |
Which object acts as a unified entry point for Spark SQL including Hive? |
SparkSession |
SparkSession is the main entry point for DataFrame and SQL functionality. |
Which transformation can only be applied on paired RDDs? |
mapValues() |
mapValues transforms only values, keeping keys unchanged; used only on key-value RDDs. |
Which method is used to save DataFrame to a Hive table? |
DataFrame.write.option("hivepath", "/path").saveAsTable("Banking.CreditCardData") |
This saves the DataFrame as a Hive table with additional options. |
Which of the following is a Spark Action? |
collect(), first, take |
These actions trigger execution and return results from the RDD/DataFrame. |
How to extract the first and third column from an RDD? |
data1.map(lambda col: (col[0], col[2])) |
Accessing elements by index allows extraction of specific columns from an RDD. |
What is the output of foldByKey with add on [('a',1), ('b',2), ('a',3), ('a',4)]? |
[('a', 8), ('b', 2)] |
foldByKey with 0 as initial value sums values grouped by key using add. |
What is the output of countByValue() on RDD with [(11,1), (1,), (11,1)]? |
[((11, 1), 2), ((1,), 1)] |
countByValue counts how many times each element occurs in the RDD. |
Which method displays contents of DataFrame as a collection of Row? |
data1.collect() |
collect() returns the content as a list of Row objects. |
Which object is created by the system in Spark interactive mode? |
SparkSession |
SparkSession is automatically created in interactive mode for convenience. |
What is the difference between persist() and cache() in Spark? |
cache() is equivalent to persist(StorageLevel.MEMORY_AND_DISK) |
cache() is a shorthand for persist with default storage level MEMORY_AND_DISK. persist allows custom storage levels. |
How does Spark handle data shuffling, and why is it expensive? |
Spark redistributes data across partitions, causing I/O, network, and memory overhead. |
Shuffling involves disk and network operations which slow down performance and require more resources. |
What is a broadcast variable in Spark and when should it be used? |
Used to cache a read-only variable on all nodes to avoid shipping with tasks. |
Broadcast variables are efficient for small datasets that are reused across many tasks. |
Explain the difference between narrow and wide transformations with examples. |
Narrow (e.g., map): data from one partition. Wide (e.g., groupByKey): requires shuffle. |
Narrow transformations don't require shuffling. Wide ones do and are more expensive. |
What are the different storage levels in Spark? |
MEMORY_ONLY, MEMORY_AND_DISK, DISK_ONLY, MEMORY_ONLY_SER, etc. |
Storage levels define how RDDs are cached – in memory, disk, or serialized form. |
What is a DAG in Spark, and how is it used in job execution? |
DAG is a Directed Acyclic Graph of stages representing computation lineage. |
Spark builds a DAG of execution for transformations before running any action. |
What are accumulators in Spark and how are they different from broadcast variables? |
Accumulators are write-only shared variables for aggregations; broadcast are read-only. |
Accumulators are useful for debugging or counters; broadcast for small lookup data. |
How do DataFrame APIs differ from RDD APIs in Spark? |
DataFrames are optimized using Catalyst and Tungsten; RDDs offer more control. |
DataFrames are higher-level APIs with better performance; RDDs are more flexible but slower. |
What are some best practices for optimizing Spark jobs? |
Use partitioning, caching, avoid shuffles, use broadcast joins, and monitor jobs. |
Performance improves by reducing shuffles, tuning partitions, and reusing cached data. |
Explain checkpointing and why it is used in Spark streaming applications. |
Checkpointing saves RDD lineage info to stable storage to recover from failures. |
It helps prevent long lineage chains and supports recovery in streaming jobs. |
No comments:
Post a Comment