Free Databricks Associate-Developer-Apache-Spark Practice Test & Real Exam Questions, Page 1

Question #1

Which of the following describes how Spark achieves fault tolerance?

A. Due to the mutability of DataFrames after transformations, Spark reproduces them using observed lineage in case of worker node failure.

B. If an executor on a worker node fails while calculating an RDD, that RDD can be recomputed by another executor using the lineage.

C. Spark is only fault-tolerant if this feature is specifically enabled via the spark.fault_recovery.enabled property.

D. Spark builds a fault-tolerant layer on top of the legacy RDD data system, which by itself is not fault tolerant.

E. Spark helps fast recovery of data in case of a worker fault by providing the MEMORY_AND_DISK storage level option.

Discussion 0

Correct Answer: B Vote an answer

Explanation: Only visible for Pass4Leader members. You can sign-up / login (it's free).

Question #2

Which of the following code blocks returns a 2-column DataFrame that shows the distinct values in column productId and the number of rows with that productId in DataFrame transactionsDf?

A. transactionsDf.groupBy("productId").count()

B. transactionsDf.count("productId").distinct()

C. transactionsDf.groupBy("productId").select(count("value"))

D. transactionsDf.count("productId")

E. transactionsDf.groupBy("productId").agg(col("value").count())

Discussion 0

Correct Answer: A Vote an answer

Explanation: Only visible for Pass4Leader members. You can sign-up / login (it's free).

Question #3

The code block displayed below contains an error. The code block is intended to write DataFrame transactionsDf to disk as a parquet file in location /FileStore/transactions_split, using column storeId as key for partitioning. Find the error.
Code block:
transactionsDf.write.format("parquet").partitionOn("storeId").save("/FileStore/transactions_split")A.

A. Partitioning data by storeId is possible with the partitionBy expression, so partitionOn should be replaced by partitionBy.

B. partitionOn("storeId") should be called before the write operation.

C. The format("parquet") expression should be removed and instead, the information should be added to the write expression like so: write("parquet").

D. The format("parquet") expression is inappropriate to use here, "parquet" should be passed as first argument to the save() operator and "/FileStore/transactions_split" as the second argument.

E. Partitioning data by storeId is possible with the bucketBy expression, so partitionOn should be replaced by bucketBy.

Discussion 0

Correct Answer: A Vote an answer

Explanation: Only visible for Pass4Leader members. You can sign-up / login (it's free).

Question #4

Which of the following statements about Spark's configuration properties is incorrect?

A. The maximum number of tasks that an executor can process at the same time is controlled by the spark.executor.cores property.

B. The default value for spark.sql.autoBroadcastJoinThreshold is 10MB.

C. The default number of partitions returned from certain transformations can be controlled by the spark.default.parallelism property.

D. The maximum number of tasks that an executor can process at the same time is controlled by the spark.task.cpus property.

E. The default number of partitions to use when shuffling data for joins or aggregations is 300.

Discussion 0

Correct Answer: E Vote an answer

Explanation: Only visible for Pass4Leader members. You can sign-up / login (it's free).

Question #5

Which of the following code blocks returns a DataFrame with approximately 1,000 rows from the 10,000-row DataFrame itemsDf, without any duplicates, returning the same rows even if the code block is run twice?

A. itemsDf.sample(withReplacement=True, fraction=0.1, seed=23536)

B. itemsDf.sample(fraction=0.1, seed=87238)

C. itemsDf.sample(fraction=0.1)

D. itemsDf.sample(fraction=1000, seed=98263)

E. itemsDf.sampleBy("row", fractions={0: 0.1}, seed=82371)

Discussion 0

Correct Answer: B Vote an answer

Explanation: Only visible for Pass4Leader members. You can sign-up / login (it's free).

Question #6

Which of the following describes Spark's way of managing memory?

A. Spark's memory usage can be divided into three categories: Execution, transaction, and storage.

B. Storage memory is used for caching partitions derived from DataFrames.

C. Disabling serialization potentially greatly reduces the memory footprint of a Spark application.

D. Spark uses a subset of the reserved system memory.

E. As a general rule for garbage collection, Spark performs better on many small objects than few big objects.

Discussion 0

Correct Answer: B Vote an answer

Explanation: Only visible for Pass4Leader members. You can sign-up / login (it's free).

Question #7

A. itemsDf.filter(~col('supplier').contains('X')).select('supplier').distinct()

B. itemsDf.filter(not(col('supplier').contains('X'))).select('supplier').unique()

C. itemsDf.filter(!col('supplier').contains('X')).select(col('supplier')).unique()

D. itemsDf.select(~col('supplier').contains('X')).distinct()

E. itemsDf.filter(col(supplier).not_contains('X')).select(supplier).distinct()

Discussion 0

Correct Answer: A Vote an answer

Explanation: Only visible for Pass4Leader members. You can sign-up / login (it's free).

Question #8

Which of the following code blocks creates a new one-column, two-row DataFrame dfDates with column date of type timestamp?

A. 1.dfDates = spark.createDataFrame([("23/01/2022 11:28:12",),("24/01/2022 10:58:34",)], ["date"])

B. 1.dfDates = spark.createDataFrame(["23/01/2022 11:28:12","24/01/2022 10:58:34"], ["date"])
2.dfDates = dfDates.withColumnRenamed("date", to_datetime("date", "yyyy-MM-dd HH:mm:ss"))

C. 1.dfDates = spark.createDataFrame([("23/01/2022 11:28:12",),("24/01/2022 10:58:34",)], ["date"])
2.dfDates = dfDates.withColumn("date", to_timestamp("date", "dd/MM/yyyy HH:mm:ss"))

D. 1.dfDates = spark.createDataFrame(["23/01/2022 11:28:12","24/01/2022 10:58:34"], ["date"])
2.dfDates = dfDates.withColumn("date", to_timestamp("dd/MM/yyyy HH:mm:ss", "date"))

E. 1.dfDates = spark.createDataFrame([("23/01/2022 11:28:12",),("24/01/2022 10:58:34",)], ["date"])
2.dfDates = dfDates.withColumnRenamed("date", to_timestamp("date", "yyyy-MM-dd HH:mm:ss"))

Discussion 0

Correct Answer: C Vote an answer

Explanation: Only visible for Pass4Leader members. You can sign-up / login (it's free).

Question #9

The code block displayed below contains an error. The code block should use Python method find_most_freq_letter to find the letter present most in column itemName of DataFrame itemsDf and return it in a new column most_frequent_letter. Find the error.
Code block:
1. find_most_freq_letter_udf = udf(find_most_freq_letter)
2. itemsDf.withColumn("most_frequent_letter", find_most_freq_letter("itemName"))

A. The UDF method is not registered correctly, since the return type is missing.

B. UDFs do not exist in PySpark.

C. Spark is not adding a column.

D. The "itemName" expression should be wrapped in col().

E. Spark is not using the UDF method correctly.

Discussion 0

Correct Answer: E Vote an answer

Explanation: Only visible for Pass4Leader members. You can sign-up / login (it's free).

Question #10

The code block displayed below contains an error. The code block should count the number of rows that have a predError of either 3 or 6. Find the error.
Code block:
transactionsDf.filter(col('predError').in([3, 6])).count()

A. Instead of a list, the values need to be passed as single arguments to the in operator.

B. Instead of filter, the select method should be used.

C. The method used on column predError is incorrect.

D. Numbers 3 and 6 need to be passed as string variables.

E. The number of rows cannot be determined with the count() operator.

Discussion 0

Correct Answer: C Vote an answer

Explanation: Only visible for Pass4Leader members. You can sign-up / login (it's free).

Question #11

Which of the following describes the conversion of a computational query into an execution plan in Spark?

A. The catalog assigns specific resources to the physical plan.

B. The catalog assigns specific resources to the optimized memory plan.

C. Depending on whether DataFrame API or SQL API are used, the physical plan may differ.

D. Spark uses the catalog to resolve the optimized logical plan.

E. The executed physical plan depends on a cost optimization from a previous stage.

Discussion 0

Correct Answer: E Vote an answer

Explanation: Only visible for Pass4Leader members. You can sign-up / login (it's free).

Question #12

Which of the following describes a shuffle?

A. A shuffle is a process that allocates partitions to executors.

B. A shuffle is a Spark operation that results from DataFrame.coalesce().

C. A shuffle is a process that compares data across executors.

D. A shuffle is a process that compares data across partitions.

E. A shuffle is a process that is executed during a broadcast hash join.

Discussion 0

Correct Answer: D Vote an answer

Explanation: Only visible for Pass4Leader members. You can sign-up / login (it's free).

Question #13

Which of the following statements about storage levels is incorrect?

A. MEMORY_AND_DISK replicates cached DataFrames both on memory and disk.

B. The cache operator on DataFrames is evaluated like a transformation.

C. In client mode, DataFrames cached with the MEMORY_ONLY_2 level will not be stored in the edge node's memory.

D. Caching can be undone using the DataFrame.unpersist() operator.

E. DISK_ONLY will not use the worker node's memory.

Discussion 0

Correct Answer: A Vote an answer

Explanation: Only visible for Pass4Leader members. You can sign-up / login (it's free).

Question #14

Which of the following code blocks returns only rows from DataFrame transactionsDf in which values in column productId are unique?

A. transactionsDf.distinct("productId")

B. transactionsDf.drop_duplicates(subset="productId")

C. transactionsDf.dropDuplicates(subset=["productId"])

D. transactionsDf.dropDuplicates(subset="productId")

E. transactionsDf.unique("productId")

Discussion 0

Correct Answer: C Vote an answer

Explanation: Only visible for Pass4Leader members. You can sign-up / login (it's free).

Question #15

Which of the following code blocks performs an inner join between DataFrame itemsDf and DataFrame transactionsDf, using columns itemId and transactionId as join keys, respectively?

A. itemsDf.join(transactionsDf, itemsDf.itemId == transactionsDf.transactionId, "inner")

B. itemsDf.join(transactionsDf, "inner", itemsDf.itemId == transactionsDf.transactionId)

C. itemsDf.join(transactionsDf, "itemsDf.itemId == transactionsDf.transactionId", "inner")

D. itemsDf.join(transactionsDf, col(itemsDf.itemId) == col(transactionsDf.transactionId))

E. itemsDf.join(transactionsDf, itemId == transactionId)

Discussion 0

Correct Answer: A Vote an answer

Explanation: Only visible for Pass4Leader members. You can sign-up / login (it's free).

Free Databricks Associate-Developer-Apache-Spark Practice Test & Real Exam Questions

Download Free Databricks Associate-Developer-Apache-Spark Demo