Free Cloudera CDP-3002 Practice Test & Real Exam Questions

Exam Code/Number: CDP-3002
Exam Name/Title: CDP Data Engineer - Certification Exam
Certification Provider: Cloudera
Corresponding Certification: Cloudera Certification

Exam Questions: 320
Updated On: Jul 05, 2026

Question #10

You are deploying a Spark application on Kubernetes and need to specify the amount of memory allocated to each Executor. In your PySpark code, which configuration setting will you use?

A. 'spark.executor.memory'

B. 'spark.executor.memoryoverhead'

C. 'spark.driver.memory'

D. 'spark.executor.instances'

Discussion 0

Correct Answer: A Vote an answer

Explanation: Only visible for Pass4Leader members. You can sign-up / login (it's free).

Question #11

An Airflow DAG designed to run a sequence of data validation checks generates a dynamic number of validation tasks based on the incoming data's characteristics. Each validation task must complete successfully before a final data processing task can begin. Which Airflow feature is most suitable for implementing this pattern?

A. The Dynamic Task Mapping feature

B. A BranchPythonOperator with a follow-up Join task

C. The TriggerDagRunOperator

D. SubDAGs

Discussion 0

Correct Answer: A Vote an answer

Explanation: Only visible for Pass4Leader members. You can sign-up / login (it's free).

Question #12

In the context of Cloudera's Optimization Framework, what role does data statistics collection play?

A. It reduces the need for data compression

B. It is used to generate more data

C. It helps the optimizer make informed decisions about data layout and query execution plans

D. It provides metadata for security enforcement

Discussion 0

Correct Answer: C Vote an answer

Explanation: Only visible for Pass4Leader members. You can sign-up / login (it's free).

Question #13

What mechanism does Apache Airflow provide to delay the execution of a task until a certain condition is met?
A The delay parameter in task definitions.

A. The execution_timeout attribute, which can be set to postpone task execution.

B. The wait_for_downstream setting in the DAG configuration.

C. Sensors, which are special operators that wait for a certain condition to be true.

Discussion 0

Correct Answer: A Vote an answer

Explanation: Only visible for Pass4Leader members. You can sign-up / login (it's free).

Question #14

In Apache Spark, which of the following is the most effective strategy for minimizing data shuffling across nodes in a cluster?

A. Decreasing the number of partitions

B. Increasing the number of partitions

C. Using broadcast variables for small data

D. Filtering data after a wide transformation

Discussion 0

Correct Answer: C Vote an answer

Explanation: Only visible for Pass4Leader members. You can sign-up / login (it's free).

Question #15

In the context of Cloudera's Optimization Framework, what is the purpose of dynamic partition pruning?

A. To dynamically eliminate unnecessary partitions from a query plan based on runtime statistics

B. To update partition metadata in real-time

C. To increase the size of partitions dynamically based on data volume

D. To partition data dynamically based on query execution plans

Discussion 0

Correct Answer: A Vote an answer

Explanation: Only visible for Pass4Leader members. You can sign-up / login (it's free).

Question #16

Considering Hive's optimization mechanisms, under which scenario might partition pruning fail to improve query performance?

A. When the partitioned table contains a small number of partitions

B. When the table is partitioned on a column frequently used in query filters

C. When querying data using the exact partition key in the WHERE clause

D. When querying data using a non-partition column as a filter

Discussion 0

Correct Answer: D Vote an answer

Explanation: Only visible for Pass4Leader members. You can sign-up / login (it's free).

Question #17

You need to process data stored in AWS S3 using SparkSQL. Which of the following options correctly reads a JSON file stored in S3 into a DataFrame and performs a SQL query on it?

Discussion 0

Correct Answer: D Vote an answer

Explanation: Only visible for Pass4Leader members. You can sign-up / login (it's free).

Question #18

You're working with an ETL pipeline that extracts data from multiple sources. How can you ensure that the pipeline only processes the latest data and avoids re-processing already processed data?

A. Use timestamps or versioning information provided by the data sources to identify new data.

B. Rely on Airflow's built-in mechanisms to handle data freshness automatically.

C. Configure the data sources to only provide new data by default.

D. Implement a custom mechanism to track the last processed record for each source and filter data accordingly.

Discussion 0

Correct Answer: A,D Vote an answer

Explanation: Only visible for Pass4Leader members. You can sign-up / login (it's free).

Download Free Cloudera CDP-3002 Demo

Simply submit your e-mail address below to get started with our free demo of your Cloudera CDP-3002 exam.