Databricks Certified Professional Data Engineer Exam (Databricks-Certified-Professional-Data-Engineer) Free Practice Test

Question 1

The DevOps team has configured a production workload as a collection of notebooks scheduled to run daily using the Jobs Ul. A new data engineering hire is onboarding to the team and has requested access to one of these notebooks to review the production logic.
What are the maximum notebook permissions that can be granted to the user without allowing accidental changes to production code or data?

A. Can Read

B. Can run

C. Can manage

D. Can edit

Correct Answer: A

Explanation: Only visible for TestSimulate members. You can sign-up / login (it's free).

Question 2

The data engineer team is configuring environment for development testing, and production before beginning migration on a new data pipeline. The team requires extensive testing on both the code and data resulting from code execution, and the team want to develop and test against similar production data as possible.
A junior data engineer suggests that production data can be mounted to the development testing environments, allowing pre production code to execute against production data. Because all users have Admin privileges in the development environment, the junior data engineer has offered to configure permissions and mount this data for the team.
Which statement captures best practices for this situation?

A. In environments where interactive code will be executed, production data should only be accessible with read permissions; creating isolated databases for each environment further reduces risks.

B. All developer, testing and production code and data should exist in a single unified workspace; creating separate environments for testing and development further reduces risks.

C. Because delta Lake versions all data and supports time travel, it is not possible for user error or malicious actors to permanently delete production data, as such it is generally safe to mount production data anywhere.

D. Because access to production data will always be verified using passthrough credentials it is safe to mount data to any Databricks development environment.

Correct Answer: A

Explanation: Only visible for TestSimulate members. You can sign-up / login (it's free).

Question 3

All records from an Apache Kafka producer are being ingested into a single Delta Lake table with the following schema:
key BINARY, value BINARY, topic STRING, partition LONG, offset LONG, timestamp LONG There are 5 unique topics being ingested. Only the " registration " topic contains Personal Identifiable Information (PII). The company wishes to restrict access to PII. The company also wishes to only retain records containing PII in this table for 14 days after initial ingestion. However, for non-PII information, it would like to retain these records indefinitely.
Which of the following solutions meets the requirements?

A. Data should be partitioned by the topic field, allowing ACLs and delete statements to leverage partition boundaries.

B. Because the value field is stored as binary data, this information is not considered PII and no special precautions should be taken.

C. All data should be deleted biweekly; Delta Lake ' s time travel functionality should be leveraged to maintain a history of non-PII information.

D. Data should be partitioned by the registration field, allowing ACLs and delete statements to be set for the PII directory.

E. Separate object storage containers should be specified based on the partition field, allowing isolation at the storage level.

Correct Answer: A

Explanation: Only visible for TestSimulate members. You can sign-up / login (it's free).

Question 4

Which statement regarding spark configuration on the Databricks platform is true?

A. Spark configuration properties set for an interactive cluster with the Clusters UI will impact all notebooks attached to that cluster.

B. Spark configuration set within an notebook will affect all SparkSession attached to the same interactive cluster

C. The Databricks REST API can be used to modify the Spark configuration properties for an interactive cluster without interrupting jobs.

D. When the same spar configuration property is set for an interactive to the same interactive cluster.

Correct Answer: A

Explanation: Only visible for TestSimulate members. You can sign-up / login (it's free).

Question 5

An organization processes customer data from web and mobile applications. Data includes names, emails, phone numbers, and location history. Data arrives both as batch files (from SFTP daily) and streaming JSON events (from Kafka in real-time).
To comply with data privacy policies, the following requirements must be met:
* Personally Identifiable Information (PII) such as email, phone number, and IP address must be masked or anonymized before storage.
* Both batch and streaming pipelines must apply consistent PII handling.
* Masking logic must be auditable and reproducible.
* The masked data must remain usable for downstream analytics.
How should the data engineer design a compliant data pipeline on Databricks that supports both batch and streaming modes, applies data masking to PII, and maintains traceability for audits?

A. Ingest both batch and streaming data using Lakeflow Declarative Pipelines, and apply masking via Unity Catalog column masks at read time to avoid modifying the data during ingestion.

B. Load batch data with notebooks and ingest streaming data with SQL Warehouses; use Unity Catalog column masks on Silver tables to redact fields after storage.

C. Allow PII to be stored unmasked in Bronze for lineage tracking, then apply masking logic in Gold tables used for reporting.

D. Use Lakeflow Declarative Pipelines for batch and streaming ingestion, define a PII masking function
, and apply it during Bronze ingestion before writing to Delta Lake .

Correct Answer: D

Explanation: Only visible for TestSimulate members. You can sign-up / login (it's free).

Question 6

A data engineer is designing a Lakeflow Spark Declarative Pipeline to process streaming order data. The pipeline uses Auto Loader to ingest data and must enforce data quality by ensuring customer_id is not null and amount is greater than zero. Invalid records should be dropped. Which Lakeflow Spark Declarative Pipelines configuration implements this requirement using Python?

A. @dlt.table
def silver_orders():
return dlt.read_stream( " bronze_orders " ) \
.expect( " valid_customer " , " customer_id IS NOT NULL " ) \
.expect( " valid_amount " , " amount > 0 " )

B. @dlt.table
@dlt.expect_or_drop( " valid_customer " , " customer_id IS NOT NULL " )
@dlt.expect_or_drop( " valid_amount " , " amount > 0 " )
def silver_orders():
return dlt.read_stream( " bronze_orders " )

C. @dlt.table
def silver_orders():
return dlt.read_stream( " bronze_orders " ) \
.expect_or_drop( " valid_customer " , " customer_id IS NOT NULL " ) \
.expect_or_drop( " valid_amount " , " amount > 0 " )

D. @dlt.table
@dlt.expect( " valid_customer " , " customer_id IS NOT NULL " )
@dlt.expect( " valid_amount " , " amount > 0 " )
def silver_orders():
return dlt.read_stream( " bronze_orders " )

Correct Answer: B

Explanation: Only visible for TestSimulate members. You can sign-up / login (it's free).

Question 7

A data engineering team uses Databricks Lakehouse Monitoring to track the percent_null metric for a critical column in their Delta table. The profile metrics table ( prod_catalog.prod_schema.
customer_data_profile_metrics ) stores hourly percent_null values. The team wants to trigger an alert when the daily average of percent_null exceeds 5% for three consecutive days, while ensuring notifications are not spammed during sustained issues. Which SQL alert configuration achieves this goal while minimizing false positives and redundant notifications?

A. SELECT AVG(percent_null) AS daily_avg
FROM prod_catalog.prod_schema.customer_data_profile_metrics
WHERE window.end > = CURRENT_TIMESTAMP - INTERVAL ' 3 ' DAY
Alert Condition: daily_avg > 5
Notification Frequency: Each time alert is evaluated

B. WITH daily_avg AS (
SELECT
DATE_TRUNC( ' DAY ' , window.end) AS day,
AVG(percent_null) AS avg_null
FROM prod_catalog.prod_schema.customer_data_profile_metrics
GROUP BY DATE_TRUNC( ' DAY ' , window.end)
)
SELECT day, avg_null
FROM daily_avg
ORDER BY day DESC
LIMIT 3
Alert Condition: ALL avg_null > 5 for the latest 3 rows
Notification Frequency: Just once

C. SELECT SUM(CASE WHEN percent_null > 5 THEN 1 ELSE 0 END) AS violation_days FROM prod_catalog.prod_schema.customer_data_profile_metrics WHERE window.end > = CURRENT_TIMESTAMP - INTERVAL ' 3 ' DAY Alert Condition: violation_days > = 3 Notification Frequency: Just once

D. SELECT percent_null
FROM prod_catalog.prod_schema.customer_data_profile_metrics
WHERE window.end > = CURRENT_TIMESTAMP - INTERVAL ' 1 ' DAY
Alert Condition: percent_null > 5
Notification Frequency: At most every 24 hours

Correct Answer: B

Explanation: Only visible for TestSimulate members. You can sign-up / login (it's free).

Question 8

Review the following error traceback:

Which statement describes the error being raised?

A. There is a type error because a DataFrame object cannot be multiplied.

B. The code executed was PvSoark but was executed in a Scala notebook.

C. There is no column in the table named heartrateheartrateheartrate

D. There is a type error because a column object cannot be multiplied.

E. There is a syntax error because the heartrate column is not correctly identified as a column.

Correct Answer: C

Explanation: Only visible for TestSimulate members. You can sign-up / login (it's free).

Question 9

Two of the most common data locations on Databricks are the DBFS root storage and external object storage mounted with dbutils.fs.mount().
Which of the following statements is correct?

A. By default, both the DBFS root and mounted data sources are only accessible to workspace administrators.

B. The DBFS root is the most secure location to store data, because mounted storage volumes must have full public read and write permissions.

C. The DBFS root stores files in ephemeral block volumes attached to the driver, while mounted directories will always persist saved data to external storage between sessions.

D. DBFS is a file system protocol that allows users to interact with files stored in object storage using syntax and guarantees similar to Unix file systems.

E. Neither the DBFS root nor mounted storage can be accessed when using %sh in a Databricks notebook.

Correct Answer: D

Explanation: Only visible for TestSimulate members. You can sign-up / login (it's free).

Question 10

A healthcare analytics team is implementing a dimensional model in Delta Lake for patient care analysis.
They have a date dimension table and are evaluating design options to ensure it supports a wide range of time- based analyses.
Which design approach for the date dimension will support efficient time-based querying and aggregation?

A. Store only the date value and calculate all time attributes dynamically in queries.

B. Pre-calculate attributes like fiscal_period, quarter, month_name, day_of_week, and holiday.

C. Create separate dimension tables for different calendar systems (fiscal, academic, etc.).

D. Store the date as a string in the format YYYY-MM-DD for readability.

Correct Answer: B

Explanation: Only visible for TestSimulate members. You can sign-up / login (it's free).

Welcome to TestSimulate

Databricks Certified Professional Data Engineer (Databricks-Certified-Professional-Data-Engineer) Free Practice Test