Welcome to TestSimulate

Pass Your Next Certification Exam Fast!

Everything you need to prepare, learn & pass your certification exam easily.

365 days free updates. First attempt guaranteed success.

[Apr-2023] Databricks Databricks-Certified-Professional-Data-Engineer Exam Basic Questions With Answers [Q52-Q72]

Share

[Apr-2023] Databricks Databricks-Certified-Professional-Data-Engineer Exam: Basic Questions With Answers

New 2023 Realistic Free Databricks Databricks-Certified-Professional-Data-Engineer Exam Dump Questions and Answer


Databricks is a leading cloud-based data platform that enables organizations to accelerate innovation and achieve their data-driven goals. To showcase their expertise in using the Databricks platform, data professionals can earn the Databricks-Certified-Professional-Data-Engineer (Databricks Certified Professional Data Engineer) certification. This certification is designed to validate the skills and knowledge required to design, build, and maintain data solutions on the Databricks platform.

 

NEW QUESTION # 52
Below table temp_data has one column called raw contains JSON data that records temperature for every four hours in the day for the city of Chicago, you are asked to calculate the maximum temperature that was ever recorded for 12:00 PM hour across all the days. Parse the JSON data and use the necessary array function to calculate the max temp.
Table: temp_date
Column: raw
Datatype: string

Expected output: 58

  • A. 1.select max(raw.chicago.temp[3]) from temp_data
  • B. 1.select array_max(raw.chicago[*].temp[3]) from temp_data
  • C. 1.select max(from_json(raw:chicago[3].temp[3],'array<int>')) from temp_data
  • D. 1.select array_max(from_json(raw:chicago[*].temp[3],'array<int>')) from temp_data
  • E. 1.select array_max(from_json(raw['chicago'].temp[3],'array<int>')) from temp_data

Answer: D

Explanation:
Explanation
Note: This is a difficult question, more likely you may see easier questions similar to this but the more you are prepared for the exam easier it is to pass the exam.
Use this below link to look for more examples, this will definitely help you,
https://docs.databricks.com/optimizations/semi-structured.html
Here is the solution, step by step
Text Description automatically generated

Use this below link to look for more examples, this will definitely help you,
https://docs.databricks.com/optimizations/semi-structured.html
If you want to try this solution use below DDL,
1.create or replace table temp_data
2. as select ' {
3. "chicago":[
4.{"date":"01-01-2021",
5."temp":[25,28,45,56,39,25]
6.},
7.{"date":"01-02-2021",
8."temp":[25,28,49,54,38,25]
9.},
10.{"date":"01-03-2021",
11."temp":[25,28,49,58,38,25]
12. }]
13. }
14. ' as raw
15.
16.select array_max(from_json(raw:chicago[*].temp[3],'array<int>')) from temp_data
17.


NEW QUESTION # 53
A particular job seems to be performing slower and slower over time, the team thinks this started to happen when a recent production change was implemented, you were asked to take look at the job history and see if we can identify trends and root cause, where in the workspace UI can you perform this analysis?

  • A. Under jobs UI select the job cluster, under spark UI select the application job logs, then you can access last 60 day historical runs
  • B. Historical job runs can only be accessed by REST API
  • C. Under jobs UI select the job you are interested, under runs we can see current active runs and last 60 days historical run
  • D. Under Compute UI, select Job cluster and select the job cluster to see last 60 day his-torical runs
  • E. Under Workspace logs, select job logs and select the job you want to monitor to view the last 60 day historical runs

Answer: C

Explanation:
Explanation
The answer is,
Under jobs UI select the job you are interested, under runs we can see current active runs and last 60 days historical run


NEW QUESTION # 54
What is the purpose of the bronze layer in a Multi-hop architecture?

  • A. Provides efficient storage and querying of full unprocessed history of data
  • B. Contains aggregated data that is to be consumed into Silver
  • C. Used as a data source for Machine learning applications.
  • D. Can be used to eliminate duplicate records
  • E. Perform data quality checks, corrupt data quarantined

Answer: A

Explanation:
Explanation
The answer is Provides efficient storage and querying of full unprocessed history of data Medallion Architecture - Databricks Bronze Layer:
1.Raw copy of ingested data
2.Replaces traditional data lake
3.Provides efficient storage and querying of full, unprocessed history of data
4.No schema is applied at this layer
Exam focus: Please review the below image and understand the role of each layer(bronze, silver, gold) in medallion architecture, you will see varying questions targeting each layer and its purpose.
Sorry I had to add the watermark some people in Udemy are copying my content.


NEW QUESTION # 55
When using the complete mode to write stream data, how does it impact the target table?

  • A. Delta commits transaction once the stream is stopped
  • B. Target table cannot be updated while stream is pending
  • C. Target table is overwritten for each batch
  • D. Stream must complete to write the data
  • E. Entire stream waits for complete data to write

Answer: C

Explanation:
Explanation
The answer is Target table is overwritten for each batch
Complete mode - The whole Result Table will be outputted to the sink after every trigger. This is supported for aggregation queries


NEW QUESTION # 56
Newly joined data analyst requested read-only access to tables, assuming you are owner/admin which section of Databricks platform is going to facilitate granting select access to the user

  • A. Azure RBAC
  • B. Data explorer
  • C. Admin console
  • D. User settings
  • E. Azure Databricks control pane IAM

Answer: B

Explanation:
Explanation
Anser is Data Explorer
https://docs.databricks.com/sql/user/data/index.html
Data explorer lets you easily explore and manage permissions on databases and tables. Users can view schema details, preview sample data, and see table details and properties. Administrators can view and change owners, and admins and data object owners can grant and revoke permissions.
To open data explorer, click Data in the sidebar.


NEW QUESTION # 57
You are currently asked to work on building a data pipeline, you have noticed that you are currently working with a data source that has a lot of data quality issues and you need to monitor data quality and enforce it as part of the data ingestion process, which of the following tools can be used to address this problem?

  • A. DELTA LIVE TABLES
  • B. STRUCTURED STREAMING with MULTI HOP
  • C. AUTO LOADER
  • D. JOBS and TASKS
  • E. UNITY Catalog and Data Governance

Answer: A

Explanation:
Explanation
The answer is, DELTA LIVE TABLES
Delta live tables expectations can be used to identify and quarantine bad data, all of the data quality metrics are stored in the event logs which can be used to later analyze and monitor.
DELTA LIVE Tables expectations
Below are three types of expectations, make sure to pay attention differences between these three.
Retain invalid records:
Use the expect operator when you want to keep records that violate the expectation. Records that violate the expectation are added to the target dataset along with valid records:
Python
[email protected]("valid timestamp", "col("timestamp") > '2012-01-01'")
SQL
1.CONSTRAINT valid_timestamp EXPECT (timestamp > '2012-01-01')
Drop invalid records:
Use the expect or drop operator to prevent the processing of invalid records. Records that violate the expectation are dropped from the target dataset:
Python
[email protected]_or_drop("valid_current_page", "current_page_id IS NOT NULL AND cur-rent_page_title IS NOT NULL") SQL
1.CONSTRAINT valid_current_page EXPECT (current_page_id IS NOT NULL and cur-rent_page_title IS NOT NULL) ON VIOLATION DROP ROW Fail on invalid records:
When invalid records are unacceptable, use the expect or fail operator to halt execution imme-diately when a record fails validation. If the operation is a table update, the system atomically rolls back the transaction:
Python
[email protected]_or_fail("valid_count", "count > 0")
SQL
1.CONSTRAINT valid_count EXPECT (count > 0) ON VIOLATION FAIL UPDATE


NEW QUESTION # 58
Which of the following commands can be used to query a delta table?

  • A. Both A & B
    (Correct)
  • B. 1.%python
    2.delta.sql("select * from table")
  • C. 1.%sql
    2.Select * from table_name
  • D. 1.%python
    2.spark.sql("select * from table_name")
  • E. 1.%python
    2.execute.sql("select * from table")

Answer: A

Explanation:
Explanation
The answer is both options A and B
Options C and D are incorrect because there is no command in Spark called execute.sql or delta.sql


NEW QUESTION # 59
A data engineer has created a Delta table as part of a data pipeline. Downstream data analysts now need
SELECT permission on the Delta table.
Assuming the data engineer is the Delta table owner, which part of the Databricks Lakehouse Plat-form can
the data engineer use to grant the data analysts the appropriate access?

  • A. Databricks Filesystem
  • B. Jobs
    B Dashboards
  • C. Data Explorer
  • D. Repos

Answer: D


NEW QUESTION # 60
Once a cluster is deleted, below additional actions need to performed by the administrator

  • A. No action needs to be performed. All resources are automatically removed.
  • B. Remove networking but Virtual machines and storage disks are automatically dropped
  • C. Drop storage disks but Virtual machines and networking are automatically dropped
  • D. Remove virtual machines but storage and networking are automatically dropped
  • E. Remove logs

Answer: A

Explanation:
Explanation
What is Delta?
Delta lake is
* Open source
* Builds up on standard data format
* Optimized for cloud object storage
* Built for scalable metadata handling
Delta lake is not
* Proprietary technology
* Storage format
* Storage medium
* Database service or data warehouse


NEW QUESTION # 61
The data analyst team had put together queries that identify items that are out of stock based on orders and replenishment but when they run all together for final output the team noticed it takes a really long time, you were asked to look at the reason why queries are running slow and identify steps to improve the performance and when you looked at it you noticed all the code queries are running sequentially and using a SQL endpoint cluster. Which of the following steps can be taken to resolve the issue?
Here is the example query
1.--- Get order summary
2.create or replace table orders_summary
3.as
4.select product_id, sum(order_count) order_count
5.from
6. (
7. select product_id,order_count from orders_instore
8. union all
9. select product_id,order_count from orders_online
10. )
11.group by product_id
12.-- get supply summary
13.create or repalce tabe supply_summary
14.as
15.select product_id, sum(supply_count) supply_count
16.from supply
17.group by product_id
18.
19.-- get on hand based on orders summary and supply summary
20.
21.with stock_cte
22.as (
23.select nvl(s.product_id,o.product_id) as product_id,
24. nvl(supply_count,0) - nvl(order_count,0) as on_hand
25.from supply_summary s
26.full outer join orders_summary o
27. on s.product_id = o.product_id
28.)
29.select *
30.from
31.stock_cte
32.where on_hand = 0

  • A. Increase the maximum bound of the SQL endpoint's scaling range.
  • B. Turn on the Serverless feature for the SQL endpoint and change the Spot Instance Pol-icy to "Reliability Optimized."
  • C. Turn on the Serverless feature for the SQL endpoint.
  • D. Turn on the Auto Stop feature for the SQL endpoint.
  • E. Increase the cluster size of the SQL endpoint.

Answer: E

Explanation:
Explanation
The answer is to increase the cluster size of the SQL Endpoint, here queries are running sequentially and since the single query can not span more than one cluster adding more clusters won't improve the query but rather increasing the cluster size will improve performance so it can use additional compute in a warehouse.
In the exam please note that additional context will not be given instead you have to look for cue words or need to understand if the queries are running sequentially or concurrently. if the que-ries are running sequentially then scale up(more nodes) if the queries are running concurrently (more users) then scale out(more clusters).
Below is the snippet from Azure, as you can see by increasing the cluster size you are able to add more worker nodes.

SQL endpoint scales horizontally(scale-out) and vertically (scale-up), you have to understand when to use what.
Scale-up-> Increase the size of the cluster from x-small to small, to medium, X Large....
If you are trying to improve the performance of a single query having additional memory, additional nodes and cpu in the cluster will improve the performance.
Scale-out -> Add more clusters, change max number of clusters
If you are trying to improve the throughput, being able to run as many queries as possible then having an additional cluster(s) will improve the performance.
SQL endpoint
A picture containing diagram Description automatically generated


NEW QUESTION # 62
A data engineer needs to create a database called customer360 at the loca-tion /customer/customer360. The
data engineer is unsure if one of their colleagues has already created the database.
Which of the following commands should the data engineer run to complete this task?

  • A. CREATE DATABASE IF NOT EXISTS customer360;
  • B. CREATE DATABASE customer360 LOCATION '/customer/customer360';
  • C. CREATE DATABASE IF NOT EXISTS customer360 LOCATION '/customer/customer360';
  • D. CREATE DATABASE IF NOT EXISTS customer360 DELTA LOCATION '/customer/customer360';
  • E. CREATE DATABASE customer360 DELTA LOCATION '/customer/customer360';

Answer: C


NEW QUESTION # 63
Which of the following commands results in the successful creation of a view on top of the delta stream(stream on delta table)?

  • A. Spark.read.format("delta").table("sales").trigger("stream").createOrReplaceTempView("streaming_vw")
  • B. You can not create a view on streaming data source.
  • C. Spark.readStream.format("delta").table("sales").createOrReplaceTempView("streaming_vw")
  • D. Spark.read.format("delta").table("sales").createOrReplaceTempView("streaming_vw")
  • E. Spark.read.format("delta").table("sales").mode("stream").createOrReplaceTempView("streaming_vw")
  • F. Spark.read.format("delta").stream("sales").createOrReplaceTempView("streaming_vw")

Answer: C

Explanation:
Explanation
The answer is
Spark.readStream.table("sales").createOrReplaceTempView("streaming_vw") When you load a Delta table as a stream source and use it in a streaming query, the query processes all of the data present in the table as well as any new data that arrives after the stream is started.
You can load both paths and tables as a stream, you also have the ability to ignore deletes and changes(updates, Merge, overwrites) on the delta table.
Here is more information,
https://docs.databricks.com/delta/delta-streaming.html#delta-table-as-a-source


NEW QUESTION # 64
Question-3: In machine learning, feature hashing, also known as the hashing trick (by analogy to the kernel
trick), is a fast and space-efficient way of vectorizing features (such as the words in a language), i.e., turning
arbitrary features into indices in a vector or matrix. It works by applying a hash function to the features and
using their hash values modulo the number of features as indices directly, rather than looking the indices up in
an associative array. So what is the primary reason of the hashing trick for building classifiers?

  • A. Noisy features are removed
  • B. It reduces the non-significant features e.g. punctuations
  • C. It creates the smaller models
  • D. It requires the lesser memory to store the coefficients for the model

Answer: D

Explanation:
Explanation
This hashed feature approach has the distinct advantage of requiring less memory and one less pass through
the training data, but it can make it much harder to reverse engineer vectors to determine which original
feature mapped to a vector location. This is because multiple features may hash to the same location. With
large vectors or with multiple locations per feature, this isn't a problem for accuracy but it can make it hard to
understand what a classifier is doing.
Models always have a coefficient per feature, which are stored in memory during model building. The hashing
trick collapses a high number of features to a small number which reduces the number of coefficients and thus
memory requirements. Noisy features are not removed; they are combined with other features and so still have
an impact.
The validity of this approach depends a lot on the nature of the features and problem domain; knowledge of
the domain is important to understand whether it is applicable or will likely produce poor results. While
hashing features may produce a smaller model, it will be one built from odd combinations of real-world
features, and so will be harder to interpret.
An additional benefit of feature hashing is that the unknown and unbounded vocabularies typical of word-like
variables aren't a problem.


NEW QUESTION # 65
What is the main difference between the silver layer and the gold layer in medalion architecture?

  • A. Data quality checks are applied in gold
  • B. Silver may contain aggregated data
  • C. Gold may contain aggregated data
  • D. God is a copy of silver data
  • E. Silver is a copy of bronze data

Answer: C

Explanation:
Explanation
Medallion Architecture - Databricks
Exam focus: Please review the below image and understand the role of each layer(bronze, silver, gold) in medallion architecture, you will see varying questions targeting each layer and its purpose.
Sorry I had to add the watermark some people in Udemy are copying my content.
A diagram of a house Description automatically generated with low confidence


NEW QUESTION # 66
You are working on a process to query the table based on batch date, and batch date is an input parameter and expected to change every time the program runs, what is the best way to we can parameterize the query to run without manually changing the batch date?

  • A. Store the batch date in the spark configuration and use a spark data frame to filter the data based on the spark configuration.
  • B. Create a dynamic view that can calculate the batch date automatically and use the view to query the data
  • C. Manually edit code every time to change the batch date
  • D. Create a notebook parameter for batch date and assign the value to a python variable and use a spark data frame to filter the data based on the python variable
  • E. There is no way we can combine python variable and spark code

Answer: D

Explanation:
Explanation
The answer is, Create a notebook parameter for batch date and assign the value to a python variable and use a spark data frame to filter the data based on the python variable


NEW QUESTION # 67
You are working on a email spam filtering assignment, while working on this you find there is new word e.g.
HadoopExam comes in email, and in your solutions you never come across this word before, hence probability
of this words is coming in either email could be zero. So which of the following algorithm can help you to
avoid zero probability?

  • A. Naive Bayes
  • B. Logistic Regression
  • C. All of the above
  • D. Laplace Smoothing

Answer: D

Explanation:
Explanation
Laplace smoothing is a technique for parameter estimation which accounts for unobserved events. It is more
robust and will not fail completely when data that has never been observed in training shows up.


NEW QUESTION # 68
Which of the following scenarios is the best fit for the AUTO LOADER solution?

  • A. Efficiently move data incrementally from one delta table to another delta table
  • B. Incrementally process new data from relational databases like MySQL
  • C. Efficiently process new data incrementally from cloud object storage
  • D. Incrementally process new streaming data from Apache Kafa into delta lake
  • E. Efficiently copy data from data lake location to another data lake location

Answer: C

Explanation:
Explanation
The answer is, Efficiently process new data incrementally from cloud object storage.
Please note: AUTO LOADER only works on data/files located in cloud object storage like S3 or Azure Blob Storage it does not have the ability to read other data sources, although AU-TO LOADER is built on top of structured streaming it only supports files in the cloud object stor-age. If you want to use Apache Kafka then you can just use structured streaming.
Diagram Description automatically generated

Auto Loader and Cloud Storage Integration
Auto Loader supports a couple of ways to ingest data incrementally
1.Directory listing - List Directory and maintain the state in RocksDB, supports incremental file listing
2.File notification - Uses a trigger+queue to store the file notification which can be later used to retrieve the file, unlike Directory listing File notification can scale up to millions of files per day.
[OPTIONAL]
Auto Loader vs COPY INTO?
Auto Loader
Auto Loader incrementally and efficiently processes new data files as they arrive in cloud storage without any additional setup. Auto Loader provides a new Structured Streaming source called cloudFiles. Given an input directory path on the cloud file storage, the cloudFiles source automatically processes new files as they arrive, with the option of also processing existing files in that directory.
When to use Auto Loader instead of the COPY INTO?
*You want to load data from a file location that contains files in the order of millions or higher. Auto Loader can discover files more efficiently than the COPY INTO SQL command and can split file processing into multiple batches.
*You do not plan to load subsets of previously uploaded files. With Auto Loader, it can be more difficult to reprocess subsets of files. However, you can use the COPY INTO SQL command to reload subsets of files while an Auto Loader stream is simultaneously running.
Refer to more documentation here,
https://docs.microsoft.com/en-us/azure/databricks/ingestion/auto-loader


NEW QUESTION # 69
If you create a database sample_db with the statement CREATE DATABASE sample_db what will be the default location of the database in DBFS?

  • A. Default location, DBFS:/user/
  • B. Default Location, dbfs:/user/hive/warehouse
  • C. Default Storage account
  • D. Default location, /user/db/
  • E. Statement fails "Unable to create database without location"

Answer: B

Explanation:
Explanation
The Answer is dbfs:/user/hive/warehouse this is the default location where spark stores user data-bases, the default can be changed using spark.sql.warehouse.dir a parameter. You can also provide a custom location using the LOCATION keyword.
Here is how this works,
Graphical user interface, text, application, email Description automatically generated

Default location

FYI, This can be changed used using cluster spark config or session config.
Modify spark.sql.warehouse.dir location to change the default location
Graphical user interface, text, application Description automatically generated


NEW QUESTION # 70
What is the type of table created when you issue SQL DDL command CREATE TABLE sales (id int, units int)

  • A. Query fails due to missing location
  • B. External Table
  • C. Query fails due to missing format
  • D. Managed Parquet table
  • E. Managed Delta table

Answer: E

Explanation:
Explanation
Answer is Managed Delta table
Anytime a table is created without the Location keyword it is considered a managed table, by de-fault all managed tables DELTA tables Syntax CREATE TABLE table_name ( column column_data_type...)


NEW QUESTION # 71
Your team has hundreds of jobs running but it is difficult to track cost of each job run, you are asked to provide a recommendation on how to monitor and track cost across various workloads

  • A. Use a single cluster for all the jobs, so cost can be easily tracked
  • B. Use Tags, during job creation so cost can be easily tracked
  • C. Create jobs in different workspaces, so we can track the cost easily
  • D. Use job logs to monitor and track the costs
  • E. Use workspace admin reporting

Answer: B

Explanation:
Explanation
The answer is Use Tags, during job creation so cost can be easily tracked Review below link for more details
https://docs.databricks.com/administration-guide/account-settings/usage-detail-tags-aws.html Here is a view how tags get propagated from pools to clusters and clusters without pools, Diagram Description automatically generated


NEW QUESTION # 72
......

Guaranteed Success in Databricks Certification Databricks-Certified-Professional-Data-Engineer Exam Dumps: https://www.testsimulate.com/Databricks-Certified-Professional-Data-Engineer-study-materials.html