Last Week Professional-Machine-Learning-Engineer Exam Results
145
Customers Passed Google Professional-Machine-Learning-Engineer Exam
94%
Average Score In Real Professional-Machine-Learning-Engineer Exam
96%
Questions came from our Professional-Machine-Learning-Engineer dumps.
Choosing the Right Path for Your Professional-Machine-Learning-Engineer Exam Preparation
Welcome to PassExamHub's comprehensive study guide for the Google Professional Machine Learning Engineer exam. Our Professional-Machine-Learning-Engineer dumps is designed to equip you with the knowledge and resources you need to confidently prepare for and succeed in the Professional-Machine-Learning-Engineer certification exam.
What Our Google Professional-Machine-Learning-Engineer Study Material Offers
PassExamHub's Professional-Machine-Learning-Engineer dumps PDF is carefully crafted to provide you with a comprehensive and effective learning experience. Our study material includes:
In-depth Content: Our study guide covers all the key concepts, topics, and skills you need to master for the Professional-Machine-Learning-Engineer exam. Each topic is explained in a clear and concise manner, making it easy to understand even the most complex concepts.
Online Test Engine: Test your knowledge and build your confidence with a wide range of practice questions that simulate the actual exam format. Our test engine cover every exam objective and provide detailed explanations for both correct and incorrect answers.
Exam Strategies: Get valuable insights into exam-taking strategies, time management, and how to approach different types of questions.
Real-world Scenarios: Gain practical insights into applying your knowledge in real-world scenarios, ensuring you're well-prepared to tackle challenges in your professional career.
Why Choose PassExamHub?
Expertise: Our Professional-Machine-Learning-Engineer exam questions answers are developed by experienced Google certified professionals who have a deep understanding of the exam objectives and industry best practices.
Comprehensive Coverage: We leave no stone unturned in covering every topic and skill that could appear on the Professional-Machine-Learning-Engineer exam, ensuring you're fully prepared.
Engaging Learning: Our content is presented in a user-friendly and engaging format, making your study sessions enjoyable and effective.
Proven Success: Countless students have used our study materials to achieve their Professional-Machine-Learning-Engineer certifications and advance their careers.
Start Your Journey Today!
Embark on your journey to Google Professional Machine Learning Engineer success with PassExamHub. Our study material is your trusted companion in preparing for the Professional-Machine-Learning-Engineer exam and unlocking exciting career opportunities.
Related Exams
Google Professional-Machine-Learning-Engineer Sample Question Answers
Question # 1
You want to train an AutoML model to predict house prices by using a small public dataset stored in
BigQuery. You need to prepare the data and want to use the simplest most efficient approach. What
should you do?
A. Write a query that preprocesses the data by using BigQuery and creates a new table Create a
Vertex Al managed dataset with the new table as the data source. B. Use Dataflow to preprocess the data Write the output in TFRecord format to a Cloud Storage
bucket. C. Write a query that preprocesses the data by using BigQuery Export the query results as CSV files
and use those files to create a Vertex Al managed dataset.
D. Use a Vertex Al Workbench notebook instance to preprocess the data by using the pandas library
Export the data as CSV files, and use those files to create a Vertex Al managed dataset.
Answer: A
Explanation:
The simplest and most efficient approach for preparing the data for AutoML is to use BigQuery and
Vertex AI. BigQuery is a serverless, scalable, and cost-effective data warehouse that can perform fast
and interactive queries on large datasets. BigQuery can preprocess the data by using SQL functions
such as filtering, aggregating, joining, transforming, and creating new features. The preprocessed
data can be stored in a new table in BigQuery, which can be used as the data source for Vertex AI.
Vertex AI is a unified platform for building and deploying machine learning solutions on Google
Cloud. Vertex AI can create a managed dataset from a BigQuery table, which can be used to train an
AutoML model. Vertex AI can also evaluate, deploy, and monitor the AutoML model, and provide
online or batch predictions. By using BigQuery and Vertex AI, users can leverage the power and
simplicity of Google Cloud to train an AutoML model to predict house prices.
The other options are not as simple or efficient as option A, for the following reasons:
Option B: Using Dataflow to preprocess the data and write the output in TFRecord format to a Cloud
Storage bucket would require more steps and resources than using BigQuery and Vertex AI. Dataflow
is a service that can create scalable and reliable pipelines to process large volumes of data from
various sources. Dataflow can preprocess the data by using Apache Beam, a programming model for
defining and executing data processing workflows. TFRecord is a binary file format that can store
sequential data efficiently. However, using Dataflow and TFRecord would require writing code,
setting up a pipeline, choosing a runner, and managing the output files. Moreover, TFRecord is not a
supported format for Vertex AI managed datasets, so the data would need to be converted to CSV or
JSONL files before creating a Vertex AI managed dataset.
Option C: Writing a query that preprocesses the data by using BigQuery and exporting the query
results as CSV files would require more steps and storage than using BigQuery and Vertex AI. CSV is a
text file format that can store tabular data in a comma-separated format. Exporting the query results
as CSV files would require choosing a destination Cloud Storage bucket, specifying a file name or a
wildcard, and setting the export options. Moreover, CSV files can have limitations such as size,
schema, and encoding, which can affect the quality and validity of the data. Exporting the data as
CSV files would also incur additional storage costs and reduce the performance of the queries.
Option D: Using a Vertex AI Workbench notebook instance to preprocess the data by using the
pandas library and exporting the data as CSV files would require more steps and skills than using
BigQuery and Vertex AI. Vertex AI Workbench is a service that provides an integrated development
environment for data science and machine learning. Vertex AI Workbench allows users to create and
run Jupyter notebooks on Google Cloud, and access various tools and libraries for data analysis and
machine learning. Pandas is a popular Python library that can manipulate and analyze data in a
tabular format. However, using Vertex AI Workbench and pandas would require creating a notebook
instance, writing Python code, installing and importing pandas, connecting to BigQuery, loading and
preprocessing the data, and exporting the data as CSV files. Moreover, pandas can have limitations
such as memory usage, scalability, and compatibility, which can affect the efficiency and reliability of
the data processing.
Reference:
Preparing for Google Cloud Certification: Machine Learning Engineer, Course 2: Data Engineering for
ML on Google Cloud, Week 1: Introduction to Data Engineering for ML
Google Cloud Professional Machine Learning Engineer Exam Guide, Section 1: Architecting low-code
ML solutions, 1.3 Training models by using AutoML
Official Google Cloud Certified Professional Machine Learning Engineer Study Guide, Chapter 4: Lowcode
ML Solutions, Section 4.3: AutoML
BigQuery
Vertex AI
Dataflow
TFRecord
CSV
Vertex AI Workbench
Pandas
Question # 2
You are training an ML model using data stored in BigQuery that contains several values that are
considered Personally Identifiable Information (Pll). You need to reduce the sensitivity of the dataset
before training your model. Every column is critical to your model. How should you proceed?
A. Using Dataflow, ingest the columns with sensitive data from BigQuery, and then randomize the
values in each sensitive column. B. Use the Cloud Data Loss Prevention (DLP) API to scan for sensitive data, and use Dataflow with the
DLP API to encrypt sensitive values with Format Preserving Encryption C. Use the Cloud Data Loss Prevention (DLP) API to scan for sensitive data, and use Dataflow to
replace all sensitive data by using the encryption algorithm AES-256 with a salt. D. Before training, use BigQuery to select only the columns that do not contain sensitive data Create
an authorized view of the data so that sensitive values cannot be accessed by unauthorized
individuals.
Answer: B
Explanation:
The best option for reducing the sensitivity of the dataset before training the model is to use the
Cloud Data Loss Prevention (DLP) API to scan for sensitive data, and use Dataflow with the DLP API to
encrypt sensitive values with Format Preserving Encryption. This option allows you to keep every
column in the dataset, while protecting the sensitive data from unauthorized access or exposure. The
Cloud DLP API can detect and classify various types of sensitive data, such as names, email
addresses, phone numbers, credit card numbers, and more1. Dataflow can create scalable and
reliable pipelines to process large volumes of data from BigQuery and other sources2. Format
Preserving Encryption (FPE) is a technique that encrypts sensitive data while preserving its original
format and length, which can help maintain the utility and validity of the data3. By using Dataflow
with the DLP API, you can apply FPE to the sensitive values in the dataset, and store the encrypted
data in BigQuery or another destination. You can also use the same pipeline to decrypt the data
when needed, by using the same encryption key and method4.
The other options are not as suitable as option B, for the following reasons:
Option A: Using Dataflow to ingest the columns with sensitive data from BigQuery, and then
randomize the values in each sensitive column, would reduce the sensitivity of the data, but also the
utility and accuracy of the data. Randomization is a technique that replaces sensitive data with
random values, which can prevent re-identification of the data, but also distort the distribution and
relationships of the data3. This can affect the performance and quality of the ML model, especially if
every column is critical to the model.
Option C: Using the Cloud DLP API to scan for sensitive data, and use Dataflow to replace all sensitive
data by using the encryption algorithm AES-256 with a salt, would reduce the sensitivity of the data,
but also the utility and validity of the data. AES-256 is a symmetric encryption algorithm that uses a
256-bit key to encrypt and decrypt data. A salt is a random value that is added to the data before
encryption, to increase the randomness and security of the encrypted data. However, AES-256 does
not preserve the format or length of the original data, which can cause problems when storing or
processing the data. For example, if the original data is a 10-digit phone number, AES-256 would
produce a much longer and different string, which can break the schema or logic of the dataset3.
Option D: Before training, using BigQuery to select only the columns that do not contain sensitive
data, and creating an authorized view of the data so that sensitive values cannot be accessed by
unauthorized individuals, would reduce the exposure of the sensitive data, but also the
completeness and relevance of the data. An authorized view is a BigQuery view that allows you to
share query results with particular users or groups, without giving them access to the underlying
tables. However, this option assumes that you can identify the columns that do not contain sensitive
data, which may not be easy or accurate. Moreover, this option would remove some columns from
the dataset, which can affect the performance and quality of the ML model, especially if every
column is critical to the model.
Reference:
Preparing for Google Cloud Certification: Machine Learning Engineer, Course 5: Responsible AI,
Week 2: Privacy
Google Cloud Professional Machine Learning Engineer Exam Guide, Section 5: Developing
responsible AI solutions, 5.2 Implementing privacy techniques
Official Google Cloud Certified Professional Machine Learning Engineer Study Guide, Chapter 9:
Responsible AI, Section 9.4: Privacy
De-identification techniques
Cloud Data Loss Prevention (DLP) API
Dataflow
Using Dataflow and Sensitive Data Protection to securely tokenize and import data from a relational
database to BigQuery
[AES encryption]
[Salt (cryptography)]
[Authorized views]
Question # 3
You have trained a DNN regressor with TensorFlow to predict housing prices using a set of predictive
features. Your default precision is tf.float64, and you use a standard TensorFlow estimator;
estimator = tf.estimator.DNNRegressor(
feature_columns=[YOUR_LIST_OF_FEATURES],
hidden_units-[1024, 512, 256],
dropout=None)
Your model performs well, but Just before deploying it to production, you discover that your current
serving latency is 10ms @ 90 percentile and you currently serve on CPUs. Your production
requirements expect a model latency of 8ms @ 90 percentile. You are willing to accept a small
decrease in performance in order to reach the latency requirement Therefore your plan is to improve
latency while evaluating how much the model's prediction decreases. What should you first try to
quickly lower the serving latency?
A. Increase the dropout rate to 0.8 in_PREDICT mode by adjusting the TensorFlow Serving
parameters B. Increase the dropout rate to 0.8 and retrain your model. C. Switch from CPU to GPU serving D. Apply quantization to your SavedModel by reducing the floating point precision to tf.float16.
Answer: D
Explanation:
Quantization is a technique that reduces the numerical precision of the weights and activations of a
neural network, which can improve the inference speed and reduce the memory footprint of the
model1.
Reducing the floating point precision from tf.float64 to tf.float16 can potentially halve the latency and
memory usage of the model, while having minimal impact on the accuracy2.
Increasing the dropout rate to 0.8 in either mode would not affect the latency, but would likely
degrade the performance of the model significantly, as dropout is a regularization technique that
randomly drops out units during training to prevent overfitting3.
Switching from CPU to GPU serving may or may not improve the latency, depending on the hardware
specifications and the model complexity, but it would also incur additional costs and complexity for
deployment4
Question # 4
You developed a Vertex Al ML pipeline that consists of preprocessing and training steps and each setof steps runs on a separate custom Docker image Your organization uses GitHub and GitHub Actionsas CI/CD to run unit and integration tests You need to automate the model retraining workflow sothat it can be initiated both manually and when a new version of the code is merged in the mainbranch You want to minimize the steps required to build the workflow while also allowing formaximum flexibility How should you configure the CI/CD workflow?
A. Trigger a Cloud Build workflow to run tests build custom Docker images, push the images toArtifact Registry and launch the pipeline in Vertex Al Pipelines. B. Trigger GitHub Actions to run the tests launch a job on Cloud Run to build custom Docker imagespush the images to Artifact Registry and launch the pipeline in Vertex Al Pipelines. C. Trigger GitHub Actions to run the tests build custom Docker images push the images to ArtifactRegistry, and launch the pipeline in Vertex Al Pipelines. D. Trigger GitHub Actions to run the tests launch a Cloud Build workflow to build custom Dickerimages, push the images to Artifact Registry, and launch the pipeline in Vertex Al Pipelines.
Answer: D
Explanation:
The best option for automating the model retraining workflow is to use GitHub Actions and Cloud
Build. GitHub Actions is a service that can create and run workflows for continuous integration and
continuous delivery (CI/CD) on GitHub. GitHub Actions can run tests, build and deploy code, and
trigger other actions based on events such as code changes, pull requests, or manual triggers. Cloud
Build is a service that can create and run scalable and reliable pipelines to build, test, and deploy
software on Google Cloud. Cloud Build can build custom Docker images, push the images to Artifact
Registry, and launch the pipeline in Vertex AI Pipelines. Vertex AI Pipelines is a service that can
orchestrate machine learning (ML) workflows using Vertex AI. Vertex AI Pipelines can run
preprocessing and training steps on custom Docker images, and evaluate, deploy, and monitor the
ML model. By using GitHub Actions and Cloud Build, users can leverage the power and flexibility of
Google Cloud to automate the model retraining workflow, while minimizing the steps required to
build the workflow.
The other options are not as good as option D, for the following reasons:
Option A: Triggering a Cloud Build workflow to run tests, build custom Docker images, push the
images to Artifact Registry, and launch the pipeline in Vertex AI Pipelines would require more
configuration and maintenance than using GitHub Actions and Cloud Build. Cloud Build is a service
that can create and run pipelines to build, test, and deploy software on Google Cloud, but it is not
designed to integrate with GitHub or other source code repositories. To trigger a Cloud Build
workflow from GitHub, users would need to set up a webhook, a Cloud Pub/Sub topic, and a Cloud
Function1. Moreover, Cloud Build does not support manual triggers, which limits the flexibility of the
workflow2.
Option B: Triggering GitHub Actions to run the tests, launching a job on Cloud Run to build custom
Docker images, pushing the images to Artifact Registry, and launching the pipeline in Vertex AI
Pipelines would require more steps and resources than using GitHub Actions and Cloud Build. Cloud
Run is a service that can run stateless containers on a fully managed environment or on Anthos.
Cloud Run can build custom Docker images, but it is not optimized for this task. Users would need to
write a Dockerfile, a cloudbuild.yaml file, and a Cloud Run service configuration file, and use the
gcloud command-line tool to build and deploy the image3. Moreover, Cloud Run is designed for
serving HTTP requests, not for running ML pipelines, which can have different performance and
scalability requirements.
Option C: Triggering GitHub Actions to run the tests, building custom Docker images, pushing the
images to Artifact Registry, and launching the pipeline in Vertex AI Pipelines would require more
skills and tools than using GitHub Actions and Cloud Build. GitHub Actions can run tests and build
code, but it is not specialized for building Docker images. Users would need to install and configure
Docker on the GitHub Actions runner, write a Dockerfile, and use the docker command-line tool to
build and push the image. Moreover, GitHub Actions has limitations on the disk space, memory, and
CPU of the runner, which can affect the speed and reliability of the image building process.
Reference:
Building CI/CD for Vertex AI pipelines: The first solution
Cloud Build
GitHub Actions
Vertex AI Pipelines
Triggering builds from GitHub
Triggering builds manually
Building containers
Cloud Run
Question # 5
You work on the data science team at a manufacturing company. You are reviewing the company's
historical sales data, which has hundreds of millions of records. For your exploratory data analysis,
you need to calculate descriptive statistics such as mean, median, and mode; conduct complex
statistical tests for hypothesis testing; and plot variations of the features over time You want to use as
much of the sales data as possible in your analyses while minimizing computational resources. What
should you do?
A. Spin up a Vertex Al Workbench user-managed notebooks instance and import the dataset Use this
data to create statistical and visual analyses B. Visualize the time plots in Google Data Studio. Import the dataset into Vertex Al Workbench usermanaged
notebooks Use this data to calculate the descriptive statistics and run the statistical
analyses C. Use BigQuery to calculate the descriptive statistics. Use Vertex Al Workbench user-managed
notebooks to visualize the time plots and run the statistical analyses. D Use BigQuery to calculate the descriptive statistics, and use Google Data Studio to visualize the
time plots. Use Vertex Al Workbench user-managed notebooks to run the statistical analyses.
Answer: C
Explanation:
The best option for analyzing large and complex datasets while minimizing computational resources
is to use a combination of BigQuery and Vertex AI Workbench. BigQuery is a serverless, scalable, and
cost-effective data warehouse that can perform fast and interactive queries on petabytes of data.
BigQuery can calculate descriptive statistics such as mean, median, and mode by using SQL functions
such as AVG, PERCENTILE_CONT, and MODE. Vertex AI Workbench is a managed service that
provides an integrated development environment for data science and machine learning. Vertex AI
Workbench allows users to create and run Jupyter notebooks on Google Cloud, and access various
tools and libraries for data visualization and statistical analysis. Vertex AI Workbench can connect to
BigQuery and use the results of the queries to create time plots and run statistical tests for
hypothesis testing. By using BigQuery and Vertex AI Workbench, users can leverage the power and
flexibility of Google Cloud to perform exploratory data analysis on large and complex
datasets. Reference:
Preparing for Google Cloud Certification: Machine Learning Engineer, Course 2: Data Engineering for
ML on Google Cloud, Week 1: Introduction to Data Engineering for ML
Google Cloud Professional Machine Learning Engineer Exam Guide, Section 1: Architecting low-code
ML solutions, 1.1 Developing ML models by using BigQuery ML
Official Google Cloud Certified Professional Machine Learning Engineer Study Guide, Chapter 3: Data
Engineering for ML, Section 3.2: BigQuery for ML
Question # 6
Your organization manages an online message board A few months ago, you discovered an increase
in toxic language and bullying on the message board. You deployed an automated text classifier that
flags certain comments as toxic or harmful. Now some users are reporting that benign comments
referencing their religion are being misclassified as abusive Upon further inspection, you find that
your classifier's false positive rate is higher for comments that reference certain underrepresented
religious groups. Your team has a limited budget and is already overextended. What should you do?
A. Add synthetic training data where those phrases are used in non-toxic ways B. Remove the model and replace it with human moderation. C. Replace your model with a different text classifier. D. Raise the threshold for comments to be considered toxic or harmful
Answer: A
Explanation:
The problem of the text classifier is that it has a high false positive rate for comments that reference
certain underrepresented religious groups. This means that the classifier is not able to distinguish
between toxic and non-toxic language when those groups are mentioned. One possible reason for
this is that the training data does not have enough examples of non-toxic comments that reference
those groups, leading to a biased model. Therefore, a possible solution is to add synthetic training
data where those phrases are used in non-toxic ways, which can help the model learn to generalize
better and reduce the false positive rate. Synthetic data is artificially generated data that mimics the
characteristics of real data, and can be used to augment the existing data when the real data is scarce
or imbalanced. Reference:
Preparing for Google Cloud Certification: Machine Learning Engineer, Course 5: Responsible AI,
Week 3: Fairness
Google Cloud Professional Machine Learning Engineer Exam Guide, Section 4: Ensuring solution
quality, 4.4 Evaluating fairness and bias in ML models
Official Google Cloud Certified Professional Machine Learning Engineer Study Guide, Chapter 9:
Responsible AI, Section 9.3: Fairness and Bias
Question # 7
You are working with a dataset that contains customer transactions. You need to build an ML modelto predict customer purchase behavior You plan to develop the model in BigQuery ML, and export itto Cloud Storage for online prediction You notice that the input data contains a few categoricalfeatures, including product category and payment method You want to deploy the model as quicklyas possible. What should you do?
A. Use the transform clause with the ML. ONE_HOT_ENCODER function on the categorical features atmodel creation and select the categorical and non-categorical features. B. Use the ML. ONE_HOT_ENCODER function on the categorical features, and select the encodedcategorical features and non-categorical features as inputs to create your model. C. Use the create model statement and select the categorical and non-categorical features. D. Use the ML. ONE_HOT_ENCODER function on the categorical features, and select the encodedcategorical features and non-categorical features as inputs to create your model.
Answer: A
Explanation:
The best option for building an ML model to predict customer purchase behavior in BigQuery ML is to
use the transform clause with the ML.ONE_HOT_ENCODER function on the categorical features at
model creation and select the categorical and non-categorical features. This option allows you to
encode the categorical features as one-hot vectors, which are binary vectors that have only one nonzero
element. One-hot encoding is a common technique for handling categorical features in ML
models, as it can reduce the dimensionality and sparsity of the data, and avoid the ordinality
problem that arises when using numerical labels for categorical values1. The transform clause is a
feature of BigQuery ML that lets you apply SQL expressions to transform the input data at model
creation time. The transform clause can perform feature engineering, such as one-hot encoding, on
the fly, without requiring you to create and store a new table with the transformed data2. By using
the transform clause with the ML.ONE_HOT_ENCODER function, you can create and train an ML
model in BigQuery ML with a single SQL statement, and export it to Cloud Storage for online
prediction.
The other options are not as good as option A, for the following reasons:
Option B: Using the ML.ONE_HOT_ENCODER function on the categorical features, and selecting the
encoded categorical features and non-categorical features as inputs to create your model, would
require more steps and storage than using the transform clause. The ML.ONE_HOT_ENCODER
function is a BigQuery ML function that returns a one-hot encoded vector for a given categorical
value. However, using this function alone would not apply the one-hot encoding to the input data at
model creation time. You would need to create a new table with the encoded features, and use that
table as the input to create your model. This would incur additional storage costs and reduce the
performance of the queries.
Option C: Using the create model statement and selecting the categorical and non-categorical
features, would not handle the categorical features properly and could result in a poor model
performance. The create model statement is a BigQuery ML statement that creates and trains an ML
model from a SQL query. However, if the input data contains categorical features, you need to
encode them as one-hot vectors or use the category_count option to specify the number of
categories for each feature. Otherwise, BigQuery ML would treat the categorical features as
numerical values, which can introduce bias and noise into the model3.
Option D: Using the ML.ONE_HOT_ENCODER function on the categorical features, and selecting the
encoded categorical features and non-categorical features as inputs to create your model, is the
same as option B, and has the same drawbacks.
Reference:
Preparing for Google Cloud Certification: Machine Learning Engineer, Course 2: Data Engineering for
ML on Google Cloud, Week 2: Feature Engineering
Google Cloud Professional Machine Learning Engineer Exam Guide, Section 1: Architecting low-code
ML solutions, 1.1 Developing ML models by using BigQuery ML
Official Google Cloud Certified Professional Machine Learning Engineer Study Guide, Chapter 3: Data
Engineering for ML, Section 3.2: BigQuery for ML
One-hot encoding
Using the TRANSFORM clause for feature engineering
Creating a model
ML.ONE_HOT_ENCODER function
Question # 8
You are an ML engineer at a manufacturing company You are creating a classification model for a
predictive maintenance use case You need to predict whether a crucial machine will fail in the next
three days so that the repair crew has enough time to fix the machine before it breaks. Regular
maintenance of the machine is relatively inexpensive, but a failure would be very costly You have
trained several binary classifiers to predict whether the machine will fail. where a prediction of 1
means that the ML model predicts a failure.
You are now evaluating each model on an evaluation dataset. You want to choose a model that
prioritizes detection while ensuring that more than 50% of the maintenance jobs triggered by your
model address an imminent machine failure. Which model should you choose?
A. The model with the highest area under the receiver operating characteristic curve (AUC ROC) and
precision greater than 0 5 B. The model with the lowest root mean squared error (RMSE) and recall greater than 0.5. C. The model with the highest recall where precision is greater than 0.5. D. The model with the highest precision where recall is greater than 0.5.
Answer: C
Explanation:
The best option for choosing a model that prioritizes detection while ensuring that more than 50% of
the maintenance jobs triggered by the model address an imminent machine failure is to choose the
model with the highest recall where precision is greater than 0.5. This option has the following
advantages:
It maximizes the recall, which is the proportion of actual failures that are correctly predicted by the
model. Recall is also known as sensitivity or true positive rate (TPR), and it is calculated as:
mathrmRecall=fracmathrmTPmathrmTP+mathrmFN
where TP is the number of true positives (actual failures that are predicted as failures) and FN is the
number of false negatives (actual failures that are predicted as non-failures). By maximizing the
recall, the model can reduce the number of false negatives, which are the most costly and
undesirable outcomes for the predictive maintenance use case, as they represent missed failures
that can lead to machine breakdown and downtime.
It constrains the precision, which is the proportion of predicted failures that are actual failures.
Precision is also known as positive predictive value (PPV), and it is calculated as:
mathrmPrecision=fracmathrmTPmathrmTP+mathrmFP
where FP is the number of false positives (actual non-failures that are predicted as failures). By
constraining the precision to be greater than 0.5, the model can ensure that more than 50% of the
maintenance jobs triggered by the model address an imminent machine failure, which can avoid
unnecessary or wasteful maintenance costs.
The other options are less optimal for the following reasons:
Option A: Choosing the model with the highest area under the receiver operating characteristic curve
(AUC ROC) and precision greater than 0.5 may not prioritize detection, as the AUC ROC does not
directly measure the recall. The AUC ROC is a summary metric that evaluates the overall
performance of a binary classifier across all possible thresholds. The ROC curve plots the TPR (recall)
against the false positive rate (FPR), which is the proportion of actual non-failures that are incorrectly
predicted by the model. The AUC ROC is the area under the ROC curve, and it ranges from 0 to 1,
where 1 represents a perfect classifier. However, choosing the model with the highest AUC ROC may
not maximize the recall, as the AUC ROC is influenced by both the TPR and the FPR, and it does not
account for the precision or the specificity (the proportion of actual non-failures that are correctly
predicted by the model).
Option B: Choosing the model with the lowest root mean squared error (RMSE) and recall greater
than 0.5 may not prioritize detection, as the RMSE is not a suitable metric for binary classification.
The RMSE is a regression metric that measures the average magnitude of the error between the
predicted and the actual values. The RMSE is calculated as:
mathrmRMSE=sqrtfrac1nsumi=1n (yi −hatyi )2
where yi is the actual value, hatyi is the predicted value, and n is the number of observations.
However, choosing the model with the lowest RMSE may not optimize the detection of failures, as
the RMSE is sensitive to outliers and does not account for the class imbalance or the cost of
misclassification.
Option D: Choosing the model with the highest precision where recall is greater than 0.5 may not
prioritize detection, as the precision may not be the most important metric for the predictive
maintenance use case. The precision measures the accuracy of the positive predictions, but it does
not reflect the sensitivity or the coverage of the model. By choosing the model with the highest
precision, the model may sacrifice the recall, which is the proportion of actual failures that are
correctly predicted by the model. This may increase the number of false negatives, which are the
most costly and undesirable outcomes for the predictive maintenance use case, as they represent
missed failures that can lead to machine breakdown and downtime.
Reference:
Evaluation Metrics (Classifiers) - Stanford University
Evaluation of binary classifiers - Wikipedia
Predictive Maintenance: The greatest benefits and smart use cases
Question # 9
You need to develop an image classification model by using a large dataset that contains labeledimages in a Cloud Storage Bucket. What should you do?
A. Use Vertex Al Pipelines with the Kubeflow Pipelines SDK to create a pipeline that reads the imagesfrom Cloud Storage and trains the model. B. Use Vertex Al Pipelines with TensorFlow Extended (TFX) to create a pipeline that reads the imagesfrom Cloud Storage and trams the model. C. Import the labeled images as a managed dataset in Vertex Al: and use AutoML to tram the model. D. Convert the image dataset to a tabular format using Dataflow Load the data into BigQuery and useBigQuery ML to tram the model.
Answer: C
Explanation:
The best option for developing an image classification model by using a large dataset that contains
labeled images in a Cloud Storage bucket is to import the labeled images as a managed dataset in
Vertex AI and use AutoML to train the model. This option allows you to leverage the power and
simplicity of Google Cloud to create and deploy a high-quality image classification model with
minimal code and configuration. Vertex AI is a unified platform for building and deploying machine
learning solutions on Google Cloud. Vertex AI can create a managed dataset from a Cloud Storage
bucket that contains labeled images, which can be used to train an AutoML model. AutoML is a
service that can automatically build and optimize machine learning models for various tasks, such as
image classification, object detection, natural language processing, and tabular data analysis.
AutoML can handle the complex aspects of machine learning, such as feature engineering, model
architecture, hyperparameter tuning, and model evaluation. AutoML can also evaluate, deploy, and
monitor the image classification model, and provide online or batch predictions. By using Vertex AI
and AutoML, users can develop an image classification model by using a large dataset with ease and
efficiency.
The other options are not as good as option C, for the following reasons:
Option A: Using Vertex AI Pipelines with the Kubeflow Pipelines SDK to create a pipeline that reads
the images from Cloud Storage and trains the model would require more skills and steps than using
Vertex AI and AutoML. Vertex AI Pipelines is a service that can orchestrate machine learning
workflows using Vertex AI. Vertex AI Pipelines can run preprocessing and training steps on custom
Docker images, and evaluate, deploy, and monitor the machine learning model. Kubeflow Pipelines
SDK is a Python library that can create and run pipelines on Vertex AI Pipelines or on Kubeflow, an
open-source platform for machine learning on Kubernetes. However, using Vertex AI Pipelines and
Kubeflow Pipelines SDK would require writing code, building Docker images, defining pipeline
components and steps, and managing the pipeline execution and artifacts. Moreover, Vertex AI
Pipelines and Kubeflow Pipelines SDK are not specialized for image classification, and users would
need to use other libraries or frameworks, such as TensorFlow or PyTorch, to build and train the
image classification model.
Option B: Using Vertex AI Pipelines with TensorFlow Extended (TFX) to create a pipeline that reads
the images from Cloud Storage and trains the model would require more skills and steps than using
Vertex AI and AutoML. TensorFlow Extended (TFX) is a framework that can create and run end-to-end
machine learning pipelines on TensorFlow, a popular library for building and training deep learning
models. TFX can preprocess the data, train and evaluate the model, validate and push the model,
and serve the model for online or batch predictions. However, using Vertex AI Pipelines and TFX
would require writing code, building Docker images, defining pipeline components and steps, and
managing the pipeline execution and artifacts. Moreover, TFX is not optimized for image
classification, and users would need to use other libraries or tools, such as TensorFlow Data
Validation, TensorFlow Transform, and TensorFlow Hub, to handle the image data and the model
architecture.
Option D: Converting the image dataset to a tabular format using Dataflow, loading the data into
BigQuery, and using BigQuery ML to train the model would not handle the image data properly and
could result in a poor model performance. Dataflow is a service that can create scalable and reliable
pipelines to process large volumes of data from various sources. Dataflow can preprocess the data by
using Apache Beam, a programming model for defining and executing data processing workflows.
BigQuery is a serverless, scalable, and cost-effective data warehouse that can perform fast and
interactive queries on large datasets. BigQuery ML is a service that can create and train machine
learning models by using SQL queries on BigQuery. However, converting the image data to a tabular
format would lose the spatial and semantic information of the images, which are essential for image
classification. Moreover, BigQuery ML is not specialized for image classification, and users would
need to use other tools or techniques, such as feature hashing, embedding, or one-hot encoding, to
handle the categorical features.
Question # 10
You are developing an image recognition model using PyTorch based on ResNet50 architecture. Your
code is working fine on your local laptop on a small subsample. Your full dataset has 200k labeled
images You want to quickly scale your training workload while minimizing cost. You plan to use 4
V100 GPUs. What should you do? (Choose Correct Answer and Give Reference and Explanation)
A. Configure a Compute Engine VM with all the dependencies that launches the training Train your
model with Vertex Al using a custom tier that contains the required GPUs B. Package your code with Setuptools. and use a pre-built container Train your model with Vertex Al
using a custom tier that contains the required GPUs C. Create a Vertex Al Workbench user-managed notebooks instance with 4 V100 GPUs, and use it to
train your model D. Create a Google Kubernetes Engine cluster with a node pool that has 4 V100 GPUs Prepare and
submit a TFJob operator to this node pool.
Answer: B
Explanation:
The best option for scaling the training workload while minimizing cost is to package the code with
Setuptools, and use a pre-built container. Train the model with Vertex AI using a custom tier that
contains the required GPUs. This option has the following advantages:
It allows the code to be easily packaged and deployed, as Setuptools is a Python tool that helps to
create and distribute Python packages, and pre-built containers are Docker images that contain all
the dependencies and libraries needed to run the code. By packaging the code with Setuptools, and
using a pre-built container, you can avoid the hassle and complexity of building and maintaining your
own custom container, and ensure the compatibility and portability of your code across different
environments.
It leverages the scalability and performance of Vertex AI, which is a fully managed service that
provides various tools and features for machine learning, such as training, tuning, serving, and
monitoring. By training the model with Vertex AI, you can take advantage of the distributed and
parallel training capabilities of Vertex AI, which can speed up the training process and improve the
model quality. Vertex AI also supports various frameworks and models, such as PyTorch and
ResNet50, and allows you to use custom containers and custom tiers to customize your training
configuration and resources.
It reduces the cost and complexity of the training process, as Vertex AI allows you to use a custom
tier that contains the required GPUs, which can optimize the resource utilization and allocation for
your training job. By using a custom tier that contains 4 V100 GPUs, you can match the number and
type of GPUs that you plan to use for your training job, and avoid paying for unnecessary or
underutilized resources. Vertex AI also offers various pricing options and discounts, such as persecond
billing, sustained use discounts, and preemptible VMs, that can lower the cost of the training
process.
The other options are less optimal for the following reasons:
Option A: Configuring a Compute Engine VM with all the dependencies that launches the training.
Train the model with Vertex AI using a custom tier that contains the required GPUs, introduces
additional complexity and overhead. This option requires creating and managing a Compute Engine
VM, which is a virtual machine that runs on Google Cloud. However, using a Compute Engine VM to
launch the training may not be necessary or efficient, as it requires installing and configuring all the
dependencies and libraries needed to run the code, and maintaining and updating the VM.
Moreover, using a Compute Engine VM to launch the training may incur additional cost and latency,
as it requires paying for the VM usage and transferring the data and the code between the VM and
Vertex AI.
Option C: Creating a Vertex AI Workbench user-managed notebooks instance with 4 V100 GPUs, and
using it to train the model, introduces additional cost and risk. This option requires creating and
managing a Vertex AI Workbench user-managed notebooks instance, which is a service that allows
you to create and run Jupyter notebooks on Google Cloud. However, using a Vertex AI Workbench
user-managed notebooks instance to train the model may not be optimal or secure, as it requires
paying for the notebooks instance usage, which can be expensive and wasteful, especially if the
notebooks instance is not used for other purposes. Moreover, using a Vertex AI Workbench usermanaged
notebooks instance to train the model may expose the model and the data to potential
security or privacy issues, as the notebooks instance is not fully managed by Google Cloud, and may
be accessed or modified by unauthorized users or malicious actors.
Option D: Creating a Google Kubernetes Engine cluster with a node pool that has 4 V100 GPUs.
Prepare and submit a TFJob operator to this node pool, introduces additional complexity and cost.
This option requires creating and managing a Google Kubernetes Engine cluster, which is a fully
managed service that runs Kubernetes clusters on Google Cloud. Moreover, this option requires
creating and managing a node pool that has 4 V100 GPUs, which is a group of nodes that share the
same configuration and resources. Furthermore, this option requires preparing and submitting a
TFJob operator to this node pool, which is a Kubernetes custom resource that defines a TensorFlow
training job. However, using Google Kubernetes Engine, node pool, and TFJob operator to train the
model may not be necessary or efficient, as it requires configuring and maintaining the cluster, the
node pool, and the TFJob operator, and paying for their usage. Moreover, using Google Kubernetes
Engine, node pool, and TFJob operator to train the model may not be compatible or scalable, as they
are designed for TensorFlow models, not PyTorch models, and may not support distributed or parallel
training.
Reference:
[Vertex AI: Training with custom containers]
[Vertex AI: Using custom machine types]
[Setuptools documentation]
[PyTorch documentation]
[ResNet50 | PyTorch]
Question # 11
You are developing a mode! to detect fraudulent credit card transactions. You need to prioritizedetection because missing even one fraudulent transaction could severely impact the credit cardholder. You used AutoML to tram a model on users' profile information and credit card transactiondata. After training the initial model, you notice that the model is failing to detect many fraudulenttransactions. How should you adjust the training parameters in AutoML to improve modelperformance?Choose 2 answers
A. Increase the score threshold. B. Decrease the score threshold. C. Add more positive examples to the training set. D. Add more negative examples to the training set. E. Reduce the maximum number of node hours for training.
Answer: B, C
Explanation:
The best options for adjusting the training parameters in AutoML to improve model performance are
to decrease the score threshold and add more positive examples to the training set. These options
can help increase the detection rate of fraudulent transactions, which is the priority for this use case.
The score threshold is a parameter that determines the minimum probability score that a prediction
must have to be classified as positive. Decreasing the score threshold can increase the recall of the
model, which is the proportion of actual positive cases that are correctly identified. Increasing the
recall can help reduce the number of false negatives, which are fraudulent transactions that are
missed by the model. However, decreasing the score threshold can also decrease the precision of the
model, which is the proportion of positive predictions that are actually correct. Decreasing the
precision can increase the number of false positives, which are legitimate transactions that are
flagged as fraudulent by the model. Therefore, there is a trade-off between recall and precision, and
the optimal score threshold depends on the business objective and the cost of errors1. Adding more
positive examples to the training set can help balance the data distribution and improve the model
performance. Positive examples are the instances that belong to the target class, which in this case
are fraudulent transactions. Negative examples are the instances that belong to the other class,
which in this case are legitimate transactions. Fraudulent transactions are usually rare and
imbalanced compared to legitimate transactions, which can cause the model to be biased towards
the majority class and fail to learn the characteristics of the minority class. Adding more positive
examples can help the model learn more features and patterns of the fraudulent transactions, and
increase the detection rate2.
The other options are not as good as options B and C, for the following reasons:
Option A: Increasing the score threshold would decrease the detection rate of fraudulent
transactions, which is the opposite of the desired outcome. Increasing the score threshold would
decrease the recall of the model, which is the proportion of actual positive cases that are correctly
identified. Decreasing the recall would increase the number of false negatives, which are fraudulent
transactions that are missed by the model. Increasing the score threshold would increase the
precision of the model, which is the proportion of positive predictions that are actually correct.
Increasing the precision would decrease the number of false positives, which are legitimate
transactions that are flagged as fraudulent by the model. However, in this use case, the cost of false
negatives is much higher than the cost of false positives, so increasing the score threshold is not a
good option1.
Option D: Adding more negative examples to the training set would not improve the model
performance, and could worsen the data imbalance. Negative examples are the instances that
belong to the other class, which in this case are legitimate transactions. Legitimate transactions are
usually abundant and dominant compared to fraudulent transactions, which can cause the model to
be biased towards the majority class and fail to learn the characteristics of the minority class. Adding
more negative examples would exacerbate this problem, and decrease the detection rate of the
fraudulent transactions2.
Option E: Reducing the maximum number of node hours for training would not improve the model
performance, and could limit the model optimization. Node hours are the units of computation that
are used to train an AutoML model. The maximum number of node hours is a parameter that
determines the upper limit of node hours that can be used for training. Reducing the maximum
number of node hours would reduce the training time and cost, but also the model quality and
accuracy. Reducing the maximum number of node hours would limit the number of iterations, trials,
and evaluations that the model can perform, and prevent the model from finding the optimal
hyperparameters and architecture3.
Reference:
Preparing for Google Cloud Certification: Machine Learning Engineer, Course 5: Responsible AI,
Week 4: Evaluation
Google Cloud Professional Machine Learning Engineer Exam Guide, Section 2: Developing highquality
ML models, 2.2 Handling imbalanced data
Official Google Cloud Certified Professional Machine Learning Engineer Study Guide, Chapter 4: Lowcode
ML Solutions, Section 4.3: AutoML
Understanding the score threshold slider
Handling imbalanced data sets in machine learning
AutoML Vision pricing
Question # 12
You are developing an ML model using a dataset with categorical input variables. You have randomly
split half of the data into training and test sets. After applying one-hot encoding on the categorical
variables in the training set, you discover that one categorical variable is missing from the test set.
What should you do?
A. Randomly redistribute the data, with 70% for the training set and 30% for the test set B. Use sparse representation in the test set C. Apply one-hot encoding on the categorical variables in the test data. D. Collect more data representing all categories
Answer: C
Explanation:
The best option for dealing with the missing categorical variable in the test set is to apply one-hot
encoding on the categorical variables in the test data. This option has the following advantages:
It ensures the consistency and compatibility of the data format for the ML model, as the one-hot
encoding transforms the categorical variables into binary vectors that can be easily processed by the
model. By applying one-hot encoding on the categorical variables in the test data, you can match the
number and order of the features in the test data with the training data, and avoid any errors or
discrepancies in the model prediction.
It preserves the information and relevance of the data for the ML model, as the one-hot encoding
creates a separate feature for each possible value of the categorical variable, and assigns a value of 1
to the feature corresponding to the actual value of the variable, and 0 to the rest. By applying onehot
encoding on the categorical variables in the test data, you can retain the original meaning and
importance of the categorical variable, and avoid any loss or distortion of the data.
The other options are less optimal for the following reasons:
Option A: Randomly redistributing the data, with 70% for the training set and 30% for the test set,
introduces additional complexity and risk. This option requires reshuffling and splitting the data
again, which can be tedious and time-consuming. Moreover, this option may not guarantee that the
missing categorical variable will be present in the test set, as it depends on the randomness of the
data distribution. Furthermore, this option may affect the quality and validity of the ML model, as it
may change the data characteristics and patterns that the model has learned from the original
training set.
Option B: Using sparse representation in the test set introduces additional overhead and inefficiency.
This option requires converting the categorical variables in the test set into sparse vectors, which are
vectors that have mostly zero values and only store the indices and values of the non-zero elements.
However, using sparse representation in the test set may not be compatible with the ML model, as
the model expects the input data to have the same format and dimensionality as the training data,
which uses one-hot encoding. Moreover, using sparse representation in the test set may not be
efficient or scalable, as it requires additional computation and memory to store and process the
sparse vectors.
Option D: Collecting more data representing all categories introduces additional cost and delay. This
option requires obtaining and labeling more data that contains the missing categorical variable,
which can be expensive and time-consuming. Moreover, this option may not be feasible or
necessary, as the missing categorical variable may not be available or relevant for the test data,
depending on the data source or the business problem.
Question # 13
You have built a model that is trained on data stored in Parquet files. You access the data through a
Hive table hosted on Google Cloud. You preprocessed these data with PySpark and exported it as a
CSV file into Cloud Storage. After preprocessing, you execute additional steps to train and evaluate
your model. You want to parametrize this model training in Kubeflow Pipelines. What should you do?
A. Remove the data transformation step from your pipeline. B. Containerize the PySpark transformation step, and add it to your pipeline. C. Add a ContainerOp to your pipeline that spins a Dataproc cluster, runs a transformation, and then
saves the transformed data in Cloud Storage. D. Deploy Apache Spark at a separate node pool in a Google Kubernetes Engine cluster. Add a
ContainerOp to your pipeline that invokes a corresponding transformation job for this Spark instance.
Answer: C
Explanation:
The best option for parametrizing the model training in Kubeflow Pipelines is to add a ContainerOp
to the pipeline that spins a Dataproc cluster, runs a transformation, and then saves the transformed
data in Cloud Storage. This option has the following advantages:
It allows the data transformation to be performed as part of the Kubeflow Pipeline, which can ensure
the consistency and reproducibility of the data processing and the model training. By adding a
ContainerOp to the pipeline, you can define the parameters and the logic of the data transformation
step, and integrate it with the other steps of the pipeline, such as the model training and evaluation.
It leverages the scalability and performance of Dataproc, which is a fully managed service that runs
Apache Spark and Apache Hadoop clusters on Google Cloud. By spinning a Dataproc cluster, you can
run the PySpark transformation on the Parquet files stored in the Hive table, and take advantage of
the parallelism and speed of Spark. Dataproc also supports various features and integrations, such as
autoscaling, preemptible VMs, and connectors to other Google Cloud services, that can optimize the
data processing and reduce the cost.
It simplifies the data storage and access, as the transformed data is saved in Cloud Storage, which is a
scalable, durable, and secure object storage service. By saving the transformed data in Cloud
Storage, you can avoid the overhead and complexity of managing the data in the Hive table or the
Parquet files. Moreover, you can easily access the transformed data from Cloud Storage, using
various tools and frameworks, such as TensorFlow, BigQuery, or Vertex AI.
The other options are less optimal for the following reasons:
Option A: Removing the data transformation step from the pipeline eliminates the parametrization
of the model training, as the data processing and the model training are decoupled and independent.
This option requires running the PySpark transformation separately from the Kubeflow Pipeline,
which can introduce inconsistency and unreproducibility in the data processing and the model
training. Moreover, this option requires managing the data in the Hive table or the Parquet files,
which can be cumbersome and inefficient.
Option B: Containerizing the PySpark transformation step, and adding it to the pipeline introduces
additional complexity and overhead. This option requires creating and maintaining a Docker image
that can run the PySpark transformation, which can be challenging and time-consuming. Moreover,
this option requires running the PySpark transformation on a single container, which can be slow and
inefficient, as it does not leverage the parallelism and performance of Spark.
Option D: Deploying Apache Spark at a separate node pool in a Google Kubernetes Engine cluster,
and adding a ContainerOp to the pipeline that invokes a corresponding transformation job for this
Spark instance introduces additional complexity and cost. This option requires creating and managing
a separate node pool in a Google Kubernetes Engine cluster, which is a fully managed service that
runs Kubernetes clusters on Google Cloud. Moreover, this option requires deploying and running
Apache Spark on the node pool, which can be tedious and costly, as it requires configuring and
maintaining the Spark cluster, and paying for the node pool usage.
Question # 14
You work for a magazine publisher and have been tasked with predicting whether customers will
cancel their annual subscription. In your exploratory data analysis, you find that 90% of individuals
renew their subscription every year, and only 10% of individuals cancel their subscription. After
training a NN Classifier, your model predicts those who cancel their subscription with 99% accuracy
and predicts those who renew their subscription with 82% accuracy. How should you interpret these
results?
A. This is not a good result because the model should have a higher accuracy for those who renew
their subscription than for those who cancel their subscription. B. This is not a good result because the model is performing worse than predicting that people will
always renew their subscription. C. This is a good result because predicting those who cancel their subscription is more difficult, since
there is less data for this group. D. This is a good result because the accuracy across both groups is greater than 80%.
Answer: B
Explanation:
This is not a good result because the model is performing worse than predicting that people will
always renew their subscription. This option has the following reasons:
It indicates that the model is not learning from the data, but rather memorizing the majority class.
Since 90% of the individuals renew their subscription every year, the model can achieve a 90%
accuracy by simply predicting that everyone will renew their subscription, without considering the
features or the patterns in the data. However, the models accuracy for predicting those who renew
their subscription is only 82%, which is lower than the baseline accuracy of 90%. This suggests that
the model is overfitting to the minority class (those who cancel their subscription), and underfitting
to the majority class (those who renew their subscription).
It implies that the model is not useful for the business problem, as it cannot identify the customers
who are at risk of churning. The goal of predicting whether customers will cancel their annual
subscription is to prevent customer churn and increase customer retention. However, the models
accuracy for predicting those who cancel their subscription is 99%, which is too high and unrealistic,
as it means that the model can almost perfectly identify the customers who will churn, without any
false positives or false negatives. This may indicate that the model is cheating or exploiting some
leakage in the data, such as a feature that reveals the outcome of the prediction. Moreover, the
models accuracy for predicting those who renew their subscription is 82%, which is too low and
unreliable, as it means that the model can miss many customers who will churn, and falsely label
them as renewing customers. This can lead to losing customers and revenue, and failing to take
proactive actions to retain them.
Reference:
How to Evaluate Machine Learning Models: Classification Metrics | Machine Learning Mastery
Imbalanced Classification: Predicting Subscription Churn | Machine Learning Mastery
Question # 15
You work for a retailer that sells clothes to customers around the world. You have been tasked with
ensuring that ML models are built in a secure manner. Specifically, you need to protect sensitive
customer data that might be used in the models. You have identified four fields containing sensitive
data that are being used by your data science team: AGE, IS_EXISTING_CUSTOMER,
LATITUDE_LONGITUDE, and SHIRT_SIZE. What should you do with the data before it is made
available to the data science team for training purposes?
A. Tokenize all of the fields using hashed dummy values to replace the real values. B. Use principal component analysis (PCA) to reduce the four sensitive fields to one PCA vector. C. Coarsen the data by putting AGE into quantiles and rounding LATITUDE_LONGTTUDE into single
precision. The other two fields are already as coarse as possible. D. Remove all sensitive data fields, and ask the data science team to build their models using nonsensitive
data.
Answer: C
Explanation:
The best option for protecting sensitive customer data that might be used in the ML models is to
coarsen the data by putting AGE into quantiles and rounding LATITUDE_LONGITUDE into single
precision. This option has the following advantages:
It preserves the utility and relevance of the data for the ML models, as the coarsened data still
captures the essential information and patterns that the models need to learn. For example, putting
AGE into quantiles can group the customers into different age ranges, which can be useful for
predicting their preferences or behavior. Rounding LATITUDE_LONGITUDE into single precision can
reduce the precision of the location data, but still retain the general geographic region of the
customers, which can be useful for personalizing the recommendations or offers.
It reduces the risk of exposing the personal or private information of the customers, as the coarsened
data makes it harder to identify or re-identify the individual customers from the data. For example,
putting AGE into quantiles can hide the exact age of the customers, which can be considered
sensitive or confidential. Rounding LATITUDE_LONGITUDE into single precision can obscure the exact
location of the customers, which can be considered sensitive or confidential.
The other options are less optimal for the following reasons:
Option A: Tokenizing all of the fields using hashed dummy values to replace the real values
eliminates the utility and relevance of the data for the ML models, as the tokenized data loses all the
information and patterns that the models need to learn. For example, tokenizing AGE using hashed
dummy values can make the data meaningless and irrelevant, as the models cannot learn anything
from the random tokens. Tokenizing LATITUDE_LONGITUDE using hashed dummy values can make
the data meaningless and irrelevant, as the models cannot learn anything from the random tokens.
Option B: Using principal component analysis (PCA) to reduce the four sensitive fields to one PCA
vector reduces the utility and relevance of the data for the ML models, as the PCA vector may not
capture all the information and patterns that the models need to learn. For example, using PCA to
reduce AGE, IS_EXISTING_CUSTOMER, LATITUDE_LONGITUDE, and SHIRT_SIZE to one PCA vector
can lose some information or introduce noise in the data, as the PCA vector is a linear combination
of the original features, which may not reflect their true relationship or importance. Moreover, using
PCA to reduce the four sensitive fields to one PCA vector may not reduce the risk of exposing the
personal or private information of the customers, as the PCA vector may still be reversible or linkable
to the original data, depending on the amount of variance explained by the PCA vector and the
availability of the PCA transformation matrix.
Option D: Removing all sensitive data fields, and asking the data science team to build their models
using non-sensitive data reduces the utility and relevance of the data for the ML models, as the nonsensitive
data may not contain enough information and patterns that the models need to learn. For
example, removing AGE, IS_EXISTING_CUSTOMER, LATITUDE_LONGITUDE, and SHIRT_SIZE from the
data can make the data insufficient and unrepresentative, as the models may not be able to learn the
factors that influence the customers preferences or behavior. Moreover, removing all sensitive data
fields from the data may not be necessary or feasible, as the data protection legislation may allow
the use of sensitive data for the ML models, as long as the data is processed in a secure and ethical
manner, and the customers consent and rights are respected.
Reference:
Protecting Sensitive Data and AI Models with Confidential Computing | NVIDIA Technical Blog
Training machine learning models from sensitive data | Fast Data Science
Securing ML applications. Model security and protection - Medium
Security of AI/ML systems, ML model security | Cossack Labs
Vulnerabilities, security and privacy for machine learning models
Question # 16
You work for a company that manages a ticketing platform for a large chain of cinemas. Customers
use a mobile app to search for movies theyre interested in and purchase tickets in the app. Ticket
purchase requests are sent to Pub/Sub and are processed with a Dataflow streaming pipeline
configured to conduct the following steps:
1. Check for availability of the movie tickets at the selected cinema.
2. Assign the ticket price and accept payment.
3. Reserve the tickets at the selected cinema.
4. Send successful purchases to your database.
Each step in this process has low latency requirements (less than 50 milliseconds). You have
developed a logistic regression model with BigQuery ML that predicts whether offering a promo code
for free popcorn increases the chance of a ticket purchase, and this prediction should be added to
the ticket purchase process. You want to identify the simplest way to deploy this model to production
while adding minimal latency. What should you do?
A. Run batch inference with BigQuery ML every five minutes on each new set of tickets issued. B. Export your model in TensorFlow format, and add a tfx_bsl.public.beam.RunInference step to the
Dataflow pipeline. C. Export your model in TensorFlow format, deploy it on Vertex AI, and query the prediction endpoint
from your streaming pipeline. D. Convert your model with TensorFlow Lite (TFLite), and add it to the mobile app so that the promo
code and the incoming request arrive together in Pub/Sub.
Answer: B
Explanation:
The simplest way to deploy a logistic regression model with BigQuery ML to production while adding
minimal latency is to export the model in TensorFlow format, and add a
tfx_bsl.public.beam.RunInference step to the Dataflow pipeline. This option has the following
advantages:
It allows the model prediction to be performed in real time, as part of the Dataflow streaming
pipeline that processes the ticket purchase requests. This ensures that the promo code offer is based
on the most recent data and customer behavior, and that the offer is delivered to the customer
without delay.
It leverages the compatibility and performance of TensorFlow and Dataflow, which are both part of
the Google Cloud ecosystem. TensorFlow is a popular and powerful framework for building and
deploying machine learning models, and Dataflow is a fully managed service that runs Apache Beam
pipelines for data processing and transformation. By using the tfx_bsl.public.beam.RunInference
step, you can easily integrate your TensorFlow model with your Dataflow pipeline, and take
advantage of the parallelism and scalability of Dataflow.
It simplifies the model deployment and management, as the model is packaged with the Dataflow
pipeline and does not require a separate service or endpoint. The model can be updated by
redeploying the Dataflow pipeline with a new model version.
The other options are less optimal for the following reasons:
Option A: Running batch inference with BigQuery ML every five minutes on each new set of tickets
issued introduces additional latency and complexity. This option requires running a separate
BigQuery job every five minutes, which can incur network overhead and latency. Moreover, this
option requires storing and retrieving the intermediate results of the batch inference, which can
consume storage space and increase the data transfer time.
Option C: Exporting the model in TensorFlow format, deploying it on Vertex AI, and querying the
prediction endpoint from the streaming pipeline introduces additional latency and cost. This option
requires creating and managing a Vertex AI endpoint, which is a managed service that provides
various tools and features for machine learning, such as training, tuning, serving, and monitoring.
However, querying the Vertex AI endpoint from the streaming pipeline requires making an HTTP
request, which can incur network overhead and latency. Moreover, this option requires paying for
the Vertex AI endpoint usage, which can increase the cost of the model deployment.
Option D: Converting the model with TensorFlow Lite (TFLite), and adding it to the mobile app so that
the promo code and the incoming request arrive together in Pub/Sub introduces additional
challenges and risks. This option requires converting the model to a TFLite format, which is a
lightweight and optimized format for running TensorFlow models on mobile and embedded devices.
However, converting the model to TFLite may not preserve the accuracy or functionality of the
original model, as some operations or features may not be supported by TFLite. Moreover, this
option requires updating the mobile app with the TFLite model, which can be tedious and timeconsuming,
and may depend on the users willingness to update the app. Additionally, this option
may expose the model to potential security or privacy issues, as the model is running on the users
device and may be accessed or modified by malicious actors.
Reference:
[Exporting models for prediction | BigQuery ML]
[tfx_bsl.public.beam.run_inference | TensorFlow Extended]
[Vertex AI documentation]
[TensorFlow Lite documentation]
Question # 17
You deployed an ML model into production a year ago. Every month, you collect all raw requests that
were sent to your model prediction service during the previous month. You send a subset of these
requests to a human labeling service to evaluate your models performance. After a year, you notice
that your model's performance sometimes degrades significantly after a month, while other times it
takes several months to notice any decrease in performance. The labeling service is costly, but you
also need to avoid large performance degradations. You want to determine how often you should
retrain your model to maintain a high level of performance while minimizing cost. What should you
do?
A. Train an anomaly detection model on the training dataset, and run all incoming requests through
this model. If an anomaly is detected, send the most recent serving data to the labeling service. B. Identify temporal patterns in your models performance over the previous year. Based on these
patterns, create a schedule for sending serving data to the labeling service for the next year.
C. Compare the cost of the labeling service with the lost revenue due to model performance C. Compare the cost of the labeling service with the lost revenue due to model performance
degradation over the past year. If the lost revenue is greater than the cost of the labeling service,
increase the frequency of model retraining; otherwise, decrease the model retraining frequency. D. Run training-serving skew detection batch jobs every few days to compare the aggregate statistics
of the features in the training dataset with recent serving data. If skew is detected, send the most
recent serving data to the labeling service.
Answer: D
Explanation:
The best option for determining how often to retrain your model to maintain a high level of
performance while minimizing cost is to run training-serving skew detection batch jobs every few
days. Training-serving skew refers to the discrepancy between the distributions of the features in the
training dataset and the serving data. This can cause the model to perform poorly on the new data,
as it is not representative of the data that the model was trained on. By running training-serving
skew detection batch jobs, you can monitor the changes in the feature distributions over time, and
identify when the skew becomes significant enough to affect the model performance. If skew is
detected, you can send the most recent serving data to the labeling service, and use the labeled data
to retrain your model. This option has the following benefits:
It allows you to retrain your model only when necessary, based on the actual data changes, rather
than on a fixed schedule or a heuristic. This can save you the cost of the labeling service and the
retraining process, and also avoid overfitting or underfitting your model.
It leverages the existing tools and frameworks for training-serving skew detection, such as
TensorFlow Data Validation (TFDV) and Vertex Data Labeling. TFDV is a library that can compute and
visualize descriptive statistics for your datasets, and compare the statistics across different datasets.
Vertex Data Labeling is a service that can label your data with high quality and low latency, using
either human labelers or automated labelers.
It integrates well with the MLOps practices, such as continuous integration and continuous delivery
(CI/CD), which can automate the workflow of running the skew detection jobs, sending the data to
the labeling service, retraining the model, and deploying the new model version.
The other options are less optimal for the following reasons:
Option A: Training an anomaly detection model on the training dataset, and running all incoming
requests through this model, introduces additional complexity and overhead. This option requires
building and maintaining a separate model for anomaly detection, which can be challenging and
time-consuming. Moreover, this option requires running the anomaly detection model on every
request, which can increase the latency and resource consumption of the prediction service.
Additionally, this option may not capture the subtle changes in the feature distributions that can
affect the model performance, as anomalies are usually defined as rare or extreme events.
Option B: Identifying temporal patterns in your models performance over the previous year, and
creating a schedule for sending serving data to the labeling service for the next year, introduces
additional assumptions and risks. This option requires analyzing the historical data and model
performance, and finding the patterns that can explain the variations in the model performance over
time. However, this can be difficult and unreliable, as the patterns may not be consistent or
predictable, and may depend on various factors that are not captured by the data. Moreover, this
option requires creating a schedule based on the past patterns, which may not reflect the future
changes in the data or the environment. This can lead to either sending too much or too little data to
the labeling service, resulting in either wasted cost or degraded performance.
Option C: Comparing the cost of the labeling service with the lost revenue due to model
performance degradation over the past year, and adjusting the frequency of model retraining
accordingly, introduces additional challenges and trade-offs. This option requires estimating the cost
of the labeling service and the lost revenue due to model performance degradation, which can be
difficult and inaccurate, as they may depend on various factors that are not easily quantifiable or
measurable. Moreover, this option requires finding the optimal balance between the cost and the
performance, which can be subjective and variable, as different stakeholders may have different
preferences and expectations. Furthermore, this option may not account for the potential impact of
the model performance degradation on other aspects of the business, such as customer satisfaction,
retention, or loyalty.
Question # 18
You work for an online publisher that delivers news articles to over 50 million readers. You have built
an AI model that recommends content for the companys weekly newsletter. A recommendation is
considered successful if the article is opened within two days of the newsletters published date and
the user remains on the page for at least one minute.
All the information needed to compute the success metric is available in BigQuery and is updated
hourly. The model is trained on eight weeks of data, on average its performance degrades below the
acceptable baseline after five weeks, and training time is 12 hours. You want to ensure that the
models performance is above the acceptable baseline while minimizing cost. How should you
monitor the model to determine when retraining is necessary?
A. Use Vertex AI Model Monitoring to detect skew of the input features with a sample rate of 100%
and a monitoring frequency of two days. B. Schedule a cron job in Cloud Tasks to retrain the model every week before the newsletter is
created. C. Schedule a weekly query in BigQuery to compute the success metric. D. Schedule a daily Dataflow job in Cloud Composer to compute the success metric.
Answer: C
Explanation:
The best option for monitoring the model to determine when retraining is necessary is to schedule a
weekly query in BigQuery to compute the success metric. This option has the following advantages:
It allows the model performance to be evaluated regularly, based on the actual outcome of the
recommendations. By computing the success metric, which is the percentage of articles that are
opened within two days and read for at least one minute, you can measure how well the model is
achieving its objective and compare it with the acceptable baseline.
It leverages the scalability and efficiency of BigQuery, which is a serverless, fully managed, and highly
scalable data warehouse that can run complex queries over petabytes of data in seconds. By using
BigQuery, you can access and analyze all the information needed to compute the success metric,
such as the newsletter publication date, the article opening date, and the user reading time, without
worrying about the infrastructure or the cost.
It simplifies the model monitoring and retraining workflow, as the weekly query can be scheduled
and executed automatically using BigQuerys built-in scheduling feature. You can also set up alerts or
notifications to inform you when the success metric falls below the acceptable baseline, and trigger
the model retraining process accordingly.
The other options are less optimal for the following reasons:
Option A: Using Vertex AI Model Monitoring to detect skew of the input features with a sample rate
of 100% and a monitoring frequency of two days introduces additional complexity and overhead.
This option requires setting up and managing a Vertex AI Model Monitoring service, which is a
managed service that provides various tools and features for machine learning, such as training,
tuning, serving, and monitoring. However, using Vertex AI Model Monitoring to detect skew of the
input features may not reflect the actual performance of the model, as skew is the discrepancy
between the distributions of the features in the training dataset and the serving data, which may not
affect the outcome of the recommendations. Moreover, using a sample rate of 100% and a
monitoring frequency of two days may incur unnecessary cost and latency, as it requires analyzing all
the input features every two days, which may not be needed for the model monitoring.
Option B: Scheduling a cron job in Cloud Tasks to retrain the model every week before the newsletter
is created introduces additional cost and risk. This option requires creating and running a cron job in
Cloud Tasks, which is a fully managed service that allows you to schedule and execute tasks that are
invoked by HTTP requests. However, using Cloud Tasks to retrain the model every week may not be
optimal, as it may retrain the model more often than necessary, wasting compute resources and cost.
Moreover, using Cloud Tasks to retrain the model before the newsletter is created may introduce
risk, as it may deploy a new model version that has not been tested or validated, potentially affecting
the quality of the recommendations.
Option D: Scheduling a daily Dataflow job in Cloud Composer to compute the success metric
introduces additional complexity and cost. This option requires creating and running a Dataflow job
in Cloud Composer, which is a fully managed service that runs Apache Airflow pipelines for workflow
orchestration. Dataflow is a fully managed service that runs Apache Beam pipelines for data
processing and transformation. However, using Dataflow and Cloud Composer to compute the
success metric may not be necessary, as it may add more steps and overhead to the model
monitoring process. Moreover, using Dataflow and Cloud Composer to compute the success metric
daily may not be optimal, as it may compute the success metric more often than needed, consuming
more compute resources and cost.
Reference:
[BigQuery documentation]
[Vertex AI Model Monitoring documentation]
[Cloud Tasks documentation]
[Cloud Composer documentation]
[Dataflow documentation]
Question # 19
You need to deploy a scikit-learn classification model to production. The model must be able to serverequests 24 and you expect millions of requests per second to the production application from 8am to 7 pm. You need to minimize the cost of deployment What should you do?
A. Deploy an online Vertex Al prediction endpoint Set the max replica count to 1 B. Deploy an online Vertex Al prediction endpoint Set the max replica count to 100 C. Deploy an online Vertex Al prediction endpoint with one GPU per replica Set the max replica countto 1. D. Deploy an online Vertex Al prediction endpoint with one GPU per replica Set the max replica countto 100.
Answer: B
Explanation:
The best option for deploying a scikit-learn classification model to production is to deploy an online
Vertex AI prediction endpoint and set the max replica count to 100. This option allows you to
leverage the power and scalability of Google Cloud to serve requests 24 and handle millions of
requests per second. Vertex AI is a unified platform for building and deploying machine learning
solutions on Google Cloud. Vertex AI can deploy a trained scikit-learn model to an online prediction
endpoint, which can provide low-latency predictions for individual instances. An online prediction
endpoint consists of one or more replicas, which are copies of the model that run on virtual
machines. The max replica count is a parameter that determines the maximum number of replicas
that can be created for the endpoint. By setting the max replica count to 100, you can enable the
endpoint to scale up to 100 replicas when the traffic increases, and scale down to zero replicas when
the traffic decreases. This can help minimize the cost of deployment, as you only pay for the
resources that you use. Moreover, you can use the autoscaling algorithm option to optimize the
scaling behavior of the endpoint based on the latency and utilization metrics1.
The other options are not as good as option B, for the following reasons:
Option A: Deploying an online Vertex AI prediction endpoint and setting the max replica count to 1
would not be able to serve requests 24 and handle millions of requests per second. Setting the
max replica count to 1 would limit the endpoint to only one replica, which can cause performance
issues and service disruptions when the traffic increases. Moreover, setting the max replica count to
1 would prevent the endpoint from scaling down to zero replicas when the traffic decreases, which
can increase the cost of deployment, as you pay for the resources that you do not use1.
Option C: Deploying an online Vertex AI prediction endpoint with one GPU per replica and setting the
max replica count to 1 would not be able to serve requests 24 and handle millions of requests per
second, and would increase the cost of deployment. Adding a GPU to each replica would increase the
computational power of the endpoint, but it would also increase the cost of deployment, as GPUs
are more expensive than CPUs. Moreover, setting the max replica count to 1 would limit the
endpoint to only one replica, which can cause performance issues and service disruptions when the
traffic increases, and prevent the endpoint from scaling down to zero replicas when the traffic
decreases1. Furthermore, scikit-learn models do not benefit from GPUs, as scikit-learn is not
optimized for GPU acceleration2.
Option D: Deploying an online Vertex AI prediction endpoint with one GPU per replica and setting the
max replica count to 100 would be able to serve requests 24 and handle millions of requests per
second, but it would increase the cost of deployment. Adding a GPU to each replica would increase
the computational power of the endpoint, but it would also increase the cost of deployment, as
GPUs are more expensive than CPUs. Setting the max replica count to 100 would enable the
endpoint to scale up to 100 replicas when the traffic increases, and scale down to zero replicas when
the traffic decreases, which can help minimize the cost of deployment. However, scikit-learn models
do not benefit from GPUs, as scikit-learn is not optimized for GPU acceleration2. Therefore, using
GPUs for scikit-learn models would be unnecessary and wasteful.
Reference:
Preparing for Google Cloud Certification: Machine Learning Engineer, Course 3: Production ML
Systems, Week 2: Serving ML Predictions
Google Cloud Professional Machine Learning Engineer Exam Guide, Section 3: Scaling ML models in
production, 3.1 Deploying ML models to production
Official Google Cloud Certified Professional Machine Learning Engineer Study Guide, Chapter 6:
Production ML Systems, Section 6.2: Serving ML Predictions
Online prediction
Scaling online prediction
scikit-learn FAQ
Question # 20
You work with a team of researchers to develop state-of-the-art algorithms for financial analysis.Your team develops and debugs complex models in TensorFlow. You want to maintain the ease ofdebugging while also reducing the model training time. How should you set up your trainingenvironment?
A. Configure a v3-8 TPU VM SSH into the VM to tram and debug the model. B. Configure a v3-8 TPU node Use Cloud Shell to SSH into the Host VM to train and debug the model. C. Configure a M-standard-4 VM with 4 NVIDIA P100 GPUs SSH into the VM and useParameter Server Strategy to train the model. D. Configure a M-standard-4 VM with 4 NVIDIA P100 GPUs SSH into the VM and useMultiWorkerMirroredStrategy to train the model.
Answer: A
Explanation:
A TPU VM is a virtual machine that has direct access to a Cloud TPU device. TPU VMs provide a
simpler and more flexible way to use Cloud TPUs, as they eliminate the need for a separate host VM
and network setup. TPU VMs also support interactive debugging tools such as TensorFlow Debugger
(tfdbg) and Python Debugger (pdb), which can help researchers develop and troubleshoot complex
models. A v3-8 TPU VM has 8 TPU cores, which can provide high performance and scalability for
training large models. SSHing into the TPU VM allows the user to run and debug the TensorFlow code
directly on the TPU device, without any network overhead or data transfer issues. Reference:
1: TPU VMs Overview
2: TPU VMs Quickstart
3: Debugging TensorFlow Models on Cloud TPUs
Question # 21
You work for the AI team of an automobile company, and you are developing a visual defect
detection model using TensorFlow and Keras. To improve your model performance, you want to
incorporate some image augmentation functions such as translation, cropping, and contrast
tweaking. You randomly apply these functions to each training batch. You want to optimize your data
processing pipeline for run time and compute resources utilization. What should you do?
A. Embed the augmentation functions dynamically in the tf.Data pipeline. B. Embed the augmentation functions dynamically as part of Keras generators. C. Use Dataflow to create all possible augmentations, and store them as TFRecords. D. Use Dataflow to create the augmentations dynamically per training run, and stage them as
TFRecords.
Answer: A
Explanation:
The best option for optimizing the data processing pipeline for run time and compute resources
utilization is to embed the augmentation functions dynamically in the tf.Data pipeline. This option
has the following advantages:
It allows the data augmentation to be performed on the fly, without creating or storing additional
copies of the data. This saves storage space and reduces the data transfer time.
It leverages the parallelism and performance of the tf.Data API, which can efficiently apply the
augmentation functions to multiple batches of data in parallel, using multiple CPU cores or GPU
devices. The tf.Data API also supports various optimization techniques, such as caching, prefetching,
and autotuning, to improve the data processing speed and reduce the latency.
It integrates seamlessly with the TensorFlow and Keras models, which can consume the tf.Data
datasets as inputs for training and evaluation. The tf.Data API also supports various data formats,
such as images, text, audio, and video, and various data sources, such as files, databases, and web
services.
The other options are less optimal for the following reasons:
Option B: Embedding the augmentation functions dynamically as part of Keras generators introduces
some limitations and overhead. Keras generators are Python generators that yield batches of data for
training or evaluation. However, Keras generators are not compatible with the tf.distribute API,
which is used to distribute the training across multiple devices or machines. Moreover, Keras
generators are not as efficient or scalable as the tf.Data API, as they run on a single Python thread
and do not support parallelism or optimization techniques.
Option C: Using Dataflow to create all possible augmentations, and store them as TFRecords
introduces additional complexity and cost. Dataflow is a fully managed service that runs Apache
Beam pipelines for data processing and transformation. However, using Dataflow to create all
possible augmentations requires generating and storing a large number of augmented images, which
can consume a lot of storage space and incur storage and network costs. Moreover, using Dataflow to
create the augmentations requires writing and deploying a separate Dataflow pipeline, which can be
tedious and time-consuming.
Option D: Using Dataflow to create the augmentations dynamically per training run, and stage them
as TFRecords introduces additional complexity and latency. Dataflow is a fully managed service that
runs Apache Beam pipelines for data processing and transformation. However, using Dataflow to
create the augmentations dynamically per training run requires running a Dataflow pipeline every
time the model is trained, which can introduce latency and delay the training process. Moreover,
using Dataflow to create the augmentations requires writing and deploying a separate Dataflow
pipeline, which can be tedious and time-consuming.
Reference:
[tf.data: Build TensorFlow input pipelines]
[Image augmentation | TensorFlow Core]
[Dataflow documentation]
Question # 22
You created an ML pipeline with multiple input parameters. You want to investigate the tradeoffsbetween different parameter combinations. The parameter options areinput datasetMax tree depth of the boosted tree regressorOptimizer learning rateYou need to compare the pipeline performance of the different parameter combinations measured inF1 score, time to train and model complexity. You want your approach to be reproducible and trackall pipeline runs on the same platform. What should you do?
A. 1 Use BigQueryML to create a boosted tree regressor and use the hyperparameter tuningcapability2 Configure the hyperparameter syntax to select different input datasets. max tree depths, andoptimizer teaming rates Choose the grid search option B. 1 Create a Vertex Al pipeline with a custom model training job as part of the pipeline Configure thepipeline's parameters to include those you are investigating2 In the custom training step, use the Bayesian optimization method with F1 score as the target tomaximize C. 1 Create a Vertex Al Workbench notebook for each of the different input datasets2 In each notebook, run different local training jobs with different combinations of the max treedepth and optimizer learning rate parameters3 After each notebook finishes, append the results to a BigQuery table D. 1 Create an experiment in Vertex Al Experiments2. Create a Vertex Al pipeline with a custom model training job as part of the pipeline. Configurethe pipelines parameters to include those you are investigating3. Submit multiple runs to the same experiment using different values for the parameters
Answer: D
Explanation:
The best option for investigating the tradeoffs between different parameter combinations is to
create an experiment in Vertex AI Experiments, create a Vertex AI pipeline with a custom model
training job as part of the pipeline, configure the pipelines parameters to include those you are
investigating, and submit multiple runs to the same experiment using different values for the
parameters. This option allows you to leverage the power and flexibility of Google Cloud to compare
the pipeline performance of the different parameter combinations measured in F1 score, time to
train, and model complexity. Vertex AI Experiments is a service that can track and compare the
results of multiple machine learning runs. Vertex AI Experiments can record the metrics, parameters,
and artifacts of each run, and display them in a dashboard for easy visualization and analysis. Vertex
AI Experiments can also help users optimize the hyperparameters of their models by using different
search algorithms, such as grid search, random search, or Bayesian optimization1. Vertex AI Pipelines
is a service that can orchestrate machine learning workflows using Vertex AI. Vertex AI Pipelines can
run preprocessing and training steps on custom Docker images, and evaluate, deploy, and monitor
the machine learning model. A custom model training job is a type of pipeline step that can train a
custom model by using a user-provided script or container. A custom model training job can accept
pipeline parameters as inputs, which can be used to control the training logic or data source. By
creating an experiment in Vertex AI Experiments, creating a Vertex AI pipeline with a custom model
training job as part of the pipeline, configuring the pipelines parameters to include those you are
investigating, and submitting multiple runs to the same experiment using different values for the
parameters, you can create a reproducible and trackable approach to investigate the tradeoffs
between different parameter combinations.
The other options are not as good as option D, for the following reasons:
Option A: Using BigQuery ML to create a boosted tree regressor and use the hyperparameter tuning
capability, configuring the hyperparameter syntax to select different input datasets, max tree depths,
and optimizer learning rates, and choosing the grid search option would not be able to handle
different input datasets as a hyperparameter, and would not be as flexible and scalable as using
Vertex AI Experiments and Vertex AI Pipelines. BigQuery ML is a service that can create and train
machine learning models by using SQL queries on BigQuery. BigQuery ML can perform
hyperparameter tuning by using the ML.FORECAST or ML.PREDICT functions, and specifying the
hyperparameters option. BigQuery ML can also use different search algorithms, such as grid search,
random search, or Bayesian optimization, to find the optimal hyperparameters. However, BigQuery
ML can only tune the hyperparameters that are related to the model architecture or training process,
such as max tree depth or learning rate. BigQuery ML cannot tune the hyperparameters that are
related to the data source, such as input dataset. Moreover, BigQuery ML is not designed to work
with Vertex AI Experiments or Vertex AI Pipelines, which can provide more features and flexibility for
tracking and orchestrating machine learning workflows2.
Option B: Creating a Vertex AI pipeline with a custom model training job as part of the pipeline,
configuring the pipelines parameters to include those you are investigating, and using the Bayesian
optimization method with F1 score as the target to maximize in the custom training step would not
be able to track and compare the results of multiple runs, and would require more skills and steps
than using Vertex AI Experiments and Vertex AI Pipelines. Vertex AI Pipelines is a service that can
orchestrate machine learning workflows using Vertex AI. Vertex AI Pipelines can run preprocessing
and training steps on custom Docker images, and evaluate, deploy, and monitor the machine
learning model. A custom model training job is a type of pipeline step that can train a custom model
by using a user-provided script or container. A custom model training job can accept pipeline
parameters as inputs, which can be used to control the training logic or data source. However, using
the Bayesian optimization method with F1 score as the target to maximize in the custom training step
would require writing code, implementing the optimization algorithm, and defining the objective
function. Moreover, this option would not be able to track and compare the results of multiple runs,
as Vertex AI Pipelines does not have a built-in feature for recording and displaying the metrics,
parameters, and artifacts of each run3.
Option C: Creating a Vertex AI Workbench notebook for each of the different input datasets, running
different local training jobs with different combinations of the max tree depth and optimizer learning
rate parameters, and appending the results to a BigQuery table would not be able to track and
compare the results of multiple runs on the same platform, and would require more skills and steps
than using Vertex AI Experiments and Vertex AI Pipelines. Vertex AI Workbench is a service that
provides an integrated development environment for data science and machine learning. Vertex AI
Workbench allows users to create and run Jupyter notebooks on Google Cloud, and access various
tools and libraries for data analysis and machine learning. However, creating a Vertex AI Workbench
notebook for each of the different input datasets, running different local training jobs with different
combinations of the max tree depth and optimizer learning rate parameters, and appending the
results to a BigQuery table would require creating multiple notebooks, writing code, setting up local
environments, connecting to BigQuery, loading and preprocessing the data, training and evaluating
the model, and writing the results to a BigQuery table. Moreover, this option would not be able to
track and compare the results of multiple runs on the same platform, as BigQuery is a separate
service from Vertex AI Workbench, and does not have a dashboard for visualizing and analyzing the
metrics, parameters, and artifacts of each run4.
Reference:
Preparing for Google Cloud Certification: Machine Learning Engineer, Course 3: Production ML
Systems, Week 3: MLOps
Google Cloud Professional Machine Learning Engineer Exam Guide, Section 1: Architecting low-code
ML solutions, 1.1 Developing ML models by using BigQuery ML
Official Google Cloud Certified Professional Machine Learning Engineer Study Guide, Chapter 3: Data
Engineering for ML, Section 3.2: BigQuery for ML
Vertex AI Experiments
Vertex AI Pipelines
BigQuery ML
Vertex AI Workbench
Question # 23
You are the Director of Data Science at a large company, and your Data Science team has recently
begun using the Kubeflow Pipelines SDK to orchestrate their training pipelines. Your team is
struggling to integrate their custom Python code into the Kubeflow Pipelines SDK. How should you
instruct them to proceed in order to quickly integrate their code with the Kubeflow Pipelines SDK?
A. Use the func_to_container_op function to create custom components from the Python code. B. Use the predefined components available in the Kubeflow Pipelines SDK to access Dataproc, and
run the custom code there C. Package the custom Python code into Docker containers, and use the load_component_from_file
function to import the containers into the pipeline. D. Deploy the custom Python code to Cloud Functions, and use Kubeflow Pipelines to trigger the
Cloud Function.
Answer: A
Explanation:
The easiest way to integrate custom Python code into the Kubeflow Pipelines SDK is to use the
func_to_container_op function, which converts a Python function into a pipeline component. This
function automatically builds a Docker image that executes the Python function, and returns a
factory function that can be used to create kfp.dsl.ContainerOp instances for the pipeline. This option
has the following benefits:
It allows the data science team to reuse their existing Python code without rewriting it or packaging
it into containers manually.
It simplifies the component specification and implementation, as the function signature defines the
component interface and the function body defines the component logic.
It supports various types of inputs and outputs, such as primitive types, files, directories, and
dictionaries.
The other options are less optimal for the following reasons:
Option B: Using the predefined components available in the Kubeflow Pipelines SDK to access
Dataproc, and run the custom code there, introduces additional complexity and cost. This option
requires creating and managing Dataproc clusters, which are ephemeral and scalable clusters of
Compute Engine instances that run Apache Spark and Apache Hadoop. Moreover, this option
requires writing the custom code in PySpark or Hadoop MapReduce, which may not be compatible
with the existing Python code.
Option C: Packaging the custom Python code into Docker containers, and using the
load_component_from_file function to import the containers into the pipeline, introduces additional
steps and overhead. This option requires creating and maintaining Dockerfiles, building and pushing
Docker images, and writing component specifications in YAML files. Moreover, this option requires
managing the dependencies and versions of the Python code and the Docker images.
Option D: Deploying the custom Python code to Cloud Functions, and using Kubeflow Pipelines to
trigger the Cloud Function, introduces additional latency and limitations. This option requires
creating and deploying Cloud Functions, which are serverless functions that execute in response to
events. Moreover, this option requires invoking the Cloud Functions from the Kubeflow Pipelines
using HTTP requests, which can incur network overhead and latency. Additionally, this option is
subject to the quotas and limits of Cloud Functions, such as the maximum execution time and
memory usage.
Reference:
Building Python function-based components | Kubeflow
Building Python Function-based Components | Kubeflow
Question # 24
You received a training-serving skew alert from a Vertex Al Model Monitoring job running inproduction. You retrained the model with more recent training data, and deployed it back to theVertex Al endpoint but you are still receiving the same alert. What should you do?
A. Update the model monitoring job to use a lower sampling rate. B. Update the model monitoring job to use the more recent training data that was used to retrain themodel. C. Temporarily disable the alert Enable the alert again after a sufficient amount of new productiontraffic has passed through the Vertex Al endpoint. D. Temporarily disable the alert until the model can be retrained again on newer training data Retrainthe model again after a sufficient amount of new production traffic has passed through the Vertex Alendpoint
Answer: B
Explanation:
The best option for resolving the training-serving skew alert is to update the model monitoring job to
use the more recent training data that was used to retrain the model. This option can help align the
baseline distribution of the model monitoring job with the current distribution of the production
data, and eliminate the false positive alerts. Model Monitoring is a service that can track and
compare the results of multiple machine learning runs. Model Monitoring can monitor the models
prediction input data for feature skew and drift. Training-serving skew occurs when the feature data
distribution in production deviates from the feature data distribution used to train the model. If the
original training data is available, you can enable skew detection to monitor your models for trainingserving skew. Model Monitoring uses TensorFlow Data Validation (TFDV) to calculate the
distributions and distance scores for each feature, and compares them with a baseline distribution.
The baseline distribution is the statistical distribution of the features values in the training data. If
the distance score for a feature exceeds an alerting threshold that you set, Model Monitoring sends
you an email alert. However, if you retrain the model with more recent training data, and deploy it
back to the Vertex AI endpoint, the baseline distribution of the model monitoring job may become
outdated and inconsistent with the current distribution of the production data. This can cause the
model monitoring job to generate false positive alerts, even if the model performance is not
deteriorated. To avoid this problem, you need to update the model monitoring job to use the more
recent training data that was used to retrain the model. This can help the model monitoring job to
recalculate the baseline distribution and the distance scores, and compare them with the current
distribution of the production data. This can also help the model monitoring job to detect any true
positive alerts, such as a sudden change in the production data that causes the model performance
to degrade1.
The other options are not as good as option B, for the following reasons:
Option A: Updating the model monitoring job to use a lower sampling rate would not resolve the
training-serving skew alert, and could reduce the accuracy and reliability of the model monitoring
job. The sampling rate is a parameter that determines the percentage of prediction requests that are
logged and analyzed by the model monitoring job. Using a lower sampling rate can reduce the
storage and computation costs of the model monitoring job, but also the quality and validity of the
data. Using a lower sampling rate can introduce sampling bias and noise into the data, and make the
model monitoring job miss some important features or patterns of the data. Moreover, using a lower
sampling rate would not address the root cause of the training-serving skew alert, which is the
mismatch between the baseline distribution and the current distribution of the production data2.
Option C: Temporarily disabling the alert, and enabling the alert again after a sufficient amount of
new production traffic has passed through the Vertex AI endpoint, would not resolve the trainingserving
skew alert, and could expose the model to potential risks and errors. Disabling the alert
would stop the model monitoring job from sending email notifications when the distance score for a
feature exceeds the alerting threshold, but it would not stop the model monitoring job from
calculating and comparing the distributions and distance scores. Therefore, disabling the alert would
not address the root cause of the training-serving skew alert, which is the mismatch between the
baseline distribution and the current distribution of the production data. Moreover, disabling the
alert would prevent the model monitoring job from detecting any true positive alerts, such as a
sudden change in the production data that causes the model performance to degrade. This can
expose the model to potential risks and errors, and affect the user satisfaction and trust1.
Option D: Temporarily disabling the alert until the model can be retrained again on newer training
data, and retraining the model again after a sufficient amount of new production traffic has passed
through the Vertex AI endpoint, would not resolve the training-serving skew alert, and could cause
unnecessary costs and efforts. Disabling the alert would stop the model monitoring job from sending
email notifications when the distance score for a feature exceeds the alerting threshold, but it would
not stop the model monitoring job from calculating and comparing the distributions and distance
scores. Therefore, disabling the alert would not address the root cause of the training-serving skew
alert, which is the mismatch between the baseline distribution and the current distribution of the
production data. Moreover, disabling the alert would prevent the model monitoring job from
detecting any true positive alerts, such as a sudden change in the production data that causes the
model performance to degrade. This can expose the model to potential risks and errors, and affect
the user satisfaction and trust. Retraining the model again on newer training data would create a
new model version, but it would not update the model monitoring job to use the newer training data
as the baseline distribution. Therefore, retraining the model again on newer training data would not
resolve the training-serving skew alert, and could cause unnecessary costs and efforts1.
Reference:
Preparing for Google Cloud Certification: Machine Learning Engineer, Course 3: Production ML
Systems, Week 4: Evaluation
Google Cloud Professional Machine Learning Engineer Exam Guide, Section 3: Scaling ML models in
production, 3.3 Monitoring ML models in production
Official Google Cloud Certified Professional Machine Learning Engineer Study Guide, Chapter 6:
Production ML Systems, Section 6.3: Monitoring ML Models
Using Model Monitoring
Understanding the score threshold slider
Sampling rate
Question # 25
You have recently created a proof-of-concept (POC) deep learning model. You are satisfied with the
overall architecture, but you need to determine the value for a couple of hyperparameters. You want
to perform hyperparameter tuning on Vertex AI to determine both the appropriate embedding
dimension for a categorical feature used by your model and the optimal learning rate. You configure
the following settings:
For the embedding dimension, you set the type to INTEGER with a minValue of 16 and maxValue of
64.
For the learning rate, you set the type to DOUBLE with a minValue of 10e-05 and maxValue of 10e02.
You are using the default Bayesian optimization tuning algorithm, and you want to maximize model
accuracy. Training time is not a concern. How should you set the hyperparameter scaling for each
hyperparameter and the maxParallelTrials?
A. Use UNIT_LINEAR_SCALE for the embedding dimension, UNIT_LOG_SCALE for the learning rate,
and a large number of parallel trials. B. Use UNIT_LINEAR_SCALE for the embedding dimension, UNIT_LOG_SCALE for the learning rate,
and a small number of parallel trials. C. Use UNIT_LOG_SCALE for the embedding dimension, UNIT_LINEAR_SCALE for the learning rate,
and a large number of parallel trials. D. Use UNIT_LOG_SCALE for the embedding dimension, UNIT_LINEAR_SCALE for the learning rate,
and a small number of parallel trials.
Answer: A
Explanation:
The best option for performing hyperparameter tuning on Vertex AI to determine the appropriate
embedding dimension and the optimal learning rate is to use UNIT_LINEAR_SCALE for the
embedding dimension, UNIT_LOG_SCALE for the learning rate, and a large number of parallel trials.
This option has the following advantages:
It matches the appropriate scaling type for each hyperparameter, based on their range and
distribution. The embedding dimension is an integer hyperparameter that varies linearly between 16
and 64, so using UNIT_LINEAR_SCALE makes sense. The learning rate is a double hyperparameter
that varies exponentially between 10e-05 and 10e-02, so using UNIT_LOG_SCALE is more suitable.
It maximizes the exploration of the hyperparameter space, by using a large number of parallel trials.
Since training time is not a concern, using more trials can help find the best combination of
hyperparameters that maximizes model accuracy. The default Bayesian optimization tuning
algorithm can efficiently sample the hyperparameter space and converge to the optimal values.
The other options are less optimal for the following reasons:
Option B: Using UNIT_LINEAR_SCALE for the embedding dimension, UNIT_LOG_SCALE for the
learning rate, and a small number of parallel trials, reduces the exploration of the hyperparameter
space, by using a small number of parallel trials. Since training time is not a concern, using fewer
trials can miss some potentially good combinations of hyperparameters that maximize model
accuracy. The default Bayesian optimization tuning algorithm can benefit from more trials to sample
the hyperparameter space and converge to the optimal values.
Option C: Using UNIT_LOG_SCALE for the embedding dimension, UNIT_LINEAR_SCALE for the
learning rate, and a large number of parallel trials, mismatches the appropriate scaling type for each
hyperparameter, based on their range and distribution. The embedding dimension is an integer
hyperparameter that varies linearly between 16 and 64, so using UNIT_LOG_SCALE is not suitable.
The learning rate is a double hyperparameter that varies exponentially between 10e-05 and 10e-02,
so using UNIT_LINEAR_SCALE makes less sense.
Option D: Using UNIT_LOG_SCALE for the embedding dimension, UNIT_LINEAR_SCALE for the
learning rate, and a small number of parallel trials, combines the drawbacks of option B and option
C. It mismatches the appropriate scaling type for each hyperparameter, based on their range and
distribution, and reduces the exploration of the hyperparameter space, by using a small number of
parallel trials.
Reference:
[Vertex AI: Hyperparameter tuning overview]
[Vertex AI: Configuring the hyperparameter tuning job]
Question # 26
You developed a custom model by using Vertex Al to forecast the sales of your company s productsbased on historical transactional data You anticipate changes in the feature distributions and thecorrelations between the features in the near future You also expect to receive a large volume ofprediction requests You plan to use Vertex Al Model Monitoring for drift detection and you want tominimize the cost. What should you do?
A. Use the features for monitoring Set a monitoring- frequency value that is higher than the default. B. Use the features for monitoring Set a prediction-sampling-rare value that is closer to 1 than 0. C. Use the features and the feature attributions for monitoring. Set a monitoring-frequency valuethat is lower than the default. D. Use the features and the feature attributions for monitoring Set a prediction-sampling-rate valuethat is closer to 0 than 1.
Answer: D
Explanation:
The best option for using Vertex AI Model Monitoring for drift detection and minimizing the cost is to
use the features and the feature attributions for monitoring, and set a prediction-sampling-rate value
that is closer to 0 than 1. This option allows you to leverage the power and flexibility of Google Cloud
to detect feature drift in the input predict requests for custom models, and reduce the storage and
computation costs of the model monitoring job. Vertex AI Model Monitoring is a service that can
track and compare the results of multiple machine learning runs. Vertex AI Model Monitoring can
monitor the models prediction input data for feature skew and drift. Feature drift occurs when the
feature data distribution in production changes over time. If the original training data is not available,
you can enable drift detection to monitor your models for feature drift. Vertex AI Model Monitoring
uses TensorFlow Data Validation (TFDV) to calculate the distributions and distance scores for each
feature, and compares them with a baseline distribution. The baseline distribution is the statistical
distribution of the features values in the training data. If the training data is not available, the
baseline distribution is calculated from the first 1000 prediction requests that the model receives. If
the distance score for a feature exceeds an alerting threshold that you set, Vertex AI Model
Monitoring sends you an email alert. However, if you use a custom model, you can also enable
feature attribution monitoring, which can provide more insights into the feature drift. Feature
attribution monitoring analyzes the feature attributions, which are the contributions of each feature
to the prediction output. Feature attribution monitoring can help you identify the features that have
the most impact on the model performance, and the features that have the most significant drift
over time. Feature attribution monitoring can also help you understand the relationship between the
features and the prediction output, and the correlation between the features1. The predictionsamplingrate is a parameter that determines the percentage of prediction requests that are logged
and analyzed by the model monitoring job. Using a lower prediction-sampling-rate can reduce the
storage and computation costs of the model monitoring job, but also the quality and validity of the
data. Using a lower prediction-sampling-rate can introduce sampling bias and noise into the data,
and make the model monitoring job miss some important features or patterns of the data. However,
using a higher prediction-sampling-rate can increase the storage and computation costs of the model
monitoring job, and also the amount of data that needs to be processed and analyzed. Therefore,
there is a trade-off between the prediction-sampling-rate and the cost and accuracy of the model
monitoring job, and the optimal prediction-sampling-rate depends on the business objective and the
data characteristics2. By using the features and the feature attributions for monitoring, and setting a
prediction-sampling-rate value that is closer to 0 than 1, you can use Vertex AI Model Monitoring for
drift detection and minimize the cost.
The other options are not as good as option D, for the following reasons:
Option A: Using the features for monitoring and setting a monitoring-frequency value that is higher
than the default would not enable feature attribution monitoring, and could increase the cost of the
model monitoring job. The monitoring-frequency is a parameter that determines how often the
model monitoring job analyzes the logged prediction requests and calculates the distributions and
distance scores for each feature. Using a higher monitoring-frequency can increase the frequency
and timeliness of the model monitoring job, but also the computation costs of the model monitoring
job. Moreover, using the features for monitoring would not enable feature attribution monitoring,
which can provide more insights into the feature drift and the model performance1.
Option B: Using the features for monitoring and setting a prediction-sampling-rate value that is
closer to 1 than 0 would not enable feature attribution monitoring, and could increase the cost of the
model monitoring job. The prediction-sampling-rate is a parameter that determines the percentage
of prediction requests that are logged and analyzed by the model monitoring job. Using a higher
prediction-sampling-rate can increase the quality and validity of the data, but also the storage and
computation costs of the model monitoring job. Moreover, using the features for monitoring would
not enable feature attribution monitoring, which can provide more insights into the feature drift and
the model performance12.
Option C: Using the features and the feature attributions for monitoring and setting a monitoringfrequency
value that is lower than the default would enable feature attribution monitoring, but
could reduce the frequency and timeliness of the model monitoring job. The monitoring-frequency is
a parameter that determines how often the model monitoring job analyzes the logged prediction
requests and calculates the distributions and distance scores for each feature. Using a lower
monitoring-frequency can reduce the computation costs of the model monitoring job, but also the
frequency and timeliness of the model monitoring job. This can make the model monitoring job less
responsive and effective in detecting and alerting the feature drift1.
Reference:
Preparing for Google Cloud Certification: Machine Learning Engineer, Course 3: Production ML
Systems, Week 4: Evaluation
Google Cloud Professional Machine Learning Engineer Exam Guide, Section 3: Scaling ML models in
production, 3.3 Monitoring ML models in production
Official Google Cloud Certified Professional Machine Learning Engineer Study Guide, Chapter 6:
Production ML Systems, Section 6.3: Monitoring ML Models
Using Model Monitoring
Understanding the score threshold slider
Question # 27
You work on a data science team at a bank and are creating an ML model to predict loan default risk.
You have collected and cleaned hundreds of millions of records worth of training data in a BigQuery
table, and you now want to develop and compare multiple models on this data using TensorFlow and
Vertex AI. You want to minimize any bottlenecks during the data ingestion state while considering
scalability. What should you do?
A. Use the BigQuery client library to load data into a dataframe, and use
tf.data.Dataset.from_tensor_slices() to read it. B. Export data to CSV files in Cloud Storage, and use tf.data.TextLineDataset() to read them. C. Convert the data into TFRecords, and use tf.data.TFRecordDataset() to read them. D. Use TensorFlow I/Os BigQuery Reader to directly read the data.
Answer: D
Explanation:
The best option for developing and comparing multiple models on a large-scale BigQuery table using
TensorFlow and Vertex AI is to use TensorFlow I/Os BigQuery Reader to directly read the data. This
option has the following advantages:
It minimizes any bottlenecks during the data ingestion stage, as the BigQuery Reader can stream data
from BigQuery to TensorFlow in parallel and in batches, without loading the entire table into
memory or disk. The BigQuery Reader can also perform data transformations and filtering using SQL
queries, reducing the need for additional preprocessing steps in TensorFlow.
It leverages the scalability and performance of BigQuery, as the BigQuery Reader can handle
hundreds of millions of records worth of training data efficiently and reliably. BigQuery is a
serverless, fully managed, and highly scalable data warehouse that can run complex queries over
petabytes of data in seconds.
It simplifies the integration with Vertex AI, as the BigQuery Reader can be used with both custom and
pre-built TensorFlow models on Vertex AI. Vertex AI is a unified platform for machine learning that
provides various tools and features for data ingestion, data labeling, data preprocessing, model
training, model tuning, model deployment, model monitoring, and model explainability.
The other options are less optimal for the following reasons:
Option A: Using the BigQuery client library to load data into a dataframe, and using
tf.data.Dataset.from_tensor_slices() to read it, introduces memory and performance issues. This
option requires loading the entire BigQuery table into a Pandas dataframe, which can consume a lot
of memory and cause out-of-memory errors. Moreover, using tf.data.Dataset.from_tensor_slices() to
read the dataframe can be slow and inefficient, as it creates one slice per row of the dataframe,
resulting in a large number of small tensors.
Option B: Exporting data to CSV files in Cloud Storage, and using tf.data.TextLineDataset() to read
them, introduces additional steps and complexity. This option requires exporting the BigQuery table
to one or more CSV files in Cloud Storage, which can take a long time and consume a lot of storage
space. Moreover, using tf.data.TextLineDataset() to read the CSV files can be slow and error-prone, as
it requires parsing and decoding each line of text, handling missing values and invalid data, and
applying data transformations and validations.
Option C: Converting the data into TFRecords, and using tf.data.TFRecordDataset() to read them,
introduces additional steps and complexity. This option requires converting the BigQuery table into
one or more TFRecord files, which are binary files that store serialized TensorFlow examples. This can
take a long time and consume a lot of storage space. Moreover, using tf.data.TFRecordDataset() to
read the TFRecord files requires defining and parsing the schema of the TensorFlow examples, which
can be tedious and error-prone.
Reference:
[TensorFlow I/O documentation]
[BigQuery documentation]
[Vertex AI documentation]