-
Notifications
You must be signed in to change notification settings - Fork 6.6k
Add bigquery_kms_key Dataflow sample #2402
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
davidcavazos
merged 2 commits into
GoogleCloudPlatform:master
from
davidcavazos:dataflow-cmek
Sep 18, 2019
Merged
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,90 @@ | ||
# Getting started with Google Cloud Dataflow | ||
|
||
[](https://console.cloud.google.com/cloudshell/editor) | ||
|
||
[Apache Beam](https://beam.apache.org/) | ||
is an open source, unified model for defining both batch and streaming data-parallel processing pipelines. | ||
This guides you through all the steps needed to run an Apache Beam pipeline in the | ||
[Google Cloud Dataflow](https://cloud.google.com/dataflow) runner. | ||
|
||
## Setting up your Google Cloud project | ||
|
||
The following instructions help you prepare your Google Cloud project. | ||
|
||
1. Install the [Cloud SDK](https://cloud.google.com/sdk/docs/). | ||
> *Note:* This is not required in | ||
> [Cloud Shell](https://console.cloud.google.com/cloudshell/editor) | ||
> since it already has the Cloud SDK pre-installed. | ||
|
||
1. Create a new Google Cloud project via the | ||
[*New Project* page](https://console.cloud.google.com/projectcreate), | ||
or via the `gcloud` command line tool. | ||
|
||
```sh | ||
export PROJECT=your-google-cloud-project-id | ||
gcloud projects create $PROJECT | ||
``` | ||
|
||
1. Setup the Cloud SDK to your GCP project. | ||
|
||
```sh | ||
gcloud init | ||
``` | ||
|
||
1. [Enable billing](https://cloud.google.com/billing/docs/how-to/modify-project). | ||
|
||
1. [Enable the APIs](https://console.cloud.google.com/flows/enableapi?apiid=dataflow,compute_component,storage_component,storage_api,logging,cloudresourcemanager.googleapis.com,iam.googleapis.com): | ||
Dataflow, Compute Engine, Cloud Storage, Cloud Storage JSON, | ||
Stackdriver Logging, Cloud Resource Manager, and IAM API. | ||
|
||
1. Create a service account JSON key via the | ||
[*Create service account key* page](https://console.cloud.google.com/apis/credentials/serviceaccountkey), | ||
or via the `gcloud` command line tool. | ||
Here is how to do it through the *Create service account key* page. | ||
|
||
* From the **Service account** list, select **New service account**. | ||
* In the **Service account name** field, enter a name. | ||
* From the **Role** list, select **Project > Owner** **(*)**. | ||
* Click **Create**. A JSON file that contains your key downloads to your computer. | ||
|
||
Alternatively, you can use `gcloud` through the command line. | ||
|
||
```sh | ||
export PROJECT=$(gcloud config get-value project) | ||
export SA_NAME=samples | ||
export IAM_ACCOUNT=$SA_NAME@$PROJECT.iam.gserviceaccount.com | ||
|
||
# Create the service account. | ||
gcloud iam service-accounts create $SA_NAME --display-name $SA_NAME | ||
|
||
# Set the role to Project Owner (*). | ||
gcloud projects add-iam-policy-binding $PROJECT \ | ||
--member serviceAccount:$IAM_ACCOUNT \ | ||
--role roles/owner | ||
|
||
# Create a JSON file with the service account credentials. | ||
gcloud iam service-accounts keys create path/to/your/credentials.json \ | ||
--iam-account=$IAM_ACCOUNT | ||
``` | ||
|
||
> **(*)** *Note:* The **Role** field authorizes your service account to access resources. | ||
> You can view and change this field later by using the | ||
> [GCP Console IAM page](https://console.cloud.google.com/iam-admin/iam). | ||
> If you are developing a production app, specify more granular permissions than **Project > Owner**. | ||
> For more information, see | ||
> [Granting roles to service accounts](https://cloud.google.com/iam/docs/granting-roles-to-service-accounts). | ||
|
||
For more information, see | ||
[Creating and managing service accounts](https://cloud.google.com/iam/docs/creating-managing-service-accounts) | ||
|
||
1. Set your `GOOGLE_APPLICATION_CREDENTIALS` environment variable to point to your service account key file. | ||
|
||
```sh | ||
export GOOGLE_APPLICATION_CREDENTIALS=path/to/your/credentials.json | ||
``` | ||
|
||
## Setting up a Python development environment | ||
|
||
For instructions on how to install Python, virtualenv, and the Cloud SDK, see the | ||
[Setting up a Python development environment](https://cloud.google.com/python/setup) | ||
guide. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,202 @@ | ||
# Using customer-managed encryption keys | ||
|
||
[](https://console.cloud.google.com/cloudshell/editor) | ||
|
||
This sample demonstrate how to use | ||
[cryptographic encryption keys](https://cloud.google.com/kms/) | ||
for the I/O connectors in an | ||
[Apache Beam](https://beam.apache.org) pipeline. | ||
For more information, see the | ||
[Using customer-managed encryption keys](https://cloud.google.com/dataflow/docs/guides/customer-managed-encryption-keys) | ||
docs page. | ||
|
||
## Before you begin | ||
|
||
Follow the | ||
[Getting started with Google Cloud Dataflow](../README.md) | ||
page, and make sure you have a Google Cloud project with billing enabled | ||
and a *service account JSON key* set up in your `GOOGLE_APPLICATION_CREDENTIALS` environment variable. | ||
Additionally, for this sample you need the following: | ||
|
||
1. [Enable the APIs](https://console.cloud.google.com/flows/enableapi?apiid=bigquery,cloudkms.googleapis.com): | ||
BigQuery and Cloud KMS API. | ||
|
||
1. Create a Cloud Storage bucket. | ||
|
||
```sh | ||
export BUCKET=your-gcs-bucket | ||
gsutil mb gs://$BUCKET | ||
``` | ||
|
||
1. [Create a symmetric key ring](https://cloud.google.com/kms/docs/creating-keys). | ||
For best results, use a [regional location](https://cloud.google.com/kms/docs/locations). | ||
This example uses a `global` key for simplicity. | ||
|
||
```sh | ||
export KMS_KEYRING=samples-keyring | ||
export KMS_KEY=samples-key | ||
|
||
# Create a key ring. | ||
gcloud kms keyrings create $KMS_KEYRING --location global | ||
|
||
# Create a key. | ||
gcloud kms keys create $KMS_KEY --location global \ | ||
--keyring $KMS_KEYRING --purpose encryption | ||
``` | ||
|
||
> *Note:* Although you can destroy the | ||
> [*key version material*](https://cloud.google.com/kms/docs/destroy-restore), | ||
> you [cannot delete keys and key rings](https://cloud.google.com/kms/docs/object-hierarchy#lifetime). | ||
> Key rings and keys do not have billable costs or quota limitations, | ||
> so their continued existence does not impact costs or production limits. | ||
|
||
1. Grant Encrypter/Decrypter permissions to the *Dataflow*, *Compute Engine*, and *BigQuery* | ||
[service accounts](https://cloud.google.com/iam/docs/service-accounts). | ||
This grants your Dataflow, Compute Engine and BigQuery service accounts the | ||
permission to encrypt and decrypt with the CMEK you specify. | ||
The Dataflow workers use these service accounts when running the pipeline, | ||
which is different from the *user* service account used to start the pipeline. | ||
|
||
```sh | ||
export PROJECT=$(gcloud config get-value project) | ||
export PROJECT_NUMBER=$(gcloud projects list --filter $PROJECT --format "value(PROJECT_NUMBER)") | ||
|
||
# Grant Encrypter/Decrypter permissions to the Dataflow service account. | ||
gcloud projects add-iam-policy-binding $PROJECT \ | ||
--member serviceAccount:service-$PROJECT_NUMBER@dataflow-service-producer-prod.iam.gserviceaccount.com \ | ||
--role roles/cloudkms.cryptoKeyEncrypterDecrypter | ||
|
||
# Grant Encrypter/Decrypter permissions to the Compute Engine service account. | ||
gcloud projects add-iam-policy-binding $PROJECT \ | ||
--member serviceAccount:service-$PROJECT_NUMBER@compute-system.iam.gserviceaccount.com \ | ||
--role roles/cloudkms.cryptoKeyEncrypterDecrypter | ||
|
||
# Grant Encrypter/Decrypter permissions to the BigQuery service account. | ||
gcloud projects add-iam-policy-binding $PROJECT \ | ||
--member serviceAccount:bq-$PROJECT_NUMBER@bigquery-encryption.iam.gserviceaccount.com \ | ||
--role roles/cloudkms.cryptoKeyEncrypterDecrypter | ||
``` | ||
|
||
1. Clone the `python-docs-samples` repository. | ||
|
||
```sh | ||
git clone https://github.com/GoogleCloudPlatform/python-docs-samples.git | ||
``` | ||
|
||
1. Navigate to the sample code directory. | ||
|
||
```sh | ||
cd python-docs-samples/dataflow/encryption-keys | ||
``` | ||
|
||
1. Create a virtual environment and activate it. | ||
|
||
```sh | ||
virtualenv env | ||
source env/bin/activate | ||
``` | ||
|
||
> Once you are done, you can deactivate the virtualenv and go back to your global Python environment by running `deactivate`. | ||
|
||
1. Install the sample requirements. | ||
|
||
```sh | ||
pip install -U -r requirements.txt | ||
``` | ||
|
||
## BigQuery KMS Key example | ||
|
||
* [bigquery_kms_key.py](bigquery_kms_key.py) | ||
|
||
The following sample gets some data from the | ||
[NASA wildfires public BigQuery dataset](https://console.cloud.google.com/bigquery?p=bigquery-public-data&d=nasa_wildfire&t=past_week&page=table) | ||
using a customer-managed encryption key, and dump that data into the specified `output_bigquery_table` | ||
using the same customer-managed encryption key. | ||
|
||
Make sure you have the following variables set up: | ||
|
||
```sh | ||
# Set the project ID, GCS bucket and KMS key. | ||
export PROJECT=$(gcloud config get-value project) | ||
export BUCKET=your-gcs-bucket | ||
|
||
# Set the region for the Dataflow job. | ||
# https://cloud.google.com/compute/docs/regions-zones/ | ||
export REGION=us-central1 | ||
|
||
# Set the KMS key ID. | ||
export KMS_KEYRING=samples-keyring | ||
export KMS_KEY=samples-key | ||
export KMS_KEY_ID=$(gcloud kms keys list --location global --keyring $KMS_KEYRING --filter $KMS_KEY --format "value(NAME)") | ||
|
||
# Output BigQuery dataset and table name. | ||
export DATASET=samples | ||
export TABLE=dataflow_kms | ||
``` | ||
|
||
Create the BigQuery dataset where the output table resides. | ||
|
||
```sh | ||
# Create the BigQuery dataset. | ||
bq mk --dataset $PROJECT:$DATASET | ||
``` | ||
|
||
To run the sample using the Dataflow runner. | ||
|
||
```sh | ||
python bigquery_kms_key.py \ | ||
--output_bigquery_table $PROJECT:$DATASET.$TABLE \ | ||
--kms_key $KMS_KEY_ID \ | ||
--project $PROJECT \ | ||
--runner DataflowRunner \ | ||
--temp_location gs://$BUCKET/samples/dataflow/kms/tmp \ | ||
--region $REGION | ||
``` | ||
|
||
> *Note:* To run locally you can omit the `--runner` command line argument and it defaults to the `DirectRunner`. | ||
|
||
You can check your submitted Cloud Dataflow jobs in the | ||
[GCP Console Dataflow page](https://console.cloud.google.com/dataflow) or by using `gcloud`. | ||
|
||
```sh | ||
gcloud dataflow jobs list | ||
``` | ||
|
||
Finally, check the contents of the BigQuery table. | ||
|
||
```sh | ||
bq query --use_legacy_sql=false "SELECT * FROM `$PROJECT.$DATASET.$TABLE`" | ||
``` | ||
|
||
## Cleanup | ||
|
||
To avoid incurring charges to your GCP account for the resources used: | ||
|
||
```sh | ||
# Remove only the files created by this sample. | ||
gsutil -m rm -rf "gs://$BUCKET/samples/dataflow/kms" | ||
|
||
# [optional] Remove the Cloud Storage bucket. | ||
gsutil rb gs://$BUCKET | ||
|
||
# Remove the BigQuery table. | ||
bq rm -f -t $PROJECT:$DATASET.$TABLE | ||
|
||
# [optional] Remove the BigQuery dataset and all its tables. | ||
bq rm -rf -d $PROJECT:$DATASET | ||
|
||
# Revoke Encrypter/Decrypter permissions to the Dataflow service account. | ||
gcloud projects remove-iam-policy-binding $PROJECT \ | ||
--member serviceAccount:service-$PROJECT_NUMBER@dataflow-service-producer-prod.iam.gserviceaccount.com \ | ||
--role roles/cloudkms.cryptoKeyEncrypterDecrypter | ||
|
||
# Revoke Encrypter/Decrypter permissions to the Compute Engine service account. | ||
gcloud projects remove-iam-policy-binding $PROJECT \ | ||
--member serviceAccount:service-$PROJECT_NUMBER@compute-system.iam.gserviceaccount.com \ | ||
--role roles/cloudkms.cryptoKeyEncrypterDecrypter | ||
|
||
# Revoke Encrypter/Decrypter permissions to the BigQuery service account. | ||
gcloud projects remove-iam-policy-binding $PROJECT \ | ||
--member serviceAccount:bq-$PROJECT_NUMBER@bigquery-encryption.iam.gserviceaccount.com \ | ||
--role roles/cloudkms.cryptoKeyEncrypterDecrypter | ||
``` |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,88 @@ | ||
#!/usr/bin/env python | ||
# | ||
# Copyright 2019 Google Inc. All Rights Reserved. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
import argparse | ||
|
||
|
||
def run(output_bigquery_table, kms_key, beam_args): | ||
# [START dataflow_cmek] | ||
import apache_beam as beam | ||
|
||
# output_bigquery_table = '<project>:<dataset>.<table>' | ||
# kms_key = 'projects/<project>/locations/<kms-location>/keyRings/<kms-keyring>/cryptoKeys/<kms-key>' # noqa | ||
# beam_args = [ | ||
# '--project', 'your-project-id', | ||
# '--runner', 'DataflowRunner', | ||
# '--temp_location', 'gs://your-bucket/samples/dataflow/kms/tmp', | ||
# '--region', 'us-central1', | ||
# ] | ||
|
||
# Query from the NASA wildfires public dataset: | ||
# https://console.cloud.google.com/bigquery?p=bigquery-public-data&d=nasa_wildfire&t=past_week&page=table | ||
query = """ | ||
SELECT latitude,longitude,acq_date,acq_time,bright_ti4,confidence | ||
FROM `bigquery-public-data.nasa_wildfire.past_week` | ||
LIMIT 10 | ||
""" | ||
|
||
# Schema for the output BigQuery table. | ||
schema = { | ||
'fields': [ | ||
{'name': 'latitude', 'type': 'FLOAT'}, | ||
{'name': 'longitude', 'type': 'FLOAT'}, | ||
{'name': 'acq_date', 'type': 'DATE'}, | ||
{'name': 'acq_time', 'type': 'TIME'}, | ||
{'name': 'bright_ti4', 'type': 'FLOAT'}, | ||
{'name': 'confidence', 'type': 'STRING'}, | ||
], | ||
} | ||
|
||
options = beam.options.pipeline_options.PipelineOptions(beam_args) | ||
with beam.Pipeline(options=options) as pipeline: | ||
( | ||
pipeline | ||
| 'Read from BigQuery with KMS key' >> | ||
beam.io.Read(beam.io.BigQuerySource( | ||
query=query, | ||
use_standard_sql=True, | ||
kms_key=kms_key, | ||
)) | ||
| 'Write to BigQuery with KMS key' >> | ||
beam.io.WriteToBigQuery( | ||
output_bigquery_table, | ||
schema=schema, | ||
write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE, | ||
kms_key=kms_key, | ||
davidcavazos marked this conversation as resolved.
Show resolved
Hide resolved
|
||
) | ||
) | ||
# [END dataflow_cmek] | ||
|
||
|
||
if __name__ == '__main__': | ||
parser = argparse.ArgumentParser() | ||
parser.add_argument( | ||
'--kms_key', | ||
required=True, | ||
help='Cloud Key Management Service key name', | ||
) | ||
parser.add_argument( | ||
'--output_bigquery_table', | ||
required=True, | ||
help="Output BigQuery table in the format 'PROJECT:DATASET.TABLE'", | ||
) | ||
args, beam_args = parser.parse_known_args() | ||
|
||
run(args.output_bigquery_table, args.kms_key, beam_args) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
apache-beam[gcp] |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.