Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Add bigquery_kms_key Dataflow sample #2402

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Sep 18, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
90 changes: 90 additions & 0 deletions 90 dataflow/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
# Getting started with Google Cloud Dataflow

[![Open in Cloud Shell](http://gstatic.com/cloudssh/images/open-btn.svg)](https://console.cloud.google.com/cloudshell/editor)

[Apache Beam](https://beam.apache.org/)
is an open source, unified model for defining both batch and streaming data-parallel processing pipelines.
This guides you through all the steps needed to run an Apache Beam pipeline in the
[Google Cloud Dataflow](https://cloud.google.com/dataflow) runner.

## Setting up your Google Cloud project

The following instructions help you prepare your Google Cloud project.

1. Install the [Cloud SDK](https://cloud.google.com/sdk/docs/).
> *Note:* This is not required in
> [Cloud Shell](https://console.cloud.google.com/cloudshell/editor)
> since it already has the Cloud SDK pre-installed.

1. Create a new Google Cloud project via the
[*New Project* page](https://console.cloud.google.com/projectcreate),
or via the `gcloud` command line tool.

```sh
export PROJECT=your-google-cloud-project-id
gcloud projects create $PROJECT
```

1. Setup the Cloud SDK to your GCP project.

```sh
gcloud init
```

1. [Enable billing](https://cloud.google.com/billing/docs/how-to/modify-project).

1. [Enable the APIs](https://console.cloud.google.com/flows/enableapi?apiid=dataflow,compute_component,storage_component,storage_api,logging,cloudresourcemanager.googleapis.com,iam.googleapis.com):
davidcavazos marked this conversation as resolved.
Show resolved Hide resolved
Dataflow, Compute Engine, Cloud Storage, Cloud Storage JSON,
Stackdriver Logging, Cloud Resource Manager, and IAM API.

1. Create a service account JSON key via the
[*Create service account key* page](https://console.cloud.google.com/apis/credentials/serviceaccountkey),
or via the `gcloud` command line tool.
Here is how to do it through the *Create service account key* page.

* From the **Service account** list, select **New service account**.
* In the **Service account name** field, enter a name.
* From the **Role** list, select **Project > Owner** **(*)**.
* Click **Create**. A JSON file that contains your key downloads to your computer.

Alternatively, you can use `gcloud` through the command line.

```sh
export PROJECT=$(gcloud config get-value project)
export SA_NAME=samples
export IAM_ACCOUNT=$SA_NAME@$PROJECT.iam.gserviceaccount.com

# Create the service account.
gcloud iam service-accounts create $SA_NAME --display-name $SA_NAME

# Set the role to Project Owner (*).
gcloud projects add-iam-policy-binding $PROJECT \
--member serviceAccount:$IAM_ACCOUNT \
--role roles/owner

# Create a JSON file with the service account credentials.
gcloud iam service-accounts keys create path/to/your/credentials.json \
--iam-account=$IAM_ACCOUNT
```

> **(*)** *Note:* The **Role** field authorizes your service account to access resources.
> You can view and change this field later by using the
> [GCP Console IAM page](https://console.cloud.google.com/iam-admin/iam).
> If you are developing a production app, specify more granular permissions than **Project > Owner**.
> For more information, see
> [Granting roles to service accounts](https://cloud.google.com/iam/docs/granting-roles-to-service-accounts).

For more information, see
[Creating and managing service accounts](https://cloud.google.com/iam/docs/creating-managing-service-accounts)

1. Set your `GOOGLE_APPLICATION_CREDENTIALS` environment variable to point to your service account key file.

```sh
export GOOGLE_APPLICATION_CREDENTIALS=path/to/your/credentials.json
```

## Setting up a Python development environment

For instructions on how to install Python, virtualenv, and the Cloud SDK, see the
[Setting up a Python development environment](https://cloud.google.com/python/setup)
guide.
202 changes: 202 additions & 0 deletions 202 dataflow/encryption-keys/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,202 @@
# Using customer-managed encryption keys

[![Open in Cloud Shell](http://gstatic.com/cloudssh/images/open-btn.svg)](https://console.cloud.google.com/cloudshell/editor)

This sample demonstrate how to use
[cryptographic encryption keys](https://cloud.google.com/kms/)
for the I/O connectors in an
[Apache Beam](https://beam.apache.org) pipeline.
For more information, see the
[Using customer-managed encryption keys](https://cloud.google.com/dataflow/docs/guides/customer-managed-encryption-keys)
docs page.

## Before you begin

Follow the
[Getting started with Google Cloud Dataflow](../README.md)
page, and make sure you have a Google Cloud project with billing enabled
and a *service account JSON key* set up in your `GOOGLE_APPLICATION_CREDENTIALS` environment variable.
Additionally, for this sample you need the following:

1. [Enable the APIs](https://console.cloud.google.com/flows/enableapi?apiid=bigquery,cloudkms.googleapis.com):
BigQuery and Cloud KMS API.

1. Create a Cloud Storage bucket.

```sh
export BUCKET=your-gcs-bucket
gsutil mb gs://$BUCKET
```

1. [Create a symmetric key ring](https://cloud.google.com/kms/docs/creating-keys).
For best results, use a [regional location](https://cloud.google.com/kms/docs/locations).
This example uses a `global` key for simplicity.

```sh
export KMS_KEYRING=samples-keyring
export KMS_KEY=samples-key

# Create a key ring.
gcloud kms keyrings create $KMS_KEYRING --location global

# Create a key.
gcloud kms keys create $KMS_KEY --location global \
--keyring $KMS_KEYRING --purpose encryption
```

> *Note:* Although you can destroy the
> [*key version material*](https://cloud.google.com/kms/docs/destroy-restore),
> you [cannot delete keys and key rings](https://cloud.google.com/kms/docs/object-hierarchy#lifetime).
> Key rings and keys do not have billable costs or quota limitations,
> so their continued existence does not impact costs or production limits.

1. Grant Encrypter/Decrypter permissions to the *Dataflow*, *Compute Engine*, and *BigQuery*
[service accounts](https://cloud.google.com/iam/docs/service-accounts).
This grants your Dataflow, Compute Engine and BigQuery service accounts the
permission to encrypt and decrypt with the CMEK you specify.
The Dataflow workers use these service accounts when running the pipeline,
which is different from the *user* service account used to start the pipeline.

```sh
export PROJECT=$(gcloud config get-value project)
export PROJECT_NUMBER=$(gcloud projects list --filter $PROJECT --format "value(PROJECT_NUMBER)")

# Grant Encrypter/Decrypter permissions to the Dataflow service account.
gcloud projects add-iam-policy-binding $PROJECT \
--member serviceAccount:service-$PROJECT_NUMBER@dataflow-service-producer-prod.iam.gserviceaccount.com \
--role roles/cloudkms.cryptoKeyEncrypterDecrypter

# Grant Encrypter/Decrypter permissions to the Compute Engine service account.
gcloud projects add-iam-policy-binding $PROJECT \
--member serviceAccount:service-$PROJECT_NUMBER@compute-system.iam.gserviceaccount.com \
--role roles/cloudkms.cryptoKeyEncrypterDecrypter

# Grant Encrypter/Decrypter permissions to the BigQuery service account.
gcloud projects add-iam-policy-binding $PROJECT \
--member serviceAccount:bq-$PROJECT_NUMBER@bigquery-encryption.iam.gserviceaccount.com \
--role roles/cloudkms.cryptoKeyEncrypterDecrypter
```

1. Clone the `python-docs-samples` repository.

```sh
git clone https://github.com/GoogleCloudPlatform/python-docs-samples.git
```

1. Navigate to the sample code directory.

```sh
cd python-docs-samples/dataflow/encryption-keys
```

1. Create a virtual environment and activate it.

```sh
virtualenv env
source env/bin/activate
```

> Once you are done, you can deactivate the virtualenv and go back to your global Python environment by running `deactivate`.

1. Install the sample requirements.

```sh
pip install -U -r requirements.txt
```

## BigQuery KMS Key example

* [bigquery_kms_key.py](bigquery_kms_key.py)

The following sample gets some data from the
[NASA wildfires public BigQuery dataset](https://console.cloud.google.com/bigquery?p=bigquery-public-data&d=nasa_wildfire&t=past_week&page=table)
using a customer-managed encryption key, and dump that data into the specified `output_bigquery_table`
using the same customer-managed encryption key.

Make sure you have the following variables set up:

```sh
# Set the project ID, GCS bucket and KMS key.
export PROJECT=$(gcloud config get-value project)
export BUCKET=your-gcs-bucket

# Set the region for the Dataflow job.
# https://cloud.google.com/compute/docs/regions-zones/
export REGION=us-central1

# Set the KMS key ID.
export KMS_KEYRING=samples-keyring
export KMS_KEY=samples-key
export KMS_KEY_ID=$(gcloud kms keys list --location global --keyring $KMS_KEYRING --filter $KMS_KEY --format "value(NAME)")

# Output BigQuery dataset and table name.
export DATASET=samples
export TABLE=dataflow_kms
```

Create the BigQuery dataset where the output table resides.

```sh
# Create the BigQuery dataset.
bq mk --dataset $PROJECT:$DATASET
```

To run the sample using the Dataflow runner.

```sh
python bigquery_kms_key.py \
--output_bigquery_table $PROJECT:$DATASET.$TABLE \
--kms_key $KMS_KEY_ID \
--project $PROJECT \
--runner DataflowRunner \
--temp_location gs://$BUCKET/samples/dataflow/kms/tmp \
--region $REGION
```

> *Note:* To run locally you can omit the `--runner` command line argument and it defaults to the `DirectRunner`.

You can check your submitted Cloud Dataflow jobs in the
[GCP Console Dataflow page](https://console.cloud.google.com/dataflow) or by using `gcloud`.

```sh
gcloud dataflow jobs list
```

Finally, check the contents of the BigQuery table.

```sh
bq query --use_legacy_sql=false "SELECT * FROM `$PROJECT.$DATASET.$TABLE`"
```

## Cleanup

To avoid incurring charges to your GCP account for the resources used:

```sh
# Remove only the files created by this sample.
gsutil -m rm -rf "gs://$BUCKET/samples/dataflow/kms"

# [optional] Remove the Cloud Storage bucket.
gsutil rb gs://$BUCKET

# Remove the BigQuery table.
bq rm -f -t $PROJECT:$DATASET.$TABLE

# [optional] Remove the BigQuery dataset and all its tables.
bq rm -rf -d $PROJECT:$DATASET

# Revoke Encrypter/Decrypter permissions to the Dataflow service account.
gcloud projects remove-iam-policy-binding $PROJECT \
--member serviceAccount:service-$PROJECT_NUMBER@dataflow-service-producer-prod.iam.gserviceaccount.com \
--role roles/cloudkms.cryptoKeyEncrypterDecrypter

# Revoke Encrypter/Decrypter permissions to the Compute Engine service account.
gcloud projects remove-iam-policy-binding $PROJECT \
--member serviceAccount:service-$PROJECT_NUMBER@compute-system.iam.gserviceaccount.com \
--role roles/cloudkms.cryptoKeyEncrypterDecrypter

# Revoke Encrypter/Decrypter permissions to the BigQuery service account.
gcloud projects remove-iam-policy-binding $PROJECT \
--member serviceAccount:bq-$PROJECT_NUMBER@bigquery-encryption.iam.gserviceaccount.com \
--role roles/cloudkms.cryptoKeyEncrypterDecrypter
```
88 changes: 88 additions & 0 deletions 88 dataflow/encryption-keys/bigquery_kms_key.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
#!/usr/bin/env python
#
# Copyright 2019 Google Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import argparse


def run(output_bigquery_table, kms_key, beam_args):
# [START dataflow_cmek]
import apache_beam as beam

# output_bigquery_table = '<project>:<dataset>.<table>'
# kms_key = 'projects/<project>/locations/<kms-location>/keyRings/<kms-keyring>/cryptoKeys/<kms-key>' # noqa
# beam_args = [
# '--project', 'your-project-id',
# '--runner', 'DataflowRunner',
# '--temp_location', 'gs://your-bucket/samples/dataflow/kms/tmp',
# '--region', 'us-central1',
# ]

# Query from the NASA wildfires public dataset:
# https://console.cloud.google.com/bigquery?p=bigquery-public-data&d=nasa_wildfire&t=past_week&page=table
query = """
SELECT latitude,longitude,acq_date,acq_time,bright_ti4,confidence
FROM `bigquery-public-data.nasa_wildfire.past_week`
LIMIT 10
"""

# Schema for the output BigQuery table.
schema = {
'fields': [
{'name': 'latitude', 'type': 'FLOAT'},
{'name': 'longitude', 'type': 'FLOAT'},
{'name': 'acq_date', 'type': 'DATE'},
{'name': 'acq_time', 'type': 'TIME'},
{'name': 'bright_ti4', 'type': 'FLOAT'},
{'name': 'confidence', 'type': 'STRING'},
],
}

options = beam.options.pipeline_options.PipelineOptions(beam_args)
with beam.Pipeline(options=options) as pipeline:
(
pipeline
| 'Read from BigQuery with KMS key' >>
beam.io.Read(beam.io.BigQuerySource(
query=query,
use_standard_sql=True,
kms_key=kms_key,
))
| 'Write to BigQuery with KMS key' >>
beam.io.WriteToBigQuery(
output_bigquery_table,
schema=schema,
write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE,
kms_key=kms_key,
davidcavazos marked this conversation as resolved.
Show resolved Hide resolved
)
)
# [END dataflow_cmek]


if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument(
'--kms_key',
required=True,
help='Cloud Key Management Service key name',
)
parser.add_argument(
'--output_bigquery_table',
required=True,
help="Output BigQuery table in the format 'PROJECT:DATASET.TABLE'",
)
args, beam_args = parser.parse_known_args()

run(args.output_bigquery_table, args.kms_key, beam_args)
1 change: 1 addition & 0 deletions 1 dataflow/encryption-keys/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
apache-beam[gcp]
Morty Proxy This is a proxified and sanitized view of the page, visit original site.