Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Commit db332fd

Browse filesBrowse files
authored
Add bigquery_kms_key Dataflow sample (GoogleCloudPlatform#2402)
* Add bigquery_kms_key Dataflow sample * Clarified description on service accounts
1 parent 851525c commit db332fd
Copy full SHA for db332fd

File tree

Expand file treeCollapse file tree

4 files changed

+381
-0
lines changed
Filter options
Expand file treeCollapse file tree

4 files changed

+381
-0
lines changed

‎dataflow/README.md

Copy file name to clipboard
+90Lines changed: 90 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,90 @@
1+
# Getting started with Google Cloud Dataflow
2+
3+
[![Open in Cloud Shell](http://gstatic.com/cloudssh/images/open-btn.svg)](https://console.cloud.google.com/cloudshell/editor)
4+
5+
[Apache Beam](https://beam.apache.org/)
6+
is an open source, unified model for defining both batch and streaming data-parallel processing pipelines.
7+
This guides you through all the steps needed to run an Apache Beam pipeline in the
8+
[Google Cloud Dataflow](https://cloud.google.com/dataflow) runner.
9+
10+
## Setting up your Google Cloud project
11+
12+
The following instructions help you prepare your Google Cloud project.
13+
14+
1. Install the [Cloud SDK](https://cloud.google.com/sdk/docs/).
15+
> *Note:* This is not required in
16+
> [Cloud Shell](https://console.cloud.google.com/cloudshell/editor)
17+
> since it already has the Cloud SDK pre-installed.
18+
19+
1. Create a new Google Cloud project via the
20+
[*New Project* page](https://console.cloud.google.com/projectcreate),
21+
or via the `gcloud` command line tool.
22+
23+
```sh
24+
export PROJECT=your-google-cloud-project-id
25+
gcloud projects create $PROJECT
26+
```
27+
28+
1. Setup the Cloud SDK to your GCP project.
29+
30+
```sh
31+
gcloud init
32+
```
33+
34+
1. [Enable billing](https://cloud.google.com/billing/docs/how-to/modify-project).
35+
36+
1. [Enable the APIs](https://console.cloud.google.com/flows/enableapi?apiid=dataflow,compute_component,storage_component,storage_api,logging,cloudresourcemanager.googleapis.com,iam.googleapis.com):
37+
Dataflow, Compute Engine, Cloud Storage, Cloud Storage JSON,
38+
Stackdriver Logging, Cloud Resource Manager, and IAM API.
39+
40+
1. Create a service account JSON key via the
41+
[*Create service account key* page](https://console.cloud.google.com/apis/credentials/serviceaccountkey),
42+
or via the `gcloud` command line tool.
43+
Here is how to do it through the *Create service account key* page.
44+
45+
* From the **Service account** list, select **New service account**.
46+
* In the **Service account name** field, enter a name.
47+
* From the **Role** list, select **Project > Owner** **(*)**.
48+
* Click **Create**. A JSON file that contains your key downloads to your computer.
49+
50+
Alternatively, you can use `gcloud` through the command line.
51+
52+
```sh
53+
export PROJECT=$(gcloud config get-value project)
54+
export SA_NAME=samples
55+
export IAM_ACCOUNT=$SA_NAME@$PROJECT.iam.gserviceaccount.com
56+
57+
# Create the service account.
58+
gcloud iam service-accounts create $SA_NAME --display-name $SA_NAME
59+
60+
# Set the role to Project Owner (*).
61+
gcloud projects add-iam-policy-binding $PROJECT \
62+
--member serviceAccount:$IAM_ACCOUNT \
63+
--role roles/owner
64+
65+
# Create a JSON file with the service account credentials.
66+
gcloud iam service-accounts keys create path/to/your/credentials.json \
67+
--iam-account=$IAM_ACCOUNT
68+
```
69+
70+
> **(*)** *Note:* The **Role** field authorizes your service account to access resources.
71+
> You can view and change this field later by using the
72+
> [GCP Console IAM page](https://console.cloud.google.com/iam-admin/iam).
73+
> If you are developing a production app, specify more granular permissions than **Project > Owner**.
74+
> For more information, see
75+
> [Granting roles to service accounts](https://cloud.google.com/iam/docs/granting-roles-to-service-accounts).
76+
77+
For more information, see
78+
[Creating and managing service accounts](https://cloud.google.com/iam/docs/creating-managing-service-accounts)
79+
80+
1. Set your `GOOGLE_APPLICATION_CREDENTIALS` environment variable to point to your service account key file.
81+
82+
```sh
83+
export GOOGLE_APPLICATION_CREDENTIALS=path/to/your/credentials.json
84+
```
85+
86+
## Setting up a Python development environment
87+
88+
For instructions on how to install Python, virtualenv, and the Cloud SDK, see the
89+
[Setting up a Python development environment](https://cloud.google.com/python/setup)
90+
guide.

‎dataflow/encryption-keys/README.md

Copy file name to clipboard
+202Lines changed: 202 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,202 @@
1+
# Using customer-managed encryption keys
2+
3+
[![Open in Cloud Shell](http://gstatic.com/cloudssh/images/open-btn.svg)](https://console.cloud.google.com/cloudshell/editor)
4+
5+
This sample demonstrate how to use
6+
[cryptographic encryption keys](https://cloud.google.com/kms/)
7+
for the I/O connectors in an
8+
[Apache Beam](https://beam.apache.org) pipeline.
9+
For more information, see the
10+
[Using customer-managed encryption keys](https://cloud.google.com/dataflow/docs/guides/customer-managed-encryption-keys)
11+
docs page.
12+
13+
## Before you begin
14+
15+
Follow the
16+
[Getting started with Google Cloud Dataflow](../README.md)
17+
page, and make sure you have a Google Cloud project with billing enabled
18+
and a *service account JSON key* set up in your `GOOGLE_APPLICATION_CREDENTIALS` environment variable.
19+
Additionally, for this sample you need the following:
20+
21+
1. [Enable the APIs](https://console.cloud.google.com/flows/enableapi?apiid=bigquery,cloudkms.googleapis.com):
22+
BigQuery and Cloud KMS API.
23+
24+
1. Create a Cloud Storage bucket.
25+
26+
```sh
27+
export BUCKET=your-gcs-bucket
28+
gsutil mb gs://$BUCKET
29+
```
30+
31+
1. [Create a symmetric key ring](https://cloud.google.com/kms/docs/creating-keys).
32+
For best results, use a [regional location](https://cloud.google.com/kms/docs/locations).
33+
This example uses a `global` key for simplicity.
34+
35+
```sh
36+
export KMS_KEYRING=samples-keyring
37+
export KMS_KEY=samples-key
38+
39+
# Create a key ring.
40+
gcloud kms keyrings create $KMS_KEYRING --location global
41+
42+
# Create a key.
43+
gcloud kms keys create $KMS_KEY --location global \
44+
--keyring $KMS_KEYRING --purpose encryption
45+
```
46+
47+
> *Note:* Although you can destroy the
48+
> [*key version material*](https://cloud.google.com/kms/docs/destroy-restore),
49+
> you [cannot delete keys and key rings](https://cloud.google.com/kms/docs/object-hierarchy#lifetime).
50+
> Key rings and keys do not have billable costs or quota limitations,
51+
> so their continued existence does not impact costs or production limits.
52+
53+
1. Grant Encrypter/Decrypter permissions to the *Dataflow*, *Compute Engine*, and *BigQuery*
54+
[service accounts](https://cloud.google.com/iam/docs/service-accounts).
55+
This grants your Dataflow, Compute Engine and BigQuery service accounts the
56+
permission to encrypt and decrypt with the CMEK you specify.
57+
The Dataflow workers use these service accounts when running the pipeline,
58+
which is different from the *user* service account used to start the pipeline.
59+
60+
```sh
61+
export PROJECT=$(gcloud config get-value project)
62+
export PROJECT_NUMBER=$(gcloud projects list --filter $PROJECT --format "value(PROJECT_NUMBER)")
63+
64+
# Grant Encrypter/Decrypter permissions to the Dataflow service account.
65+
gcloud projects add-iam-policy-binding $PROJECT \
66+
--member serviceAccount:service-$PROJECT_NUMBER@dataflow-service-producer-prod.iam.gserviceaccount.com \
67+
--role roles/cloudkms.cryptoKeyEncrypterDecrypter
68+
69+
# Grant Encrypter/Decrypter permissions to the Compute Engine service account.
70+
gcloud projects add-iam-policy-binding $PROJECT \
71+
--member serviceAccount:service-$PROJECT_NUMBER@compute-system.iam.gserviceaccount.com \
72+
--role roles/cloudkms.cryptoKeyEncrypterDecrypter
73+
74+
# Grant Encrypter/Decrypter permissions to the BigQuery service account.
75+
gcloud projects add-iam-policy-binding $PROJECT \
76+
--member serviceAccount:bq-$PROJECT_NUMBER@bigquery-encryption.iam.gserviceaccount.com \
77+
--role roles/cloudkms.cryptoKeyEncrypterDecrypter
78+
```
79+
80+
1. Clone the `python-docs-samples` repository.
81+
82+
```sh
83+
git clone https://github.com/GoogleCloudPlatform/python-docs-samples.git
84+
```
85+
86+
1. Navigate to the sample code directory.
87+
88+
```sh
89+
cd python-docs-samples/dataflow/encryption-keys
90+
```
91+
92+
1. Create a virtual environment and activate it.
93+
94+
```sh
95+
virtualenv env
96+
source env/bin/activate
97+
```
98+
99+
> Once you are done, you can deactivate the virtualenv and go back to your global Python environment by running `deactivate`.
100+
101+
1. Install the sample requirements.
102+
103+
```sh
104+
pip install -U -r requirements.txt
105+
```
106+
107+
## BigQuery KMS Key example
108+
109+
* [bigquery_kms_key.py](bigquery_kms_key.py)
110+
111+
The following sample gets some data from the
112+
[NASA wildfires public BigQuery dataset](https://console.cloud.google.com/bigquery?p=bigquery-public-data&d=nasa_wildfire&t=past_week&page=table)
113+
using a customer-managed encryption key, and dump that data into the specified `output_bigquery_table`
114+
using the same customer-managed encryption key.
115+
116+
Make sure you have the following variables set up:
117+
118+
```sh
119+
# Set the project ID, GCS bucket and KMS key.
120+
export PROJECT=$(gcloud config get-value project)
121+
export BUCKET=your-gcs-bucket
122+
123+
# Set the region for the Dataflow job.
124+
# https://cloud.google.com/compute/docs/regions-zones/
125+
export REGION=us-central1
126+
127+
# Set the KMS key ID.
128+
export KMS_KEYRING=samples-keyring
129+
export KMS_KEY=samples-key
130+
export KMS_KEY_ID=$(gcloud kms keys list --location global --keyring $KMS_KEYRING --filter $KMS_KEY --format "value(NAME)")
131+
132+
# Output BigQuery dataset and table name.
133+
export DATASET=samples
134+
export TABLE=dataflow_kms
135+
```
136+
137+
Create the BigQuery dataset where the output table resides.
138+
139+
```sh
140+
# Create the BigQuery dataset.
141+
bq mk --dataset $PROJECT:$DATASET
142+
```
143+
144+
To run the sample using the Dataflow runner.
145+
146+
```sh
147+
python bigquery_kms_key.py \
148+
--output_bigquery_table $PROJECT:$DATASET.$TABLE \
149+
--kms_key $KMS_KEY_ID \
150+
--project $PROJECT \
151+
--runner DataflowRunner \
152+
--temp_location gs://$BUCKET/samples/dataflow/kms/tmp \
153+
--region $REGION
154+
```
155+
156+
> *Note:* To run locally you can omit the `--runner` command line argument and it defaults to the `DirectRunner`.
157+
158+
You can check your submitted Cloud Dataflow jobs in the
159+
[GCP Console Dataflow page](https://console.cloud.google.com/dataflow) or by using `gcloud`.
160+
161+
```sh
162+
gcloud dataflow jobs list
163+
```
164+
165+
Finally, check the contents of the BigQuery table.
166+
167+
```sh
168+
bq query --use_legacy_sql=false "SELECT * FROM `$PROJECT.$DATASET.$TABLE`"
169+
```
170+
171+
## Cleanup
172+
173+
To avoid incurring charges to your GCP account for the resources used:
174+
175+
```sh
176+
# Remove only the files created by this sample.
177+
gsutil -m rm -rf "gs://$BUCKET/samples/dataflow/kms"
178+
179+
# [optional] Remove the Cloud Storage bucket.
180+
gsutil rb gs://$BUCKET
181+
182+
# Remove the BigQuery table.
183+
bq rm -f -t $PROJECT:$DATASET.$TABLE
184+
185+
# [optional] Remove the BigQuery dataset and all its tables.
186+
bq rm -rf -d $PROJECT:$DATASET
187+
188+
# Revoke Encrypter/Decrypter permissions to the Dataflow service account.
189+
gcloud projects remove-iam-policy-binding $PROJECT \
190+
--member serviceAccount:service-$PROJECT_NUMBER@dataflow-service-producer-prod.iam.gserviceaccount.com \
191+
--role roles/cloudkms.cryptoKeyEncrypterDecrypter
192+
193+
# Revoke Encrypter/Decrypter permissions to the Compute Engine service account.
194+
gcloud projects remove-iam-policy-binding $PROJECT \
195+
--member serviceAccount:service-$PROJECT_NUMBER@compute-system.iam.gserviceaccount.com \
196+
--role roles/cloudkms.cryptoKeyEncrypterDecrypter
197+
198+
# Revoke Encrypter/Decrypter permissions to the BigQuery service account.
199+
gcloud projects remove-iam-policy-binding $PROJECT \
200+
--member serviceAccount:bq-$PROJECT_NUMBER@bigquery-encryption.iam.gserviceaccount.com \
201+
--role roles/cloudkms.cryptoKeyEncrypterDecrypter
202+
```
+88Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
#!/usr/bin/env python
2+
#
3+
# Copyright 2019 Google Inc. All Rights Reserved.
4+
#
5+
# Licensed under the Apache License, Version 2.0 (the "License");
6+
# you may not use this file except in compliance with the License.
7+
# You may obtain a copy of the License at
8+
#
9+
# http://www.apache.org/licenses/LICENSE-2.0
10+
#
11+
# Unless required by applicable law or agreed to in writing, software
12+
# distributed under the License is distributed on an "AS IS" BASIS,
13+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
# See the License for the specific language governing permissions and
15+
# limitations under the License.
16+
17+
import argparse
18+
19+
20+
def run(output_bigquery_table, kms_key, beam_args):
21+
# [START dataflow_cmek]
22+
import apache_beam as beam
23+
24+
# output_bigquery_table = '<project>:<dataset>.<table>'
25+
# kms_key = 'projects/<project>/locations/<kms-location>/keyRings/<kms-keyring>/cryptoKeys/<kms-key>' # noqa
26+
# beam_args = [
27+
# '--project', 'your-project-id',
28+
# '--runner', 'DataflowRunner',
29+
# '--temp_location', 'gs://your-bucket/samples/dataflow/kms/tmp',
30+
# '--region', 'us-central1',
31+
# ]
32+
33+
# Query from the NASA wildfires public dataset:
34+
# https://console.cloud.google.com/bigquery?p=bigquery-public-data&d=nasa_wildfire&t=past_week&page=table
35+
query = """
36+
SELECT latitude,longitude,acq_date,acq_time,bright_ti4,confidence
37+
FROM `bigquery-public-data.nasa_wildfire.past_week`
38+
LIMIT 10
39+
"""
40+
41+
# Schema for the output BigQuery table.
42+
schema = {
43+
'fields': [
44+
{'name': 'latitude', 'type': 'FLOAT'},
45+
{'name': 'longitude', 'type': 'FLOAT'},
46+
{'name': 'acq_date', 'type': 'DATE'},
47+
{'name': 'acq_time', 'type': 'TIME'},
48+
{'name': 'bright_ti4', 'type': 'FLOAT'},
49+
{'name': 'confidence', 'type': 'STRING'},
50+
],
51+
}
52+
53+
options = beam.options.pipeline_options.PipelineOptions(beam_args)
54+
with beam.Pipeline(options=options) as pipeline:
55+
(
56+
pipeline
57+
| 'Read from BigQuery with KMS key' >>
58+
beam.io.Read(beam.io.BigQuerySource(
59+
query=query,
60+
use_standard_sql=True,
61+
kms_key=kms_key,
62+
))
63+
| 'Write to BigQuery with KMS key' >>
64+
beam.io.WriteToBigQuery(
65+
output_bigquery_table,
66+
schema=schema,
67+
write_disposition=beam.io.BigQueryDisposition.WRITE_TRUNCATE,
68+
kms_key=kms_key,
69+
)
70+
)
71+
# [END dataflow_cmek]
72+
73+
74+
if __name__ == '__main__':
75+
parser = argparse.ArgumentParser()
76+
parser.add_argument(
77+
'--kms_key',
78+
required=True,
79+
help='Cloud Key Management Service key name',
80+
)
81+
parser.add_argument(
82+
'--output_bigquery_table',
83+
required=True,
84+
help="Output BigQuery table in the format 'PROJECT:DATASET.TABLE'",
85+
)
86+
args, beam_args = parser.parse_known_args()
87+
88+
run(args.output_bigquery_table, args.kms_key, beam_args)
+1Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
apache-beam[gcp]

0 commit comments

Comments
0 (0)
Morty Proxy This is a proxified and sanitized view of the page, visit original site.