Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Commit 583cb93

Browse filesBrowse files
vertex-sdk-botcopybara-github
authored andcommitted
docs: Update the documentation for the image_dataset class
PiperOrigin-RevId: 642377218
1 parent fe15b18 commit 583cb93
Copy full SHA for 583cb93

1 file changed

+98-64Lines changed: 98 additions & 64 deletions

File tree

Expand file treeCollapse file tree
Open diff view settings
Filter options
Expand file treeCollapse file tree
Open diff view settings
Collapse file

‎google/cloud/aiplatform/datasets/text_dataset.py‎

Copy file name to clipboardExpand all lines: google/cloud/aiplatform/datasets/text_dataset.py
+98-64Lines changed: 98 additions & 64 deletions
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,35 @@
2727

2828

2929
class TextDataset(datasets._Dataset):
30-
"""Managed text dataset resource for Vertex AI."""
30+
"""A managed text dataset resource for Vertex AI.
31+
32+
Use this class to work with a managed text dataset. To create a managed
33+
text dataset, you need a datasource file in CSV format and a schema file in
34+
YAML format. A schema is optional for a custom model. The CSV file and the
35+
schema are accessed in Cloud Storage buckets.
36+
37+
Use text data for the following objectives:
38+
39+
* Classification. For more information, see
40+
[Prepare text training data for classification](https://cloud.google.com/vertex-ai/docs/text-data/classification/prepare-data).
41+
* Entity extraction. For more information, see
42+
[Prepare text training data for entity extraction](https://cloud.google.com/vertex-ai/docs/text-data/entity-extraction/prepare-data).
43+
* Sentiment analysis. For more information, see
44+
[Prepare text training data for sentiment analysis](Prepare text training data for sentiment analysis).
45+
46+
The following code shows you how to create and import a text dataset with
47+
a CSV datasource file and a YAML schema file. The schema file you use
48+
depends on whether your text dataset is used for single-label
49+
classification, multi-label classification, or object detection.
50+
51+
```py
52+
my_dataset = aiplatform.TextDataset.create(
53+
display_name="my-text-dataset",
54+
gcs_source=['gs://path/to/my/text-dataset.csv'],
55+
import_schema_uri=['gs://path/to/my/schema.yaml'],
56+
)
57+
```
58+
"""
3159

3260
_supported_metadata_schema_uris: Optional[Tuple[str]] = (
3361
schema.dataset.metadata.text,
@@ -49,91 +77,97 @@ def create(
4977
sync: bool = True,
5078
create_request_timeout: Optional[float] = None,
5179
) -> "TextDataset":
52-
"""Creates a new text dataset and optionally imports data into dataset
53-
when source and import_schema_uri are passed.
80+
"""Creates a new text dataset.
81+
82+
Optionally imports data into this dataset when a source and
83+
`import_schema_uri` are passed in. The following is an example of how
84+
this method is used:
5485

55-
Example Usage:
56-
ds = aiplatform.TextDataset.create(
57-
display_name='my-dataset',
58-
gcs_source='gs://my-bucket/dataset.csv',
59-
import_schema_uri=aiplatform.schema.dataset.ioformat.text.multi_label_classification
60-
)
86+
```py
87+
ds = aiplatform.TextDataset.create(
88+
display_name='my-dataset',
89+
gcs_source='gs://my-bucket/dataset.csv',
90+
import_schema_uri=aiplatform.schema.dataset.ioformat.text.multi_label_classification
91+
)
92+
```
6193

6294
Args:
6395
display_name (str):
64-
Optional. The user-defined name of the Dataset.
65-
The name can be up to 128 characters long and can be consist
66-
of any UTF-8 characters.
96+
Optional. The user-defined name of the dataset. The name must
97+
contain 128 or fewer UTF-8 characters.
6798
gcs_source (Union[str, Sequence[str]]):
68-
Google Cloud Storage URI(-s) to the
69-
input file(s).
70-
71-
Examples:
72-
str: "gs://bucket/file.csv"
73-
Sequence[str]: ["gs://bucket/file1.csv", "gs://bucket/file2.csv"]
99+
Optional. The URI to one or more Google Cloud Storage buckets
100+
that contain your datasets. For example, `str:
101+
"gs://bucket/file.csv"` or `Sequence[str]:
102+
["gs://bucket/file1.csv", "gs://bucket/file2.csv"]`.
74103
import_schema_uri (str):
75-
Points to a YAML file stored on Google Cloud
76-
Storage describing the import format. Validation will be
77-
done against the schema. The schema is defined as an
78-
`OpenAPI 3.0.2 Schema
79-
Object <https://tinyurl.com/y538mdwt>`__.
104+
Optional. A URI for a YAML file stored in Cloud Storage that
105+
describes the import schema used to validate the
106+
dataset. The schema is an
107+
[OpenAPI 3.0.2 Schema](https://tinyurl.com/y538mdwt) object.
80108
data_item_labels (Dict):
81-
Labels that will be applied to newly imported DataItems. If
82-
an identical DataItem as one being imported already exists
83-
in the Dataset, then these labels will be appended to these
84-
of the already existing one, and if labels with identical
85-
key is imported before, the old label value will be
86-
overwritten. If two DataItems are identical in the same
87-
import data operation, the labels will be combined and if
88-
key collision happens in this case, one of the values will
89-
be picked randomly. Two DataItems are considered identical
90-
if their content bytes are identical (e.g. image bytes or
91-
pdf bytes). These labels will be overridden by Annotation
92-
labels specified inside index file referenced by
93-
``import_schema_uri``,
94-
e.g. jsonl file.
109+
Optional. A dictionary of label information. Each dictionary
110+
item contains a label and a label key. Each item in the dataset
111+
includes one dictionary of label information. If a data item is
112+
added or merged into a dataset, and that data item contains an
113+
image that's identical to an image that’s already in the
114+
dataset, then the data items are merged. If two identical labels
115+
are detected during the merge, each with a different label key,
116+
then one of the label and label key dictionary items is randomly
117+
chosen to be into the merged data item. Data items are
118+
compared using their binary data (bytes), not on their content.
119+
If annotation labels are referenced in a schema specified by the
120+
`import_schema_url` parameter, then the labels in the
121+
`data_item_labels` dictionary are overriden by the annotations.
95122
project (str):
96-
Project to upload this dataset to. Overrides project set in
97-
aiplatform.init.
123+
Optional. The name of the Google Cloud project to which this
124+
`TextDataset` is uploaded. This overrides the project that
125+
was set by `aiplatform.init`.
98126
location (str):
99-
Location to upload this dataset to. Overrides location set in
100-
aiplatform.init.
127+
Optional. The Google Cloud region where this dataset is uploaded. This
128+
region overrides the region that was set by `aiplatform.init`.
101129
credentials (auth_credentials.Credentials):
102-
Custom credentials to use to upload this dataset. Overrides
103-
credentials set in aiplatform.init.
130+
Optional. The credentials that are used to upload the `TextDataset`.
131+
These credentials override the credentials set by
132+
`aiplatform.init`.
104133
request_metadata (Sequence[Tuple[str, str]]):
105-
Strings which should be sent along with the request as metadata.
134+
Optional. Strings that contain metadata that's sent with the request.
106135
labels (Dict[str, str]):
107-
Optional. Labels with user-defined metadata to organize your Tensorboards.
108-
Label keys and values can be no longer than 64 characters
109-
(Unicode codepoints), can only contain lowercase letters, numeric
110-
characters, underscores and dashes. International characters are allowed.
111-
No more than 64 user labels can be associated with one Tensorboard
112-
(System labels are excluded).
113-
See https://goo.gl/xmQnxf for more information and examples of labels.
114-
System reserved label keys are prefixed with "aiplatform.googleapis.com/"
115-
and are immutable.
136+
Optional. Labels with user-defined metadata to organize your
137+
Vertex AI Tensorboards. The maximum length of a key and of a
138+
value is 64 unicode characters. Labels and keys can contain only
139+
lowercase letters, numeric characters, underscores, and dashes.
140+
International characters are allowed. No more than 64 user
141+
labels can be associated with one Tensorboard (system labels are
142+
excluded). For more information and examples of using labels, see
143+
[Using labels to organize Google Cloud Platform resources](https://goo.gl/xmQnxf).
144+
System reserved label keys are prefixed with
145+
`aiplatform.googleapis.com/` and are immutable.
116146
encryption_spec_key_name (Optional[str]):
117147
Optional. The Cloud KMS resource identifier of the customer
118-
managed encryption key used to protect the dataset. Has the
119-
form:
120-
``projects/my-project/locations/my-region/keyRings/my-kr/cryptoKeys/my-key``.
148+
managed encryption key that's used to protect the dataset. The
149+
format of the key is
150+
`projects/my-project/locations/my-region/keyRings/my-kr/cryptoKeys/my-key`.
121151
The key needs to be in the same region as where the compute
122152
resource is created.
123153

124-
If set, this Dataset and all sub-resources of this Dataset will be secured by this key.
154+
If `encryption_spec_key_name` is set, this `TextDataset` and
155+
all of its sub-resources are secured by this key.
125156

126-
Overrides encryption_spec_key_name set in aiplatform.init.
127-
create_request_timeout (float):
128-
Optional. The timeout for the create request in seconds.
157+
This `encryption_spec_key_name` overrides the
158+
`encryption_spec_key_name` set by `aiplatform.init`.
129159
sync (bool):
130-
Whether to execute this method synchronously. If False, this method
131-
will be executed in concurrent Future and any downstream object will
132-
be immediately returned and synced when the Future has completed.
160+
If `true`, the `create` method creates a text dataset
161+
synchronously. If `false`, the `create` method creates a text
162+
dataset asynchronously.
163+
create_request_timeout (float):
164+
Optional. The number of seconds for the timeout of the create
165+
request.
133166

134167
Returns:
135168
text_dataset (TextDataset):
136-
Instantiated representation of the managed text dataset resource.
169+
An instantiated representation of the managed `TextDataset`
170+
resource.
137171
"""
138172
if not display_name:
139173
display_name = cls._generate_display_name()

0 commit comments

Comments
0 (0)
Morty Proxy This is a proxified and sanitized view of the page, visit original site.