You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* Initialize test file and add comments
* add a couple tests and pull out functions
* lint
* PR comments
* add another ValueError
* Add back the mkdir step
* Add component to version.py
* change to protected functions
Copy file name to clipboardExpand all lines: tfx_addons/copy_example_gen/README.md
+13-13
Original file line number
Diff line number
Diff line change
@@ -10,7 +10,7 @@
10
10
**Project name:** CopyExampleGen component
11
11
12
12
## Project Description
13
-
CopyExampleGen will allow the user to copy pre-existing tfrecords and ingest it into the pipeline as examples, ultimately skipping the process of shuffling and running the Beam job that is in the standard component, ExampleGen. This process will require a dict input with split names as keys and their respective URIs as the value from the user. Following suit, the component will set the artifact’s properties, generate output dict, and register contexts and execution for downstream components to use. Lastly, tfrecord file(s) in uri must resemble same `.gz` file format as the output of ExampleGen component.
13
+
CopyExampleGen will allow the user to copy pre-existing TFRecords and ingest it into the pipeline as examples, ultimately skipping the process of shuffling and running the Beam job that is in the standard component, ExampleGen. This process will require a dict input with split names as keys and their respective URIs as the value from the user. Following suit, the component will set the artifact’s properties, generate output dict, and register contexts and execution for downstream components to use. Lastly, TFRecord file(s) in URI must resemble same `.gz` file format as the output of ExampleGen component.
As of April 10th, 2023, tfx.dsl.components.Parameter only supports primitive types therefore, in order to properly use CopyExampleGen, the 'input_dict' of type Dict[str, str] needs to be converted into a JSON str. We can do this by simply using `json.dumps()` by adding 'tfrecords_dict' in as an argument.
@@ -31,9 +31,9 @@ As of April 10th, 2023, tfx.dsl.components.Parameter only supports primitive typ
31
31
Addon Component
32
32
33
33
## Project Use-Case(s)
34
-
CopyExampleGen will replace ExampleGen when tfrecords and split names are already in the possession of the user. Hence, a Beam job will not be run nor will the tfrecords be shuffled and/ or randomized saving data ingestion pipeline process time.
34
+
CopyExampleGen will replace ExampleGen when TFRecords and split names are already in the possession of the user. Hence, a Beam job will not be run nor will the TFRecords be shuffled and/ or randomized saving data ingestion pipeline process time.
35
35
36
-
Currently, ingesting data with the ExampleGen component does not provide a way to split without random data shuffling and always runs a beam job. This component will save significant time (hours for large amounts of data) per pipeline run when a pipeline run does not require data to be shuffled. Some challenges users have had:
36
+
Currently, ingesting data with the ExampleGen component does not provide a way to split without random data shuffling and always runs a Beam job. This component will save significant time (hours for large amounts of data) per pipeline run when a pipeline run does not require data to be shuffled. Some challenges users have had:
37
37
38
38
1. “Reshuffle doesn't work well with DirectRunner and causes OOMing. Users have been patching out shuffling in every release and doing it in the DB query. They have given up on Beam based ExampleGen and have created an entire custom ExampleGen that reads from the database and doesn’t use Beam”.
39
39
@@ -47,29 +47,29 @@ Custom Python function component: CopyExampleGen
47
47
48
48
-`input_json_str`: will be the input parameter for CopyExampleGen of type `tfx.dsl.components.Parameter[str]`, where the user will assign their Dict[str, str] input, tfrecords_dict. However, because Python custom component development only supports primitive types, we must assign `input_json_str` to `json.dumps(tfrecords_dict)` and place the tfrecords_dict in as an argument.
49
49
50
-
-`output_example`: Output artifact can be referenced as an object of its' specified type ArtifactType in the component function being declared. For example, if the ArtifactType is Examples, one can reference properties in an Examples ArtifactType (span, version, split_names, etc.) by calling the OutputArtifact object. This will be the variable we reference to build and register our Examples Artifact after pasrsing the tfrecords_dict input.
50
+
-`output_example`: Output artifact can be referenced as an object of its specified type ArtifactType in the component function being declared. For example, if the ArtifactType is Examples, one can reference properties in an Examples ArtifactType (span, version, split_names, etc.) by calling the OutputArtifact object. This will be the variable we reference to build and register our Examples Artifact after pasrsing the tfrecords_dict input.
Using fileio.mkdir and fileio.copy, the component will then create a directory folder for each name in `split_name`. Following the creation of the `Split-name` folder, the files in the uri path will then be copied into the designated `Split-name` folder.
55
+
Using fileio.mkdir and fileio.copy, the component will then create a directory folder for each name in `split_name`. Following the creation of the `Split-name` folder, the files in the URI path will then be copied into the designated `Split-name` folder.
56
56
57
57
Thoughts from original implementation in phase 1:
58
58
This step can possibly use the [importer.generate_output_dict](https://github.com/tensorflow/tfx/blob/f8ce19339568ae58519d4eecfdd73078f80f84a2/tfx/dsl/components/common/importer.py#L153) function:
59
59
Create standard ‘output_dict’ variable. The value will be created by calling the worker function. If file copying is done before this step, this method can probably be used as is to register the artifact.
60
60
61
61
Using the keys and values from `tfrecords_dict`:
62
62
Parse the input_dict.keys() to a str to resemble the necessary format of property `split-names` i.e. '["train","eval"]'
63
-
63
+
64
64
## Possible Future Development Directions
65
-
1. There's a few open questions about how the file copying should actually done. Where does the copying that importer does actually happen? And what's the best way to change that? Are there other ways in TFX to do copying in a robust way? Maybe something in tfx.io? If there's an existing method, what has to happen in the `parse_tfrecords_dict`. Depending on the copying capabilities available, will there be a need to detect the execution environment? Does TFX rely on other tools to execute a copy that handle this? Is detection of the execution environment and the copying itself separate? What could be reused?
66
-
65
+
1. There's a few open questions about how the file copying should actually done. Where does the copying that importer does actually happen? And what's the best way to change that? Are there other ways in TFX to do copying in a robust way? Maybe something in tfx.io? If there's an existing method, what has to happen in the `parse_tfrecords_dict`. Depending on the copying capabilities available, will there be a need to detect the execution environment? Does TFX rely on other tools to execute a copy that handle this? Is detection of the execution environment and the copying itself separate? What could be reused?
66
+
67
67
- If it's not easy to detect the execution environment without also performing a copy, will the user have to specify the execution environment and therefore how to do the copy (e.g., local copy, GCS, S3). And then what's the best way to handle that?
68
-
68
+
69
69
2. Should the dictionary of file inputs take a path to a folder? Globs? Lists of individual files?
70
70
3. Assuming file copying is done entirely separately, [importer.generate_output_dict](https://github.com/tensorflow/tfx/blob/f8ce19339568ae58519d4eecfdd73078f80f84a2/tfx/dsl/components/common/importer.py#L153) be used as is to register the artifacts, or does some separate code using [MLMD](https://www.tensorflow.org/tfx/guide/mlmd) in a different way need to be written
0 commit comments