Support TensorFlow-1.5.0 and MXNet-1.0.0 #82

Merged

nadiaya merged 5 commits intoaws:masterfrom

nadiaya:release-1.1

Feb 27, 2018

Contributor

nadiaya commented Feb 23, 2018

Add support for TensorFlow-1.5.0 and MXNet-1.0.0


          Update .gitignore to ignore pytest_cache.

4fac858

nadiaya requested review from iquintero, lukmis and winstonaws

February 23, 2018 07:14

mvsusp reviewed

View reviewed changes

tests/unit/test_estimator.py Outdated

+                  framework, py_ver, tag = framework_name_from_image(image_name)
                   assert framework == 'mxnet'
                   assert py_ver == 'py2'
+                  assert tag == '2.5.6-gpu-py2'

Contributor

mvsusp Feb 23, 2018 •

edited

Loading

👍 🥇 I love these tags.

mvsusp reviewed

View reviewed changes

src/sagemaker/fw_utils.py Outdated



		def create_image_uri(region, framework, instance_type, py_version='py2', tag='1.0', account='520713654638'):
		def create_image_uri(region, framework, instance_type, framework_version, py_version='py2', account='520713654638'):

Contributor

mvsusp Feb 23, 2018

I've notice that create_image_uri does not have unit tests. Can you write unit tests for this function?

Contributor Author

nadiaya Feb 23, 2018

it's doesn't, it should 👍

tests/unit/test_mxnet.py Outdated

-              def test_model(sagemaker_session):
-                  model = MXNetModel("s3://some/data.tar.gz", role=ROLE, entry_point=SCRIPT_PATH, sagemaker_session=sagemaker_session)
+              def test_model(sagemaker_session, mxnet_version):

Contributor

mvsusp Feb 23, 2018

This test does not need to run for both versions.

tests/unit/test_tf_estimator.py Outdated

-              def test_tf_deploy_model_server_workers(sagemaker_session):
-                  tf = _build_tf(sagemaker_session)
+              def test_tf_deploy_model_server_workers(sagemaker_session, tf_version):

Contributor

mvsusp Feb 23, 2018

This test does not need to run for both versions.

tests/unit/test_tf_estimator.py Outdated

-              def test_tf_deploy_model_server_workers_unset(sagemaker_session):
-                  tf = _build_tf(sagemaker_session)
+              def test_tf_deploy_model_server_workers_unset(sagemaker_session, tf_version):

Contributor

mvsusp Feb 23, 2018

This test does not need to run for both versions.

tests/unit/test_tf_estimator.py Outdated

               @patch('subprocess.call')
               @patch('os.access', return_value=False)
-              def test_run_tensorboard_locally_without_tensorboard_binary(time, strftime, popen, call, access, sagemaker_session):
+              def test_run_tensorboard_locally_without_tensorboard_binary(time, strftime, popen, call, access,

Contributor

mvsusp Feb 23, 2018

This test does not need to run for both versions.

tests/unit/test_tf_estimator.py Outdated

               @patch('time.time', return_value=TIME)
               @pytest.mark.skip(reason="this test fails sometimes and it needs further investigation")
-              def test_run_tensorboard_locally_port_in_use(time, strftime, popen, call, access, socket, sagemaker_session):
+              def test_run_tensorboard_locally_port_in_use(time, strftime, popen, call, access, socket,

Contributor

mvsusp Feb 23, 2018

This test does not need to run for both versions.

tests/unit/test_tf_estimator.py Outdated



		def test_tf_checkpoint_not_set(sagemaker_session):
		def test_tf_checkpoint_not_set(sagemaker_session, tf_version):

Contributor

mvsusp Feb 23, 2018

This test does not need to run for both versions.

tests/unit/test_tf_estimator.py Outdated



		def test_tf_training_and_evaluation_steps_not_set(sagemaker_session):
		def test_tf_training_and_evaluation_steps_not_set(sagemaker_session, tf_version):

Contributor

mvsusp Feb 23, 2018

This test does not need to run for both versions.

tests/unit/test_tf_estimator.py Outdated



		def test_tf_training_and_evaluation_steps(sagemaker_session):
		def test_tf_training_and_evaluation_steps(sagemaker_session, tf_version):

Contributor

mvsusp Feb 23, 2018

This test does not need to run for both versions.

tests/unit/test_tf_estimator.py Outdated

-              def test_tf_checkpoint_set(sagemaker_session):
-                  tf = _build_tf(sagemaker_session, checkpoint_path='s3://my_checkpoint_bucket')
+              def test_tf_checkpoint_set(sagemaker_session, tf_version):

Contributor

mvsusp Feb 23, 2018

This test does not need to run for both versions.

winstonaws previously requested changes

View reviewed changes

Contributor

winstonaws left a comment

Something I'm not set on, but worth thinking about: right now, if the user passes in an invalid framework_version, it won't fail until EASE tries to pull the image, right? If we have some kind of validation, that could save frustration.

On the other hand, we don't want to over-validate - when we release new images, people should be able to use them without updating the python SDK.

Maybe a good compromise would be to validate just the format of the tag?

README.rst Outdated

    
              Your MXNet training script will be run on version 1.0.0 (by default) or 0.12 of MXNet, built for either GPU or CPU use. The decision to use the GPU or CPU version of MXNet is made by the ``train_instance_type``, set on the MXNet constructor. If you choose a GPU instance type, your training job will be run on a GPU version of MXNet. If you choose a CPU instance type, your training job will be run on a CPU version of MXNet. Similarly, when you call deploy, specifying a GPU or CPU deploy_instance_type, will control which MXNet build your Endpoint runs.

              Each Docker container has the following dependencies installed:

              The Docker containers have the following dependencies installed:

Contributor

winstonaws Feb 23, 2018

Let's change this to "Docker images".

README.rst

    
              SageMaker runs MXNet Estimator scripts in either Python 2.7 or Python 3.5. You can select the Python version by passing a ``py_version`` keyword arg to the MXNet Estimator constructor. Setting this to ``py2`` (the default) will cause your training script to be run on Python 2.7. Setting this to ``py3`` will cause your training script to be run on Python 3.5. This Python version applies to both the Training Job, created by fit, and the Endpoint, created by deploy.

              Your MXNet training script will be run on version 0.12 of MXNet, built for either GPU or CPU use. The decision to use the GPU or CPU version of MXNet is made by the train_instance_type, set on the MXNet constructor. If you choose a GPU instance type, your training job will be run on a GPU version of MXNet. If you choose a CPU instance type, your training job will be run on a CPU version of MXNet. Similarly, when you call deploy, specifying a GPU or CPU deploy_instance_type, will control which MXNet build your Endpoint runs.

              Your MXNet training script will be run on version 1.0.0 (by default) or 0.12 of MXNet, built for either GPU or CPU use. The decision to use the GPU or CPU version of MXNet is made by the ``train_instance_type``, set on the MXNet constructor. If you choose a GPU instance type, your training job will be run on a GPU version of MXNet. If you choose a CPU instance type, your training job will be run on a CPU version of MXNet. Similarly, when you call deploy, specifying a GPU or CPU deploy_instance_type, will control which MXNet build your Endpoint runs.

Contributor

winstonaws Feb 23, 2018

Can we write this to be more future-proof? Maybe something like, "will run using the latest supported version of MXNet by default. See (something) for our currently supported versions. You can choose older versions by..."

Contributor Author

nadiaya Feb 23, 2018

But we still would need to make an sdk change to change the default.
I believe it adds value to say what version we support by default.

README.rst

    
              Your TensorFlow training script must be a **Python 2.7** source file. The current supported TensorFlow

              version is **1.4.0**. This training script **must contain** the following functions:

              versions are **1.5.0 (default)** and **1.4.1**. This training script **must contain** the following functions:

Contributor

winstonaws Feb 23, 2018

Same comment here about future-proofing.

src/sagemaker/fw_utils.py Outdated



		def create_image_uri(region, framework, instance_type, py_version='py2', tag='1.0', account='520713654638'):
		def create_image_uri(region, framework, instance_type, framework_version, py_version='py2', account='520713654638'):

Contributor

winstonaws Feb 23, 2018

It might make sense to take out the default argument for py_version as well - what do you think?

Contributor Author

nadiaya Feb 23, 2018

I don't have an explanation on why it was defaulted at all.

src/sagemaker/fw_utils.py Outdated

                   """
-                  device_version = 'cpu'
+                  device_type = 'cpu'
                   # Instance types that start with G, P are GPU powered: https://aws.amazon.com/ec2/instance-types/

Contributor

winstonaws Feb 23, 2018

Should link to https://aws.amazon.com/sagemaker/pricing/instance-types/ instead

Contributor Author

nadiaya Feb 23, 2018

Good point :)

src/sagemaker/mxnet/defaults.py

+              # distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF
+              # ANY KIND, either express or implied. See the License for the specific
+              # language governing permissions and limitations under the License.
+              MXNET_VERSION = '1.0'

Contributor

winstonaws Feb 23, 2018

Maybe "DEFAULT_MXNET_VERSION"?

Contributor Author

nadiaya Feb 23, 2018 •

edited

Loading

defaults.DEFAULT_MXNET_VERSION ? I think it's a little bit redundant.

Contributor

winstonaws Feb 26, 2018

Good point, whoops.

src/sagemaker/tensorflow/defaults.py

+              # distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF
+              # ANY KIND, either express or implied. See the License for the specific
+              # language governing permissions and limitations under the License.
+              TF_VERSION = '1.5'

Contributor

winstonaws Feb 23, 2018

"DEFAULT_TF_VERSION"?

Contributor Author

nadiaya Feb 23, 2018 •

edited

Loading

defaults.DEFAULT_TF_VERSION ? I think it's a little bit redundant.

src/sagemaker/tensorflow/estimator.py Outdated

+                      framework, py_version, tag = framework_name_from_image(init_params.pop('image'))
                       init_params['py_version'] = py_version
+                      # For backward compatibility map deprecated container tag to a framework version

Contributor

winstonaws Feb 23, 2018

container tag -> image tag

src/sagemaker/mxnet/estimator.py

                                         container_log_level=self.container_log_level, code_location=self.code_location,
-                                        py_version=self.py_version, model_server_workers=model_server_workers,
-                                        sagemaker_session=self.sagemaker_session)
+                                        py_version=self.py_version, framework_version=self.framework_version,

Contributor

winstonaws Feb 23, 2018

We probably want an extra assertion in the appropriate unit test that we pass along the framework version correctly to the created model (for both TF and MXNet).

Contributor Author

nadiaya Feb 23, 2018

https://github.com/aws/sagemaker-python-sdk/pull/82/files#diff-8e5bb247678c2ede89b070ed8c1926f2R192

We already do this indirectly.

I guess it would be good to have a unit test that tests just model creation.

src/sagemaker/mxnet/estimator.py

                               to convert them before training.
                           py_version (str): Python version you want to use for executing your model training code (default: 'py2').
                                             One of 'py2' or 'py3'.
+                          framework_version (str): MXNet version you want to use for executing your model training code.

Contributor

winstonaws Feb 23, 2018

Can you add info to this docstring to tell users how they can figure out what valid values are? (Same for TF as well.)

Contributor Author

nadiaya Feb 24, 2018

Yep! 👍

lukmis reviewed

View reviewed changes

CHANGELOG.rst

+.1.0
+              =====
+              * feature: Estimators: add support for TensorFlow-1.5.0

Contributor

lukmis Feb 23, 2018

Please update after master branch merge.

Contributor Author

nadiaya Feb 23, 2018

Will do.

lukmis reviewed

View reviewed changes

src/sagemaker/tensorflow/estimator.py

                       init_params['py_version'] = py_version
+                      # For backward compatibility map deprecated container tag to a framework version
+                      # otherwise extract framework version from the tag itself.

Contributor

lukmis Feb 23, 2018

Could we have an utility function that would return version from the tag? Otherwise we hardcode this logic (element [0] of the tag) in 2 different places (for both frameworks).

Contributor Author

nadiaya Feb 23, 2018

i really thought about it, but the version is specific to mxnet and tensorflow and it's too little to generalize.
Also we should be able to remove it at some point in the future.

Contributor

lukmis Feb 23, 2018

What I had in mind is that the logic to deal with tag parsing should be externalized because we have same pattern for all the frameworks. It' same way as we do with 'framework_name_from_image'.
Version is specific to each but getting that version is common. If the pattern chanes in the future we will only update it in 1 place.

Contributor Author

nadiaya Feb 24, 2018

Got it, yeah probably makes sense even for 1 line function.

lukmis reviewed

View reviewed changes

tests/conftest.py

		import pytest


		@pytest.fixture(scope='module', params=["1.4", "1.4.1", "1.5", "1.5.0"])

Contributor

lukmis Feb 23, 2018

It would be great to cover the default case, e.g. pass special word like 'default' and in the test do not pass version at all.

Contributor Author

nadiaya Feb 23, 2018

But it would run one of the versions again for every test.
We have a specific test for testing default value.

Contributor

lukmis Feb 23, 2018

That's true about the double run but I didn't see the special (integ) test for default value.

Contributor Author

nadiaya Feb 24, 2018

https://github.com/aws/sagemaker-python-sdk/pull/82/files/bc748929fc3088d99d1e5c00787e3ddad1a3a51b#diff-8e5bb247678c2ede89b070ed8c1926f2R321
and we have another similar one for mxnet

lukmis reviewed

View reviewed changes

tests/unit/test_estimator.py Outdated

+                  framework, py_ver, tag = framework_name_from_image('123.dkr.ecr.us-west-2.amazonaws.com/sagemaker-myown-py2-gpu:1')
                   assert framework is None
                   assert py_ver is None
+                  assert tag is None

Contributor

lukmis Feb 23, 2018

Why would we get 'None' here? The tag will be '1', right?

Contributor Author

nadiaya Feb 23, 2018

Not the way framework_name_from_image works. In this case it:

        name_pattern = re.compile('^sagemaker-(tensorflow|mxnet)-(py2|py3)-(cpu|gpu):(.*)$')
        name_match = name_pattern.match(sagemaker_match.group(8))

name_match is None and everything is None as a result.
I don't see the value in changing this behavior since we can't really use the tag if we don't have the framework, similar to python version.

Contributor

lukmis Feb 23, 2018

Ah yes, sorry - I missed the 'myown' framework name.

lukmis reviewed

View reviewed changes

src/sagemaker/mxnet/estimator.py

-                  def __init__(self, entry_point, source_dir=None, hyperparameters=None, py_version='py2', **kwargs):
+                  def __init__(self, entry_point, source_dir=None, hyperparameters=None, py_version='py2',
+                               framework_version=MXNET_VERSION, **kwargs):

Contributor

lukmis Feb 23, 2018

Something to consider: shall the framework_version be part of the Framework class and not be repeated in each of the child classes?

Contributor Author

nadiaya Feb 23, 2018 •

edited

Loading

Maybe in the future, we decided not to do it as part of this change and keep it similar to py_version.


          Support TensorFlow-1.5.0 and MXNet-1.0.0

c6b444a

nadiaya force-pushed the release-1.1 branch from bc74892 to aa36826 Compare

February 24, 2018 08:59


          Update and refactor tests. Add tests for fw_utils.

d854806

nadiaya force-pushed the release-1.1 branch from d088a00 to d854806 Compare

February 26, 2018 18:27

mvsusp previously approved these changes

View reviewed changes

Contributor

mvsusp left a comment

🥇 🚢 🎉


          Fix typo.

dcfe663

nadiaya dismissed mvsusp’s stale review via

dcfe663

February 26, 2018 19:07


          Merge branch 'master' into release-1.1

aa64686

mvsusp approved these changes

View reviewed changes

nadiaya dismissed winstonaws’s stale review

February 27, 2018 17:26

Discussed it online,

nadiaya merged commit c8fa80b into aws:master

jalabort added a commit to hudl/sagemaker-python-sdk that referenced this pull request


          Update to latest upstream (#6)

50b5d13

* Add data_type to hyperparameters (aws#54)

When we describe a training job the data type of the hyper parameters is
lost because we use a dict[str, str]. This adds a new field to
Hyperparameter so that we can convert the datatypes at runtime.

instead of validating with isinstance(), we cast the hp value to the type it
is meant to be. This enforces a "strongly typed" value. When we
deserialize from the API string responses it becomes easier to deal with
too.

* Add wrapper for LDA. (aws#56)

Update CHANGELOG and bump the version number.

* Add support for async fit() (aws#59)

when calling fit(wait=False) it will return immediately. The training
job will carry on even if the process exits. by using attach() the
estimator can be retrieved by providing the training job name.

_prepare_init_params_from_job_description() is now a classmethod instead
of being a static method. Each class is responsible to implement their
specific logic to convert a training job description into arguments that
can be passed to its own __init__()

* Fix Estimator role expansion (aws#68)

Instead of manually constructing the role ARN, use the IAM boto client
to do it. This properly expands service-roles and regular roles.

* Add FM and LDA to the documentation. (aws#66)

* Fix description of an argument of sagemaker.session.train (aws#69)

* Fix description of an argument of sagemaker.session.train

'input_config' should be an array which has channel objects.

* Add a link to the botocore docs

* Use 'list' instead of 'array' in the description

* Add ntm algorithm with doc, unit tests, integ tests (aws#73)

* JSON serializer: predictor.predict accepts dictionaries (aws#62)

Add support for serializing python dictionaries to json
Add prediction with dictionary in tf iris integ test

* Fixing timeouts for PCA async integration test. (aws#78)

Execute tf_cifar test without logs to eliminate delay to detect that job has finished.

* Fixes in LinearLearner and unit tests addition. (aws#77)

* Print out billable seconds after training completes (aws#30)

* Added: print out billable seconds after training completes

* Fixed: test_session.py to pass unit tests

* Fixed: removed offending tzlocal()

* Use sagemaker_timestamp when creating endpoint names in integration tests. (aws#81)

* Support TensorFlow-1.5.0 and MXNet-1.0.0  (aws#82)

* Update .gitignore to ignore pytest_cache.

* Support TensorFlow-1.5.0 and MXNet-1.0.0

* Update and refactor tests. Add tests for fw_utils.

* Fix typo.

* Update changelog for 1.1.0 (aws#85)

nadiaya deleted the release-1.1 branch

June 28, 2018 00:17

apacker pushed a commit to apacker/sagemaker-python-sdk that referenced this pull request


          Merge pull request aws#82 from awslabs/arpin_ensemble_move

26dd0d8

Arpin ensemble move

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet