Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Commit 1bbe055

Browse filesBrowse files
holtskinnergalz10
andauthored
feat: Added Batch creation for Cloud Storage documents. (#66)
* feat: Added Batch creation for Cloud Storage documents. * Ran Black format on samples * Update noxfile.py * Changed Client to use custom user agent header * Updates to tests and docs * Fixed Test inputs * Add link to send processing request page * Change Import for sample --------- Co-authored-by: Gal Zahavi <38544478+galz10@users.noreply.github.com>
1 parent 448389a commit 1bbe055
Copy full SHA for 1bbe055

File tree

Expand file treeCollapse file tree

10 files changed

+367
-6
lines changed
Open diff view settings
Filter options
Expand file treeCollapse file tree

10 files changed

+367
-6
lines changed
Open diff view settings
Collapse file
+7Lines changed: 7 additions & 0 deletions
  • Display the source diff
  • Display the rich diff
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
Document AI Toolbox Utilities
2+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
3+
4+
.. automodule:: google.cloud.documentai_toolbox.utilities.utilities
5+
:members:
6+
:private-members:
7+
:noindex:
Collapse file

‎packages/google-cloud-documentai-toolbox/docs/index.rst‎

Copy file name to clipboardExpand all lines: packages/google-cloud-documentai-toolbox/docs/index.rst
+1Lines changed: 1 addition & 0 deletions
  • Display the source diff
  • Display the rich diff
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@ API Reference
66
:maxdepth: 2
77

88
documentai_toolbox/wrappers
9+
documentai_toolbox/utilities
910

1011
Changelog
1112
---------
Collapse file

‎packages/google-cloud-documentai-toolbox/google/cloud/documentai_toolbox/__init__.py‎

Copy file name to clipboardExpand all lines: packages/google-cloud-documentai-toolbox/google/cloud/documentai_toolbox/__init__.py
+4-5Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -28,9 +28,8 @@
2828
converters,
2929
)
3030

31-
__all__ = (
32-
document,
33-
page,
34-
entity,
35-
converters,
31+
from .utilities import (
32+
utilities,
3633
)
34+
35+
__all__ = (document, page, entity, converters, utilities)
Collapse file

‎packages/google-cloud-documentai-toolbox/google/cloud/documentai_toolbox/constants.py‎

Copy file name to clipboardExpand all lines: packages/google-cloud-documentai-toolbox/google/cloud/documentai_toolbox/constants.py
+16Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,3 +20,19 @@
2020
JSON_MIMETYPE = "application/json"
2121

2222
FILE_CHECK_REGEX = r"(.*[.].*$)"
23+
24+
# https://cloud.google.com/document-ai/quotas
25+
BATCH_MAX_FILES = 50
26+
# 1GB in Bytes
27+
BATCH_MAX_FILE_SIZE = 1073741824
28+
BATCH_MAX_REQUESTS = 5
29+
30+
# https://cloud.google.com/document-ai/docs/file-types
31+
VALID_MIME_TYPES = {
32+
"application/pdf",
33+
"image/bmp" "image/gif",
34+
"image/jpeg",
35+
"image/png",
36+
"image/tiff",
37+
"image/webp",
38+
}
Collapse file
+15Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
# -*- coding: utf-8 -*-
2+
# Copyright 2023 Google LLC
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
#
Collapse file
+97Lines changed: 97 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,97 @@
1+
# -*- coding: utf-8 -*-
2+
# Copyright 2023 Google LLC
3+
#
4+
# Licensed under the Apache License, Version 2.0 (the "License");
5+
# you may not use this file except in compliance with the License.
6+
# You may obtain a copy of the License at
7+
#
8+
# http://www.apache.org/licenses/LICENSE-2.0
9+
#
10+
# Unless required by applicable law or agreed to in writing, software
11+
# distributed under the License is distributed on an "AS IS" BASIS,
12+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
13+
# See the License for the specific language governing permissions and
14+
# limitations under the License.
15+
#
16+
"""Document AI utilities."""
17+
18+
from typing import List, Optional
19+
20+
from google.cloud import documentai
21+
22+
from google.cloud.documentai_toolbox import constants
23+
from google.cloud.documentai_toolbox.wrappers.document import _get_storage_client
24+
25+
26+
def create_batches(
27+
gcs_bucket_name: str,
28+
gcs_prefix: str,
29+
batch_size: Optional[int] = constants.BATCH_MAX_FILES,
30+
) -> List[documentai.BatchDocumentsInputConfig]:
31+
"""Create batches of documents in Cloud Storage to process with `batch_process_documents()`.
32+
33+
Args:
34+
gcs_bucket_name (str):
35+
Required. The name of the gcs bucket.
36+
37+
Format: `gs://bucket/optional_folder/target_folder/` where gcs_bucket_name=`bucket`.
38+
gcs_prefix (str):
39+
Required. The prefix of the json files in the `target_folder`
40+
41+
Format: `gs://bucket/optional_folder/target_folder/` where gcs_prefix=`optional_folder/target_folder`.
42+
batch_size (Optional[int]):
43+
Optional. Size of each batch of documents. Default is `50`.
44+
45+
Returns:
46+
List[documentai.BatchDocumentsInputConfig]:
47+
A list of `BatchDocumentsInputConfig`, each corresponding to one batch.
48+
"""
49+
if batch_size > constants.BATCH_MAX_FILES:
50+
raise ValueError(
51+
f"Batch size must be less than {constants.BATCH_MAX_FILES}. You provided {batch_size}."
52+
)
53+
54+
storage_client = _get_storage_client()
55+
blob_list = storage_client.list_blobs(gcs_bucket_name, prefix=gcs_prefix)
56+
batches: List[documentai.BatchDocumentsInputConfig] = []
57+
batch: List[documentai.GcsDocument] = []
58+
59+
for blob in blob_list:
60+
# Skip Directories
61+
if blob.name.endswith("/"):
62+
continue
63+
64+
if blob.content_type not in constants.VALID_MIME_TYPES:
65+
print(f"Skipping file {blob.name}. Invalid Mime Type {blob.content_type}.")
66+
continue
67+
68+
if blob.size > constants.BATCH_MAX_FILE_SIZE:
69+
print(
70+
f"Skipping file {blob.name}. File size must be less than {constants.BATCH_MAX_FILE_SIZE} bytes. File size is {blob.size} bytes."
71+
)
72+
continue
73+
74+
if len(batch) == batch_size:
75+
batches.append(
76+
documentai.BatchDocumentsInputConfig(
77+
gcs_documents=documentai.GcsDocuments(documents=batch)
78+
)
79+
)
80+
batch = []
81+
82+
batch.append(
83+
documentai.GcsDocument(
84+
gcs_uri=f"gs://{gcs_bucket_name}/{blob.name}",
85+
mime_type=blob.content_type,
86+
)
87+
)
88+
89+
if batch != []:
90+
# Append the last batch, which could be less than `batch_size`
91+
batches.append(
92+
documentai.BatchDocumentsInputConfig(
93+
gcs_documents=documentai.GcsDocuments(documents=batch)
94+
)
95+
)
96+
97+
return batches
Collapse file
+52Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
# Copyright 2023 Google LLC
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
#
15+
16+
17+
# [START documentai_toolbox_create_batches]
18+
19+
from google.cloud import documentai
20+
from google.cloud.documentai_toolbox import utilities
21+
22+
# TODO(developer): Uncomment these variables before running the sample.
23+
# Given unprocessed documents in path gs://bucket/path/to/folder
24+
# gcs_bucket_name = "bucket"
25+
# gcs_prefix = "path/to/folder"
26+
# batch_size = 50
27+
28+
29+
def create_batches_sample(
30+
gcs_bucket_name: str,
31+
gcs_prefix: str,
32+
batch_size: int = 50,
33+
) -> None:
34+
# Creating batches of documents for processing
35+
batches = utilities.create_batches(
36+
gcs_bucket_name=gcs_bucket_name, gcs_prefix=gcs_prefix, batch_size=batch_size
37+
)
38+
39+
print(f"{len(batches)} batch(es) created.")
40+
for batch in batches:
41+
print(f"{len(batch.gcs_documents.documents)} files in batch.")
42+
print(batch.gcs_documents.documents)
43+
44+
# Use as input for batch_process_documents()
45+
# Refer to https://cloud.google.com/document-ai/docs/send-request
46+
# for how to send a batch processing request
47+
request = documentai.BatchProcessRequest(
48+
name="processor_name", input_documents=batch
49+
)
50+
51+
52+
# [END documentai_toolbox_create_batches]
Collapse file

‎packages/google-cloud-documentai-toolbox/samples/snippets/noxfile.py‎

Copy file name to clipboardExpand all lines: packages/google-cloud-documentai-toolbox/samples/snippets/noxfile.py
+1-1Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -282,4 +282,4 @@ def readmegen(session: nox.sessions.Session, path: str) -> None:
282282
in_file = os.path.join(dir_, "README.rst.in")
283283
session.run(
284284
"python", _get_repo_root() + "/scripts/readme-gen/readme_gen.py", in_file
285-
)
285+
)
Collapse file
+33Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
# Copyright 2023 Google LLC
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
#
15+
16+
17+
import pytest
18+
from samples.snippets import create_batches_sample
19+
20+
gcs_bucket_name = "cloud-samples-data"
21+
gcs_input_uri = "documentai_toolbox/document_batches/"
22+
batch_size = 50
23+
24+
25+
def test_create_batches_sample(capsys: pytest.CaptureFixture) -> None:
26+
create_batches_sample.create_batches_sample(
27+
gcs_bucket_name=gcs_bucket_name, gcs_prefix=gcs_input_uri, batch_size=batch_size
28+
)
29+
out, _ = capsys.readouterr()
30+
31+
assert "2 batch(es) created." in out
32+
assert "50 files in batch." in out
33+
assert "47 files in batch." in out
Collapse file
+141Lines changed: 141 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,141 @@
1+
# pylint: disable=protected-access
2+
# -*- coding: utf-8 -*-
3+
# Copyright 2023 Google LLC
4+
#
5+
# Licensed under the Apache License, Version 2.0 (the "License");
6+
# you may not use this file except in compliance with the License.
7+
# You may obtain a copy of the License at
8+
#
9+
# http://www.apache.org/licenses/LICENSE-2.0
10+
#
11+
# Unless required by applicable law or agreed to in writing, software
12+
# distributed under the License is distributed on an "AS IS" BASIS,
13+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
14+
# See the License for the specific language governing permissions and
15+
# limitations under the License.
16+
17+
import pytest
18+
19+
from google.cloud.documentai_toolbox.utilities import utilities
20+
21+
# try/except added for compatibility with python < 3.8
22+
try:
23+
from unittest import mock
24+
except ImportError: # pragma: NO COVER
25+
import mock
26+
27+
28+
test_bucket = "test-directory"
29+
test_prefix = "documentai/input"
30+
31+
32+
@mock.patch("google.cloud.documentai_toolbox.wrappers.document.storage")
33+
def test_create_batches_with_3_documents(mock_storage, capfd):
34+
client = mock_storage.Client.return_value
35+
mock_bucket = mock.Mock()
36+
client.Bucket.return_value = mock_bucket
37+
38+
mock_blobs = []
39+
for i in range(3):
40+
mock_blob = mock.Mock(
41+
name=f"test_file{i}.pdf", content_type="application/pdf", size=1024
42+
)
43+
mock_blob.name.endswith.return_value = False
44+
mock_blobs.append(mock_blob)
45+
client.list_blobs.return_value = mock_blobs
46+
47+
actual = utilities.create_batches(
48+
gcs_bucket_name=test_bucket, gcs_prefix=test_prefix
49+
)
50+
51+
mock_storage.Client.assert_called_once()
52+
53+
out, err = capfd.readouterr()
54+
assert out == ""
55+
assert len(actual) == 1
56+
assert len(actual[0].gcs_documents.documents) == 3
57+
58+
59+
def test_create_batches_with_invalid_batch_size(capfd):
60+
with pytest.raises(ValueError):
61+
utilities.create_batches(
62+
gcs_bucket_name=test_bucket, gcs_prefix=test_prefix, batch_size=51
63+
)
64+
65+
out, err = capfd.readouterr()
66+
assert "Batch size must be less than" in out
67+
assert err
68+
69+
70+
@mock.patch("google.cloud.documentai_toolbox.wrappers.document.storage")
71+
def test_create_batches_with_large_folder(mock_storage, capfd):
72+
client = mock_storage.Client.return_value
73+
mock_bucket = mock.Mock()
74+
client.Bucket.return_value = mock_bucket
75+
76+
mock_blobs = []
77+
for i in range(96):
78+
mock_blob = mock.Mock(
79+
name=f"test_file{i}.pdf", content_type="application/pdf", size=1024
80+
)
81+
mock_blob.name.endswith.return_value = False
82+
mock_blobs.append(mock_blob)
83+
client.list_blobs.return_value = mock_blobs
84+
85+
actual = utilities.create_batches(
86+
gcs_bucket_name=test_bucket, gcs_prefix=test_prefix
87+
)
88+
89+
mock_storage.Client.assert_called_once()
90+
91+
out, err = capfd.readouterr()
92+
assert out == ""
93+
assert len(actual) == 2
94+
assert len(actual[0].gcs_documents.documents) == 50
95+
assert len(actual[1].gcs_documents.documents) == 46
96+
97+
98+
@mock.patch("google.cloud.documentai_toolbox.wrappers.document.storage")
99+
def test_create_batches_with_invalid_file_type(mock_storage, capfd):
100+
client = mock_storage.Client.return_value
101+
mock_bucket = mock.Mock()
102+
client.Bucket.return_value = mock_bucket
103+
104+
mock_blob = mock.Mock(
105+
name="test_file.json", content_type="application/json", size=1024
106+
)
107+
mock_blob.name.endswith.return_value = False
108+
client.list_blobs.return_value = [mock_blob]
109+
110+
actual = utilities.create_batches(
111+
gcs_bucket_name=test_bucket, gcs_prefix=test_prefix
112+
)
113+
114+
mock_storage.Client.assert_called_once()
115+
116+
out, err = capfd.readouterr()
117+
assert "Invalid Mime Type" in out
118+
assert actual == []
119+
120+
121+
@mock.patch("google.cloud.documentai_toolbox.wrappers.document.storage")
122+
def test_create_batches_with_large_file(mock_storage, capfd):
123+
client = mock_storage.Client.return_value
124+
mock_bucket = mock.Mock()
125+
client.Bucket.return_value = mock_bucket
126+
127+
mock_blob = mock.Mock(
128+
name="test_file.pdf", content_type="application/pdf", size=2073741824
129+
)
130+
mock_blob.name.endswith.return_value = False
131+
client.list_blobs.return_value = [mock_blob]
132+
133+
actual = utilities.create_batches(
134+
gcs_bucket_name=test_bucket, gcs_prefix=test_prefix
135+
)
136+
137+
mock_storage.Client.assert_called_once()
138+
139+
out, err = capfd.readouterr()
140+
assert "File size must be less than" in out
141+
assert actual == []

0 commit comments

Comments
0 (0)
Morty Proxy This is a proxified and sanitized view of the page, visit original site.