performance

Performance

This is a collection of tools helpful for inspecting and tracking performance of the Unstructured library.

The benchmarking script allows a user to track performance time to partitioning results against a fixed set of test documents and store those results with indication of architecture, instance type, and git hash, in S3.

The profiling script allows a user to inspect how time time and memory are spent across called functions when performing partitioning on a given document.

Install

Benchmarking requires no additional dependencies and should work without any initial setup. Profiling has a few dependencies which can be installed with:

pip install -r scripts/performance/requirements.txt
npm install -g speedscope

The second dependency speedscope provides a tool to view profiling results from py-spy locally. Alternatively you can also drop the profile result *.speedscope into https://www.speedscope.app/ to view the results online.

Run

Benchmark

Export / assign desired environment variable settings:

DOCKER_TEST: Set to true to run benchmark inside a Docker container (default: false)
NUM_ITERATIONS: Number of iterations for benchmark (e.g., 100) (default: 3)
INSTANCE_TYPE: Type of benchmark instance (e.g., "c5.xlarge") (default: unspecified)
PUBLISH_RESULTS: Set to true to publish results to S3 bucket (default: false)

Usage: ./scripts/performance/benchmark.sh

Profile

Export / assign desired environment variable settings:

DOCKER_TEST: Set to true to run profiling inside a Docker container (default: false)

Usage:

on Linux: ./scripts/performance/profile.sh

on macOS: sudo -E ./scripts/performance/profile.sh; py-spy requires su to run on macOS

Run the script and choose the profiling mode: 'run' or 'view'.
In the 'run' mode, you can profile custom files or select existing test files.
In the 'view' mode, you can view previously generated profiling results.
The script supports time profiling with cProfile and memory profiling with memray.
Users can choose different visualization options such as flamegraphs, tables, trees, summaries, and statistics.
Test documents are synced from an S3 bucket to a local directory before running the profiles

Name	Name	Last commit message	Last commit date
parent directory ..
docs	docs
warmup_docs	warmup_docs
.gitignore	.gitignore
README.md	README.md
benchmark-local.sh	benchmark-local.sh
benchmark.sh	benchmark.sh
get-stats-name.sh	get-stats-name.sh
profile.sh	profile.sh
requirements.txt	requirements.txt
run_partition.py	run_partition.py
time_partition.py	time_partition.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expand file tree

README.md

Performance

Install

Run

Benchmark

Profile

Search code, repositories, users, issues, pull requests...

FilesExpand file tree

performance

Directory actions

More options

Directory actions

More options

Latest commit

History

performance

Folders and files

parent directory

README.md

Performance

Install

Run

Benchmark

Profile

Expand file tree