Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

fix: Use PSS instead of RSS to estimate children process memory usage on Linux #1210

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jun 9, 2025

Conversation

Pijukatel
Copy link
Collaborator

@Pijukatel Pijukatel commented May 21, 2025

Description

To estimate process memory usage use Proportional Set Size (PSS) to estimate process memory usage of the process and all it's children to avoid overestimation of used memory due to same shared memory being counted multiple times when using Resident Set Size (RSS).

PSS is available only on Linux, so this improved estimation will work only there.

Add test.

Issues

@github-actions github-actions bot added this to the 115th sprint - Tooling team milestone May 21, 2025
@github-actions github-actions bot added t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics. labels May 21, 2025
@Pijukatel Pijukatel requested a review from Copilot May 21, 2025 14:12
src/crawlee/_utils/system.py Outdated Show resolved Hide resolved

total_size_bytes = psutil.virtual_memory().total

return MemoryInfo(
total_size=ByteSize(total_size_bytes),
current_size=ByteSize(current_size_bytes),
)


def _get_used_memory(memory_full_info: Any) -> int:
Copy link
Collaborator Author

@Pijukatel Pijukatel May 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This internal type hint does not seem to be available. The actual type is dependent on the OS as well.

Copilot

This comment was marked as outdated.

@Pijukatel
Copy link
Collaborator Author

Pijukatel commented May 23, 2025

Based on the docs:

Using uss might actually underestimate memory usage as it would not count the shared memory at all. Seems like pss might be the best approximation of used memory in our case, but is available only on Linux. Maybe we iteratively improve at least the Linux based estimation for now and leave improving Windows and other OS based estimation for later

@Pijukatel Pijukatel changed the title fix: Use USS instead of RSS to estimate children process memory usage fix: Use PSS instead of RSS to estimate children process memory usage on Linux May 30, 2025
@Pijukatel Pijukatel marked this pull request as ready for review May 30, 2025 20:12
@Pijukatel
Copy link
Collaborator Author

To come up with the test was really hard. The test is not nice at all but testing the memory usage estimation is really tricky due to to Python being too high-level for some precise memory control.

@Pijukatel Pijukatel requested review from janbuchar and vdusek May 30, 2025 20:14
Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for looking into this.

Btw. so in Windows are giving up, since the PSS isn't a concept there? Or is there a chance to use some alternative metric? If so, maybe we can open a follow-up issue?

@janbuchar
Copy link
Collaborator

janbuchar commented Jun 5, 2025

Couple of questions 🙂

  1. how does this compare to what we do in the JS version? It looks like it also uses RSS (https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/memory-info.ts#L85-L85) - I'd like to know why it doesn't cause problems there (or does it?)
  2. when running locally, Crawlee thinks "memory is overloaded" means "stuff that we own takes up more memory than some configured part of the total system memory" - if our calculation of the used memory fails to take something into account, we could easily smother the system
    • are you aware of anything "missed" by USS in a Crawlee workload?
    • do we have some safety mechanism that would say the system is overloaded if the system memory has something like >95% utilization, regardless of CRAWLEE_AVAILABLE_MEMORY_RATIO? shouldn't we?

@Pijukatel
Copy link
Collaborator Author

Pijukatel commented Jun 5, 2025

  1. how does this compare to what we do in the JS version? It looks like it also uses RSS (https://github.com/apify/crawlee/blob/master/packages/utils/src/internals/memory-info.ts#L85-L85) - I'd like to know why it doesn't cause problems there (or does it?)

So far I can only guess. I have to make some experiments with the JS version to have some data. But if it relies on RSS only, then I think it could also overestimate used memory.

do we have some safety mechanism that would say the system is overloaded if the system memory has something like >95% utilization, regardless of CRAWLEE_AVAILABLE_MEMORY_RATIO? shouldn't we?

Yes, I think it could be good safety measure to bound it like that, regardless of this change.

are you aware of anything "missed" by USS in a Crawlee workload?

In Crawlee only probably not, but my guess is, that multiple Playwright processes could actually use some shared memory which would be overestimated by RSS and probably underestimated with USS. So PSS seems to me like the best in our case as it takes into account shared memory in somewhat predictable way.

@B4nan
Copy link
Member

B4nan commented Jun 5, 2025

So far I can only guess. I have to make some experiments with the JS version to have some data. But if it relies on RSS only, then I think it could also overestimate used memory.

Keep in mind that on the platform, memory usage (and pretty much all the scaling metrics) is coming over websockets, we don't measure it ourselves, so it's very much possible we dont do it perfectly, and nobody noticed, since on localhost, we use 1/4 of the available memory by default. Also given the memory scales with CPU, you usually run things with enough memory.

@Pijukatel Pijukatel merged commit 436032f into master Jun 9, 2025
23 checks passed
@Pijukatel Pijukatel deleted the memory-tests branch June 9, 2025 08:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t-tooling Issues with this label are in the ownership of the tooling team. tested Temporary label used only programatically for some analytics.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

The memory usage shown by the snapshotter is inconsistent with that shown by htop.
4 participants
Morty Proxy This is a proxified and sanitized view of the page, visit original site.