I created a Python 3.11 utility that truncates an input string to a fixed word count—splitting on any whitespace, collapsing runs, and dropping trailing stop-words—so you get clean, concise snippets ready for downstream NLP tasks.
What it should do:
- Truncate an input string to at most
max_words
words. - Split on any whitespace (collapsing runs of spaces, tabs, newlines).
- If the last retained word is a common stop-word (e.g. “of”, “the”), drop it so you don’t end on an article/preposition.
- Return a single string of the truncated words.
- Handle edge cases: empty input, exact fits,
max_words
= 1, and invalid parameters (max_words
< 1).
Environment & background
- Python 3.11
- Preprocessing predicates or short text snippets before feeding into downstream logic (e.g. building knowledge-graph edges).
- Not a homework or interview question—just looking for best practices and bug-checks.
from typing import Set
_STOP_WORDS: Set[str] = {
"a",
"an",
"the",
"of",
"with",
"by",
"to",
"from",
"in",
"on",
"for",
}
def truncate(
text: str,
max_words: int = 3
) -> str:
"""
Truncate `text` to at most `max_words` whitespace-separated words,
dropping a trailing common stop-word if present.
Splits on any whitespace (spaces, tabs, newlines), collapsing runs
into single separators.
Args:
text: Input string to truncate.
max_words: Maximum number of words to retain (must be ≥1).
Returns:
A string consisting of up to `max_words` words joined by single spaces.
Raises:
ValueError: if `max_words < 1`.
Examples:
>>> truncate("run in the park", 3)
"run in"
>>> truncate("of the", 2)
"of the"
"""
if max_words < 1:
raise ValueError("max_words must be ≥ 1")
words = text.strip().split()
if len(words) <= max_words:
return " ".join(words)
head = words[:max_words]
# Drop trailing stop-word so we don’t end on “of”, “the”, etc.
if head and head[-1].lower() in _STOP_WORDS:
head.pop()
return " ".join(head)
Unit Tests (basic)
import pytest
def test_error_on_invalid_max_words():
try:
truncate("some text", 0)
assert False, "Expected ValueError for max_words < 1"
except ValueError as e:
assert "max_words" in str(e)
def test_no_truncation_if_shorter():
assert truncate("one two", 3) == "one two"
assert truncate("a b c", 3) == "a b c"
def test_simple_truncation():
assert truncate("one two three four", 2) == "one two"
def test_drop_trailing_stopword():
# 'in' is a stopword, should be removed
assert truncate("alpha beta in the", 3) == "alpha beta"
# last kept word not a stopword, so stays
assert truncate("alpha in beta gamma", 3) == "alpha in beta"
def test_strip_whitespace_and_split():
# leading/trailing spaces collapse
assert truncate(" hello world ", 1) == "hello"
def test_mixed_case_and_stopword():
# stopword removal is case-insensitive
assert truncate("Run In The Park", 3) == "Run In"
if __name__ == "__main__":
pytest.main()
3
) seems absolutely arbitrary. Can't think of a usage where any particular default value would be useful... Seems an unnecessary wrinkle, (imho)... None of the presented "test cases" capitalise on the default being there.... \$\endgroup\$