Performance/caching issue: tokenizer fails to reset has_special flag after encountering special span, effectively disabling caching

How to reproduce the behaviour

nlp = English()
doc = nlp("I can't believe you have done this")

"can't" is a tokenizer exception because of the funky contraction (yay english)

spaCy/spacy/lang/en/tokenizer_exceptions.py

Line 233 in 0069cf9

{ORTH: "ca", NORM: "can"},

Spans that contain these exceptions are marked as has_special
declared here:

spaCy/spacy/tokenizer.pyx

Line 179 in 0069cf9

cdef int has_special = 0

set here:

spaCy/spacy/tokenizer.pyx

Line 375 in 0069cf9

has_special[0] = 1

And has_special spans are not cached:

spaCy/spacy/tokenizer.pyx

Line 523 in 0069cf9

if has_special[0]:

The problem is that has_special, once set to a nonzero value, is never reset. And this means that once the tokenizer encounters a special case, every subsequent span is also marked as special, and none of them get cached, even if they should be.

This has some fairly significant tokenizer performance implications. It should be much faster than it is. I'll put some benchmarking in my PR.

Your Environment

Info about spaCy

spaCy version: 3.8.14
Platform: macOS-26.3.1-arm64-arm-64bit
Python version: 3.12.4
Pipelines: en_core_web_md (3.8.0), en_core_web_sm (3.8.0)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance/caching issue: tokenizer fails to reset has_special flag after encountering special span, effectively disabling caching #13950

How to reproduce the behaviour

Your Environment

Info about spaCy

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Search code, repositories, users, issues, pull requests...

Uh oh!

Performance/caching issue: tokenizer fails to reset has_special flag after encountering special span, effectively disabling caching #13950

Description

How to reproduce the behaviour

Your Environment

Info about spaCy

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions