Content sniffing implementation details

Last month I spent two weeks on implementing content sniffing, which was behaviorally identical to Firefox's implementation. Unfortunately, I lost the laptop before I pushed the changes, so I will document what's necessary in case anyone (maybe me?) is interested in implementing a content sniffer.

The full implementation (code and comments) consisted of about 3 - 5k lines of JS code (unit tests were written but not included in this count).

The implementation details are as follows (this is a brain dump from my recollection):

The new webRequest.filterResponseData API can be used to inspect and modify the response body. This filter is activated after the webRequest.onHeadersReceived event stage, for http(s) only. There are several bugs, see the list of bugs that I appended to the bug that introduced this new webRequest method : https://bugzilla.mozilla.org/show_bug.cgi?id=1255894#a48785057_447061
Content sniffing happens in two stages (much more details below):
- At first entries in the NS_CONTENT_SNIFFER_CATEGORY (aka "net-content-sniffers") category are used to estimate the MIME type.
- If unknown, then basically the logic of nsUnknownDecoder::DetermineContentType is used (which includes entries from the NS_DATA_SNIFFER_CATEGORY (aka "content-sniffing-services") category.
The extension can force a specific content type after the onHeadersReceived by using the webRequest.filterResponseData to change the response body. For some types, prepending magic bytes can be done in a transparent way (e.g. HTML and plain text). For others, the response can be forced to HTML that in turn embeds a full-page iframe that requests the original URL (with cache buster). The extension can then intercept this request and pipe the original response to this new request. The reason for using an iframe is to ensure that the original response stream is not aborted. If the original response is not important, redirecting would work too.
Basically, Firefox follows the following logic to determine what to do with a givien response body
- Extract the MIME type from the Content-Type header.
  - Implementation: https://searchfox.org/mozilla-central/rev/8a6a6bef7c54425970aa4fb039cc6463a19c0b7f/netwerk/base/nsURLHelper.cpp#978-1030
- If the MIME is not set or an empty string, treat it as "application/x-unknown-content-type" and continue at the next bullet point.
- If the MIME is supported by Firefox, display inline and don't sniff (follow the logic at nsDocumentOpenInfo::DispatchContent as I mentioned at )https://github.com/Rob--W/open-in-browser/issues/1#issuecomment-331710653)
  - Exception: for the text/plain, application/octet-stream and application/x-unknown-content-type MIME types, Firefox MAY activate content sniffing, and open a download dialog even if the content would otherwise be displayed inline (text/plain), or display the content inline even though the content usually triggers a download dialog (application/octet-stream).
- If the MIME is not recognized by Firefox, open a download dialog.
- If the MIME is application/octet-stream or application/x-unknown-content-type, perform media sniffing:
  - Implementation: https://searchfox.org/mozilla-central/rev/8a6a6bef7c54425970aa4fb039cc6463a19c0b7f/toolkit/components/mediasniffer/nsMediaSniffer.cpp#141-210
  - Note: If a document was sniffed as media, Firefox will immediately switch to a document, and the webRequest.filterResponseData method can NOT be used to modify the response stream. To replace the document, you must run a content script in this new media document.
- If the MIME is text/html, application/octet-stream or containing "xml", then the feed sniffer is activated.
  - Implementation: https://searchfox.org/mozilla-central/rev/8a6a6bef7c54425970aa4fb039cc6463a19c0b7f/browser/components/feeds/nsFeedSniffer.cpp#206-336
  - Note: I did not implement this because of the rare conditions, and the fact that the type was already inline (I only need to implement content sniffing if the type is potentially going to display a download dialog, since Open in Browser is only relevant for that situation).
- If the Content-Type is a case-sensitive match for text/plain, text/plain; charset=ISO-8859-1, text/plain; charset=iso-8859-1 or text/plain; charset=UTF-8, AND the Content-Encoding request header is NOT set, then the sniffer will either force a download dialog or display inline:
  - Implementation: https://searchfox.org/mozilla-central/rev/091894faeac5b54b7e40b0a304c3d3268f7b645d/netwerk/streamconv/converters/nsUnknownDecoder.cpp#895-943
  - Basically, if starting with an unicode BOM, or the first 512 bytes (or less if the response ends early) only consists of text characters: Treat as text. Otherwise application/octet-stream = download dialog.
    - Implementation: https://searchfox.org/mozilla-central/rev/091894faeac5b54b7e40b0a304c3d3268f7b645d/netwerk/streamconv/converters/nsUnknownDecoder.cpp#666-714
- If the MIME is "application/x-unknown-content-type" (or empty, as mentioned before), sniff magic bytes.
  - Implementation: https://searchfox.org/mozilla-central/rev/091894faeac5b54b7e40b0a304c3d3268f7b645d/netwerk/streamconv/converters/nsUnknownDecoder.cpp#434-530
  - Basically, the MIME is found in the following order:
    1. Look at magic bytes.
    2. Call the sniffers in the NS_DATA_SNIFFER_CATEGORY (aka "content-sniffing-services") category
      - Media sniffer - https://searchfox.org/mozilla-central/rev/8a6a6bef7c54425970aa4fb039cc6463a19c0b7f/toolkit/components/mediasniffer/nsMediaSniffer.cpp#141-210 (complicated - magic bytes and structure parsing)
      - Image sniffer - https://searchfox.org/mozilla-central/rev/8a6a6bef7c54425970aa4fb039cc6463a19c0b7f/image/imgLoader.cpp#2646-2701 (simple - magic bytes only)
    3. Try HTML sniffing.
    4. Try sniffing from the URL.
    5. Fall back to the same method as text/plain sniffing (which would result in text/plain or application/octet-stream).

Other notes relevant for the implementation:

Content sniffing relies on up to 512 bytes of data, but the media sniffer may try to use more if available.
At least for text and HTML, Firefox will only display the response after 512 bytes of data have been written (or 1024, I don't remember).
For images and media, Firefox will switch to a special image/media document upon detecting the type (typically via magic bytes; for media sniffer more than magic bytes).
There is a draft for a specification at https://mimesniff.spec.whatwg.org/. This specification is close to Firefox's content sniffing. It does have any mention of media sniffing for application/octet-stream, and neither mentions the special application/x-unknown-content-type (this MIME is an artefact of Firefox's implementation; internally it represents the default value for a MIME type in a HTTP channel).
Character encoding should be respected/supported. For text/plain the UTF-8 and UTF-16 BOM can be used. For text/html, the content can be transcoded via the TextDecoder/TextEncoder APIs (except for UTF-16, which should not be used for HTML anyway).

Bugs in the webRequest.filterResponseData API that I haven't reported upstream (yet?):

If the Content-Type is application/x-unknown-content-type and the response is content-encoded, then the filtered response must also be encoded using the same type (e.g. gzipped) (for other types, e.g. text/html, the encoding is transparent, i.e. the value of the Content-Encoding header does not matter). The easiest way around this is to remove the Accept-Encoding request header or the Content-Encoding response header (or set it to "identity"). The more difficult way to get around this is to implement gzipping (and possibly other (obscure) encoding schemes such as deflate/brotli).
If a StreamFilter is closed, Firefox will always commit a navigation to a new document, even if no data was written to that StreamFilter, and even if the tab/frame has navigated to a different page. The only work-around that I could think of is to keep the StreamFilter open forever (yuck).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Content sniffing implementation details #5

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Search code, repositories, users, issues, pull requests...

Content sniffing implementation details #5

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions