Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

Conversation

@ShiroYasha18
Copy link

Why are these changes needed?

Bridges the support for ASR - Automatic Speech Recognition feature from docling to dpk .
Currently supported models :
WHISPER_TINY
WHISPER_SMALL
WHISPER_MEDIUM
WHISPER_BASE
WHISPER_LARGE
WHISPER_TURBO

These are all the ASR models which Docling support as of on 3/07/2025

Related issue number (if any).

Related to issue #1346

Copy link
Collaborator

@touma-I touma-I left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shahrokhDaijavad What is the use case for this?

Just to be up-to-date with the latest capabilities of Docling.

@ShiroYasha18
Copy link
Author

Hiii , do I need to add a test file or a sample data file or something more for the same ?

@shahrokhDaijavad
Copy link
Collaborator

Hiii , do I need to add a test file or a sample data file or something more for the same ?

@ShiroYasha18 Any change in docling2parquet requires generating updated "expected" files, for the test-src to pass. Please run make generate-expected (for reference, see https://github.com/data-prep-kit/data-prep-kit/blob/dev/transforms/language/docling2parquet/Makefile). If you are adding ASR files for testing in the test-data/input directory, the corresponding expected output files are needed.

@touma-I
Copy link
Collaborator

touma-I commented Jul 8, 2025

@shahrokhDaijavad What is the use case for this?

Just to be up-to-date with the latest capabilities of Docling.

@shahrokhDaijavad : no need to be up-to-date. let's discuss. I need a viable use case before we can proceed.

@ShiroYasha18
Copy link
Author

ShiroYasha18 commented Jul 9, 2025

Hi @touma-I

Thanks for the feedback!

The idea behind integrating ASR (Automatic Speech Recognition) support is to allow docling2parquet to process audio transcription data directly—this is especially useful in projects dealing with oral histories, interviews, podcasts, or user feedback recordings. These types of datasets are becoming increasingly common in research and user experience domains.

This integration makes the DPK pipeline compatible with speech data workflows, enabling users to extract structured insights from spoken content with minimal setup. It aligns with Docling's existing support and helps bridge that capability into DPK for broader utility.

Example Real World use case:Companies often conduct video calls (e.g., via Zoom or Google Meet) with users. These are saved as .mp4 files. This ASR integration allows automatic transcription and ingestion of those interviews within the data-prep-kit pipeline

@shahrokhDaijavad
Copy link
Collaborator

@ShiroYasha18 Sure. It is good for DPK to keep us with the latest Docling capabilities, but for us, it only makes sense to add ASR features when there is a specific use case (or client need) in which processing of sound files are followed by one or more DPK transforms in a real application recipe, either in pre-training or post-training LLM applications. As soon as we can find such a use case, we can come back to this PR.

@touma-I touma-I marked this pull request as draft July 11, 2025 13:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Morty Proxy This is a proxified and sanitized view of the page, visit original site.