Added support for ASR #1359

ShiroYasha18 · Jul 3, 2025

Why are these changes needed?

Bridges the support for ASR - Automatic Speech Recognition feature from docling to dpk .
Currently supported models :
WHISPER_TINY
WHISPER_SMALL
WHISPER_MEDIUM
WHISPER_BASE
WHISPER_LARGE
WHISPER_TURBO

These are all the ASR models which Docling support as of on 3/07/2025

Related issue number (if any).

Related to issue #1346

modified the readme with acknowledging the issue data-prep-kit#1042

touma-I

@shahrokhDaijavad What is the use case for this?

Just to be up-to-date with the latest capabilities of Docling.

ShiroYasha18 · Jul 7, 2025

Hiii , do I need to add a test file or a sample data file or something more for the same ?

shahrokhDaijavad · Jul 7, 2025

Hiii , do I need to add a test file or a sample data file or something more for the same ?

@ShiroYasha18 Any change in docling2parquet requires generating updated "expected" files, for the test-src to pass. Please run make generate-expected (for reference, see https://github.com/data-prep-kit/data-prep-kit/blob/dev/transforms/language/docling2parquet/Makefile). If you are adding ASR files for testing in the test-data/input directory, the corresponding expected output files are needed.

touma-I · Jul 8, 2025

@shahrokhDaijavad What is the use case for this?

Just to be up-to-date with the latest capabilities of Docling.

@shahrokhDaijavad : no need to be up-to-date. let's discuss. I need a viable use case before we can proceed.

ShiroYasha18 · Jul 9, 2025

Hi @touma-I

Thanks for the feedback!

The idea behind integrating ASR (Automatic Speech Recognition) support is to allow docling2parquet to process audio transcription data directly—this is especially useful in projects dealing with oral histories, interviews, podcasts, or user feedback recordings. These types of datasets are becoming increasingly common in research and user experience domains.

This integration makes the DPK pipeline compatible with speech data workflows, enabling users to extract structured insights from spoken content with minimal setup. It aligns with Docling's existing support and helps bridge that capability into DPK for broader utility.

Example Real World use case:Companies often conduct video calls (e.g., via Zoom or Google Meet) with users. These are saved as .mp4 files. This ASR integration allows automatic transcription and ingestion of those interviews within the data-prep-kit pipeline

shahrokhDaijavad · Jul 9, 2025

@ShiroYasha18 Sure. It is good for DPK to keep us with the latest Docling capabilities, but for us, it only makes sense to add ASR features when there is a specific use case (or client need) in which processing of sound files are followed by one or more DPK transforms in a real application recipe, either in pre-training or post-training LLM applications. As soon as we can find such a use case, we can come back to this PR.

ShiroYasha18 added 9 commits April 26, 2025 08:49

Update README.md

cedf052

modified the readme with acknowledging the issue data-prep-kit#1042

fixed typo bug in code2parquet transform.py

9d13cbe

Update transform.py

518c3fe

Update README.md

ff6097c

Merge branch 'data-prep-kit:dev' into dev

91184fa

Merge branch 'data-prep-kit:dev' into dev

9bc435d

Merge branch 'data-prep-kit:dev' into dev

7d79a3b

Merge branch 'data-prep-kit:dev' into dev

6a0f34a

added asr functionality

5c45f41

touma-I requested changes Jul 3, 2025

View reviewed changes

touma-I marked this pull request as draft July 11, 2025 13:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Added support for ASR #1359

Added support for ASR #1359

ShiroYasha18 commented Jul 3, 2025

Uh oh!

touma-I left a comment •

edited by shahrokhDaijavad

Loading

Uh oh!

ShiroYasha18 commented Jul 7, 2025

Uh oh!

shahrokhDaijavad commented Jul 7, 2025

Uh oh!

touma-I commented Jul 8, 2025

Uh oh!

ShiroYasha18 commented Jul 9, 2025 •

edited

Loading

Uh oh!

shahrokhDaijavad commented Jul 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Search code, repositories, users, issues, pull requests...

Added support for ASR #1359

Are you sure you want to change the base?

Added support for ASR #1359

Conversation

ShiroYasha18 commented Jul 3, 2025

Why are these changes needed?

Related issue number (if any).

Uh oh!

touma-I left a comment • edited by shahrokhDaijavad Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ShiroYasha18 commented Jul 7, 2025

Uh oh!

shahrokhDaijavad commented Jul 7, 2025

Uh oh!

touma-I commented Jul 8, 2025

Uh oh!

ShiroYasha18 commented Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shahrokhDaijavad commented Jul 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

touma-I left a comment •

edited by shahrokhDaijavad

Loading

ShiroYasha18 commented Jul 9, 2025 •

edited

Loading