How to extract the text from the LayoutItem objects when we set extract_layout=True in the parser? #695

Apr 25, 2025

michelle-unia-mermich
Apr 25, 2025

To parse a research paper that spans 2 columns and have images and tables at different positions on the page, I have now connected to your code API and used the premium_mode:

parser = LlamaParse(
api_key=API_KEY,
premium_mode=True,
extract_layout=True,
target_pages="1",
verbose=False
)
documents = parser.parse(PDF_PATH)

If I do not use extract_layout=True and parse as normal, the parsed text is not accurate because the caption text is mixed up with the actual paragraph text in the parsed document. The reading order is also not accurate, for example, if the page has (A) a bottom left column text box and (B) a top right column text box, humans will read (A) first and then (B) according to standard, but the parser reads (B) first and then (A), in my attempts.

To make sure that the reading order is correct and the final document only has the words of section headers and paragraph text without caption text, I do:

use extract_layout=True
When we set extract_layout=True, we can have a list of LayoutItem objects from the page. Each LayoutItem object is labelled into different categories, including:
- 'picture',
- 'caption' (caption of pictures or tables),
- 'table',
- 'pageHeader',
- 'pageFooter',
- 'sectionHeader'
  and finally,
- 'text' (the normal text paragraphs)
I will select only the LayoutItem objects that are labelled as either text or sectionHeader
Write a function to sort those text and sectionHeader LayoutItem objects according to x,y coordinates in bbox attributes of LayoutItem
Copy the text within each object according to that order to the final document: This is the problem
The LayoutItem object only has those attributes:

class LayoutItem(BaseModel):
    """The layout of a page."""
    image: str = Field(description="The name of the image containing the layout item")
    confidence: float = Field(description="The confidence of the layout item.")
    label: str = Field(description="The label of the layout item.")
    bbox: BBox = Field(description="The bounding box of the layout item.")
    isLikelyNoise: bool = Field(description="Whether the layout item is likely noise.")

The only information I can get from this LayoutItem is the image queried using GET requests. How do I get the text within each LayoutItem object that is labeled as "text" or "sectionHeader"? I can pass this image through an OCR reader or the parser again, but that just seems expensive and wasteful, since the LlamaParse parser has already gone through those words once; it's just that I cannot associate each LayoutItem image to a text block in the final parsed text document.

I also tried to use the bbox attribute of LayoutItem to identify the text section in the parsed document - by using the PageItem object.
for example, we have

documents = parser.parse(PDF_PATH)
page = documents.pages[0] #example of a page
page.layout #list of LayoutItem objects
page.items #list of PageItem objects

Each PageItem object has bbox and text value attribute, but the bbox does not match any in the LayoutItem lists. Basically, the PageItem objects that are recognised from each page are different from the LayoutItem objects from each page, and the recognition/classification of PageItem objects is no where as good as LayoutItem. For example, if a page has 19 LayoutItem objects, it only has 9 PageItem objects; and the text in a PageItem object may combine all text of caption and text LayoutItem objects together, with the same wrong reading order in the original document.

Is there a way to retrieve the words from each object of LayoutItem without using another OCR or parsing those images for the second time?

I would really appreciate your help! Please let me know if I need to provide any more details/documents.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to extract the text from the LayoutItem objects when we set extract_layout=True in the parser? #695

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Search code, repositories, users, issues, pull requests...

How to extract the text from the LayoutItem objects when we set extract_layout=True in the parser? #695

Uh oh!

michelle-unia-mermich Apr 25, 2025

Replies: 0 comments

michelle-unia-mermich
Apr 25, 2025