Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings
Discussion options

To parse a research paper that spans 2 columns and have images and tables at different positions on the page, I have now connected to your code API and used the premium_mode:

parser = LlamaParse(
api_key=API_KEY,
premium_mode=True,
extract_layout=True,
target_pages="1",
verbose=False
)
documents = parser.parse(PDF_PATH)

If I do not use extract_layout=True and parse as normal, the parsed text is not accurate because the caption text is mixed up with the actual paragraph text in the parsed document. The reading order is also not accurate, for example, if the page has (A) a bottom left column text box and (B) a top right column text box, humans will read (A) first and then (B) according to standard, but the parser reads (B) first and then (A), in my attempts.

To make sure that the reading order is correct and the final document only has the words of section headers and paragraph text without caption text, I do:

  • use extract_layout=True
  • When we set extract_layout=True, we can have a list of LayoutItem objects from the page. Each LayoutItem object is labelled into different categories, including:
    • 'picture',
    • 'caption' (caption of pictures or tables),
    • 'table',
    • 'pageHeader',
    • 'pageFooter',
    • 'sectionHeader'
      and finally,
    • 'text' (the normal text paragraphs)
  • I will select only the LayoutItem objects that are labelled as either text or sectionHeader
  • Write a function to sort those text and sectionHeader LayoutItem objects according to x,y coordinates in bbox attributes of LayoutItem
  • Copy the text within each object according to that order to the final document: This is the problem
    The LayoutItem object only has those attributes:
class LayoutItem(BaseModel):
    """The layout of a page."""
    image: str = Field(description="The name of the image containing the layout item")
    confidence: float = Field(description="The confidence of the layout item.")
    label: str = Field(description="The label of the layout item.")
    bbox: BBox = Field(description="The bounding box of the layout item.")
    isLikelyNoise: bool = Field(description="Whether the layout item is likely noise.")

The only information I can get from this LayoutItem is the image queried using GET requests. How do I get the text within each LayoutItem object that is labeled as "text" or "sectionHeader"? I can pass this image through an OCR reader or the parser again, but that just seems expensive and wasteful, since the LlamaParse parser has already gone through those words once; it's just that I cannot associate each LayoutItem image to a text block in the final parsed text document.

I also tried to use the bbox attribute of LayoutItem to identify the text section in the parsed document - by using the PageItem object.
for example, we have

documents = parser.parse(PDF_PATH)
page = documents.pages[0] #example of a page
page.layout #list of LayoutItem objects
page.items #list of PageItem objects

Each PageItem object has bbox and text value attribute, but the bbox does not match any in the LayoutItem lists. Basically, the PageItem objects that are recognised from each page are different from the LayoutItem objects from each page, and the recognition/classification of PageItem objects is no where as good as LayoutItem. For example, if a page has 19 LayoutItem objects, it only has 9 PageItem objects; and the text in a PageItem object may combine all text of caption and text LayoutItem objects together, with the same wrong reading order in the original document.

Is there a way to retrieve the words from each object of LayoutItem without using another OCR or parsing those images for the second time?

I would really appreciate your help! Please let me know if I need to provide any more details/documents.

You must be logged in to vote

Replies: 0 comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
🙏
Q&A
Labels
None yet
1 participant
Morty Proxy This is a proxified and sanitized view of the page, visit original site.