NEXT-EVAL: From Web URLs to Structured Tables – Extraction and Evaluation

NEXT-EVAL Logo

NEXT-EVAL: From Web URLs to Structured Tables – Extraction and Evaluation

Welcome to NEXT-EVAL, a comprehensive toolkit for the rigorous evaluation and comparison of methods for extracting tabular data records from web pages. This framework supports both traditional algorithms and modern Large Language Model (LLM)-based approaches. We provide the necessary components to generate datasets, preprocess web data, evaluate model performance, and conduct standardized benchmarking.

NEXT-EVAL is an open-source library accompanying the NeurIPS 2025 paper:

📄 NEXT-EVAL: Next Evaluation of Traditional and LLM Web Data Record Extraction [https://arxiv.org/abs/2505.17125]

🎥 Demo Video

🏁 Getting Started

npm install @wordbricks/next-eval

🔧 Components

1. HTML Processing Tool

Convert real-world webpage HTML into compact formats optimized for LLM processing:

HTML to Slim HTML: Clean and simplify raw HTML for model input
HTML to Hierarchical JSON: Structure webpage HTML into nested JSON preserving original structure
HTML to Flat JSON: Structure HTML into flat JSON format where key is xpath and value is text

import { processHtmlContent } from "@wordbricks/next-eval/html/utils/processHtmlContent";

const htmlString = `<!DOCTYPE html>
<html lang="en">
<body>
  <div class="container">
    <h1>Main Page</h1>
    <div class="card">
      <div class="card-title">User Profile</div>
      <div class="card-content">
        <ul>
          <li><strong>Name:</strong> Jane Doe</li>
          <li><strong>Email:</strong> jane@example.com</li>
          <li>
            <strong>Skills:</strong>
            <ul>
              <li>JavaScript</li>
              <li>Python</li>
              <li>HTML & CSS</li>
            </ul>
          </li>
        </ul>
      </div>
    </div>
  </div>
</body>
</html>`;

const { html: slimmedHtml, textMapFlat, textMap } = await processHtmlContent(htmlString);

console.log("[Slim HTML]", slimmedHtml);
console.log("[Hierarchical JSON]", textMap);
console.log("[Flat JSON]", textMapFlat);

2. Table Generation Tool

Generate tabular data from web content using LLM-based extraction with customizable prompts:

import { getLLMResponse } from "@wordbricks/next-eval/llm/utils/getLLMResponse";

const temperature = 1.0; // Control randomness (0.0 to 2.0)

// Option 1: Using Slim HTML format
const { text: slimText, usage: slimUsage } = await getLLMResponse(slimmedHtml, "slim", temperature);

// Option 2: Using Hierarchical JSON format
const { text: hierText, usage: hierUsage } = await getLLMResponse(textMap, "hier", temperature);

// Option 3: Using Flat JSON format  
const { text: flatText, usage: flatUsage } = await getLLMResponse(textMapFlat, "flat", temperature);

console.log("Slim HTML result:", slimText);
console.log("Hierarchical JSON result:", hierText);
console.log("Flat JSON result:", flatText);

3. Evaluation Framework

Comprehensive evaluation with precision, recall, F1-score, and detailed overlap analysis:

import { calculateEvaluationMetrics } from "@wordbricks/next-eval/evaluation/utils/calculateEvaluationMetrics";

const predictedRecords = [
  [
    "/body/section[1]/div[4]/span[1]",
    "/body/section[1]/div[4]/span[2]",
    "/body/section[1]/div[4]/a[1]",
  ],
  [
    "/body/section[1]/div[2]/span[1]",
    "/body/section[1]/div[2]/span[2]",
    "/body/section[1]/div[2]/a[1]",
  ],
  [
    "/body/section[1]/div[3]/span[1]",
    "/body/section[1]/div[3]/span[2]",
    "/body/section[1]/div[3]/a[1]",
  ],
];

const groundTruthRecords = [
  [
    "/body/section[1]/div[3]/span[1]",
    "/body/section[1]/div[3]/a[1]",
    "/body/section[1]/div[3]/button[1]",
  ],
  [
    "/body/section[1]/div[2]/span[1]",
    "/body/section[1]/div[2]/a[1]",
    "/body/section[1]/div[2]/span[3]",
  ],
  [
    "/body/section[1]/div[5]/span[1]",
    "/body/section[1]/div[5]/span[2]",
    "/body/section[1]/div[5]/a[1]",
  ],
];

const { precision, recall, f1, totalOverlap, matches } = calculateEvaluationMetrics(
  predictedRecords, 
  groundTruthRecords
);

console.log(`Precision: ${precision.toFixed(3)}`);
console.log(`Recall: ${recall.toFixed(3)}`);
console.log(`F1-Score: ${f1.toFixed(3)}`);
console.log(`Total Overlap: ${totalOverlap}`);
console.log(`Matches: ${matches}`);

🧪 Citation

If you use NEXT-EVAL in your research, please cite:

@inproceedings{next-eval2025,
  title={NEXT-EVAL: Next Evaluation of Traditional and LLM Web Data Record Extraction},
  author={arXiv},
  year={2025},
  url={https://arxiv.org/abs/2505.17125}
}

🤝 Contributing

We welcome contributions to improve tool coverage, add datasets, or refine evaluation metrics. Please see CONTRIBUTING.md for guidelines.

📬 Contact

Have questions or ideas? We'd love to hear from you. Contact us at research@wordbricks.ai

Inspired by our research? We are looking for innovative thinkers to join our team. Please email your resume to hr@wordbricks.ai and be sure to mention our paper.

To see what else we're building, explore our latest technologies at nextrows.com

Name	Name	Last commit message	Last commit date
Latest commit History 31 Commits
.github	.github
apps/web	apps/web
packages	packages
.bun-version	.bun-version
.env.example	.env.example
.gitattributes	.gitattributes
.gitignore	.gitignore
.syncpackrc.ts	.syncpackrc.ts
CLAUDE.md	CLAUDE.md
CODE_OF_CONDUCT.md	CODE_OF_CONDUCT.md
CONTRIBUTING.md	CONTRIBUTING.md
Next-eval-dark.png	Next-eval-dark.png
Next-eval-light.png	Next-eval-light.png
README.md	README.md
biome.jsonc	biome.jsonc
bun.lock	bun.lock
lefthook.yml	lefthook.yml
package.json	package.json
turbo.json	turbo.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

NEXT-EVAL: From Web URLs to Structured Tables – Extraction and Evaluation

🎥 Demo Video

🏁 Getting Started

🔧 Components

1. HTML Processing Tool

2. Table Generation Tool

3. Evaluation Framework

🧪 Citation

🤝 Contributing

📬 Contact

About

Uh oh!

Releases

Packages

Contributors 4

Uh oh!

Languages

Search code, repositories, users, issues, pull requests...

wordbricks/next-eval

Folders and files

Latest commit

History

Repository files navigation

NEXT-EVAL: From Web URLs to Structured Tables – Extraction and Evaluation

🎥 Demo Video

🏁 Getting Started

🔧 Components

1. HTML Processing Tool

2. Table Generation Tool

3. Evaluation Framework

🧪 Citation

🤝 Contributing

📬 Contact

About

Topics

Resources

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

Packages