Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Appearance settings

GanjinZero/math401-llm

Open more actions menu

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 

Repository files navigation

math401-llm

Source codes and datasets for How well do Large Language Models perform in Arithmetic tasks?

Main

Full evaluation of all size models.

Full

Dataset

MATH 401 = 1 Euler Equation + 16 group * 25 problems

  • Euler Equation.
  • Add & Subtract of two integers within 10.
  • Add & Subtract of two integers within 100.
  • Add & Subtract of two integers within 1,000.
  • Add & Subtract of two integers within 1,000,000,000,000.
  • Add & Subtract of two integers within -10~10.
  • Add & Subtract of two decimal numbers within -100~100.
  • Multiply two integers within 100.
  • Multiply two decimal numbers within 10.
  • Multiply two integers within 100,000.
  • Division of two integers within 100.
  • Exponentiation of with integer base within 10 and integer exponent within 2~4.
  • Exponentiation of with a decimal number within 10 as the base and a decimal number within 2~4 as the exponent.
  • Add, Subtract & Multiply with one integer within 10 and a common irrational number (i.e. $e$ or $\pi$).
  • Long arithmetic expressions with brackets, involved integers are all within 100 and operators contain add, subtract, multiply, and division.
  • Trigonometry functions including $\sin$, $\cos$, and $\tan$. Inputs can be in the format of degrees and radians ($\pi$ can also appear in the inputs).
  • Logarithm of integers within 1000 of different bases: $2,e,10$.

Metric

Accuracy

If the difference between the decoded number and the target number is less than $1e-3$, we consider it a correct prediction. Accuracy is calculated based on correct prediction counts.

Relative error

We denote decoded number is $\hat{y}$ and target is $y$. We calculate relative error by:

$RE = \min(10, \frac{|\hat{y}-y|}{\max(|y|, 1)})$

If LLM does not decode any number, we consider $RE=10$. We truncate the relative error to 10 to prevent that one big mistake dominate the average relative error.

Non-number ratio

If decoded content does not contain any numbers, we consider it a failure. We calculate the non-number ratio based on it.

Citation

@misc{yuan2023large,
      title={How well do Large Language Models perform in Arithmetic tasks?}, 
      author={Zheng Yuan and Hongyi Yuan and Chuanqi Tan and Wei Wang and Songfang Huang},
      year={2023},
      eprint={2304.02015},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

About

Source codes and datasets for How well do Large Language Models perform in Arithmetic tasks?

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published
Morty Proxy This is a proxified and sanitized view of the page, visit original site.