A guide to author gender ratio data on Nature Index
About the Nature Index
The Nature Index is a database of author affiliations and institutional relationships. The index tracks contributions to research articles published in high-quality natural-science and health-science journals, chosen by an independent group of researchers, based on reputation.
The Nature Index provides absolute and fractional counts of article publication at the institutional and national level and, as such, is an indicator of global high-quality research output and collaboration. Data in the Nature Index are updated regularly, with the most recent 12 months made available under a Creative Commons licence at nature.com/nature-index. The database is compiled by Nature Portfolio.
Author gender inference
Each authorship in the Nature Index is defined as a unique combination of DOI, first name, middle name and last name. Authorships where only an initial is available for the first name were excluded due to insufficient data.
To enhance data accuracy and resolve ambiguities, authorship records are matched with disambiguated author profiles from Digital Science’s Dimensions database, where available. This process ensures that researchers with multiple name variations or affiliations are consistently recognized.
Determining country of origin
Each author’s associated country or territory was inferred using the following process which is similar to the method used by the CWTS Leiden Ranking’s gender indicator.
-
If an author’s first publication country or territory matches the most frequently occurring country or territory in their publication history, they are assigned to that location.
-
If no single country or territory is predominant, the author is linked to all countries or territories appearing in their publications.
-
If an author remains ambiguous (a single publication with no country information, for example), the country or territory is inferred from their institutional affiliation at the time of authorship.
This method allows for a balanced representation of an author’s research footprint while minimizing errors from temporary institutional movements.
We emphasize that the analysis counts authorship instances rather than individual authors, meaning that authors could appear multiple times in the totals and ratios. As a result, the data reflect publication activity rather than a strict headcount of individuals. This is important to note, as it may influence gender representation trends.
Inferring gender using large language models
Unique combinations of first name, last name and origin country were fed to an algorithm run by Open AI’s GPT-4o mini to infer each author’s gender. The following prompt was used:
“I need to pick up someone from {country/territory} named {first_name} {last_name}. Am I more likely looking for a male or a female? Report only ‘Male’ or ‘Female’, and a score from 0 to 1 on how certain you are.”
This prompt is adapted from a paper1 in which the authors compare ChatGPT to commercial gender inference tools. Authorships were limited to binary gender classifications (male or female)owing to non-binary identities being under-represented or absent in training data used by large language models.
The response format strictly follows: Gender, Score (e.g., Male, 0.95), ensuring standardized output for further analysis. An error margin of 95% was used for each author-gender inference. Cases where ChatGPT returned a score below 0.9 were excluded, along with instances where the model assigned different genders to the same first name and country combination.
We also report upper and lower bounds. We can say with 95% confidence that the true value falls within the given interval.
Challenges and limitations of gender inference
Gender inference provides valuable demographic insights, but it is inherently limited by:
-
Cultural variability: Name-gender associations differ across regions, leading to potential inaccuracies in global datasets.
-
Model training biases: Large language models are trained on historical datasets, which often reflect societal norms that emphasize binary gender identification.
Some countries and territories have stronger name–gender associations — in other words, the majority of people with a certain name will identify as a certain gender — than others, which can make the results of the analysis uneven. Authors where there is low confidence that the model is accurate in inferring gender have been excluded from the analysis, including the majority of names in some countries such as China and Singapore.
Gender indicators
We report the following gender diversity indicators, recognizing that authors may be counted multiple times as we analyze authorship at the publication level:
-
Total authorships: The total number of authorships per institution, country, subject, etc., where each publication-authorship combination is considered separately.
-
Female authorships: The number of authorships attributed to female authors per institution, country, subject, etc.
-
Proportion of female authorships: The percentage of female authorships relative to the total number of male and female authorships within a given institution, country/territory, or subject area (excluding authorships where gender inference did not pass the error threshold).
-
Expected proportion of female authorships: An estimation of what an institution's or journal's ratio of male to female authorship could be, based on how much research they publish in male- or female-heavy sciences.
We built a linear regression model to predict the proportion of female authorships based on historical data. The model learns patterns from past data, including how the proportion of female authors varies across topics and years.
The methodologies outlined ensure a structured and replicable approach to identifying authors in the Nature Index dataset. While gender inference and country attribution provide valuable demographic insights, it is essential to acknowledge their limitations and apply results with appropriate caution. Future improvements in author disambiguation and gender inference models will further enhance the accuracy and inclusivity of the analyses.
References
- Alexopoulos, M. et al., Gender Inference: Can ChatGPT Outperform Common Commercial Tools? arXiv:2312.00805v1 [cs.CL] 24 Nov 2023