A guide to identifying author gender for bibliometric analyses

In 2018, I wrote a post for The Bibliomagician blog on identifying authors' genders based on name analyses, based on a lively discussion on the LIS-Bibliometrics listserv. I’m reposting the blog post here under a CC-BY license.

Recently on the LIS-Bibliometrics listserv, Ruth Harrison (Imperial College London) posed a question on behalf of a patron who was interested in identifying authors' genders based upon names listed on ~2,000 journal articles–too large a corpus for manual analysis. The community weighed in with many good suggestions for ways to approach a large scale gender analysis for author names. We thought it would be helpful to others to share what Ruth learned (with permission from the original posters).

Here are some recommendations from LIS-Bibliometrics listserv members on the best places to find author names, APIs and software you can use to analyze gender, consultants you can hire the analysis out to, and previous approaches to analysis from other gender bibliometrics researchers.

Where to find author names lists

Web of Science was most recommended as being a good way to download full author names for publication lists. Programmatic access via the Web of Science API is usually available for licensing (libraries are usually the purchasers of Web of Science access for institutions, so you should contact your library to inquire as to whether API access is included in your institution’s contract).

We would be remiss if we did not point out the challenges that face anyone seeking to do a study that determines a person’s gender, based on name alone.

New citation index Dimensions also makes authors' full names available for download (though only for up to 50 papers at once in the free version of the app) and via the Dimensions API, which is freely available for those doing scientometrics research.

On the other hand, listserv members pointed out that Scopus only makes authors' first initials available both in metadata downloads for publication lists and via the Scopus API. Therefore, it is unsuitable to use in isolation for finding author names.

APIs and software

Automated gender analysis requires a bit of programming knowledge (or at least a willingness to learn). In particular, calling APIs and parsing publication metadata are two essential programming skills.

Gender API is a recommended service that allows you to look up the likely gender (and degree of confidence) for a particular name or list of names. For example, you could query the name “Diana” and learn that the name is classified as ‘female’, with a 93% accuracy rate based on a sample of 523 names. The providers offer clients for interacting with the API in PHP, Python, and several other programming languages.

Namsor is another recommended API for looking up gender based on names, and it has the added feature of looking up ethnicity, as well. The free API allows for a limited number of monthly calls; you can also pay for API access to increase your API call limit.

GenderChecker is a recommended name list that can be downloaded for less than $200 USD, then analyzed. As one listserv poster explained, “It’s not 100 percent accurate, but works for most American/European first names, especially if you have a large dataset. Be very careful with Chinese/Japanese/Korean names; most of the time they should be neutral unless you further checked.”

Genderize.io is yet another API that was not recommended by listserv members, but appears in several recent studies and reports. The Genderize database reportedly contains 216,286 distinct names across 79 countries and 89 languages. It is free to use but rate-limited to 1000 requests per day.

Finally, the recommended Python package SexMachine allows you to look up the gender for around 40,000 names. For each name you query, you will get a response for one of the following categories: andy (androgynous), male, female, mostly_male, or mostly_female. For example, the query “Paul” would return “male”, whereas the name “Stacy” would return “mostly_female”.

Other gender researchers' approaches

Listserv members also suggested that Ruth and her patron look to existing author gender analysis studies to find methods to borrow. Two in particular–a 2013 commentary from Nature, and a more recent Elsevier report–were the most mentioned:

The Nature study’s supplementary files include a thorough discussion of how to parse Web of Science names data for a variety of countries of origin.

One listserv respondent pointed out that “The Elsevier report’s methodology implies they didn’t have an easier way to [identify author gender] (“Scopus Author Profiles were combined with gender-name data from social media, applied onomastics, and Wikipedia”).” More details on the study’s methods can be found in a report appendix. Particularly useful is a discussion of the various name-gender APIs suitability for multi-country analysis.

Consultants

For those who want to hire out the work, Science-Metrix, Elsevier Analytical Services, and Digital Science Consultancy are all businesses that offer a variety of bibliometrics analysis services, which may include gender analysis. Contact the consultancies themselves for more information.

Challenges

We would be remiss if we did not point out the challenges that face anyone seeking to do a study that determines a person’s gender, based on name alone. First and foremost, there is the question of ethics: does this kind of study rob authors of their right to be identified as a particular gender that might not match the expected gender for someone with their name?

Related to that issue is the problem of the assumption of a gender binary. All studies in this area tend to identify authors as “Male”, “Female”, “Unisex” (as in, a name that is suitable for both men and women), and “Unknown”. How can researchers more accurately identify the gender of someone who identifies as genderqueer or agender, for example? It doesn’t seem possible to do so using a simple names analysis, meaning that these kinds of studies should be approached and described with that caveat in mind.

Then there are technical issues related to the dearth of useful author metadata and regional name-gender data. “What about cases where the author info only includes initials?” one listserv respondent wrote. Other respondents pointed out that many name-gender analysis tools are biased towards Western names, making it difficult to do accurate analysis on authors from other areas of the world.

Stacy Konkiel
Stacy Konkiel
Professional Data Wrangler 🤠