A Novel Method for Analysing Racial Bias: Collection of Person Level References

Abstract

Long term exposure to biased content in literature or media can significantly influence people’s perceptions of reality, leading to the development of implicit biases that are difficult to detect and address (Gerbner 1998). In this study, we propose a novel method to analyze the differences in representation between two groups and use it examine the representation of African Americans and White Americans in books between 1850 to 2000 with the Google Books dataset (Goldberg and Orwant 2013). By developing better tools to understand differences in representation, we aim to contribute to the ongoing efforts to recognize and mitigate biases. To improve upon the more common phrase based (men, women, white, black, etc) methods to differentiate context (Tripodi et al. 2019, Lucy; Tadimeti, and Bamman 2022), we propose collecting a comprehensive list of historically significant figures and using their names to select relevant context. This novel approach offers a more accurate and nuanced method for detecting implicit biases through reducing the risk of selection bias. We create group representations for each decade and analyze them in an aligned semantic space (Hamilton, Leskovec, and Jurafsky 2016). We further support our results by assessing the time adjusted toxicity (Bassignana, Basile, and Patti 2018) in the context for each group and identifying the semantic axes (Lucy, Tadimeti, and Bamman 2022) that exhibit the most significant differences between the groups across decades. We support our method by showing that our proposed method can capture known socio political changes accurately and our findings indicate that while the relative number of African American names mentioned in books have increased over time, the context surrounding them remains more toxic than white Americans.

Publication
Arxiv Preprint
Muhammed Yusuf Kocyigit
Muhammed Yusuf Kocyigit
PhD Student at Boston University

My research interests include better evaluating and improving LLMs, understanding the pre-training data of LLMs and computational social sciences