AI Shows Racial Bias When Grading Essays 鈥� and Can’t Tell Good Writing From Bad

Explore

Analysis

AI Shows Racial Bias When Grading Essays 鈥� and Can’t Tell Good Writing From Bad

Smith: Study finds ChatGPT replicates human prejudices and fails to recognize exceptional work 鈥� reinforcing the inequalities it's intended to fix.

By Kennedy Smith

May 6, 2025

Education news and commentary, delivered right to your inbox.

This site is protected by reCAPTCHA and the Google and apply.

Most Popular

Spotlight Oklahoma
‘A Game of Catch-Up’: How This Oklahoma School Gets Kids Reading at Grade Level
Artificial Intelligence
Weingarten: Kids鈥� Attention Crisis Demands Widespread Curbs on AI and Tech
ed tech
Parents鈥� Consent at the Heart of Ed Tech Lawsuits
The Big Picture
Long-Term NAEP Shows Growth for 9-Year-Olds, More Disappointment for Teens
Spotlight Oklahoma
Oklahoma鈥檚 Schools Are Some of the Worst in the Nation. Can They Recover?

Get stories like this delivered straight to your inbox. Sign up for 麻豆精品 Newsletter

Every day, artificial intelligence reaches deeper into the nation’s classrooms, helping teachers personalize learning, tutor students and develop lesson plans. But the jury is still out on how well it does some of those jobs, notably grading student writing. A new from found that while ChatGPT can mimic human scoring when it comes to essays, it struggles to distinguish good writing from bad. And that has serious implications for students.

To better understand those implications, we evaluated ChatGPT鈥檚 essay scoring ability using the . This includes approximately written by U.S. middle and high school students. What makes ASAP 2.0 particularly useful for this type of research is that each essay was scored by humans, and it includes demographic data, such as race, English learner status, gender and the economic status of each student author. That means researchers can look at how AI performs not just in comparison to human scorers, but across different student groups.

So what did we find? Chat GPT did to different demographic groups, but most of those differences were so small, they probably wouldn鈥檛 matter much. However, there was one exception: , and that gap was large enough to warrant some attention.

But here鈥檚 the thing: . In other words, ChatGPT didn鈥檛 introduce new bias, but rather replicated the bias that already existed in the human scoring data. While that might suggest the model accurately reflects current standards, it also highlights a serious risk. When training data reflects existing demographic disparities, those inequalities can be baked into the model itself. The result is then predictable: The same students who鈥檝e historically been overlooked stay overlooked.

And that matters a lot. If AI models reinforce existing scoring disparities, students could see lower grades not because of poor writing, but because of how performance has been historically judged. Over time, this could impact academic confidence, access to advanced coursework or even college admissions, amplifying educational inequities rather than closing them.

Furthermore, our study also found that between great and poor writing. Unlike human graders, who gave out more As and Fs, ChatGPT handed out a lot of Cs. That means strong writers may not get the recognition they deserve, while weaker writing could go unchecked. For students of marginalized backgrounds who often have to work harder to be noticed, that鈥檚 potentially a serious loss.

To be clear, human grading isn鈥檛 perfect. Teachers can harbor unconscious biases or apply inconsistent standards when scoring essays. But if AI both replicates those biases and fails to recognize exceptional work, it doesn鈥檛 fix the problem. It reinforces the same inequalities that so many advocates and educators are trying to fix.

That鈥檚 why schools and educators must carefully consider when and how to utilize AI for scoring. Rather than replacing grading, they could provide feedback on grammar or paragraph structure while leaving the final assessment to the teacher. Meanwhile, ed tech developers have a responsibility to evaluate their tools critically. It鈥檚 not enough to measure accuracy; developers need to ask: Who is it accurate for, and under what circumstances? Who benefits and who gets left behind?

Benchmark datasets like ASAP 2.0, which include demographic details and human scores, are essential for anyone trying to evaluate fairness in an AI system. But there is a need for more. Developers need access to more high-quality datasets, researchers need the funding to create them and the industry needs clear guidelines that prioritize equity from the start, not as an afterthought.

AI is beginning to reshape how students are taught and judged. But if that future is going to be fair, developers must build AI tools that account for bias, and educators must use them with clear boundaries in place. These tools should help all students shine, not flatten their potential to fit the average. The promise of educational AI isn鈥檛 just about efficiency. It鈥檚 about equity. And nobody can afford to get that part wrong.

Did you use this article in your work?

We鈥檇 love to hear how 麻豆精品鈥檚 reporting is helping educators, researchers, and policymakers.

Republish This Article Learn More

Kennedy Smith is the research and development associate at The Learning Agency.

Republish This Article

We want our stories to be shared as widely as possible 鈥� for free.

Please view 麻豆精品's republishing terms.


                <h2>AI Shows Racial Bias When Grading Essays 鈥� and Can’t Tell Good Writing From Bad</h2>

                <h2>Smith: Study finds ChatGPT replicates human prejudices and fails to recognize exceptional work  鈥� reinforcing the inequalities it's intended to fix.</h2>

                <p class="sans">By <a rel="author" href="/contributor/kennedy-smith/">Kennedy Smith</a></p>

                <img src="/wp-content/uploads/2025/05/AI-grading-essays.jpg">

                <p>This story first appeared at <a href="/">麻豆精品</a>, a nonprofit news site covering education. <a href="/about/newsletters/?utm_source=republish-button&utm_medium=website&utm_campaign=republish">Sign up for free newsletters from 麻豆精品</a> to get more like this in your inbox.</p>
                
<p>Every day, <a href="/tag/artificial-intelligence/">artificial intelligence</a> reaches deeper into the nation’s classrooms, helping teachers personalize learning, tutor students and develop lesson plans. But the jury is still out on how well it does some of those jobs, notably grading student writing. A new  from  found that while ChatGPT can mimic human scoring when it comes to essays, it struggles to distinguish good writing from bad. And that has serious implications for students.</p>



<p>To better understand those implications, we evaluated ChatGPT鈥檚 essay scoring ability using the . This includes approximately  written by U.S. middle and high school students. What makes ASAP 2.0 particularly useful for this type of research is that each essay was scored by humans, and it includes demographic data, such as race, English learner status, gender and the economic status of each student author. That means researchers can look at how AI performs not just in comparison to human scorers, but across different student groups.</p>



<p>So what did we find? Chat GPT did  to different demographic groups, but most of those differences were so small, they probably wouldn鈥檛 matter much. However, there was one exception: , and that gap was large enough to warrant some attention.</p>







<p>But here鈥檚 the thing: . In other words, ChatGPT didn鈥檛 introduce new bias, but rather replicated the bias that already existed in the human scoring data. While that might suggest the model accurately reflects current standards, it also highlights a serious risk. When training data reflects existing demographic disparities, those inequalities can be baked into the model itself. The result is then predictable: The same students who鈥檝e historically been overlooked stay overlooked.</p>



<p>And that matters a lot. If AI models reinforce existing scoring disparities, students could see lower grades not because of poor writing, but because of how performance has been historically judged. Over time, this could impact academic confidence, access to advanced coursework or even college admissions, amplifying educational inequities rather than closing them.</p>



<aside class="inline_story shortcode simple"><a href="/article/california-teachers-are-using-ai-to-grade-papers-whos-grading-the-ai/"><figure style="background-image: url(/wp-content/uploads/2024/06/California-teachers-AI-grading-education.jpg);"></figure><div><span class="sans related_tag">Related</span><h4 class="sans">California Teachers are Using AI to Grade Papers. Who鈥檚 Grading the AI?</h4></div></a></aside>



<p>Furthermore, our study also found that  between great and poor writing. Unlike human graders, who gave out more As and Fs, ChatGPT handed out a lot of Cs. That means strong writers may not get the recognition they deserve, while weaker writing could go unchecked. For students of marginalized backgrounds who often have to work harder to be noticed, that鈥檚 potentially a serious loss.</p>



<p>To be clear, human grading isn鈥檛 perfect. Teachers can harbor unconscious biases or apply inconsistent standards when scoring essays. But if AI both replicates those biases and fails to recognize exceptional work, it doesn鈥檛 fix the problem. It reinforces the same inequalities that so many advocates and educators are trying to fix.<br><br>That鈥檚 why schools and educators must carefully consider when and how to utilize AI for scoring. Rather than replacing grading, they could provide feedback on grammar or paragraph structure while leaving the final assessment to the teacher. Meanwhile, ed tech developers have a responsibility to evaluate their tools critically. It鈥檚 not enough to measure accuracy; developers need to ask: Who is it accurate for, and under what circumstances? Who benefits and who gets left behind?</p>



<p>Benchmark datasets like ASAP 2.0, which include demographic details and human scores, are essential for anyone trying to evaluate fairness in an AI system. But there is a need for more. Developers need access to more high-quality datasets, researchers need the funding to create them and the industry needs clear guidelines that prioritize equity from the start, not as an afterthought.</p>



<p>AI is beginning to reshape how students are taught and judged. But if that future is going to be fair, developers must build AI tools that account for bias, and educators must use them with clear boundaries in place. These tools should help all students shine, not flatten their potential to fit the average. The promise of educational AI isn鈥檛 just about efficiency. It鈥檚 about equity. And nobody can afford to get that part wrong.</p>

麻豆精品

Contact Us

Follow Us

Explore