Image by Hotpot.ai
Author: Jill Maschio, PhD
Some research articles experimenting with AI in education read more like a marketing campaign than a credible study. Let me explain. Marketing is the process of influencing consumers’ perceptions about a product. Marketing companies spend millions of dollars a year to persuade consumers. The golden rule is if you hear a brand name three times, it is more likely to stick in your mind when something triggers it. For example, the same commercial airs on your favorite radio station for laundry soap. Now, there’s a greater chance that you will at least think of that laundry brand the next time you need laundry soap. How that occurs is through word association. The brain has been forming word associations throughout life. As AI advances its abilities and we hear the hype of what it can do for us, you are forming associations, whether it’s good or bad; the point is that that’s how the brain works. So, whenever we hear of a research study showing positive results, the brain automatically associates AI with being effective for education.
But, words matter. When listening to the laundry commercial, the words suggest to the listener that it’s effective or the best soap on the market—again, influencing the connections we form. So, how is this happening in the research studies about AI? It comes down to the words. I want to review the study “A Review of The GPT Surprise: Offering Large Language Model Chat in a Massive Coding Class Reduced Engagement but Increased Adopter’s Exam Performance” from Stanford University (2024). In the Abstract, it reads, “ We estimate positive benefits for adopters, the students who used the tool… there may be promising benefits to using LLMs in class…” (p, 1). They are talking about the students who used ChatGPT-4.
Let’s look closer at the study. The Stanford researchers had two groups of students consisting of 5,831: those who had access to a customized in-class ChatGPT interface and those who didn’t while taking a free online class about coding. Students came from 146 countries. Students were offered an exam, but it wasn’t required to earn a course certificate. Of those students, 45.8% took the exam. Of those offered access to GPT-4, 14.2% did (N=510).
The researchers sought to address two questions: did students engage with GPT-4, and in doing so, did it increase exam scores? The researchers framed GPT-4 to the participants as a positive tool for programming but it also has some limitations, such as hallucinations and potential harm to the learning process.
The researchers found that those who used GPT-4 had a significantly higher average than those who didn’t. The researchers acknowledge that age and user experience may impact the results. One of the researchers’ biggest observations was in their conclusion about what caused improved exam scores. They believe it was how the students used GPT-4 – they tended to use it as a tool for better understanding; however, the researchers are transparent in stating that they did not do enough to safeguard against cheating.
In my opinion, here’s perhaps the most significant limitation to the study that should have been taken into account to help establish causality: The independent variable (IV) was not manipulated. In this case, comparing a group who used the product to a group who did not, isn’t sufficient.
To help establish causality, the independent AI usage should have been executed by creating more groups (e.g., those who used AI for 1 hour, 3 hours, 5 hours, or more). Plus, have a group that uses other means of studying, such as reviewing videos online, rereading the material and notetaking, or testing themselves using a quiz bank (Quizlet).
So, basically, the study, perhaps in some cases, compared AI to no studying method other than not using AI and did not include manipulation of the IV. It is logical to assume that studying with AI may produce greater academic performance compared to no studying. Based on the writing article, we don’t know what the control group did. By manipulating the IV, the researchers could have isolated the effects on the dependent variable and been used to make a strong inference about causal outcomes.
Another factor is GPA. Any professor knows that some students care more about their grades than others do and that some students have extenuating circumstances that may prevent them from achieving high grades. How long did the experiment run? We are not provided with that information, but the time provided to complete the course could affect the results. These are critical confounding variables that these researchers did not address.
With AI being pushed on educators, we need sound and reliable studies. I am not attacking this research, and I respect the research process. In my opinion, we need more studies that examine authentic learning. But. I am also concerned that studies may prime people to believe that AI can do more than what it really can. You can read my critique of Sam Altman’s statement urging educators to be more savvy with AI.
In the article “Why Most Published Research Findings Are False” by Ioannidis (2005), false positives are problematic for research studies. Research articles that read as positive but may be flawed have the potential to mislead the field of education. Perhaps authors Dunnigan et al. (2023) have the right idea when they stated in their article, “Can we just please slow down?”: School leaders take on ChatGPT”. In this case, let’s do highly credible research so that educational leaders and educators can make the best decisions for student success. I see educators, even from top universities, supporting questionable research. As professionals in education, we must be more critical of the studies promoting AI for education to have solid science upon which to base our curriculum.
Your support makes articles like these possible!
The Psychologyofsingularity website is completely donor-supported, allowing articles like this to be free of charge and uncensored. A donation to Psychologyofsingularity, no matter the amount, will be appreciated so that more information can continue to be posted.
References
Dunnigan, J., Henriksen, D., Mishra, P., & Lake, R. (2023). “Can we just please slow it all down?” School leaders take on ChatGPT. Techtrends, 67, 8978-884. doi: https://doi.org/10.1007/s11528-023-00914-1
Ioannidis J. P. (2005). Why most published research findings are false. PLoS medicine, 2(8), e124. https://doi.org/10.1371/journal.pmed.0020124
Nie, A., Chandak, Y., Suzara, M., Ali, M., Woodrow, J., Peng, M., Sahami, M., Brunskill, E., & Piech, C. (2024) A review of The GPT surprise: Offering large language model chat in a massive coding class reduced engagement but increased adopter’s exam performance. Computer Science, Arxiv. doi: 10.31219/osf.io/qy8zd