I do not think that AI does what you think it does

INFORMATIVITY

I do not really care about AI.¹ I was exhausted by it almost immediately, feeling (fairly or unfairly) that it was a natural extension of other loathsome “Web 3.0” ventures like NFTs and “The Blockchain.” Even at first, when people were just having fun with the image generators, I did not really find it all that amusing: I found its images grotesque and the novelty of that grotesquerie wore off quickly, and I could not get it to generate the one image I desired.² This was not initially a moral stance but an aesthetic and philosophical one: I found the images (and, later, text) aesthetically-repellant and I have a tendency to be a contrarian luddite crank.

Since then, I have accrued the same myriad concerns shared among many reasonable people: concerns about ethics; about labour; about costs environmental, cognitive, pedagogical, and “spiritual” in nature. On a higher level, I believe that LLMs (and related technologies) appetize and indulge all of the worst human instincts: ceaseless consumption, sloth, prioritizing convenience and ease over effort and struggle and growth. Putting all of those other Very Big and Very Serious concerns aside, I am still just not all that interested in “AI.”³ I do not think that most humans should be looking to offload thinking most of the time.⁴ I am not convinced that having instant access to whatever we want is a good thing and I do not think that convenience and ease should be the primary pillars propping up one’s existence. I think that that having an easy shortcut around ever learning a skill will be detrimental, for at least some people, some of the time. Being able to mindlessly generate utterly-worthless text and images, being forced to consume utterly-worthless text and images, I do not think that these are good for us. This discussion is not about those concerns, however. There are a great many intelligent and learned people talking about the very real concerns AI presents. I am not one of those people.

This is also not a discussion of the technology itself. I understand the broad strokes of how these new technologies work—especially because the technologies themselves are not new—but I am no expert on them. It is not my discipline and, as has been intimated, I am not especially interested in it. Indeed, my natural reaction to being inundated with the topic has been to withdraw even further, paying it mind only when it supported my interest in something else. If you are looking for a technical dismantling of some particular LLM, you will find this discussion wanting.

I do, however, have another set of experiences and expertise that may be relevant to the present discussion: I am a career academic and research scientist.

CLARITY

Being trained as an academic is not especially or specifically-relevant to AI, despite how often it intrudes into my professional life. Perspectives on the use of AI in research vary somewhat: younger and apolitical grad students seem fairly keen on it as a way of reducing their considerable workload; early-career or left-leaning academics tend to be sceptical of its use, given that it is ethically-dubious and prone to failure; late-career academics are probably just letting their grad students do whatever they want. Its use in the classroom is far less controversial, however: almost everyone agrees that it is cheating and that it utterly undermines learning. Research papers written by ChatGPT are ugly and awkward and, worse, tend to be laden with errors. For whatever reason, GPT loves to hallucinate and cite papers that do not exist, or cite papers that do exist for studies they did not conduct and results they did not obtain.⁵ I do not really know why it does this but it seems to do so with some consistency, or at least consistently enough that professors and TAs across disciplines are having the completely new experience of checking a reference only to find that the paper does not exist or it covers another topic entirely.⁶

Of course, many academics are already aware of AI’s struggles with reading research. When ChatGPT was first becoming popular, it was a common diversion for academics to feed it their own work or the work of someone else and to observe how badly it would botch the interpretation. This struggle should not come as a great surprise, however, as GPT is not trained to read the academic papers of any particular discipline. It is trained to read and respond to lay language, whereas research papers only appear to be written as such. In truth, they are written in disciplinarian cant and are not fully-legible to anyone outside that discipline. For example, I am not an epidemiologist but I do find myself occasionally reading epidemiology papers. My training is such that I can understand the research and its methodologies, but there is undoubtedly still a dysfluency in my reading. This is because research, like any communication, is laden with group-specific sub-text, connotation, and is meant to be understood within a broader context.⁷ Because the intended audience of a research paper are academics, they often omit things like basic definitions, statements of group values and mores, commentary on social context within the broader scientific community, or provocative interpretations of one’s work or that of another researcher. This generally does not compromise communication, however, so long as the audience is indeed a trained academic.⁸ ⁹ A LLM trained on domain-general communication (or on thousands of other types of context-specific communication) will believe that it has correctly understood a research paper despite having missed crucial domain-specific information. A layperson would also miss this domain-specific information, of course, but he would (likely) understand that the paper’s broader significance had eluded him. The LLMs do not seem to understand this.¹⁰

I recently had occasion to share these thoughts with a non-academic friend and she found them surprising. Is this not exactly the sort of task at which a LLM should excel? And she is not wrong, in a sense. Summarizing a text, translating it into something more legible and identifying its main points, these are indeed well-suited tasks for a LLM, and are by no means a recent innovation. This particular party trick has been around for quite some time, though it has (to mine eyes) improved markedly in recent years. For these reasons I have just outlined, however, research papers seem to represent a particular struggle. This is seemingly at odds with the beliefs of many AI-enthusiasts and -agnostics alike; from what I have gathered, it is commonly-accepted that LLMs are indeed very good at reading, interpreting, and summarizing research.¹¹ Curious, she asked: Could I provide an example of this sort of difficulty? Do I have evidence of this tendency towards failure?

I was a little apprehensive, to be honest. It had been some time since I had personally seen ChatGPT dramatically fail to read an academic paper, even if undergraduate students continue to provide anecdotal evidence of the phenomenon. I am quite certain that the technology has improved considerably over the years and, technically-speaking, I do think that a LLM could reliably read academic research (even if it currently does not). Even if I believed that it occasionally failed at the task in spectacular ways—which I did—it seemed unlikely that I would be able to produce that sort of result without a great or systematic effort. It turns out that I was deeply mistaken.

TRUTH

This was not a scientific investigation. It was not systematic, it was not pre-registered or documented, and it is not meant to be presented or interpreted as a Serious Examination of AI. It was conducted over an hour or so at a laundromat and is being presented on an anonymous blog whose most common topic of discussion is consumer goods marketed to children and degenerates. Please do not mistake my semi-serious tone for gravity,¹² nor my pseudo-academic pontification for scholarship.¹³ I have no doubt that there are scientific investigations into this matter out there that are absolutely worth reading and non-scientific investigations that probably are not. Here is what I do provide: the brief experiences of an academic dipping his toe into this technology who happens to have the acumen to evaluate how well the technology performed.

I only ended up asking ChatGPT to interpret two papers for me. It turns out that that was all I needed. I used the venerated “ChatGPT 4o”,¹⁴ which I believed to be the one of the newer-and-fancier models and therefore one with the greatest chance of success. The first paper I asked it to interpret was Kidd and Castano’s “Reading literary fiction improves theory of mind”, which was published in Science¹⁵ in 2013. I chose this paper for a few reasons: first, it is highly-cited, widely-read, and was extremely impactful at the time of publication; second, it is a foundational paper in one of my areas of research specialization and so I know the paper very well; and third, it is now considered a highly-controversial paper and its findings are considered dubious at best.¹⁶ Across five studies, these researchers attempt to demonstrate that a brief (non-volitional¹⁷) exposure to literary fiction¹⁸ could improve performance on skills related to theory of mind and empathy. Subsequent high-powered attempts at replication have been mixed, to say the least. The effect also all but disappears in meta-analysis and has failed to stand up to the scrutiny of other meta-analytic analyses. I was curious as to how ChatGPT would interpret a highly-impactful paper that later became highly-disputed.¹⁹

Unsurprisingly, ChatGPT did a reasonably-good job of summarizing Kidd and Castano’s findings. The paper is written in the disjointed staccato common in Science, which can be pretty harrowing to read even for grizzled veterans of the field. Interpreting five studies presented in this style might be arduous for a simple academic but it should be no trouble for a LLM. GPT’s summary of the paper generally hit the most important points and it also emphasized some of the paper’s theoretical nuances,²⁰ which I am sure would have pleased the authors. It did not volunteer any critique of the paper, nor any broader commentary on the paper’s ignoble reputation. Fair enough. ChatGPT asked if I would like a table summarizing the findings across the five studies, which, yes absolutely I would. Who doesn’t love a nice table or graph? Here is where the trouble begins, unfortunately. GPT attempted to summarize the findings across these five studies—such-and-such group did better than so-and-so group on this-or-that measure—but on closer inspection, some of these were incorrect. The most important findings, the ones that supported the paper’s conclusions, they were generally intact, but not all of the condition comparisons were accurate. This did not strike me as catastrophic but it certainly was not ideal. The errors were nothing like the “hallucinations” I had witnessed in years prior, but they were failures (if minor). I chalked it up to the unusually-spartan text of this article, however, and allayed my concerns.

Next, I gave ChatGPT a fair shot at the task of criticism and asked it to critique the paper. This too was a mixed success but was honestly far more effective than I anticipated. GPT correctly identified several of the methodological issues with the paper and, much more importantly, informed me that subsequent high-powered replication attempts have failed to find the purported effect. Pretty impressive! For some reason though, GPT then rated the paper 4 out of 5 stars, which is a preposterous thing to do in general, but especially questionable given that it had just informed me that the finding does not replicate.

I considered these to be marks in GPT’s favour. It had incorrectly relayed a few results, sure, but it had done a much better job critiquing the paper than I had expected and it hadn’t utterly fabricated anything. Glorified copy-and-pasting it may be, but it had nonetheless largely succeeded at the task. If a student or layperson had asked for a summary and critique of the paper, they would have largely been fine (if slightly misinformed about some of the results and confounded by the paper’s “four-star rating”). If they had not requested a critique specifically, they would have been considerably less-well-informed, though no less so than if they had just read the paper themselves without ever reading anything else on the topic. I then moved onto a paper that I knew even better than Kidd and Castano’s: one of mine own.

Asking ChatGPT to interpret mine own work seemed like a natural next step. This was what I had seen academics do previously, after all, and there is no one more qualified than me to evaluate how effectively it had understood and translated the work. It also seemed like an interesting counterpoint to the previous paper: my published work would be hideously-verbose compared to the austere writing in Science; anything I have published would (naturally) be far less popular and impactful than the aforediscussed paper; and my work is also somewhat more complex than what I had previously fed ChatGPT. Whereas Kidd and Castano were generally conducting analyses on a singular outcome (at a time) and evaluating the result as a binary (i.e., there is an effect or there is not), my work often avoids such crass simplicities. In the paper I intended to share with ChatGPT, many of the analyses are “multivariate,” meaning there are several relations and outcomes being evaluated simultaneously. Furthermore, I avoid binary interpretations of effects whenever I can (except where obvious, such as in the case of a clearly-null relation), preferring instead to focus on interpreting the size of the effect and contextualizing that effect size within my work and the literature at large. These sorts of findings are somewhat more delicate than those the previous paper and are resistant to simple and single-sentence summaries like “x causes y”.²¹ I was curious as to how ChatGPT would handle this additional nuance. It turns out that it handled it reasonably-well, first, and then catastrophically-poorly, later.

The paper I had ChatGPT analyze comprised five studies, with several different designs and analytic approaches.²² The initial summary of the complete manuscript was reasonably-good, if a little light on actual findings—perhaps an early indication of difficulty with multivariate analyses. I decided to ask it to dig a little deeper and so requested additional detail on Study 4, which contained a great number of analyses and described a great number of relations. It summarized the findings in text and, again, did generally fine. It correctly identified which of the findings were of most interest to me and to the reader—as well it should, since I explicitly identify and highlight them in the text—and did not have any errors, as best as I could tell. It then asked if I wanted a table summarizing the study’s findings. Well, of course I did. I had (necessarily) presented the findings across no fewer than four tables and four graphs. Some of them might have been extraneous, sure, but I was very curious as to how GPT would boil down all of this information into a single two-dimensional table. Here is where the proverbial wheels fall off.

The table immediately struck me as strange. It reported effect sizes verbally and emoji-ly, using some sort of a system where a single checkmark sometimes meant “Possible” and sometimes meant “Moderate”, and “Strong” was variantly associated with either two or three checkmarks. Other words sometimes appeared in the tables alongside the effect size without explanation. The study’s findings could also be generally separated in two separate sets and ChatGPT had simply ignored one full half of them. Even if summarizing the findings would necessitate cutting many of the findings—as I said, this sort of study naturally rebuffs attempts at swift summarization—ignoring half of the findings was tantamount to misunderstanding the study entirely, as the two halves were meant to be interpreted alongside eachother. Putting those qualms aside though, I kept staring at the table, feeling like something was off. I looked at each of the predictors in the table’s first column and it started to dawn on me.. did I look at all of these relations? I mean, I did look at a lot of stuff but I do not remember looking at a few of these. They do not seem to fit the structure of the study. It had been a few years since I wrote this manuscript though, so I just went and checked: did I look at all of these relations?

No, of course not. You’ve been reading this whole thing,²³ you knew where this was headed. Of course I didn’t look at those relations and of course I didn’t forget them. The reason they jumped out to me as weird is because they didn’t even make sense within the study. Allow me a moment to explain in some very broad terms: This study included the creation of several “factors”, which were created using lower-level variables. The major analyses rely primarily on the factor-level predictions, but it would not be incorrect to evaluate variable-level predictions, so long as you did not combine the two. This table mixed together factors and variables, which is just patently incorrect within the logic of the study (and, obviously, not something I had done). ChatGPT had effectively just tossed together a bunch of the elements of analyses from the study and slammed them together into some sort of a pastiche of my findings, but added in a few utterly-fabricated ones along the way. It had created plausible but completely-imaginary analyses and then presented them to me as if they were my study’s findings. It wasn’t just that it was disastrously wrong—it had mistaken its own dreams for reality and then presented them with perfect confidence and absolutely zero equivocation.

What is especially strange about these errors is that they are not even replicable. In the course of writing this text, I tested ChatGPT again on Kidd and Castano’s paper and obtained completely different results. Its summary was largely the same and it again asked if I wanted a table collecting the paper’s results, but the table it provided was completely different from the one it had provided on my previous attempt. The table was in a different format and it summarized the results in a more generalized form (i.e., it did not show specific condition differences).²⁴ When I asked ChatGPT to critique the paper, it provided completely different and vastly-worse commentary. Broadly speaking, its observations this time around were correct but far more pat. Gone were the surprisingly-insightful references to other research and replication attempts. Gone, too, was the four-star rating. It did not see fit to provide a star-based rating at all this time around.

RELEVANCE

What you should take from this exercise is not that ChatGPT has never and will never correctly interpret an academic article. I am quite certain that it can and does, at least some of the time. The rub is that you don’t know when it has or it has not. You will not know whether it has accurately captured and relayed the results, whether it has ignored crucial results, or whether it has fabricated new results entirely²⁵—not without reading the paper yourself, anyway. Given this uncertainty, you cannot actually believe what it tells you, not if you actually care about the results or if there are some stakes involved.²⁶ And, after reading all of this, if you still believe that ChatGPT is excellent at reading academic papers but are not trained in said academic field, then please send me your address because I would like to try my hand at some construction on your house.

This exercise was fairly bracing for me. As has been discussed, I am an AI sceptic and a hater, but I also try to be realistic and reasonable. Everyone seems to believe that ChatGPT can read academic papers and, despite what I witnessed in years past and what I have seen in student papers, I figured there had to be more than a grain of truth to that. I realize now that this belief is likely the result of shared hype and naïvety. Everyone interested in these tools has been feeding GPT papers, reading the summaries, and telling one another what a good job it did. Few of them probably ever bother to read the papers carefully (if at all) and if they did they probably still would not understand them.²⁷ Never read the papers, never notice the inaccuracies, keep the hype up.²⁸ Meanwhile, the technology breaks down the second an expert glances at it.

If you have read this far and you have said aloud at your computer screen, “Well, you just didn’t use the correct prompts”, I would ask you which prompts I should be using and why basic prompts like “Can you explain this paper to me?” result in utterly-incorrect and hallucinated results. What kind of tool is that? And do you think others are using the “correct prompts”? Or do you think that they are defaulting to the plain language requests that seemingly act as a dark spell and (secretly) curse the LLM into failing completely?

As for what you should do with this information, far be it from me to give you advice on how to live your life. I just do not really have any interest in the thing at all and, with a few exceptions,²⁹ do not really connect with the desire to use it in the first place. I would say though that if you are relying on it to read academic papers then you should approach it with a considerable amount of caution. If you were a friend of mine and asked me if you should use it read to papers, having now had this experience I would say: No, of course not. You would be better off just skimming the thing for 45 seconds like grad students do before seminars. No matter how good a friend you were though, I probably would not ask you why it was so important to you to avoid a bit of reading.

Some final errant thoughts:

As I have stated a couple times now, I am quite confident that a LLM can and will be trained to do this task specifically and it will probably get good enough at it as to be reliable. There have already been some promising developments in that regard. If you must use AI to read an academic paper, maybe seek out one of these specifically-trained tools instead.
A few weeks ago, it dawned on me that I am probably going to be looking at grotesque AI art for the rest of my life and there is very little I can do about that. It was a devastating realization, I can tell you. I guess that is contingent on a few things though. Legislative and regulatory changes are certainly on the way, though if history is any indication then they will take hold in the EU and China long before they reach North America. In much the way the GDPR radically shifted the way corporations use the internet, forthcoming regulation (and other acts of legislation or justice) may also change the way companies like OpenAI operate. They may also limit consumer access to these tools—that is, if the AI companies don’t do it themselves, pulling the same bait-and-switch that companies like Uber, Google, and Netflix have.
I was recently exposed to a long bit of text ostensibly written by a private individual that seemed like vapid corpo-speak. It was filled with buzzwords and nothing-phrases and read like the sort of pablum companies like OpenAI force down our throats when they are angling for more funding—but this was from a person to a cohort of would-be friends and acquaintances. I found it strange. It had to be pointed out to me by someone else that, of course, it was all AI-generated.³⁰ Why would someone “write” that sort of meaningless bullshit and then force it on others? Did the person who produced it believe that it had meaning? Did they expect that others would find it meaningful? It reminded me of pseudo-profound bullshit and it made me a little sad.
It is no coincidence, of course, that this post comes on the heels of the paper demonstrating that using an LLM to write leads to worse performance across several domains and that students given the opportunity to do so naturally relied on it more-and-more over time. The paper itself is quite long and dense but there is a decent lay language interpretation in Time Magazine. For what it is worth, nothing about their findings will register as surprising to anyone with any experience with pedagogy.
My favourite AI-based pun is in the videogame AI: The Somnium Files. The protagonist of AI has a fake eyeball, called an AI-Ball, that contains a little digital assistant named Aiba (which is “close enough” to the Japanese pronunciation of the word “eyeball”). “Ai” is also one of the words for “love” in Japanese, so that gets wrapped in there too. Pretty good.

Footnotes

by which i mean of course the recent spat of LLMs and generative tools that have descended on modern life like a plague of locusts↩︎
Goro Majima wearing a keffiyeh. I am not sure why I wanted to generate that image. It was the first thing I thought of and whatever the popular generators were at the time (DALL-E? Midjourney? who cares) utterly failed at the task. Since then, whenever I am forced to interact with an AI-generation tool, I once again try this prompt. It has never generated a satisfying result (though I am quite certain that there is a tool out there that could).↩︎
Although I resent the synecdoche of using “AI” to refer to these new technologies, I will do so for the sake of readability.↩︎
which is to say that, yes: it is a good thing for some humans to offload some thinking some of the time.↩︎
If you are an AI enthusiast, you may now be thinking to yourself: No, it doesn’t, to which I respond: Yes, it does.↩︎
I was recently told about a grad student who referenced a particular paper of his supervisor’s to his supervisor, which she politely informed him did not exist.↩︎
Even denotation may differ between academic language and lay language, or between academic disciplines. The same words presented in the same order may mean different things depending on where you encounter them (this is true for the rest of human existence as well, so hopefully it does not come as a great shock to you). A word like “reactionary”, for example, is highly relevant to many different fields, despite having completely different meanings in each. These distinctions may also be widening further as academic language is increasingly adopted by the lay public and then assigned new (sometimes contradictory) definitions (e.g., cultural appropriation, parasocial relationship).↩︎
A trained academic, that is, with training that is comparable to your own. Reading 70-year old papers from mine own discipline can be a very confusing experience despite them being foundational in my field. This is because researchers in the 1950’s, despite sharing my discipline, had a very different set of standards around communication and were conducting research in a completely different social and scientific context.↩︎
Have you ever read a novel from the 19th century and, despite understanding the text itself, felt that you were missing something? Communication is not just linguistically-bound but culturally-bound and the rules and methods of communication change over time and between cultural contexts.↩︎
I stand by all of this speculative theorizing, of course, but none of it explains why LLMs so frequently hallucinate papers, studies, or analyses.↩︎
This would certainly explain why so many undergraduates continue to use it for that purpose, despite how often it betrays them.↩︎
I am compelled to write in this manner by a deeply-serious neurological condition inflicted upon me by a malevolent sorcerer when I was a mere child.↩︎
I certainly don’t.↩︎
Did you know that the “o” in “4o” refers to “omni”? I assume that this is because the model can accept and generate information of many different modalites: images, audio, video, and so on. Is it also the model responsible for all of those repugnant and increasingly-yellow “ghibli-style” images? Presumably. I don’t know. I told you I’m not really interested in this stuff. I would consider asking ChatGPT whether it was responsible for them and then linking a screenshot of the exchange here “for the bit,” but to do so would feel at odds with the opening paragraphs of this text.↩︎
For those unaware, Science is one of the most prestigious (domain-general) journals and it is exceptionally-difficult to publish an article in it.↩︎
One of the researchers also happens to be a figure of some notoriety, though I don’t think that his alleged “orgies” impacted this research, for better or worse.↩︎
meaning that subjects were randomly assigned to do it, as opposed to volitionally choosing to do it, as would be common in real life.↩︎
as opposed to popular fiction; the researchers propose that this effect would be found for, say, Crime and Punishment, but not for Twilight.↩︎
This is, to my mind, both a fair and unfair test. On the one hand, it is absolutely beyond the glorified copy-and-pasting that a typical text-summarization task might comprise. On the other hand, interpretation of academic work necessitates an understanding of theoretical and methodological mores and of other academic work. If I ask ChatGPT to interpret a paper that has been all-but-retracted and it neglects to tell me of that fact, then it has failed at the task of scientific interpretation, even if it has succeeded at the (rote, base) task of text summarization.↩︎
These nuances—which, for those interested, include the strict distinction between “literary” and “popular” fiction and the definition and relevance of the term “theory of mind”—have been divisive since the paper’s first publication. The paper’s methodology and replicability came under question some years later and are the true source of its controversy but these complex theoretical claims have always rankled some in the field. Understandably, ChatGPT did not comment on this. I do not really hold this against GPT, but I do think that it is worthy of highlighting: if a researcher says “I am studying theory of mind, here is what theory of mind is, here is how it is relevant, and here is how I am measuring it,” ChatGPT will repeat all of those claims back to you. It will not evaluate those claims for feasibility, however, and so may lure you into a false confidence about the paper’s theoretical claims. This would be less likely if one were to read a paper in full, but would prove challenging for a layperson (or negligent academic) regardless.↩︎
To be clear, I would not consider my work to be complex within the broader context of my field. For the purposes of this task, however, and in comparison to the other paper under consideration, my work does qualify as “somewhat complex.”↩︎
I do not intend to identify this paper by name because I consciously write this blog sort-of-anonymously. If this leads you to doubt my little exercise, then so be it!↩︎
or getting a robot do it for you↩︎
Based on a brief scan of the table, I believe the results were correct, but I did not bother verifying them exactly—that no longer seemed necessary.↩︎
all of which I encountered in this brief exercise↩︎
Of course, if you do not care about the results, then by all means, have at ’er. Just, maybe don’t repeat them to anyone else, lest you risk humiliation.↩︎
This is not a gibe. As has been discussed, academic papers are difficult to read for those outside the discipline for all kinds of reasons. It is not a fault to have difficulty reading papers from a field in which you have not been trained.↩︎
The Web 3.0 comparison seems fair after all.↩︎
It is undoubtedly a useful tool for programming if you either a) already know how to program or b) do not care about the results. I have seen students attempt to use it for statistical analysis programming, however, and it had a tendency to fail spectacularly. This is especially problematic for statistics as the incorrect analysis may still produce an intelligible result. Someone who did not understand the code that had been generated for them would not know that they had obtained a nonsensical result.↩︎
Truly, I am too innocent for this world.↩︎