In January 2025, Chinese AI company DeepSeek made headlines with the release of DeepSeek-R1, a large language model reportedly able to compete with industry leaders like ChatGPT despite having been developed with less money and computing power. Reports that DeepSeek-R1 could outperform ChatGPT-4o on key benchmarks caused some observers to describe its release as “AI’s Sputnik moment.” These benchmarks include tests such as the American Invitational Mathematics Examination (AIME), or collections of “PhD-level” physics, biology, and chemistry problems (GPQA).”
I should admit that I have no idea what to make of this news. As I do not myself have a “PhD-level” understanding of physics or chemistry, I’m have no way to contextualize the significance of a LLM doing marginally better on these tests than a model which was released a few months prior. That’s why I decided to put DeepSeek and ChatGPT to the test using a benchmark that is a bit closer to my reading level: the Adventures of Tom Sawyer.
The prompt
Published by Mark Twain in 1876, this classic novel contains on of the most famous uses of reverse psychology in all of literature, when in chapter 2 Twain tells the story of how 11 year old Tom Sawyer used reverse psychology to amass a wealth of trinkets from his peers in exchange for the “privilege” of white washing Aunt Polly’s picket fence. Twain lists each of these items in detail, including treasures like marbles, firecrackers, and “a dead rat and a string to swing it with.” I included this passage in a prompt along with a set of instructions designed to test a LLM’s ability to extract and categorize the treasure listed in the text.
Output a CSV table with three columns: “Treasure”, “Category”, and “Quantity”.
The values in column “Treasure” should be each item described as being acquired through barter in the relevant passage.
The values in column “Category” should be a 1 to 3 word category which best fits each item based on the significance and utility it has in the eyes of Tom Sawyer. The count of unique values in column “Category” should number between 6 to 8.
The values in column “Quantity” should be the quantity of each item described in the text.
The passage:
“Tom gave up the brush with reluctance in his face, but alacrity in his heart. And while the late steamer Big Missouri worked and sweated in the sun, the retired artist sat on a barrel in the shade close by, dangled his legs, munched his apple, and planned the slaughter of more innocents. There was no lack of material; boys happened along every little while; they came to jeer but remained to whitewash. By the time Ben was fagged out, Tom had traded the next chance to Billy Fisher for a kite, in good repair; and when he played out, Johnny Miller bought in for a dead rat and a string to swing it with—and so on, and so on, hour after hour. And when the middle of the afternoon came, from being a poor poverty-stricken boy in the morning, Tom was literally rolling in wealth. He had besides the things before mentioned, twelve marbles, part of a jews-harp, a piece of blue bottle-glass to look through, a spool cannon, a key that wouldn’t unlock anything, a fragment of chalk, a glass stopper of a decanter, a tin soldier, a couple of tadpoles, six fire-crackers, a kitten with only one eye, a brass door-knob, a dog-collar—but no dog—the handle of a knife, four pieces of orange-peel, and a dilapidated old window sash.”
Response comparison
The result was four data tables with show striking differences in the way the LLMs approached this task. A comparison of these tables reveals that the tasks of sorting through Tom Sawyer’s “wealth” present a challenge which is very different from the “PhD-level” chemistry problems ChatGPT-o1 has been trained to solve.
a couple of tadpoles…
The prompt instructed these models to simply extract the items described in Twain’s prose, yet the differences found in the “Treasure” column in these tables shows that this task is not as straightforward as it might seem. The ‘Treasure’ column of the table generated by DeepSeek, for example, shows that the model extracted entire noun phrases verbatim as they appear in the text. This is different from the approach taken by Claude or Gemini, where the noun phrases tended to be parsed into smaller constituent parts.
The advantages of Claude and Gemini’s approach to parsing come to the fore with items that have quantities, such the “twelve marbles” or “couple of tadpoles.” ChatGPT’s tendency to list entire noun phrases results in a table which includes these counting words as part of the items description in the ‘Treasure’ column. This is not a great way to adapt Twain’s prose into the format of a CSV table, where information related to quantity naturally belongs in a separate column. This is the approach taken by Claude and Gemini, as they simply list “Marbles” and “Tadpoles” in the treasure column while recording their quantities separately.
For example, DeepSeek list a “piece of blue bottle-glass to look through,” which is the entire description of the item verbatim as it appears in Twain’s text. Claude and Gemini, on the other hand, list this item using the paired down description “Blue bottle-glass,” while ChatGPT-o1’s table splits the difference by going with “Piece of blue bottle-glass.” One could argue that DeepSeek’s description is more faithful to the source text, and that Claude and Gemini’s more concise description leaves out relevant information. On the other hand, DeepSeek’s approach also result in item descriptions which are an awkward fit for a data table, and it can be difficult to read as a result.
However, the advantages of Claude and Gemini’s approach to parsing come to the fore with items that have quantities, such the “twelve marbles” or “six fire-crackers.”Here again DeepSeek lists these items using their full verbatim description, such as “couple of tadpoles.” However, this is an awkward, robotic way to adapt Twain’s prose into the format of a CSV data table. It is not necessary to include the counting word “couple” in the item description, because this is information which is in the ‘Quantity’ column.
The quantifiable wealth of Tom Sawyer.
Prose and data tables are different ways of conveying information, and a description which works well in prose might become awkward or unreadable in a tabular format. The approach taken by Claude and Gemini respects the differences in how information is conveyed in prose vs. tabular formats.
key that wouldn’t unlock anything…
Although Claude and Gemini both break Twain’s descriptions down into smaller parts, Claude exercises a kind of editorial discretion which Gemini does not. For instance, where Twain’s prose describes a “key that wouldn’t unlock anything,” Claude lists a “Non-working key.” This description slightly changes the meaning, because a “non-working key” could be describing a key which is broken, yet Twain appears to be describing an intact key which is simply missing its corresponding lock. In this instance, Claude’s edits improve readability, but changes the meaning of the text in the process.
Gemini goes with an even more concise description, listing the item simply as “Key.” This approach is consistent with Gemini’s tendency to shorten phrases by omission rather than outright rewriting any of the descriptions. In fact, of these four models only Claude was bold enough to rewrite Twain’s prose, either by making changes to syntax or swapping out individual words. The phrases “Non-working” or “One-eyed,” for example, are found in Claude’s table, but do not appear anywhere in the prompt.
dead rat and a string to swing it…
Another notable difference in these four tables relates to how they treat one of Tom’s most prized possessions: “a dead rat and a string to swing it with.” In three out of four of the tables generated by these models, this item is regard this item as a single a single entity. However, the table generated by Gemini Advanced is the exception, as the phrase is parsed as containing two discreet entities, with the model listing “Dead rat” and “String” on separate rows.
While many of the differences between these tables are subjective and a matter of personal presence, I consider this difference to be an objective error with instruction following on the part of Gemini Advanced. The prompt instructs the model to “list items which Tom acquired through barter,” and Twain’s prose makes it clear that Tom acquired the dead rat & string as a single item from a trade with Johnny Miller.
This error carries over into the Gemini classifies these items, using the one-off category of “Oddities” to classify the dead rat. This category implies that the utility of the object as an ornamental collectable, as though it were a sideshow relic to be displayed in a jar of formaldehyde. However, the prompt instructs the model to classify each item “based on the significance and utility it has in the eyes of Tom Sawyer.” Of course, in Tom’s eyes this item is not some ornamental “Oddity.” Rather, Tom’s enjoyment of the rat is found in a kinetic form of play which is made possible by its indivisible marriage to the string he uses “to swing it with.”
Categorization
These are several noteworthy differences between the categories these models use to classify Tom’s treasures. Claude makes use of bespoke categories which are much more imaginative and playful than its counterparts. For instance, Claude assigns the “Fancy Treasures” category to the brass door knob, and uses “Magical Things” for the blue bottle-glass. This shows when it comes to categorizing the items based on their significance “in the eyes of Tom Sawyer,” Claude understood the assignment.
By comparison, Gemini goes with the generic category of “Useful Items” for the brass door-know. DeepSeek’s model again uses “Curio” to classify this item — a category which it uses for 9 of the 18 items listed in the table. However, the most notable categorization error is found in ChatGPT-o1’s table, where the model sorts Tom’s brass door knob into the category of “Household Junk.” In the context of this prompt classifying any of these items as “junk” feels mean spirited, because it insults the values of Tom Sawyer. It also betrays a misunderstanding of the ironic, playful tone of Twain’s prose, which describes Tom as “literally rolling in wealth.”
It would appear that while ChatGPT-o1’s vaunted improvements in “advanced reasoning” might help it achieve “PhD-level accuracy” on physics benchmarks, it struggles when faced with the more elementary task of understanding the perspective of 11-year-old Tom Sawyer. .
Of course, there is no correct way to list or categorize Tom’s stuff, which means it is probably not a terribly useful as an objective benchmark of LLM performance. This makes the challenge very different from completing benchmark tests like the AIME, where every answer has a single, unambiguous correct answer, and where every response can be objectively graded. DeepSeek and ChatGPT-o1 may well have surpassed most of us humans when it comes to taking solving PhD-level physics questions, but it would seem they have still have some catching up to do when it comes to sorting through the messy ambiguity of day-to-day life.

“Mark Twain” (1902) by Charles E. Bolles.
Library of Congress, Prints and Photograph’s division