Machine translation systems and literary translation

von Peter Winslow, veröffentlicht am 21.11.2022

October 25, 2022, saw the publication of a research paper titled “Exploring Document-Level Literary Machine Translation with Parallel Paragraphs from World Literature,” whose primary aim is “to understand how both state-of-the-art MT systems and MT evaluation metrics fail in the literary domain” (see Introduction). While understanding failure can be instructive, this research provides more than instruction; it provides a surprise, to me at least. According to this paper, “expert literary translators prefer reference human translations over machine-translated paragraphs at a rate of 84%,” as stated in the abstract, or “professional translators prefer reference translations at a rate of 85%,” as stated in the introduction. Those numbers are incredible; it strains credulity that anyone with expert, no, with even advanced translation knowledge and/or with even advanced literary knowledge could prefer machine translations of literary texts over human reference translations of literary texts at a rate of 15% or 16%. Machine translation is not that good, Punkt.

My first thought was that something must be wrong with the human reference translations. Section 2.1 describes the criteria for selecting works of literature, but fails to offer any real information regarding the reference translations. The criteria for selecting works of literature consisted of a literary work’s being in (1) “the public domain of its country of publication by 2022 with (2) a published electronic version along with (3) multiple versions of human-written, English translations.” Nothing in these criteria or in the research offer real insight into the overall quality of the “human-written, English translations” used. The best one receives is certain language in 3.1 Experimental setup, which indicates the translations were published, but without any indication by whom. Were they self-published? published by reputable publishing houses known for good literary translation? In other words, one must withhold judgment here.

But this problem got me thinking about who these “expert literary translators” are, what their qualifications and expertise would have to be. It turns out that the authors of this research fail to define this operative term. The authors of this research disclose only that they hired “human experts ([…] literary translators fluent in both languages) to perform A/B tests” on the translated paragraphs, translated by both humans and Google Translate (3.1). The authors go on to call these expert translators “professional literary translators” (3.1 Experimental setup) – who, we learn from footnote 10, were required to have “experience translating” in certain language pairs. But that’s it. One learns nothing more of these translators’ qualifications or expertise. Even rudimentary information is omitted. What are their degrees? What experience do they have – beyond “translating”? Have they published literary translations? Do they teach literature? literary translation? A curious omission.

It’s curious, because it turns the designation “expert literary translator” into a perfect example of the kind of “critical key word” that the literary critic and rhetorician I.A. Richards (1893–1979) damns in his book Practical Criticism. In his chapter titled “Technical Presuppositions and Critical Preconceptions,” Richards writes that critical key words, at home in certain critical preconceptions, excel in a kind of duplicity; they disguise “great vagueness and ambiguity behind an appearance of simplicity and precision.” Like Richards’s critical key words, “expert literary translator” excels in this kind of duplicity; absent any specification of the translators’ qualifications and expertise, it is impossible to know what the apparently simple and precise designation is supposed to mean. It is vague and ambiguous, at best.

At worst, it is, perhaps most obviously, an invitation to believe that a common designation entails a common education or that expert literary translators all share certain common qualifications – beyond, it would seem, “experience translating.” Yet, there exist, say, no licensing requirements to become a literary translator, expert, professional, or other, and the translation profession remains, by and large, unregulated in most, if not all, parts of the world, including in the United States, where the research was conducted. Hence, one must greet this invitation with skepticism; in translation, designations are not of necessity credentials. Are these literary translators subject-matter experts specialized, say, in literary criticism, a certain author, a certain time period? Questions abound, these and more. And answers are wanting.

Now I do not wish to suggest that the translators, who participated in the research, are not “expert literary translators.” They very well may be, though I do have doubts. Neither do I wish to suggest that the authors of this research do not mean something definite with this designation. I wish to suggest only that readers have no idea what “expert literary translator” means in this context, as the authors of this research fail to offer any information readers need to understand what it means.

Unfortunately, the problems do not stop here. In addition to that question of designation, there exists another question, given that the “expert literary translators” each “completed 50 tasks in their language of expertise” (3.1 Experimental setup). Fifty is not an insignificant number of tasks – hence, the question: were the source texts, the texts to be translated, in each language pair homogenous enough that a single translator per language pair would be competent enough to evaluate all the translations of all the source texts? I mean, were all the source texts in each language pair roughly from the same country/region? roughly from the same time period? roughly from the same literary tradition? In other words, did the source texts in each language pair exhibit commonalities enough that would render it likely that a single translator could evaluate all the translations in his or her language pair?[1]

Of course I am not qualified to answer this question, or parts of it, in relation to all three source languages tested, French, German, and Russian (see 3.1 Monolingual vs translator ratings). But I am qualified to answer this question, and all its parts, in relation to German. And the answer is no. Allow me to explain.

Here, our primary source is Table 5, whose caption assures us it contains a “full list of the literary texts from which the source paragraphs [] are sampled,” with among other things the authors’ names and each text’s year of publication. This table lists nine German-language writers and sixteen German-language books. At first blush, this general data suggests a homogeneity; however, closer examination reveals important intricacies.[2]

And one does not even have to examine that data very closely. The writers are Goethe, Sacher-Masoch, Spryi, Hesse, Mann, Kafka, Schnitzler, Zweig, and Rilke, all active between the eighteenth and mid-twentieth centuries. This alone should be reason enough to give rise to doubts about the homogeneity in, and to sense the intricacies of, the literary works sampled in this research. Still, one need not rely on this sense of the intricacies. The underlying facts bear them out.

Five or roughly 55% of these nine writers were from the Austrian Empire (Sacher-Masoch) or the Austrian-Hungarian Empire (Kafka, Zweig, Rilke, Schnitzler), but all from different regions; three or roughly 33% from Germany (Thomas Mann, Hesse, Goethe), but all from different regions; and one or roughly 11% from Switzerland (Spyri).

The dates fall very roughly into three categories, spanning three centuries: (1) late eighteenth/early nineteenth centuries (comprising two or roughly 12.5% of the publications, both by Goethe in 1774 and 1809), (2) late-nineteenth century (comprising two or roughly 12.5% of the publications, one by Sacher-Masoch in 1870, one by Spyri in 1881), and (3) first half of the twentieth century (comprising twelve or roughly 75% of the publications, four by Kafka in 1915, 1924, 1925, and 1927; three by Thomas Mann in 1901, 1912, and 1924; two by Hesse in 1922 and 1927; one by Zweig in 1939; one by Rilke in 1910; and one by Schnitzler in 1926).

Of the nine writers, only three, or roughly 33%, of them can be subsumed under a common category, but perhaps with certain caveats: Kafka, Schnitzler, and Rilke can be subsumed under a sort of Austrian modernism. This is not true of the rest of the writers. While, today, Thomas Mann is perhaps most closely associated with so-called Exilliteratur, he wrote various kinds of literature, novels, a Bildungsroman, short stories, novellas, essays, etc. To some extent, the same is true of Zweig, historical studies, biographies, plays, essays, fiction broadly defined, etc. Hermann Hesse is perhaps best known for his themes of self-realization and Eastern mysticism. Spyri is perhaps best known for children’s literature; Sacher-Masoch – if he is generally known at all – for being the namesake of masochism. And while Goethe was an important figure first in the Sturm und Drang movement and later in the Weimar Classicism movement, he is known quite simply for being one of the best writers to have ever written in the German language, as poet, as playwright, as epistolary novelist, etc.

Put summarily, the answer to the three-part question posed above – whether all the German-language texts are roughly from the same country/region, roughly from the same time period, and roughly from the same literary tradition – is a hard no. As such, the German-language source texts do not exhibit commonalities enough that would render it likely that a single translator could evaluate all the German to English translations. I have translated literature and philosophy and epistles, and I know I would not be confident in my evaluation of translations of such disparate German literature. Among other things, I could not be confident in my ability to identify and to understand all the regionalisms, all the literary and other allusions, earlier or later meanings of words and phrases which may or may not have been deployed in those texts.

Quite simply, my expertise as a translator has its limits, despite my liberal arts training, despite a degree in philosophy, despite a degree in German Studies, despite more than fifteen years of translation experience, despite a number of published translations. None of these facts would eradicate or alleviate my reasonable fear that misunderstandings, misinterpretations, and anachronisms would crop up in my evaluations. I am not so well-read in all the various German and Austrian and Swiss literary traditions from the eighteenth to the twentieth centuries, which are relevant here, that I could be confident in my ability to avoid those things. To the best of my knowledge, there exist very few translators, who would be so well-read and who would therefore have such confidence. In point of fact, I can think of only one in my field, who was so read and who might have justifiably had such confidence, were he alive today. And that is Michael Hamburger (1924–2007), who translated – not without success – writers as diverse as Goethe and Hölderlin, Brecht and Hofmannsthal, Celan and Sebald. But, to my knowledge, Hamburger is not the rule; he is the exception.

Hence, this research seems to be marred by one of two assumptions. It assumes either that evaluating literary translation is a purely language-related matter independent of geography and accompanying differences in language usage, time periods, and/or literary traditions or that evaluating literary translations of such diverse writers is easy for “expert literary translators.” As for the first assumption, it is questionable, for the reasons given above; in some sense, expertise is defined by limits. As for the second assumption – I don’t know what my response should be. Perhaps, this: if evaluating literary translations of such diverse writers is easy, one should sit down and try. Try evaluating a translation of Goethe’s The Sorrows of Young Werther and then of Venus in Furs by Sacher-Masoch and then, say, of The Metamorphosis by Kafka. It is easier to say a thing than to do it. I suspect at least two of the evaluations will be found wanting in important respects. Like chess, translation is a full-knowledge activity; all information is available to all. Get something wrong, and your ignorance shows.

These facts and misgivings bring us back to my doubts that the translators engaged for this research are “expert literary translators.” The “expert literary translator” engaged for the German to English evaluations gets his or her evaluation of a passage from Thomas Mann’s The Magic Mountain horribly wrong – see Table 13. This table gives this passage in German, a translation rendered by Google Translate (the “GT Translation”), and a translation done by a human (the “Human Translation”). Given my credentials, I am confident I can evaluate the translation of this passage and assess the evaluation given by the “expert literary translator.”

The passage from Mann’s The Magic Mountain reads:

Joachim ging, und es kam die »Mittagssuppe«: ein einfältig symbolischer Name für das, was kam! Denn Hans Castorp war nicht auf Krankenkost gesetzt, – warum auch hätte man ihn darauf setzen sollen? Krankenkost, schmale Kost war auf keine Art indiziert bei seinem Zustande. Er lag hier und zahlte den vollen Preis, und was man ihm bringt in der stehenden Ewigkeit dieser Stunde, das ist keine »Mittagssuppe«, es ist das sechsgängige Berghof-Diner ohne Abzug und in aller Ausführlichkeit, – am Alltage üppig, am Sonntage ein Gala-, Lust- und Parademahl, von einem europäisch erzogenen Chef in der Luxushotelküche der Anstalt bereitet. Die Saaltochter, deren Amt es war, die Bettlägrigen zu versorgen, brachte es ihm unter vernickelten Hohldeckeln und in leckeren Tiegeln; sie schob den Krankentisch, der sich eingefunden, dies einbeinige Wunder von Gleichgewichtskonstruktion, quer über sein Bett vor ihn hin, und Hans Castorp tafelte daran wie der Sohn des Schneiders am Tischlein deck dich.

The GT Translation reads:

Joachim went, and “Lunchtime Soup” came: a simple symbolic name for what was coming! Because Hans Castorp was not put on sick food - why should he have been put on it? Sick diet, small fare, was in no way indicated in his condition. He lay here and paid the full price, and what is brought to him in the standing eternity of this hour is not a “lunchtime soup,” it is the six-course Berghof dinner without deduction and in great detail - sumptuous in everyday life, closed on Sundays Gala, pleasure and parade meal, prepared by a European-educated chef in the luxury hotel kitchen of the institution. The maid, whose job it was to look after the bedridden, brought it to him under nickel-plated hollow lids and in delicious jars; She pushed the patient’s table that appeared, this one-legged marvel of balanced construction, across his bed in front of him, and Hans Castorp ate at it like the tailor’s son at the little table, cover yourself.

The Human Translation reads:

Joachim would leave, and the “midday soup” would arrive—soup was the simplified, symbolic name for what came. Because Hans Castorp was not on a restricted diet—why should he have been? A restricted diet, short commons, would hardly have been appropriate to his condition. There he lay, paying full price, and what they brought him at this hour of fixed eternity was “midday soup,” the six-course Berghof dinner in all its splendor, with nothing missing—a hearty meal six days a week, a sumptuous showpiece, a gala banquet, prepared by a trained European chef in the sanatorium’s deluxe hotel kitchen. The dining attendant whose job it was to care for bedridden patients would bring it to him, a series of tasty dishes arranged under domed nickel covers. She would shove over the bed table, which was now part of the furniture, a marvel of one-legged equilibrium, adjust it across his bed in front of him, and Hans Castorp would dine from it like the tailor’s son who dined from a magic table.

According to the caption given at the end of Table 13, the translator’s evaluation favored the GT Translation over the Human Translation. But why? According to 3.1 Monolingual vs translator ratings, the “expert literary translator” evaluates the Human Translation as containing a “catastrophic error,” an error defined to invalidate the translation completely (see Table 3); according to this “expert literary translator,” the Human Translation

contains several mistakes, mainly small omissions that change the meaning of the sentence [sic], but also wrong translations (‘trained European chef’ instead of ‘European-educated chef’).

Save for the criticism of the translation of “einem europäisch erzogenen Chef,” this evaluation lacks determinate criticism (what mistakes, what omissions, what changes in meaning?). Now, this indeterminacy might be owed to a constraint imposed by the authors of this research; they “asked each rater to choose the 'better' translation and also to give written justification for their choice (2-3 sentences)” (3.1 Experimental setup). Still, this constraint does not explain why this “expert literary translator” added only one example to illustrate his or her meaning; surely, additional examples would have conformed to this constraint. For the sake of argument, however, let us grant the sentiment: the Human Translation is not the best translation a human could do. Even if one grants this sentiment, there still exists a number of reasons why the Human Translation is objectively preferrable to the GT Translation. Here’s why, rudimentarily, indicatively, with no claim to completeness.

The German passage is obviously part of a literary narrative and wears the movement of its prose on its sleeve. And that’s just in the first two sentences. Mann deploys the rhetorical figure of epistrophe, repetition at the end of a sentence, clause, etc. (Krankenkost, schmale Kost). There is a play on words: “Kost war auf keine Art” is seemingly meant to invoke, playfully, in the reader’s mind common names of German dishes such as “Schnitzel Wiener Art,” “Matjes Hausfrauenart,” and the like. At about the half-way point of the passage, a known literary/rhetorical device is used: diction follows substance; certain words announce what follows – “Ausführlichkeit,” for instance, announces a kind of “ausführlich” description. And the passage ends with an allusion to Grimms’ fairy tale “Tischchen deck dich, Goldesel und Knüppel aus dem Sack” (Mann writes “Tischlein deck dich”).

The Human Translation is obviously part of a literary narrative and more or less wears the movement of its prose on its sleeve. The GT translation is little more than a report; there’s nothing literary about it. This is to say: while the GT Translation captures the movement of the prose, it does so awkwardly, with a matter-of-factness inappropriate to the passage. (Why does the “expert literary translator” not evaluate this as a “catastrophic error”? Surely, this completely invalidates the translation.) Google Translate also mistranslates “Krankenkost” as “sick food,” but “sick food” does not lend itself to ready comprehension. (Again: why does the “expert literary translator” not evaluate this as a “catastrophic error”? Surely, this completely invalidates the translation.) The human came up with a plausible translation, “restricted diet,” which is not a stumbling block, a huge plus in translation. The rhetorical figure cannot – anyway, I cannot think of any way it can – be reproduced in English here, but the human handles this bit of text much better than Google Translate; the Human Translation uses a known English term, “short commons,” whereas the GT Translation, “short fare,” is overly literal and seems out of place in this context. (Again: why does the “expert literary translator” not evaluate this as a “catastrophic error”? Surely, this completely invalidates the translation.) The play on words “auf keine Art” gets lost in both translations, understandably so. While, in the GT Translation, the medical term used by Mann, “indiziert,” is translated with the corresponding English medical term “indicated,” this translation is – and this is not an exaggeration – the only thing right in this sentence. In the Human Translation, it is true, the translator opts for a different rendering, but the rendering is one of everyday language; it is not imprecise or inaccurate – the charitable reading here is that the translator opted for a different register to fit his or her translation of this term into a cohesive style – to make it of a piece.

Google Translate captures the literary/rhetorical device, I guess. Yet, Google Translate makes a fine mess of it; “without deduction and in great detail” is overly literal and nonsensical. (Again: why does the “expert literary translator” not evaluate this as a “catastrophic error”? Surely, this completely invalidates the translation.) The Human Translation captures this device – if not very well, then at least competently. It is true that Google Translate handles “einem europäisch erzogenen Chef” better than the human does, as evaluated by the “expert literary translator,” but this advantage is not enough to justify an evaluation that the Human Translation contains a “catastrophic error” or that the GT Translation is preferable to the Human translation.

As for the allusion to Grimms’ fairy tale: Google Translate missed it, and missed it badly – “little table, cover yourself” is, quite simply, nonsense. (Again: why does the “expert literary translator” not evaluate this as a “catastrophic error”? Surely, this completely invalidates the translation.) It is difficult to say whether the human translator missed it. The Human Translation certainly makes sense in this context; the idea of a magic table no doubt works here: the imagery is right. But one cannot properly evaluate the Human Translation without more information concerning the translation and without certain historical/literary facts such as: when was the translation published? what translation(s) of Grimms’ fairy tales existed and how widely known in the English-speaking world were they at the time?

Because I do not know when the Human Translation was published and because I am not an authority on Grimms’ fairy tales, I performed a quick Google search. And it revealed that this fairy tale has been translated into English at least twice, both dating back to the 1880s, prior to the German publication of Thomas Mann’s book in 1924. In one of those translations “Tischen deck dich” is rendered simply as “table”; in the other, quite wonderfully, as “wishing-table.” In other words, even rudimentary research should convince any “expert literary translator” that an evaluation here must remain inconclusive without such information and facts. Without them, possibilities abound, and with those the incertitude about the choice of translation grows.

It is possible the human translator missed this allusion. But it is also possible – and, I believe, a charitable interpretation demands the affirmation of the possibility – that the human translator recognized the allusion. But even so: given the rudimentary facts outlined above, this charitable interpretation opens up further possibilities. It is possible that the translator knew the translation of “Tischen deck dich” only as “table” and felt compelled to invoke the fairy-tale overtone by adding the word “magic” – in fear, say, that readers would misunderstand the allusion without that addition. It is also possible that the translator translated the allusion himself or herself, without regard to the existing translations of that fairy tale. It is also possible that the translator (1) believed the allusion would be lost on an English-speaking readership – because he or she believed, for whatever reason, that such readership lacks sufficient familiarity with Grimms’ fairy tales to recognize the allusion – and (2) therefore chose to translate generally. ——Today, I imagine, one would translate “Tischlein deck dich” as “wishing-table” and not think twice about it.

The “expert literary translator” engaged to evaluate the German to English translations of literary texts fails to weigh the literary merit exhibited by the Human Translation against the absence of literary merit exhibited by the GT Translation. This failure, together with the indeterminate nature of his or her criticism, makes his or her evaluation seem flippant – no, unbefitting of an “expert literary translator,” however defined.

In conclusion, this research is flawed. I am not confident that, in its current form, it can survive the general criticism unmasking the duplicity of its operative term “expert literary translator” – apparently simple and precise, but truly vague and ambiguous. This criticism seems to me to inflict a seemingly mortal wound on this research, at least as it relates to the claim that expert literary translators prefer reference human translations over machine-translated paragraphs at a rate of 84% or 85%. The remaining criticisms do not, perhaps, inflict mortal wounds, but they do inflict serious wounds to this claim – at least one third of the languages tested was tested with serious flaws. And these require further examination and treatment, if this claim is to survive.



[1] Posed in this way, the question presumes there was only one translator per language pair, though it is difficult to know whether this presumption holds or no. The authors are not explicit on this point. In any case, my presumption is based on the language used in the captions to Tables 13, 14, and 15 appended to the research paper. That language seems to suggest there was only one translator per language pair; those captions each speak only of “the translator.”

[2] Any one of the distinctions/categorizations made here ought to be taken as (very) rough distinctions/categorizations, made for the purpose of illustration and brevity in composition; finer distinctions/categorizations are no doubt possible.


Further reading

In Literary Translation, Humans Prefer Humans and Machines Prefer Machines,” dated November 4, 2022 (accessed on November 21, 2022)

Humans still beat machines when it comes to literary translation,” dated November 8, 2022 (accessed on November 21, 2022)

Diesen Beitrag per E-Mail weiterempfehlenDruckversion

Hinweise zur bestehenden Moderationspraxis
Kommentar schreiben

4 Kommentare

Kommentare als Feed abonnieren

So obviously IT still hasn't developed a feeling for literary atmosphere. Thank god.

Apart from the fact that every translator has to make a choice betwenn clinging more to the literary meaning of words or emphasizing the atmosphere or the flow of the language at every turn, which the reader can "like" or "dislike", I agree that in the chosen example it's hard to see how anyone could prefer the IT translation. Basically and not surprisingly the IT completely missed the magical / fairy-tale undertones of the German in the last part of the passage, even considering the time, when the text was written. "ate", "push", "construction" simply feel wrong here. The human translation is definitely superior in these aspects. One may discuss, if it`s preferable to simply leave out the strange combination of "leckere Tiegel". While "jar" seems ok, the humour of the expression is lost; the human translator simply left this out, although I like the "domed nickel covers".  

I agree. The omission you mention is not the only one, however; "Sonntag" was also omitted earlier in the passage. I also agree that certain omissions in literature are issues for discussion – occassions for the best sort of criticism. One way I like to think about this point is this, that literature is not an inventory of words. As you point out, sometimes the point is not whether every word has been translated, but whether the overall atmosphere (great word, by the way!) has been conveyed.

Yes and AI would have to decide, whether the use of a certain word is meant as everyday or artificial language or even ironically, which would mean knowledge of the kind of text and of the time at which it was written. Then it would have to decide, whether the prime importance of a world in that context would bei its meaning, its sound, like in a rhyme, or its rhythm, eg for a metre. And then AI would have to make a weighted decision...

There is a situation in a Harry Potter book, where soneone says in Astrology lesson: "Yes, I want to see Uranus too." I remember reading a literal translation in my German edition, which gives the completely wrong impression of someone interested in a planet instead of someone interested in making a dirty joke. If lacking astronomy related German dirty jokes, probably any stupid joke would have captured the situation better ...

I've recently read a book by Pascal Mercier, "Das Gewicht der Worte". I don't know whether I really liked it, as it was a bit long. But at its heart there is a discussion of creativity as a necessary part of translating, even in conflict with the author's own creativity. Perhaps this spark of creativity is what is missing in machine translations.

The book also discusses, in how far the technical language of different professions is used as a tool for excluding others and strengthening the own subculture, and therefore ultimately as an instrument of power. This also ist interesting, as there is an ongoing discussion, whether the dominance of the legal language at family courts and the willingness of psycologists and social workers to adopt it are preventing those professions from bringing their own strengths to bear on the proceedings.

Your point about an AI having to make a weighted decision seems right. And that is surely why AI fails so often. It cannot distinguish – and I do not believe it will ever be able to distinguish – the sorts of language use in which humans engage as a matter of course. At its base, AI-powered machine translation appears to take neutral exposition as its norm – apparently coupled with a very bad theory of meaning (very roughly it seems to be something like: words have fixed, definite meanings independent of any context or if not independent of any context, the term 'context' would have to be defined so narrowly as to be meaningless). In other words, yes, I suspect you're right about machine translation's missing that spark of creativity. 

It's interesting that you mention the adoption of jargon that is home in fields, in which neutral exposition is a norm (and rightfully so in those fields), might be preventing other professionals from bringing their own strengths to certain matters. That thought strikes me as right; at any rate, I've seen it personally in my own life. 

It looks like I might just have to read Mercier's book. Thanks for the tip!

PS: The Potter joke is very good; it appeals to my inner-child. What a shame it got lost in translation in the dullest possible way!

Kommentar hinzufügen