Language mystery: Computers

Showing posts with label Computers. Show all posts

Tuesday, 14 March 2023

Translating numbers 2: thousands and decimals

Translating numerals is more complicated than it seems. The number 20.525, for example, would be just over 20 in English, but more than twenty thousand in German. But that is just the beginning. There is a chaotic variety of conventions for writing numerals in different languages. Let us start exploring.

Tradition

Historically, there was a fairly simple rule for translating numerals between English and German. English has a comma as the thousands separator and a dot (full stop) as the decimal point. German uses a dot (point/full stop) to separate the digits in thousands, and a comma to signify decimals. Therefore:

English 100,000,000 becomes German 100.000.000
and
English 23.52 becomes German 23,52.

Swiss German is a special case. Thousands are often written with an asterisk to separate groups of three digits, i.e. as 100’000’000. And there are special rules for decimals in Swiss German. General numbers with decimals are usually written with a decimal comma (23,52), but currencies are written with a decimal point (23.52).

Standardisation

This is where things get complicated. As early as 1948, the international standardisation body “General Conference on Weights and Measures” defined how numerals should be written. It stated that “the decimal marker shall be either the point on the line or the comma on the line”. In other words, the institute was unable to decide between the different national traditions and left both of them in place. But it was far stricter with numbers above a thousand. It stated that the groups of three digits should only be divided by spaces (e.g. 100 000 000), and that “neither dots nor commas are ever inserted in the spaces between groups“.

Never dots or commas? Seventy-five years later this utopia has still not been achieved. In the German source texts which I translate, I now see three different conventions for numbers over a thousand (100.000.000, 100’000’000 and 100 000 000). In English I see two conventions in Internet texts and printed books (100,000,000 and 100 000 000).

This is partly due to the eternal tension between natural language development and centralised language control. Many people have never heard of the standardised regulations, or they do not accept that a centrally imposed convention should take precedence over their traditional patterns. But even the official statements made by standardisation authorities, publishers and other major institutions show a surprising variety.

Different standards and style guides

The German DIN standards DIN 1333, DIN 5008 and ISO 80000 state that the thin space is the correct form in German, but the use of a dot to separate thousands is permitted for amounts of money.

The EU interinstitutional style guide requires that a space must be used to group the digits in thousands in English, and it prohibits the use of a comma.

The house style of the British Office for National Statistics states: “Use commas to separate thousands ... and never spaces”.

The style manual of the Australian government also stipulates commas to separate thousands, and forbids the use of a non-breaking space.

The Chicago Manual of Style also stipulates commas as the thousand separator.

Oxford University Press issues a mini style checklist for its academic journals. For “HUMSOC” (humanities and social sciences) it prescribes commas as the thousand separator, but for “SCIMED” (science and medicine) it prescribes thin spaces.

Wikipedia: the English style manual stipulates that digit groups should be separated either by commas or by “narrow gaps” (i.e. as 100,000,000 or 100 000 000). The use of narrow gaps is particularly recommended for articles on science, technology, engineering and mathematics. The German style manual only suggests the use of dots (100.000.000) and states that the use of non-breaking spaces is controversial within the German Wikipedia organisation.

The Microsoft globalization documentation states that the thousands separator is a comma in the USA, a dot in Germany and a space in Sweden.

What sort of “spaces between groups”?

Care is needed if we use spaces as the separators. A normal space is not a good solution, because the number could easily be split in a normal paragraph, e.g. 100 000 (line break) 000. Therefore, the space must at least be a non-breaking space (CTRL-Shift-Space). But most regulations state that it should be a “thin space”, otherwise known as a “narrow no-break space” (German: schmales geschütztes Leerzeichen). Typographers can create this space character as “U+202F”, on my computer I can create it with “Alt-8239”.

What should translators do?

The decimal marker (point or comma) is fairly clear: follow the traditional convention of the target language. The thousand separator is more complicated. Translating into English, I would use the traditional format (with commas for thousands) unless I have specific information that the other convention (with thin spaces) should be used. For translations into German, the simple answer is “it depends”. In texts for casual readers and in financial texts I would tend to use the traditional form (with dots) unless there is a specific reason to use a different version. In academic and formal texts, the standardised “thin space” is probably best. For Swiss German, of course, specific knowledge of the Swiss conventions for the text type and audience is needed.

This article is not exhaustive. I have not covered the formatting of dates or the grouping of digits in phone numbers, bank account numbers or other contexts. And there are many countries and languages which have completely different ways of writing numbers. Wikipedia is a useful starting point for research into the many different numeral systems in the world.

Wednesday, 8 May 2013

Humpty Dumpty and the TAUS quality concept

The “Translation Automation User Society” (TAUS) is a think tank which promotes the use of machine translation and technology within the translation industry. It organises events and offers services such as data sharing and language technology training. A recent article on the TAUS blog focused on the problem of quality evaluation in automated translation. It proposes a model called “dynamic quality evaluation”. This model has also been discussed onthe LinkedIn group “Translation Automation”, and Rahzeb Choudhury of Leeds University kindly sent me a link to a longer report in PDF format, the DynamicQuality Framework Report.

Looking at these materials, the underlying logic looks to me rather suspect, like a circular argument. It is worth considering the reasons for this.

The TAUS demographics

The Dynamic Quality Evaluation Framework report is based on a study conducted with a number of major multinational organisations (“reviewers”) which have a high volume of text which needs translation. Most of these organisations are large businesses with high volume technical products such as Dell, Google, Microsoft, Phillips and Siemens. The organisations also include the EU, which has a high volume of translations between the national languages in the European Community.

In other words, the work of TAUS, at least in this particular instance, is based on a very limited sample, i.e. major international organisations with an extremely high volume of multilingual text requirements, most of which service a limited range of subject areas. There is no consideration given to highly complex and confidential legal texts which will be read in different jurisdictions, no mention of complicated architectural texts, of urban planning, high-powered business management documents and much more. Given this highly selective demographic situation, it is not surprising that TAUS claims broad agreement on certain priorities in its reports and other documents. I would suggest, however, that the translation industry is much broader than the demographic group represented by TAUS.

The part and the whole

This limited demographic sample would not in itself be a problem if TAUS freely admitted that the study deliberately focuses on a certain scenario and certain types of translation work. But the actual usage in the report exacerbates the problem and is often misleading. For example, there are frequent references to “the translation industry”, although the actual descriptions and conclusions actually apply to clients (and perhaps selected suppliers) in the translation technology industry working on high volume automated translation in specified subject domains.

If the work of TAUS claimed to be impartial academic research, it would take a far more self-critical approach to its own sampling procedures and would openly point out the limitations of its material. Instead, it acts like a political pressure group, presenting its results in the way that most suits its own agenda. In some of the TAUS material that I have read, I have wondered whether this confusion is deliberate, or whether it reflects a genuine inability to perceive that there are different perspectives on the issues.

Dynamic quality evaluation – a definition of convenience?

The report on “dynamic quality evaluation” uses this very problem as its starting point. It states, for example, “Quality evaluation (QE) in the translation industry is problematic”. The blog post claims “The industry needs common measurable definitions”. Both of these statements pose more questions than they answer. Which sector(s) of the translation industry is TAUS referring to? What quality is referred to, who wants to evaluate this quality, for what purpose and in what kinds of text? What measurements could be used to define something as flowing and variable as language? To what extent would industrial-scale evaluation and defined measurements miss the essential characteristics of the material they are used on?

Instead of dealing with these fundamental issues, TAUS posits a quality evaluation system with three main elements, which it calls utility, time and sentiment. We are told that utility refers to the functionality of the content, speed refers to how quickly the translation is needed and sentiment denotes the effect of the resulting text on the brand image. You may notice that the actual quality of a text is not one of the three elements. So where does it come in? As far as I can gather, it seems to be relegated to a sub-category of “Utility” and to be marginally touched on in the category “Sentiment”. At the stroke of the categoriser's computer keyboard, the quality of the text itself is relegated to a mere sub-category.

The pinnacle of the “dynamic quality” logic is reached in the blog post. At the conference which is reported on the blog, there were apparently some participants who did not agree with the majority opinion – they advocated absolute rather than relative quality, and they felt that universal measurable standards did not do justice to the phenomenon of translation. Then comes the classic conclusion: most participants at the conference felt that “unless we maintain the simplicity of the model we get lost in endless details and personal requirements, and we end up … having no generalizable reference …”

Get yourself a cup of coffee and sit down and consider this sentence for a few moments. I would paraphrase it like this: some people argue that the world of language and translation is complicated, but we can’t handle a complex world because we could then not create the simple and measurable system that we want. We must have simplicity, so let there be simplicity. Simplicity rules, simply because we want it to rule.

This is rather like the semantic principles expressed by Humpty Dumpty in Lewis Carroll's novel “Alice in Wonderland”: “When I use a word, it means just what I choose it to mean – neither more nor less.” It would be a wonderfully simple way to use language: I say what I want, and it means what I want. The only problem is the puzzled expression on the faces of my listeners.

The toxic disclaimer

The final section of the blog is where TAUS dances on the borderline of Imperialism. In the title of this section, and three times in the paragraphs, it mentions the possibility of applying for the “dynamic quality” system to be certified as a standard. Each time, the possibility is retracted, at least partially, rather like the song of the Mock Turtle in Carroll's novel: “Will you, won't you, will you, won't you, will you join the dance?” In a TAUS context, this translates as “we would not be so sure that we would want to apply for official standardisation” and “Whether we go for standard certification is a decision we can take together when we get to this crossroads”.

Together? Dear TAUS, does this mean that you will gather all of the translators in the world and involve us in deciding whether to apply for certification of a standard? I think not. Your agenda seems to be domination of the translation industry rather than cooperation with real life translators. You do not look kindly on people like me who have differing opinions, far less do you take us seriously. For you, we are unwelcome “quality gatekeepers” who are “blinkered by prior assumptions”. Ho hum, I suppose Humpty would be proud of these sweeping allegations.

Unintended consequences

The occupation of Gaul by the Roman Empire gave rise to the insurrection by Asterix and Obelix in the wonderful French comics and films. Many other literary parallels come to mind, such as Luke Skywalker and the Empire, Thursday Next and Goliath Corporation, etc. If you continue to play Humpty with the values which translators hold dear, please do not be surprised when you meet opposition. Every group which aspires to global domination must expect resistance. The rhetoric adopted by TAUS and others will bring forth a myriad Luke Skywalkers, and your glorious automated future will be lit up by the flash of lightsabres all over the globe.

Previous related posts on this blog

Would I advise my grandchildren to translate?

Still building Babel?

Fight the machine? (1)

Fight the machine? (2)

Wednesday, 25 April 2012

Computer language mystery solved by humans

Computers have languages, too. According to an article in the American Scientist, even the experts do not agree how many programming languages there are – estimates range from 2,500 to over 8,500.

One recent example which highlighted this variety was the mystery of the programming language used in the creation of “Duqu”, a computer Trojan which has been studied by heavyweight anti-virus companies like Symantec, Kaspersky Labs and F-Secure. These IT giants were able to see the code which this Trojan consisted of, but they were not able to identify which programming language had been used to compile this code.

Why didn’t they ask a computer?

To me, as a mere computer user without a programming background, the solution appears simple. It is a computer language, and a computer is obviously able to follow the instructions in the code (otherwise the Trojan would be of no use to the crooks who created it). So a computer should be able to identify what language it is. This seems to be an obvious logical conclusion.

But it is not so. Igor Soumenkov, a Kaspersky Lab Expert, wrote a blog article “The Mystery of the Duqu Framework”. The article outlines the history of the study of Duqu and the structure of the threat which it poses, and it ends with an appeal which amazed me: “We would like to make an appeal to the programming community and ask anyone who recognizes the framework, toolkit or the programming language that can generate similar code constructions, to contact us or drop us a comment in this blogpost.”

Digital guesswork?

Soumenkov received a flood of blog comments and e-mail responses, and the mystery of the programming language has now been solved. But it is interesting to check out the wording of the 159 comments on the original blog article. They are peppered with phrases like:

That code looks familiar

It may be a tool developed by ...

I think it's a ...

What about ...?

Just a guess ... the first thing that pops to my mind is ...

Sounds a lot like ...

I am not a specialist but I would say it could be ...

One more guess ...

This does smell to me a little bit like ...

I'm gonna take a wild guess ...

Plus a generous sprinkling of words like might, perhaps, maybe, probably, similar, clue, feel, remember, possibility and similar vague terms.

Data or brains?

For me, this throws an interesting light on the use of computers in natural language processing. The human guesswork in the comments on Duqu included many ideas that turned out to be wrong, but the brainstorming process was helpful to the computer experts involved, and the fuzzy process of human thinking led to a solution which evidently was not possible with the computer alone. And all of this for a language which is only useful in computers and has no meaning for human communication (when did you last _class_2.setup_class13)[esi]?).

The situation in translation between human languages is comparable. Automatic translation programs from Google, Microsoft, IBM and others can achieve a certain amount of pattern recognition and sometimes come up with plausible solutions. But only a competent human being can evaluate whether this solution is really accurate or appropriate. So these programs can be a useful tool in the hands of an expert, but there is a distinct risk that they may get the wrong end of the stick.

Friday, 11 November 2011

DVX2 screenshot gallery

At first sight, the screen of the Translation Memory program DéjàVuX2 (DVX2) is just a mass of boxes, a chaotic pattern of vertical and horizontal lines. What are they all for? Where in this enormous jigsaw puzzle can I find the text I want to translate? What other information is provided on the screen, and how is it helpful? The best way to explore this is with screenshots.

The classic layout

When you start working on a project with DVX2, the screen will probably look something like this. The pane at the top left is the working area. The left column is headed "German" - that is my source language. The right column, English (United Kingdom), is where my translation goes.

At the bottom left and bottom right of the screen I can see my reference material. At the bottom right I have terminology suggestions ("AutoSearch Portions"), and at the bottom left I have similar sentences ("AutoSearch Segments"). The top right ("Project Explorer") shows me the files in the project. When I am working on the translation, I normally hide this pane so that I have the full window height for the terminology.

There are various ways to personalise this layout. I can change the font and type size in the various windows, and I can also change the arrangement of the different panes in the working window.

My personal layout

Modern monitors, laptops and netbooks tend to have a wide screen. There is not much space to display elements above each other, so it is sometimes better to display the elements side by side. Therefore, my normal DVX2 screen looks like this:

In this "tramline" layout, the working area is in the middle of the screen and the reference material is arranged to the right and left. It provides more context (i.e. the text before and after the active sentence). The shorter lines could be a disadvantage for longer sentences, and especially on smaller screens. The above screenshot is taken from my 22" monitor. On my 10" netbook, this layout is rather more cramped, although it would be just about workable:

One way to make the lines longer in the working area is to work in a separate text area at the bottom of the screen and to split this text area vertically (Tools>Options>Environment). The active sentence is highlighted in the grid, but the working area is now at the bottom, i.e.:

I often get jobs with very long sentences, and sometimes the reference pane on the left is empty for most segments. In such jobs, I can simply hide this column, which gives me longer text lines even without using the separate text area:

Hide and display

In the last screenshot, note the little tabs on the left and right of the screen. They are "mouse-over" tabs. If I want to have a quick look at "AutoSearch Segments", I simply move the mouse over the tab, and the AS Segments pane opens up, but closes again when I return the mouse to the main grid.

Note also the little drawing pin icon at the top right of the "AS Portions" pane. This is a three-way switch for the display of this pane. It can either be fully displayed, as it is here, folded away like the "AS Segments" pane, or it can hover as in the mouse-over function. The combination of the tabs and the drawing pin icons takes a bit of practice, but it helps me to be flexible in using the screen layout.

Smaller details

There are a number of smaller details in the screen layout which can be useful.

The top of the DVX2 window shows the name and path of the current project. For example, the project I used for these screenshots is on drive D at the location shown.

These six icons are in the middle of the bottom edge of the DVX window. Mousing over them displays what they mean - here I had the mouse over the first icon (AutoWrite). The background colour shows me whether the function is on or off. Here, for example, AutoWrite, AutoAssemble, AutoPropagate and AutoCheck are enabled, but AutoSearch and AutoSend are disabled. These functions can also be switched on or off via Tools>Options>Environment, but the icons are quicker.

This is the area above the working part of the grid, and it contains a few hidden details. The grid language heading boxes (here "German" and "English") switch between alphabetical and chronological view of the project sentences. The language field with the flag has a little arrow to the right, which leads to a list of the target languages in the project (useful for project managers, but not usually for freelancers like me). The box "All segments" also has a little arrow, which opens up a list of types of sentence (all fuzzy matches, all exact matches etc.). The empty box on the left is a row finder. If I know the number of a segment, I can type it here, and DVX2 jumps to that segment (useful if I am proofreading and notice that a segment needs more work when I have finished proofing - I simply jot down the number and jump to the segment afterwards).

The tabs above this row show the name of the files which I have opened, so I can move to another file simply by clicking the tab. That in itself does not sound special. But these tabs can also be used to display files side by side (or one above the other). I can then compare my work on two files in context, for example like this:

This article only looks at the main grid, in other words the screen which I usually see when I work on a project. It does not explore the menu or any of the subsidiary screens, nor does it examine the efficiency of the many functions of the program. But I hope that this visual summary gives a general impression of the working environment.

Language mystery

Tuesday, 14 March 2023

Translating numbers 2: thousands and decimals

Wednesday, 8 May 2013

Humpty Dumpty and the TAUS quality concept

Wednesday, 25 April 2012

Computer language mystery solved by humans

Friday, 11 November 2011

DVX2 screenshot gallery

Popular Posts

Blog Archive

About Me

My Blog List

Followers