Ciarán's free software notes
Ciaran O'Riordan's irregularly kept software freedom journal
Limit entries displayed: [ 2 ] [ 4 ] [ 6 ] [ 8 ]
Gettext for static websites
Here's how I implemented a translation management system for a static website, using GNU gettext. For the impatient, I've distilled it to 11 instructions at the end.
Goal
This system allows block-by-block translation (string-by-string), which is better than page-by-page because:
- Changes to non-translated parts will be applied to all translations automatically (formatting, tags, images, maybe dates, names, links, etc.).
- By storing the text blocks of all pages together, repeated blocks will only have to be translated once (menu text, copyright notices, headings, etc.).
- You won't get lost when the original changes while the translation is still in progress.
- When you change a paragraph in the original, it's easier to see what parts of the translations need to be updated.
For such a system, the abstract steps are:
- Somehow mark each translatable text block in your webpage. The non-translatable parts will become a shared frame.
- Extract the blocks into a database. Translate.
- Find or write some software to merge the blocks back into the frame to remake the original webpage - but with the option of taking the text blocks from either the English database or one of the translated versions of the database.
Gettext seemed like an obvious possibility, and everything's working perfectly now, but it took me eight hours. The difficulty was that the existing documentation is all geared toward using gettext for computer programs, not for websites or documents. That's when I realised that I must document what I did:
What I did
I started by minimally turning my webpage into a computer program. This involved five steps:
- Write a tiny program that prints some text (a string) into a file.
- Copy the webpage into the program in place of the string.
- Insert some standard bits of code required by gettext.
- Break the string into smaller strings, separating translatable from non-translatable
- Mark the translatable strings with gettexts' tag (the format of the tags depend on which programming language you use but it's usually something involving an underscore _ ).
Gettext works with lots of programming languages, so take your pick
from the examples that come with the package. On my computer, these
are in this
folder:
/usr/share/doc/gettext-doc/examples/
The choice of language isn't important. The code will be dead
simple.
Here's my original index.html:
<html>
<head>
<title>Cow</title>
</head>
<body>
<p>See also: <a href="http://fsfe.org/">FSFE</a></p>
</body>
</html>
Of the supported programming languages, I
choose Scheme (a
dialect of Lisp). At first glance, the code below looks complex,
but you'll only have to modify the first and third chunks. The
first chunk defines three variables which should be
self-explanatory. All the webpage text is in the third chunk. It's
broken up into blocks and I've put gettext tags for
Scheme (_ ) around the translatable blocks. Here
it is, generate-index.scm:
#!/usr/bin/guile -s
!#
(define output-filename "index.html")
(define project-name "ciarans-website")
(define build-directory "/home/ciaran/website-build/")
(use-modules (ice-9 format))
(catch #t (lambda () (setlocale LC_ALL "")) (lambda args #f))
(textdomain project-name)
(bindtextdomain project-name build-directory)
(define _ gettext)
(define page-text (string-append
"<html><head>\n<title>"
(_ "Cow")
"</title>\n</head>\n<body>\n<p>"
(_ "See also: ")
"<a href=\"http://fsfe.org/\">"
(_ "FSFE")
"</a></p>\n"
"</body></html>\n\n"))
(define the-file (open-file output-filename "w"))
(display page-text the-file)
Three of the eight strings are marked as translatable. The other five are part of the shared frame that will be the same no matter what language version of the page is being generated.
Remember to replace any quote marks in your HTML with
backslash-quote (\"), and to add a few line breaks
(\n) to make the output readable. Those are the quote
and the newline sequences for Scheme. They're the same in a few
other languages, but they're different in others.
Before you continue, you must set the "build-directory"
variable to the directory where generate-index.scm is.
If you don't, everything will seem to work but your program will
never access the translated strings.
That done, you extract the translatable strings with these two commands:
$ xgettext --language=scheme -d ciarans-website -k_ generate-index.scm$ mv ciarans-website.po ciarans-website.pot
And then you can create a file (a "po" file) for French translations with this command:
$ msginit --locale=fr
One part of the gettext manual says that
"msginit" is optional - that you can do it
manually instead, but this didn't work for me at all. I spent two
hours diagnosing that problem. Use msginit.
This creates fr.po which you can edit with any text
editor. There will be a line at the top like this:
"Content-Type: text/plain; charset=UTF-8\n"
If your charset is "ASCII", you should probably change it to UTF-8. If your charset is something else and you get error messages from other gettext tools (such as msgmerge) about invalid characters, then changing charset to UTF-8 might also be the answer. There'll also be a field for content-transfer-encoding. The manual says that should always be "8bit".
Emacs is particularly good for editing po files because it has a special editing mode for them.
Next you have to convert your po file into the special mo format and put it in the subdirectory where gettext expects it to be with these two commands:
$ mkdir -p fr/LC_MESSAGES$ msgfmt --output-file=fr/LC_MESSAGES/ciarans-website.mo fr.po
Make the Scheme file executable, and that's it!
ciaran@hide:~/tests/simple-page$ LANGUAGE=fr ./generate-index.scm; cat index.html
<html><head>
<title>Vache</title>
</head>
<body>
<p>Voir aussi : <a href="http://fsfe.org/">La FSFE</a></p>
</body></html>
ciaran@hide:~/tests/simple-page$ LANGUAGE=en ./generate-index.scm; cat index.html
<html><head>
<title>Cow</title>
</head>
<body>
<p>See also: <a href="http://fsfe.org/">FSFE</a></p>
</body></html>
ciaran@hide:~/tests/simple-page$
Ok, so there's your proof-of-concept. Next, I have to convert my
site to this system and maintain it (using msgmerge).
I'll try to keep notes to publish here. Lastly, thanks to the ILUG
community,
who suggested
some alternatives.
The instructions
- Make an empty file
generate-index.scm - Copy my
generate-index.scm(above) into your file - Adjust the build-directory (3rd defined variable) in
generate-index.scmto point to the directory where yourgenerate-index.scmis $ xgettext --language=scheme -d ciarans-website -k_ generate-index.scm$ mv ciarans-website.po ciarans-website.pot$ msginit --locale=fr- edit
fr.poto add translations of the three text strings $ mkdir -p fr/LC_MESSAGES$ msgfmt --output-file=fr/LC_MESSAGES/ciarans-website.mo fr.po$ chmod +x generate-index.scm$ LANGUAGE=fr ./generate-index.scm; cat index.html
UPDATE: (2008-12-15) Some nice people sent me info about existing systems, so I've put that info in a recent blog post.
--

Ciarán O'Riordan, (RSS)
Support free software: Join FSFE's
Fellowship
Japanese PDFs part 2: XeTeX
(Last month's article: Using LaTeX to make PDF documents with Japanese characters)
I've found a better TeX tool for making Japanese PDFs: XeTeX. Below are first the technical advantages, and then an analysis of community and sustainability.
XeTeX is a version of Tex that has been modified to use Unicode (UTF-8) encoding internally. It is also configured to work with modern font tools such as FreeType and fontconfig. With XeTeX, the minimal example from my last article becomes:
\documentclass[12pt]{article}
\usepackage{fontspec}
\setmainfont{Sazanami Mincho}
\begin{document}
\section{What I learned today}
I can write this 私はキランです in Japanese.
\end{document}
This is converted to a PDF with the command line
tool xelatex. XeTeX has been part of the very
common TeX Live bundle
since TeX-Live-2007. So if LaTeX is available for your GNU/Linux
distro, I'm sure TeX Live is too, and thus XeTeX.
(TeX-Live-2008 will
be released
soon.)
(For a more complex example, see jlesson002.tex, and the output jlesson002.pdf.)
One improvement in this example is that I wrote the file in the very common UTF-8 encoding. This means I don't have to tell my applications to use the JP-EUC format that LaTeX+CJK would have required, and it means I'm less likely to have compatibility problems with other text processing tools. (This article was actually supposed to be about converting Japanese TeX to plain text, but an application's lack of support for JP-EUC encoding led me to research UTF-8 versions of TeX.)
A second improvement is that I could use the standard "article" document class. When using CJK, you can only use document classes that have been specifically written to work with CJK. There is a CJK-enabled equivalent for "article", called "scrartcl", but for some others classes, there's no equivalent that works with CJK.
Another improvement is that the font is specified in a much more readable way ("Sazanami Mincho"), and if I want to use another font, I can use this fontconfig command at the shell to find all fonts on my system that include Japanese characters:
fc-list :lang=ja
On my system, this finds six fonts. The differences between Gothic and Mincho are roughly equivalent to sans-serif and serif fonts in Western scripts.
It's hard to find a list of free Japanese fonts. It seems that many Japanese font developers have invented their own licences. Two free fonts available are Kochi and Sazanami, of which some say the latter is slightly better, but I can't see any difference. There is also a font called "UmePlus", which seems to be free, but is missing from some distributions (such as Debian) because the licence is somewhat unclear (but it looks fine to me). When I say "free", I mean it in the free software sense, e.g. that everyone can use, copy, modify, and redistribute (modified or unmodified).
Note: I set the default font to a Japanese font because my documents are wholly/mostly in Japanese. If you just wanted to add some Japanese to a mostly English document, XeTeX is still a good option, but I won't go into how to do that (it involves defining a Japanese environment and beginning the environment, entering Japanese, then ending the environment).
A last, minor technical improvement is output file size. For a
one-line test file, pdflatex made a file of 19.6kb,
and xelatex made one of only 7.5kb. For a more
complex 1-page file
(jlesson002.tex),
the XeTeX output was 15.1kb, and when I converted it to
LaTeX-CJK, pdflatex made a file of 65.2kb.
What about community support and sustainability?
Is it safe to move from the old reliable LaTeX+CJK package to this new XeTeX thing? Will XeTeX still have a developer community in the future? Will developers of other TeX tools take care to ensure their packages work with XeTex? What do Japanese TeX users use?
My searches suggest that Japanese TeX users are using a mix of tools. Some use pTeX, which is a version of TeX modified specifically to work with Japanese. Others use LaTeX+CJK. But there seems to be consensus that these are tools of the past and that Unicode is the future. So change is coming.
Japanese top Tex expert Haruhiko Okumura said in April 2007: "Since pTeX for Unicode is now being developed and XeTeX is acquiring pTeX-like versatility, next year I'll be using either the new pTeX or XeTeX."
The pTeX for Unicode project he's referring to is uptex. It exists, but seems to be still in alpha (early testing) stage. It isn't available in the Debian archives, but someone has made Debian uptex packages. (I haven't tested them.)
If Mr. Okumura has now adopted upTeX or XeTeX, I bet he chose XeTeX.
Next, I got really scientific. I put a few combinations of words into search engines, each time including "2008", a Japanese word, and either "uptex" or "xetex". Each time, XeTeX won by miles. So I guess Japanese people are not currently using uptex. I think XeTeX is winning the battle for Unicode TeX in Japan.
XeTeX being accepted into the TeX Live bundle is also a strong endorsement that XeTeX's future is safe, and the mainainer of LaTeX-CJK is discussing if it and XeTeX can be merged.
The only bad sign I saw about XeTeX is that the maintainer has recently resigned his job, but, he says this shouldn't affect his ability to maintain XeTeX.
Ok, so that's this month's TeX wisdom from a newbie :-) Hopefully next month's article will be about generating plain text files from the same Japanese TeX source files used for generating PDFs. Final note: I'm pretty sure all these tips work for Chinese, Korean, and other foreign characters, but I haven't tried that yet.
For more info and links about computers, free software, and Japanese, see my Learning Japanese page.
UPDATE: I just found Dave Crossland's summary of the recent 4-day TeX Users Group conference: day 1 day 2, day 3, and day 4. There are also videos of the event
--
Ciarán O'Riordan,
Support free software: Join FSFE's
Fellowship

