P18 Internationalizing Preprocessor
Documentation
Encoding
P18 can process only 8 Bit data files not containing null characters, which
are compatible with US ASCII character set. Fortunately this is true for most
relevant character encodings. Input files not compatible with US ASCII can
not be processed directly and must be converted to an ASCII compatible
encoding. The following encodings can be handled without conversion:
- US ASCII
- ISO/IEC 8859-n
The number n denotes the character table from the 8859 encoding
family.
- UTF-8
UTF-8 stands for UCS Transfer Encoding, 8 Bit (and UCS stands for
Universal Character Set). For a definition of UTF-8 see
RFC 2279.
Since UTF-8 can be used to encode about anything, the 8 bit limitation of P18
is not really a limitation. It is possible that future versions of P18 will
support UCS-2, UCS-4, or other non-ASCII compatible encodings by performing an
initial transformation to UTF-8 on the input and a final transformation back
to the original encoding on the output.
Recognized message types never start with an underscore, and future versions
of P18 won't define additional recognized message types starting with an
underscore. You may want to prepend your own message types with an underscore
to avoid a collision with a recognized message type.
The message type of a message determines the encoding, but may also determine
some other aspects of how the message text is interpreted. The following
message types are recognized by the current version of P18:
- TEXT
This type tells the preprocessor that the message should be treated as raw
8 bit data. Note that null-characters are not allowed in messages, even in
messages of type TEXT.
This type should be used only if there's no other message type reflecting
the message encoding. The character codes of a message of type
TEXT are mapped straight to ISO/IEC 10646-1 codes (which is
equivalent to ISO 8859-1 here, since we're dealing with 8 bit character
codes). This may lead to difficulties using translation files.
- HTML
This type should be used for messages using HTML character references to
encode non-ASCII characters. P18 will transform HTML characters references
to the referenced character codes when creating a translation file (see
section Translation Files).
- XHTML
This type should be used for messages using both UTF-8 encoded characters
non-ASCII characters and HTML character references. This is often the case
for XHTML documents.
- JAVA
This type should be used for messages that are part of Java string literals.
Note that all backslashes that should appear in the output file
have to be masked by another backslash, see
Java Encoding below.
- UTF-7
This type should be used for messages using the UTF-7 encoding (see
RFC 2152).
- UTF-8
This type should be used for messages using the UTF-8 encoding (see
RFC 2279).
- ASCII
7 bit US-ASCII encoding.
- LATIN-1
ISO/IEC 8859-1 (also known as ISO Latin 1). Used for west
European languages.
- LATIN-2
ISO/IEC 8859-2 (also known as ISO Latin 2). Used for east
European languages.
- LATIN-3
ISO/IEC 8859-3 (also known as ISO Latin 3). Used for southeast
European and miscellaneous languages.
- LATIN-4
ISO/IEC 8859-4 (also known as ISO Latin 4). Used for
Scandinavian/Baltic languages.
- CYRILLIC
ISO/IEC 8859-5. Latin/Cyrillic.
- ARABIC
ISO/IEC 8859-6. Latin/Arabic.
- GREEK
ISO/IEC 8859-7. Latin/Greek.
- HEBREW
ISO/IEC 8859-8. Latin/Hebrew.
- LATIN-5
ISO/IEC 8859-9 (also known as ISO Latin 5). Latin 1 with
modification for Turkish.
- LATIN-6
ISO/IEC 8859-10 (also known as ISO Latin 6). Used for
Lappish/Nordic/Eskimo languages.
- LATIN-7
ISO/IEC 8859-13 (also known as ISO Latin 7). Used for Baltic
Rim languages.
- LATIN-8
ISO/IEC 8859-14 (also known as ISO Latin 8). Celtic.
- LATIN-9
ISO/IEC 8859-15 (also known as ISO Latin 9). Latin 1 with Euro
symbol.
Java uses the backslash character as an escape character, just like P18.
In P18, all backslash characters that are to appear in the output
file have to be quoted. This is also the case for backslash characters which
act as an escape character in the output file. When handling Java string
literals, P18 recognizes the resulting double-backslash quoting sequences and
transforms these strings to UCS-4 internally. The only thing to remember when
writing P18ized Java files, is that the escape sequences in the Java string
literals have to be introduced by a double backslash instead of a single
backslash.
If an I18N escape of type JAVA contains non-ASCII characters, these characters
are interpreted as ISO Latin 1 characters.
Translation files may be written in different encodings. Only the message
text parts are written using the specified encoding, the meta information of
the translation file is written in plain ASCII. The default encoding for
translation files is ISO 8859-1 (Latin 1). The encoding of a translation file
has to be roughly ASCII compatible (i.e. UTF-7 and HTML are considered ASCII
compatible, while ENCDIC is certainly not). I don't recommend using UTF-7 for
translation files (try it and you'll see why).
A translation file is written only if all characters of all messages can be
encoded using the requested translation file encoding. For most translations,
one of the ISO 8859 encodings will do. However, if you wish to translate
hebrew to russian for example, you'll probably want to use UTF-8 or HTML as
the translation file encoding.
An other possibility is to use the -f option of the
db export
command (forced export). This option will tolerate unencodable characters
when writing the translation file. This is useful if messages of the source
language are likely to remain readable even if some special characters are
missing (e.g. when translating from german to greek one might want to use the
greek encoding ISO 8859-7, even if the german umlaut-characters can't be
encoded correctly).
The translation file encoding can be specified using the -e option
of the db export
command (see section Commands).