Code Pages in DBT
A Code Page is a widely-recognized system of matching characters and other symbols with the numbers that represent them in the computer. Modern computer systems now have a code page (called Unicode) that includes virtually all characters in use. But before Unicode, the set of characters in a code page was limited to 256 characters. The solution then was to allow multiple Code Pages, each with a different group of characters.
The history and proliferation of code pages is a mirror of the development of computers. One of the first code pages, EBCDIC, was invented by IBM. This code page is now out of use. Then ASCII was a standard, but only for the first 128 characters. Many code pages use the same 128 characters at the beginning (called the low bit characters), but then vary extensively on the next 128 characters (called the high bit characters).
After ASCII was introduced, IBM created a new set of code pages. Originally Microsoft co-operated with IBM, and they used identical code pages. In the 1990's, these two companies stopped co-operating, and Microsoft made its own version of key code pages. Before Unicode, various Asian nations also set up their own systems for handling their scripts on computers, and many of these systems are still popular today. At one point there was an Icelandic code page, and then another Icelandic code page was created just to add the Euro symbol.
This history explains why there are a lot of code pages, but it does not explain how to handle them. A document imported into DBT is interpreted according to its code page, which you can set on the Import File dialog that comes up automatically during the import process. Here are some basic suggestions on getting the correct code page.
- When in doubt, use WINDOWS-1252 as your code page. This is the most commonly used code page.
- If you import with the WINDOWS-1252 code page, and everything is correct except for curling quotes, then you have a fixable code page problem. Specifically, if you import with the WINDOWS-1252 code page and curling quotes appear as accented upper case O's, then you really have a Macintosh text file (use a Mac code page instead).
- Several of the Macintosh code pages for Asian languages are the same as a popular existing text file format. This may prove useful.
- Some users will come to this issue knowing exactly what code page they want. They have a Polish file using WINDOWS-1250 or a Japanese file using Shift JIS. They know what the file contains and can select the right choice.
- Some software, such as Microsoft Word, has a feature called "Code Page Sniffing". This means that Word can detect and suggest the most appropriate code page for the file. Since DBT cannot do this, you can use Word to detect the code page in use, and then set DBT to that code page for it and all similar files.
- The list of DBT supported code pages includes all of the variations of Unicode (including UTF-8).
DBT Name | Other Names | Language / Region |
---|---|---|
ISO-8859-1 | Latin-1 | Western European |
ISO-8859-2 | Latin-2 | Central European |
ISO-8859-3 | Latin-3 | South European and Esperanto |
ISO-8859-4 | Latin-4 | Baltic, old |
ISO-8859-5 | Cyrillic | Russian and related languages |
ISO-8859-6 | Arabic | Arabic, Farsi, and Urdu |
ISO-8859-7 | Greek | Greece |
ISO-8859-8 | Hebrew | Israel |
ISO-8859-9 | Latin-5 | Turkish |
ISO-8859-10 | Latin-6 | Nordic |
ISO-8859-11 | Thai | Thailand |
ISO-8859-13 | Latin-7 | Baltic, new |
ISO-8859-14 | Latin-8 | Celtic |
ISO-8859-15 | Latin-9 | Revised Western European |
KOI8-R | RFC 1489 | Russian, Bulgarian |
KOI8-U | RFC 2319 | Ukrainian |
WINDOWS-437 | Old MS-DOS character set | European |
WINDOWS-866 | Cyrillic | Russian and related languages |
WINDOWS-874 | Thai | Thailand |
WINDOWS-932 | 31J, MS version of Shift JIS | Japanese |
WINDOWS-936 | MS version of GB1312 | Simplified Chinese |
WINDOWS-949 | MS version of EUC-KR | Korean |
WINDOWS-950 | MS version of Big5 | Chinese |
WINDOWS-1250 | Microsoft Windows Central European | Central Europe |
WINDOWS-1251 | Microsoft Windows Cyrillic | Cyrillic |
WINDOWS-1252 | Microsoft Windows Western European | Western Europe |
WINDOWS-1253 | Microsoft Windows Greek | Greece |
WINDOWS-1254 | Microsoft Windows Turkish | Turkey |
WINDOWS-1255 | Microsoft Windows Hebrew | Israel |
WINDOWS-1256 | Microsoft Windows Arabic | Middle East |
WINDOWS-1257 | Microsoft Windows Baltic | Baltic countries |
WINDOWS-1258 | Microsoft Windows Vietnamese | Vietnam |
UTF-7 | Unicode (all characters) | Every region |
UTF-8 | Unicode (all characters) | Every region |
UTF 16BE | Unicode (all characters) | Every region |
UTF 16LE | Unicode (all characters) | Every region |
UTF 32BE | Unicode (all characters) | Every region |
UTF 32LE | Unicode (all characters) | Every region |
MacRoman | none | Western Europe |
MacJapanese Code Page | Shift JIS | Japanese |
MacChineseTrad Code Page | Big5 | Chinese |
MacKorean Code Page | EUC-KR | Korean |
MacArabic | none | Arabic |
MacArabic-Farsi | none | Arabic and Farsi |
MacHebrew | none | Hebrew |
MacGreek | none | Greek |
MacCyrillic | none | Cyrillic |
MacDevanagari | none | India, Hindi |
MacGurmukhi | none | India |
MacGujarati | none | India |
MacOriya | none | India |
MacBengali | none | India, Bangladesh |
MacTamil | none | India, Sri Lanka |
MacTelegu | none | India |
MacKanada | none | India |
MacMalayalam | none | India |
MacSingalese | none | India, Sri Lanka |
MacKhmer | none | Khmer, Cambodia |
MacThai | none | Thai |
MacLaotian | none | Laos |
MacGeorgian | none | Georgia |
MacArmenian | none | Armenia |
MacChineseSimp | EUC-CN | China |
MacTibetian | none | Tibet |
MacMongolian | none | Mongolia |
MacEthiopic | none | Ethiopia |
Mac-Central-European | none | Central Europe |
MacVietnamese | none | Vietnam |
MacExtArabic | none | Arabic Scripts |
MacSymbol | none | Any region |
MacDingbats | none | Any region |
MacTurkish | none | Turkey |
MacCroatian | none | Croatia |
MacIcelandic | Icelandic | Iceland |
MacRomanian | Romanian | Romania |
MacCeltic | Scottish | Scotland |
MacGaelic | Irish | Ireland |
MacKeyboardGlyphs | Odd Symbols | Every region |
ISO-2022-JP | Japanese | Japan |