DBT Supported Code Pages

Code Pages in DBT

A Code Page is a widely-recognized system of matching characters and other symbols with the numbers that represent them in the computer. Modern computer systems now have a code page (called Unicode) that includes virtually all characters in use. But before Unicode, the set of characters in a code page was limited to 256 characters. The solution then was to allow multiple Code Pages, each with a different group of characters.

The history and proliferation of code pages is a mirror of the development of computers. One of the first code pages, EBCDIC, was invented by IBM. This code page is now out of use. Then ASCII was a standard, but only for the first 128 characters. Many code pages use the same 128 characters at the beginning (called the low bit characters), but then vary extensively on the next 128 characters (called the high bit characters).

After ASCII was introduced, IBM created a new set of code pages. Originally Microsoft co-operated with IBM, and they used identical code pages. In the 1990's, these two companies stopped co-operating, and Microsoft made its own version of key code pages. Before Unicode, various Asian nations also set up their own systems for handling their scripts on computers, and many of these systems are still popular today. At one point there was an Icelandic code page, and then another Icelandic code page was created just to add the Euro symbol.

This history explains why there are a lot of code pages, but it does not explain how to handle them. A document imported into DBT is interpreted according to its code page, which you can set on the Import File dialog that comes up automatically during the import process. Here are some basic suggestions on getting the correct code page.

When in doubt, use WINDOWS-1252 as your code page. This is the most commonly used code page.
If you import with the WINDOWS-1252 code page, and everything is correct except for curling quotes, then you have a fixable code page problem. Specifically, if you import with the WINDOWS-1252 code page and curling quotes appear as accented upper case O's, then you really have a Macintosh text file (use a Mac code page instead).
Several of the Macintosh code pages for Asian languages are the same as a popular existing text file format. This may prove useful.
Some users will come to this issue knowing exactly what code page they want. They have a Polish file using WINDOWS-1250 or a Japanese file using Shift JIS. They know what the file contains and can select the right choice.
Some software, such as Microsoft Word, has a feature called "Code Page Sniffing". This means that Word can detect and suggest the most appropriate code page for the file. Since DBT cannot do this, you can use Word to detect the code page in use, and then set DBT to that code page for it and all similar files.
The list of DBT supported code pages includes all of the variations of Unicode (including UTF-8).

List of Code Pages (Character Sets)
DBT Name	Other Names	Language / Region
ISO-8859-1	Latin-1	Western European
ISO-8859-2	Latin-2	Central European
ISO-8859-3	Latin-3	South European and Esperanto
ISO-8859-4	Latin-4	Baltic, old
ISO-8859-5	Cyrillic	Russian and related languages
ISO-8859-6	Arabic	Arabic, Farsi, and Urdu
ISO-8859-7	Greek	Greece
ISO-8859-8	Hebrew	Israel
ISO-8859-9	Latin-5	Turkish
ISO-8859-10	Latin-6	Nordic
ISO-8859-11	Thai	Thailand
ISO-8859-13	Latin-7	Baltic, new
ISO-8859-14	Latin-8	Celtic
ISO-8859-15	Latin-9	Revised Western European
KOI8-R	RFC 1489	Russian, Bulgarian
KOI8-U	RFC 2319	Ukrainian
WINDOWS-437	Old MS-DOS character set	European
WINDOWS-866	Cyrillic	Russian and related languages
WINDOWS-874	Thai	Thailand
WINDOWS-932	31J, MS version of Shift JIS	Japanese
WINDOWS-936	MS version of GB1312	Simplified Chinese
WINDOWS-949	MS version of EUC-KR	Korean
WINDOWS-950	MS version of Big5	Chinese
WINDOWS-1250	Microsoft Windows Central European	Central Europe
WINDOWS-1251	Microsoft Windows Cyrillic	Cyrillic
WINDOWS-1252	Microsoft Windows Western European	Western Europe
WINDOWS-1253	Microsoft Windows Greek	Greece
WINDOWS-1254	Microsoft Windows Turkish	Turkey
WINDOWS-1255	Microsoft Windows Hebrew	Israel
WINDOWS-1256	Microsoft Windows Arabic	Middle East
WINDOWS-1257	Microsoft Windows Baltic	Baltic countries
WINDOWS-1258	Microsoft Windows Vietnamese	Vietnam
UTF-7	Unicode (all characters)	Every region
UTF-8	Unicode (all characters)	Every region
UTF 16BE	Unicode (all characters)	Every region
UTF 16LE	Unicode (all characters)	Every region
UTF 32BE	Unicode (all characters)	Every region
UTF 32LE	Unicode (all characters)	Every region
MacRoman	none	Western Europe
MacJapanese Code Page	Shift JIS	Japanese
MacChineseTrad Code Page	Big5	Chinese
MacKorean Code Page	EUC-KR	Korean
MacArabic	none	Arabic
MacArabic-Farsi	none	Arabic and Farsi
MacHebrew	none	Hebrew
MacGreek	none	Greek
MacCyrillic	none	Cyrillic
MacDevanagari	none	India, Hindi
MacGurmukhi	none	India
MacGujarati	none	India
MacOriya	none	India
MacBengali	none	India, Bangladesh
MacTamil	none	India, Sri Lanka
MacTelegu	none	India
MacKanada	none	India
MacMalayalam	none	India
MacSingalese	none	India, Sri Lanka
MacKhmer	none	Khmer, Cambodia
MacThai	none	Thai
MacLaotian	none	Laos
MacGeorgian	none	Georgia
MacArmenian	none	Armenia
MacChineseSimp	EUC-CN	China
MacTibetian	none	Tibet
MacMongolian	none	Mongolia
MacEthiopic	none	Ethiopia
Mac-Central-European	none	Central Europe
MacVietnamese	none	Vietnam
MacExtArabic	none	Arabic Scripts
MacSymbol	none	Any region
MacDingbats	none	Any region
MacTurkish	none	Turkey
MacCroatian	none	Croatia
MacIcelandic	Icelandic	Iceland
MacRomanian	Romanian	Romania
MacCeltic	Scottish	Scotland
MacGaelic	Irish	Ireland
MacKeyboardGlyphs	Odd Symbols	Every region
ISO-2022-JP	Japanese	Japan