Monday, January 10, 2005

Converting ISCII to Unicode

The usage of computers proliferated into non-English speaking nations much before a unified global standard for representing data (Unicode) took shape. Different countries adopted different approaches to encoding data – most notable of them being the DBCS/MBCS we see for far-eastern languages (such as Simplified/Traditional Chinese, Japanese etc.). The idea was this – two characters of 1 byte each (between 128-255) when interpreted together designated one character – e.g. you combining Ê (0xCA) and ¯ (0xAF) gives you the ideogram for stone: 石 (Unicode 0x77F3) in Simplified Chinese. The system besides being prone to programming errors had some limitations – because the same lead-byte, trail-byte combination, depending on which “codepage” you are looking at ,can represent multiple languages, it is not possible to have documents with more than one far-eastern language (e.g. the same lead-byte (0xCA) trail-byte (0xAF) sequence gives you坒 (Unicode 0x5752) under Traditional Chinese codepage – two totally different characters!).

In India too, we had a similar indigenous encoding schemes, the most widespread of them being ISCII (Indian Script Code for Information Interchange) – the idea is very similar in that you use characters between 128-255 to denote Indian language characters. Depending on the ISCII encoding you chose, the same set of bytes could represent a different language.

The Encoding class in .NET allows you to convert between these encodings and Unicode. Recently while having a discussion with Dr. Pavanaja, it occurred to me that you can use the Encoding class to also do the conversion from ISCII to Unicode.

Let’s see an example (I chose to write it as a web-page and not a console app because the final Unicode result will not show up on console):

<%@Page Language="C#"%>
<%@Import Namespace="System.Text"%>
<%@Import Namespace="System.Globalization"%>

<script runat="server">
void Page_Load(Object o, EventArgs e)
{
    //Response.Write("Hello World");
    Encoding encFrom = Encoding.GetEncoding(1252);
    Encoding encTo = Encoding.GetEncoding(57008);
    String str = "ØÛÆèÄÜ";

    //Get it into a byte array...
    Byte[] b = encFrom.GetBytes(str);
    String strUnicode = encTo.GetString(b);
    Response.Write(strUnicode);

}
</script>

57002 denotes the ISCII Hindi Encoding. Other ISCII Encodings are:









































Codepage NameLanguage
57002 x-iscii-de Devnagri
57003 x-iscii-be Bengali
57004 x-iscii-ta Tamizh
57005 x-iscii-te Telugu
57006 x-iscii-as Assamese
57007 x-iscii-or Oriya
57008 x-iscii-ka Kannada


'ØÛÆèÄÜ' is an Indic String – with the right software/font you should be able to view it. You can also create an HTML document (thanks agin to Dr. Pavanaja for the tip!) with the following Meta tag, to view the contents of the ISCII string without explicitly doing conversion to Unicode (though IE does it internally for you before rendering it using the sytem installed Indic Open-Type fonts):

<meta http-equiv="Content-Type" content="text/html; charset=x-iscii-de">

5 comments:

Anonymous said...

So what happens if you happen to have some of the CDAC fonts installed, and have used the appropriate font tags? Does IE still convert internally to Unicode and display using the supplied Indic fonts?
While on the topic, has anyone created any more Unicode Indic fonts?

Deepak said...

Hi,

a.) If you use the meta-tag, IE will still do a conversion and use the default opentype font for the display. If you want the CDAC font to be used you'll need to skip the meta tag altogether.

b.) I haven't seen anything from MS, and unfortunately, I haven't seen any free open-type fonts either. The tools to create these fonts are freely available but their uptake has been slow.

Regards,
Deepak

Anonymous said...

Hi,

CDAC fonts have to be converted into ISCII first and then into Unicode. Please visit the discussion forums at www.bhashaindia.com wherein these are discussed heavily.

Regards,
Pavanaja

Unknown said...

Hi,
can u please tell me how indic languages are encoded and inter pretated in ISCII coding method as all of them have same numerical-character set??

Anonymous said...

Hi,
How to write that to a text file. I tried but failed using system.io.
Any suggestions to properly write the unicoded text to a file?
Thanks!!