See Also: Home | FriendFeed | Wiki | Links | Personal Site | RSS | CV

Unicode/UTF Encoding Demo

The following lines should all end with an small umlaut-U character that is marked up or encoded in different ways (indicated by preceeding text) and displayed by your browser depending on the charset encoding set by the webpage

ISO8859-1/Latin single byte:
Raw UTF-8 byte-pair: ü
Unicode entity Number: ü
Special entity Name: ü

The encoding for this page is currently set to UTF-8. Note how the last two are consistent no matter what encoding is selected. Try selecting one of the following to see how it affect the rendering of the character.

Encoding Options: UTF-8 | ISO8859-1


UTF-8 Encoding Notes:

1) Latin1 Unicode Character for LATIN SMALL LETTER U WITH DIAERESIS
	Hex = FCh, Decimal equivalent = 252,
	Encoded numerical (ala HTML friendly) entity = ü

2) Coz its greater than 128 it requires two UTF-8 bytes

3) Byte1 = 192 + (Ucode-value div 64)
	   = 192 + (252 div 64)
	   = 192 + 3
	   = 195

  note: 192 added coz two 1st bits (128+64) signal that
	the UTF sequence is multi-byte and two bytes in length

4) Byte2 = 128 + (Ucode-val mod 64)
	   = 128 + (252 mod 64)
	   = 188

  note: 128 added coz the high-order bit signifies part of an UTF
	sequence and not normal low ordinal (<128) Ascii char

5) So final UTF-8 encoding is...
	Decimal <195><188> = Hex <C3><BC>

   and these are the two bytes embeded in the UTF-8 example at
   the top of this page

References:

http://www1.tip.nl/~t876506/utf8tbl.html
ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT


To the Top Top Of Page    Home Home    Wiki Frontpage Wiki Home    emailContact    emailAuthor