notepad++ encode vs convert

In this article I will explain the difference between Encode In and Convert To.

You will find similar menus in one of the popular text editor called Notepad++.

Encode In means keep the raw information (byte sequence) same and map that to the specified character set.

Convert To means keep the character information same while map byte sequence to the specified character set.

Let us take a hypothetical example to understand it.

We have two character set: X and Y as follows:

Character Set X

Character - A, B, C
Hex Format - 3C, 49, 2D

Character Set Y

Character - A, B, C
Hex Format - 49, 2D, 3C

We have a file in character set X named message.txt with content as ACB BC. Hex Content as 3C2D49 492D.

Encode In character set Y, file becomes

Hex Content - 3C2D49 492D
Character - CBA AB

Convert To character set Y, file becomes

Hex Content - 493C2D 2D3C
Character - ACB BC

(I have not replaced the space character with its hex value.)

In short
  • Encode in means keep the bytes and map or modify the characters.
Byte Sequence (source character set) ==> Character (target character set)
  • Convert to means keep the characters and map or modify the bytes.
Byte Sequence (source character set) ==> Character (source character set) ==> Character (target character set) ==>  Byte Sequence (target character set)
 
For encoding operation to complete without error, the incoming byte sequence from the source (say a file) must be understood by the target character set.

For conversion operation to complete without error, the incoming characters in the source (say a file) must be understood by the target character set and the byte sequence must be understood by the source character set. Why do I say this. This is because two character set can have a same character but with different byte sequence. Take for example the copyright symbol. It is 0xA9 in ISO-8859-1 whereas it is 0xC2 0xA9
in utf-8. Now if I try to convert this file into a particular character set and mention that the source character set is utf-8, the conversion will give error as utf-8 do not understand 0xA9 as a byte sequence.

Further Readings

Character encodings for beginners 


Why converting and encoding, in Encodings menu, differ

Please note that a file may contain a BOM (Byte Order Mark) character. Some application might or might not handle that character gracefully. You need to act accordingly.

Comments

Back To Top

Popular posts from this blog

error 18 at 0 depth lookup: self signed certificate

How to check fragmentation in MySQL tables

How to Drop or Remove or Decommission a Database in Oracle