Character Set in JavaScript

Published on 5 Nov, 2020

JavaScript programs are written using Unicode character set. Unicode is a superset of ASCII and supports most of languages in the world.

Unicode is a standard for consistent representation of text maintained by Unicode Consortium. Unicode Consortium is a non-profit organization based in Mountain View California.

Non English Text

Let us try to print a Japanese text using console.log() statement. I translated "hope" to Japanese using Google translate and it says the Japanese is Nozomu.

console.log("望む");

Above code prints the Japanese text just like that in console.

望む

Since JavaScript supports Unicode character set, it also possible to use foreign languages as variable names.

const പേര് = "Backbencher";
console.log(പേര്); // "Backbencher"

Above code used a word from Malayalam language as an identifier. That is also valid in JavaScript.

Escape Sequence

Due to either hardware or software limitations, if we are not able input a particular unicode character, we can make use of escape sequence. Any unicode character in JavaScript can be represented using 6 characters. 6 characters include a \, u and 4 hexa decimal characters.

console.log("\u2764"); // "❤"

Above code logs a heart symbol in console.

Another useful case is to write latin alphabets. How to write an é?. We can make use of unicode in this case.

console.log("\u00e9"); // "é"

According to JavaScript engine, both é and \u00e9 are same.

console.log("é" === "\u00e9"); // true

Normalization

We can write a character in multiple ways using Unicode. Let us take the case of é. It can be written as a single unicode character as seen above.

console.log("\u00e9"); // "é"

é can also be written by combining the normal ASCII e with the acute accent combining mark(\u0301). The combining mark adds the dash on any normal characters.

console.log("e\u0301"); // "é"
console.log("f\u0301"); // "f́"

Even though both techniques produces the same output, they are not equal internally.

console.log("\u00e9" === "e\u0301"); // false

Unicode Application

Even though we can use unicode to declare variables or as string literals, its direct usage is very rare. I have not seen anyone giving a Japanese word as variable name. When we declare a variable for maximum readability, it is good to choose English language.

There can be scenarios when we need to insert a special character like copyright symbol. In that case if use unicode, we might save inserting an additional image.

console.log("\u00A9"); // "©"