Living with UTF-8 character encoding

From the strictly technical point of view, one of the biggest gains for application size/speed/memory would be to make sure that everything in a GNU/Linux distribution is not compiled with UTF-8. This is certainly true, but doesn’t match the objectives of RULE, for the following reasons:

One encoding for all languages is a good thing: it’s silly that, even among languages using some derivation of the same alphabet, there is more than one sequence of bytes for the same letter.
Users in developing countries absolutely need multy-byte encoding.
The world (OK, at least all Linux distros and Free SW in general) is going UTF-8, for the reasons above. Compiling without UTF-8 means doing and supporting (almost) another whole distro from source. Even ignoring the first two points above, we simply have no time, expertise or CPU power to do it.

The problem today, at least in certain cases, is due to applications that have not been rewritten yet to live in a UTF-8 world. So, for
example, you may filter your files much slower than before because the
grep or diff programs haven’t been told yet that bytes starting with a
“1″ are the beginning of multibyte characters, the others are already the whole character. Hence, those programs crawl one byte at a time even when there is no need.

In other words, since it is good to have UTF-8, the solution is not to
recompile programs without it, but modify them so they can manage it properly. Since RULE is not a distribution, and the real distros, including Fedora Core, are already (if slowly) doing the dirty work for RULE, let’s just take advantage of that.

RULE = Run Up to Date Linux Everywhere

Categories

Archives