UTF-8 = TeH SuX0r
All right, you think UTF-8 is great because it lets you display and store data in virtually any language, while keeping some compatibility with the usual ASCII strings. You set your whole system to a UTF-8 locale, so that you never have to worry again with character encodings. You even speak UTF-8 on IRC channels, even though everyone else is using an iso8859 charset (hah! what a bunch of retrogrades). But what you don’t know yet is that UTF-8 = TeH SuX0r !!!!!111
Parsing UTF-8 strings is slow
So, I have this 2 MB foo.txt
file containing only ASCII
characters (generated with yes aaaaaaaaaaaaaaa | head -131072
),
and I want to know how many characters there are in it. Here are the two
attempts:
% time LC_CTYPE=fr_FR.UTF-8 wc foo.txt >/dev/null
|
Woah! Counting characters is 11 times slower with an UTF-8 locale.
Well, some people can live with that. But the problem is that counting
characters is not the most complicated thing you’d do. Let’s try something
else, now, like looking for non-empty lines in foo.txt
:
% time LC_CTYPE=fr_FR.UTF-8 grep "." foo.txt >/dev/null
|
Hum, that was too fast for the fr_FR
locale to be
meaningful, so I made this a bit more difficult for it, by concatenating
foo.txt
a hundred times, creating a 200 MB bar.txt
file. This roughly means that `grep "."´
is more than 6000
times slower when using an UTF-8 locale.
Notes: I have been using GNU grep 2.5.1, wc 5.2.1 on a glibc 2.3.2 system. The system has a Pentium M 1.7 and 1 GB of RAM. All tests have been done five times, the two extreme values dropped and the three remaining values were averaged. YMMV, of course.