UTF-8 = TeH SuX0r

All right, you think UTF-8 is great because it lets you display and store data in virtually any language, while keeping some compatibility with the usual ASCII strings. You set your whole system to a UTF-8 locale, so that you never have to worry again with character encodings. You even speak UTF-8 on IRC channels, even though everyone else is using an iso8859 charset (hah! what a bunch of retrogrades). But what you don’t know yet is that UTF-8 = TeH SuX0r !!!!!111

Parsing UTF-8 strings is slow

So, I have this 2 MB foo.txt file containing only ASCII characters (generated with yes aaaaaaaaaaaaaaa | head -131072), and I want to know how many characters there are in it. Here are the two attempts:

% time LC_CTYPE=fr_FR.UTF-8 wc foo.txt >/dev/null
      
5.54s user 0.02s system 93% cpu 5.976 total
      
% time LC_CTYPE=fr_FR wc foo.txt >/dev/null
      
0.49s user 0.03s system 91% cpu 0.566 total
      
%

Woah! Counting characters is 11 times slower with an UTF-8 locale. Well, some people can live with that. But the problem is that counting characters is not the most complicated thing you’d do. Let’s try something else, now, like looking for non-empty lines in foo.txt:

% time LC_CTYPE=fr_FR.UTF-8 grep "." foo.txt >/dev/null
      
207.62s user 0.02s system 93% cpu 3:41.64 total
      
% time LC_CTYPE=fr_FR grep "." foo.txt >/dev/null
      
0.03s user 0.00s system 40% cpu 0.074 total
      
% time LC_CTYPE=fr_FR grep "." bar.txt >/dev/null
      
3.37s user 0.16s system 90% cpu 3.895 total
      
%

Hum, that was too fast for the fr_FR locale to be meaningful, so I made this a bit more difficult for it, by concatenating foo.txt a hundred times, creating a 200 MB bar.txt file. This roughly means that `grep "."´ is more than 6000 times slower when using an UTF-8 locale.

Notes: I have been using GNU grep 2.5.1, wc 5.2.1 on a glibc 2.3.2 system. The system has a Pentium M 1.7 and 1 GB of RAM. All tests have been done five times, the two extreme values dropped and the three remaining values were averaged. YMMV, of course.

about me

projects

MPEG & DVD

doc

leisure

UTF-8 = TeH SuX0r

Parsing UTF-8 strings is slow

More to come!