UTF-8 = TeH SuX0rAll right, you think UTF-8 is great because it lets you display and store data in virtually any language, while keeping some compatibility with the usual ASCII strings. You set your whole system to a UTF-8 locale, so that you never have to worry again with character encodings. You even speak UTF-8 on IRC channels, even though everyone else is using an iso8859 charset (hah! what a bunch of retrogrades). But what you don’t know yet is that UTF-8 = TeH SuX0r !!!!!111 Parsing UTF-8 strings is slow So, I have this 2 MB
Woah! Counting characters is 11 times slower with an UTF-8 locale.
Well, some people can live with that. But the problem is that counting
characters is not the most complicated thing you’d do. Let’s try something
else, now, like looking for non-empty lines in
Hum, that was too fast for the Notes: I have been using GNU grep 2.5.1, wc 5.2.1 on a glibc 2.3.2 system. The system has a Pentium M 1.7 and 1 GB of RAM. All tests have been done five times, the two extreme values dropped and the three remaining values were averaged. YMMV, of course. More to come! |