about me

projects

MPEG & DVD

doc

leisure

Sam Hocevar’s .plan

This is an experimental blog engine. RSS feeds: everything | blog | Debian (DPL only) | VideoLAN | GNOME | Mono

shlib-with-non-pic-code: have inline assembly and PIC mix well

Posted on Fri, 13 Apr 2007 04:52:19 +0200 - Keywords: debian, devel

Perhaps the most accessible documentation on what PIC code is and how an ELF dynamic linker works is John Levine’s Linkers and Loaders (and it has amazing sketches, too!). The Gentoo documentation also has an Introduction to Position Independent Code. I’d like to give a few hints on how to fix the shlib-with-non-pic-code lintian error caused by inline assembly on the i386 and amd64 platforms, as well as build errors that may occur due to inline assembly being used.

I’m not going to cover the trivial “all objects were not built using gcc’s -fPIC flag” problem. It usually requires a fix to the build system, not to the code.

gcc can’t find a register (i386)

PIC on i386 uses a register to store the GOT (global offset table) address. This register is usually %ebx, making it unavailable for use by inline assembly (and also restricting the compiler’s register usage when compiling C or C++ code). So the following perfectly valid code will not build with the -fPIC flag:

 void cpuid(uint32_t op, uint32_t reg[4])
 {
    asm volatile("cpuid"
                 : "=a"(reg[0]), "=b"(reg[1]), "=c"(reg[2]), "=d"(reg[3])
                 : "a"(op)
                 : "cc");
 }

Using -fPIC, gcc will say something around the lines of error: can't find a register in class ‘BREG’ while reloading ‘asm’. Several things need to be done to fix this:

And here is the PIC-compliant version:

 void cpuid(uint32_t op, uint32_t reg[4])
 {
    asm volatile("pushl %%ebx      \n\t" /* save %ebx */
                 "cpuid            \n\t"
                 "movl %%ebx, %1   \n\t" /* save what cpuid just put in %ebx */
                 "popl %%ebx       \n\t" /* restore the old %ebx */
                 : "=a"(reg[0]), "=r"(reg[1]), "=c"(reg[2]), "=d"(reg[3])
                 : "a"(op)
                 : "cc");
 }

using variables from assembly code (i386)

Directly using variables from inline assembly always creates non-PIC code even if -fPIC is being used. There are at least three strategies to consider when trying to fix this. Here is an example of PIC-incompatible code:

 extern uint32_t sym1, sym2, sym3;
 
 void store_in_symbols(uint32_t x)
 {
     asm volatile("movl %0, sym1   \n\t"
                  /* ... */
                  "movl %0, sym2   \n\t"
                  /* ... */
                  "movl %0, sym3   \n\t"
                  : : "r"(x));
 }

The first strategy is to pass the variable through the usual gcc inline assembly syntax. This is not always possible because there might be a shortage of registers, but here is what it looks like:

 extern uint32_t sym1, sym2, sym3;
 
 void store_in_symbols(uint32_t x)
 {
     asm volatile("movl %3, %0   \n\t"
                  /* ... */
                  "movl %3, %1   \n\t"
                  /* ... */
                  "movl %3, %2   \n\t"
                  : "=r"(sym1), "=r"(sym2), "=r"(sym3) : "r"(x));
 }

If there are too many variables and not enough registers, one can use the second strategy: put all required variables (or addresses) in a table, thus requiring only one extra register:

 extern uint32_t sym1, sym2, sym3;
 
 void store_in_symbols(uint32_t x)
 {
     uint32_t tab[3];
     asm volatile("movl %1, (%0)   \n\t"
                  /* ... */
                  "movl %1, 4(%0)  \n\t"
                  /* ... */
                  "movl %1, 8(%0)  \n\t"
                  : : "r"(tab), "r"(x));
     sym1 = tab[0];
     sym2 = tab[1];
     sym3 = tab[2];
 }

If the second method happens to be unsuitable because variable types differ too much, there is a third, more complicated strategy: retrieve the GOT address using some linker-specific magic, then use @GOT addressing to access variables:

 extern uint32_t sym1, sym2, sym3;
 
 void store_in_symbols(uint32_t x)
 {
     uint32_t got;
     asm volatile("call get_got                               \n\t"
                  "get_got:                                   \n\t"
                  "popl %0                                    \n\t"
                  "addl $_GLOBAL_OFFSET_TABLE_ - get_got, %0  \n\t"
                  : "=r"(got));
     /* ... */
     asm volatile("pushl %%ebx                \n\t" /* save ebx */
                  "movl sym1@GOT(%0), %%ebx   \n\t" /* retrieve sym1 address */
                  "movl %1,( %%ebx)           \n\t"
                  /* ... */
                  "movl sym2@GOT(%0), %%ebx   \n\t" /* retrieve sym2 address */
                  "movl %1, (%%ebx)           \n\t"
                  /* ... */
                  "movl sym3@GOT(%0), %%ebx   \n\t" /* retrieve sym3 address */
                  "movl %1, (%%ebx)           \n\t"
                  "popl %%ebx                 \n\t" /* restore ebx */
                  : : "r"(got), "r"(x));
 }

It is of course possible to merge the two above asm statements into one. However the performance implications of the GOT address retrieval should not be overlooked, especially if the assembly code takes place inside a critical loop, which means the loop might have to be unrolled and the GOT address retrieval done outside the loop.

Note also that _GLOBAL_OFFSET_TABLE_ is ELF-specific. On BSD and a.out systems you should use __GLOBAL_OFFSET_TABLE_. I am not aware of a way to do the same on Darwin / Mac OS X, which is a shame because it is the only widespread i386 platform where shared objects cannot be non-PIC.

gcc silently generates non-PIC code (amd64)

The amd64 architecture has the same problem as i386 when trying to access variables from assembly code. However, the solution is a lot easier: the GOT address can always be retrieved from the special %rip register. This is called rip-relative addressing:

 extern uint32_t sym1, sym2, sym3;
 
 void store_in_symbols(uint32_t x)
 {
     asm volatile("pushl %%rax                       \n\t" /* save rax */
                  "movq sym1@GOTPCREL(%rip), %%rax   \n\t" /* retrieve sym1 address */
                  "movl %0,(%%rax)                   \n\t"
                  /* ... */
                  "movq sym2@GOTPCREL(%rip), %%rax   \n\t" /* retrieve sym2 address */
                  "movl %0,(%%rax)                   \n\t"
                  /* ... */
                  "movq sym3@GOTPCREL(%rip), %%rax   \n\t" /* retrieve sym3 address */
                  "movl %0,(%%rax)                   \n\t"
                  "popl %%rax                        \n\t" /* restore rax */
                  : : "r"(x));
 }

amd64 also has more registers than i386, which makes the first discussed technique less prone to register shortage.

Why bother?

Good question.

Currently many shared libraries simply disable entire chunks of inline assembly on the amd64 architecture. Whether these routines bring any performance benefit compared to the C alternative is uncertain, but without being able to build the code there is no way to tell.

Even fewer developers care about i386, because this architecture can handle non-PIC shared objects quite fine (but then they are no longer really shared). I am in favour of making all shared objects PIC-friendly (ie. not necessarily build them as PIC if there is a valid reason not to, but at least don’t make it too difficult to build PIC versions) because that makes it easier to port the code to Darwin. I also believe the memory gain and especially the cache gain when using real shared libraries is underestimated.

Show the last 10 | 20 | 50 entries.