Discussion:
[avr-libc-dev] Request: gcrt1.S with empty section .init9
Marko Mäkelä
2017-01-06 15:21:03 UTC
Permalink
I am trying to move from assembler to C programming on the AVR while
avoiding unnecessary overhead. I see that crt1/gcrt1.S contains the
following code:

.section .init9,"ax",@progbits
#ifdef __AVR_ASM_ONLY__
XJMP main
#else /* !__AVR_ASM_ONLY__ */
XCALL main
XJMP exit
#endif /* __AVR_ASM_ONLY__ */

The above references to main() and exit() are needed for complying with
the C standard. However, I would tend to believe that normally programs
written for bare metal (such as the AVR) never terminate. Such programs
do not need an exit() function or even a call or jump to main().

Would it be possible to introduce a (necessarily non-standard) option
that allows the .init9 section of the runtime libary to be omitted?
Then, the user could declare their infinite main loop something like
this:

__attribute__((naked)) __attribute__((section(".init9")))
static
void
mainloop (void)
{
for (;;) do_my_stuff ();
}

This would save 3 instructions and some RAM (call main/ret, jump exit).

I am aware of the linker options -nostartfiles -nostdlib, but I do want
the interrupt table and the sections .init0 through .init8.

Perhaps .linkonce or some clever use of .weak could help here? Or
perhaps an alternative crt.o file could be provided for a minimal
startup?

I would have tried to patch gcrt1.S myself, but I am having trouble
setting up the build environment.

Best regards,

Marko
Georg-Johann Lay
2017-01-06 18:31:26 UTC
Permalink
Post by Marko Mäkelä
I am trying to move from assembler to C programming on the AVR while
avoiding unnecessary overhead. I see that crt1/gcrt1.S contains the
#ifdef __AVR_ASM_ONLY__
XJMP main
#else /* !__AVR_ASM_ONLY__ */
XCALL main
XJMP exit
#endif /* __AVR_ASM_ONLY__ */
When you need optimizations at a lever where 2 instructions matter,
then it's very likely you need a project specific start-up code and
linker description anyway. For example, you might truncate the
vector table after the last vector used by the application.

For an easy fix, you can

1) Set up own start-up code omitting .init9

2) Provide own linker description file without input section .init9

3) Or, as a quick fix: 3a) Link with -Wl,--unique=.init9 so that
.init9 becomes an output section, and then 3b) drop it by means
of avr-objcopy --remove-section=.init9 foo.elf

All of these approaches require main in .init8 or earlier.
Post by Marko Mäkelä
The above references to main() and exit() are needed for complying with
the C standard. However, I would tend to believe that normally programs
written for bare metal (such as the AVR) never terminate. Such programs
do not need an exit() function or even a call or jump to main().
One approach is to handle it similar to bits of the start-up code which
are only dragged if needed. For example, if avr-gcc sees some stuff
being put into .bss or COMMON, it emits ".global __do_clear_bss" where
the latter is implemented in libgcc, cf. http://gcc.gnu.org/PR18145

Using a similar approach for, say, call_main will lead to the unpleasant
effect that you'll get the same code twice except you are using a libc
version which removed .init9 from gcrt1.S :-) and you still need to set
a command option to /not/ drag the call to main by having the compiler
/not/ emit the ".global __init9_call_main".

gcrt1.S would need yet another #if __GNUC__ >= 7 or so, and because
toolchain distributors are usually not aware of such subtleties, you
will observe complaints of "brain dead misoptimization" à la

CALL main
JMP exit
CALL main
JMP exit

throughout avr forums all over the net if someone bundles avr-gcc
with the new feature together with avr-libc without conditional removal.

Johann

One last note: As you are coming straight from asm programming, you will
have a hard time reading the compiler generated code. Maybe your are
shocked enough to jump into contributing to GCC :-)
Post by Marko Mäkelä
Would it be possible to introduce a (necessarily non-standard) option
that allows the .init9 section of the runtime libary to be omitted?
__attribute__((naked)) __attribute__((section(".init9")))
static
void
mainloop (void)
{
for (;;) do_my_stuff ();
}
This would save 3 instructions and some RAM (call main/ret, jump exit).
I am aware of the linker options -nostartfiles -nostdlib, but I do want
the interrupt table and the sections .init0 through .init8.
Perhaps .linkonce or some clever use of .weak could help here? Or
perhaps an alternative crt.o file could be provided for a minimal startup?
I would have tried to patch gcrt1.S myself, but I am having trouble
setting up the build environment.
Best regards,
Marko
Marko Mäkelä
2017-01-07 10:01:53 UTC
Permalink
Hallo Johann,
Post by Georg-Johann Lay
When you need optimizations at a lever where 2 instructions matter,
then it's very likely you need a project specific start-up code and
linker description anyway. For example, you might truncate the vector
table after the last vector used by the application.
Good idea, thanks! I did think about the interrupt vector table already,
and that approach would allow me to trim it too.
Post by Georg-Johann Lay
For an easy fix, you can
1) Set up own start-up code omitting .init9
2) Provide own linker description file without input section .init9
3) Or, as a quick fix: 3a) Link with -Wl,--unique=.init9 so that
.init9 becomes an output section, and then 3b) drop it by means
of avr-objcopy --remove-section=.init9 foo.elf
All of these approaches require main in .init8 or earlier.
Right. I successfully had put main in .init3 already before posting.

The quick fix 3) works and shortens the program by 4 words and reduces
stack usage by 2 bytes. The .init9 section would be emitted to the end
of the ELF binary and indeed omitted from the avr-objcopy output.

[snip]
Post by Georg-Johann Lay
gcrt1.S would need yet another #if __GNUC__ >= 7 or so, and because
toolchain distributors are usually not aware of such subtleties, you
will observe complaints of "brain dead misoptimization" à la
CALL main
JMP exit
CALL main
JMP exit
throughout avr forums all over the net if someone bundles avr-gcc with
the new feature together with avr-libc without conditional removal.
Right, so the risk could be greater than the savings.
Post by Georg-Johann Lay
One last note: As you are coming straight from asm programming, you
will have a hard time reading the compiler generated code.
Actually I write C and C++ code bigger systems for living.

The 8-bit processors are just a hobby, and my ‘first love’ is the 6502,
not the AVR. I was happy to learn that the avr-llvm changes were
recently merged to the upstream repository. The experimental AVR target
for clang generates some code, but it still needs work. I am hoping that
one day clang generates similar code as avr-gcc. Also clang++ works,
which is nice if you watched the CppCon 2016 talks touting zero-overhead
abstraction, such as these:



Post by Georg-Johann Lay
Maybe your are shocked enough to jump into contributing to GCC :-)
Not an impossible idea, but I find the idea of LLVM more promising,
because it could be easier to add other 8-bit processor targets there.

So far I found the generated code surprisingly good. I feared that GCC
would target a ‘virtual machine’ with 32-bit registers, but that does
not seem to be the case, or there are good peephole optimizations in
place, and my input is so simple. I am using the Debian package gcc-avr
1:4.9.2+Atmel3.5.3-1.

My only complaint is that avr-gcc does not allow me to assign a pointer
to the Z register even when my simple program does not need that
register for anything else:

register const __flash char* usart_tx_next asm("r28"); // better: r30:31

ISR(USART_TX_vect)
{
char c;
if (!usart_tx_next);
else if ((c = *usart_tx_next++))
UDR0 = c;
else
usart_tx_next = 0;
}

In its current form, this program is generating quite a few push/pop to
preserve the value of the Z register while copying the Y register to it.

I got the impression that LLVM is a 16-bit (or wider) virtual machine.
It could be an acceptable design choice, given that 8-bit processors
usually have a 16-bit or wider address space. But currently llc (the
LLVM-to-AVR translator) is lacking optimizations, generating very
bloated code.

Best regards,

Marko
Georg-Johann Lay
2017-01-07 12:05:44 UTC
Permalink
Post by Marko Mäkelä
Hallo Johann,
Post by Georg-Johann Lay
When you need optimizations at a lever where 2 instructions matter,
then it's very likely you need a project specific start-up code and
linker description anyway. For example, you might truncate the vector
table after the last vector used by the application.
Good idea, thanks! I did think about the interrupt vector table already,
and that approach would allow me to trim it too.
Post by Georg-Johann Lay
For an easy fix, you can
1) Set up own start-up code omitting .init9
2) Provide own linker description file without input section .init9
3) Or, as a quick fix: 3a) Link with -Wl,--unique=.init9 so that
.init9 becomes an output section, and then 3b) drop it by means
of avr-objcopy --remove-section=.init9 foo.elf
All of these approaches require main in .init8 or earlier.
Right. I successfully had put main in .init3 already before posting.
The quick fix 3) works and shortens the program by 4 words and reduces
stack usage by 2 bytes. The .init9 section would be emitted to the end
of the ELF binary and indeed omitted from the avr-objcopy output.
[snip]
Post by Georg-Johann Lay
gcrt1.S would need yet another #if __GNUC__ >= 7 or so, and because
toolchain distributors are usually not aware of such subtleties, you
will observe complaints of "brain dead misoptimization" à la
CALL main
JMP exit
CALL main
JMP exit
throughout avr forums all over the net if someone bundles avr-gcc with
the new feature together with avr-libc without conditional removal.
Right, so the risk could be greater than the savings.
Post by Georg-Johann Lay
One last note: As you are coming straight from asm programming, you
will have a hard time reading the compiler generated code.
Actually I write C and C++ code bigger systems for living.
Ya, but on such systems you won't dive into generated assembly and
propose library changes when you come across a pair of instructions
you don't need in your specific context :-P
Post by Marko Mäkelä
The 8-bit processors are just a hobby, and my ‘first love’ is the 6502,
not the AVR. I was happy to learn that the avr-llvm changes were
recently merged to the upstream repository. The experimental AVR target
for clang generates some code, but it still needs work. I am hoping that
one day clang generates similar code as avr-gcc. Also clang++ works,
which is nice if you watched the CppCon 2016 talks touting zero-overhead
http://youtu.be/zBkNBP00wJE
http://youtu.be/uzF4u9KgUWI
http://youtu.be/D7Sd8A6_fYU
Post by Georg-Johann Lay
Maybe your are shocked enough to jump into contributing to GCC :-)
Not an impossible idea, but I find the idea of LLVM more promising,
Many developers find llvm more attractive than gcc because it is
not GPL and the newer code "pure doctrine" C++, whereas gcc might
deliberately use macros (for host performance) which many developers
find disgusting. When you are coding a backend, it hardly matters
whether you write XX_FOO (y) or xx.foo (y), what's paramount is that
you are able to express what you want to express and get your job
done w.r.t. target features, target code performance, etc.

As gcc supports way more hosts and targets, in particular in realm of
embedded, I cannot well what's the best choice. But yes, llvm
appears to be much more attractive and appealing these days.
Post by Marko Mäkelä
because it could be easier to add other 8-bit processor targets there.
ymmv
Post by Marko Mäkelä
So far I found the generated code surprisingly good. I feared that GCC
would target a ‘virtual machine’ with 32-bit registers, but that does
GCC targets a target, and the description should match the real hardware
as close as it can :-)
Post by Marko Mäkelä
not seem to be the case, or there are good peephole optimizations in
place, and my input is so simple. I am using the Debian package gcc-avr
1:4.9.2+Atmel3.5.3-1.
avr-gcc implements some peepholes, but imho peepholes are a last resort
optimization to clean up mess from other passes which didn't perform as
expected.
Post by Marko Mäkelä
My only complaint is that avr-gcc does not allow me to assign a pointer
to the Z register even when my simple program does not need that
register const __flash char* usart_tx_next asm("r28"); // better: r30:31
This is not the Z pointer, R28 is the Y register, which in turn might be
the frame pointer. Even if avr-gcc allowed to reserve it globally, you
would get non-functional code. Same with reserving Z. __flash will
try to access via Z, and if you take that register away by fixing it
then the compiler will no more be able to use Z for its job.

My strong impression is that you are inventing hacks to push the
compiler into generating the exact same sequence as you would write
down as a smart assembler programmer. Don't do it. You will come
up with code that it cluttered up with hard-to-maintain kludges
or it might even be non-functional (as with globally reserving
registers indispensable to the compiler).
Post by Marko Mäkelä
ISR(USART_TX_vect)
{
char c;
if (!usart_tx_next);
else if ((c = *usart_tx_next++))
UDR0 = c;
else
usart_tx_next = 0;
}
In its current form, this program is generating quite a few push/pop to
preserve the value of the Z register while copying the Y register to it.
ISRs will come with some overhead, which will add some performance drop
which will be noticeable in particular with small ISRs. Part of the
overhead is that R0 and R1 are fixed and the compiler don't track their
contents, hence with be saved / restored; same for SREG.

You could write that ISR and avoid push / pop SREG by using CPSE, but
that needs asm, of course. Cf. PR20296. Everybody familiar with avr
is aware of this, but also aware that it will be quite some work to
come up with optimal code. The general recommendation is to use
assembler if you need specific instruction sequences.

Even if the compiler generated code that's close to optimal, it
would be very hard to force CPSE and block any other comparison
instructions provided respective code exists.
Post by Marko Mäkelä
I got the impression that LLVM is a 16-bit (or wider) virtual machine.
It could be an acceptable design choice, given that 8-bit processors
usually have a 16-bit or wider address space. But currently llc (the
LLVM-to-AVR translator) is lacking optimizations, generating very
bloated code.
Best regards,
Marko
Marko Mäkelä
2017-01-07 16:55:32 UTC
Permalink
Hallo Johann,
Post by Georg-Johann Lay
Post by Marko Mäkelä
Actually I write C and C++ code bigger systems for living.
Ya, but on such systems you won't dive into generated assembly and
propose library changes when you come across a pair of instructions you
don't need in your specific context :-P
True. Also, in big software systems consisting of multiple subsystems
maintained by separate organizations, the amount of bloat is at a
completely different level. I would typically not look at the generated
machine code except when it shows up in a profiler.

In a hobby project, I am free to optimize every single bit of
performance. I brought up the call to main, because I think that the 2
wasted stack bytes can be significant when targeting a small unit, such
as the ATtiny2313 with 128 bytes of SRAM. My biggest AVR program so far
was an interrupt-driven interface adapter that uses only 4 bytes of
stack (it'd strictly only need 2, but I did not figure out a trick),
using the remaining 124 bytes of RAM for buffers.
Post by Georg-Johann Lay
Many developers find llvm more attractive than gcc because it is not
GPL and the newer code "pure doctrine" C++, whereas gcc might
deliberately use macros (for host performance) which many developers
find disgusting.
I do not think that macros are necessarily faster than inline functions.
It may be true for GCC, but with clang the opposite can hold. I recently
rewrote a puzzle solver in C++14, and to my surprise, clang generated
slightly faster code for the C++ than GCC did for the C where I had used
macros. I am hoping to run the code on an AVR some day, just to see how
fast it would do the 64-bit math:

http://www.iki.fi/~msmakela/software/pentomino/
Post by Georg-Johann Lay
When you are coding a backend, it hardly matters whether you write
XX_FOO (y) or xx.foo (y), what's paramount is that you are able to
express what you want to express and get your job done w.r.t. target
features, target code performance, etc.
Very true. As far as I understand, in avr-gcc there are some
difficult-to-change design limitations with regard to what can be
expressed. I have already encountered the inability to preserve a value
in r0 until some some code really needs the register, and the inability
to precisely track which registers need to be saved and restored in a
function, i.e. PR20296.

I am not claiming that LLVM is better, but given that it is different,
maybe these particular problems can be solved.
Post by Georg-Johann Lay
Post by Marko Mäkelä
So far I found the generated code surprisingly good. I feared that GCC
would target a ‘virtual machine’ with 32-bit registers, but that does
GCC targets a target, and the description should match the real
hardware as close as it can :-)
I think that GCC (or any compiler for that matter) targets something
that resides above the bare metal. There are layers of constraints and
assumptions in the form of ABI (mainly calling conventions) and run-time
library. Each of these layers (including a possible operating system)
could also be thought of as lightweight virtual machine residing above
the previous layer. The bare metal would be the lowest layer.
Post by Georg-Johann Lay
Post by Marko Mäkelä
My only complaint is that avr-gcc does not allow me to assign a
pointer to the Z register even when my simple program does not need
register const __flash char* usart_tx_next asm("r28"); // better: r30:31
This is not the Z pointer, R28 is the Y register, which in turn might be
the frame pointer. Even if avr-gcc allowed to reserve it globally, you
would get non-functional code. Same with reserving Z.
I did get properly working code for the above with -O3 and -Os, but
admittedly, maybe it would not work in a bigger program where some
function call is not inlined. If I used "r30" instead, the program would
refuse to compile.
Post by Georg-Johann Lay
__flash will try to access via Z, and if you take that register away by
fixing it then the compiler will no more be able to use Z for its job.
My very reason for attempting to reserve Z for this pointer was that it
is the only __flash pointer in the program.
Post by Georg-Johann Lay
My strong impression is that you are inventing hacks to push the
compiler into generating the exact same sequence as you would write
down as a smart assembler programmer. Don't do it. You will come up
with code that it cluttered up with hard-to-maintain kludges or it
might even be non-functional (as with globally reserving registers
indispensable to the compiler).
I admit I am trying to find and push the limits with these experiments.
I would not use these tricks in a program that is intended to be
portable.
Post by Georg-Johann Lay
You could write that ISR and avoid push / pop SREG by using CPSE, but
that needs asm, of course. Cf. PR20296. Everybody familiar with avr
is aware of this, but also aware that it will be quite some work to
come up with optimal code. The general recommendation is to use
assembler if you need specific instruction sequences.
Yes, it seems that small interrupt handlers are indeed better written in
assembler. However, given that avr-gcc does not let me to reserve the Z
register pair, I would still have to save and restore Z in the assembler
code so that it can use the LPM instruction.
Post by Georg-Johann Lay
Even if the compiler generated code that's close to optimal, it would
be very hard to force CPSE and block any other comparison instructions
provided respective code exists.
Right. It would be an additional constraint for the compiler to try to
avoid generating instructions that affect SREG. It is possible but
tricky, and in the end some code might end up forcing the SREG to be
saved and restored anyway.

In a small embedded system with at most some kilobytes to hundreds of
kilobytes of instruction space, I think that it might be worthwhile to
compile the whole program at once, instead of linking separately
compiled compilation units together. This would allow additional
whole-program optimizations and warnings, such as detecting possible
stack overflow.

On the 6502, where the ALU instructions can work directly with memory
operands and where the stack pointer is only 8 bits, it would be
beneficial to statically allocate RAM locations to the local variables
in non-recursive procedure calls, to save the precious stack address
space. This is only possible with whole-program optimization.

On the AVR, maybe a whole-program optimization could assign the most
commonly used global or static variables to registers.

Marko

Loading...