Hallo Johann,
Post by Georg-Johann LayPost by Marko MäkeläActually I write C and C++ code bigger systems for living.
Ya, but on such systems you won't dive into generated assembly and
propose library changes when you come across a pair of instructions you
don't need in your specific context :-P
True. Also, in big software systems consisting of multiple subsystems
maintained by separate organizations, the amount of bloat is at a
completely different level. I would typically not look at the generated
machine code except when it shows up in a profiler.
In a hobby project, I am free to optimize every single bit of
performance. I brought up the call to main, because I think that the 2
wasted stack bytes can be significant when targeting a small unit, such
as the ATtiny2313 with 128 bytes of SRAM. My biggest AVR program so far
was an interrupt-driven interface adapter that uses only 4 bytes of
stack (it'd strictly only need 2, but I did not figure out a trick),
using the remaining 124 bytes of RAM for buffers.
Post by Georg-Johann LayMany developers find llvm more attractive than gcc because it is not
GPL and the newer code "pure doctrine" C++, whereas gcc might
deliberately use macros (for host performance) which many developers
find disgusting.
I do not think that macros are necessarily faster than inline functions.
It may be true for GCC, but with clang the opposite can hold. I recently
rewrote a puzzle solver in C++14, and to my surprise, clang generated
slightly faster code for the C++ than GCC did for the C where I had used
macros. I am hoping to run the code on an AVR some day, just to see how
fast it would do the 64-bit math:
http://www.iki.fi/~msmakela/software/pentomino/
Post by Georg-Johann LayWhen you are coding a backend, it hardly matters whether you write
XX_FOO (y) or xx.foo (y), what's paramount is that you are able to
express what you want to express and get your job done w.r.t. target
features, target code performance, etc.
Very true. As far as I understand, in avr-gcc there are some
difficult-to-change design limitations with regard to what can be
expressed. I have already encountered the inability to preserve a value
in r0 until some some code really needs the register, and the inability
to precisely track which registers need to be saved and restored in a
function, i.e. PR20296.
I am not claiming that LLVM is better, but given that it is different,
maybe these particular problems can be solved.
Post by Georg-Johann LayPost by Marko MäkeläSo far I found the generated code surprisingly good. I feared that GCC
would target a ‘virtual machine’ with 32-bit registers, but that does
GCC targets a target, and the description should match the real
hardware as close as it can :-)
I think that GCC (or any compiler for that matter) targets something
that resides above the bare metal. There are layers of constraints and
assumptions in the form of ABI (mainly calling conventions) and run-time
library. Each of these layers (including a possible operating system)
could also be thought of as lightweight virtual machine residing above
the previous layer. The bare metal would be the lowest layer.
Post by Georg-Johann LayPost by Marko MäkeläMy only complaint is that avr-gcc does not allow me to assign a
pointer to the Z register even when my simple program does not need
register const __flash char* usart_tx_next asm("r28"); // better: r30:31
This is not the Z pointer, R28 is the Y register, which in turn might be
the frame pointer. Even if avr-gcc allowed to reserve it globally, you
would get non-functional code. Same with reserving Z.
I did get properly working code for the above with -O3 and -Os, but
admittedly, maybe it would not work in a bigger program where some
function call is not inlined. If I used "r30" instead, the program would
refuse to compile.
Post by Georg-Johann Lay__flash will try to access via Z, and if you take that register away by
fixing it then the compiler will no more be able to use Z for its job.
My very reason for attempting to reserve Z for this pointer was that it
is the only __flash pointer in the program.
Post by Georg-Johann LayMy strong impression is that you are inventing hacks to push the
compiler into generating the exact same sequence as you would write
down as a smart assembler programmer. Don't do it. You will come up
with code that it cluttered up with hard-to-maintain kludges or it
might even be non-functional (as with globally reserving registers
indispensable to the compiler).
I admit I am trying to find and push the limits with these experiments.
I would not use these tricks in a program that is intended to be
portable.
Post by Georg-Johann LayYou could write that ISR and avoid push / pop SREG by using CPSE, but
that needs asm, of course. Cf. PR20296. Everybody familiar with avr
is aware of this, but also aware that it will be quite some work to
come up with optimal code. The general recommendation is to use
assembler if you need specific instruction sequences.
Yes, it seems that small interrupt handlers are indeed better written in
assembler. However, given that avr-gcc does not let me to reserve the Z
register pair, I would still have to save and restore Z in the assembler
code so that it can use the LPM instruction.
Post by Georg-Johann LayEven if the compiler generated code that's close to optimal, it would
be very hard to force CPSE and block any other comparison instructions
provided respective code exists.
Right. It would be an additional constraint for the compiler to try to
avoid generating instructions that affect SREG. It is possible but
tricky, and in the end some code might end up forcing the SREG to be
saved and restored anyway.
In a small embedded system with at most some kilobytes to hundreds of
kilobytes of instruction space, I think that it might be worthwhile to
compile the whole program at once, instead of linking separately
compiled compilation units together. This would allow additional
whole-program optimizations and warnings, such as detecting possible
stack overflow.
On the 6502, where the ALU instructions can work directly with memory
operands and where the stack pointer is only 8 bits, it would be
beneficial to statically allocate RAM locations to the local variables
in non-recursive procedure calls, to save the precious stack address
space. This is only possible with whole-program optimization.
On the AVR, maybe a whole-program optimization could assign the most
commonly used global or static variables to registers.
Marko