Discussion:
[avr-libc-dev] Interested in 64-bit printf support?
George Spelvin
2016-12-04 22:53:06 UTC
Permalink
The developers of the TICC time interval counter are having
a problem printing 64-bit integers:
https://github.com/TAPR/TICC/blob/master/TO-DO.txt

I went to work figuring out how to do that conversion, and after some
false starts, came up with some quite efficient arbitrary-precision
printing code. It's small, fast, and works for any number of bytes.
(Implementation currently limited to 255 bytes.)

The code, in its current state, is appended; it's been tested extensively
on x86 and more lightly on simulavr. Any comments from experienced AVR
programmers is definitely appreciated. Like the current __ultoa_invert,
it formats the number backward in a caller-provided buffer.

It's (currently) slightly larger, but slightly faster than __ultoa_invert.
Timing in cycles for a simulated atmega1280:

Input genprint __ultoa_invert
0 160
0xff 316 480
0xffff 584 792
0xffffff 1005 1260
0xffffffff 1434 1572
0xffffffffff 2024
48 ones 2626
56 ones 3286
64 ones 4103

I could just pass this over to the TICC project, but is there any interest
in me making the necessary overhauls to vfprintf to incorporate this?

I'd have to change vfprintf to pass around a pointer and length internally
rather than copying the value. The printing code requires the input
in a modifiable little-endian buffer, but that's how varargs works on
AVR anyway. (Adding the hex/octal and negative number handling is quite
simple.)


If I get really sneaky, I could eliminate the output buffer by having
it push the formatted number on the stack and call down to an output
routine that supplies the necessary prefix padding once the length is
known. But that might not be doable in C unless there's a way to force
allocation of a stack frame.


(If anyone cares, I have some other code for 16- and 32-bit numbers based
on the ideas at http://homepage.divms.uiowa.edu/~jones/bcd/decimal.html.
It converts 2 digits at a time, using a 100-byte binary-to-BCD lookup
table, but is a lot larger, extending it to 64 bits is a mess, and doing
without a hardware multiplier bloats it with shift-and-add sequences.)


#if __AVR_ARCH__
typedef _Bool bool;
typedef unsigned char uint8_t;
typedef unsigned short uint16_t;
#define false (bool)0
#define true (bool)1
#else
#include <stdbool.h>
#include <stdint.h>
#endif
/*
* Print an arbitrary buffer of bytes. This uses only byte operations
* and only multiplies by the constant 0x33, so is suitable for an
* 8-bit microcontroller.
*
* The basic operation involves three steps:
* 1) Sum the bytes mod 255 to find the remainder mod 5. Combine
* this with the lsbit to compute the least significant digit.
* 2) Subtract this from the number
* 3) Then do a 1-bit shift and multiply by 0xcc..cccd to accomplish a
* divide by 10.
*
* Because of the simple pattern of that multiplier (it's -0x33333...),
* it can be managed in O(n) time and O(1) additional space.
*/


/* Reduce a byte mod 5. */
static uint8_t __attribute__((const))
mod5(uint8_t x)
{
#if __AVR_ARCH__
uint8_t tmp;
/* Reduce x mod 15. We add the high halves, to get a carry. */
asm( "mov __tmp_reg__,%0"
"\n swap __tmp_reg__"
"\n cbr %0,15"
"\n add %0,__tmp_reg__"
"\n cbr %0,15"
"\n swap %0" /* Swap and add carry bit to low */
"\n adc %0,__zero_reg__" /* End-around carry */
: "+d" (x) :: "r0");
#else
x = (x >> 4) + (x & 0xf); /* 1 <= x <= 30 */
x = (x >> 4) + (x & 0xf); /* 1 <= x <= 15 */
#endif
/* 1 <= x <= 15; now for final reduction mod 5 */
if (x >= 10)
x -= 10; /* 0 <= x <= 9 */
if (x >= 5)
x -= 5; /* 0 <= x <= 4 */
return x;
}

/*
* Convert the arbitrary-precision number in bin[0..len-1] from
* little-endian binary to to decimal. The length can be up to 127 bytes
* (but the algorithm can go forever).
* The digits are written little-endian, so they must be reversed
* for output. A pointer to the end of the formatted number is returned.
* (The caller is responsible for ensuring the output is big enough.)
*/
char *
genprint(char *out, uint8_t *bin, uint8_t len)
{
uint8_t digit;

/* This shouldn't happen, but don't crash */
if (!len)
return out;
/*
* Pre-pass: strip trailing zero bytes, but not the last.
* Also leave "bin" pointing to the end of the input.
*/
bin += len;
while (!*--bin) {
if (--len)
continue;
len++;
break;
}
bin++; /* Leave bin pointing just past end of input */

do {
uint16_t accum;
uint8_t i = len;
bool lsbit = false;

/*
* Pass 1, msb-to-lsb: divide bin[] by 2, and return the
* value mod 10 (the new value mod 5, combined with the
* previous lsbit).
*/
digit = 255; /* Could also be zero */
#if __AVR_ARCH__
/*
* The one implementation hassle is the need for two
* carry bits. One for the shift, and the other for
* the end-around carries modulo 255.
*
* Unfamiliar AVR inline asm feature: given a memory
* constraint, %2 prints X, Y or Z. %E2 and %F2 are the
* low and high registers of the pointer, respectively.
* So if %2 = Z, then %E2 = r30 and %F2 = r31.
*/
asm(""
"\n1: ld __tmp_reg__,-%4"
"\n lsr %1" /* lsbit to carry bit */
"\n ror __tmp_reg__"
"\n st %4,__tmp_reg__"
"\n rol %1" /* carry bit to lsbit */
"\n add %0,__tmp_reg__"
"\n adc %0,__zero_reg__" /* End-around carry */
"\n dec %2"
"\n brne 1b"
"\n.ifnc %A3,%E4"
"\n .error \"GCC constraints not working as expected.\""
"\n.endif"
"\n.ifnc %B3,%F4"
"\n .error \"GCC constraints not working as expected.\""
"\n.endif"
: "+&r" (digit), "+&r" (lsbit), "+&r" (i), "+e" (bin)
: "m" (*bin)
: "r0");
#else
do {
uint8_t byte = *--bin;
uint16_t t = digit;

t += *bin = lsbit << 7 | byte >> 1;
lsbit = byte & 1;
digit = (t >> 8) + (t & 0xff);
} while (--i);
#endif
digit = mod5(digit);
*out++ = digit + digit + lsbit + '0';

/*
* Pass 2, lsb-to-msb: the exciting part!
*
* We have to subtract the digit, then multiply by
* 0xCCCC...CD to achieve a divide by 5. We actually
* multiply by 0x333...33, and negate. To do the negate,
* we don't complement and add 1, but rather subtract
* 1 and then complement. That works because then the
* subtraction can be combined with the accumulation
* for the multiply.
*
* For each byte, we multiply it by 0x33, and add that
* to the "value to be stored in all higher bytes"
* accumulator. The accumulator needs to be 16 bits, but
* we can prevent overflow by adding the msbyte to the
* lsbyte after each iteration.
*
* We can also use this to do the necessary subtracts.
* To subtract 1, we just initialize the accumulator
* to 0xff. To subtract the digit without having a
* separate carry-propagation pass through bin[], we
* further subtract 0x33 times the digit.
* (Since digit <= 4, this fits into 1 byte.)
*/
if (digit > 4)
__builtin_unreachable();
accum = 255 - (uint8_t)(digit * 0x33);

i = len;
do {
accum += 0x33 * *bin;
*bin++ = digit = ~(uint8_t)accum;
accum = (accum >> 8) + (accum & 0xff);
} while (--i);

/* Decrease len as high bytes become zero */
} while (digit || (--bin, --len));
return out;
}

#if __AVR_ARCH__
#define READ_PORT 0x38
#define WRITE_PORT 0x39
#define EXIT_PORT 0x3a

static uint8_t inline __attribute__((always_inline))
inb(uint8_t port)
{
uint8_t x;
asm("in %0,%1" : "=r" (x) : "I" (port));
return x;
}

static void inline __attribute__((always_inline))
outb(uint8_t port, uint8_t x)
{
asm("out %0,%1" :: "I" (port), "r" (x));
}

static void
putch(char c)
{
outb(WRITE_PORT, c);
}

static void
revwrite(char const *p, uint8_t len)
{
while (len--)
putch(*--p);
}

static void
revput(char *buf, uint8_t *bin, uint8_t len)
{
char const *p = genprint(buf, bin, len);
revwrite(p, p-buf);
putch('\n');
}

static void
putst(char const __flash *p)
{
char c;
while ((c = *p++) != 0)
putch(c);
putch('\n');
}

static void __attribute__((noreturn))
my_exit(uint8_t status)
{
outb(EXIT_PORT, status);
__builtin_unreachable();
}

int
main(void)
{
unsigned i;
char buf[21];
union {
unsigned i;
uint8_t b[2];
} icopy;
uint8_t bin[8];

static char const __flash hello[] = "Hello, world!";

putst(hello);

for (i = 0; i < 8; i++) {
int j;
for (j = 0; j < 8; j++)
bin[j] = 0xff;

revput(buf, bin, i+1);
}


for (i = 0; i < 100; i++) {
icopy.i = i;
revput(buf, icopy.b, 2);
}
my_exit(0);
}

#else
static bool __attribute__((pure))
revcmp(char const *a, char const *b, unsigned len)
{
while (len--)
if (*a++ != b[len])
return false;
return true;
}

#include <stdio.h>

static bool
test(uint64_t x)
{
char buf1[21], buf2[21];
int len = sprintf(buf1, "%llu", x);
union {
uint64_t x;
uint8_t buf[8];
} u = { x };
char *p = genprint(buf2, u.buf, 8);

*p = '\0';
//printf("Result: %u -> '%s' vs '%s'\n", x, buf1, buf2);
if (p - buf2 == len && revcmp(buf1, buf2, len))
return true;
printf("Mismatch: %llu (%#llx) -> '%s' vs '%s'\n", x, x, buf1, buf2);
return false;
}

int
main(void)
{
unsigned i;

for (i = 0; i < 10000000; i++) {
//printf("Testing %u\n", i);
if (!test(i))
return 1;
if (!test(~i))
return 1;
if (!test(~(uint64_t)i))
return 1;
}
printf("All succeeded, up to %u\n", i);
return 0;
}

#endif /* !__AVR_ARCH__ */
George Spelvin
2016-12-05 19:26:28 UTC
Permalink
Looking at GCC's assembly output, I was disgusted by what it was doing
(note that I'm still using avr-gcc 4.9.1) and wrote an assembly version.

That let me do some useful space-saving tricks like allocating a different
zero register so that r1 is free for the multiplier.

Again, feel free to kibitz the code; this is literally my first AVR code ever.

I'm not quite clear whether I need to support the avr1 architecture.
I'm assuming not, and make significant use of adiw/sdiw and Z+/-Z.
Is that right?


It's now the same length (0x8c bytes) as the decimal-only part of
__ultoa_invert, and 3/4 the time of my previous C version. Compared to
__ultoa_invert, it's 2/3 of the time for large outputs, and less than
1/2 the time for small.

Updating the previous table (__ultoa_invert numbers reduced to account
for deleting the tests for bases 8 or 16):

Time in clock cycles
Input genprint genprint asm asm, !MUL __ultoa_invert !MUL
0 104 114 156 165
0xff 316 193 227 476 479
0xffff 584 393 479 788 793
0xffffff 1005 705 873 1256 1264
0xffffffff 1434 1045 1310 1568 1578
0xffffffffff 2024 1497 1889
48 ones 2626 1977 2511
56 ones 3286 2513 3207
64 ones 4103 3161 4045

For the !__AVR_HAVE_MUL__ case, the size is 0xa2 (my code) vs. 0x90 bytes.


While the time saving is nice, I realize that space is very critical in
!__AVR_HAVE_MUL__ code, so it might not be worth it. Unless you want
64-bit printing support, in which case this can provide it for very
little additional space.

Note that changing vfprintf() to keep track of a 16-bit pointer rather
than a 32-bit value will reduce register pressure in it, saving a little
bit there, which will partially compensate for the extra %llu code.




#ifndef __tmp_reg__
# define __tmp_reg__ r0
#endif
#ifndef __zero_reg__
# define __zero_reg__ r1
#endif

/* Arguments */
#define out X /* Arrives in r24:r25, but we move it immediately */
#define out_lo r26
#define out_hi r27
#define bin Z /* Arrives in r22:r23, but we move it immediately */
#define bin_lo r30
#define bin_hi r31
#define len r20

/* Local variables */
#define acc_hi r25
#define acc_lo r24
#define digit r23
#define lsbit r22
#define tlen r21 /* Copy of len used for loop counter */

#if __AVR_HAVE_MUL__
#define zero r19 /* Used instead of r1 to free up multiplier */
#define k r18 /* Multiplier 0x33 */
#else
#define zero __zero_reg__
#endif

.text
.global genprint
.type genprint, @function
1: ret
genprint:
tst len /* Handle zero-length gracefully */
breq 1b

#if __AVR_HAVE_MUL__
clr zero
ldi k,0x33
#endif
movw out_lo,r24

/* bin += len, point to msbyte */
movw bin_lo,r22
add bin_lo,len
adc bin_hi,zero

/* Strip trailing (most-significant) zeros from bin */
2: ld __tmp_reg__,-bin
cpse __tmp_reg__,zero
rjmp 3f /* Found a non-zero byte, stop */
dec len
brne 2b
inc len /* But stop at 1 byte, so we print "0" */

3: adiw bin_lo,1

/* The main loop, repeated while len > 0 */
4: clr lsbit
ser digit /* Sum of all bytes, mod 255 */
mov tlen,len

/*
* Pass 1, msb-to-lsb: Finding the input mod 10.
*
* We do two things here: divide by 2 (saving the lsbit), and sum
* the result mod 255. This is then used to compute the result
* mod 5, which combined with the lsbit gives the decimal digit
* we want.
*/
5:
ld __tmp_reg__,-bin
lsr lsbit /* lsbit to carry bit */
ror __tmp_reg__
st bin,__tmp_reg__
rol lsbit /* carry bit to lsbit */
add digit,__tmp_reg__
adc digit,zero /* End-around carry */

dec tlen
brne 5b

/* Reduce digit mod 15 (from 1 <= digit <= 255 to 1 <= digit <= 15) */
mov __tmp_reg__,digit
swap __tmp_reg__
cbr digit,15
add digit,__tmp_reg__ /* Add high halves to get carry bit */
cbr digit,15
swap digit
adc digit,zero /* End-around carry */

/* Reduce digit mod 5 */
cpi digit,10
brlo 6f
subi digit,10
6: cpi digit,5
brlo 7f
subi digit,5
7:
/* Form and store ASCII digit (2*digit + lsbit) */
add lsbit,digit
add lsbit,digit
ori lsbit,'0'
st out+,lsbit

/*
* Pass 2, lsb-to-msb: dividing by 5
*
* Rather than do a general divide by 5, we can subtract the digit
* to produce a multiple of 5, and then do an exact division by
* multiplying by the 2-adic inverse of 5, 0xCCC...CCD.
*
* To get this into an even simpler form, we multiply by
* 0x333...333 and negate. Each byte is multiplied by 0x33 and
* added to an accumulator to be used for each higher byte.
*
* The accumulator has to be 16 bits wide, but after storing
* each output byte, we can fold the msbyte into the lsbyte.
*
* Negating the output can be "complement and add one", but
* we do it as "subtract one and complement", initializing the
* accumulator to 0xff, then complementing before storing.
*
* To subtract the digit without an additional carry propagation
* pass, subtract 0x33 times the digit from the accumulator
* to start. (Since 0 <= digit <= 4, this is very easy.)
*/

/* acc = 255 - (digit * 0x33) */
#if __AVR_HAVE_MUL__
mul digit,k
mov acc_lo,r0
#else
mov acc_lo,digit
swap acc_lo /* Digit < 16, so this is accum <<= 4 */
add acc_lo,digit /* Multiply by 0x11 */
mov r0,acc_lo
add acc_lo,r0
add acc_lo,r0 /* Multiply by 3 */
#endif
com acc_lo
clr acc_hi

/* Here's the actual loop */
mov tlen,len
8: ld r0,bin
/* acc += 0x33 * r0 */
#if __AVR_HAVE_MUL__
mul r0,k
add acc_lo,r0
adc acc_hi,r1
#else
/* Compute 0x11*r0 into digit:lsbit */
mov lsbit,r0
swap lsbit
mov digit,lsbit
andi digit,15 /* Mask off high 4 bits */
eor lsbit,digit /* Mask off low 4 bits */
add lsbit,r0
adc digit,zero

/* Now add it to the accumulator 3 times (there's no faster way) */
add acc_lo,lsbit
adc acc_hi,digit
add acc_lo,lsbit
adc acc_hi,digit
add acc_lo,lsbit
adc acc_hi,digit
#endif
/* Store the complemented accumulator (*bin++ = ~accum) */
mov r0,acc_lo
com r0
st bin+,r0

/* Fold the accumulator: acc = acc_hi + acc_lo */
add acc_lo,acc_hi
clr acc_hi
adc acc_hi,zero

dec tlen
brne 8b

/*
* End of main loop: check if the new msbyte was zero. If so,
* drop it (reduce len by 1), and test for termination.
*/
cpse r0,zero
rjmp 4b
sbiw bin_lo,1
dec len
#if __AVR_HAVE_MUL
brne 4b
clr __zero_reg__
#else
breq 8f
rjmp 4b
8:
#endif

/* Finally, put the return value in the expected place */
movw r24,out_lo
ret
.size genprint, .-genprint
Joerg Wunsch
2016-12-05 22:25:20 UTC
Permalink
Post by George Spelvin
I could just pass this over to the TICC project, but is there any interest
in me making the necessary overhauls to vfprintf to incorporate this?
Basically, yes, having 64-bit integer printing support in vfprintf()
would certainly be cool.

My only concern from avr-libc's point of view would be how to enable
this. So far, we've got three different link-time options for
vfprintf(), "standard", "reduced", and "floating-point". If we add
the 64-bit support to the latter (in the sense of "full-featured"), it
might hit people who are really only intersted in floating-point
support. OTOH, people who are actually only interested in 64-bit
integer formatting would be forced to also take the load for FP
support.

In contrast, if we make it more fine-grained, we'd end up with

. reduced (minimal size) (-lprintf_min)
. standard (no 64-bit integer, no FP) (plain -lc)
. floating-point (no 64-bit integer) (-lprintf_flt)
. 64-bit integer (no floating-point) (-lprintf_64)
. full-featured (FP and 64-bit int) (-lprintf_full)

I think it's possible to have it that way, even though buld-time for
the library will increase.

(Well, it would really get interesting if someone came up with 64-bit
double numbers in the compiler and library then. :-)
--
cheers, Joerg .-.-. --... ...-- -.. . DL8DTL

http://www.sax.de/~joerg/
Never trust an operating system you don't have sources for. ;-)
George Spelvin
2016-12-06 00:36:02 UTC
Permalink
[Re-send to list this time.]

Thanks for the reply! I know this is a minimal-time hobby for you and
I'll try not to be a pest. But some response is very motivating.
Post by Joerg Wunsch
Basically, yes, having 64-bit integer printing support in vfprintf()
would certainly be cool.
Okay, great! I'll start working in that direction. Step 1: check out
avr-libc into git-svn and build it.
Post by Joerg Wunsch
My only concern from avr-libc's point of view would be how to enable
this. So far, we've got three different link-time options for
vfprintf(), "standard", "reduced", and "floating-point". If we add
the 64-bit support to the latter (in the sense of "full-featured"), it
might hit people who are really only intersted in floating-point
support. OTOH, people who are actually only interested in 64-bit
integer formatting would be forced to also take the load for FP
support.
I suggest adding it to "standard". It would add *very* little code.

As I mentioned, the cleanest way to do this is to change the argument
handling in vfprintf() to use a pointer & size (to the argument on the
stack), rather than a copy of the value. That actually shrinks the
code and register pressure inside vfprintf() slightly. Then all you
need to do is add code to parse %ll and learn that it means "length=8".

The only overhead is needing a 20-byte (or 22-byte, for octal) buffer
on the stack, rather than the current 11-byte one. If that's a problem,
I can do a bit of extra work to get around that:

I change the calling convention so that the caller doesn't pass in a
buffer pointer, but a FILE * and some precision+flags. Then the
formatting code *pushes the number on the stack* and calls
vfprintf_helper(FILE *, char *ptr, uint8_t len, uint8_t prec, uint8_t flags)

vfprintf_helper() supplies the necessary prefixes based on the precision
and length, makes the necessary fputc() calls, then the buffer gets
deallocated when it returns.

I can't find a way to tail-call vfprintf_helper() and teach it to do
the deallocation itself (I can add my own asm() block ending in ret,
followed by __builtin_unreachable(), but that would omit the helper's
epilogue to restore spilled registers), but if you really need to
save stack I can do the whole helper in asm.
Joerg Wunsch
2016-12-06 07:20:13 UTC
Permalink
Post by George Spelvin
I suggest adding it to "standard". It would add *very* little code.
That's even better to hear.
Post by George Spelvin
As I mentioned, the cleanest way to do this is to change the argument
handling in vfprintf() to use a pointer & size (to the argument on the
stack), rather than a copy of the value.
I don't mind any changes like that.
Post by George Spelvin
The only overhead is needing a 20-byte (or 22-byte, for octal) buffer
on the stack, rather than the current 11-byte one.
Given that people using printf() usually don't use something as small
as an ATtiny13 (with only 64 bytes of RAM), I think that's not an
issue here. We might document this as one of the differences between
the minimal and the standard version then.

I think, overall, having useable printf() support in avr-libc is one
of the features that make using the AVR a nice job, so making it more
useful is a Good Thing. There are always people who "as a matter of
principle" believe printf() is evil on a small controller, though I
bet, the majority of them would actually produce larger code by
rolling their own formatting and output functions. ;-)
--
cheers, Joerg .-.-. --... ...-- -.. . DL8DTL

http://www.sax.de/~joerg/
Never trust an operating system you don't have sources for. ;-)
Georg-Johann Lay
2016-12-06 11:38:22 UTC
Permalink
Post by George Spelvin
The developers of the TICC time interval counter are having
https://github.com/TAPR/TICC/blob/master/TO-DO.txt
I went to work figuring out how to do that conversion, and after some
false starts, came up with some quite efficient arbitrary-precision
printing code. It's small, fast, and works for any number of bytes.
(Implementation currently limited to 255 bytes.)
The code, in its current state, is appended; it's been tested extensively
on x86 and more lightly on simulavr. Any comments from experienced AVR
programmers is definitely appreciated. Like the current __ultoa_invert,
it formats the number backward in a caller-provided buffer.
It's (currently) slightly larger, but slightly faster than __ultoa_invert.
Input genprint __ultoa_invert
0 160
0xff 316 480
0xffff 584 792
0xffffff 1005 1260
0xffffffff 1434 1572
0xffffffffff 2024
48 ones 2626
56 ones 3286
64 ones 4103
I could just pass this over to the TICC project, but is there any interest
in me making the necessary overhauls to vfprintf to incorporate this?
I'd have to change vfprintf to pass around a pointer and length internally
rather than copying the value. The printing code requires the input
in a modifiable little-endian buffer, but that's how varargs works on
AVR anyway. (Adding the hex/octal and negative number handling is quite
simple.)
It it possible to do it with the same code and not special-casing
different radices? Even if the code runs slower (which is not really
important for output routines) the code size might drop.
Post by George Spelvin
If I get really sneaky, I could eliminate the output buffer by having
it push the formatted number on the stack and call down to an output
routine that supplies the necessary prefix padding once the length is
known. But that might not be doable in C unless there's a way to force
allocation of a stack frame.
(If anyone cares, I have some other code for 16- and 32-bit numbers based
on the ideas at http://homepage.divms.uiowa.edu/~jones/bcd/decimal.html.
It converts 2 digits at a time, using a 100-byte binary-to-BCD lookup
table, but is a lot larger, extending it to 64 bits is a mess, and doing
without a hardware multiplier bloats it with shift-and-add sequences.)
Usually, we are after code-size, not speed; in particular for such
output and conversion routines.
Post by George Spelvin
#if __AVR_ARCH__
In the library you'll have to protect the code against AVR_TINY, i.e.

#ifndef __AVR_TINY__
Post by George Spelvin
typedef _Bool bool;
typedef unsigned char uint8_t;
typedef unsigned short uint16_t;
#define false (bool)0
#define true (bool)1
Dunno how libc's modules are compiled, but they are using stdint.h all
over the place, so stdbool.h and stdint.h should be in order, no?
Post by George Spelvin
/* Reduce a byte mod 5. */
static uint8_t __attribute__((const))
mod5(uint8_t x)
{
#if __AVR_ARCH__
uint8_t tmp;
/* Reduce x mod 15. We add the high halves, to get a carry. */
asm( "mov __tmp_reg__,%0"
"\n swap __tmp_reg__"
"\n cbr %0,15"
"\n add %0,__tmp_reg__"
"\n cbr %0,15"
"\n swap %0" /* Swap and add carry bit to low */
"\n adc %0,__zero_reg__" /* End-around carry */
: "+d" (x) :: "r0");
#else
x = (x >> 4) + (x & 0xf); /* 1 <= x <= 30 */
x = (x >> 4) + (x & 0xf); /* 1 <= x <= 15 */
#endif
/* 1 <= x <= 15; now for final reduction mod 5 */
if (x >= 10)
x -= 10; /* 0 <= x <= 9 */
if (x >= 5)
x -= 5; /* 0 <= x <= 4 */
return x;
}
Again, we can safe code size by slightly slowing things down, e.g.

mod5 (uint8_t x)
{
#if __AVR_ARCH__
asm ("0: $ subi %0,%1 $ brcc 0b $ subi %0,%n1" : "+d" (x) : "n" (35));
asm ("0: $ subi %0,%1 $ brcc 0b $ subi %0,%n1" : "+d" (x) : "n" (5));
return x;
#else
...

The intermediate step via 35 is not essential, it's just a speed-up.
Post by George Spelvin
do {
uint16_t accum;
uint8_t i = len;
bool lsbit = false;
/*
* Pass 1, msb-to-lsb: divide bin[] by 2, and return the
* value mod 10 (the new value mod 5, combined with the
* previous lsbit).
*/
digit = 255; /* Could also be zero */
If 0 is just as fine, use 0. I am getting better code size and less
stack usage (using -mcall-prologues).
Post by George Spelvin
#if __AVR_ARCH__
/*
* The one implementation hassle is the need for two
* carry bits. One for the shift, and the other for
* the end-around carries modulo 255.
*
* Unfamiliar AVR inline asm feature: given a memory
* constraint, %2 prints X, Y or Z. %E2 and %F2 are the
* low and high registers of the pointer, respectively.
* So if %2 = Z, then %E2 = r30 and %F2 = r31.
*/
asm(""
"\n1: ld __tmp_reg__,-%4"
"\n lsr %1" /* lsbit to carry bit */
"\n ror __tmp_reg__"
"\n st %4,__tmp_reg__"
"\n rol %1" /* carry bit to lsbit */
"\n add %0,__tmp_reg__"
"\n adc %0,__zero_reg__" /* End-around carry */
"\n dec %2"
"\n brne 1b"
"\n.ifnc %A3,%E4"
"\n .error \"GCC constraints not working as expected.\""
"\n.endif"
"\n.ifnc %B3,%F4"
"\n .error \"GCC constraints not working as expected.\""
"\n.endif"
: "+&r" (digit), "+&r" (lsbit), "+&r" (i), "+e" (bin)
: "m" (*bin)
: "r0");
I don't quite get this, shouldn't this be something like

asm ("\n1: ld __tmp_reg__,-%a3"
"\n lsr %1" /* lsbit to carry bit */
"\n ror __tmp_reg__"
"\n st %a3,__tmp_reg__"
"\n rol %1" /* carry bit to lsbit */
"\n add %0,__tmp_reg__"
"\n adc %0,__zero_reg__" /* End-around carry */
"\n dec %2"
"\n brne 1b"
: "+r" (digit), "+r" (lsbit), "+r" (i), "+e" (bin)
:
: "memory");

IMO it's not worth the hassle to describe explicitly which portion of
memory gets changed, hence "memory".

Try to avoid "m" constraints in your code, except for making side
effects on memory explicit, i.e. don't use the respective %-Operand in
the asm template as "m" allows much more address modes than any AVR
instruction can chew.
Post by George Spelvin
if (digit > 4)
__builtin_unreachable();
What's the purpose of this test? An optimization hint?
Post by George Spelvin
#if __AVR_ARCH__
#define READ_PORT 0x38
#define WRITE_PORT 0x39
#define EXIT_PORT 0x3a
static uint8_t inline __attribute__((always_inline))
inb(uint8_t port)
{
uint8_t x;
asm("in %0,%1" : "=r" (x) : "I" (port));
return x;
}
static void inline __attribute__((always_inline))
outb(uint8_t port, uint8_t x)
{
asm("out %0,%1" :: "I" (port), "r" (x));
}
static void
putch(char c)
{
outb(WRITE_PORT, c);
}
static void
revwrite(char const *p, uint8_t len)
{
while (len--)
putch(*--p);
}
static void
revput(char *buf, uint8_t *bin, uint8_t len)
{
char const *p = genprint(buf, bin, len);
revwrite(p, p-buf);
putch('\n');
}
static void
putst(char const __flash *p)
{
char c;
while ((c = *p++) != 0)
putch(c);
putch('\n');
}
static void __attribute__((noreturn))
my_exit(uint8_t status)
{
outb(EXIT_PORT, status);
__builtin_unreachable();
}
FYI, avrtest comes with such functions for you, you just have to link
against, say, exit-atmega128.o

https://sourceforge.net/p/winavr/code/HEAD/tree/trunk/avrtest/

avrtest is just a core simulator, i.e. has reduced functionality
compared to other simulators -- but it's /fast/ and well suited for such
algorithmic torture tests.


Johann
George Spelvin
2016-12-06 15:50:29 UTC
Permalink
Thanks for the comments! May of your comments have been obviated by my
later work (particularly the assembler version), but I really appreciate
them anyway.

I should mention that the code you commented on was a proof of concept.
It's something that *could be adapated* to avr-libc, so a couple of your
code style comments are appreciated, but premature.

When writing the first version that you saw, I was trying to keep the asm()
blocks as small as practicable and let gcc handle the parts that could
be expressed well in C.

Then I had a look at the generated asm and was very sad. :-(
Gcc *could not* be convinced to put the output and input pointers in
the X and Z register pairs where they belonged, but kept copying
the values in and out.

So I rewrote the whole thing in asm and produced a massive improvement
in code size.
Post by Georg-Johann Lay
It it possible to do it with the same code and not special-casing
different radices? Even if the code runs slower (which is not really
important for output routines) the code size might drop.
That's certainly possible, but that uses many calls to a general 16/8->8+8
bit division algorithm and is *hugely* slower. Still, I could try coding
it up for a "tiny" version.
Post by Georg-Johann Lay
Usually, we are after code-size, not speed; in particular for such
output and conversion routines.
That part, I understand, although I'm so used to coding the other way
around that it may take me a while to unlearn bad habits. :-)
Thank you very much for pointing out my mistakes in this area!
Post by Georg-Johann Lay
Post by George Spelvin
#if __AVR_ARCH__
In the library you'll have to protect the code against AVR_TINY, i.e.
#ifndef __AVR_TINY__
Thank you! This is one of those "premature" comments. Helpful to me for
the future, but I wasn't trying to fit the library code style *at all* with
this code.
Post by Georg-Johann Lay
Post by George Spelvin
typedef _Bool bool;
typedef unsigned char uint8_t;
typedef unsigned short uint16_t;
#define false (bool)0
#define true (bool)1
Dunno how libc's modules are compiled, but they are using stdint.h all
over the place, so stdbool.h and stdint.h should be in order, no?
At the time, I *only* had avr-gcc and not avr-libc installed, and *no*
#include directives worked (not even the compiler-supplied ones like
those), so I just stubbed them manually as you see. For a "release"
version, you're absolutely right, of course.
Post by Georg-Johann Lay
Again, we can safe code size by slightly slowing things down, e.g.
mod5 (uint8_t x)
{
#if __AVR_ARCH__
asm ("0: $ subi %0,%1 $ brcc 0b $ subi %0,%n1" : "+d" (x) : "n" (35));
asm ("0: $ subi %0,%1 $ brcc 0b $ subi %0,%n1" : "+d" (x) : "n" (5));
return x;
#else
...
The intermediate step via 35 is not essential, it's just a speed-up.
Damn, nice spotting!

When writing that, I though od the "subtract/skip-if-no-carry/add-back"
sequence, but it was the same three instructions as the "compare/skip/suntract"
sequence. I totally missed that the former makes a *loop* possible!

(And the "n" modifier isn't documented either! Damn gcc.)

And good call on "35". Being the geometric average of 5 and 255, it's
obviously the "middle" value, but I did a more careful measurement. I summed
the total number of backward branches for all x from 1..255, and tried
various values of the intermediate threshold. It comes out:

20 1890
25 1686
30 1586
35 1554
40 1554
45 1586
50 1656
55 1686
60 1762
65 1890
Post by Georg-Johann Lay
Post by George Spelvin
digit = 255; /* Could also be zero */
If 0 is just as fine, use 0. I am getting better code size and less
stack usage (using -mcall-prologues).
It's the math geek in me liking the property that 1 <= digit <= 255.
That's preserved by the summation loop (overflow to 256 gets wrapped to
1), so it's "natural" to start the loop with that invariant even though
it doesn't really matter.

As long as it ends up in a caller-saved register, it's in the high registers
and there's no advantage to "clr" over "ldi".

(I wonder if I initialize it to 1, I could reduce to 1..5, which corresponds to
the desired digits 0..4, replacing some reduction with an unconditional
drcrement. Could I fold the offset into the constant '0' or something?
Doesn't look like it would work, unfortunately...)
Post by Georg-Johann Lay
Post by George Spelvin
asm(""
"\n1: ld __tmp_reg__,-%4"
"\n lsr %1" /* lsbit to carry bit */
"\n ror __tmp_reg__"
"\n st %4,__tmp_reg__"
"\n rol %1" /* carry bit to lsbit */
"\n add %0,__tmp_reg__"
"\n adc %0,__zero_reg__" /* End-around carry */
"\n dec %2"
"\n brne 1b"
"\n.ifnc %A3,%E4"
"\n .error \"GCC constraints not working as expected.\""
"\n.endif"
"\n.ifnc %B3,%F4"
"\n .error \"GCC constraints not working as expected.\""
"\n.endif"
: "+&r" (digit), "+&r" (lsbit), "+&r" (i), "+e" (bin)
: "m" (*bin)
: "r0");
I don't quite get this, shouldn't this be something like
asm ("\n1: ld __tmp_reg__,-%a3"
"\n lsr %1" /* lsbit to carry bit */
"\n ror __tmp_reg__"
"\n st %a3,__tmp_reg__"
"\n rol %1" /* carry bit to lsbit */
"\n add %0,__tmp_reg__"
"\n adc %0,__zero_reg__" /* End-around carry */
"\n dec %2"
"\n brne 1b"
: "+r" (digit), "+r" (lsbit), "+r" (i), "+e" (bin)
: "memory");
IMO it's not worth the hassle to describe explicitly which portion of
memory gets changed, hence "memory".
Thank you! I went to all that trouble because the 'a' modifier isn't documented in
https://gcc.gnu.org/viewcvs/gcc/trunk/gcc/config/avr/avr.md?view=markup or
https://gcc.gnu.org/viewcvs/gcc/trunk/gcc/config/avr/avr.c?view=markup and I didn't
know about it! I could go from the X/Y/Z name to the rXX name with %E and %F, but
I didn't know any way to do the opposite! The only way I could get X/Y/Z was to use
an 'm" constraint and then do all that Evil Hackery to try to tell gcc that I
was modifying the pointer.

Your way of expressing it is clearly much better.
Post by Georg-Johann Lay
Try to avoid "m" constraints in your code, except for making side
effects on memory explicit, i.e. don't use the respective %-Operand in
the asm template as "m" allows much more address modes than any AVR
instruction can chew.
Um, really? I know that it allows offset addressing modes as well,
but I thought that it matched exactly what the LD and ST instructions
could chew.

That is, X, -X, X+, Y, -Y, Y+, Y+{0..63}, Z, -Z, Z+, Z+{0..63}.

I know I was abusing it in my code (which wanted X, Y or Z only), but I thought
it *could* be used safely.
Post by Georg-Johann Lay
Post by George Spelvin
if (digit > 4)
__builtin_unreachable();
What's the purpose of this test? An optimization hint?
Exactly. I knew the multiplication was easier because of the restricted
range of the input (in particular, since it's less than 15, you could
implement "<< 4" using a swap instruction) and wanted to tell gcc
about it.
Post by Georg-Johann Lay
FYI, avrtest comes with such functions for you, you just have to link
against, say, exit-atmega128.o
https://sourceforge.net/p/winavr/code/HEAD/tree/trunk/avrtest/
avrtest is just a core simulator, i.e. has reduced functionality
compared to other simulators -- but it's /fast/ and well suited for such
algorithmic torture tests.
Ooh, thank you! I hadn't looked at winavr at all, so I've been doing
everything with simulavr and it's definitely a pain. I'm having to
count line numbers in the trace file.

Although since the existing avr-libc test suite is designed for simulavr,
I may end up sticking with it anyway.
George Spelvin
2016-12-06 17:14:29 UTC
Permalink
Post by Georg-Johann Lay
Again, we can safe code size by slightly slowing things down, e.g.
mod5 (uint8_t x)
{
#if __AVR_ARCH__
asm ("0: $ subi %0,%1 $ brcc 0b $ subi %0,%n1" : "+d" (x) : "n" (35));
asm ("0: $ subi %0,%1 $ brcc 0b $ subi %0,%n1" : "+d" (x) : "n" (5));
return x;
#else
...
The intermediate step via 35 is not essential, it's just a speed-up.
More detailed measurements...

The reduction loop is 3 instructions, and 3 + 3*loops cycles.
My code for reducing mod 15 is 7 instructions and 7 cycles:
mov __tmp_reg__,digit
swap __tmp_reg__
cbr digit,15
add digit,__tmp_reg__ /* Add high halves to get carry bit */
cbr digit,15
swap digit
adc digit,zero /* End-around carry */

So we have three code options:

1) Above code + mod-5 loop: 10 instructions, 17.835 cycles average
2) Mod-35 + mod-5 loops: 6 instructions, 24.282 cycles average
3) Mod-70 + mod-20 + mod-5 loops: 9 instructions, 20.718 cycles average
4) Mod-5 loop only: 3 instructions, 78.600 cycles average (ouch!)

The third option makes very little sense (I just wanted to measure it),
and the fourth is a little dear for my taste, but your suggestion costs
6.45 cycles per output digit, and saves 4 instructions.


Inspired by you, I saved one more instruction rather sneakily.

Rather than

clr lsbit

5: lsr lsbit
adc
rol lsbit
dec tlen
brne 5b

add lsbit,digit
add lsbit,digit
add lsbit,'0'
st out+,lsbit


I started with "ldi lsbit,'0'" and deleted the final add.
All the intermediate fiddling doesn't modify the high 7 bits of
the "lsbit" register, so I can load it right up front.

(I should think about renaming those variables.)
George Spelvin
2016-12-06 21:09:24 UTC
Permalink
Perhaps the two different reduction-mod-5 schemes should depend on
OPTIMIZE_SPEED?

Speaking of optimization, there are a significant number of places in
libc/string (and libx/pmstring) where adding one instruction could save
one cycle per byte.

Most of the loops end with

subi len_lo, lo8(1)
sbci len_hi, hi8(1)
brcc .L_loop

The first thing that leaps to mind is that r25:r24 is free (since the
first argument is a pointer which gets moved to X or Z), so adding one
movw to the top would let the loop be counted with sbiw.

But although that saves an instruction, sbiw takes 2 cycles, so we've
gained nothing in the loop and lost one cycle to the movw.


However, there's a second trick, and this one does work. When the amount
subtracted fits into one byte, you can write:

subi len_lo, lo8(1)
brcc .L_loop
sbci len_hi, 0
brcc .L_loop

That's one instruction more, but 255/256 times, the loop will be one
cycle shorter. (1/256 times, including the last time, the loop will be
one cycle *longer*.)

What's nice about this is that it's easy to define a macro (O_brcc?) which
expands to nothing if OPTIMIZE_SPEED is off.

Anyway, back to what I'm supposed to be working on. I'm just poking around
the code to familiarize myself with it and noticing low-hanging fruit.
Joerg Wunsch
2016-12-06 22:59:07 UTC
Permalink
Post by George Spelvin
Perhaps the two different reduction-mod-5 schemes should depend on
OPTIMIZE_SPEED?
Doesn't really matter much. Since the library is pre-compiled, you
cannot map it to the user's -Ox compiler option anyway.

As Johann already explained, most AVR users care for saved flash
and RAM more than for saving a few CPU cycles.
--
cheers, Joerg .-.-. --... ...-- -.. . DL8DTL

http://www.sax.de/~joerg/
Never trust an operating system you don't have sources for. ;-)
Georg-Johann Lay
2016-12-08 13:42:33 UTC
Permalink
Post by Joerg Wunsch
Post by George Spelvin
Perhaps the two different reduction-mod-5 schemes should depend on
OPTIMIZE_SPEED?
Doesn't really matter much. Since the library is pre-compiled, you
cannot map it to the user's -Ox compiler option anyway.
As Johann already explained, most AVR users care for saved flash
and RAM more than for saving a few CPU cycles.
Skimming generated code, IMO modules like vfprintf should be
compiled with -mcall-prologues; at least the versions that
supply float support (as many float functions use call-prologues
anyway).

Johann
George Spelvin
2016-12-06 23:26:22 UTC
Permalink
Post by Joerg Wunsch
Post by George Spelvin
Perhaps the two different reduction-mod-5 schemes should depend on
OPTIMIZE_SPEED?
Doesn't really matter much. Since the library is pre-compiled, you
cannot map it to the user's -Ox compiler option anyway.
Er... AFAICT, that option has nothing to do with -O compiler
flags, but is set as part of library compilation.

I assumed it was set based on the amount of flash on the target
processor. 4K flash? Turn off. 128K flash? Turn on. 32K
flash? Not sure.
Post by Joerg Wunsch
As Johann already explained, most AVR users care for saved flash
and RAM more than for saving a few CPU cycles.
RAM, I understand, but flash is either "enough" or "not enough", and if
you have enough, spending a few bytes of it for speed isn't a horrible
thing.

To quote libc/string/memcpy.S:

#if OPTIMIZE_SPEED
; 15 words, (14 + len * 6 - (len & 1)) cycles
(Unrolled code, copies 2 bytes per loop iteration)
#else
; 11 words, (13 + len * 8) cycles
(Rolled code)
#endif
Joerg Wunsch
2016-12-07 16:13:18 UTC
Permalink
Post by George Spelvin
Er... AFAICT, that option has nothing to do with -O compiler
flags, but is set as part of library compilation.
I know. I've never really been happy with that, but it's been in
use for so long, so it won't be changed.
--
cheers, Joerg .-.-. --... ...-- -.. . DL8DTL

http://www.sax.de/~joerg/
Never trust an operating system you don't have sources for. ;-)
Georg-Johann Lay
2016-12-07 21:48:27 UTC
Permalink
Post by George Spelvin
The developers of the TICC time interval counter are having
https://github.com/TAPR/TICC/blob/master/TO-DO.txt
As an aside, attached is an AVR asm code (needs MUL) that
transforms uint64_t to decimal string.

Despite most algorithms, it starts with the highest digit
and need no string reversion, and it doesn't need div or mod.

The show stopper is presumably the code size because of the
multiplication by 10: The value to convert is passed in regs
as of

extern void put64 (uint64_t val, char *buf);

The algo is rather slow because it always iterates over all
digits, i.e. it won't run faster for small numbers.

Have fun!

Code size is ~140 bytes.


Johann
George Spelvin
2016-12-07 23:34:05 UTC
Permalink
Post by Georg-Johann Lay
The algo is rather slow because it always iterates over all
digits, i.e. it won't run faster for small numbers.
Have fun!
Code size is ~140 bytes.
Well, it's bigger (140 > 124), slower, and doesn't handle sizes *other*
than 64 bits, so that's not terribly useful.

I think you could shrink it a bit, replacing these 16 instructions of messy
digit output code (why are you looping incrementing DIGIT2 when you know it
is never more than 1?):

clr DIGIT2
1:
inc DIGIT2
subi DIGIT, 10
brcc 1b

brts 2f
;; T = 0 is the first round. Output the high digit if it's not '0'.
set
subi DIGIT2, 1-'0'
;; Initialize nonZero. We only output digits if we saw a digit != '0'.
mov nonZero, DIGIT2
cpi nonZero, '0'
breq 2f
st X+, DIGIT2
2:
;; Output digits except the highest (except that for 10^19).
subi DIGIT, -10-'0'
or nonZero, DIGIT
;; We only output digits if we saw a digit != '0', i.e. strip leading '0's.
cpi nonZero, '0'
breq 3f
st X+, DIGIT

With these 9 instructions:
cpi DIGIT, 10 ;; First "digit" can be as high as 18
brcs 2f
ldi nonZero, '1' ;; '1' is non-zero, which is perfect
st X+, nonZero
subi DIGIT, 10
2:
or nonZero, DIGIT
breq 3f ;; Don't print leading zeros
subi DIGIT, -10-'0'
st X+, DIGIT
3:

With this, you can also delete the leading clt. It eliminates DIGIT2,
but unfortunately that doesn't save a spill. You also have to adapt
the final "lone zero" printing code to print if nonZero == 0, but that's
the same size.

Also, this is just silly:
dec Count
cpse Count, Zero
rjmp .Loop

"dec" sets the zero flag, so that can just be "dec Count $ brnz .Loop".

And finally, your multiply loop is wasting two instructions:

mul A0,Ten $ mov A0,r0 $ add A0,Cy $ mov Cy,r1 $ adc Cy,Zero
mov __tmp_reg__,A0
mov A0,A1 $ mov A1,A2 $ mov A2,A3 $ mov A3,A4
mov A4,A5 $ mov A5,A6 $ mov A6,A7 $ mov A7,__tmp_reg__

"mov A0,r0" and "mov __tmp_reg__,A0" are cancelling each other out
and should both be deleted (with the "A0 += Cy" adjusted to add to r0,
of course). Just make it:

mul A0,Ten $ mov A0,r0 $ add r0,Cy $ adc r1,Zero $ mov Cy,r1
mov A0,A1 $ mov A1,A2 $ mov A2,A3 $ mov A3,A4
mov A4,A5 $ mov A5,A6 $ mov A6,A7 $ mov A7,r0


That saves 22 bytes, leaving it 6 bytes smaller than mine. Nice to have available!
Georg-Johann Lay
2016-12-08 13:16:17 UTC
Permalink
Post by George Spelvin
Post by Georg-Johann Lay
The algo is rather slow because it always iterates over all
digits, i.e. it won't run faster for small numbers.
Have fun!
Code size is ~140 bytes.
Well, it's bigger (140 > 124), slower, and doesn't handle sizes *other*
than 64 bits, so that's not terribly useful.
I think you could shrink it a bit, replacing these 16 instructions of messy
digit output code (why are you looping incrementing DIGIT2 when you know it
It's a transcript from antique C-code for 32-bit values, which
don't have this nice property. I shouldn't write code to late in
the evening.

And I didn't actually intend nor expect to beat your code, just was
interested in how far it can be pushed...
Post by George Spelvin
clr DIGIT2
inc DIGIT2
subi DIGIT, 10
brcc 1b
brts 2f
;; T = 0 is the first round. Output the high digit if it's not '0'.
set
subi DIGIT2, 1-'0'
;; Initialize nonZero. We only output digits if we saw a digit != '0'.
mov nonZero, DIGIT2
cpi nonZero, '0'
breq 2f
st X+, DIGIT2
;; Output digits except the highest (except that for 10^19).
subi DIGIT, -10-'0'
or nonZero, DIGIT
;; We only output digits if we saw a digit != '0', i.e. strip leading '0's.
cpi nonZero, '0'
breq 3f
st X+, DIGIT
cpi DIGIT, 10 ;; First "digit" can be as high as 18
brcs 2f
ldi nonZero, '1' ;; '1' is non-zero, which is perfect
st X+, nonZero
subi DIGIT, 10
or nonZero, DIGIT
breq 3f ;; Don't print leading zeros
subi DIGIT, -10-'0'
st X+, DIGIT
With this, you can also delete the leading clt. It eliminates DIGIT2,
but unfortunately that doesn't save a spill. You also have to adapt
the final "lone zero" printing code to print if nonZero == 0, but that's
the same size.
dec Count
cpse Count, Zero
rjmp .Loop
"dec" sets the zero flag, so that can just be "dec Count $ brnz .Loop".
mul A0,Ten $ mov A0,r0 $ add A0,Cy $ mov Cy,r1 $ adc Cy,Zero
mov __tmp_reg__,A0
mov A0,A1 $ mov A1,A2 $ mov A2,A3 $ mov A3,A4
mov A4,A5 $ mov A5,A6 $ mov A6,A7 $ mov A7,__tmp_reg__
"mov A0,r0" and "mov __tmp_reg__,A0" are cancelling each other out
Nice spotting!
Post by George Spelvin
and should both be deleted (with the "A0 += Cy" adjusted to add to r0,
mul A0,Ten $ mov A0,r0 $ add r0,Cy $ adc r1,Zero $ mov Cy,r1
mov A0,A1 $ mov A1,A2 $ mov A2,A3 $ mov A3,A4
mov A4,A5 $ mov A5,A6 $ mov A6,A7 $ mov A7,r0
That saves 22 bytes, leaving it 6 bytes smaller than mine. Nice to have available!
The reworked version comes up with 110 bytes (still asserting MUL).

But I like your approach more, as it does not rely on special properties
of the numbers and comes with some nice ideas.

perf-metering with avrtest reveals a run time from ~3100 up to < 4800
ticks; high as expected.

Johann
George Spelvin
2016-12-08 17:20:58 UTC
Permalink
Post by Georg-Johann Lay
The reworked version comes up with 110 bytes (still asserting MUL).
Nicely done.
Post by Georg-Johann Lay
perf-metering with avrtest reveals a run time from ~3100 up to < 4800
ticks; high as expected.
While mine is 3161 cycles worst case (64 ones), or 4045 if !MUL.

So yours is actually not too unreasonable *if* the numbers are
uniformly distributed. With more common distributions which obey
Benford's law, of course, it's pretty awful speed-wise.

I really wish I could find a way to skip the totally unnecessary final
multiplication of 0 * 10, without adding one extra instruction.

One slight speed saving: "Ten" is never overwritten anywhere in the
function. You can load it once in the preamble and leave it.

Or you could get rid of the Ten register entirely, save a spill/fill
(106 bytes!) and "ldi r0,10" in the multiplication loop.

Loading...