[avr-libc-dev] Printing octal (brain dump)

Discussion:

George Spelvin

2016-12-14 06:51:49 UTC

If anyone wants to flex their asm-hacking skills, there's some draft
code here. If not, don't panic; this is primarily notes to myself
that someone else might be interested in.

To finish long long support in printf, I need to print 64-bit numbers
(preferably, arbitrary-precision numbers) in octal, and it seems to
be taking as long to write as the decimal code.

Hex is easy to do byte at a time, but octal is messy because 3 doesn't
divide 8. It's not that it's difficult, per se. Or that I need to
stress about efficiency (to a first approximation, nobody uses octal,
so it matters even less than usual).

The problem is that messy translates to large code (boo!), and
it's hard to see how large an alternative will be without writing
it in detail.

The old code just masked and printed the low-order bits, did a full-width
shift by 1 bit, repeated it 3x, and continued as long as the result
wasn't zero. It did that with a 12-instruction helper routine to shift
the buffer by cnt bits:

.L_lsr_4:
ldi cnt, 4
.L_lsr:
lsr v_fifth
ror v_hhi
ror v_hlo
ror v_hi
ror v_lo
dec cnt
brne .L_lsr
; tst
sbiw v_hlo, 0 ; only Z flag is needed
cpc v_lo, rzero
cpc v_hi, rzero
ret

This was shared between the hex/octal print code and the decimal print
code, so the rest of the hex/octal print code was only 20 instructions
(+4 to branch to it depending on "base" on function entry).

The equivalent for a variable-length memory buffer is rather more awkward,
14 instructions which *aren't* shared with the decimal print code:

.L_lsr_4:
ldi cnt, 4
.L_lsr:
movw ZL, r24
sub merge, merge ; Set to to zero and clear carry
mov tlen, len
1: ld r0, -Z
ror r0
st Z, r0
or merge, r0
dec tlen
bne 1b
dec cnt
bne .L_lsr
tst merge
ret

Worse, the inner loop is 9 cycles to shift 1 byte by 1 bit, and the outer
loop is another 5 per bit. Shifting an 8-byte value is 9 * 8 + 5 = 77
cycles per bit.

If I have a 64-bit input to print, then that's 77 * 64 = 4928 cycles
just spent shifting! Given that I can print a *decimal* 64-bit number
in 4103 cycles, that seems Just Wrong.

And even assuming I could keep the rest to 20 instructions, that's a
total of 34. (It's not obvious I can, because while the shift leaves the
lsbyte in r0 ready for printing, the first digit doesn't have a shift,
so it needs to be handled specially.)

After a lot of messing about, I came up with the following O(n) scheme.
It buffers two registers, keeping 8..15 bits of the input buffered at any
given time. Shifting is bit at a time, with the buffer refilled every 8
bits, and a digit printed every 3. (Or 4. The code is not only shared
with hex, but can handle bases 2, 4 or 32 as well.)

A trick I use repeatedly is bit shifting for loop counters. Rather than
initialize a counter to "3" and decrement it each iteration, I initialize
a counter to "7" (a value I already have available for digit-extraction
purposes) and shift it right each iteration.

For the bit buffer, I have 8..15 bits of buffered data, and rather than
have a separate counter, mark the end of the valid data with a 1 bit.
So the data starts out like "1abcdefg hijklmno" and shifts down until
it's "00000001 abcdefgh". At that point, we start the next shift,
but the msbyte becomes zero (with the 1 bit stored in the carry).
That's the signal to fetch another byte.

This reduces the loop overhead in both time and code size.

Here's the current draft. Suggestions solicited!
Registers:
len: Length of input (in bytes). Shared code with
the decimal print has stripped off leading zero
bytes.
X: Points to output buffer
Z: Points to input buffer
flags: Input flags distinguishing decimal, octal,
and upper/lower case hex
msb, lsb: Current bit buffer. Most significant set bit of msb
counts bits per byte.
mask: A bitmask the size of the current base's digits (7 or 15)
tmask: A loop counter initialized to "mask" which counts bits per digit
digit: a temporary used to form the ASCII digit
delta: The amount to add to digits past '9' to make
them start at 'a' or 'A'.

.L_octhex:
ldi mask, 7
lsr flags ; bit 1 in carry, but 1 in register
breq 1f
ldi mask, 15
ldi delta, 'A'-'0'+10
brcc 1f
ldi delta, 'a'-'0'+10
1:
ldi msb, 1
ldi tmask, 1
ld lsb, Z+
rjmp 2f

.L_bitloop: ; Main loop
ror lsb
2: lsr tmask ; Entry point, tmask >>= 1
brne 4f ; if (tmask == 0) {
; Spit out a digit of (lsb & mask)
mov digit, lsb
and digit, mask
cpi digit, 10
brcs 3f
add digit, delta
3: add digit, '0'
st X+, digit
mov tmask, mask
4: ; }
lsr msb
brne .L_bitloop ; End of main loop

; Load another byte
; Note: we know that carry is SET
dec len ; Load another byte
breq .L_lastbyte
ld msb, Z+
ror msb ; Shift carry=1 into msbit
rjmp .L_bitloop

.L_lastbyte:
lastbyte:
adc msb, tmask ; Pad until end of digit (and clear carry)
inc len
cpse lsb, __zero_reg__
rjmp 4b
movw r24, r26 ; Copy X to return value
ret

This is 35 instructions in its current state. I can probably shave
one off the end by sharing the epilogue with the decimal print code,
and I'm thinking of making the "mask" value be what's passed in from
the caller to select the base, which will simplify the prologue.

The main loop has two conditional branches, either of which could be
the loop end. I could save a cycle (but cost an instruction) in the
main loop by placing both conditions out of line.

The other tricky thing is the "lastbyte" section, invoked when we need
another byte but len == 0. When we get to the end of the input, we need
to pad with zeros until the significant digits have all been printed.
This involves setting msb to some suitable all-zero, and keeping going
until lsb == 0 as well.

We stop when len == 0 and lsb == 0. len is decremented *after* using
the byte, so when len is decremented to 0, the last byte has already
been used and we need to load an msbyte of padding.

There are three ways to check for this, and I *think* I've picked the one
with minimum code size (and fastest among those), but let me analyze them
in detail.

In the first two cases, we can pad with a full byte of 8 zero bits,
then the "lastbyte" code is simply "ror msb" (which sets msb to 0x80
and clears the carry in one instruction, since we know we started with
msb = 0 and carry = 1) and "rjmp .L_bitloop". Oh! Those instructions
already exist at the end of the "load another byte" branch, so maybe
the savings is more than I thought.

The three ways of checking and the additional costs are:

1) Check each iteration of the main bit loop. This is slowest, so only
use it if it has no rivals for smallest. This can be placed after
the "ror lsb", adding three more instructions: branch if not equal,
test len == 0, and branch if equal to epilogue code.
2) Check with each digit printed. This is identical code to the above,
but inside the "if (!tmask)" condition to reduce overhead.
The problem is that we don't get the test of lsb for free any more,
so I think it's four instructions. No! We can do it in three:
cp lsb, __zero_reg__ $ cpc len,__zero_reg__ $ breq epilogue
CPC ANDs its result with the previous zero flag. *If* the zero flag
was set (the only case we care about) then the carry flag is clear,
and the cpc will compute the test we want. If the carry is set,
the cpc will do the wrong thing, but that doesn't matter because
zero is already clear.
3) The way I did it, checking with each byte fetched. We have to
test len == 0 anyway, so we just have to check lsb == 0.
But we need four additional instructions:
- We have to use "adc msb,tmask" to set the msb correctly, so we'll
fall back into the byte-load check after only one more digit, and
- We have to "inc len" so it will decrement again properly.
- And we need two instructions to test lsb == 0

I thought option 3 was smallest because I thought the "rol msb" the
other cases required was the same size as "adc msb.tmask". I didn't
notice the 2-instruction code sharing.

So now it seems that option 2 is the smallest. Here it is in 34
instructions:

.L_octhex:
ldi mask, 7
lsr flags ; bit 1 in carry, but 1 in register
breq 1f
ldi mask, 15
ldi delta, 'A'-'0'+10
brcc 1f
ldi delta, 'a'-'0'+10
1:
ldi msb, 1
ldi tmask, 1
ld lsb, Z+
rjmp 2f

.L_bitloop: ; Main loop
ror lsb
2: lsr tmask ; Entry point, tmask >>= 1
brne 4f ; if (tmask == 0) {
; Check if we're done
cp lsb, __zero_reg__
cpc len, __zero_reg__
breq .L_epilogue
; Spit out a digit of (lsb & mask)
mov digit, lsb
and digit, mask
cpi digit, 10
brcs 3f
add digit, delta
3: add digit, '0'
st X+, digit
mov tmask, mask
4: ; }
lsr msb
brne .L_bitloop ; End of main loop

; Load another byte
; Note: we know that msb == 0 and carry is SET
dec len ; Load another byte
breq 5f
ld msb, Z+
5: ror msb ; Shift carry=1 into msbit
rjmp .L_bitloop

.L_epilogue:
movw r24, r26 ; Copy X to return value
ret

This is reasonably small (although there is definitely a size hit
relative to the previous code, sigh) and reasonably efficient. As I said,
the 2-instruction epilogue can be shared with the decimal code, and the
prologue code might be shortened by 3 instructions by choosing the "flags"
values so they can be used directly as "mask". That would cut it to 29.
(Plus two in the common code to branch to this depending on the flags.)

I have to code it up and test it, but I wanted to write down my thoughts
about it first, and that seemed post-worthy.

George Spelvin

2016-12-16 07:41:06 UTC

Permalink

(Am I annoying everyone by using this mailing list as my personal
coding blog?)

After considerable rearranging (and fixing one nasty logic bug in
the first algorithm posted), I have octal converison working to my
satisfaction.

The logic bug was that I assumed I'd need at most one byte of zero-padding
to print a number. But I was checking for termination before printing
a digit. That ended up not working with 1-byte octal numbers where the
top digit is non-zero. By the time I was ready to print the fourth digit
(when the termination check would fire), the lsbyte wanted to hold bits
9..16, and that meant loading a *second* byte (bits 16..23).

So I changed to checking for termination *after* printing a digit,
which I knew would save time, but I unexpectedly found additional
space savings, too.

Not counting preamble code shared with decimal printing (all the
stuff before the label "3:"), it's down to 29 instructions. Still
a bit more than 20, but I'm satisfied.

It's even slightly faster than the previous code:

Bits Old New
0 56 42
8 144 113
16 276 232
24 364 314
32 496 430
40 546
48 628
56 744
64 860

/* Arguments */
#define out X /* Arrives in r24:r25, but we move it immediately */
#define out_lo r26
#define out_hi r27
#define bin Z /* Arrives in r22:r23, but we move it immediately */
#define bin_lo r30
#define bin_hi r31
#define len r20
#define flags r18 /* Mask, after removing two lsbits */

/* Local variables */
#define msb r25 /* Overlaps input */
#define lsb r24 /* Overlaps input */
#define digit r23 /* Overlaps input */
#define delta r22 /* Overlaps input */
#define tmask r21
// len = r20
#define k r19
// flags = r18

.text
.global binprint
.type binprint, @function
binprint:
movw out_lo, r24
movw bin_lo, r22
#if 1
add bin_lo, len
adc bin_hi, zero
#else
mov tmask, len
; Conditional negate using the standard identity -x = ~x + 1.
; Given mask of -1 or 0, (x ^ mask) - mask returns -x or x.
; However, we would need the carry bit clear to start this, and
; forming "mask" from the carry bit in one instruction preserves
; the carry bit. So instead add zero with carry.
lsr flags ; Lsbit is negate flag
sbc k, k ; Set to 0 or -1, carry preserved
1:
ld __tmp_reg__, bin
eor __tmp_reg__, k
adc __tmp_reg__, __zero_reg__
st bin+, __tmp_reg__
dec tmask
brne 1b
#endif
; Strip trailing (most-significant) zeros from bin */
2: dec len
breq 3f ; If we've reached the end, stop
ld __tmp_reg__, -bin
or __tmp_reg__, __tmp_reg__
breq 2b ; Continue as long as bytes are zero

3: movw bin_lo, r22 ; Reset bin to lsbyte
; Len is now pre-decremented

; Done with args in r22-r25; now allowed to use delta, digit, lsb, msb
ldi delta, 'A'-'0'-10
lsr flags
brcc 4f
ldi delta, 'a'-'0'-10
4: ldi msb, 1
ld lsb, bin+

.L_digit_out: ; Spit out a digit
mov digit, lsb
and digit, flags
cpi digit, 10
brcs 5f
add digit, delta ; Hex digit > 9
5: subi digit, -'0'
st X+, digit
; Check for done: is len:lsb < 0:flags?
cp flags, lsb
cpc __zero_reg__, len
brcc .L_epilogue ; if (!lsb && !len) return X
mov tmask, flags
.L_bitloop:
lsr msb
brne 7f ; if ((msb >>= 1) == 0) get another byte
; Fetch another byte
or len, len ; Preserves carry
breq 6f
dec len ; Preserves carry
ld msb, Z+
6: ror msb ; Shift carry=1 into msbit
7: ror lsb
lsr tmask
brne .L_bitloop ; if ((tmask >>= 1)== 0) {
rjmp .L_digit_out
.size binprint, .-binprint

Dale Whitfield

2016-12-16 08:56:56 UTC

Permalink

Hi George,

Post by George Spelvin
(Am I annoying everyone by using this mailing list as my personal
coding blog?)

No. But I speak for myself.

Some of us are interested and reading but don't have time to comment or
to work through the code.

Your efforts are appreciated.

Post by George Spelvin
After considerable rearranging (and fixing one nasty logic bug in
the first algorithm posted), I have octal converison working to my
satisfaction.
The logic bug was that I assumed I'd need at most one byte of zero-padding
to print a number. But I was checking for termination before printing
a digit. That ended up not working with 1-byte octal numbers where the
top digit is non-zero. By the time I was ready to print the fourth digit
(when the termination check would fire), the lsbyte wanted to hold bits
9..16, and that meant loading a *second* byte (bits 16..23).
So I changed to checking for termination *after* printing a digit,
which I knew would save time, but I unexpectedly found additional
space savings, too.
Not counting preamble code shared with decimal printing (all the
stuff before the label "3:"), it's down to 29 instructions. Still
a bit more than 20, but I'm satisfied.
Bits Old New
0 56 42
8 144 113
16 276 232
24 364 314
32 496 430
40 546
48 628
56 744
64 860
/* Arguments */
#define out X /* Arrives in r24:r25, but we move it immediately */
#define out_lo r26
#define out_hi r27
#define bin Z /* Arrives in r22:r23, but we move it immediately */
#define bin_lo r30
#define bin_hi r31
#define len r20
#define flags r18 /* Mask, after removing two lsbits */
/* Local variables */
#define msb r25 /* Overlaps input */
#define lsb r24 /* Overlaps input */
#define digit r23 /* Overlaps input */
#define delta r22 /* Overlaps input */
#define tmask r21
// len = r20
#define k r19
// flags = r18
.text
.global binprint
movw out_lo, r24
movw bin_lo, r22
#if 1
add bin_lo, len
adc bin_hi, zero
#else
mov tmask, len
; Conditional negate using the standard identity -x = ~x + 1.
; Given mask of -1 or 0, (x ^ mask) - mask returns -x or x.
; However, we would need the carry bit clear to start this, and
; forming "mask" from the carry bit in one instruction preserves
; the carry bit. So instead add zero with carry.
lsr flags ; Lsbit is negate flag
sbc k, k ; Set to 0 or -1, carry preserved
ld __tmp_reg__, bin
eor __tmp_reg__, k
adc __tmp_reg__, __zero_reg__
st bin+, __tmp_reg__
dec tmask
brne 1b
#endif
; Strip trailing (most-significant) zeros from bin */
2: dec len
breq 3f ; If we've reached the end, stop
ld __tmp_reg__, -bin
or __tmp_reg__, __tmp_reg__
breq 2b ; Continue as long as bytes are zero
3: movw bin_lo, r22 ; Reset bin to lsbyte
; Len is now pre-decremented
; Done with args in r22-r25; now allowed to use delta, digit, lsb, msb
ldi delta, 'A'-'0'-10
lsr flags
brcc 4f
ldi delta, 'a'-'0'-10
4: ldi msb, 1
ld lsb, bin+
.L_digit_out: ; Spit out a digit
mov digit, lsb
and digit, flags
cpi digit, 10
brcs 5f
add digit, delta ; Hex digit > 9
5: subi digit, -'0'
st X+, digit
; Check for done: is len:lsb < 0:flags?
cp flags, lsb
cpc __zero_reg__, len
brcc .L_epilogue ; if (!lsb && !len) return X
mov tmask, flags
lsr msb
brne 7f ; if ((msb >>= 1) == 0) get another byte
; Fetch another byte
or len, len ; Preserves carry
breq 6f
dec len ; Preserves carry
ld msb, Z+
6: ror msb ; Shift carry=1 into msbit
7: ror lsb
lsr tmask
brne .L_bitloop ; if ((tmask >>= 1)== 0) {
rjmp .L_digit_out
.size binprint, .-binprint
_______________________________________________
AVR-libc-dev mailing list
https://lists.nongnu.org/mailman/listinfo/avr-libc-dev

David Brown

2016-12-16 13:43:29 UTC

Permalink

Post by Dale Whitfield
Hi George,

Post by George Spelvin
(Am I annoying everyone by using this mailing list as my personal
coding blog?)

No. But I speak for myself.
Some of us are interested and reading but don't have time to comment or
to work through the code.
Your efforts are appreciated.

My feelings are much the same.

I would not put in too much effort into making this code fast (except
for the fun of it, of course). Printing in octal is necessary to make
printf long long support complete. But it is unlikely to be useful in
practice - people very rarely use octal. I can think of only 4 use-cases:

1. chmod parameters on *nix.
2. 0 is an octal constant in C, and is often quite handy.
3. "uint8_t month = 09;" is a fine way to get accidental compile-time
errors.
4. "uint16_t year_of_nicaea = 0325;" is a fine way to get accidental
run-time errors.

Still, as long as octal is in the C standards, it is great that you are
doing this work to finish the standard support in printf.