George Spelvin
2016-12-14 06:51:49 UTC
If anyone wants to flex their asm-hacking skills, there's some draft
code here. If not, don't panic; this is primarily notes to myself
that someone else might be interested in.
To finish long long support in printf, I need to print 64-bit numbers
(preferably, arbitrary-precision numbers) in octal, and it seems to
be taking as long to write as the decimal code.
Hex is easy to do byte at a time, but octal is messy because 3 doesn't
divide 8. It's not that it's difficult, per se. Or that I need to
stress about efficiency (to a first approximation, nobody uses octal,
so it matters even less than usual).
The problem is that messy translates to large code (boo!), and
it's hard to see how large an alternative will be without writing
it in detail.
The old code just masked and printed the low-order bits, did a full-width
shift by 1 bit, repeated it 3x, and continued as long as the result
wasn't zero. It did that with a 12-instruction helper routine to shift
the buffer by cnt bits:
.L_lsr_4:
ldi cnt, 4
.L_lsr:
lsr v_fifth
ror v_hhi
ror v_hlo
ror v_hi
ror v_lo
dec cnt
brne .L_lsr
; tst
sbiw v_hlo, 0 ; only Z flag is needed
cpc v_lo, rzero
cpc v_hi, rzero
ret
This was shared between the hex/octal print code and the decimal print
code, so the rest of the hex/octal print code was only 20 instructions
(+4 to branch to it depending on "base" on function entry).
The equivalent for a variable-length memory buffer is rather more awkward,
14 instructions which *aren't* shared with the decimal print code:
.L_lsr_4:
ldi cnt, 4
.L_lsr:
movw ZL, r24
sub merge, merge ; Set to to zero and clear carry
mov tlen, len
1: ld r0, -Z
ror r0
st Z, r0
or merge, r0
dec tlen
bne 1b
dec cnt
bne .L_lsr
tst merge
ret
Worse, the inner loop is 9 cycles to shift 1 byte by 1 bit, and the outer
loop is another 5 per bit. Shifting an 8-byte value is 9 * 8 + 5 = 77
cycles per bit.
If I have a 64-bit input to print, then that's 77 * 64 = 4928 cycles
just spent shifting! Given that I can print a *decimal* 64-bit number
in 4103 cycles, that seems Just Wrong.
And even assuming I could keep the rest to 20 instructions, that's a
total of 34. (It's not obvious I can, because while the shift leaves the
lsbyte in r0 ready for printing, the first digit doesn't have a shift,
so it needs to be handled specially.)
After a lot of messing about, I came up with the following O(n) scheme.
It buffers two registers, keeping 8..15 bits of the input buffered at any
given time. Shifting is bit at a time, with the buffer refilled every 8
bits, and a digit printed every 3. (Or 4. The code is not only shared
with hex, but can handle bases 2, 4 or 32 as well.)
A trick I use repeatedly is bit shifting for loop counters. Rather than
initialize a counter to "3" and decrement it each iteration, I initialize
a counter to "7" (a value I already have available for digit-extraction
purposes) and shift it right each iteration.
For the bit buffer, I have 8..15 bits of buffered data, and rather than
have a separate counter, mark the end of the valid data with a 1 bit.
So the data starts out like "1abcdefg hijklmno" and shifts down until
it's "00000001 abcdefgh". At that point, we start the next shift,
but the msbyte becomes zero (with the 1 bit stored in the carry).
That's the signal to fetch another byte.
This reduces the loop overhead in both time and code size.
Here's the current draft. Suggestions solicited!
Registers:
len: Length of input (in bytes). Shared code with
the decimal print has stripped off leading zero
bytes.
X: Points to output buffer
Z: Points to input buffer
flags: Input flags distinguishing decimal, octal,
and upper/lower case hex
msb, lsb: Current bit buffer. Most significant set bit of msb
counts bits per byte.
mask: A bitmask the size of the current base's digits (7 or 15)
tmask: A loop counter initialized to "mask" which counts bits per digit
digit: a temporary used to form the ASCII digit
delta: The amount to add to digits past '9' to make
them start at 'a' or 'A'.
.L_octhex:
ldi mask, 7
lsr flags ; bit 1 in carry, but 1 in register
breq 1f
ldi mask, 15
ldi delta, 'A'-'0'+10
brcc 1f
ldi delta, 'a'-'0'+10
1:
ldi msb, 1
ldi tmask, 1
ld lsb, Z+
rjmp 2f
.L_bitloop: ; Main loop
ror lsb
2: lsr tmask ; Entry point, tmask >>= 1
brne 4f ; if (tmask == 0) {
; Spit out a digit of (lsb & mask)
mov digit, lsb
and digit, mask
cpi digit, 10
brcs 3f
add digit, delta
3: add digit, '0'
st X+, digit
mov tmask, mask
4: ; }
lsr msb
brne .L_bitloop ; End of main loop
; Load another byte
; Note: we know that carry is SET
dec len ; Load another byte
breq .L_lastbyte
ld msb, Z+
ror msb ; Shift carry=1 into msbit
rjmp .L_bitloop
.L_lastbyte:
lastbyte:
adc msb, tmask ; Pad until end of digit (and clear carry)
inc len
cpse lsb, __zero_reg__
rjmp 4b
movw r24, r26 ; Copy X to return value
ret
This is 35 instructions in its current state. I can probably shave
one off the end by sharing the epilogue with the decimal print code,
and I'm thinking of making the "mask" value be what's passed in from
the caller to select the base, which will simplify the prologue.
The main loop has two conditional branches, either of which could be
the loop end. I could save a cycle (but cost an instruction) in the
main loop by placing both conditions out of line.
The other tricky thing is the "lastbyte" section, invoked when we need
another byte but len == 0. When we get to the end of the input, we need
to pad with zeros until the significant digits have all been printed.
This involves setting msb to some suitable all-zero, and keeping going
until lsb == 0 as well.
We stop when len == 0 and lsb == 0. len is decremented *after* using
the byte, so when len is decremented to 0, the last byte has already
been used and we need to load an msbyte of padding.
There are three ways to check for this, and I *think* I've picked the one
with minimum code size (and fastest among those), but let me analyze them
in detail.
In the first two cases, we can pad with a full byte of 8 zero bits,
then the "lastbyte" code is simply "ror msb" (which sets msb to 0x80
and clears the carry in one instruction, since we know we started with
msb = 0 and carry = 1) and "rjmp .L_bitloop". Oh! Those instructions
already exist at the end of the "load another byte" branch, so maybe
the savings is more than I thought.
The three ways of checking and the additional costs are:
1) Check each iteration of the main bit loop. This is slowest, so only
use it if it has no rivals for smallest. This can be placed after
the "ror lsb", adding three more instructions: branch if not equal,
test len == 0, and branch if equal to epilogue code.
2) Check with each digit printed. This is identical code to the above,
but inside the "if (!tmask)" condition to reduce overhead.
The problem is that we don't get the test of lsb for free any more,
so I think it's four instructions. No! We can do it in three:
cp lsb, __zero_reg__ $ cpc len,__zero_reg__ $ breq epilogue
CPC ANDs its result with the previous zero flag. *If* the zero flag
was set (the only case we care about) then the carry flag is clear,
and the cpc will compute the test we want. If the carry is set,
the cpc will do the wrong thing, but that doesn't matter because
zero is already clear.
3) The way I did it, checking with each byte fetched. We have to
test len == 0 anyway, so we just have to check lsb == 0.
But we need four additional instructions:
- We have to use "adc msb,tmask" to set the msb correctly, so we'll
fall back into the byte-load check after only one more digit, and
- We have to "inc len" so it will decrement again properly.
- And we need two instructions to test lsb == 0
I thought option 3 was smallest because I thought the "rol msb" the
other cases required was the same size as "adc msb.tmask". I didn't
notice the 2-instruction code sharing.
So now it seems that option 2 is the smallest. Here it is in 34
instructions:
.L_octhex:
ldi mask, 7
lsr flags ; bit 1 in carry, but 1 in register
breq 1f
ldi mask, 15
ldi delta, 'A'-'0'+10
brcc 1f
ldi delta, 'a'-'0'+10
1:
ldi msb, 1
ldi tmask, 1
ld lsb, Z+
rjmp 2f
.L_bitloop: ; Main loop
ror lsb
2: lsr tmask ; Entry point, tmask >>= 1
brne 4f ; if (tmask == 0) {
; Check if we're done
cp lsb, __zero_reg__
cpc len, __zero_reg__
breq .L_epilogue
; Spit out a digit of (lsb & mask)
mov digit, lsb
and digit, mask
cpi digit, 10
brcs 3f
add digit, delta
3: add digit, '0'
st X+, digit
mov tmask, mask
4: ; }
lsr msb
brne .L_bitloop ; End of main loop
; Load another byte
; Note: we know that msb == 0 and carry is SET
dec len ; Load another byte
breq 5f
ld msb, Z+
5: ror msb ; Shift carry=1 into msbit
rjmp .L_bitloop
.L_epilogue:
movw r24, r26 ; Copy X to return value
ret
This is reasonably small (although there is definitely a size hit
relative to the previous code, sigh) and reasonably efficient. As I said,
the 2-instruction epilogue can be shared with the decimal code, and the
prologue code might be shortened by 3 instructions by choosing the "flags"
values so they can be used directly as "mask". That would cut it to 29.
(Plus two in the common code to branch to this depending on the flags.)
I have to code it up and test it, but I wanted to write down my thoughts
about it first, and that seemed post-worthy.
code here. If not, don't panic; this is primarily notes to myself
that someone else might be interested in.
To finish long long support in printf, I need to print 64-bit numbers
(preferably, arbitrary-precision numbers) in octal, and it seems to
be taking as long to write as the decimal code.
Hex is easy to do byte at a time, but octal is messy because 3 doesn't
divide 8. It's not that it's difficult, per se. Or that I need to
stress about efficiency (to a first approximation, nobody uses octal,
so it matters even less than usual).
The problem is that messy translates to large code (boo!), and
it's hard to see how large an alternative will be without writing
it in detail.
The old code just masked and printed the low-order bits, did a full-width
shift by 1 bit, repeated it 3x, and continued as long as the result
wasn't zero. It did that with a 12-instruction helper routine to shift
the buffer by cnt bits:
.L_lsr_4:
ldi cnt, 4
.L_lsr:
lsr v_fifth
ror v_hhi
ror v_hlo
ror v_hi
ror v_lo
dec cnt
brne .L_lsr
; tst
sbiw v_hlo, 0 ; only Z flag is needed
cpc v_lo, rzero
cpc v_hi, rzero
ret
This was shared between the hex/octal print code and the decimal print
code, so the rest of the hex/octal print code was only 20 instructions
(+4 to branch to it depending on "base" on function entry).
The equivalent for a variable-length memory buffer is rather more awkward,
14 instructions which *aren't* shared with the decimal print code:
.L_lsr_4:
ldi cnt, 4
.L_lsr:
movw ZL, r24
sub merge, merge ; Set to to zero and clear carry
mov tlen, len
1: ld r0, -Z
ror r0
st Z, r0
or merge, r0
dec tlen
bne 1b
dec cnt
bne .L_lsr
tst merge
ret
Worse, the inner loop is 9 cycles to shift 1 byte by 1 bit, and the outer
loop is another 5 per bit. Shifting an 8-byte value is 9 * 8 + 5 = 77
cycles per bit.
If I have a 64-bit input to print, then that's 77 * 64 = 4928 cycles
just spent shifting! Given that I can print a *decimal* 64-bit number
in 4103 cycles, that seems Just Wrong.
And even assuming I could keep the rest to 20 instructions, that's a
total of 34. (It's not obvious I can, because while the shift leaves the
lsbyte in r0 ready for printing, the first digit doesn't have a shift,
so it needs to be handled specially.)
After a lot of messing about, I came up with the following O(n) scheme.
It buffers two registers, keeping 8..15 bits of the input buffered at any
given time. Shifting is bit at a time, with the buffer refilled every 8
bits, and a digit printed every 3. (Or 4. The code is not only shared
with hex, but can handle bases 2, 4 or 32 as well.)
A trick I use repeatedly is bit shifting for loop counters. Rather than
initialize a counter to "3" and decrement it each iteration, I initialize
a counter to "7" (a value I already have available for digit-extraction
purposes) and shift it right each iteration.
For the bit buffer, I have 8..15 bits of buffered data, and rather than
have a separate counter, mark the end of the valid data with a 1 bit.
So the data starts out like "1abcdefg hijklmno" and shifts down until
it's "00000001 abcdefgh". At that point, we start the next shift,
but the msbyte becomes zero (with the 1 bit stored in the carry).
That's the signal to fetch another byte.
This reduces the loop overhead in both time and code size.
Here's the current draft. Suggestions solicited!
Registers:
len: Length of input (in bytes). Shared code with
the decimal print has stripped off leading zero
bytes.
X: Points to output buffer
Z: Points to input buffer
flags: Input flags distinguishing decimal, octal,
and upper/lower case hex
msb, lsb: Current bit buffer. Most significant set bit of msb
counts bits per byte.
mask: A bitmask the size of the current base's digits (7 or 15)
tmask: A loop counter initialized to "mask" which counts bits per digit
digit: a temporary used to form the ASCII digit
delta: The amount to add to digits past '9' to make
them start at 'a' or 'A'.
.L_octhex:
ldi mask, 7
lsr flags ; bit 1 in carry, but 1 in register
breq 1f
ldi mask, 15
ldi delta, 'A'-'0'+10
brcc 1f
ldi delta, 'a'-'0'+10
1:
ldi msb, 1
ldi tmask, 1
ld lsb, Z+
rjmp 2f
.L_bitloop: ; Main loop
ror lsb
2: lsr tmask ; Entry point, tmask >>= 1
brne 4f ; if (tmask == 0) {
; Spit out a digit of (lsb & mask)
mov digit, lsb
and digit, mask
cpi digit, 10
brcs 3f
add digit, delta
3: add digit, '0'
st X+, digit
mov tmask, mask
4: ; }
lsr msb
brne .L_bitloop ; End of main loop
; Load another byte
; Note: we know that carry is SET
dec len ; Load another byte
breq .L_lastbyte
ld msb, Z+
ror msb ; Shift carry=1 into msbit
rjmp .L_bitloop
.L_lastbyte:
lastbyte:
adc msb, tmask ; Pad until end of digit (and clear carry)
inc len
cpse lsb, __zero_reg__
rjmp 4b
movw r24, r26 ; Copy X to return value
ret
This is 35 instructions in its current state. I can probably shave
one off the end by sharing the epilogue with the decimal print code,
and I'm thinking of making the "mask" value be what's passed in from
the caller to select the base, which will simplify the prologue.
The main loop has two conditional branches, either of which could be
the loop end. I could save a cycle (but cost an instruction) in the
main loop by placing both conditions out of line.
The other tricky thing is the "lastbyte" section, invoked when we need
another byte but len == 0. When we get to the end of the input, we need
to pad with zeros until the significant digits have all been printed.
This involves setting msb to some suitable all-zero, and keeping going
until lsb == 0 as well.
We stop when len == 0 and lsb == 0. len is decremented *after* using
the byte, so when len is decremented to 0, the last byte has already
been used and we need to load an msbyte of padding.
There are three ways to check for this, and I *think* I've picked the one
with minimum code size (and fastest among those), but let me analyze them
in detail.
In the first two cases, we can pad with a full byte of 8 zero bits,
then the "lastbyte" code is simply "ror msb" (which sets msb to 0x80
and clears the carry in one instruction, since we know we started with
msb = 0 and carry = 1) and "rjmp .L_bitloop". Oh! Those instructions
already exist at the end of the "load another byte" branch, so maybe
the savings is more than I thought.
The three ways of checking and the additional costs are:
1) Check each iteration of the main bit loop. This is slowest, so only
use it if it has no rivals for smallest. This can be placed after
the "ror lsb", adding three more instructions: branch if not equal,
test len == 0, and branch if equal to epilogue code.
2) Check with each digit printed. This is identical code to the above,
but inside the "if (!tmask)" condition to reduce overhead.
The problem is that we don't get the test of lsb for free any more,
so I think it's four instructions. No! We can do it in three:
cp lsb, __zero_reg__ $ cpc len,__zero_reg__ $ breq epilogue
CPC ANDs its result with the previous zero flag. *If* the zero flag
was set (the only case we care about) then the carry flag is clear,
and the cpc will compute the test we want. If the carry is set,
the cpc will do the wrong thing, but that doesn't matter because
zero is already clear.
3) The way I did it, checking with each byte fetched. We have to
test len == 0 anyway, so we just have to check lsb == 0.
But we need four additional instructions:
- We have to use "adc msb,tmask" to set the msb correctly, so we'll
fall back into the byte-load check after only one more digit, and
- We have to "inc len" so it will decrement again properly.
- And we need two instructions to test lsb == 0
I thought option 3 was smallest because I thought the "rol msb" the
other cases required was the same size as "adc msb.tmask". I didn't
notice the 2-instruction code sharing.
So now it seems that option 2 is the smallest. Here it is in 34
instructions:
.L_octhex:
ldi mask, 7
lsr flags ; bit 1 in carry, but 1 in register
breq 1f
ldi mask, 15
ldi delta, 'A'-'0'+10
brcc 1f
ldi delta, 'a'-'0'+10
1:
ldi msb, 1
ldi tmask, 1
ld lsb, Z+
rjmp 2f
.L_bitloop: ; Main loop
ror lsb
2: lsr tmask ; Entry point, tmask >>= 1
brne 4f ; if (tmask == 0) {
; Check if we're done
cp lsb, __zero_reg__
cpc len, __zero_reg__
breq .L_epilogue
; Spit out a digit of (lsb & mask)
mov digit, lsb
and digit, mask
cpi digit, 10
brcs 3f
add digit, delta
3: add digit, '0'
st X+, digit
mov tmask, mask
4: ; }
lsr msb
brne .L_bitloop ; End of main loop
; Load another byte
; Note: we know that msb == 0 and carry is SET
dec len ; Load another byte
breq 5f
ld msb, Z+
5: ror msb ; Shift carry=1 into msbit
rjmp .L_bitloop
.L_epilogue:
movw r24, r26 ; Copy X to return value
ret
This is reasonably small (although there is definitely a size hit
relative to the previous code, sigh) and reasonably efficient. As I said,
the 2-instruction epilogue can be shared with the decimal code, and the
prologue code might be shortened by 3 instructions by choosing the "flags"
values so they can be used directly as "mask". That would cut it to 29.
(Plus two in the common code to branch to this depending on the flags.)
I have to code it up and test it, but I wanted to write down my thoughts
about it first, and that seemed post-worthy.