yes but >> 8 is so much faster

xigoi · 2026-06-01T20:14:47 1780344887

You don’t divide a float by 256 by shifting it right eight bits; that would yield complete garbage. You subtract 8 from the exponent, then check if you got an underflow.

dheera · 2026-06-01T21:05:57 1780347957

Same point; divide by power of 2 is a fast subtraction operation in float world, while divide by 255 shits all over the whole float

yongjik · 2026-06-01T23:56:05 1780358165

If your input is an arbitrary float, you need to check for denormals (and maybe NaNs). You can do bitmasking trick to avoid conditional jumps but I'm skeptical you can do it faster than SIMD multiply instruction.

StilesCrisis · 2026-06-01T20:06:24 1780344384

It's just multiplication. Floating multiply is extraordinarily fast.

lacedeconstruct · 2026-06-01T20:13:22 1780344802

The difference between 20 cycles and 1 clock cycle in a hot loop is very noticeable

exyi · 2026-06-01T20:49:51 1780346991

It's 3 cycles for float multiplication (and 1 for shift right):

https://uops.info/table.html?search=mulss&cb_lat=on&cb_tp=on...

https://uops.info/table.html?search=shr&cb_lat=on&cb_tp=on&c...

In throughput it's even less of a difference: 2 per cycle vs 3 per cycle.

userbinator · 2026-06-02T03:27:45 1780370865

It's 3 cycles for float multiplication (and 1 for shift right):

3x faster

In throughput it's even less of a difference: 2 per cycle vs 3 per cycle.

50% faster

Tuna-Fish · 2026-06-01T20:46:50 1780346810

FP Division by constant is optimized by a compiler into a multiply. Graphics processing typically happens on the GPU these days, and on all recent GPUs FPMUL belongs to the class of lowest-latency operations. That is, there are no other instructions that complete faster.

pixelesque · 2026-06-01T21:17:56 1780348676

Only with things like -ffast-math enabled will compilers do the reciprocal. It can make a fair difference in some cases, but it's often better to selectively use it in code locations you know are acceptable by doing it manually in the code.

mgaunard · 2026-06-01T21:03:25 1780347805

That's only valid to do if the reciprocal is representable exactly.

hansvm · 2026-06-01T22:14:49 1780352089

That's not totally true. It's sufficient to be exactly representable, but you only need the reciprocal rounding error to be small enough to guarantee the multiplication rounding step fixes it across the entire range of numerators. For IEEE754 f16 values, there are 28 such extra values, the positive and negative sides of 1705/x where x is a power of 2 at least as great as 2048.

Sesse__ · 2026-06-01T20:35:09 1780346109

Useful, then, that you can start several vectorized floating-point muls each cycle. (E.g., most modern x86 are 3/0.5 cycles for vmulps. No 20 cycles in sight.)

dist-epoch · 2026-06-01T19:41:20 1780342880

Only in micro-benchmarks.

For real usage, today's CPUs are limited by memory bandwidth.

lacedeconstruct · 2026-06-01T19:51:01 1780343461

What are you talking about in a hot loop in my software renderer this is like 10x faster

    // color4_t result = {
    //     .r = (src.r * src.a + dst.r * inv_alpha) * INV_255,
    //     .g = (src.g * src.a + dst.g * inv_alpha) * INV_255,
    //     .b = (src.b * src.a + dst.b * inv_alpha) * INV_255,
    //     .a = src.a + (dst.a * inv_alpha) * INV_255
    // };

    // 1/256 but much faster
    color4_t result = {
        .r = (src.r * src.a + dst.r * inv_alpha) >> 8,
        .g = (src.g * src.a + dst.g * inv_alpha) >> 8,
        .b = (src.b * src.a + dst.b * inv_alpha) >> 8,
        .a = src.a + ((dst.a * inv_alpha) >> 8)
    };

imtringued · 2026-06-02T09:34:19 1780392859

How is this supposed to be 10x faster if all you did was drop one out of three multiplications?

virtualritz · 2026-06-02T08:49:09 1780390149

And both are wrong since the values would have to be in a linear color space for for the compositing math to make sense. But in some non-linear space to be useful when mapped to 0..255 (e.g non-linear sRGB).

Which happens right after the Porter-Duff Over operator above -- a smoking gun. Which one is it gonna be?

I.e. the display transform is omitted from this and the math involved with the latter makes your whole argument moot.

It can't be expressed well enough with bitshifts to keep your purported 10x speedup anyway (and which I strongly doubt btw).

And lastly: in a software renderer that stuff is usually <0.01% of the compute in the absolut worst case.

P.S.: I'm speaking from 30 years of experience with software rendering in the context of VFX.

Tuna-Fish · 2026-06-01T20:53:44 1780347224

If the latter is 10x faster, the issue is some kind of weird compilation failure for the above version. For one, it only cuts a third of the multiplies.

dist-epoch · 2026-06-01T19:51:38 1780343498

Because you are working in the cache.

Also, you should use SIMD.

lacedeconstruct · 2026-06-01T19:59:03 1780343943

> Also, you should use SIMD. ironically no clang is better at auto vectorizing

spider-mario · 2026-06-02T09:22:12 1780392132

Better than what? And do you use `-mavx2` or do you let it target baseline x86_64 and miss out on 8-float vectors? How do you make sure its autovectorisation is successful?