I suspect SIMD would help with the encoding. The lookup table is small enough to fit into 8 AVX2 registers, so instead of hashing, you could use direct lookup, which would improve compression ratio further (a little bit).
SIMD registers aren’t indexable (at least not on AMD64), the register needs to be known to the compiler.
Lanes within each register aren’t indexable either. The insert and extract instructions are encoding lane index in the code. There’re workarounds for this one, like abusing vpshufb or vpermd, but with that amount of overhead I doubt SIMD will deliver any profit at all.
There are some clever tricks that can be pulled with the latest instructions sets like AVX-512. The registers are huge and the instructions available are so varied that there are clever ways to use them in "off label" ways to implement lookup tables and bit-level parsers.
This is neat. I wonder if the author would be willing to write a Kaitai Struct definition for it.
Something else interesting: QOI is 1,2,2 letters off from PNG. I'm quite certain this is an accident but it's interesting nonetheless.