What's the performance like (quality and speed wise)?

tarruda · on Dec 19, 2023

Yesterday I tried mixtral 7bx8 running on the CPU. With an Intel 11th gen chip and 64gb DDR4 at 3200mhz, I got around 2-4 tokens/second in a small context, this gets progressively slower as the context grows.

You would get a much better experience with apple silicon and lots of RAM

yobanate · on Dec 19, 2023

Can confirm. My M3 Max gets about 22t/s, putting the bottleneck BKAC.

3abiton · on Dec 19, 2023

That's 10x speed increase. What's the secret behind apple M3? Faster clocked RAMs? Specific AI hardware?

bugglebeetle · on Dec 19, 2023

Unified memory and optimizations in llama.cpp (which Ollama wraps).

ithkuil · on Dec 19, 2023

Is that using the GPU?

bugglebeetle · on Dec 19, 2023

It can be variably configured. There are details in the repo, but llama.cpp makes use of Metal.

all2 · on Dec 19, 2023

Mistral 7b is serviceable for short contexts. If you have a longer conversation, token generation can start to lag a lot.