Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

What's the performance like (quality and speed wise)?


Yesterday I tried mixtral 7bx8 running on the CPU. With an Intel 11th gen chip and 64gb DDR4 at 3200mhz, I got around 2-4 tokens/second in a small context, this gets progressively slower as the context grows.

You would get a much better experience with apple silicon and lots of RAM


Can confirm. My M3 Max gets about 22t/s, putting the bottleneck BKAC.


That's 10x speed increase. What's the secret behind apple M3? Faster clocked RAMs? Specific AI hardware?


Unified memory and optimizations in llama.cpp (which Ollama wraps).


Is that using the GPU?


It can be variably configured. There are details in the repo, but llama.cpp makes use of Metal.


Mistral 7b is serviceable for short contexts. If you have a longer conversation, token generation can start to lag a lot.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: