I don't quite understand how 'retain' and 'release' can be more memory efficient on Apple Silicon than x86.... I can understand how they can be more efficient from a performance standpoint in terms of more efficient reference counting, but I don't understand how that translates to less memory usage which is apparently what's being argued... ?
Unless on x86 some of the 'free's when the ref counts hit 0 were being batched up and deferred, and that doesn't need to happen now?
I don't think retain/release perf has anything to do with memory consumption, but I have seen a bunch of reviews claiming that 8GB is perfectly fine.
This is fascinating to me, because:
(a) every 8GB Mac I've used in the past has been unusably slow
(b) since upgrading my 32GB Hackintosh to Big Sur, my usual 40GB working set is only about 20GB.
(c) My 2015 16GB MBPr with Big Sur is also using about half as much physical memory on the same workload. Swappiness is up a little, but I haven't noticed.
So my guess is that something in Big Sur has dramatically reduced memory consumption and that fix is being commingled with the M1 announce.
Seriously, I'm utterly baffled by all the people claiming that 8 GB isn't enough for the average user.
The only situation I ever ran into where it was a problem was in trying to run multiple VM's at once.
Otherwise it's just a non-issue. Programs often reserve a lot more memory than they actually use (zero hit in performance) so memory stats are misleading, and the OS is really good at swapping memory not touched in a while to the SSD without you noticing.
Yes, sometimes it takes a couple seconds to switch to a tab I haven't touched in Chrome in days because it's got to swap it back in from the SSD. Who cares?
> people claiming that 8 GB isn't enough for the average user
I'm not claiming anything of the sort.
My point is that memory consumption seems to be greatly reduced in Big Sur, and that might make 8GB machines much better to use than before. All of my testing is on Intel machines. It's not exclusively an M1 phenomenon.
I would still recommend 16GB to anyone, and if the extra $200 was a factor, I would recommend that they buy last year's Intel with 16GB of RAM.
Nah, sorry, but you're wrong. I had to upgrade my laptop because I wanted to run Firefox, IntelliJ IDEA and an Android emulator on the same machine. Nothing else. This was not possible on 8GB ram.
So it's not like multiple VMs are needed and above scenario is pretty average for a common mobile developer (but still not an average user, I admit)
Second thing is, lots of games require 16 GB RAM. Maybe gamers are still not average users, I don't know.
For me with 16GB in an MBP, there is currently 20.5GB used + swap, and I haven't even started Firefox today, that would add another ~6GB or so.
Usually if I'm running Safari, Firefox and my 4GB Linux VM, that's 16-18GB used up in those. At the moment I have a few other things open, PDF viewer, Word, iTerms, Emacs etc, but nothing huge.
Most of the time this level of usage is ok, but I've had times where I've had to wait 30+ seconds for the UI to respond at all (even the Dock or switching workspaces) and wondered if the system had crashed.
For that reason I'm generally waiting for the next 32GB model before committing, that's assuming I stick with Apple instead of switching back to Linux (which I used for ~20 years before trying the MBP).
> Programs often reserve a lot more memory than they actually use (zero hit in performance) so memory stats are misleading, and the OS is really good at swapping memory not touched in a while to the SSD without you noticing.
The stats are absolutely reliable because no physical memory page is allocated until it is actually used to store something. So allocating a large chunk of unused memory wouldn't show in the (physical) memory usage stat.
I pretty much daily have to do a closing round to not run out of my 24GiB. That’s all web browsers
(Usually 100-200 tabs), vs code with some extensions and 2x4k display.
But what do you even mean "run out"? This is what I don't get.
If you have multiple browsers with hundreds of tabs, the majority of those tabs are probably swapped out to your SSD already.
With swapfiles and SSD's, physical memory is less and less relevant except when you're performing very specific computational tasks that actually require everything to be simultaneously in memory -- things like highly complex video effects rendering.
How do you measure "running out" of your 24 GiB? And what happens when you do "run out"?
As a human, when I have many tabs open, I observe that everything gets really slow. All applications get slow, but especially the browser.
So I put on my engineering hat and pull up Activity Monitor and further observe (a) high memory pressure, (b) high memory consumption attributed to Chrome or Firefox, (c) high levels of swap usage, (d) high levels of disk I/O attributed to kerneltask or nothing, depending on macOS version, which is the swapper task.
I close some tabs. I then observe that the problems go away.
Swap isn't a silver bullet, not even at 3Gbytes/sec. It is slow. I haven't even touched on GPU memory pressure which swaps back to sysram, which puts further pressure on disk swap.
It's the equivalent of having 50 stacks of paper documents & magazines sitting unorganized on your desk and complaining about not having space to work on.
A bigger desk is not the solution to this problem.
If your tabs are swapped out to SSD, your computer feels incredibly _slow_. SSD are fast, yeah, but multiple orders of magnitude slower than the slowest RAM module.
You can run 4GB if you're fine with having most of your applications swapped out, but the experience will be excruciating.
Physical memory is still as relevant as it was 30 years ago. No offense but if you can't see the problem, you probably have never used a computer with enough RAM to fit everything in memory + have enough spare for file caching.
I don't swap. You can do all your arguments about why I should if you want but yes, there are legit reasons not to and there is such a thing as running out of memory in 2020.
4GB MBA user here, don't have any problems either running Chrome or Firefox with 10-20 tabs and iTerm (Safari does feel much faster than other two and my dev enviroment is on a remote server though).
iPhones and iPads also have relatively small amounts of RAM compared to Android devices in the same class, so I wonder if Apple is doing something smart with offloading memory to fast SSD storage in a way that isn't noticeable to the user.
This is most probably more linked to Java/Kotlin vs Objective-C/Swift. Want an array of 1000 objects in Java ? You'll endup with 1001 allocations and 1000 pointers.
In Swift you can add value types to the heap-backed array directly, in ObjC you can use stack allocated arrays (since you have all of C) and there are optimizations such as NSNumber using tagged pointers.
> Theoretically Java should be more memory efficient because it makes fewer guarantees and can move memory around.
Java makes a lot of memory guarantees that are hard to make efficient. Specifically in that it becomes extremely hard to have a scoped allocation. Escape analysis helps, but the nature of Java's GC'd + no value types means it's basically never good at memory efficiency. Memory performance can be theoretically good, but efficiency not really. That's just part of the tradeoff it's making. And nearly everything is behind a reference, making everything far larger than it could be.
Compaction helps reduce fragmentation, but it comes at the cost of necessarily doubling the size of everything being compacted. Only temporarily, but those high-water spikes are what kicks things to swap, too.
Big difference is that Objective-C is a superset of C. Any Objective-C developer worth his/her salt will drop down to C code when you need to optimize. The object-oriented parts of Objective-C are way slower than Java. But the reason Objective-C programs can still outcompete Java programs is that you have the opportunity to pick hotspots and optimize the hell out of them using C code.
Object-oriented programs in Objective-C are written in a very different fashion from Java programs. Java programs tend to have very fine granularity on their objects. Objective-C programs tend to have interfaces which are bulkier, and larger objects.
That is partly why you can have a high performance 3D API like Metal written in a language such as Objective-C which has very slow method dispatch. It works because the granularity of the objects have been designed with that in mind.
For those, Apple's favored approach to memory management (mostly reference counting) absolutely _is_ an advantage over Android's (mostly GC). That's not relevant when comparing an Intel and ARM Mac, tho.
I think the argument they were trying to get to but totally failed to make is possibly along these lines
huge memory bandwidth relative to ram size + os level memory compression => massive reduction in memory pressure for many many many workloads.
macos has supported memory compression for awhile now -- i would hypothesize that M1 may have massively improved that subsystem in ways that actually do translate into needing less memory on average for a lot of common real-world workloads that amount to "human timescale multitasking" between large working sets -- eg i click this app and it has a huge working set and then click into another that has a large working set and then click back -- with those clicks that represent application context switches occurring very very rarely in machine time scale.
If memory compression subsystem can swap working sets into and out of compressed memory space insanely quickly with low power usage then the os might've gotten very aggressive about using that feature to put not recently accessed memory into compressed memory space.
I believe it was being brought up as an example of "Apple has designed their hardware around their software" and then that translates to "Apple's software does well on machines with less memory".
Compared to something like Android, sure, I get that, but compared to ObjectiveC/Swift on x86 (which I think was being argued - i.e. against the Intel Macs)?
I guess it makes reference counting in general more efficient, I'm just saying I don't see why that would mean Apple Silicon Macs running ObjectC/Swift code would have less memory usage than the same code compiled and running on x86.
I'm not necessarily convinced by the posted argument. That being said, I tend to think that people running a bunch of VMs and Electron apps and Docker cause them to use a bunch more RAM than I would consider to be "reasonable", and they've lost sight of how much you can do in a lesser amount of memory. (Typing this from a computer with 8 GB of RAM, which I have repeatedly been told is "below adequate" for development.)
The problem is, by now development practices in many companies effectively force using multiple large containers. I know an x stack could use 4x less memory if I spent considerable time on ripping out unnecessary cruft, but few people in the company would agree that it's time well spent, and the home office allowance suffices for a machine with 32-64Gb RAM (especially in 2020, when I don't really see that much value in laptops for dev work anymore).
I believe the idea was that reference counting was more memory efficient than other forms of garbage collection, such as copy collectors and mark and sweep collectors which commonly make up generational garbage collectors.
Languages like Java also do not yet support stack-allocated value types outside a few primitives like integers, and heap allocations are both slower and less space efficient due to the indirection and memory management.
It is a simple process, everything that you do in a language needs to be mapped into lower level instructions.
If the lower level hardware instruction does not exist, you use multiple of other instructions to emulate that.
If you add a low level instruction that maps a very common high level operation in hardware, you don't need to call 5 to 10 software functions(extremely expensive),each calling lots of opcodes but just can execute a single opcode and works by hardware beings extremely faster.
It is not hard to be better than Microsoft here. From my personal experience and having disassembled lots of their code they always were lazy bastards. They cared 0 about efficiency. Why should they? They had monopolies like Office or Windows giving them over 95% margins. They could just use the money they printed to buy everything instead of competing.
Lisp machines did that (adding opcodes that map the high level language) with the most common Lisp operators. Those machines were extremely expensive, in the hundreds of thousands of dollars because few could afford that. Apple sells in massive scale, in the hundreds of millions of CPUs per year, making this cheap for them.
> each calling lots of opcodes but just can execute a single opcode and works by hardware beings extremely faster.
Typically these language oriented instructions need to be implemented by microcode in the CPU. Often this does not create a fast system, but it helps to keep the compiler simple. Examples are typical Lisp Machines, you've mentioned. With RISC CPUs OTOH the idea is to make the CPU instructions more primitive and put more effort into optimizing compilers instead. There were a few attempts to combine (high) language supporting architecture and the RISC principle, but I personally have never seen such a machine.
Reference counting releases memory as soon as it gets dereferenced, while GC cleans up memory periodically, which means higher memory usage (more than what's actually in use at any moment).
That explains a iOS vs Android difference (ARC vs Garbage Collection), but it doesn't explain the article (and Gruber's) apparent argument that Apple Silicon machines running native ObjectiveC/Swift code use less memory than the same apps natively built via ObjectiveC/Swift code on Intel running the same OS (but different machine code obviously).
Systems that can reap no longer needed objects rather than walking them can help here. The automatic approach is a copy collector, which is typically the more often approach of a generational garbage collector. Since a copy collector typically works by following references, this also increases data locality for machines with a small amount of L1 cache.
Garbage Collectors and JITs typically work best with hardware support, as you need to check pointer reads and writes as objects are being moved around or code is being rewritten. A lot of these systems use MMU gymnastics, such as mapping the same memory page into multiple locations with different permissions.
You also have systems where you create the objects knowing that they will be tiny and short-lived with a fixed lifetime, which can be hugely efficient. This is how Apache Bucket brigades work, since they know that other than a few special cases all memory allocated while handling a request will be garbage once a response is returned.
Lots of tiny memory allocations are inefficient no matter what. Slight deallocation refinements to poorly made software (the reference counting part is not 'hugely inefficient') is focusing on the wrong thing.
Lots of tiny memory allocations is pretty efficient in java. The VM would have already allocated memory from the kernel so there's no context switch, and once the tiny objects are no longer referenced, deallocation is a free (0 machine instructions) side effect of garbage collection. Garbage collection isn't free, but it can be cheap(er) than reference counting millions of objects with explicit and individual allocation and deallocation.
There are lots of problems that aren't being addressed here.
First, java ends up doing a huge number of heap allocations that are just stack allocations for other system languages.
Second, java might have some heap allocation optimization, but it's still a huge performance sink to allocate in a tight loop.
Third, reference counting is not slow. Incrementing or decrementing an integer only when a variable isn't moved can be both cheap and rare. Even better, it is deterministic. Garbage collection gets its optimization from doing bulk operations, which is exactly what becomes a problem. Any speed up pales in comparison to the speed advantage of avoiding those allocations all together. Once allocations are not weighing down performance, the lack of pauses and deterministic behavior of reference counting is an even larger advantage.
You can say that memory 'has already been allocated from the kernel' but that is what heap allocators do in any language. Jemalloc maps virtual memory and puts it into pools for sizes and threads.
At the end of the day, taking out excessive allocations is usually a trivial optimization to make. It is usually trivial to avoid in the first place. Languages fighting their garbage collector and promising the next version will have one that is faster and/or lower latency is a cycle that has been going on before java was first released. At a certain point I think people should accept that stack allocations and moves of heap allocations take care of the vast majority of scenarios and actual reference counting in this context is not a problem. Variable with unknown lifetimes should only be needed when communicating with unknown components. Garbage collection on the other hand has been a constant problem as soon as there is any neccesity for interactivity.
Yep. It’s actually a pointer to the class instance for the object, which is a full object that contains more information than a typical vtable might, but it serves as a “type ID” that the runtime can use to dispatch on.
x86-64 was designed to prevent (or at least discourage) efficient use of tagged pointers, with the higher half/lower half split in the virtual address space. All the excess high-order bits you don't need for actual addressing are required to have the same value, so you effectively only get at most one tag bit.
They’re required to have the same value upon dereference; there is no restrictions prior to this as assembly doesn’t care what a register is. The bits are appropriately masked off when necessary prior to using the pointer.
Yikes. That's the same shenanigans that got them into trouble with the 68000. Everyone stuffed data into the top 8 bits of pointers because even though the 68000 had 32-bit addressing registers, it only had a 24-bit address bus and the top 8 bits were dontcare's. Then, the 6802x came out with more address lines and...
...and that's basically why x86_64 was specified to require a particular bit pattern in high-order bits - it was to stop applications and OS programmers from writing a bunch of software with tagged pointers which would tie Intel's and AMD's hands when adding address lines. I guess Apple is ok with tying their own hands.
Tagged pointers are an officially accepted thing in ARM -- the relevant feature is called top-byte ignore (TBI). It only applies to the upper 8 bits of a pointer, leaving 56 bits for addressing.
Eh, the jump from 16-bit addressing to 32-bit was a factor of 65,536. The jump from 32-bit to 64-bit is 4,294,967,296x. Throwing away the top 8 bits drops it to an address space "only" 16,777,216 times bigger than 4GB. It seems like there's some headroom for growth in there.
Doesn't this become less and less of an issue the more bits you add to your pointers? Like with 32 bits, you can't have one memory address per person on the planet earth, at 64 bits, you can have 1 pointer per atom that makes up the planet earth, and at 128 bits we're talking 1 address per atom in the known universe (or something like that, I haven't crunched the numbers exactly, this is more to give a flavor for the order of magnitude we're talking).
So if you cut off the top 8 bytes of a 32bit register and leave yourself with 24 bits, you can't even give a pointer to each person in Tokyo, but you cut off the top 8 bits of a 64 bit pointer you can still give a pointer to each atom of every human being on earth?
Unless on x86 some of the 'free's when the ref counts hit 0 were being batched up and deferred, and that doesn't need to happen now?