Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask HN: How does a CPU communicate with a GPU?
148 points by pedrolins on March 30, 2022 | hide | past | favorite | 98 comments
I've been learning about computer architecture [1] and I've become comfortable with my understanding of how a processor communicates with main memory - be it directly, with the presence of caches or even virtual memory - and I/O peripherals.

But something that seems weirdly absent from the courses I took and what I have found online is how the CPU communicates with other processing units, such as GPUs - not only that, but an in-depth description of interconnecting different systems with buses (by in-depth I mean an RTL example/description).

I understand that as you add more hardware to a machine, complexity increases and software must intervene - so a generalistic answer won't exist and the answer will depend on the implementation being talked about. That's fine by me.

What I'm looking for is a description of how a CPU tells a GPU to start executing a program. Through what means do they communicate - a bus? How does such a communication instance look like?

I'd love get pointers to resources such as books and lectures that are more hands-on/implementation aware.

[1] Just so that my background knowledge is clear: I've concluded NAND2TETRIS, watched and concluded Berkeley's 2020 CS61C and have read a good chunk of H&P (both Computer Architecture: A Quantitative Approach and Computer Organization and Design: RISC-V edition), and now am moving on to Onur Mutlu's lectures on advanced computer architecture.



Typically CPU and GPU communicate over the PCI Express bus. (It’s not technically a bus but a point to point connection.) From the perspective of software running on the CPU, these days, that communication is typically in the form of memory-mapped IO. The GPU has registers and memory mapped into the CPU address space using PCIE. A write to a particular address generates a message on the PCIE bus that’s received by the GPU and produces a write to a GPU register or GPU memory.

The GPU also has access to system memory through the PCIE bus. Typically, the CPU will construct buffers in memory with data (textures, vertices), commands, and GPU code. It will then store the buffer address in a GPU register and ring some sort of “doorbell” by writing to another GPU register. The GPU (specifically, the GPU command processor) will then read the buffers from system memory, and start executing the commands. Those commands can include, for example, loading GPU shader programs into shader memory and triggering the shaders to execute those shaders.


If OP or anyone else wants to see this firsthand.. well shit, I feel old now, but.. try an exercise into assembly programming of commodore 64. Get a VICE emulator and dig into it for a few weeks. It's real easy to get into, CPU (6502 based), video chip (VIC II), sound chip (famous SID), ROM chips.. they all love in this address space (yeah, not mentioning pages), CPU has three registers.. it's also real fun to get into, even to this day.


There's a nice guide by Ben Eater on Youtube about a breadboard computers: https://www.youtube.com/playlist?list=PLowKtXNTBypFbtuVMUVXN...

It doesn't sport any modern features like DMA, but builds up from the core basics: a 6502 chip, a clock, and a blinking LED, all hooked up on a breadboard. He also built a basic VGA card and explains protocols like PS/2, USB, and SPI. It's a great introduction or refresher into the low level hardware concepts behind computers. You can even buy kits to play along at home!


I was going to say, watching Ben Eater and building his 6502 machine myself filled in so many gaps in knowledge or fuzzy areas around basic computer hardware. If OP sees this he should definitely consider building some of these after Nand2Tetris.


This! except I (slightly redesigned and) built a common z80 setup, no comparison for knowledge gained from this single effort.


Is my understanding correct that compared to those historical architectures, modern GPUs are a lot more asynchronous?

What I mean that these days you'd issue a data transfer or program execution on the GPU, they will complete at its own pace and the CPU in the meanwhile continues executing other code; in contrast in those 8 bitters you'd poke a video register or whatev and expect that to have more immediate effect allowing those famous race the beam effects etc?


It depends.

First, in modern systems you usually don't have direct access to GPU, but call syscall and than happen magic, something like this. And they all are multilayered architecture, mean, now normal, when systrace dump of some exception lists few hundreds functions, imagine how this slow.

This is because all modern systems except consoles, are direct descendants of business minicomputers, in which where most important, they could simultaneously run wallet and text processor (or electronic table, or email, etc), and these programs will not see each others memory, so errors in text will not lead to lost money from wallet.

Second, modern systems behave like NUMA architectures, or some people even named them distributed architectures, mean, that only CPU computing in userspace is fast.

In many cases you have choice, to do magic in sync way or async, but in any case, syscalls are extremely slow, in some cases could be few magnitudes slower than userspace CPU computing.

Sometimes lifesaver some sort of message passing architecture, so you send message to GPU and immediately could do your tasks without any waits at all (kernel periodically check messages mailboxes of all processes and when happen your turn it will read your message and do things and write answer).

But message passing is now rare, mostly as I know, used paradigm of direct syscall, and async means, kernel release your process just after receive syscall, sync means, your process will be released only after syscall finished processing data.

In 8bit systems, cpu frequency where very low, sometimes fraction of bus speed and there where very few layers, basically userlevel program could directly access hardware. So even formally their behavior named synchronous, but in reality they where very fast in most cases, except understandable waits, like when Atari ANTIC access videoram when you also trying to do that.

Reliability issues in 8bit and in consoles solved very simple ways - first, most where capable to run only one program, and second, console software where extremely reliable, and expensive, much more reliable than business software, sometimes exceeds reliability of military software or mainframe system software.


There were interrupts telling you when certain things happened. If anything, it was asynchronous. Big thing is also that you had to tally the cost of what you eere doing. There was a budget of how many cycles you got per line, per screen and then fit whatever you had to in that. With playing sound it was common to draw color when you fed the music into SID so you could tell, like a crude debug/ad hoc printf, how many cycles your music routines ate.


Nice exercise. Similarly I learned most about basic computer architecture by programing 8050 in ASM as well as C.

And I'm 32. Am I old yet? I'm not right? Right?


Sorry pal!

I remember playing Halo in my early 20's, and chatting with a guy from LA who was 34. Wow, he's so old, why was he still playing video games.

Here I sit in my late 30's...still playing games when I have time, denying that I'm old, despite the noises I make getting up and random aches and pains.


OK. I am definitely old...

- I have mostly no joy playing video games anymore these days.

- I found myself refusing some contemporary stuff in favour of stuff "I am used to".

Damn! I was always looking at my dad, not understanding that very same things :) Now I am a dad myself...


Haha same. I think the problem is that it seems to hit you all at once. I was pretty open minded and go with the flow until at least my early 30s, now I hate any kind of change at all.

Re: video games, I had a similar experience but play for different reasons. I used to have fun blasting people and chatting on xbox live... don't enjoy any of that anymore. All single player for me, and no FPS. A cool physics game like Rocket League(admittedly, not single player), or a puzzle game like The Turing Test, or a mystery game like Kona are way more interesting to me than say, Call of Duty.


> Am I old yet? I'm not right? Right?

Yes, but being old isn't bad. I, for one, long for the day when I'll be unable to produce and contribute to society.


40s are new thirties, my friend. Also, painkillers help.


IMO memory-mapped IO is the coolest thing since sliced bread. It's a great example in computing where many different kinds of hardware can all be brought together under a relatively simple abstraction.


It was a glorious "click" when learning embedded programming. Even when writing Rust in typical desktop uses, it all feels... abstract. Computer program logic. Where does the magic happen? Where do you go from abstract logic to making things happen? The answer is in voltatile memory reads and writes to memory-mapped IO. You write a word to a memory address, and a voltage changes. Etc.


relatively simple until you get into cache coherence (not an issue if you mark the memory as volatile)


The volatile keyword in C has nothing to do with cache coherence and does not prevent cache coherence issues.

It just forces the compiler to generate all memory accesses on this variable. And if this variable is in a cached memory region, then it opens the door to very usual cache coherency problems


Memory mapped IO is never cached as far as I know, so at least that's not a concern.


You'd still have to tell the compiler it's volatile if you were writing into the DMA buffer directly for some reason. GP just used the wrong words to say what they meant.


One of the best classes I took was building space invaders on an FPGA/dev board. It had 2 PowerPC cpus, an AC97 sound chip vga output, etc.

We built the game in C, wired up the interrupts to the sound chip, wrote to the mapped addresses, etc.

Then we re-did the sound code by writing a DMA controller in the fpga. It was so educational. We had to read and write to the shared bus, follow the commands of the arbiter, send and receive interrupts, etc.

The final bit was a creative assignment, where you could add whatever hardware component you wanted. Some did old Atari controllers, some worked in a separate scoreboard. We did wireless chips via SPI to make it a two player game.

All in, it was one of the best hands on educational experiences. There was no way to complete the class and not have built an intuition for how things work.


Going one deeper, how does the communication work on a physical level? I’m guessing the wires of the PCI Express bus passively propagate the voltage and the CPU and GPU do “something” with that voltage?


Before you get that deep, you need to step back for a bit. The CPU is itself several different processors and controllers. Look at a modern Intel CPU: https://www.anandtech.com/show/3922/intels-sandy-bridge-arch.... The individual x86 cores are connected via a ring bus to a system agent. The ring bus is a kind of parallel bus. In general, a parallel bus works by having every device on the bus operating on a clock. At each clock tick (or after some number of clock ticks), data can be transferred by pulling address lines high or low to signify an address, and pulling data lines high or low to signify the data value to be written to that address.

The system agent then receives the memory operation and looks at the system address map. If the target address is PCI-E memory, it generates a PCI-E transaction using its built-in PCI-E controller. The PCI-E bus is actually a multi-lane serial bus. Each lane is a pair of wires using differential signaling (https://en.wikipedia.org/wiki/Differential_signalling). Bits are sent on each lane according to a clock by manipulating the voltages on the differential pairs. The voltage swings don't correspond directly to 0s and 1s. Because of the data rates involved and the potential for interference, cross-talk, etc., an extremely complex mechanism is used to turn bits into voltage swings on the differential pairs: https://pcisig.com/sites/default/files/files/PCI_Express_Ele...

From the perspective of software, however, it's just bits sent over a wire. The bits encode a PCI-E message packet: https://www.semisaga.com/2019/07/pcie-tlp-header-packet-form.... The packet has headers, address information, and data information. But basically the packet can encode transactions such as a memory write or read or register write or read.


Oh! I know this one! PCI itself is comprised of several layers, from a physical layer, to a link layer, to a transaction layer and application layers.

At the physical level we're literally pushing the limits of semiconductor physics as an industry. :-) There's all sorts of tricks, like 8b10b encodings[1] to address error handling and recovery, and then different ways to itself transmit the PCI signal. You can, for example, encode a PCIe signal on an optical cable. :-) The bus itself is divided into "lanes" which can be aggregated into various "widths" - so you'll see x1, x4, x8, x16 etc. This compounds the problem, in that each lane is negotiated. See this video[2] for what appears to be a nice overview. I haven't watched the video in depth, but clicking around, this seems reasonable to get a sense of what's what.

There's a good reference on the rest of the protocol here: https://www.mindshare.com/files/ebooks/pci%20express%20syste... . Put bluntly - it's amazing that anything works at all, technology is as cheap as it is, ubiquitous as it is, and continues to get cheaper, and more powerful on a daily basis.

If you want your mind blown - look up how modern WiFi works, and what we take for granted today. Beam forming. [3] 256-QAM encoding.[4]

Source: I wrote firmware for very, very large servers at one point in my career. Code that I wrote still (AFAIK) is still running, and writing a a 32-bit test pattern into a register once every second to validate that a particular chip is working ok. That test pattern? 0x4D494B45, which is "MIKE" in ASCII. :-)

Math (and science) is amazing. :-)

[1] https://en.wikipedia.org/wiki/8b/10b_encoding

[2] https://www.youtube.com/watch?v=EHkuzkNWXFk

[3] https://www.youtube.com/watch?v=xGkyZw98Tug

[4] https://www.minim.com/blog/what-is-qam


8b10b is like, 3 PCI generations old (PCI 2)! The next thing was 128b130b (PCI 3-5) and apparently 6 is moving to 242B/256B.


Fascinating. So I guess I date myself there. But - it makes sense having read a bit. There’s remarkable complexity. “Strong statistical bounds” is again a reason for me to say “it is amazing anything works, anywhere.” I say this typing on a capacitive piece of glass that happens to be a multitouch display, glued to a battery, a quad core processor, and a very precisely arranged lump of sand and gold that’s able to literally shout into the microwave aether to a thingie on a pole in the middle of the air, which then is relaying (encrypted btw) data to some box somewhere that results in an interrupt occurring, a cpu servicing that, and many other things happening many times to result in random people somewhere eventually reading this message.

We’ve come pretty far from the telegraph, haven’t we?


Voltages yes.. usually its all binary digital signals, running serial/parallel and following some communication protocol. Maybe you should have a look at something really simple/old like UART communication to get some idea how this works and then study next how this is scaled up over PCIE to understand the chat between CPU/GPU?

Or maybe not, one does not need all the details, so often just scaled concepts :)

https://en.m.wikipedia.org/wiki/Universal_asynchronous_recei...

Edit: Wait it is really already QAM over PCIE? Yeah then UART is a gross simplification, but maybe still a good one to start with depending on knowledge level?


PCI-E isn't QAM. It's NRZ (a type of pulse amplitude modulation with just two amplitude levels), with 8/10b, 64/66b, or 128/130b encoding, and then scrambled to reduce long runs of 0s or 1s. I think PCI-E 6.0 uses PAM-4 with 4 amplitude levels.


https://pcisig.com/sites/default/files/files/PCI_Express_Ele... It doesn’t say QAM explicitly but it has all the QAM terminology like 128 codes. Inter symbol interference etc. I’m not an RF guy by any stretch but it sounds like QAM to me.

This is an old spec. I think it’s like equivalent to QAM-512 for PCIe 6


128/130b encoding is used to help embed a clock signal in the data stream. After every 128 bits of data, the receiver will see a preamble of 01 or 10. That voltage transition at predictable intervals helps the receiver keep the clock in sync.

Intersymbol interference likewise applies to pretty much any high-speed digital transmissions. At high frequencies, you have to worry about things like the signal reflecting off the other end of the circuit trace, which creates inter-symbol interference.

QAM is a modulation technique. It specifies how symbols of one or more bits are represented as analog waves on the wire. PCI-E does not use QAM. It simply represents 0s and 1s high and low voltage swings--what's called Pulse Amplitude Modulation. Other modulation techniques encode symbols by varying the amplitude and phase (or both) of a carrier wave. QAM works by modulating the amplitude of two carrier waves 90 degrees out of phase (e.g. sine and cosine) based on a data signal and summing them together: https://www.techtarget.com/searchnetworking/definition/QAM


Oh cool. Thanks for that. So basically it uses similar terminology because it’s modulated it’s just that QAM is a different kind of modulation?


It’s similar terminology because it’s dealing with sending digital signals at high speeds using analog signals. “Modulation” just means the scheme by which sequences of bits are represented in analog signals. There necessarily has to be some sort of modulation. But the terminology is more related to common issues that arise from the receiving end reading the small, rapidly changing voltages and recovering the bit stream. Regardless of the modulation you face similar problems with keeping the clock in sync etc.


Oh, that is several levels deeper! PCIe is a big standard with several layers of abstraction, and it's far from passive.

The different versions of PCIe use a different encoding, so it's hard to sum it all up in a couple sentences in terms of what the voltage does.


Older CPUs - the CPU had a bunch of A pins (address), a bunch of D pins (data).

The A pins would be a binary representation of an address, and the D pins would be the binary representation of data.

A couple of other pins would select behavior (read or write) and allow handshaking.

Those pins were connected to everything else that needed to talk with the CPU on a physical level, such as RAM, I/O devices, and connectors for expansion. Think 10-base-T networking where multiple nodes are physically modulating one common wire on an electrical level. Same concept, but you have many more wires (and they're way shorter).

Arbitration logic was needed so things didn't step on each other. Sometimes things did anyway and you couldn't talk to certain devices in certain ways or your system would lock up or misbehave.

Were there "switches" to isolate and select among various banks of components? Sure, they are known as "gate arrays" - those could be ASICs or implemented with simple 74xxx ICs.

Things like NuBus and PCI came about - the bus controller is directly connected and addressable to the CPU as a device, but everything else is connected to the bus controller, so now the new-style bus isn't tied to the CPU and can operate at a different speed and CPU and bus speed are now decoupled. (This was done on video controllers in the old 8-bit days as well - to get to video RAM you had to talk to the video chip, and couldn't talk to video RAM directly on some 8-bit systems).

PCIE is no longer a bus, it's more like switched Ethernet - there's packets and switching and data goes over what's basically one wire - this ends up being faster and more reliable if you use advanced modulation schemes than keeping multiple wires in sync at high speeds. The controllers facing the CPU still implement the same interface, though.


> Older CPUs - the CPU had a bunch of A pins (address), a bunch of D pins (data).

And on some buses, some or all A/D pins where multiplexed, so transaction divided to address set part and data read/write part.


It’s signaled similar to QAM. Far more complicated than GPIO type stuff. Think FM radio / spread spectrum rather than bitbanging / old school serial / parallel ports.

Similar to old school modems if the line is noisy it can drop to lower “baud” rates. You can manually try to recover higher rates if the noise is gone but it’s simpler to just reboot.


It might be easier to start with older or simpler/slower buses. ISA, SPI, I2C. In some ways ISA is very different - latching multiple parallel channels together instead of ganging independent serial lanes. But it makes sense to start off simple and consider the evolution. Modern PCIe layers several awesome technologies together, especially FEC. Originally they used 8b10b but I see now they're using 242b256b.


Other are not wrong in saying Memory mapped IO. taking a look at the Amiga hardware Reference manual [1] and a simple example [2] or a NES programming guide [3] would be a good way to see this in operation.

A more modern CPU/GPU setup is likely to use a ring buffer. The buffer will be in CPU memory. That memory is also mapped into the GPU address space. The Driver on the CPU will write commands into the buffer which the GPU will execute. These will be different to the shader unit instruction set.

Commands would be setting some internal GPU register to a value. Allowing the setting resolution, framebuffer base pointer, set up the output resolution, setting the mouse pointer position, reference a texture from system memory, load a shader, execute a shader, set a fence value (Useful for seeing when a resource, texture, shader is no longer in use).

Hierarchical DMA buffers are a useful feature of some DMA engines. You can think of them as similar to sub routines. The command buffer can contain an instruction to switch execution to another chunk of memory. This allows the driver to reuse operations or expensive to generate sequences. OpenGL's display list commonly compiled down to separate buffer.

[1] https://archive.org/details/amiga-hardware-reference-manual-...

[2] https://www.reaktor.com/blog/crash-course-to-amiga-assembly-...

[3] https://www.nesdev.org/wiki/Programming_guide


Lot of things happen there.

But most important, PCIe bus is serial bus, which have virtualized interface, so there is no physical process of communication, what happen more similar to Ethernet network, mean on each device exists few endpoints, each has it's own controller with its own address and few registers to store state and transitions, and memory buffer(s).

Videocards usually have many behaviors. In simplest modes, they behave just as RAM mapped to large chunk of system RAM space, plus video registers to control video output, and to control address mapping of video ram, and to switch modes.

In more complex modes, Videocards generate interrupts (just special type of message on PCIe).

In 3D modes, which are most complex, Videocontroller take data from its own memory (which mapped to system space), there are stored tree of graphic primitives, some draw directly from videoram, but for others used bus master option of PCIe, in which videocontroller read additional data (textures) from predefined chunks of system RAM.

About GPU operation, usually, CPU copy data to Videoram directly, than ask videocontroller to run program in videoram, and when complete, GPU issue interrupt, and than CPU copied result from videoram.

Recent additions where, add GPU possibility to read data from system disks, using mentioned before bus master, but those additions are not already wide implemented.


> Recent additions where, add GPU possibility to read data from system disks, using mentioned before bus master, but those additions are not already wide implemented.

My impression is that high-end graphics cards (Nvidia RTX 30x and professional equivalents) more and more replicate parts of the PC architecture and become sort of mini-computers within a computer.

Following that logic, I wonder when we'll see the first card with its own dedicated flash memory - or why not a PCIe controller, so you can hook op an SSD...


Some years amd produced enterprise gpu accelerator with ssd interface (or nvme, I don't find too many details). For Nvidia cannot remember. Currently this addition doesn't become commodity, but this is possible in nearest future.


I mean, for gaming cards, it would even make some sense.

When the user installs a new game, the installer could just pass on all assets of the game right to the GPU. The GPU could precompile the shaders and also store textures, models etc in whatever way would be optimal for that particular card.

During play, loading should be considerably faster as the GPU could just load the data from storage into graphics memory, without touching the main system at all.

A funny side-effect could be that the game install would also probably appear significantly smaller as it would take up less space on the filesystem.

Finally, Intellectual Property people would probably be happy as well, as the precious game assets wouldn't touch any user-accessible storage at all...


> During play, loading should be considerably faster as the GPU could just load the data from storage into graphics memory, without touching the main system at all.

This could not affect performance much, because pcie speed far surpass speed of best ssd/nvme.

Currently only problem, that for all transactions ssd<->gpu, involved main cpu, so it have much load only for accept transactions. But this is already solved by new technology, which appear recently and allows gpu to access ssd directly, unfortunately not much details known.

As for size I'm not sure, because gpu hardware is too special for deal with filesystems, so I think, there will be special fs. Also possible that this part of disk system will be read only for gpu (all writes and maintenance will do host cpu), to simplify gpu software part.


> This could not affect performance much, because pcie speed far surpass speed of best ssd/nvme.

Ah, I wasn't aware of that. Thanks for the info.

I would think there could still be some speed advantage if the GPU does not have to share storage with CPU, coordinate access, etc.

> As for size I'm not sure, because gpu hardware is too special for deal with filesystems, so I think, there will be special fs. Also possible that this part of disk system will be read only for gpu (all writes and maintenance will do host cpu), to simplify gpu software part.

That was sort of what I meant. In that scenario, a game would take up less space in the "ordinary" filesystem, i.e. storage of the main system - because part of the game data would be moved to GPU storage instead. Hence appear to be smaller: If you added the space taken up in main storage and GPU storage back together, you'd arrive back at the old size. It's more a sleight of hand, not an actual reduction in size.


> I would think there could still be some speed advantage if the GPU does not have to share storage with CPU

In enterprise advantage really big - because they typically use RAID disk arrays and enterprise systems usually have lot of disks, 5-10 disks normal, and exists machines with 64 disks or even more than 100. And also important thing, server hardware usually use much lower bus frequencies than gaming PC, so they just don't have opportunity to see max speed of bus.

That's why amd first created such card for enterprise - speed is really pain for enterprise computing.

> a game would take up less space in the "ordinary" filesystem, i.e. storage of the main system - because part of the game data would be moved to GPU storage instead

I mean, because of need less gpu load for access fs, there should be used simplified filesystems, or huge optimized for speed of access, and sucj FSs will be much less size effective than ordinary.

For example, read Knuth about search and sorting - you will know, that search/sort by using index is much faster, many magnitudes faster, but index size may be magnitudes larger than data.



For beginner, I think the best to begin read about Atari consoles, Atari-65/130, NES, as their ideas where later implemented in all commodity videocards, just slightly extended.

BTW all modern videos use bank-switching.


The OSDev Wiki is a great resource on how this all works from the perspective of actually programming it at least on x86

For example here's the page on talking PCI-E https://wiki.osdev.org/PCI_Express


It is old and I am not sure everything still applies but I found this course useful to understand how GPUs work:

Intro to Parallel Programming:

https://classroom.udacity.com/courses/cs344

https://developer.nvidia.com/udacity-cs344-intro-parallel-pr...


There is some good information on how PCI-Express works here: https://blog.ovhcloud.com/how-pci-express-works-and-why-you-...


This was good intro


At a high level, it's actually really simple. Your PCIe devices are each given a region of the address space, say, 0x8428000000000000-0x8428000000000fff. Just write to that region from kernel mode. But what do you write? Well, that isn't standardized. It's not even really documented. The best documentation is the source code to the GPU drivers in the Linux kernel, which are usually added to by engineers working at GPU vendors, and they don't discuss it much.


AMD does have some GPU register documentation for GCN at the bottom of https://developer.amd.com/resources/developer-guides-manuals... but not for RDNA / RDNA2.


And I you looking for some strange architecture forgoten by time:). https://www.copetti.org/writings/consoles/sega-saturn/


You'll find a very good introduction in the comparch book "Write Great Code, Volume 1", chapter 12 ("Input and Output"), which also explains the history of system buses (therefore, you'll find an explanation of how ISA works).

Interestingly, there is a footnote explaining that "Computer Architecture: A Quantitative Approach provided a good chapter on I/O devices and buses; sadly, as it covered very old peripheral devices, the authors dropped the chapter rather than updating it in subsequent revisions."


Wow. Just skimmed across that chapter and that looks like a great resource. No wonder I couldn't find it in any of my searching sessions, I'd never think a book titled like that would cover hardware concepts so extensively. This will definitely help me in understanding buses better. Thank you.


While we're here: is there any reasonable prospect of keeping one's GPU from being able to read and write to literally anywhere in physical memory?

I.e., a practical way a kernel and driver might be able to forward to the GPU only commands and shaders that can access only your process memory, and nobody else's, and your process's pixels, and no other process's pixels, when they live in GPU RAM?

For all I know, this is the norm for all GPUs, but I wonder why it is hard, then, for VMs to share a GPU.


> I wonder why it is hard, then, for VMs to share a GPU.

It isn't. Nvidia & AMD just charge a massive premium for the privilege. Nvidia calls it vGPU https://docs.nvidia.com/grid/13.0/grid-vgpu-user-guide/index... and AMD calls it MXGPU https://www.amd.com/en/graphics/workstation-virtual-graphics

Both have been around for a while now, and both refuse to bring it to their consumer cards.


> is there any reasonable prospect of keeping one's GPU from being able to read and write to literally anywhere in physical memory?

This is the purpose of an IOMMU.

> I.e., a practical way a kernel and driver might be able to forward to the GPU only commands and shaders that can access only your process memory, and nobody else's, and your process's pixels, and no other process's pixels, when they live in GPU RAM?

So IOMMU and the GPU's MMU.

> For all I know, this is the norm for all GPUs

What?

> but I wonder why it is hard, then, for VMs to share a GPU.

Engineering is hard.


TL;DR: bi-directional memory access with some means to notify the other part about "something has changed".

It's not that different for any other PIC/E device, be it a network card or a disk/HBA/RAID controller.

If you want to understand how it came to this - look at the history of ISA, PCI/PCI-X, a short stint for AGP and finally PCI-E.

Other comments provides a good ELI15 for the topic.

A minor note about "bus" - for PCEe it is mostly a historic term, because it's a serial, P2P connection, though the process of enumerating and qurying the devices is still very akin to what you would do on some bus-based system, e.g.: SAS is a serial "bus", compared to SCSI, but still you operate with it as some "logical" bus, because it is easier for humans to grok it this way.


I find it very interesting that you mention looking at the history of ISA's first in order to understand the current iteration of the technology.

I was reading the RISC-V privileged ISA recently and the amount of seemingly arbitrary registers and behaviours that must be implemented to support a UNIX-like OS is crazy, and that got me thinking about the history behind all of these things that the hardware must support in order to support the OS.

But thank you for the pointers, I'll definitely use this.


The "ISA" mentioned above is the "Industry Standard Architecture", the 8/16bit bus used by PCs and PC clones back in the day, not "Instruction Set Architecture (x86, ARM, RISC-V, etc): https://en.wikipedia.org/wiki/Industry_Standard_Architecture



Ha! I definitely meant ISA bus, as others had mentioned.

Kudos to swetland and phendrenad2!


Other has mentioned MMIO. MMIO has several kinds:

1. CPU accessing GPU hw with uncache-able MMIO, such as lower level register access

2. GPU accessing CPU memory with cache-able MMIO, or DMA. such as command and data stream

3. CPU accessing GPU memory with cache-able MMIO, such as textures

They all happen on the bus with different latency and bandwidth.


BTW I believe memory maps are set up by the ioctl() system call on Unix (including OS X), which is kind of a "catch all" hole poked through the kernel. Not sure about Windows.

I didn't understand that for a long time ...

I would like to see a "hello world GPU" example. I think you open() the device and the ioctl() it ... But what happens when things go wrong?

Similar to this "Hello JIT", where it shows you have to call mmap() to change permissions on the memory to execute dynamically generated code.

https://blog.reverberate.org/2012/12/hello-jit-world-joy-of-...

I guess one problem is that this may be typically done in vendor code and they don't necessarily commit to an interface? They make you link their huge SDK


On my system, the CPU sees the GPU as a PCI device. The "PCI config space" [0] is a standard thing and so the CPU can read it and figure out its device ID, vendor ID, revision, class, etc. From that, the OS looks at its PCI drivers and tries to find which one claims to drive that specific PCI device_id/vendor_id combination (or class in case there's some kind of generic universal driver for a certain class).

From there, the driver pretty much knows what to do. But primarily the driver will map the registers to memory addresses, so accessing offset 0xF0 from that map is equivalent as accessing register 0xF0. The definition of what each register does is something that the HW developers provide to the SW developers [1].

Setting modes (screen resolution) and a lot of other stuff is done directly by reading and writing to these registers. At some point they also have to talk about memory (and virtual addresses) and there's quite a complicated dance to map GPU virtual memory to CPU virtual memory. On discrete GPUs the data is actually "sent" to the memory somehow through the PCI bus (I suppose the GPU can read directly from the memory without going through the CPU?), but in the driver this is usually abstracted to "this is another memory map". On integrated systems both the CPU and GPU read directly from the system memory, but they may not share all caches so extra care is required here. In fact, caches may also mess the communication on discrete graphics, so extra care is always required. This paragraph is mostly done by the Kernel driver in Linux.

At some point the CPU will tell the GPU that a certain region of memory is the framebuffer to be displayed. And then the CPU will formulate binary programs that are written in the GPU's machine code, and the CPU will submit those programs (batches) and the GPU will execute them. These programs are generally in the form of "I'm using textures from these addresses, this memory holds the fragment shader, this other holds the geometry shader, the configuration of threading and execution units is described in this structure as you specified, SSBO index 0 is at this address, now go and run everything". After everything is done the CPU may even get an interrupt from the GPU saying things are done, so they can notify user space. This paragraph describes mostly the work done by the user space driver (in Linux, this is Mesa), which implements OpenGL/Vulkan/etc abstractions.

[0]: https://en.wikipedia.org/wiki/PCI_configuration_space [1]: https://01.org/linuxgraphics/documentation/hardware-specific...


Everyone in tech should read the book "Understanding the Digital World" by Brian W. Kernighan.


Is it very in-depth or more for layman readers?


It doesn't really answer my question, but from what I've seen in the TOC I'd say it's equivalent to an introductory course on computer architecture + computer systems and some cryptography as well. Kind of an introduction (don't get me wrong with the word 'introduction', it covers a decent amount of material) to the most important concepts and technologies that guide computers and the internet.


Most normal people and junior devs would get a red head when reading it, techies and seniors would nod along and sometimes say "uh... so that's how it really works". It's in between but a good primer on the essentials.


Is this before or after they read Knuth?


Mmm, do you know anybody who has read Knuth, really?


I own the books and I read parts of it.

But regardless, my point was that there are many many things that people say every programmer should read, and that sometimes includes Knuth. In fact, so many things that even if I stop everything I'm doing and just keep reading everything "every programmer should read", I will die before I finish.

tl;dr: it was a (bad) joke.


> What I'm looking for is a description of how a CPU tells a GPU to start executing a program. Through what means do they communicate - a bus? How does such a communication instance look like?

For most modern computers, through the PCI Express bus. Take a look at the output of "lspci -v" and you'll see something like:

    00:02.0 VGA compatible controller: [...]
        [...]
     Flags: bus master, fast devsel, latency 0, IRQ 128
 Memory at ee000000 (64-bit, non-prefetchable) [size=16M]
 Memory at d0000000 (64-bit, prefetchable) [size=256M]
 I/O ports at f000 [size=64]
 Expansion ROM at 000c0000 [virtual] [disabled] [size=128K]
That is, the GPU on this particular laptop makes available a region of memory sized 16 megabytes at physical address 0xee000000, and another region of memory sized 256 megabytes at physical address 0xd0000000. Whenever the CPU writes to or reads from these memory regions, it is writing to memory on the GPU, not on the normal RAM chips. And not all of that "memory" on the GPU is real memory; some of it are registers, which are used to control the GPU.

The same happens on the opposite direction: for code running on the GPU, some regions of memory are actually the RAM normally used by the CPU. In either case, the memory read and/or write transactions go through the PCI Express bus to the other device.

The exact details of what is written to (and read from) that memory vary depending on the device. For most GPUs, the driver sets up a list of commands in memory (either "host" memory, which is the RAM on the CPU, or "device" memory, which is the RAM on the GPU accessible through these PCI Express "memory windows"), and writes the address of that command list to a register on the GPU; the GPU then reads the list and executes the commands found in it. These commands can include things like "start N threads of the program found at X with Y as the input" (GPU programs are commonly called "shaders", and they are highly parallel), but also things like "wait for event W to happen before doing Z".


Woah there, my dude. Let's try to understand a simple model first.

A CPU can access memory. When a CPU performs loads & stores it initiates transactions containing the address of the memory. Therefore, it is a bus master--it initiates transactions. A slave accepts transactions and services them. The interconnect routes those transactions to the appropriate hardware, e.g. the DDR controller, based on the system address map.

Let's add a CPU, interconnect, and 2GB of DRAM memory:

  +-------+
  |  CPU  |
  +---m---+
      |
  +---s--------------------+
  |      Interconnect      |
  +-------m----------------+
          |
     +----s-----------+
     | DDR controller |
     +----------------+
  
  
  System Address Map:
  
    0x8000_0000 - 0x0000_0000  DDR controller
So, a memory access to 0x0004_0000 is going to DRAM memory storage.

Let's add a GPU.

  +-------+    +-------+
  |  CPU  |    |  GPU  |
  +---m---+    +---s---+
      |            |
  +---s------------m-------+
  |      Interconnect      |
  +-------m----------------+
          |
     +----s-----------+
     | DDR controller |
     +----------------+
  
  
  System Address Map:
  
    0x9000_0000 - 0x8000_0000  GPU
    0x8000_0000 - 0x0000_0000  DDR controller
Now the CPU can perform loads & stores from/to the GPU. The CPU can read/write registers in the GPU. But that's only one-way communication. Let's make the GPU a bus master as well:

  +-------+    +-------+
  |  CPU  |    |  GPU  |
  +---m---+    +--s-m--+
      |           | |
  +---s-----------m-s-----+
  |      Interconnect      |
  +-------m----------------+
          |
     +----s-----------+
     | DDR controller |
     +----------------+
  
  
  System Address Map:
  
    0x9000_0000 - 0x8000_0000  GPU
    0x8000_0000 - 0x0000_0000  DDR controller
Now, the GPU can not only receive transactions, but it can also initiate transactions. Which also means it has access to DRAM memory too.

But this is still only one-way communication (CPU->GPU). How can the GPU communicate to the CPU? Well, both have access to DRAM memory. The CPU can store information in DRAM memory (0x8000_0000 - 0x0000_0000) and then write to a register in the GPU (0x9000_0000 - 0x8000_0000) to inform the GPU that the information is ready. The GPU then reads that information from DRAM memory. In the other direction, the GPU can store information in DRAM memory, and then send an interrupt to the CPU to inform the CPU that the information is ready. The CPU then reads that information from DRAM memory. An alternative to using interrupts is to have the CPU poll. The GPU stores information in DRAM memory and then sets some bit in DRAM memory. The CPU polls on this bit in DRAM memory, and when it changes, the CPU knows that it can read the information in DRAM memory that was previously written by the GPU.

Hope this helps. It's very fun stuff!


Yeah, your explanation really hits the nail in regards to what I was trying to understand - MMIO coupled with all of that bus dynamic of a master and slave going on. It's clear to me now that my knowledge gap resides in not knowing enough about interconnects. Thanks a lot!

I do wonder, why aren't interconnects more emphasized in the courses I took? All I've seen was just oversimplified pictures of the process. Your explanation goes just enough into the lower-level aspects of the process to allow me to piece it.


> Yeah, your explanation really hits the nail in regards to what I was trying to understand

:)

There's a lot of knowledge, and acronyms, and BS out there. E.g. there is no need to discuss PCIe here. It's much easier, and enjoyable, to cultivate a simple understanding of fundamentals. Build up from there. Reduce it down to your own simple model.

Interconnects focus on the transfer of data between components in the system. Topics like topology, switching/routing, and performance come into play. But, for the purposes of the simple model described above, all you really need to grasp is topology. I.e., how are things connected and where is data flowing?

The memory model is another extension to the simple model described above. Both the CPU and the GPU have access to DRAM memory (shared memory). The CPU can send transactions to the GPU, and the GPU can interrupt the CPU. These are all different paths thru the system. But, remember that we described a very specific order of events that need to happen for shared memory communication between CPU and GPU. E.g. (1) the CPU sends transactions to the DDR controller to store some information in DRAM memory, and then (2) sends a transaction to the GPU to inform the GPU that it can now (3) send transactions to the DDR controller to retrieve that information from DRAM memory. But what if (1) and (3) happen much faster than (2)? The GPU will get old data, not the new data that was written by the CPU. Managing these order of events in the system is what the memory model is all about. What if shared memory exists not only in DRAM memory but also in caches elsewhere in the system?

Edit:

Back to your question of "how a CPU tells a GPU to start executing a program"...

In the simple model, you could imagine something like: (1) CPU stores the shader program in DRAM memory. (2) CPU writes a GPU register informing the GPU of what address the shader program is located at in DRAM memory. (3) CPU also informs the GPU of the size of the shader program. (4) GPU loads the shader program from DRAM memory. (5) GPU starts executing shader program.


In a SoC (all phones) and most laptops with "integrated graphics", that's the way it works. Basically a multi-processor system with shared memory. It is actually more common in the wild than the PCIe model, where the GPU has dedicated memory.


:)


Do you have a website to create such ´artwork´?

I find all these comments with the same typographic difficulty and I just want to work faster.

Can you help?

Ron.


Not sure if this addresses the speed issue, but here is one such tool https://asciiflow.com/#/ The download arrow lets you copy to clipboard.

And if you are an emacs user, you can try the artist-mode to draw with the mouse.


Another tool similar to this is https://dot-to-ascii.ggerganov.com


WOw! At least this one works. Thanks! But it's not for mouse drawing, more like scripting.


Wow, this is awesome. Definitely will speed up creating such drawings. Thank you!


Through the electrical wires in the PCI express port.


Nit pick…

Technically it’s not “through” the electrical wires, it’s actually through the electrical field created around the electrical wires.

Veritasium explains https://youtu.be/bHIhgxav9LY


Nitpicking the nitpick: the energy is what's in the fields, but the electrical wires aren't just for show, the electrons do need to be able to move in the wire for there to be a current, and the physical properties of the wire have a big impact on the signal.

So things get very complicated and unintuitive, especially at high frequencies, but it's okay to say through the wire!


And as you might be alluding, particularly high frequencies: in the skin (via skin effect) of the wire!

I'll confess I have never seen a plot of actual rms current density vs radius related to skin effect.


I could be misunderstanding the context of the question, but I think OP is imagining some sophisticated communication logic involved at the chip level. The CPU doesn't know anything much about the GPU other than it's there and data can be sent back and forth to it. It doesn't know what any of the data means.

I think the logic OP imagines does exist, but it's actually in the compiler (eg the cuda compiler), figuring exactly what bytes to send which will start a program etc.


Not in the compiler but in GPU driver. A graphic program (or compute) just calls APIs (DirectX/Vulkan/CUDA) of a driver, which then knows how to do that on a low-level writing to particular regions of RAM mapped to GPU registers.


Yes! This is correct. My bad, it's been too long. I guess either way the point is that it's done in software, not hardware.


There's also odd/interesting architectures like one of the earlier Raspberry Pis, where the GPU was actually running its own operating system that would take care of things like shader compilation.

In that case, what's actually being written to shared/mapped memory is very high level instructions that are then compiled or interpreted on the GPU (which is really an entire computer, CPU and all) itself.


Drivers make all hardware communicate.


I'm no expert on PCIe, but its been described to me as a network.

PCIe has switches, addresses, and so forth. Very much like IP-addresses, except PCIe operates on a significantly faster level.

At its lowest-level, PCIe x1 is a single "lane", a singular stream of zeros-and-ones (with various framing / error correction on top). PCIe x2, x4, x8, and x16 are simply 2x, 4x, 8x, or 16 lanes running in parallel and independently.

-------

PCIe is a very large and complex protocol however. This "serial" comms can become abstracted into Memory-mapped I/O. Instead of programming at the "packet" level, most PCIe operations are seen as just RAM.

> even virtual memory

So you understand virtual memory? PCIe abstractions go up to and include the virtual memory system. When your OS sets aside some virtual-memory for PCIe devices, when programs read/write to those memory-addresses, the OS (and PCIe bridge) will translate those RAM reads/writes into PCIe messages.

--------

I now handwave a few details and note: GPUs do the same thing on their end. GPUs can also have a "virtual memory" that they read/write to, and translates into PCIe messages.

This leads to a system called "Shared Virtual Memory" which has become very popular in a lot of GPGPU programming circles. When the CPU (or GPU) read/write to a memory address, it is then automatically copied over to the other device as needed. Caching layers are layered on top to improve the efficiency (Some SVM may exist on the CPU-side, so the GPU will fetch the data and store it in its own local memory / caches, but always rely upon the CPU as the "main owner" of the data. The reverse, GPU-side shared memory, also exists, where the CPU will communicate with the GPU).

To coordinate access to RAM properly, the entire set of atomic operations + memory barriers have been added to PCIe 3.0+. So you can perform "compare-and-swap" to shared virtual memory, and read/write to these virtual memory locations in a standardized way across all PCIe devices.

PCIe 4.0 and PCIe 5.0 are adding more and more features, making PCIe feel more-and-more like a "shared memory system", akin to cache-coherence strategies that multi-CPU / multi-socket CPUs use to share RAM with each other. In the long term, I expect Future PCIe standards to push the interface even further in this "like a dual-CPU-socket" memory-sharing paradigm.

This is great because you can have 2-CPUs + 4 GPUs on one system, and when GPU#2 writes to Address#0xF1235122, the shared-virtual-memory system automatically translates that to its "physical" location (wherever it is), and the lower-level protocols pass the data to the correct location without any assistance from the programmer.

This means that a GPU can do things like perform a linked-list traversal (or tree traversal), even if all of the nodes of the tree/list are in CPU#1, CPU#2, GPU#4, and GPU#1. The shared-virtual-memory paradigm just handwaves the details and lets PCIe 3.0 / 4.0 / 5.0 protocols handle the details automatically.


I agree that PCIe is mostly shared memory system.

But for videocards this sharing is unequal, because their RAM sizes exceeds 32bit address space, and lot of still used mainboards have 32bit PCIe controller, so all PCIe addresses should be inside 4GB address space, and this is seen on windows machines as total installed memory is nor all, but minus approximately 0.5GB, from which 256MB is videoram access window.

So in most cases, remain in force rule, that videocard share all it's memory through 256mb window using bank-switching.

As for GPU read main system memory, usually this is useless, because vram is magnitudes faster, even if not consider usage of bus bandwidth by other devices, like HDD/SSD.

And in most cases, only usage of access GPU to main system memory, is traditional read of textures (for 3D accelerator) from main system memory - for example ALL 3D software using GPU rendering, could only use for this videoram, none use system ram.


Yeah, as I've read other responses to my post I've been able to better define my difficulties in understanding CPU-GPU communication. I was having a hard time separating the MMIO concept from the communications protocol that ties together all of these devices (based on what you've explained that'd be PCIe). I actually haven't learned about PCIe as of yet, so the way you've introduced the concept has set me up to further look into it, thanks.


On the PC side start by reading some basics like https://archive.org/details/URP_8th_edition/ (never editions require logging in and borrowing)

>What I'm looking for is a description of how a CPU tells a GPU to start executing a program. Through what means do they communicate - a bus? How does such a communication instance look like?

Long time ago you would memory map the framebuffer and just write directly to it.

Then first 2D acceleration showed up in 1987 in form of IBM 8514 (later cloned by ATI/Matrox/S3/Tseng and others). You wrote commands one at a time using I/O port access to FIFO with pooling for idle/full, no direct access to the framebuffer http://www.os2museum.com/wp/the-8514a-graphics-accelerator/

Next evolution was MMIO - memory mapped IO. You no longer executed dedicated CPU IO instruction (assembler IN/OUT), IO ports were simply addresses in memory. You still had FIFOs and wrote one command at a time http://www.o3one.org/hwdocs/video/voodoo_graphics.pdf

Then someone threw DMA into the mix. Now you could DMA contents of a circular buffer filled with your commands http://www.bitsavers.org/components/s3/DB019-B_ViRGE_Integra...

We finally got command list/command buffer/bundle copied directly to the GPU.

Nowadays you have multiple command lists/command buffers/bundles going in parallel https://developer.nvidia.com/blog/advanced-api-performance-c...

On a hardware side 8/16 bit ISA bus was a shared parallel connection to CPU bus at fixed clock (4.77-10MHz, 4 clocks per transfer, ~5MB/s max speed).

It took us up to 1992 to get the next commonly used solution, a "rogue" consortium of companies tired of IBM shit designed VESA Local Bus (a true hack) in form of slapping expansion cards direct on the raw 32bit CPU bus of 486 processors. Cheap, no licensing fees, extremely fast (40MHz x 32bit = potentially faster than later PCI), easy to implement.

This got replaced with the advent of Pentium (64bit external CPU data bus) and introduction of PCI. PCI is still a shared parallel bus, but this time 32bits at 33MHz with packetized transactions.

AGP was "just" a faster PCI on its own dedicated separate controller (no contention with other PCI devices) and optimized addressing (sideband). 32bit at 66MHz, then x2 DDR, x4 QDR, x8 ODR. Last one means there are 8 transfers taking place between one clock cycle for a nice 2GB/s.

PCI-E is faster bidirectional serial point-to-point PCI with ability to combine links into bundles (x1-x16). PCI-E devices live on a network switch and dont block each other from talking simultaneously. You could think of PCI-E as every PCI device getting its own dedicated dual direction AGP connector.

Some vintage hands on coding examples:

2D Tseng Labs ET4000 coding https://www.youtube.com/watch?v=K8kZ4BFxOtc

2D Cirrus Logic https://www.youtube.com/watch?v=WoAE7x-u1g0

"How 3D acceleration started 20 years ago: S3/Virge register level programming" https://www.youtube.com/watch?v=fXJ11_wG_0U

"Acceleration code working on real S3 Virge/DX" https://www.youtube.com/watch?v=Hsg1N4IqXac

"Direct hardware accelerated 3d in 20kB code" https://www.youtube.com/watch?v=n509_wN02u8

"Bare metal hardware 3d texturing in 23kb of code w/ S3/Virge" https://www.youtube.com/watch?v=UgvBGXiw6LY

"Testing our latest low-level hardware 3d code on real S3/Virge hardware" https://www.youtube.com/watch?v=px--LWdRoYA

"Live coding and testing more low-level 3D w/ S3/Virge" https://www.youtube.com/watch?v=l3lH0cIZUSA

"Finishing low-level hardware S3/Virge acceleration demo" https://www.youtube.com/watch?v=JmfeB2LEDbc

"3dfx Voodoo: Low-level & bare-metal driver-less code" https://www.youtube.com/watch?v=LDT6KlfOG2k

"Finally 3dfx Voodoo triangles" https://www.youtube.com/watch?v=ZWaDqY4gqhw

"More GPU programming Voodoo case study" https://www.youtube.com/watch?v=AYZvNyxFHqk

"Quite final 3dfx Voodo low-level code working" https://www.youtube.com/watch?v=2ADQgIEWrx4




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: