Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Typically CPU and GPU communicate over the PCI Express bus. (It’s not technically a bus but a point to point connection.) From the perspective of software running on the CPU, these days, that communication is typically in the form of memory-mapped IO. The GPU has registers and memory mapped into the CPU address space using PCIE. A write to a particular address generates a message on the PCIE bus that’s received by the GPU and produces a write to a GPU register or GPU memory.

The GPU also has access to system memory through the PCIE bus. Typically, the CPU will construct buffers in memory with data (textures, vertices), commands, and GPU code. It will then store the buffer address in a GPU register and ring some sort of “doorbell” by writing to another GPU register. The GPU (specifically, the GPU command processor) will then read the buffers from system memory, and start executing the commands. Those commands can include, for example, loading GPU shader programs into shader memory and triggering the shaders to execute those shaders.



If OP or anyone else wants to see this firsthand.. well shit, I feel old now, but.. try an exercise into assembly programming of commodore 64. Get a VICE emulator and dig into it for a few weeks. It's real easy to get into, CPU (6502 based), video chip (VIC II), sound chip (famous SID), ROM chips.. they all love in this address space (yeah, not mentioning pages), CPU has three registers.. it's also real fun to get into, even to this day.


There's a nice guide by Ben Eater on Youtube about a breadboard computers: https://www.youtube.com/playlist?list=PLowKtXNTBypFbtuVMUVXN...

It doesn't sport any modern features like DMA, but builds up from the core basics: a 6502 chip, a clock, and a blinking LED, all hooked up on a breadboard. He also built a basic VGA card and explains protocols like PS/2, USB, and SPI. It's a great introduction or refresher into the low level hardware concepts behind computers. You can even buy kits to play along at home!


I was going to say, watching Ben Eater and building his 6502 machine myself filled in so many gaps in knowledge or fuzzy areas around basic computer hardware. If OP sees this he should definitely consider building some of these after Nand2Tetris.


This! except I (slightly redesigned and) built a common z80 setup, no comparison for knowledge gained from this single effort.


Is my understanding correct that compared to those historical architectures, modern GPUs are a lot more asynchronous?

What I mean that these days you'd issue a data transfer or program execution on the GPU, they will complete at its own pace and the CPU in the meanwhile continues executing other code; in contrast in those 8 bitters you'd poke a video register or whatev and expect that to have more immediate effect allowing those famous race the beam effects etc?


It depends.

First, in modern systems you usually don't have direct access to GPU, but call syscall and than happen magic, something like this. And they all are multilayered architecture, mean, now normal, when systrace dump of some exception lists few hundreds functions, imagine how this slow.

This is because all modern systems except consoles, are direct descendants of business minicomputers, in which where most important, they could simultaneously run wallet and text processor (or electronic table, or email, etc), and these programs will not see each others memory, so errors in text will not lead to lost money from wallet.

Second, modern systems behave like NUMA architectures, or some people even named them distributed architectures, mean, that only CPU computing in userspace is fast.

In many cases you have choice, to do magic in sync way or async, but in any case, syscalls are extremely slow, in some cases could be few magnitudes slower than userspace CPU computing.

Sometimes lifesaver some sort of message passing architecture, so you send message to GPU and immediately could do your tasks without any waits at all (kernel periodically check messages mailboxes of all processes and when happen your turn it will read your message and do things and write answer).

But message passing is now rare, mostly as I know, used paradigm of direct syscall, and async means, kernel release your process just after receive syscall, sync means, your process will be released only after syscall finished processing data.

In 8bit systems, cpu frequency where very low, sometimes fraction of bus speed and there where very few layers, basically userlevel program could directly access hardware. So even formally their behavior named synchronous, but in reality they where very fast in most cases, except understandable waits, like when Atari ANTIC access videoram when you also trying to do that.

Reliability issues in 8bit and in consoles solved very simple ways - first, most where capable to run only one program, and second, console software where extremely reliable, and expensive, much more reliable than business software, sometimes exceeds reliability of military software or mainframe system software.


There were interrupts telling you when certain things happened. If anything, it was asynchronous. Big thing is also that you had to tally the cost of what you eere doing. There was a budget of how many cycles you got per line, per screen and then fit whatever you had to in that. With playing sound it was common to draw color when you fed the music into SID so you could tell, like a crude debug/ad hoc printf, how many cycles your music routines ate.


Nice exercise. Similarly I learned most about basic computer architecture by programing 8050 in ASM as well as C.

And I'm 32. Am I old yet? I'm not right? Right?


Sorry pal!

I remember playing Halo in my early 20's, and chatting with a guy from LA who was 34. Wow, he's so old, why was he still playing video games.

Here I sit in my late 30's...still playing games when I have time, denying that I'm old, despite the noises I make getting up and random aches and pains.


OK. I am definitely old...

- I have mostly no joy playing video games anymore these days.

- I found myself refusing some contemporary stuff in favour of stuff "I am used to".

Damn! I was always looking at my dad, not understanding that very same things :) Now I am a dad myself...


Haha same. I think the problem is that it seems to hit you all at once. I was pretty open minded and go with the flow until at least my early 30s, now I hate any kind of change at all.

Re: video games, I had a similar experience but play for different reasons. I used to have fun blasting people and chatting on xbox live... don't enjoy any of that anymore. All single player for me, and no FPS. A cool physics game like Rocket League(admittedly, not single player), or a puzzle game like The Turing Test, or a mystery game like Kona are way more interesting to me than say, Call of Duty.


> Am I old yet? I'm not right? Right?

Yes, but being old isn't bad. I, for one, long for the day when I'll be unable to produce and contribute to society.


40s are new thirties, my friend. Also, painkillers help.


IMO memory-mapped IO is the coolest thing since sliced bread. It's a great example in computing where many different kinds of hardware can all be brought together under a relatively simple abstraction.


It was a glorious "click" when learning embedded programming. Even when writing Rust in typical desktop uses, it all feels... abstract. Computer program logic. Where does the magic happen? Where do you go from abstract logic to making things happen? The answer is in voltatile memory reads and writes to memory-mapped IO. You write a word to a memory address, and a voltage changes. Etc.


relatively simple until you get into cache coherence (not an issue if you mark the memory as volatile)


The volatile keyword in C has nothing to do with cache coherence and does not prevent cache coherence issues.

It just forces the compiler to generate all memory accesses on this variable. And if this variable is in a cached memory region, then it opens the door to very usual cache coherency problems


Memory mapped IO is never cached as far as I know, so at least that's not a concern.


You'd still have to tell the compiler it's volatile if you were writing into the DMA buffer directly for some reason. GP just used the wrong words to say what they meant.


One of the best classes I took was building space invaders on an FPGA/dev board. It had 2 PowerPC cpus, an AC97 sound chip vga output, etc.

We built the game in C, wired up the interrupts to the sound chip, wrote to the mapped addresses, etc.

Then we re-did the sound code by writing a DMA controller in the fpga. It was so educational. We had to read and write to the shared bus, follow the commands of the arbiter, send and receive interrupts, etc.

The final bit was a creative assignment, where you could add whatever hardware component you wanted. Some did old Atari controllers, some worked in a separate scoreboard. We did wireless chips via SPI to make it a two player game.

All in, it was one of the best hands on educational experiences. There was no way to complete the class and not have built an intuition for how things work.


Going one deeper, how does the communication work on a physical level? I’m guessing the wires of the PCI Express bus passively propagate the voltage and the CPU and GPU do “something” with that voltage?


Before you get that deep, you need to step back for a bit. The CPU is itself several different processors and controllers. Look at a modern Intel CPU: https://www.anandtech.com/show/3922/intels-sandy-bridge-arch.... The individual x86 cores are connected via a ring bus to a system agent. The ring bus is a kind of parallel bus. In general, a parallel bus works by having every device on the bus operating on a clock. At each clock tick (or after some number of clock ticks), data can be transferred by pulling address lines high or low to signify an address, and pulling data lines high or low to signify the data value to be written to that address.

The system agent then receives the memory operation and looks at the system address map. If the target address is PCI-E memory, it generates a PCI-E transaction using its built-in PCI-E controller. The PCI-E bus is actually a multi-lane serial bus. Each lane is a pair of wires using differential signaling (https://en.wikipedia.org/wiki/Differential_signalling). Bits are sent on each lane according to a clock by manipulating the voltages on the differential pairs. The voltage swings don't correspond directly to 0s and 1s. Because of the data rates involved and the potential for interference, cross-talk, etc., an extremely complex mechanism is used to turn bits into voltage swings on the differential pairs: https://pcisig.com/sites/default/files/files/PCI_Express_Ele...

From the perspective of software, however, it's just bits sent over a wire. The bits encode a PCI-E message packet: https://www.semisaga.com/2019/07/pcie-tlp-header-packet-form.... The packet has headers, address information, and data information. But basically the packet can encode transactions such as a memory write or read or register write or read.


Oh! I know this one! PCI itself is comprised of several layers, from a physical layer, to a link layer, to a transaction layer and application layers.

At the physical level we're literally pushing the limits of semiconductor physics as an industry. :-) There's all sorts of tricks, like 8b10b encodings[1] to address error handling and recovery, and then different ways to itself transmit the PCI signal. You can, for example, encode a PCIe signal on an optical cable. :-) The bus itself is divided into "lanes" which can be aggregated into various "widths" - so you'll see x1, x4, x8, x16 etc. This compounds the problem, in that each lane is negotiated. See this video[2] for what appears to be a nice overview. I haven't watched the video in depth, but clicking around, this seems reasonable to get a sense of what's what.

There's a good reference on the rest of the protocol here: https://www.mindshare.com/files/ebooks/pci%20express%20syste... . Put bluntly - it's amazing that anything works at all, technology is as cheap as it is, ubiquitous as it is, and continues to get cheaper, and more powerful on a daily basis.

If you want your mind blown - look up how modern WiFi works, and what we take for granted today. Beam forming. [3] 256-QAM encoding.[4]

Source: I wrote firmware for very, very large servers at one point in my career. Code that I wrote still (AFAIK) is still running, and writing a a 32-bit test pattern into a register once every second to validate that a particular chip is working ok. That test pattern? 0x4D494B45, which is "MIKE" in ASCII. :-)

Math (and science) is amazing. :-)

[1] https://en.wikipedia.org/wiki/8b/10b_encoding

[2] https://www.youtube.com/watch?v=EHkuzkNWXFk

[3] https://www.youtube.com/watch?v=xGkyZw98Tug

[4] https://www.minim.com/blog/what-is-qam


8b10b is like, 3 PCI generations old (PCI 2)! The next thing was 128b130b (PCI 3-5) and apparently 6 is moving to 242B/256B.


Fascinating. So I guess I date myself there. But - it makes sense having read a bit. There’s remarkable complexity. “Strong statistical bounds” is again a reason for me to say “it is amazing anything works, anywhere.” I say this typing on a capacitive piece of glass that happens to be a multitouch display, glued to a battery, a quad core processor, and a very precisely arranged lump of sand and gold that’s able to literally shout into the microwave aether to a thingie on a pole in the middle of the air, which then is relaying (encrypted btw) data to some box somewhere that results in an interrupt occurring, a cpu servicing that, and many other things happening many times to result in random people somewhere eventually reading this message.

We’ve come pretty far from the telegraph, haven’t we?


Voltages yes.. usually its all binary digital signals, running serial/parallel and following some communication protocol. Maybe you should have a look at something really simple/old like UART communication to get some idea how this works and then study next how this is scaled up over PCIE to understand the chat between CPU/GPU?

Or maybe not, one does not need all the details, so often just scaled concepts :)

https://en.m.wikipedia.org/wiki/Universal_asynchronous_recei...

Edit: Wait it is really already QAM over PCIE? Yeah then UART is a gross simplification, but maybe still a good one to start with depending on knowledge level?


PCI-E isn't QAM. It's NRZ (a type of pulse amplitude modulation with just two amplitude levels), with 8/10b, 64/66b, or 128/130b encoding, and then scrambled to reduce long runs of 0s or 1s. I think PCI-E 6.0 uses PAM-4 with 4 amplitude levels.


https://pcisig.com/sites/default/files/files/PCI_Express_Ele... It doesn’t say QAM explicitly but it has all the QAM terminology like 128 codes. Inter symbol interference etc. I’m not an RF guy by any stretch but it sounds like QAM to me.

This is an old spec. I think it’s like equivalent to QAM-512 for PCIe 6


128/130b encoding is used to help embed a clock signal in the data stream. After every 128 bits of data, the receiver will see a preamble of 01 or 10. That voltage transition at predictable intervals helps the receiver keep the clock in sync.

Intersymbol interference likewise applies to pretty much any high-speed digital transmissions. At high frequencies, you have to worry about things like the signal reflecting off the other end of the circuit trace, which creates inter-symbol interference.

QAM is a modulation technique. It specifies how symbols of one or more bits are represented as analog waves on the wire. PCI-E does not use QAM. It simply represents 0s and 1s high and low voltage swings--what's called Pulse Amplitude Modulation. Other modulation techniques encode symbols by varying the amplitude and phase (or both) of a carrier wave. QAM works by modulating the amplitude of two carrier waves 90 degrees out of phase (e.g. sine and cosine) based on a data signal and summing them together: https://www.techtarget.com/searchnetworking/definition/QAM


Oh cool. Thanks for that. So basically it uses similar terminology because it’s modulated it’s just that QAM is a different kind of modulation?


It’s similar terminology because it’s dealing with sending digital signals at high speeds using analog signals. “Modulation” just means the scheme by which sequences of bits are represented in analog signals. There necessarily has to be some sort of modulation. But the terminology is more related to common issues that arise from the receiving end reading the small, rapidly changing voltages and recovering the bit stream. Regardless of the modulation you face similar problems with keeping the clock in sync etc.


Oh, that is several levels deeper! PCIe is a big standard with several layers of abstraction, and it's far from passive.

The different versions of PCIe use a different encoding, so it's hard to sum it all up in a couple sentences in terms of what the voltage does.


Older CPUs - the CPU had a bunch of A pins (address), a bunch of D pins (data).

The A pins would be a binary representation of an address, and the D pins would be the binary representation of data.

A couple of other pins would select behavior (read or write) and allow handshaking.

Those pins were connected to everything else that needed to talk with the CPU on a physical level, such as RAM, I/O devices, and connectors for expansion. Think 10-base-T networking where multiple nodes are physically modulating one common wire on an electrical level. Same concept, but you have many more wires (and they're way shorter).

Arbitration logic was needed so things didn't step on each other. Sometimes things did anyway and you couldn't talk to certain devices in certain ways or your system would lock up or misbehave.

Were there "switches" to isolate and select among various banks of components? Sure, they are known as "gate arrays" - those could be ASICs or implemented with simple 74xxx ICs.

Things like NuBus and PCI came about - the bus controller is directly connected and addressable to the CPU as a device, but everything else is connected to the bus controller, so now the new-style bus isn't tied to the CPU and can operate at a different speed and CPU and bus speed are now decoupled. (This was done on video controllers in the old 8-bit days as well - to get to video RAM you had to talk to the video chip, and couldn't talk to video RAM directly on some 8-bit systems).

PCIE is no longer a bus, it's more like switched Ethernet - there's packets and switching and data goes over what's basically one wire - this ends up being faster and more reliable if you use advanced modulation schemes than keeping multiple wires in sync at high speeds. The controllers facing the CPU still implement the same interface, though.


> Older CPUs - the CPU had a bunch of A pins (address), a bunch of D pins (data).

And on some buses, some or all A/D pins where multiplexed, so transaction divided to address set part and data read/write part.


It’s signaled similar to QAM. Far more complicated than GPIO type stuff. Think FM radio / spread spectrum rather than bitbanging / old school serial / parallel ports.

Similar to old school modems if the line is noisy it can drop to lower “baud” rates. You can manually try to recover higher rates if the noise is gone but it’s simpler to just reboot.


It might be easier to start with older or simpler/slower buses. ISA, SPI, I2C. In some ways ISA is very different - latching multiple parallel channels together instead of ganging independent serial lanes. But it makes sense to start off simple and consider the evolution. Modern PCIe layers several awesome technologies together, especially FEC. Originally they used 8b10b but I see now they're using 242b256b.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: