Speaker: Dominik Ernst, NHR@FAU
What’s so special about those NVIDIA H100 GPUs that are (almost) as valuable as gold? This month’s HPC Cafe shines a light on the capabilities and differences of all the different GPUs currently sold for HPC. We try to answer questions like: How does the older A100 GPU, which the NHR@FAU has in spades, hold up to its newer sibling? Why does Alex also have another type of GPU, the A40? What’s the deal with the GPUs by “the other vendor,” which are powering the fastest (documented) supercomputer in the world? Can the new competitor, Intel, build GPUs that power supercomputers instead of just notebooks? What is an “APU” and why does everyone want to build one?
Slides: https://hpc.fau.de/files/2023/10/toptrumps.pdf
Material from past HPC Cafe events is available at: https://hpc.fau.de/teaching/hpc-cafe/
Today’s topic is um TPU top trums that’s what I called it because that’s what I’ve been reminded um of you know these um in in German it’s called super tum or quartet where you have like this these car um these car top trums where you say my car accelerates faster and then
Another person says yeah but mine is um heavier or whatever and then you need to pick a category and then if your card wins you get both cards and that way you try to trade cards um I have mocked this up as a this game doesn’t exist yeah
It’s only in my imagination and I only have those three cards even although I pretend with the serial numbering that there are more cards there aren’t and but I’ve mocked it up a little bit just for fun um what the categories could be and um and actually I think at least for
These three cards the game would actually be kind of balanced because for every category you would be able to beat the other cards at least somewhat for example with the um with the a100 you could beat everyone in the bandwidth category um but then again on the other
Hand you could for example in the um what is it does the does the intro CPU lose in everything oh that looks bad yeah I think it loses in everything to the AMD okay too bad there’s always a a card that’s sucks I guess that’s how it is
Okay but we’re going to um pide and these are some of the categories I have in here the dam capacity clock frequencies bandwidth and then a bunch of floating Point execution rates um first of all um and yeah and these differ for all of the gpus all of the
Gpus have a somewhat similar structure that’s what they GPS kind of converge to this is um an Nvidia marketing um marketing graph we have like this this big image in the background you know this thing around here which kind of should show the it’s a schematic of the
Whole chip and then you see a lot of these repetitive structures these um subunits called SMS or streaming multiprocessors um in Nvidia speech all centered around like this large level two cashier in the middle and then it’s probably too small to see properly there’s here on the on um around the
Edges there are these HPM interfaces there uh memory chip interfaces to the memory chip um to move data from into memory and the breakdown of one of these SMS shows a bunch of execution units and here the number of execution units that each each of the vendors put into their
Um each of these um units um differs a little bit they all have very similar numbers of or the the numbers of of these SS don’t differ too much but the number of execution units they put in they differ somewhat um not too different this is another marketing marketing thing this time from
AMD just to show the balance yeah you see again lots of repetitive structures and if you if you zoom in on these then that’s definitely too small for you to see there’s a bunch of units that do stuff what do these units do and why are there different units at all because we
Need to differentiate an a bunch of things and this depends very much on the applications um the thing here is in floating Point number formats that there is different kinds of floting Point operations depending on what Precision what um um with what Precision they they operate the big thing that gpus
Traditionally do very well is the top one single Precision 32bit single precis or fp32 computer Graphics usually just use fp32 after a while they standardize standardized on that because that works well and this is what gpus traditionally do very well for the traditional HPC Market single Precision is um not quite
Enough people want either because they’re too lazy to think about whether less Precision also works but they just standardized on double Precision this is the format on the on on the bottom and I have these These Bars a bit of a kind of a representation of the size of the how
Many bits I um allocated for each one of the um mantisa exponent and then one sign bit you can see double position well yeah it’s double as wide as single Precision um makes sense I guess so most HBC most traditional HBC applications double Precision um for HBC applications
There are a bunch of um HBC applications that can make use that use single position um go World among them I think for us it’s mostly um MD MD is fine with with single precis I guess because they it’s so chaotic that pretty much everything diverges very quickly anyway
And then it’s only statistics so I guess the full Precision doesn’t make much sense anyway I think let bolman also works um somewhat well with um reduced precision and like I’ve said computer Graphics use it use them a lot and where in recent years the all the all these
Precision things have really come to place all these even smaller Precision formats um the most the oldest one of them is fp16 or half precision and then there is formats like um tf32 as Nvidia calls it which is um I think it’s cheating because nothing in
Tf32 is 32 it’s actually only 19 if I count correctly um they use the same exponent as um um the 8 Bits exponent of single position but then just chop off some digits of the mantisa I think that way it’s very easy to convert if you
Want to convert from from tf32 to single position you just pad with zeros and if you the other way around you chop chop off some digits so I guess that’s and they have except the um aside the Precision aside they have similar properties I guess and then there’s
Stuff like B float 16 I think the b stands for brain brain float I don’t know who came comes up with these kind of names and they shift around the um the the I think they put a few more digits toward the exponent because they think we don’t need that
Much Precision having a LGE range is more important I guess that is better for machine learning yeah all these formats mainly use in the training of neuron networks I have left out stuff like um even smaller Precision fp8 is something that people do integer 8 integer 4 sometimes but that’s that’s
That’s a point where you that’s really only usable for inference for training a network you really need um you need at least something like um fp6 because there you need to you have a lot of weights and you change them very slightly yeah you need to compute a new
Gradient and um and the change is very small otherwise I guess it becomes unstable and because this change is so small and you need to accumulate over a lot of small contributions you need at least um a certain position for those contributions to actually make a difference so for
Training um somewhat higher precisions for inference I think everything goes like I said integer four but that’s not what I’m going to talk about today okay now I have a table uh sorry one more thing there’s also like I’ve said floating Point operations are not made equal like I’ve said there’s
Different number formats but there’s also um some of these these these numbers and execution units they have in there um are not as flexible as other other ex execution units and what Nvidia introduced are those tensor cores that do small Matrix Matrix multiplications and this is um in
Hardware that’s um a very nice thing because you can save I think mostly on register bandwidth because you can reuse a lot of stuff and need to move less data around because um these days in chips you spend more energy moving data around than do an actual computation so
It’s more efficient to um Implement these a full Matrix Matrix multiplication than um implementing um the same number of of vector floating Point operations um per cycle and um but those floating Point operations are only really useful if you have a problem that maps to these kinds of small Matrix Max Matrix mplications
That works very well for neuron networks where very often you have these kind of small Matrix Matrix multiplications it also works for well Matrix Matrix multiplications of course um the most known one is the the limac and for the HBC gpus um the tensor course for example in the a100 or the
H100 the they are actually fp64 capable and then they deliver double the appr for the consumer gpus they don’t put that in because it’s it’s wasted but um the bigger gpus the HBC gpus can actually use the metrix cores or tensor cores or whatever they are called for um for the
Lmek uh um yeah so for Matrix Matrix multiplications that’s nice I think Matrix Matrix multiplications in HPC are actually not the the norm or somewhat of an exception um or only a few problems really M to to a pure composition Matrix Matrix multiplication so in the rest of
This talk I’m going to count vector fp64 vector fp32 and tensor core or Matrix core reduced precision because that works for machine learning and so now I have a table and with a bunch of gpus um some many of which we actually do have here in our
Computing Center the RTX 380 um is in tiny GPU I think a few of them um the Nvidia 840 is a big part of Alex um those two gpus are actually somewhat similar because they used the same chip the same silicon but they a 40 has many
More units enabled in the RDX M 380 um there’s a lot of disabled units and and also that’s an actually consumer gaming product that you put into your computer plug your display into um opposed to the A40 which is a server only product um then we have the
Nvidia a100 which is also big part of Alex um which is the HPC focused um counterpart to the A40 and um and Nvidia h00 of the hopper Generation Um the Next Generation I have to um say say something about the versions because both of the a100 and the h100 there are
Several different versions around in Alex we have an S xm4 version that is a version where the um that we saw right here at the beginning you have these kind of boards and these boards pluck directly onto the main board yeah it’s not like like um
Like in the past the graphics card that you put into your computer in the PCI Express slot but um these are much more integrated so these um this works differently these days so for for for the a100 we have this um sorry this a this sxm version specifically the at gab
Version for the Nvidia h100 so far we’ve had only access to the PCI Express version and the PCI Express version is slower than the sxm version in a number of things mostly the power budget and with reduced power budget comes reduced clocks and few a smaller unit count in
This case actually also less memorable so if we compare now for example the a100 to the h100 in this case it’s not quite a fair comparison because the a100 that we have is like the high end variant and the h100 that we have access
To is a low end variant so some of the progress the generational progress looks smaller here than it actually is or to be then we have two AMD graphics card cards the Mi 210 is the chip that powers the currently fastest superc computer um uh in the top 500 list um in
This case it’s also although the the one that is in the in the um in Frontier is a special version a bigger version of that one where they combine two of these chips into one GPU and we only have the smaller version so again for the Mi 210
We have the small version and then on the very right we have two Intel gpus the one the max 110 100 1,100 we actually have here this is um and this again is roughly a a half version of the max um, 550 which we don’t have access
To yet okay so getting that out of the way yeah there’s a question are those now specs from the docks or did you measure those are specs spe yeah those are effects I did not measure those rates because those rates are actually you can if you have an application that
That would be made the right way you can probably get pretty close to the max rate sometimes there’s limits like um I think for the um RTX 3080 um if you mix you needed the right instruction mix to really get that rate but well rarely is the peak rate
Here really the actual limiter so here that that’s values that I just looked up on the Internet um yeah in the fp16 domain we can see in the for the Nvidia gpus there’s a generational um large inquiry so Nvidia has invested in for every gener generation they’ have invested
Quite a bit of new silicon to make that faster and I think the double the throughput from A40 to a100 to h100 um the throughput per core with average generation and then with a little bit of clock increases and um unit count increases then we get um quite large increases in those
Throughputs the other um the other gpus the AMD gpus they haven’t put that much um focus on the on the machine learning things yet that’s probably going to change because everyone wants a piece of the not just of our cake but of the AI
Cake too um um yeah and we see for the Intel gpus they have gone all in they have have the highest value with the those 840 um yeah so they can pretty much tie with the 800 for the vector for the vector um comput so single precis sorry
Excuse me single position fp32 compute um I think the differences are smaller um yeah we can we even see a slight degradation that the a100 has slightly less that’s has reasons that I want to explain right now and for 64bit for double position there’s one big difference the two two most left most
Gpus um those are not typos they’re really that that that little it’s only half a Terra flops where other gpus do put out something like 50 Tera flops that’s because they’re consumer gpus and they have a very very low um double position through foot on purpose because
Gaming doesn’t need it um for comparison I’ve also put in the the tdps the power consumption values these are um spec values how much power the TP actually need um Can can vary and also I wanted to put it in because the the Intel GPU
On the very right it looks like I mean it takes the crown in every single displine it looks like the absolute Champion here but you can also see that it uses up to double the power of some of the gpus so um I guess the um how good it really is remains to
Be seen we also don’t have one so um yeah I think some also based on the specs I think in terms of Energy Efficiency is slightly but only slightly better than the top a model which was the top candidate but Bas based on specs you said right just specs based on specs
Yeah yeah of course yeah what is slightly open and what I really don’t know is whether which gpus can actually keep up their clocks if you really fully use um some of the units it might be that some of them drop the clocks a bit and
Then um to to to keep the TDP and also this chip here on the right it’s massive it’s a massively big chip so um could have probably also put in like area to have like an area efficiency or something other chips like the Mi 210 for example are pretty small and the A40
And the RTX 380 are pretty small chips okay yeah um like I don’t know that much about chield so maybe that’s a stupid question but what like the technical reason why I don’t have this simple I can do twice as many SP blops and DP blops and for example like why didn’t
Interview that if they like went all in if I already have a 64bit register right not just yeah I think for um the good question why is is the usual ratio one to two um I think you can for additions it makes sense you can reuse some of the
Circuitry um that you use for fp6 for do twice as much twice as many fp32 operations um that’s actually not true for multiplications where you need four times as much um circuitry um I think data path is very often just doubled in that way double rate just makes sense I think double
Rate is the is the the technically most simple thing and why for example the Intel gpus have have equal rate of um double position and single precis I don’t quite know I think I’m not sure yeah no less than you probably but this thing is based the core is based on the
Internal um embedded GPU it’s the same core than the previous generation embedded GPU and I think the internal architecture is using Sim it’s simply with one kilit word withd and so therefore when you have the same register well then dou position but you you would fit double as many values in
The same register size I don’t know I guess it could be an it could be an um instruction throughput thing for example on the AMD side um they’ve doubled their fp64 throughput from the Mi 100 to to the Mi 210 so that um single cycle throughput and they have
Um now they have sd2 throughput for the f332 so you need to pack two simd two fp32 values to actually get the 46 um ter for singal position MI 210 so they increased that number and came up with a trick to actually make it usable for single pris um
Yeah okay but um I guess enough of that the other big part except for computation is um bandwidth memory bandwidth is the biggest one we have on the top here on the top line here the spec number yeah the the theoretical width of the interface of the memory interface and
Right below the the second um row is a number unmeasured I haven’t measured numbers for all the I haven’t measured numbers for all the um gpus um the bars here for the first two rows are actually um have the same scaling so you can kind
Of see for example here for the a100 get 1,732 GB out of 239 there’s always a bit of a an a certain efficiency involved in how much of the bandwidth you can actually get and um yeah can you could compute the ratio here which kind of
What what kind of efficiency is um that is some tpus have worse efficieny than than others I think for example here the max 1,100 um that would only be something like 2/3 efficiency on the other hand you have here the A40 which gets 655 out of 696 which is um I guess something
Like 95% efficiency um so those numbers always vary a b something else I’ve measured well what what we can can see the two gaming gpus less bandwidth because you don’t need as much for gaming the um the two HPC gpus a100 and h100 they have the same bandwidth because the the PCI
Express version of the h100 doesn’t have as much the sxm version I think goes up to 3 terab how much of that it actually can realize I’m not sure and then um sorry one too far and then the others they’re in similar ranges I guess and how much the
The max 155 Z how much of those 3.3 terabyte can realize we don’t know probably only something like two terabytes if we look at the at the numbers from the smaller version from the max 1,100 so probably won’t even be that much faster than the other TPS in actual
Realiz one other things I’ve measured is the level two cach bandwidth just is a data point is an interesting data point where you see a very clear generational improvement from from the A40 a100 h100 um and other gpus in similar ranges let’s keep it to that um the cach cach
Values the how much cash the the L2 cach these um um um qqs have and I’m actually talking about this because I think it’s somewhat relevant for the next benchmarks that I show or could be relevant um there’s a split some of the gpus have only in the range of five or
Six or eight megabytes of level two cach and that used to be the traditional value for for computer Graphics that that where you get um good reuse out of textures and 3D models and stuff and um the a100 and the h100 has increased that quite a bit with 40 and 50 megabytes and
The Intel gpus they go really all out and they even call this their Rambo cache where they have 144 to 48 megabytes of cache and this suddenly opens some possibilities that weren’t possible before and yeah I think drum size is not not so interesting gaming chus have have
Less D because there it’s not important others have more question in the chat what’s the architectural reason for the difference in measured and spec DM bandw that’s a good question that I really don’t know there’s a memory controller yeah that gets requests Fons of requests
And I think it tries to batch as many requests to the same memory page together because that makes things things faster and there’s a trade of like does it wait longer for more requests to come in or does does it process them quicker and then get lower latency but also less
Throughput so there’s tuning involved in the in the memory controller and the theoretical rate is really only the electrical interface with so um and the other one is the how many actually useful trans transactions can we actually get get through there also an interesting question if we do accesses
That are not as regular um as the ones were what have measured here where I iterate through a array in in linear fion um do we then in many cases the bandwidth does drop even more and I I have seen that for example the AMD tpus are less robus in the memory bandwith
And that the the achievable memory bandwith drops quicker than on the nvida gpus why that is I really don’t know okay now I have one more application where where we have measurement benchmarks to compare um this this these are gromic benchmarks from anakala um which she has measured and
Which she has generously shared they might have actually been shown here in the HBC Cafe before I’m not 100% sure about those um we have six different systems those vary in size I think they increase in size with from from one to six and um yeah an observation we can make the
H100 is always the fastest to view first observation and then we can see that the um the percentual um the percentage of the smaller gpus like the a100 the A40 and the um RTX 380 it decreases as the systems go get bigger do they get slower because the systems get bigger um well
Yes they do because the numbers get smaller yes but I think it’s mostly that the um the bigger the TPU is the more difficult it is to actually fully utilize the the GPU and um we can try of course to correlate some of the numbers that are
On on on the top for example the single position rate or memory bandwidth and try to correlate them with with the um resting performance I think one thing that’s pretty clear is that it doesn’t correlate at all with the dam bandwidth because we have very little D bandwidth
For the for two leftmost gpus and then um the a100 and the h100 have the same um D bandwidth but the differences aren’t that big they’re nowhere near in the range of the um v um this not not not that surprising because um molecular Dynamics is usually not or at least
These GIC benchmarks and shouldn’t be memory memory Bond um instead the data set sizes here I think one of the system has in the range of 500,000 atoms and I don’t quite know what the others have and I’ve I asked estimate that the data footprint of those systems ranges from
Somewhere between 5 megabyte and 500 megabyte so and that spans clearly spans those those L2 cach size sizes so some of the bigger tpus one could think that they would Fair better in those um in in those things but there again can we really cluster according to the L2 cach
Sizes no we can’t it’s it’s really not that clear the the speed up from the smallest GPU to the biggest GPU it isn’t actually that large and that’s where other things come into like there’s still CPU work being done that I actually don’t quite know details about
Um that just limits the scaling the amount of scaling that you can can see and I think for the gpus that get um um relatively um slower with increasing system sizes those are the smaller gpus that are already somewhat fully utilized even for the smaller ones
And the ones that get faster are the ones that need more per ISM or have higher latencies or higher startup overheads or whatever um to to really get past this is actually especially true I think for the two AMD gpus that actually get relatively faster when the
Systems get bigger even compared to the very big h100 TPU and I have one more one last slide um where I measure where I try to measure this overhead this startup overhead and this is um this is a stream kernel a streaming kernel where where just copy data from one array to
The other for two of the gpus the Mi 210 the AMD gpus and an old the V100 and slightly old the Nvidia GPU I chose that one because they’re in the similar similar ballparks in in terms of speed and then we have the the the grid size
Or the the array sizes um on the on the x-axis and we can see we very small grid sizes um the startup overhead is so so large that the startup overhead uh um takes much longer than anything else so the performance is very low and as we
Increase the grid sizes then we have this ramp up effect I actually have this fitted red line here that tries to model this is a very simple startup time plus um bandwidth and I fit something like five micros or 10 micros and especially for the smaller systems like the top one
Here system one I think we estimated um the time for one time step to be in the hundreds of microseconds but that of course involves a lot of steps um lot of different kernels and also maybe some synchronization with the CPU so one could imagine or probably the the
Time for each single colel is in that range of the startup overhead so probably play plays a significant role and we can in this col we can see that the the V100 the Nvidia TPU it Rams up much quicker for smaller grid sizes even though in the end for very large grid
Sizes the the AMD TPU the the Mi 210 is faster yeah It ultimately ultimately has the higher memory brandwidth but for the smaller grid sizes it’s its large um startup overhead just is too slow and hinders it is this rooted in hard this difference in startup overhead is it a
Hardware problem or a software issue um that’s an interesting question I’m not sure I think it’s both it’s I think it’s it’s both and one thing certainly is is is the software layer and how it handles like initiating a kernel launch another thing is probably spreading work to the
Actual execution units another thing is how long does a kernel a single kernel as excuse me a single thread take from start to finish um and the longer that time is the longer it takes for the very last thread to finish executing and everyone else has to wait for this single for the
Very last thread so I think it’s both and for the V 100 you can actually also see that there’s a very clear the2 P you can see a bump here or a peak rather where we compute out of level two cach for the Mi 210 there is something
Like a bump but the startup overhead the ramp up is is so so slow that it doesn’t profit at all yeah do have some preliminary data for the insute you as well how um yes I do yes I do it does look I don’t I don’t
Remember the values but I think the V100 is pretty quick with those I have a fitted value here that fits to 3,000 Cycles startup overhead I think some of the bigger tpus go at 5,000 maybe the Intel CPU was in the range of 7,000 Cycles I think something like that
Because for this big p you should clearly see this yeah the the the bump is very clearly visible for the for the inpu yeah yeah okay so far to the gpus of today that we actually do have is once one very quick Outlook um about the apus which is a future development
Um both AMD and Intel both AMD and Nvidia are doing this Intel has I think a few month ago said they they wanted to but they canell a product that they had planned where they in wanted to integrate a CP CPU course in GP Calles but now they said yeah not yet they’re
Not there yet but um on the right side we have AMD they’re very actively doing this and I think this is the die shot an annotated die shot of the Mi 300 which I think think uh well which you probably can’t buy but it’s probably already shipping or sampling or whatever I don’t
Know what the state is um but here you can you can see these um three quadrants here contain GPU dies and then on the lower right here there is a bunch of CPU dieses and now you have both on the same die or on the same package sharing the
Same memory chips so working on the same memory this is um quite useful because now you don’t have to copy data between the TPU memory and the CPU memory which so far are distinct um memory structures and that way coupling the and I think for for
Gromic for example um I’m not a gromic expert but I have heard have heard that it still does a lot of stuff some of the conents still run on the CPU and especially we have this very fine grain sharing and and and and switching between CPU and GPU that would be quite
Um could be useful of course people have Pro programmed around this limitation of separate CPU and TPU pools for the last 15 years and have gotten pretty good at getting rid of that fact um but especially for New Ports there’s a lot less work to actually get speed out of
The CPU on the left left hand side we have this the Grace Hopper um project with a Grace CPU arm CPU and on the right a hopper GPU they don’t share the same um the exact same memory so they’re still is the um here up here there is
Lpddr Dam chips and then around the GPU there are these HPM chips but they’re connected with a rather fast interconnect 900 gbt per second I think the interconnect is as for the CPU the interconnect is as fast as its own memory bandwidth so the idea is the CPU
Can work on the GPU memory pretty much as fast probably as on its own memory so it looks kind of if as if they had the same memory or it doesn’t hurt as much that it’s a separate memory space so they are actually technolog technologically the AMD approach is is cooler or more
Advanced um but Nvidia should get a lot of the benefits um of that too question in the chat what distinguishes apus from many core processes so apus were um term Apu was made up in the beginning when um AMD bought ATI so they bought a graphics card manufacturer and then they wanted
To includ gpus onto a CPU die and then they said it’s really nice because now it’s it’s not just a CPU now it’s an accelerated Pro processing unit and they envisioned this future where a lot of things like Excel spreadsheet processing and J jpeg compression and sound and
Whatever would be done by the GPU um this Vision never really materialized also because AMD just didn’t have the resources to actually build a product like that but there were apus in the desktop consumer market right and there still are in these days it’s common that there’s a GPU this actual vision of
Having like the TPU do lots of stuff it kind of comes it slowly comes to pass now um yeah so this the the thing is that the thought is that an APU has GPU and CPU cores combined same chip many core processes are just very many small CPU
Cores so you’re kind of like you’re still in you you have a core that’s bit of a mix of in in the design space somewhere in between a CPU and a TPU core but um a Manore um chip is homogeneous whereas um an APU has two different kind of um units on
There okay more questions well on in the desktop and the and the um notebook space not no they just use um lpddr in the um HPC market for sure because in the HPC Market nothing nothing um you cannot get around HPM just because it has so much more B
The two most the the two chips here on the very left those two chips use gddr and I mean you can see that they only do something like 700 gab per second I think the fastest GPU the RTX 480 um 4080 I think is around a terabyte
With gddr memory but um that is the same that hbm tpws had five years six years seven years ago or something like that so hbm is just so much faster and that’s why there’s no alternative for HPC expensive though which is doesn’t matter for HPC for hbcp
Is question in the chat um how does L2 two cach work for apus specifically is it shared between CPU and GPU course no no it’s not at least today um I know that for Intel in their integrated gpus they used to have um a shared level three cach but even that
Isn’t shared in today’s consumer desktop um gpus for the Mi 300 we don’t have a shared level two cach so that those are still because level two cach for a CPU is a very core private thing each core has its own level two c it’s very close
To the core but they have a shared um well what level would you call it um somewhat difficult because they all have their own level two and then the CPU also has its own level three which is still private because it’s on the die you can actually kind of see it in
Between here that there’s a level three cache um for the tpus they call it the infinity cach so or last level cach and that’s shared for the mi30 which is also kind of nice or kind of exciting because that’s that also speeds up the data exchange between CPUs and tpus quite a
Bit and also it’s supposed I don’t actually know the exact size of the cache um but should be in written the hundreds of of megabytes so that will be kind of nice for the Nvidia um for Grace Hopper there’s no shared cach at all just because it’s two distinct chps but it is
Cash coherent right yes I but I think you can opt out because it’s often slower yeah that was be my question how do they manage that I mean these are incredible speeds and and really low latencies and to keep that so I think if you read from the from the CPU from the
Gray CPU into GPU memory you go go via the level two cach so that kind of enforces the the level two cach um coherency um and the other way around probably two yeah probably two I’m not 100% sure but usually for these kind of things there there’s um opt outs because it’s Slower