rorschach200

joined 11 months ago
[–] [email protected] 1 points 10 months ago

The bus width needed to be what it needed to be. That left 2 possibilities - 12 GB and 24 GB. The former was way low for 4090 to work in its target applications. 24 it became.

This is exactly what drives these decisions.

What do you think drives them?

[–] [email protected] 1 points 10 months ago

by improving the RT & tensor cores

and HW support for DLSS features and CUDA as a programming platform.

It might be "a major architecture update" by the amount of work that Nvidia engineering will have to put in to pull off all the new features and RT/TC/DLSS/CUDA improvements without regressing PPA - that's where the years of effort will be sunk - and possibly large improvements in perf in selected application categories and operating modes, but a very minor improvement in "perf per SM per clock" in no-DLSS rasterization on average.

[–] [email protected] 1 points 10 months ago (5 children)

Why actually build the 36 GB one though? What gaming application will be able to take advantage of more than 24 for the lifetime of 5090? 5090 will be irrelevant by the time the next gen of consoles releases, and the current one has 16 GB for VRAM and system RAM combined. 24 is basically perfect for top end gaming card.

And 36 will be even more self-canibalizing for professional cards market.

So it's unnecessary, expensive, and canibalizing. Not happening.

[–] [email protected] 1 points 10 months ago (2 children)

GA102 to AD102 increased by about 80%

without scaling DRAM bandwidth anywhere near as much, only partially compensating for that with a much bigger L2.

For 5090 on the other hand we might also have clock increase going (another 1.15x?), and proportional 1:1 (unlike Ampere -> Ada) DRAM bandwidth increase by a factor of 1.5 due to GDDR7 (no bus width increases necessary; 1.5 = 1.3 * 1.15), so this is 1.5x perf increase 4090 -> 5090, which has to be further multiplied by whatever u-architectural improvements might bring, like Qesa is saying.

Unlike Qesa, though, I'm personally not very optimistic regarding those u-architectural improvements being very major. To get from 1.5x that comes out of node speed increase and the node shrink subdued and downscaled by node cost increase, to recently rumored 1.7x one would need to get (1.7 / 1.5 = 1.13) 13% perf and perf/w improvement, which sounds just about realistic. I'm betting it'll be even a little bit less, yielding more like 1.6x proper average, that 1.7x might have been the result of measuring very few apps or outright "up to 1.7x" with "up to" getting lost during the leak (if there was even a leak).

1.6x is absolutely huge, and no wonder nobody's increasing the bus width: it's unnecessary for yielding a great product and even more expensive now than it was on 5nm (DRAM controllers almost don't shrink and are big).

[–] [email protected] 1 points 10 months ago
  1. The "gain" is largely a weighted average over all apps, not a max realizing in couple of outliers. It's the bulk that determines the economics of the question, not singular exceptions.
  2. The current status is heavily dominated by the historical state of affairs, as not enough time has passed to do much yet. Complex heterogenous cache hierarchies that generalize poorly is a very recent thing in CPUs, in GPUs it was the case for decades now, and in GPUs that is not the only source of large sensitivity to tuning.
[–] [email protected] 0 points 10 months ago (2 children)

Also, GPUs are full of sharp performance cliffs and tuning opportunities, there is a lot to be gained. CPUs are a lot more resilient and generic - a lot less to be gained there.

[–] [email protected] 1 points 10 months ago (1 children)

but all they would need to do is look at like the top 100 games played every year

My main hypothesis on this subject - perhaps they already did, and out of the top 100 games only 2 games was possible to accelerate via this method, even after exhaustively checking all possible affinities and scheduling schemes, and only on CPUs with 2 or more 4-clusters of E-cores.

The support for the hypothesis is the following suggestions:

  1. how many behavioral requirements the game threads might need to satisfy
  2. how temporally stable the thread behaviors might need to be, probably disqualifying apps with any in-app task scheduling / load balancing
  3. the signal that they possibly didn't find a single game where 1 4-core E-cluster is enough (how rarely is this applicable if they apparently needed 2+, for... some reason?)
  4. the odd choice of Metro Exodus as pointed out by HUB - it's a single player game with very high visual fidelity, pretty far down the list of CPU limited games (nothing else benefited?)
  5. the fact that none of the games supported (Metro and Rainbow 6) are based on either of the two most popular game engines (Unity and Unreal), possibly reducing how many apps could be hoped to have similar behavior and possibly benefit.

Now, perhaps the longer list of games they show on their screenshot is actually the games that benefit, and we only got 2 for now because those are the only ones they figured (at the moment) how to detect threads identities in (possibly not too far off from as curiously as this), or maybe that list is something else entirely and not indicative of anything. Who knows.

And then there comes the discussion you're having, re implementation, scaling, and maintenance with its own can of worms.

[–] [email protected] 1 points 10 months ago

From the video for convenience:

"Why did Intel only choose to enable Intel® Application Optimization on select 14th Gen processors? Settings within Intel® Application Optimization are custom determined for each supported processor, as they consider the number of P-cores, E-cores, and Intel® Hyperthreading Technology. Due to the massive amount of custom testing that went into the optimized setting parameters specifically gaming applications [sic], Intel chose to align support for our gaming-focused processors."
- Intel

(original page quoted from)

[–] [email protected] 1 points 10 months ago

flags

Slide 7 https://ee.usc.edu/~redekopp/cs356/slides/CS356Unit5_x86_Control

See also https://dougallj.wordpress.com/2022/11/09/why-is-rosetta-2-fast/ on the entire general subject.

However, acc. to the very author of that article the contribution of these extensions to the overall performance is rather quite minor, see discussion starting at https://news.ycombinator.com/item?id=33537213 that gives very compact descriptions of both the extensions in question and the assessment of their realistic contribution.

[–] [email protected] 1 points 10 months ago

could easily be caused by not having enough RAM and having to shuffle art assets between storage and memory

In iOS there is no swap:

In iOS, there is no backing store and so pages are are never paged out to disk, but read-only pages are still be paged in from disk as needed.

(https://developer.apple.com/library/archive/documentation/Performance/Conceptual/ManagingMemory/Articles/AboutMemory.html )

Which counterintuitively actually makes it more RAM capacity demanding than macOS. macOS's 2-stage swapping with compression (and absurdly fast NVME) is in fact quite good, iOS has to fit in RAM and RAM only.

[–] [email protected] 1 points 10 months ago

I wholeheartedly agree about MaxTech on average, but in this particular instance... the video is actually pretty good.

Or rather, if one watches it correctly, the material presented that remains after the filter is quite good. Namely, one should discard all claims based on pure numerology - numbers from the config, current year, "RAM used" shown by activity monitor (a big chunk of that figure is very optional file cache that only marginally improves perf, but uses up the more RAM the more you give it, for starters, + a lot more), etc.

The actual experiments with applications done on the machine, performance measurements (seconds, etc), demonstrations of responsiveness (switching from tab to tab on camera) are actually quite well done, in fact, the other youtube videos on the subject rarely include quantifiable performance / timing measurements and limit themselves to demos (or pure handwaving and numerology).

Of course, conclusions, "recommendations", etc. in their exact wording also need to be taken with a half a metric ton of salt, but there is still a lot of surprisingly good signal in that video, as noted.

[–] [email protected] 1 points 10 months ago (1 children)

Let's put some science to it, shall we.

Using Digital Foundry's vid as the main perf orientation source for ballpark estimates, it seems that in gaming applications depending on a game M1 Max is anywhere from 2.1 to staggering 4.5 times slower than desktop 3090 (350W GPU), with geomean sitting at embarrassing 2.76. In rendering Pro Apps, on the other hand, using Blender as an example, the difference is quite a bit smaller (even though still huge), 1.78.

From Apple's event today it seems to be pretty clear that information on generic slides pertains to gaming performance, and on dedicated pro apps slides - to pro apps (with ray tracing). It appears that M3 Mac / M1 Max in gaming applications, therefore, is 1.5x, which would put M3 Max at 1.84x slower still than 3090. Looks like it will take M3 Ultra to beat 3090 in games.

In pro apps (rendering), however, M3 Max / M1 Max is declared having a staggering 2.5x advantage, moving M3 Max from being 1.78x slower than 3090 to being 1.4x faster than 3090 (desktop at 350W), or alternatively, 3090 being only 0.71x of M3 Max's performance.

Translating all of this to 4000 series using TechPowerUp ballpark figures, it appears that in gaming applications M3 Max is going to be only very slightly faster than... a desktop 4060 (non-Ti; 115W). At the same time the very same M3 Max is going to be a bit faster than a desktop 4080 (320W GPU) in ray-tracing 3D rendering pro applications (like Redshift and Blender).

With an added detail that a desktop 4080 is a 16 GB VRAM GPU, with the largest consumer grade card - 4090 - having 24 GB of VRAM, while M3 Max can be configured with up to 128 GB of unified RAM even in a laptop enclosure, which will probably make about 100 GB or so available as VRAM, about 5x more than on Nvidia side, which, like the other commenter said, unjustly downvoted, makes a number of pro tasks comically impossible (do not run) on Nvidia very much possible on M3 Max.

So, anywhere from a desktop 4060 to a desktop 4080 depending on application, in games, 4060, in pro apps, "up to 4080" depending on the app (and a 4080 in at least some of the ray tracing 3d rendering applications).

Where does that put a CAD app I've no idea, probably something like 1/3 away from games and 2/3 aways from pro apps? Like 1.45x slower than a desktop 3090? Which puts it somewhere between a desktop 4060 Ti and a desktop 4070.

I'm sure you can find how to translate all of that from desktop nvidia cards used here to their laptop variants (which are very notably slower).

I have to highlight for the audience once again an absolutely massive difference in performance improvement between games and 3D rendering pro apps: M3 Max / M1 Max, as announced by Apple today, is 1.5x in games, but 2.5x in 3D rendering pro apps, where M1 Max already was noticeably slower in games than it presumably should have been given how it performed in 3D rendering apps, relative to Nvidia.

view more: next ›