Lion Cove: Intel's P-Core Roars

▲

Lion Cove: Intel's P-Core Roars(chipsandcheese.com)

126 points by luyu_wu 277 days ago | 8 comments

▲kristianp 277 days ago

About 94.9 GB/s DRAM bandwidth for the Core Ultra 7 258V they measured. Aren't Intel going to respond to the 200GB/s bandwidth of the M1 Pro introduced 3 years ago? Not to mention 400GB/s of Max and 800GB/s of the Ultra?

Most of the bandwidth comes from cache hits, but for those rare workloads larger than the caches, Apples products may be 2-8x faster?

▲adrian_b 276 days ago

AMD Strix Halo, to be launched in early 2025, will have a 256-bit memory interface for LPDDR5x of 8 or 8.5 GHz, so it will match M1 Pro.

However, Strix Halo, which has a much bigger GPU, is designed for a maximum power consumption for CPU+GPU of 55 W or more (up to 120 W), while Lunar Lake is designed for 17 W, which explains the choices for the memory interfaces.

▲Dylan16807 276 days ago

That's good. And better than match, that's 30% faster, at least until the M4 Pro launches with a RAM frequency upgrade.

On the other hand, I do think it's fair to compare to the Max too, and it loses by a lot to that 512 bit bus.

▲kvemkon 276 days ago

> LPDDR5x of 8 or 8.5 GHz

Not 8000 or 8500 MT/s and thus the frequency is halved?

▲adrian_b 276 days ago

Sorry, I meant the frequency of the transfers, not that of some synchronization signal, so 8000 or 8500 MT/s, as you say.

However that should have been obvious, because there will be no LPDDR5x of 16000 MT/s. That throughput might be reached in the future, but a different memory standard will be needed, perhaps a derivative of the MRDIMMs that are beginning to be used in servers (with multiplexed ranks).

MHz and MT/s is really the same unit. What differs is the quantity that is measured, e.g. the frequency of oscillations and the frequency of transfers. I do not agree with the method of giving multiple names to a unit of measurement in order to suggest what kind of quantity has been measured. The number of different quantities that can be measured with the same unit is very large. If the method of giving multiple unit names had been applied consistently, there would have been a huge number of unit names. I believe that the right way is to use a unique unit name, but to always specify separately what kind of quantity had been measured, because having only a numeric value and the unit is never sufficient information, without having also which quantity has been measured.

▲wtallis 276 days ago

Lunar Lake is very clearly a response to the M1, not its larger siblings: the core counts, packaging, and power delivery changes all line up with the M1 and successors. Lunar Lake isn't intended to scale up to the power (or price) ranges of Apple's Pro/Max chips. So this is definitely not the product where you could expect Intel to start using a wider memory bus.

And there's very little benefit to widening the memory bus past 128-bit unless you have a powerful GPU to make good use of that bandwidth. There are comparatively few consumer workloads for CPUs that are sufficiently bandwidth-hungry.

▲formerly_proven 276 days ago

Is the full memory bandwidth actually available to the CPU on M-series CPUs? Because that would seem like a waste of silicon to me, to have 200+ GB/s of past-LLC bandwidth for eight cores or so.

▲inkyoto 275 days ago

100 Gb/sec per a CPU core.

200+ Gb/sec (IIRC Anandtech measured 240 Gb/sec) per a cluster.

Pro is one cluster, Max is two clusters and Ultra is four clusters, so the accumulative bandwidth is 200, 400 and 800 Gb/sec respectively[*].

The bandwidth is also shared with GPU and NPU cores within the same cluster, so on combined loads it is plausible that the memory bus may become fully saturated.

[*] Starting with M3, Apple has pivoted Pro models into more Pro and less Pro versions that have differing memory bus widths.

▲adrian_b 276 days ago

I do not know for the latest models, but at least in M1 the CPU was limited to a fraction of the total memory throughput, IIRC to about 100 GB/s.

▲nox101 276 days ago

with all of the local ML being introduced by Apple and Google and Microsoft this thinking seems close to "640k is all you need"

I suspect consumer workloads to rise

▲throwuxiytayq 276 days ago

I think the number of people interested in running ML models locally might be greatly overestimated [here]. There is no killer app in sight that needs to run locally. People work and store their stuff in the cloud. Most people just want a lightweight laptop, and AI workloads would drain the battery and cook your eggs in a matter of minutes, assuming you can run them. Production quality models are pretty much cloud only, and I don’t think open source models, especially ones viable for local inference will close the gap anytime soon. I’d like all of those things to be different, but I think that’s just the way things are.

Of course there are enthusiasts, but I suspect that they prefer and will continue to prefer dedicated inference hardware.

▲0x000xca0xfe 276 days ago

Microsoft wants to bring Recall back. When ML models come as part of the OS suddenly there are hundreds of millions of users.

▲throwuxiytayq 276 days ago

I have some difficulty with estimating how heavy Recall’s workload is, but either way, I have little faith in Microsoft’s ability to implement this feature efficiently. They struggle with much simpler features, such as search. I wouldn’t be surprised if a lot of people disable the feature to save battery life and improve system performance.

▲layer8 276 days ago

It remains to be seen whether users want what Microsoft wants.

▲fasa99 276 days ago

Huh? All the files are local and models are gonna be made to have a lot of or ultimately all of them in the "context window". You can't really have an AI for your local documents on the cloud because the cloud doesn't have access. Same logic for businesses. The use case follows from data availability and barriers.

We've observed the same on web pages where more and more functionality gets pushed to the frontend. One could push the last (x) layers of the neural net for example, to the frontend, for lower expense and if rightly engineered, better speed and scalability.

AIs will be local, super-AIs still in the cloud. Local AIs will be proprietary and they will have strings to the mothership The strings will have business value both related to the consumer and for further AI training.

▲throwuxiytayq 275 days ago

> All the files are local

From what I’ve seen, most people tend to use cloud document platforms. Microsoft has made Office into one. This has been the steady direction for the last few years; they’ll keep pushing for it, because it gives them control. Native apps with local files is an inconvenient model for them. This sadly applies to most other types of apps. On many of these cloud platforms, you can’t even download the project files.

> You can't really have an AI for your local documents on the cloud because the cloud doesn't have access

Yes, up to the cloud it all goes. They don’t mind, they can charge you for it. Microsoft literally openly wants to move the whole OS to the cloud.

> Same logic for businesses

Businesses hate local files. They’re a huge liability. When firing people it’s very convenient that you can just cut someone off by blocking their cloud credentials.

> We've observed the same on web pages where more and more functionality gets pushed to the frontend

It will never go all the way.

> One could push the last (x) layers of the neural net for example, to the frontend, for lower expense and if rightly engineered, better speed and scalability

I’ll believe it when I see it. I don’t think the incentive is there. Sounds like a huge complicating factor. It’s much simpler to keep everything running in the cloud, and software architects strongly prefer simple designs. How much do these layers weigh? How many MB/GB of data will I need to transfer? How often? Does that really give me better latency than just transferring a few KBs of the AIs output?

▲tucnak 276 days ago

> AI workloads would drain the battery and cook your eggs in a matter of minutes, assuming you can run them

M2 Max is passively cooled... and does 1/2 of 4090's token bandwidth in inference.

▲MindSpunk 276 days ago

Um, no? The MBP has fans? The base level M2 in a MacBook Air is passively cooled and thermal throttles very aggressively when you push it.

▲tucnak 274 days ago

That sounds right, I was judging by mac studio I own that is passively cooled.

▲Onavo 276 days ago

> I think the number of people interested in running ML models locally might be greatly overestimated [here]. There is no killer app in sight that needs to run locally. People work and store their stuff in the cloud. Most people just want a lightweight laptop, and AI workloads would drain the battery and cook your eggs in a matter of minutes, assuming you can run them. Production quality models are pretty much cloud only, and I don’t think open source models, especially ones viable for local inference will close the gap anytime soon. I’d like all of those things to be different, but I think that’s just the way things are. Of course there are enthusiasts, but I suspect that they prefer and will continue to prefer dedicated inference hardware.

Do you use FTP instead of Dropbox?

▲cvs268 276 days ago

Do you use Dropbox instead of rsync?

( Couldn't resist (^.^) )

▲wtallis 276 days ago

Local ML isn't a CPU workload. The NPUs in mobile processors (both laptop and smartphone) are optimized for low power and low precision, which limit how much memory bandwidth they can demand. So as I said, demand for more memory bandwidth depends mainly on how powerful the GPU is.

▲epolanski 276 days ago

The few reviews we have seen now show that lunar lake is competitive with m3s too depending on the application.

▲phonon 276 days ago

M3 Pro is 150 GB/s (and that should be compared to Lunar Lake's nominal memory bandwidth of 128 GB/s) and the cheapest model with it starts at $2000 ($2400 if you want 36 GB of RAM).

At those price levels, PC laptops have discrete GPUs with their own RAM with 256 GB/s and up just for the GPU.

▲wmf 277 days ago

The "response" to those is discrete GPUs that have been available all along.

▲Aaargh20318 276 days ago

Discrete GPUs are a dead end street. They are fine for gaming, but for GPGPU tasks unified memory is a game changer.

▲kristianp 276 days ago

True, but I thought Intel might start using more channels to make that metric look less unbalanced in Apple's favour. Especially now that they are putting RAM on package.

▲tjoff 276 days ago

Why the obsession of this particular metric? And how can one claim something is unbalanced while focusing on one metric?

▲kristianp 275 days ago

Availability bias. But it's the metric that stands out as being higher than an Intel system without a graphics card.

▲sudosysgen 276 days ago

Not really, the killer is latency, not throughput. It's very rare that a CPU actually runs out of memory bandwidth. It's much more useful for the GPU.

95GB/s is 24GB/s per core, at 4.8Ghz that's 40 bits per core per cycle. You would have to be doing basically nothing useful with the data to be able to get through that much bandwidth.

▲adrian_b 276 days ago

For scientific/technical computing, which uses a lot of floating-point operations and a lot of array operations, when the memory is limiting the performance almost always the limit is caused by the memory throughput and almost never by the memory latency (in correctly written programs, which allow the hardware prefetchers to do their job of hiding the memory latency).

The resemblance to the behavior of GPUs is not a coincidence, but it is because the GPUs are also mostly doing array operations.

So the general rule is that the programs dominated by array operations are sensitive mostly to the memory throughput.

This can be seen in the different effect of the memory bandwidth on the SPECint and SPECfp benchmark results, where the SPECfp results are usually greatly improved when memory with a higher throughput is used, unlike the SPECint results.

▲sudosysgen 276 days ago

You are right that it's a limiting factor in general for that use case, just not in the case of this specific chip - this chip has far less cores per lanes, so latency will be the limiting factor. Even then, I assure you that no scientific workload is going to be consuming 40 bits/clock/core. It's just a staggering amount of memory, no correctly written program would hit this, you'd need to have abysmal cache hit ratios.

This processor has two lanes over 4 P-cores. Something like an EPYC-9754 has 12 lanes over 128 cores.

▲adrian_b 276 days ago

I agree that for CPU-only tasks, Lunar Lake has ample available memory bandwidth, but high memory latency.

However, the high memory bandwidth is intended mainly for the benefit of its relatively big GPU, which might have been able to use even higher memory throughputs.

▲fulafel 276 days ago

40 bits per clock in a 8-wide core gets you 5 bits per instruction, and we have AVX512 instructions to feed, with operand sizes 100x that (and there are multiple operands).

Modern chips do face the memory wall. See eg here (though about Zen 5) where they in the same vein conclude "A loop that streams data from memory must do at least 340 AVX512 instructions for every 512-bit load from memory to not bottleneck on memory bandwidth."

▲adrian_b 276 days ago

The throughput of the AVX-512 computation instructions is matched to the throughput of loads from the L1 cache memory, on all CPUs.

Therefore to reach the maximum throughput, you must have the data in the L1 cache memory. Because L1 is not shared, the throughput of the transfers from L1 scales proportionally with the number of cores, so it can never become a bottleneck.

So the most important optimization target for the programs that use AVX-512 is to ensure that the data is already located in L1 whenever it is needed. To achieve this, one of the most important things is to use memory access patterns that will trigger the hardware prefetchers, so that they will fill the L1 cache ahead of time.

The main memory throughput is not much lower than that of the L1 cache, but the main memory is shared by all cores, so if all cores want data from the main memory at the same time, the performance can drop dramatically.

▲sudosysgen 276 days ago

The processors that hit this wall have many many cores per memory lane. It's just not realistic for this to be a problem with 2 lanes of DDR5 feeding 4 cores.

These cores cannot process 8 AVX512 instructions at once, in fact they can't do it at all, as it's disabled on consumer Intel chips.

Also, AVX instructions operate on registers, not on memory, so you cannot have more than one register being loaded at once.

If you are running at ~4 instruction per clock, to actually go ahead and saturate 40 bits per clock on 64 bit loads, you'd need 1/6 of instructions to hit main memory (not cache)!

▲unsigner 276 days ago

There might be a chicken-and-egg situation here - one often hears that there’s no point having wider SIMD vectors or more ALU units, as they would spend all their time waiting for the memory anyway.

▲adrian_b 276 days ago

The width and count of the SIMD execution units are matched to the load throughput from the L1 cache memory, which is not shared between cores.

Any number of cores with any count and any width of SIMD functional units can reach the maximum throughput, as long as it can be ensured that the data can be found in the L1 cache memories at the right time.

So the limitations on the number of cores and/or SIMD width and count are completely determined by whether in the applications of interest it is possible to bring the data from the main memory to the L1 cache memories at the right times, or not.

This is what must be analyzed in discussions about such limits.

▲immibis 276 days ago

CPUs generally achieve around 4-8 FLOPS per cycle. That means 256-512 bits per cycle. We're all doing AI which means matrix multiplications which means frequently rereading the same data bigger than the cache, and doing one MAC with each piece of data read.

▲jart 276 days ago

The most important algorithm in the world, matrix multiplication, just does a fused multiply add on the data. Memory bandwidth is a real bottleneck.

▲adrian_b 276 days ago

The importance of the matrix multiplication algorithm is precisely due to the fact that it is the main algorithm where the ratio between computational operations and memory transfers can be very large, therefore the memory bandwidth is not a bottleneck for it.

The right way to express a matrix multiplication is not that wrongly taught in schools, with scalar products of vectors, but as a sum of tensor products between the column vectors of the first matrix with those row vectors of the second matrix that share with them the same position of the element on the main diagonal of the matrix.

Computing a tensor product of two vectors, with the result accumulated in registers, requires a number of memory loads equal to the sum of the lengths of the vectors, but a number of FMA operations equal to the product of the lengths (i.e. for square matrices of size NxN, there are 2N loads and N^2 FMA for one tensor product, which multiplied with N tensor products give 2N^2 loads and N^3 FMA operations for the matrix multiplication).

Whenever the lengths of both vectors are are no less than 2 and at least one length is no less than 3, the product is greater than the sum. With greater vector lengths, the ratio between product and sum grows very quickly, so when the CPU has enough registers to hold the partial sum, the ratio between the counts of FMA operations and of memory loads can be very great.

▲svantana 276 days ago

Is it though? The matmul of two NxN matrices takes N^3 macs and 2*N^2 memory access. So the larger the matrices, the more the arithmetic dominates (with some practical caveats, obviously).

▲perryh2 277 days ago

It looks awesome. I am definitely going to purchase a 14" Lunar Lake laptop from either Asus (Zenbook S14) or Lenovo (Yoga Slim). I really like my 14" MBP form factor and these look like they would be great for running Linux.

▲jjmarr 277 days ago

I constantly get graphical glitches on my Zenbook Duo 2024. Would recommend against going Intel if you want to use Linux.

▲skavi 277 days ago

Intel has historically been pretty great at Linux support. Especially for peripherals like WiFi cards and GPUs.

▲jauntywundrkind 277 days ago

Their "PCIe" wifi cards "mysteriously" not working in anything but Intel systems is enraging.

I bought a wifi7 card & tried it in a bunch of non-Intel systems, straight up didn't work. Bought a wifi6 card and it sort of works, ish, but I have to reload the wifi module and sometimes it just dies. (And no these are not cnvio parts).

I think Intel has a great amazing legacy & does super things. Usually their driver support is amazing. But these wifi cards have been utterly enraging & far below what's acceptable in the PC world; they are not fit to be called PCIe devices.

Something about wifi really brings out the worst in companies. :/

▲transpute 276 days ago

> not fit to be called PCIe devices

They might be CNVi in M.2 form factor, with the rest of the "wifi card" inside the Intel SoC.

  In CNVi, the network adapter's large and usually expensive functional blocks (MAC components, memory, processor and associated logic/firmware) are moved inside the CPU and chipset (Platform Controller Hub). Only the signal processor, analog and Radio frequency (RF) functions are left on an external upgradeable CRF (Companion RF) module which, as of 2019 comes in M.2 form factor.

Wifi7 has 3-D radar features for gestures, heartbeat, keystrokes and human activity recognition, which requires the NPU inside Intel SoC. The M.2 card is only a subset.

▲zaptrem 276 days ago

> Wifi7 has 3-D radar features for gestures, heartbeat, keystrokes and human activity recognition, which requires the NPU inside Intel SoC. The M.2 card is only a subset.

Source? Google turned up nothing.

▲zxexz 276 days ago

Sounds like some LLM hallucination to me.

EDIT: Right after that I found another HN comment [0] by the same user (through a google search!)!

[-1] Interesting IEEE rfc email thread on related to preamble puncturing

misc (I have not yet read these through beyond the abstracts): A preprint in ArXiV related to the proposed spec [1] A paper in IEEE Xplore on 802.11bf [2] NIST publication on 802.11bf [3] (basically [2] but on NIST)

[-1] https://www.ieee802.org/11/email/stds-802-11-tgbe/msg00711.h... [0] https://news.ycombinator.com/item?id=38811036 [1] https://arxiv.org/pdf/2207.04859 [2] https://ieeexplore.ieee.org/document/10467185 [3] https://www.nist.gov/publications/ieee-80211bf-enabling-wide...

▲adrian_b 276 days ago

The part about the supposed features of WiFi 7 looks like a hallucination or perhaps only a misinterpretation of proposed features.

How to do the actual sensing functions does not belong in a communication standard. What has been proposed, but I do not think that the standard amendment has been finalized, is to make some changes in the signals transmitted by WiFi stations, which would enable those desiring to implement sensing functions with the WiFi equipment to do that, without interfering with the normal communication functions.

So if Intel would create some program for Lunar Lake CPUs, possibly using the internal NPU, for purposes like detecting the room occupancy, that application would not be covered by the WiFi standard, the standard will just enable the creation of such applications and such an application would be equally implementable with any PCIe WiFi card conforming to the new standard, not only with an Intel CNVi card, whicd uses an internal WiFi controller.

However it is correct that there are 2 kinds of Intel WiFi cards that look the same (but they have different part numbers, e.g. AX200 for PCIe and AX201 for CNVi), but one kind of cards are standard PCIe cards that work in any computer, while the other kind of cards (CNVi) works only when connected to compatible Intel laptop CPUs.

▲transpute 276 days ago

> Even if it is possible to implement such features

It has been possible for years with custom firmware, search "device free wireless sensing" in Google Scholar.

> they would not be incorporated in a communication standard at this time.

One would hope so, right?

https://www.technologyreview.com/2024/02/27/1088154/wifi-sen...

  When the new standard comes out in 2025, it will allow “every Wi-Fi device to easily and reliably extract the signal measurements,” Yang says. That alone should help get more Wi-Fi sensing products on the market. “It will be explosive,” Liu believes.. cases imagined by the committee include counting and finding people in homes or in stores, detecting children left in the back seats of cars, and identifying gestures, along with long-standing goals like detecting falls, heart rates, and respiration.

  Wi-Fi 7, which rolls out this year, will open up an extra band of radio frequencies for new Wi-Fi devices to use, which means more channel state information for algorithms to play with. It also adds support for more tiny antennas on each Wi-Fi device, which should help algorithms triangulate positions more accurately. With Wi-Fi 7, Yang says, “the sensing capability can improve by one order of magnitude..”

  WiGig already allows Wi-Fi devices to operate in the millimeter-wave space used by radar chips like the one in the Google Nest.. [use cases include] reidentifying known faces or bodies, identifying drowsy drivers, building 3D maps of objects in rooms, or sensing sneeze intensity (the task group, after all, convened in 2020).. There is one area that the IEEE is not working on, at least not directly: privacy and security.

In advance of the IEEE 802.11bf standard, Intel implemented presence detection with WiFi 6E starting with 2023 Meteor Lake sensors and NPU, https://www.intel.com/content/www/us/en/content-details/7651...

   Intel Wi-Fi can intelligently sense when to lock or wake your laptop
   Walk Away Lock: Wi-Fi senses your departure & locks the PC in seconds
   Wake on Approach: Wi-Fi senses your return & wakes the PC in seconds

Intel Labs 2023 PoC demo of breathing detection, https://community.intel.com/t5/Blogs/Tech-Innovation/Client/...

  The solution detects the rhythmic change in CSI due to chest movement during breathing. It then uses that information to detect the presence of a person near a device, even if the person is sitting silently without moving. The respiration rates gathered by this technology could play an important role in stress detection and other wellness applications.

> such an application would be equally implementable with any PCIe WiFi card conforming to the new standard

Yes, this would be possible on AMD, Qualcomm and other "AI" PCs.

Some Arm SoCs include NPUs that could be used by routers and other devices for WiFi sensing applications.

> looks like a hallucination or perhaps only a misinterpretation of proposed features

Is there a good term for reality conflicting with claims of hallucination and misinterpretation?

▲zxexz 275 days ago

> Is there a good term for reality conflicting with claims of hallucination and misinterpretation?

I can't think of a good one yet, I'm sure a year from now there will be one commonly used.

What triggered the assumption on my end (not the comment you're replying to) was your statement "Wifi7 has 3-D radar features for gestures, heartbeat, keystrokes and human activity recognition, which requires the NPU inside Intel SoC. The M.2 card is only a subset." - I initially couldn't find any good references, because Google was so filled with ChatGPT'd SEO spam (notable a bunch of circular references citing a Medium article, itself obviously LLM'd).

Apologies for my mistake there. You seem to know quite a bit about the subject. I'm definitely constantly on-guard with any claims these days, especially ones that have potentially terrifying implications, without a primary source.

▲transpute 275 days ago

Thanks for the response. Web search has sadly deteriorated.

Here are some recent HN threads on through-wall 2.4Ghz Wi-Fi Sensing with CSI radar and IEEE 802.11bf:

Surveilling the masses with wi-fi-based positioning systems (2024), 140 comments, https://news.ycombinator.com/item?id=40492234

The spies in your home: How WiFi companies monitor your private life (2024), 40 comments, https://news.ycombinator.com/item?id=41272294

Wild new Wi-Fi routers turn your home network into a security radar (2024), 40 comments, https://news.ycombinator.com/item?id=40897828

Inside a $1 radar motion sensor (2024), 100 comments, https://news.ycombinator.com/item?id=40834349

▲jauntywundrkind 276 days ago

> They might be CNVi in M.2 form factor, with the rest of the "wifi card" inside the Intel SoC.

Already answered:

> And no these are not cnvio parts

(Oops, CNVi not cnvio.)

▲transpute 276 days ago

Which model number(s) exhibited this behavior?

▲silisili 277 days ago

I get them also in my Lunar Lake NUC. Usually in the browser, and presents as missing/choppy text oddly enough. Annoying but not really a deal breaker. Hoping it sorts out in the next couple kernel updates.

▲jjmarr 276 days ago

Do you get weird checkerboard patterns as well?

▲silisili 276 days ago

Yep, but not as often. They seem to flash and disappear when I'm moving or resizing a window or something.

▲gigatexal 277 days ago

Give it some time. Probably needs updated drivers, intel and Linux have been rock solid for me too. If your hardware is really new it’s likely a kernel and time issue. 6.12 or 6.13 should have everything sorted.

▲rafaelmn 276 days ago

Given the layoffs and the trajectory spiral I wouldn't be holding my breath for this.

▲amanzi 277 days ago

I'm really curious about how well they run Linux. e.g. will the NPU work under Linux in the same way it does on Windows? Or does it require specific drivers? Same with the batter life - if there a Windows-specific driver that helps with this, or can we expect the same under Linux?

▲ac29 277 days ago

You can look at the NPU software stack here:

https://github.com/intel/linux-npu-driver/blob/main/docs/ove...

The Linux driver is specific to Linux, but the software on top of that like oneAPI and OpenVINO are cross platform I think.

▲wmf 277 days ago

All NPUs require drivers. https://www.phoronix.com/news/Intel-Linux-NPU-Driver-1.5

▲formerly_proven 276 days ago

I’d wait for LG honestly

▲RicoElectrico 277 days ago

> A plain memory latency test sees about 131.4 ns of DRAM latency. Creating some artificial bandwidth load drops latency to 112.4 ns.

Can someone put this in context? The values seem order of magnitude higher than here: https://www.anandtech.com/show/16143/insights-into-ddr5-subt...

▲toast0 277 days ago

The chips and cheese number feels like an all-in number; get a timestamp, do a memory read (that you know will not be served from cache), get another timestamp.

The anandtech article is latencies for parts of a memory operation, between the memory controller and the ram. End to end latency is going to be a lot more than just CAS latency, because CAS latency only applies once you've got the proper row open, etc.

▲wtallis 277 days ago

Getting requests up through the cache hierarchy to the DRAM controller, and data back down to the requesting core's load/store units is also a non-trivial part of this total latency.

▲foota 277 days ago

I think the numbers in that article (the CAS latency) are the latency numbers "within" the DRAM module itself, not the end to end latency between the processor and the RAM.

You could read the article on the latest AMD top of the line desktop chip to compare: https://chipsandcheese.com/2024/08/14/amds-ryzen-9950x-zen-5... (although that's a desktop chip, the original article compares the Intel performance to 128 ns of DRAM latency for AMD's mobile platform Strix Point)

▲Tuna-Fish 276 days ago

CAS latency is only the latency of doing an access from an open row. This is in no way representative of a normal random access latency. (Because caches are so large that if you were frequently hitting open rows, you'd just load from cache instead.)

The way CAS has been widely understood as "memory latency" is just wrong.

▲Sakos 276 days ago

That article is about RAM latency in isolation. See this Anandtech article that shows similar numbers to chips and cheese when evaluating a CPU's DRAM latency (further down on the page): https://www.anandtech.com/show/16214/amd-zen-3-ryzen-deep-di...

▲jart 276 days ago

Use Intel's mlc (memory latency checker) tool to measure your system. On a GCE instance I see about 97ns for RAM access. On a highly overclocked gaming computer with a small amount of RAM I see 60ns. Under load, latency usually drops to about 200ns. On workstation with a lot of RAM and cores I see it drop to a microsecond.

▲adrian_b 276 days ago

I completely agree with the author that renaming the L1 cache memory as L0 and introducing a new L1 cache, as done by Intel is a completely misleading terminology.

The correct solution is that from the parent article, to continue to call the L1 cache memory as the L1 cache memory, because there is no important difference between it and the L1 cache memories of the previous CPUs, and to call the new cache memory that has been inserted between the L1 and L2 cache memories as the L1.5 cache memory.

Perhaps Intel did this to give the very wrong impression that the new CPUs have a bigger L1 cache memory than the old CPUs. To believe this would be incorrect, because the so called new L1 cache has a much lower throughput and a worse latency than a true L1 cache memory of any other CPU.

The new L1.5 is not a replacement for an L1 cache, but it functions as a part of the L2 cache memory, with identical throughput as the L2 cache, but with a lower latency. As explained in the article, this has been necessary to allow Intel to expand the L2 cache to 2.5 MB in Lunar Lake and to 3 MB in Arrow Lake S (desktop CPU), in comparison with AMD, which has an only 1 MB L2 cache (but a bigger L3 cache).

According to rumors, while the top AMD desktop CPUs without stacked cache memory have an 80 MB L2+L3 cache (16 MB L2 + 64 MB L3), the top Intel model 285K might have 78 MB of cache, i.e. about the same amount, but with a different distribution on levels: 2 MB L1.5 + 40 MB L2 + 36 MB L3. Nevertheless, until now there is no official information from Intel about Arrow Lake S, whose launch is expected in a month from now, so the amount of L3 cache is not certain, only the amounts of L2 and L1.5 are known from earlier Intel presentations.

Lunar Lake is an excellent design for all applications where adequate cooling is impossible, i.e. thin and light notebooks and tablets or fanless small computers.

Nevertheless, Intel could not abstain from not using unfair marketing tactics. Almost all the benchmarks presented by Intel at the launch of Lunar Lake have been based on the top model 288V. Both top models 288V and 268V are likely to be unobtainium for most computer models, while at the few manufacturers that will offer this option they will be extremely overpriced.

Most available and affordable computers with Lunar Lake will not offer any better CPU than 258V, which is the one tested in the parent article. 258V has only 4.8 GHz/2.2 GHz turbo/base clock frequencies, vs. 5.1 GHz/3.3 GHz of the 288V used in the Intel benchmarks and in many other online benchmarks. So the actual experience of most Lunar Lake users will not match most published benchmarks, even if it will be good enough in comparison with any competitors in the same low-power market segment.

▲AzzyHN 277 days ago

We'll have to see how this compared to Zen 5 once 24H2 drops.

And once more than like three Zen 5 laptops come out.

▲deaddodo 277 days ago

The last couple of generations have had plenty of AMD options. Razer 14, Zephyrus G14, TUFbook, etc. If you get out of the performance/enthusiast segment, they're even more plentiful (Inspirons, Lenovos, Zenbooks, etc).

▲nahnahno 277 days ago

the review guide had everyone on 24H2; there were some issues with one of the updates that messed up performance for lunar lake pre-release, but appears to have been fixed in time for release.

I’d expect lunar lake’s position to improve a bit in coming months as they tweak scheduling, but AMD should be good at this point.

Edit: around 16 mark, https://youtu.be/5OGogMfH5pU?si=ILhVwWFEJlcA3HLO. The laptops came with 24H2

▲AStonesThrow 275 days ago

I apologize in advance for my possibly off-topic linguistics-nerd pun:

Q: What do you call Windows with its UI translated to Hebrew? A: The L10N of Judah