This work from Google (original Nature paper: https://www.nature.com/articles/s41586-021-03544-w) has been credibly criticized by several researchers in the EDA CAD discipline. These papers are of interest:
- A rebuttal by a researcher within Google who wrote this at the same time as the "AlphaChip" work was going on ("Stronger Baselines for Evaluating Deep Reinforcement Learning in Chip Placement"): http://47.190.89.225/pub/education/MLcontra.pdf
- A paper from Igor Markov which critically evaluates the "AlphaChip" algorithm ("The False Dawn: Reevaluating Google's Reinforcement Learning for Chip Macro Placement"): https://arxiv.org/pdf/2306.09633
In short, the Google authors did not fairly evaluate their RL macro placement algorithm against other SOTA algorithms: rather they claim to perform better than a human at macro placement, which is far short of what mixed-placement algorithms are capable of today. The RL technique also requires significantly more compute than other algorithms and ultimately is learning a surrogate function for placement iteration rather than learning any novel representation of the placement problem itself.
- The 2023 ISPD paper didn't pre-train at all. This means no learning from experience, for a learning-based algorithm. I feel like you can stop reading there.
- The ISPD paper and the MLcontra paper both used much larger older technology node sizes, which have pretty different physical properties. TPU has a sub 10nm technology node size, whereas ISPD uses 45nm and 12nm. These are really different from a physical design perspective. Even worse, MLcontra uses a truly ancient benchmark with >100nm technology node size.
Markov's paper just summarizes the other two.
(Incidentally, none of ISPD / MLcontra / Markov were peer reviewed - ISPD 2023 was an invited paper.)
There's a lot of other stuff wrong with the ISPD paper and the MLcontra paper - happy to go into it - and a ton of weird financial incentives lurking in the background. Commercial EDA companies do NOT want a free open-source tool like AlphaChip to take over.
Reading your post, I appreciate the thoroughness, but it seems like you are too quick to let ISPD 2023 off the hook for failing to pre-train and using less compute. The code for pre-training is just the code for training --- you train on some chips, and you save and reuse the weights between runs. There's really no excuse for failing to do this, and the original Nature paper described at length how valuable pre-training was. Given how different TPU is from the chips they were evaluating on, they should have done their own pre-training, regardless of whether the AlphaChip team released a pre-trained checkpoint on TPU.
(Using less compute isn't just about making it take longer - ISPD 2023 used half as many GPUs and 1/20th as many RL experience collectors, which may screw with the dynamics of the RL job. And... why not just match the original authors' compute, anyway? Isn't this supposed to be a reproduction attempt? I really do not understand their decisions here.)
Why does pretraining or not matter in the ISPD 2023 paper? The circuit_training repo, as noted in the rebuttal of the rebuttal by the ISPD 2023 paper authors, claims training from scratch is "comparable or better" than fine-tuning the pre-trained model. So no matter your opinion on the importance of the pretraining step, this result isn't replicable, at which point the ball is in Google's court to release code/checkpoints to show otherwise.
The quick-start guide in the repo that said you don't have to pre-train for the sample test case, meaning that you can validate your setup without pre-training. That does not mean you don't need to pre-train! Again, the paper talks at length about the importance of pre-training.
In reinforcement learning pre-training reduces peak performance. We can argue about this, but it is not a sufficiently strong point to stop reading from alone.
Oh, man... this is the same old stuff from the 2023 Anna Goldie statement (is this Anna Goldie's comment?). This was all addressed by Kahng in 2023 - no valid criticisms. Where do I start?
Kahng's ISPD 2023 paper is not in dispute - no established experts objected to it. The Nature paper is in dispute. Dozens of experts objected to it: Kahng, Cheng, Markov, Madden, Lienig, Swartz objected publically.
The fact that Kahng's paper was invited doesn't mean it wasn't peer reviewed. I checked with ISPD chairs in 2023 - Kahng's paper was thoroughly reviewed and went through multiple rounds of comments. Do you accept it now? Would you accept peer-reviewed versions of other papers?
Kahng is the most prominent active researcher in this field. If anyone knows this stuff, it's Kahng. There were also five other authors in that paper, including another celebrated professor, Cheng.
The pre-training thing was disclaimed in the Google release. No code, data or instructions for pretraining were given by Google for years. The instructions said clearly: you can get results comparable to Nature without pre-training.
The "much older technology" is also a bogus issue because the HPWL scales linearly and is reported by all commercial tools. Rectangles are rectangles. This is textbook material. But Kahng etc al prepared some very fresh examples, including NVDLA, with two recent technologies. Guess what, RL did poorly on those. Are you accepting this result?
The bit about financial incentives and open-source is blatantly bogus, as Kahng leads OpenROAD - the main open-source EDA framework. He is not employed by any EDA companies. It is Google who has huge incentives here, see Demis Hassabis tweet "our chips are so good...".
The "Stronger Baselines" matched compute resources exactly. Kahng and his coauthors performed fair comparisons between annealing and RL, giving the same resources to each. Giving greater resources is unlikely to change results. This was thoroughly addressed in Kahng's FAQ - if you only could read that.
The resources used by Google were huge. Cadence tools in Kahng's paper ran hundreds times faster and produced better results. That is as conclusive as it gets.
It doesn't take a Ph.D. to understand fair comparisons.
The GP would have had to appeal only to the expert’s opinion, with no actual evidence, but the GP actually gave a lot of evidence to the expertise of the researcher in the form of peer reviewed papers and other links. That’s not an appeal to authority at all.
Please check the textbook - that fallacy asks to accept a claim based on authority. In this case, we have a negative-spin troll (negativeonehalf) spreading FUD about an established subject expert ("a lot of other stuff wrong") with one falsehood after another (paper not peer reviewed, Kahng funded by Cadence, etc). No authority is needed to debunk the troll's claims - they are not supported by anything, basically made-up nonsense.
For AlphaChip, pre-training is just training. You train, and save the weights in between. This has always been supported by the Google's open-source repository. I've read Kahng's FAQ, and he fails to address this, which is unsurprising, because there's simply no excuse for cutting out pre-training for a learning-based method. In his setup, every time AlphaChip sees a new chip, he re-randomizes the weights and makes it learn from scratch. This is obviously a terrible move.
HPWL (half-perimeter wirelength) is an approximation of wirelength, which is only one component of the chip floorplanning objective function. It is relatively easy to crunch all the components together and optimize HPWL --- minimizing actual wirelength while avoiding congestion issues is much harder.
Simulated annealing is good at quickly converging on a bad solution to the problem, with relatively little compute. So what? We aren't compute-limited here. Chip design is a lengthy, expensive process where even a few-percent wirelength reduction can be worth millions of dollars. What matters is the end result, and ML has SA beat.
(As for conflict of interest, my understanding is Cadence has been funding Kahng's lab for years, and Markov's LinkedIn says he works for Synopsis. Meanwhile, Google has released a free, open-source tool.)
It's not that one needs an excuse. The Google CT repo said clearly you don't need to pretrain. "supported" usually includes at least an illustration, some scripts to get it going - no such thing there before Kahng's paper. Pre-trained was not recommended and was not supported.
Everything optimized in Nature RL is an approximation. HPWL is where you start, and RL uses it in the objective function too. As shown in "Stronger Baselines", RL loses a lot by HPWL - so much that nothing else can save it. If your wires are very long, you need routing tracks to route them, and you end up with congestion too.
SA consistently produces better solutions than RL for various time budgets. That's what matters. Both papers have shown that SA produces competent solutions. You give SA more time, you get better solutions. In a fair comparison, you give equal budgets to SA and RL. RL loses. This was confirmed using Google's RL code and two independent SA implementations, on many circuits. Very definitively. No, ML did not have SA beat - please read the papers.
Cadence hasn't funded Kahng for a long time. In fact, Google funded Kahng more recently, so he has all the incentives to support Google. Markov's LinkedIn page says he worked at Google before. Even Chatterjee, of all people, worked at Google.
Google's open-source tool is a head fake, it's practically unusable.
Update: I'll respond to the next comment here since there's no Reply button.
1. The Nature paper said one thing, the code did something else, as we've discovered. The RL method does some training as it goes. So, pre-training is not the same as training. Hence "pre". Another problem with pretraining in Google work is data contamination - we can't compare test and training data. The Google folks admitted to training and testing on different versions of the same design. That's bad. Rejection-level bad.
2. HPWL is indeed a nice simple objective. So nice that Jeff Dean's recent talks use it. It is chip design. All commercial circuit placers without exception optimize it and report it. All EDA publications report it. Google's RL optimized HPWL + density + congestion
3. This shows you aren't familiar with EDA. Simulated Annealing was the king of placement from mid 1980s to mid 1990s. Most chips were placed by SA. But you don't have to go far - as I recall, the Nature paper says they used SA to postprocess macro placements.
SA can indeed find mediocre solutions quickly, but keeps on improving them, just like RL. Perhaps, you aren't familiar with SA. I am. There are provable results showing SA finds optimal solution if given enough time. Not for RL.
The Nature paper describes the importance of pre-training repeatedly. The ability to learn from experience is the whole point of the method. Pre-training is just training and saving the weights -- this is ML 101.
I'm glad you agree that HPWL is a proxy metric. Optimizing HPWL is a fun applied math puzzle, but it's not chip design.
I am unaware of a single instance of someone using SA to generate real-world, usable macro layouts that were actually taped out, much less for modern chip design, in part due to SA's struggles to manage congestion, resulting in unusable layouts. SA converges quickly to a bad solution, but this is of little practical value.
1. The Nature paper said one thing, the code did something else, as we've discovered. The RL method does some training as it goes. So, pre-training is not the same as training. Hence "pre". Another problem with pretraining in Google work is data contamination - we can't compare test and training data. The Google folks admitted to training and testing on different versions of the same design. That's bad. Rejection-level bad.
2. HPWL is indeed a nice simple objective. So nice that Jeff Dean's recent talks use it. It is chip design. All commercial circuit placers without exception optimize it and report it. All EDA publications report it. Google's RL optimized HPWL + density + congestion
3. This shows you aren't familiar with EDA. Simulated Annealing was the king of placement from mid 1980s to mid 1990s. Most chips were placed by SA. But you don't have to go far - as I recall, the Nature paper says they used SA to postprocess macro placements.
SA can indeed find mediocre solutions quickly, but keeps on improving them, just like RL. Perhaps, you aren't familiar with SA. I am. There are provable results showing SA finds optimal solution if given enough time. Not for RL.
SA and HPWL are most definitely used as of today for the chips that power the GPUs used for "ML 101". But frankly this has the same value as saying "some sort algorithm is used somewhere" -- they're well entrenched basics of the field. To claim that SA produces "bad congestion" is like claiming that using steel pans produces bad cooking -- needs a shitton of context and qualification since you cannot generalize this way.
The Deepmind chess paper was also criticized for unfair evaluation, as they were using an older version of Stockfish for comparison. Apparently, the gap between AlphaZero and that old version of Stockfish (about 50 elo iirc) was about the same as the gap between consecutive versions of Stockfish.
Indeed, six years later, the AlphaZero algorithm is not the best performing algorithm for chess. LCZero (uses AlphaZero algorithm) won some TCECs after it came out but for the past few years Stockfish (does not use AlphaZero algorithm) has been winning consistently.
There’s a lot of codevelopment happening in the space where the positions are evaluated by Leela and then used to train the NNUE net within stockfish. And Leela comes from AlphaZero. So basically AlphaZero was directly responsible for opening up new avenues of research for a more specialized chess engine to reach new levels than it could have without it.
> Generally considered to be the strongest GPU engine, it continues to provide open data which is essential for training our NNUE networks. They released version 0.31.1 of their engine a few weeks ago, check it out!
[1]
I’d say the impact AlphaZero has had on chess and go can’t be understated considering it’s a general algorithm that at worst is highly competitive with purpose built engines. And that’s ignoring the actual point of why DeepMind is doing any of this which is for GAI (that’s why they’re not constantly trying to compete with existing engines)
To be fair, some of these criticisms are a few years old. Which normally would be fair game, but the progress in AI has been breakneck. Criticism of other AI tech from 2021 or 2022 are pretty dated today.
Dated or not, if half of the criticisms are right, the original paper may need to be retracted. No progress on RL for chip design was published by Google since 2022, as far as I can tell. So, it looks like most if not all criticisms remain valid.
I don't really understand all the fuss about this particular paper. Nearly all papers on AI techniques are pretty much impossible to reproduce, due to details that the authors don't understand or are trying to cover up.
This is what you get if you make academic researchers compete for citation counts.
Pretraining seems to be an important aspect here, and it makes sense that such pretraining requires good examples, which unfortunately for the free lunch people, is not available to the public.
That's what you get when you let big companies do fundamental research. Would it be better if the companies did not publish anything about their research at all?
It all feels a bit unproductive to attack one another.
When I first read about AlphaChip yesterday, my first question was how it compares to other optimization algorithms such as genetic algorithms or simulated annealing. Thank you for confirming that my questions are valid.
What is your opinion of the addendum? I think the addendum and the pre-trained checkpoint are the substance of the announcement, and it is surprising to see little mention of those here.
It seems like this is multiple parties pursuing distinct arguments. Is Google saying that this technique is applicable in the way that the rebuttals are saying it is not? When I read the paper and the update I did not feel as though Google claimed that it is general, that you can just rip it off and run it and get a win. They trained it to make TPUs, then they used it to make TPUs. The fact that it doesn't optimize whatever "ibm14" is seems beside the point.
Good question. It's not just ibm14, but everything people outside Google tried shows that RL is much worse than prior methods. NVDLA, BlackParrot, etc. There is a strong possibility that Google pre-trained RL on certain TPU designs then tested in them, and submitted to Nature.
EDA claims in the digital domain are fairly easy to evaluate. Look at the picture of the layout.
When you see a chip that has the datapath identified and laid out properly by a computer algorithm, you've got something. If not, it's vapor.
So, if your layout still looks like a random rat's nest? Nope.
If even a random person can see that your layout actually follows the obvious symmetric patterns from bit 0 to bit 63, maybe you've got something worth looking at.
Analog/RF is a little tougher to evaluate because the smaller number of building blocks means you can use Moore's Law to brute force things much more exhaustively, but if things "looks pretty" then you've got something. If it looks weird, you don't.
That doesn't mean the fabricated netlist doesn't work. I'm not supporting Google by any means, but the test should be: Does it fabricate and function as intended? If not, clearly gibberish. If so, we now have computers building computers, which is one step closer to SkyNet. The truth is probably somewhere in between.
But even if some of the samples, with the terrible layouts, are actually functional, then we might learn something new. Maybe the gibberish design has reduced crosstalk, which would be fascinating.
Vindicated indeed. The senior researcher and others on the project were bullied for raising concerns of fraud by the two researchers [1]. They filed a lawsuit against Google that has a lot of detailed allegations of fraud [2].
It's actually not clear who was bullied. The two researchers ganged up on Chatterjee and got him fired because he used the word "fraud" - wrongful termination of a whistleblower. Only recently Google settled with Chatterjee for an undisclosed amount.
TSMC made a point of calling out that their latest generation of software for automating chip design has features that allow you to select logic designs for TDP over raw speed. I think that’s our answer to keep Dennard scaling alive in spirit if not in body. Speed of light is still going to matter, so physical proximity of communicating components will always matter, but I wonder how many wins this will represent versus avoiding thermal throttling.
Questions for those in the know about chip design. How are they measuring the quality of a chip design? Does the metric that Google is reporting make sense? Or is it just something to make themselves look good?
Without knowing much, my guess is that “quality” of a chip design is multifaceted and heavily dependent on the use case. That is the ideal chip for a data center would look very different from those for a mobile phone camera or automobile.
So again what does “better” mean in the context of this particular problem / task.
I have not read the latest paper, but their previous work was really unclear about metrics being used. Researchers trying to replicate results had a hard time getting reliable details/benchmarks out of Google. Also, my recollection is that Google did not even compute timing, just wirelength and congestion; i.e. extremely primitive metrics.
Floorplanning/placement/synthesis is a billion dollar industry, so if their approach were really revolutionary they would be selling the technology, not wasting their time writing blog posts about it.
I am not sure these publications were intended to generate sales of these technologies. My assumption is that they mostly help the company in terms of recruitment. This lets potential employees see cool stuff Google is doing, and see them as an industry leader.
Spanner is literally a Google cloud product you can buy ignoring that it underpins a good amount of Google tech internally. The same is true of other stuff. Dismissing it as a recruitment tool indicates you haven’t worked at Google or really know much about their product lines.
Spanner research paper was in 2012. Bigtable was in 2006. GFS 2003. The last decade has been a 'lost decade' of google. Not much innovation to be honest.
From what I saw in the rebuttal papers, the Google cost-function is wirelength based. You can still get good TNS from that if your timing is very simplistic -- or if you choose your benchmark carefully.
They optimize using a fast heuristic based on wirelength, congestion, and density, but they evaluate with full P&R. It is definitely interesting that they get good timing without explicitly including it in their reward function!
The odd thing is that they don't compute timing in RL, but claim that somehow TNS and WNS improved. Does anyone believe this? With five circuits and three wins, the results are a coin toss.
Yes in combination. Customers generally buy these tools as a package deal. If the placer/floorplanner blows everything else out of the water, then a CAD vendor can upsell a lot of related tools.
Oh man, if only it were that simple. A floorplanner has to guestimate what the P&R tools are going to do with the initial layout. That can be very hard to predict -- even if the floorplanner and P&R tool are from the same vendor.
What is Google doing here? At best, the quality of their "computer chip design" work can be described as "controversial" https://spectrum.ieee.org/chip-design-controversy . What is there to gain by just making a PR now without doing anything new?
What's more, Eurisco was then used in designing Traveler TCS' game fleet of battle spaceships. And Eurisco used symmetry-based placement learned from VLSI design in the design of the spaceships' fleet.
Doesn’t look like it. In fact the original paper claimed that their RL method could be used for all sorts of combinatorial optimization problems. Yet they chose an obscure problem in chip design and showed their results on proprietary data instead of standard public benchmarks.
Instead they could have demonstrated their amazing method on any number of standard NP hard optimization problems e.g. traveling salesman, bin packing, ILP, etc. where we can generate tons of examples and verify easily whether it produces better results than other solvers or not.
This is why many in the chip design and optimization community felt that the paper was suspicious. Even with this addendum they adamantly refuse to share any results that can be independently verified.
Have we decided when are we deprecating it? I'm already cultivating another team in a remote location to work on a competing product that we will include into Google Cloud a month before deprecating this one.
Believe it or not, but there was a time where algorithms were worse than humans at layout out transistors. In particular at the higher level design decisions.
The original paper from DeepMind evaluates what they are now calling AlphaChip versus existing optimizers, including simulated annealing. They conclude that AlphaChip outperforms them with much less compute and real time.
Does this make any sense, really? - Define some common words and then let the media run wild with them. How about we redefine "better" and "revolutionize"? Oh, wait, I think people are doing that already...
Prior to AlphaChip, macro placement was done manually by human engineers in any production setting. Prior algorithmic methods especially struggled to manage congestion, resulting in chips that weren't manufacturable.
I think the real value would be in ease of use. I imagine the top N chip creators represent a fair bit of the marginal value in pushing the state of the art forward. E.g., for hobbyists or small shops, there's likely not much value in tiny marginal improvements, but for the big ones it's worth the investment.
How far are we from memory-based computing going from research into competitive products? I get the impression that we are already well passed the point where it makes sense to invest very aggressively to scale up experiments with things like memristors. Because they are talking about how many new nuclear reactors they are going to need just for the AI datacenters.
The cognitive mismatch between Von Neumann's folly and other compute architectures is vast. He slowed down the ENIAC by 66% when he got ahold of it.
We're in the timeline that took the wrong path. The other world has isolinear memory, which can be used for compute, or as memory, down to the LUT level. Everything runs at a consistent speed, and hardware faults LUTs can be routed around easily.
If you don't worry about the programming model, it's pretty easy to be way better than than existing methodologies in terms of pure compute.
But if you do pay attention to the programming model, they're unusable. You'll see that dozens of these approaches have come and gone, because it's impossible to write software for them.
I think it's just ignorance and timidity on the part of investors. Memristor or memory-computing startups are surely the next trend in investing within a few years.
I don't think it's necessarily demand or any particular calculation that makes things happen. I think people including investors are just herd animals. They aren't enthusiastic until they see the herd moving and then they want in.
I have seen at least one experiment running a language model or other neural network on (small scale) memory-based computing substrates. That suggests less than 1-2 years to apply them immediately to existing tasks once they are scaled up in terms of compute capacity.
I would have assumed it would take many years longer than that to scale something like this up, based on how long it takes traditional CPU manufacturers to design state of the art chips and manufacturing processes.
I must be old because first thing I thought reading AlphaChip was why is deepmind talking about chips in DEC Alpha :-) https://en.wikipedia.org/wiki/DEC_Alpha.
How good are TPUs in comparison with state of the art Nvidia datacenter GPUs, or Groq's ASICs? Per watt, per chip, total cost, etc.? Is there any published data?
PS: It would also be nice to apply a similar algorithm to graph drawing (e.g. trying to optimize for human readability instead of electrical performance).
Technology singularity is around the corner as soon as the chips (mostly) design themselves. There will be a few engineers, zillions of semiskilled maintenance people making a pittance, and most of the world will be underemployed or unemployed. Technical people better understand this and unionize or they will find themselves going the way of piano tuners and Russian physicists. Slow boiling frog...
Discounting fraud... what if the AI produces something genuinely better. Genuinely moving you to tears? What then?
Imagine your favorite movie, the most moving book. You read it, it changed you, then you found out it was an AI that generated it in a mere 10 seconds.
Artificial sentimentality is useless in the face of reality. That human endeavor is simply data points along an multi-dimensional best fit curve.
It feels too close to being a rat with a dopamine button, meaningless hedonism.
I haven’t thought it through particularly thoroughly though, I’d been interested in hearing other opinions. These philosophical questions quickly approach unanswerable.
With the current trendline of AI progress in the last decade the question has a high possibility of being answered by being actualized in reality.
It's not a random question either. With AI quickly entrenching itself into every aspect of human creation from art, music, to chip design, this is all I can think about.
anything that needs very real-time info. AI's will always be limited by us feeding them info, or them collecting it themselves. But humans can travel to more places than an AI can, until robots are everywhere too I suppose
I'm only tangential to the area, but my impression over the decades is that what is going to happen is that, eventually, designing the next generation is going to require more resources than the current generation can provide, thereby putting a hard stop at the exponential growth stage.
I'd even dare to claim we are already at the point where the growth has stopped, but even then you will only see the effect in a decade or so as there are still many small low-hanging fruits you can fix, but no big improvements.
Definitely a big part of it. Chips enable better EDA tools, which enable better chips. First it was analytic solvers and simulated annealing, now ML. Exciting times!
To clarify what the parent is getting at: Moore's law is an observation about the density (and, really about the cost) of transistors. So it's about the fabrication process, not about the logic design.
Practically speaking, though, maintaining Moore's law would have been economically prohibitive if circuit design and layout had not been automated.
Synopsys tools can use ML, but not for the layout itself, rather tuning variables that go into the physical design flow.
> Synopsys DSO.ai autonomously explores multiple design spaces to optimize PPA metrics while minimizing tradeoffs for the target application. It uses AI to navigate the design-technology solution space by automatically adjusting or fine-tuning the inputs to the design (e.g., settings, constraints, process, flow, hierarchy, and library) to find the best PPA targets.
So you're basically saying that Google should have used existing tools to layout their chip designs, instead of their ML solution, and that these existing tools would have produced even better chips than the ones they are actually manufacturing?
It’s more like no one outside of Google has been able to reproduce Google’s results. And not for lack of trying. So if you’re outside of Google, at this moment, it’s vapor.
So AI designing it's own chips. Now that is moving towards exponential growth. Like at the end of "Colossus" the movie.
Forget LLM's. What DeepMind is doing seems more like how an AI will rule, in the world. Building real world models, and applying game logic like winning.
LLM's will just be the text/voice interface to what DeepMind is building.
Seems to me the article is claiming a lot of things, but is very light on actual comparisons that matter to you and me, namely: how does one of those fabled AI-designed chop compare to their competition ?
For example, how much better are these latest gen TPU's when compared to NVidia's equivalent offering ?
A marvellous achievement from DeepMind as usual, I am quite surprised that Google acquired them for a significant discount of $400M, when I would have expected it to be in the range of $20BN, but then again Deepmind wasn’t making any money back then.
Why aren’t they using this technique to design better transformer architectures or completely novel machine learning architectures in general? Are plain or mostly plain transformers really peak? I find that hard to believe.
Because chip placement and the design of neural network architectures are entirely different problems, so this solution won't magically transfer from one to the other.
And AlphaGo is trained to play Go? The point is training a model through self play to build neural network architectures. If it can play Go and architect chip placements, I don’t see why it couldn’t be trained to build novel ML architectures.
I understand the achievement, but can't square it with my belief that uniform systolic arrays will prove to be the best general purpose compute engine for neural networks. Those are almost trivial to route, by nature.
Imagine a bit level systolic array. Just a sea of LUTs, with latches to allow the magic of graph coloring to remove all timing concerns by clocking everything in 2 phases.
GPUs still treat memory as separate from compute, they just have wider bottlenecks than CPUs.
- A rebuttal by a researcher within Google who wrote this at the same time as the "AlphaChip" work was going on ("Stronger Baselines for Evaluating Deep Reinforcement Learning in Chip Placement"): http://47.190.89.225/pub/education/MLcontra.pdf
- The 2023 ISPD paper from a group at UCSD ("Assessment of Reinforcement Learning for Macro Placement"): https://vlsicad.ucsd.edu/Publications/Conferences/396/c396.p...
- A paper from Igor Markov which critically evaluates the "AlphaChip" algorithm ("The False Dawn: Reevaluating Google's Reinforcement Learning for Chip Macro Placement"): https://arxiv.org/pdf/2306.09633
In short, the Google authors did not fairly evaluate their RL macro placement algorithm against other SOTA algorithms: rather they claim to perform better than a human at macro placement, which is far short of what mixed-placement algorithms are capable of today. The RL technique also requires significantly more compute than other algorithms and ultimately is learning a surrogate function for placement iteration rather than learning any novel representation of the placement problem itself.
In full disclosure, I am quite skeptical of their work and wrote a detailed post on my website: https://vighneshiyer.com/misc/ml-for-placement/
The AlphaChip authors address criticism in their addendum, and in a prior statement from the co-lead authors: https://www.nature.com/articles/s41586-024-08032-5 , https://www.annagoldie.com/home/statement
- The 2023 ISPD paper didn't pre-train at all. This means no learning from experience, for a learning-based algorithm. I feel like you can stop reading there.
- The ISPD paper and the MLcontra paper both used much larger older technology node sizes, which have pretty different physical properties. TPU has a sub 10nm technology node size, whereas ISPD uses 45nm and 12nm. These are really different from a physical design perspective. Even worse, MLcontra uses a truly ancient benchmark with >100nm technology node size.
Markov's paper just summarizes the other two.
(Incidentally, none of ISPD / MLcontra / Markov were peer reviewed - ISPD 2023 was an invited paper.)
There's a lot of other stuff wrong with the ISPD paper and the MLcontra paper - happy to go into it - and a ton of weird financial incentives lurking in the background. Commercial EDA companies do NOT want a free open-source tool like AlphaChip to take over.
Reading your post, I appreciate the thoroughness, but it seems like you are too quick to let ISPD 2023 off the hook for failing to pre-train and using less compute. The code for pre-training is just the code for training --- you train on some chips, and you save and reuse the weights between runs. There's really no excuse for failing to do this, and the original Nature paper described at length how valuable pre-training was. Given how different TPU is from the chips they were evaluating on, they should have done their own pre-training, regardless of whether the AlphaChip team released a pre-trained checkpoint on TPU.
(Using less compute isn't just about making it take longer - ISPD 2023 used half as many GPUs and 1/20th as many RL experience collectors, which may screw with the dynamics of the RL job. And... why not just match the original authors' compute, anyway? Isn't this supposed to be a reproduction attempt? I really do not understand their decisions here.)
Kahng's ISPD 2023 paper is not in dispute - no established experts objected to it. The Nature paper is in dispute. Dozens of experts objected to it: Kahng, Cheng, Markov, Madden, Lienig, Swartz objected publically.
The fact that Kahng's paper was invited doesn't mean it wasn't peer reviewed. I checked with ISPD chairs in 2023 - Kahng's paper was thoroughly reviewed and went through multiple rounds of comments. Do you accept it now? Would you accept peer-reviewed versions of other papers?
Kahng is the most prominent active researcher in this field. If anyone knows this stuff, it's Kahng. There were also five other authors in that paper, including another celebrated professor, Cheng.
The pre-training thing was disclaimed in the Google release. No code, data or instructions for pretraining were given by Google for years. The instructions said clearly: you can get results comparable to Nature without pre-training.
The "much older technology" is also a bogus issue because the HPWL scales linearly and is reported by all commercial tools. Rectangles are rectangles. This is textbook material. But Kahng etc al prepared some very fresh examples, including NVDLA, with two recent technologies. Guess what, RL did poorly on those. Are you accepting this result?
The bit about financial incentives and open-source is blatantly bogus, as Kahng leads OpenROAD - the main open-source EDA framework. He is not employed by any EDA companies. It is Google who has huge incentives here, see Demis Hassabis tweet "our chips are so good...".
The "Stronger Baselines" matched compute resources exactly. Kahng and his coauthors performed fair comparisons between annealing and RL, giving the same resources to each. Giving greater resources is unlikely to change results. This was thoroughly addressed in Kahng's FAQ - if you only could read that.
The resources used by Google were huge. Cadence tools in Kahng's paper ran hundreds times faster and produced better results. That is as conclusive as it gets.
It doesn't take a Ph.D. to understand fair comparisons.
This is written as a textbook example logical fallacy of appeal to authority.
HPWL (half-perimeter wirelength) is an approximation of wirelength, which is only one component of the chip floorplanning objective function. It is relatively easy to crunch all the components together and optimize HPWL --- minimizing actual wirelength while avoiding congestion issues is much harder.
Simulated annealing is good at quickly converging on a bad solution to the problem, with relatively little compute. So what? We aren't compute-limited here. Chip design is a lengthy, expensive process where even a few-percent wirelength reduction can be worth millions of dollars. What matters is the end result, and ML has SA beat.
(As for conflict of interest, my understanding is Cadence has been funding Kahng's lab for years, and Markov's LinkedIn says he works for Synopsis. Meanwhile, Google has released a free, open-source tool.)
Everything optimized in Nature RL is an approximation. HPWL is where you start, and RL uses it in the objective function too. As shown in "Stronger Baselines", RL loses a lot by HPWL - so much that nothing else can save it. If your wires are very long, you need routing tracks to route them, and you end up with congestion too.
SA consistently produces better solutions than RL for various time budgets. That's what matters. Both papers have shown that SA produces competent solutions. You give SA more time, you get better solutions. In a fair comparison, you give equal budgets to SA and RL. RL loses. This was confirmed using Google's RL code and two independent SA implementations, on many circuits. Very definitively. No, ML did not have SA beat - please read the papers.
Cadence hasn't funded Kahng for a long time. In fact, Google funded Kahng more recently, so he has all the incentives to support Google. Markov's LinkedIn page says he worked at Google before. Even Chatterjee, of all people, worked at Google.
Google's open-source tool is a head fake, it's practically unusable.
Update: I'll respond to the next comment here since there's no Reply button.
1. The Nature paper said one thing, the code did something else, as we've discovered. The RL method does some training as it goes. So, pre-training is not the same as training. Hence "pre". Another problem with pretraining in Google work is data contamination - we can't compare test and training data. The Google folks admitted to training and testing on different versions of the same design. That's bad. Rejection-level bad.
2. HPWL is indeed a nice simple objective. So nice that Jeff Dean's recent talks use it. It is chip design. All commercial circuit placers without exception optimize it and report it. All EDA publications report it. Google's RL optimized HPWL + density + congestion
3. This shows you aren't familiar with EDA. Simulated Annealing was the king of placement from mid 1980s to mid 1990s. Most chips were placed by SA. But you don't have to go far - as I recall, the Nature paper says they used SA to postprocess macro placements.
SA can indeed find mediocre solutions quickly, but keeps on improving them, just like RL. Perhaps, you aren't familiar with SA. I am. There are provable results showing SA finds optimal solution if given enough time. Not for RL.
I'm glad you agree that HPWL is a proxy metric. Optimizing HPWL is a fun applied math puzzle, but it's not chip design.
I am unaware of a single instance of someone using SA to generate real-world, usable macro layouts that were actually taped out, much less for modern chip design, in part due to SA's struggles to manage congestion, resulting in unusable layouts. SA converges quickly to a bad solution, but this is of little practical value.
2. HPWL is indeed a nice simple objective. So nice that Jeff Dean's recent talks use it. It is chip design. All commercial circuit placers without exception optimize it and report it. All EDA publications report it. Google's RL optimized HPWL + density + congestion
3. This shows you aren't familiar with EDA. Simulated Annealing was the king of placement from mid 1980s to mid 1990s. Most chips were placed by SA. But you don't have to go far - as I recall, the Nature paper says they used SA to postprocess macro placements.
SA can indeed find mediocre solutions quickly, but keeps on improving them, just like RL. Perhaps, you aren't familiar with SA. I am. There are provable results showing SA finds optimal solution if given enough time. Not for RL.
https://en.wikipedia.org/wiki/Top_Chess_Engine_Championship
So perhaps the critics had a point there.
> Generally considered to be the strongest GPU engine, it continues to provide open data which is essential for training our NNUE networks. They released version 0.31.1 of their engine a few weeks ago, check it out!
[1]
I’d say the impact AlphaZero has had on chess and go can’t be understated considering it’s a general algorithm that at worst is highly competitive with purpose built engines. And that’s ignoring the actual point of why DeepMind is doing any of this which is for GAI (that’s why they’re not constantly trying to compete with existing engines)
[1] https://lichess.org/@/StockfishNews/blog/stockfish-17-is-her...
This is what you get if you make academic researchers compete for citation counts.
Pretraining seems to be an important aspect here, and it makes sense that such pretraining requires good examples, which unfortunately for the free lunch people, is not available to the public.
That's what you get when you let big companies do fundamental research. Would it be better if the companies did not publish anything about their research at all?
It all feels a bit unproductive to attack one another.
Whichever approach ends up winning is improved by careful evaluation and replication of results
When you see a chip that has the datapath identified and laid out properly by a computer algorithm, you've got something. If not, it's vapor.
So, if your layout still looks like a random rat's nest? Nope.
If even a random person can see that your layout actually follows the obvious symmetric patterns from bit 0 to bit 63, maybe you've got something worth looking at.
Analog/RF is a little tougher to evaluate because the smaller number of building blocks means you can use Moore's Law to brute force things much more exhaustively, but if things "looks pretty" then you've got something. If it looks weird, you don't.
They must feel vindicated by their work turning out to be so fruitful now.
[1] https://www.theregister.com/AMP/2023/03/27/google_ai_chip_pa...
[2] https://regmedia.co.uk/2023/03/26/satrajit_vs_google.pdf
Without knowing much, my guess is that “quality” of a chip design is multifaceted and heavily dependent on the use case. That is the ideal chip for a data center would look very different from those for a mobile phone camera or automobile.
So again what does “better” mean in the context of this particular problem / task.
Floorplanning/placement/synthesis is a billion dollar industry, so if their approach were really revolutionary they would be selling the technology, not wasting their time writing blog posts about it.
https://research.google/pubs/spanner-googles-globally-distri...
or Bigtable?
https://research.google/pubs/bigtable-a-distributed-storage-...
or GFS?
or MapReduce?
or Borg?
or...I think you get the idea.
(no paywall): https://www.cl.cam.ac.uk/~ey204/teaching/ACS/R244_2021_2022/...
Maybe all together, but I don't think automatic placement algorithms are a billion dollar industry. There's so much more to it than that.
[1] https://en.wikipedia.org/wiki/Eurisko
What's more, Eurisco was then used in designing Traveler TCS' game fleet of battle spaceships. And Eurisco used symmetry-based placement learned from VLSI design in the design of the spaceships' fleet.
Can AlphaChip's heuistics be used anywhere else?
Instead they could have demonstrated their amazing method on any number of standard NP hard optimization problems e.g. traveling salesman, bin packing, ILP, etc. where we can generate tons of examples and verify easily whether it produces better results than other solvers or not.
This is why many in the chip design and optimization community felt that the paper was suspicious. Even with this addendum they adamantly refuse to share any results that can be independently verified.
I assume that the human benchmark is a human using existing EDA tools, not a guy with a pocket protector and a roll of tape.
https://www.cl.cam.ac.uk/~ey204/teaching/ACS/R244_2021_2022/...
Of course they do. I'm waiting for their products.
To quote certain popular TV series .... Sorry, are you from the past? Do your "production" chips only have a couple dozen macros or what?
What nonsense! XD
We're in the timeline that took the wrong path. The other world has isolinear memory, which can be used for compute, or as memory, down to the LUT level. Everything runs at a consistent speed, and hardware faults LUTs can be routed around easily.
Better architectures without the yearly investment train will no longer be better quite quickly.
You would need to be 100x to 1000x better in order to pull the investment train onto your tracks.
Don’t has been impossible for decades.
Even so, I think we will see such a change in my lifetime.
AI could be that use case that has a strong enough demand pull to make it happen.
We will see.
But if you do pay attention to the programming model, they're unusable. You'll see that dozens of these approaches have come and gone, because it's impossible to write software for them.
I don't think it's necessarily demand or any particular calculation that makes things happen. I think people including investors are just herd animals. They aren't enthusiastic until they see the herd moving and then they want in.
TPU v5e [1]: not available for purchase, only through GCP, storage=5B, LLM-Model=7B, efficiency=393TFLOP.
[1] https://cloud.google.com/tpu/docs/v5e
Also: when is this coming to KiCad? :)
PS: It would also be nice to apply a similar algorithm to graph drawing (e.g. trying to optimize for human readability instead of electrical performance).
I don’t want art that wasn’t made by a human, no matter how visually stunning or indistinguishable it is.
Imagine your favorite movie, the most moving book. You read it, it changed you, then you found out it was an AI that generated it in a mere 10 seconds.
Artificial sentimentality is useless in the face of reality. That human endeavor is simply data points along an multi-dimensional best fit curve.
I think it would feel hollowed out, disingenuous.
It feels too close to being a rat with a dopamine button, meaningless hedonism.
I haven’t thought it through particularly thoroughly though, I’d been interested in hearing other opinions. These philosophical questions quickly approach unanswerable.
With the current trendline of AI progress in the last decade the question has a high possibility of being answered by being actualized in reality.
It's not a random question either. With AI quickly entrenching itself into every aspect of human creation from art, music, to chip design, this is all I can think about.
I'd even dare to claim we are already at the point where the growth has stopped, but even then you will only see the effect in a decade or so as there are still many small low-hanging fruits you can fix, but no big improvements.
Practically speaking, though, maintaining Moore's law would have been economically prohibitive if circuit design and layout had not been automated.
> Synopsys DSO.ai autonomously explores multiple design spaces to optimize PPA metrics while minimizing tradeoffs for the target application. It uses AI to navigate the design-technology solution space by automatically adjusting or fine-tuning the inputs to the design (e.g., settings, constraints, process, flow, hierarchy, and library) to find the best PPA targets.
Forget LLM's. What DeepMind is doing seems more like how an AI will rule, in the world. Building real world models, and applying game logic like winning.
LLM's will just be the text/voice interface to what DeepMind is building.
For example, how much better are these latest gen TPU's when compared to NVidia's equivalent offering ?
Re:using RL and other types AI assistance for chip design, Nvidia and others are doing this too
I think the next step is arrays of memory-based compute.
GPUs still treat memory as separate from compute, they just have wider bottlenecks than CPUs.