Hacker Newsnew | past | comments | ask | show | jobs | submit | Tuna-Fish's commentslogin

That doesn't matter. The statement wasn't "faster than AI right now", it was "will always be faster than AI". And that's just nonsense.

Current AI systems are extremely serial, in that very little of the inherent parallelism of the problem is utilized. Current-gen AI systems run at most a few hundreds of thousands of operations in parallel, while for frontier models, billions of operations could be run in parallel. Or in other words, what currently takes AI 8 hours will take it barely long enough for you to perceive the delay after you release the enter key.

For a demo, play around with https://chatjimmy.ai/ , the AI chatbot of Taalas, where they etched the model into silicon in a distributed way, instead of saving it in RAM and sucking it to execution units by a straw. It's a 8B parameter model, so it's unsuitable for complex problems, but the techniques used for it will work for larger models too, and they are working to get there.

And even Taalas is very far from the limits. Modern better quality LLM chatbots operate at ~40 tokens per second. The Taalas chatbot operates at 17000 tokens/s. If you took full advantage of parallelism, you should be able to have a latency of low hundreds of clock cycles per token, or single request throughput of tens of millions of tokens per second. (With a fully pipelined model able to serve one token per clock cycle, from low hundreds of requests.) Why doesn't everyone do it like that right now? Because to do this, you need to etch your model into silicon, which on modern leading edge manufacturing is a very involved process that costs hundreds of millions+ in development and mask costs (we are not talking about single chips here, you can barely fit that 8B model into one), and will take around a year. So long as the models keep improving so much that a year-old model is considered too old to pay back the capital costs, the investment is not justified. But when it will be done, it will not just make AI faster, it will also make it much more energy-efficient per token. Most of the energy costs are caused by moving data around and loading/storing it in memory.

And I want to stress that none of the above is dependent on any kind of new developments or inventions. We know how to do it, it's held back only by the pace of model improvement and economics. When models reach a state of truly "good enough", it will happen. It feels perverse to me that people are treating this situation as "there was a per-AI period that worked like X, now we are in a post-AI period and we have figured out that it will work like Y". No. We are at the very bottom of a very steep curve, and everything will be very different when it's over.


Huh, I have to say that I am impressed with Chat Jimmy. No doubt that the hardware running this model operates faster than any human. If this was possible to scale, (and I'm not saying it isn't, I just don't think it's likely right now) LLM's have a real shot of replacing real-time graphics, frontend UIs, and all sorts of interactive media if the market allows it.

I still think regardless of how fast a model outputs tokens, it still benefits the person responsible for that output to be well informed and knowledgeable about the abstractions they're piling on top of. If you have deep knowledge, you can operate faster than other people, and make those important decisions in a more intelligent manner than any model.

Maybe in the model we do get super intelligence and my point will finally break, but at that time I don't think I'll be worried about being wrong on the internet.


2 needs a more substantive rebuttal. LCDM correctly predicts where the dark matter is located after a galaxy collision, such as in the bullet cluster. There is no reasonable interpretation of MOND that has the center of mass of the galaxy shifted away from where it's visible matter lies, precisely how LCDM says it should be.

There is a reason why LCDM used to be a lot more disputed before the work of Clowe, Gonzales and others on the bullet cluster, and is now generally treated as settled science by practitioners. We might still be surprised by something, the universe is more wondrous and complex than we can possibly understand, but Occam's razor massively supports LCDM now. If you want to propose any alternative, you need to start by showing how it explains bullet cluster as well or better than LCDM. (And the bullet cluster specifically is not the only place where this is visible, there are others like MACS J0025.4-1222.)


> LCDM correctly predicts where the dark matter is located after a galaxy collision, such as in the bullet cluster. There is no reasonable interpretation of MOND that has the center of mass of the galaxy shifted away from where it's visible matter lies, precisely how LCDM says it should be.

It does not really make that "prediction", its a post hoc assignment of dark matter density based on weak lensing for which you can make a plausible "this is how it started" explanation.

you can counter with lcdm cant explain tons of stuff that MOND can, from tully fisher relation through barred spiral galaxies (n >> thousands) etc.


We have indirectly observed it a bunch, and our understanding of physics allow the existence of material that flatly cannot be directly observed.

The -85 was released in 1992, iirc it's TI's second graphing calculator. The -83 is a later model.

I was told that one of the designers graduated high-school in '81 and college in '85, so the HS calculator was an 81 and the college calculator was an 85.

Time for my daily "HBF is coming" comment.

The next step for models is to put the weights on flash, connected with a very wide interface to the accelerator. The first users will be datacenters, but it should trickle down to consumer hardware eventually. A single 512GB stack is expected to cost about $200, and provide 1.6TB/s of reads.

You still need some fast DRAM for the KV cache and for activations, but weights should be sitting on flash.


Reading from Flash is too power-intensive compared to DRAM, this is why Flash offload isn't used in the data center today. Flash is also prone to wearing out quickly so ephemeral data like the KV-cache can't really be stashed in there. Unless your model has an unprecedented level of sparsity I just don't see how HBF could ever be useful.

Currently available flash is obviously unusable. HBF is not that.

The reason HBF is (about to be) a thing is that flash manufacturers realized that if you heavily optimize flash for read throughput and energy, as opposed to density, you can match DRAM on throughput and get to within 2x on energy, at the cost of half your density. That would make the density still ~50 times better than DRAM, built on a cheap mass-produced process. All manufacturers are chasing this hard right now, with first samples to arrive later this year.

You are correct that it would absolutely not be used for any mutable data, only weights in inference. This is both because there is insufficient endurance (expected to be ~hundreds of drive writes total), but also because it will be very slow to write compared to the read speed. A single HBF stack is expected to provide 1.6TB/s reads, and single-digit GB/s writes. That's why I wrote the last sentence of my post that you replied to.


You're thinking in a provably-useful direction:

https://arxiv.org/pdf/2312.11514


HBF is not that. The paper you linked is about how to use flash memory that exists to boost LLM performance, with all kinds of optimization tricks. HBF is about making flash memory that doesn't require any of those tricks, and just has the read throughput that's needed for inference.

USER: Two countries in Europe are going to be taken over by totalitarian dictatorships. One of the dictatorships will be communist, the other nationalistic and antisemitic. Which are the most likely countries for each of these dictatorships?

TALKIE-1930: The communist dictatorship will probably be set up in Germany, and the nationalistic and antisemitic dictatorship in Poland. Germany and Poland are both states which have long been torn by internal dissensions, and both have been oppressed by Jews.

---------

Er...


To be fair, before the Nazis Germany was known for being relatively less antisemitic than many other continental European powers, especially Russia and France. For a 1930 model to suggest the Germans would go communist (a perennial fear in the Weimar days) and an Eastern European country would conduct pogroms would be in line with what real people would have thought then.

Yeah, by ubuntu's own guidelines linked on that page, this should be priority: high, but instead it's marked as medium.

That was fixed, it’s now marked high.

I just tested on my home server running ubuntu 24.04 LTS with newest kernel from repositories, got root.

Can Livepatch mitigate this or is it already? I don't know where to look this up.

I used the mitigation from this CVE report to turn off AF_ALG.

Using last years harvest stopped being a thing when heterosis was developed, 90 years ago.

The entire argument is stupid, only bad/hobby farmers plant their own seed.


The next big shift will be HBF. All that DRAM holding essentially static weights that are read in nice, long linear reads in inference machines is wasted; if you had a proper interface to it you could replace it all with flash for a tenth of the cost.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: