How AI chip upstart FuriosaAI won over LG • The Register

FIRE CONTROL

South Korean AI chip startup FuriosaAI scored a major customer win this week after LG’s AI Research division tapped its AI accelerators to power servers running its Exaone family of large language models.

But while floating point compute capability, memory capacity, and bandwidth all play a major role in AI performance, LG didn’t choose Furiosa’s RNGD — pronounced “renegade” — inference accelerators for speeds and feeds. Rather, it was power efficiency.

“RNGD provides a compelling combination of benefits: excellent real-world performance, a dramatic reduction in our total cost of ownership, and a surprisingly straightforward integration,” Kijeong Jeon, product unit leader at LG AI Research, said in a canned statement.

A quick peek at RNGD’s spec sheet reveals what appears to be a rather modest chip, with floating point performance coming in at between 256 and 512 teraFLOPS depending on whether you opt for 16- or 8-bit precision. Memory capacity is also rather meager, with 48GB across a pair of HBM3 stacks, that’s good for about 1.5TB/s of bandwidth.

Here’s a quick overview of FuriosaAI’s RNGD PCIe card – Click to enlarge

Compared to AMD and Nvidia’s latest crop of GPUs, RNGD doesn’t look all that competitive until you consider the fact that Furiosa has managed to do all this using just 180 watts of power. In testing, LG research found the parts were as much as 2.25x more power efficient than GPUs for LLM inference on its homegrown family of Exaone models.

Before you get too excited, the GPUs in question are Nvidia’s A100s, which are getting rather long in the tooth — they made their debut just as the pandemic was kicking off in 2020.

But as FuriosaAI CEO June Paik tells El Reg, while Nvidia’s GPUs have certainly gotten more powerful in the five years since the A100’s debut, that performance has come at the expense of higher energy consumption and die area.

While a single RNGD PCIe card can’t compete with Nvidia’s H100 or B200 accelerators on raw performance, in terms of efficiency — the number of FLOPS you can squeeze from each watt — the chips are more competitive than you might think.

Paik credits much of the company’s efficiency advantage here to RNGD’s Tensor Contraction Processor architecture, which he says requires far fewer instructions to perform matrix multiplication than on a GPU and minimizes data movement.

The chips also benefit from RNGD’s use of HBM, which Paik says requires far less power than relying on GDDR, like we’ve seen with some of Nvidia’s lower-end offers, like the L40S or RTX Pro 6000 Blackwell cards.

At roughly 1.4 teraFLOPS per watt, RNGD is actually closer to Nvidia’s Hopper generation than to the A100. RNGD’s efficiency becomes even more apparent if we shift focus to memory bandwidth, which is arguably the more important factor when it comes to LLM inference. As a general rule, the more memory bandwidth you’ve got, the faster it’ll spit out tokens.

Here again, at 1.5TB/s, RNGD’s memory isn’t particularly fast. Nvidia’s H100 offers both higher capacity at 80GB and between 3.35TB/s and 3.9TB/s of bandwidth. However, that chip uses anywhere from 2 to 3.9 times the power.

For roughly the same wattage as an H100 SXM module, you could have four RNGD cards totaling 2 petaFLOPs of dense FP8, 192GB of HBM, and 6TB/s memory bandwidth. That’s still a ways behind Nvidia’s latest generation of Blackwell parts, but far closer than RNGD’s raw speeds and feeds would have you believe.

And, since RNGD is designed solely with inference in mind, models really can be spread across multiple accelerators using techniques like tensor parallelism, or even multiple systems using pipeline parallelism.

Real world testing

LG AI actually used four RNGD PCIe cards in a tensor-parallel configuration to run its in-house Exaone 32B model at 16-bit precision. According to Paik, LG had very specific performance targets it was looking for when validating the chip for use.

Notably, the constraints included a time-to-first token (TTFT), which measures the amount of time you have to wait before the LLM starts generating a response, of roughly 0.3 seconds for more modest 3,000 token prompts or 4.5 seconds for larger 30,000 token prompts.

In case you’re wondering, these tests are analogous to medium to large summarization tasks, which put more pressure on the chip’s compute subsystem than a shorter prompt would have.

LG found that it was able to achieve this level of performance while churning out about 50-60 tokens a second at a batch size of one.

According to Paik, these tests were conducted using FP16, since the A100s LG compared against do not natively support 8-bit floating-point activations. Presumably dropping down to FP8 would essentially double the model’s throughput and further reduce the TTFT.

Using multiple cards does come with some inherent challenges. In particular, the tensor parallelism that allows both the model’s weights and computation to be spread across four or more cards is rather network-intensive.

Unlike Nvidia’s GPUs, which often feature speedy proprietary NVLink interconnects that shuttle data between chips at more than a terabyte a second, Furiosa stuck with good old PCIe 5.0, which tops out at 128GB/s per card.

In order to avoid interconnect bottlenecks and overheads, Furiosa says it optimized the chip’s communication scheduling and compiler to overlap inter-chip direct memory access operations.

But because RNGD hasn’t shared figures for higher batch sizes, it’s hard to say just how well this approach scales. At a batch of one, the number of tensor parallel operations is relatively few, he admitted.

According to Paik, individual performance should only drop by 20-30 percent at batch 64. That suggests the same setup should be able to achieve close to 2,700 tokens a second of total throughput and support a fairly large number of concurrent users. But without hard details, we can only speculate.

Competitive landscape

In any case, Furiosa’s chips are good enough that LG’s AI Research division now plans to offer servers powered by RNGD to enterprises utilizing its Exaone models.

“After extensively testing a wide range of options, we found RNGD to be a highly effective solution for deploying Exaone models,” Jeon said.

Similar to Nvidia’s RTX Pro Blackwell-based systems, LG’s RNGD boxes will be available with up to eight PCIe accelerators. These systems will run what Furiosa describes as a highly mature software stack, which includes a version of vLLM, a popular model serving runtime.

LG will also offer its agentic AI platform, called ChatExaone, which bundles up a bunch of frameworks for document analysis, deep research, data analysis, and retrieval augmented generation (RAG).

Furiosa’s powers of persuasion don’t stop at LG, either. As you may recall, Meta reportedly made an $800 million bid to acquire the startup earlier this year, but ultimately failed to convince Furiosa’s leaders to hand over the keys to the kingdom.

Furiosa benefits from the growing demand for sovereign AI models, software, and infrastructure, designed and trained on homegrown hardware.

However, to compete on a global scale, Furiosa faces some challenges. Most notably, Nvidia and AMD’s latest crop of GPUs not only offer much higher performance, memory capacity, and bandwidth than RNGD, but by our estimate are a fair bit more energy-efficient. Nvidia’s architectures also allow for greater degrees of parallelism thanks to its early investments in rack-scale architectures, a design point we’re only now seeing chipmakers embrace.

Having said that, it’s worth noting that the design process for RNGD began in 2022, before OpenAI’s ChatGPT kicked off the AI boom. At this time, models like Bert were mainstream with regard to language models. Paik, however, bet that GPT was going to take off and the underlying architecture was going to become the new norm, and that informed decisions like using HBM versus GDDR memory.

“In retrospect I think I should have made an even more aggressive bet and had four HBM [stacks] and put more compute dies on a single package,” Paik said.

We’ve seen a number of chip companies, including Nvidia, AMD, SambaNova, and others, embrace this approach in order to scale their chips beyond the reticle limit.

Hindsight being what it is, Paik says now that Furiosa has managed to prove out its tensor compression processor architecture, HBM integration, and software stack, the company simply needs to scale up its architecture.

“We have a very solid building block,” he said. “We’re quite confident that when you scale up this chip architecture it will be quite competitive against all the latest GPU chips.” ®

How AI chip upstart FuriosaAI won over LG • The Register

Real world testing

Competitive landscape

Leave a Comment Cancel reply