Posts

Lately, I’ve been low-key obessed with SemiAnalysis articles, and one in particular made me pause: even 8x H100 cannot serve a 1 trillion parameter dense model at 33.33 tokens per second. Furthermore, the FLOPS utilization rate of the 8xH100’s at 20 tokens per second would still be under 5%, resulting is horribly high inference costs. While the claim is from 2023 and inference is still largely memory-bandwidth-bound, the hardware game has evovled since then....