- Benchmark testing shows splitting AI workloads between separate GPUs speeds up large language model inference.
- The approach keeps response times steady even as user prompts become much longer.
- The architecture outperformed a commercial competitor on key metrics like first-token latency.
- Disaggregation prevents heavy users with long queries from slowing down the system for everyone else.
The team behind Theta EdgeCloud recently completed a benchmark demonstrating a more efficient method for serving large language models. Their tests split the two distinct phases of LLM inference, prefill and decode, across separate pools of NVIDIA H200 GPUs to improve performance.
Prefill, the compute-heavy prompt processing phase, was handled on one set of hardware. Meanwhile, the memory-sensitive decode phase, which generates the response, ran on another. These pools communicated over a high-speed RDMA network link, transferring the model’s working memory between them. This architecture prevents the two workload types from competing for resources on the same GPUs.
Consequently, response times remained remarkably consistent even as prompts grew longer. For instance, the time to first token was around 783ms for a 1,000-word prompt and only 794ms for a 4,000-word prompt. This steadiness makes performance more predictable under real-world conditions where query length varies. The deployment’s results were compared to a commercial offering from Together.ai.
Under matched workloads, the EdgeCloud setup outperformed Together.ai’s serverless endpoint on first-token latency and burst performance. It also demonstrated stronger throughput under steady load. However, Together.ai held a slight edge during very long continuous tests, providing a fair comparison for potential users. This technical advancement is part of a broader effort to get more output from existing, scarce GPU supply.
✅ Follow BITNEWSBOT on Telegram, Facebook, LinkedIn, X.com, and Google News for instant updates.
