Top Python Libraries

Top Python Libraries

PyTorch Officially Accelerates Inference Without CUDA: Is the Triton Era Coming?

PyTorch explores non-CUDA inference with Triton kernels, challenging NVIDIA’s dominance in large model training and inference. Discover the future of AI computing.

Meng Li's avatar
Meng Li
Sep 09, 2024
∙ Paid
1
Share

Recently, PyTorch shared insights on implementing non-CUDA computations, including micro-benchmark comparisons of different kernels and discussing future improvements to Triton kernels to close the gap with CUDA.

For training, fine-tuning, and inference of large language models (LLMs), NVIDIA GPUs and CUDA are commonly used.

In broader machine learning computation, CUDA is also heavily relied upon, offering significant performance boosts for accelerated models.

While CUDA dominates the field of accelerated computing, it has also become one of NVIDIA’s key advantages.

However, other efforts are emerging to challenge CUDA. One example is Triton, introduced by OpenAI, which shows advantages in usability, memory efficiency, and AI compiler stack development.

Recently, PyTorch announced its plans for large model inference without using NVIDIA CUDA.

PyTorch explained why they are exploring 100% Triton, saying:

“Triton offers a path to run large models on various GPUs, including those from NVIDIA, AMD, Intel, and other GPU-based accelerators. It also provides a higher-level abstraction for GPU programming in Python, making it faster to write high-performance kernels with PyTorch than vendor-specific APIs.”

This post is for paid subscribers

Already a paid subscriber? Sign in
© 2025 Meng Li
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture