Show HN: Llama 3.1 70B on a single RTX 3090 via NVMe-to-GPU bypassing the CPU

TechStrider

22 Feb 2026 — 2 min read

Introduction to NVMe-to-GPU Bypassing

As a tech enthusiast, I'm always excited to see innovative solutions that push the boundaries of what's possible with consumer-grade hardware. Recently, a developer shared an impressive project on Hacker News, showcasing the ability to run Llama 3.1, a 70B transformer model, on a single RTX 3090 graphics card, bypassing the CPU and RAM altogether. This is achieved by connecting the GPU directly to an NVMe storage device, effectively cutting out the middleman.

How it Works

The idea behind this project is to leverage the high-speed interface of NVMe storage devices to feed data directly to the GPU, without relying on the CPU or system RAM. This approach has the potential to significantly reduce latency and increase overall performance, especially in applications that rely heavily on GPU acceleration, such as machine learning and AI workloads.

The developer has made the library repository available, which includes a Readme file with more information on the project. According to the author, this solution works not only on high-end consumer GPUs like the RTX 3090 but also on professional-grade graphics cards, where it's expected to perform even better.

Potential Benefits

The implications of this project are fascinating, and I can think of several scenarios where this technology could be extremely useful:

Retro gaming: The original motivation behind this project, retro gaming could benefit greatly from this technology, allowing for smoother and more responsive gameplay on older systems.
Machine learning: By bypassing the CPU and RAM, this solution could enable faster and more efficient training of large machine learning models, making it an attractive option for researchers and developers.
AI workloads: Similar to machine learning, AI workloads that rely heavily on GPU acceleration could see significant performance boosts with this technology.

Features and Pros

Some key features and pros of this project include:

Direct NVMe-to-GPU connection: Eliminates the need for CPU and RAM, reducing latency and increasing performance.
Compatibility with consumer-grade GPUs: Works on high-end consumer GPUs like the RTX 3090, making it more accessible to a wider range of users.
Potential for improved performance: Could enable faster and more efficient processing of large datasets and workloads.

Code and Implementation

While I don't have the exact code snippet to share, the library repository is available, and the developer has provided a Readme file with more information on the project. If you're interested in exploring this technology further, I recommend checking out the repository and experimenting with the code.

Verdict

Who is this for? This project is ideal for:

Retro gaming enthusiasts: Looking to improve performance and responsiveness in older games.
Machine learning and AI researchers: Seeking to accelerate their workflows and reduce latency.
Developers and tinkerers: Interested in exploring the potential of NVMe-to-GPU bypassing and its applications.

What do you think about this innovative solution? Can you envision any other use cases where NVMe-to-GPU bypassing could be beneficial? Share your thoughts and ideas in the comments below!

Show HN: Llama 3.1 70B on a single RTX 3090 via NVMe-to-GPU bypassing the CPU

TechStrider

Introduction to NVMe-to-GPU Bypassing

How it Works

Potential Benefits

Features and Pros

Code and Implementation

Verdict

Read more

Your Phone Is an Entire Computer

Show HN: Context Gateway – Compress agent context before it hits the LLM

The Accidental Room (2018)

Why the militaries are scrambling to create their own Starlink

🚀 Global, automated cloud infrastructure