Show HN: RunAnwhere – Faster AI Inference on Apple Silicon

TechStrider

10 Mar 2026 — 2 min read

Introduction to RunAnwhere

As a developer, I'm always excited to see new projects that push the boundaries of what's possible on-device. Recently, I came across RunAnwhere, a fast inference engine for Apple Silicon that's been making waves in the AI community. In this post, I'll dive into the details of RunAnwhere, its capabilities, and what makes it so exciting.

Why this matters

On-device AI is a crucial aspect of many modern applications, from voice assistants to image recognition. However, shipping on-device AI can be brutal, especially when it comes to latency. Demoing on-device AI is easy, but making it work seamlessly in a real-world application is a different story. This is where RunAnwhere comes in – by providing a fast inference engine that can handle large language models (LLMs), speech-to-text, and text-to-speech, all on-device.

How to install and use RunAnwhere

To get started with RunAnwhere, you can install the RCLI (RunAnwhere Command-Line Interface) using Homebrew:

brew tap RunanywhereAI/rcli https://github.com/RunanywhereAI/RCLI.git
brew install rcli
rcli setup   # downloads ~1 GB of models
rcli         # interactive mode with push-to-talk

Alternatively, you can use the installation script:

curl -fsSL https://raw.githubusercontent.com/RunanywhereAI/RCLI/main/install.sh | bash

Once installed, you can use RCLI to interact with the voice pipeline, which includes features like:

LLM decode: 1.67x faster than llama.cpp, 1.19x faster than Apple MLX
STT (Speech-to-Text): 70 seconds of audio transcribed in 101 ms, 714x real-time, 4.6x faster than mlx-whisper
TTS (Text-to-Speech): 178 ms synthesis, 2.8x faster than mlx-audio and sherpa-onnx

Under the hood

So, what makes RunAnwhere so fast? The answer lies in its custom Metal compute shaders, which are compiled ahead of time and dispatched directly to the GPU. This approach skips the overhead of graph schedulers, runtime dispatchers, and memory managers, resulting in a significant performance boost. Additionally, RunAnwhere uses a unified engine for all three modalities (LLM, STT, and TTS), which eliminates the need for stitching separate runtimes together.

Features and benefits

Some of the key features and benefits of RunAnwhere include:

Custom Metal shaders: compiled ahead of time for optimal performance
Zero allocations during inference: all memory pre-allocated at init
Unified engine: handles all three modalities natively on Apple Silicon
Lock-free ring buffers: enables concurrent threads for efficient processing
Double-buffered TTS: ensures seamless audio synthesis
38 macOS actions by voice: control your Mac with voice commands
Local RAG (~4 ms over 5K+ chunks): fast and efficient retrieval of information
20 hot-swappable models: easily switch between different models for various tasks

Who is this for?

RunAnwhere is perfect for developers and researchers who want to build on-device AI applications that require fast and efficient processing. With its open-source RCLI and MetalRT engine, RunAnwhere provides a powerful tool for anyone looking to push the boundaries of what's possible on-device. Whether you're building a voice assistant, an image recognition app, or any other AI-powered application, RunAnwhere is definitely worth checking out.

So, what would you build if on-device AI were genuinely as fast as cloud? Share your ideas and thoughts in the comments below!

Show HN: RunAnwhere – Faster AI Inference on Apple Silicon

TechStrider

Introduction to RunAnwhere

Why this matters

How to install and use RunAnwhere

Under the hood

Features and benefits

Who is this for?

Read more

Go on Embedded Systems and WebAssembly

Mercurial Dyson – a plan for the disassembly of planet Mercury

Why the heck are we still using Markdown?

Solana Drift Protocol drained of $285M via fake token and governance hijack

🚀 Global, automated cloud infrastructure