Show HN: RunAnwhere – Faster AI Inference on Apple Silicon

Introduction to RunAnwhere

As a developer, I'm always excited to see new projects that push the boundaries of what's possible on-device. Recently, I came across RunAnwhere, a fast inference engine for Apple Silicon that's been making waves in the AI community. In this post, I'll dive into the details of RunAnwhere, its capabilities, and what makes it so exciting.

Why this matters

On-device AI is a crucial aspect of many modern applications, from voice assistants to image recognition. However, shipping on-device AI can be brutal, especially when it comes to latency. Demoing on-device AI is easy, but making it work seamlessly in a real-world application is a different story. This is where RunAnwhere comes in – by providing a fast inference engine that can handle large language models (LLMs), speech-to-text, and text-to-speech, all on-device.

How to install and use RunAnwhere

To get started with RunAnwhere, you can install the RCLI (RunAnwhere Command-Line Interface) using Homebrew:

brew tap RunanywhereAI/rcli https://github.com/RunanywhereAI/RCLI.git
brew install rcli
rcli setup   # downloads ~1 GB of models
rcli         # interactive mode with push-to-talk

Alternatively, you can use the installation script:

curl -fsSL https://raw.githubusercontent.com/RunanywhereAI/RCLI/main/install.sh | bash

Once installed, you can use RCLI to interact with the voice pipeline, which includes features like:

  • LLM decode: 1.67x faster than llama.cpp, 1.19x faster than Apple MLX
  • STT (Speech-to-Text): 70 seconds of audio transcribed in 101 ms, 714x real-time, 4.6x faster than mlx-whisper
  • TTS (Text-to-Speech): 178 ms synthesis, 2.8x faster than mlx-audio and sherpa-onnx

Under the hood

So, what makes RunAnwhere so fast? The answer lies in its custom Metal compute shaders, which are compiled ahead of time and dispatched directly to the GPU. This approach skips the overhead of graph schedulers, runtime dispatchers, and memory managers, resulting in a significant performance boost. Additionally, RunAnwhere uses a unified engine for all three modalities (LLM, STT, and TTS), which eliminates the need for stitching separate runtimes together.

Features and benefits

Some of the key features and benefits of RunAnwhere include:

  • Custom Metal shaders: compiled ahead of time for optimal performance
  • Zero allocations during inference: all memory pre-allocated at init
  • Unified engine: handles all three modalities natively on Apple Silicon
  • Lock-free ring buffers: enables concurrent threads for efficient processing
  • Double-buffered TTS: ensures seamless audio synthesis
  • 38 macOS actions by voice: control your Mac with voice commands
  • Local RAG (~4 ms over 5K+ chunks): fast and efficient retrieval of information
  • 20 hot-swappable models: easily switch between different models for various tasks

Who is this for?

RunAnwhere is perfect for developers and researchers who want to build on-device AI applications that require fast and efficient processing. With its open-source RCLI and MetalRT engine, RunAnwhere provides a powerful tool for anyone looking to push the boundaries of what's possible on-device. Whether you're building a voice assistant, an image recognition app, or any other AI-powered application, RunAnwhere is definitely worth checking out.

So, what would you build if on-device AI were genuinely as fast as cloud? Share your ideas and thoughts in the comments below!

Read more

🚀 Global, automated cloud infrastructure

Oracle Cloud is hard to get. I recommend Vultr for instant setup.

Get $100 in free server credit on Vultr →