Mantle>

Updated: June 28, 2025

Mantle is an iOS-based chat application that runs powerful Large Language Models (LLMs) entirely on-device, leveraging the full potential of Apple Silicon. This project is a deep dive into the world of on-device AI, focusing on the significant engineering challenges involved in making large, complex models performant within the constrained environment of a mobile phone.

Demo

Watch the demo on YouTube>

Project Purpose

The primary goal of this project is to explore and implement advanced optimization techniques for running LLMs on iOS devices. It serves as a portfolio piece to showcase deep technical expertise in:

Model Compression & Quantization
Stateful On-Device Inference
Performance Profiling & Bottleneck Analysis
Low-Level GPU Programming with Metal Performance Shaders (MPS)

How It Works: The Optimization Journey

Running a multi-billion parameter LLM on a phone is non-trivial. The key is a multi-stage optimization process.

1. Model Conversion & Compression

The first step is to convert a pre-trained model into Apple's Core ML format. This project uses meta-llama/Llama-3.2-3B. To shrink the model, we apply aggressive quantization, converting the model's weights from 16-bit floating-point numbers to smaller types like 8-bit integers, which drastically reduces the model's size and memory usage.

2. Stateful Inference & The KV Cache

To have a conversation, an LLM must remember context, which is managed by a Key-Value (KV) Cache. This project implements stateful inference where the Core ML model accepts the KV cache as an input state and outputs the updated cache after each prediction. This makes generation significantly faster.

3. Profiling & Identifying Bottlenecks

Using Xcode Instruments, we measure latency, memory usage, power consumption, and compute unit utilization. This analysis reveals performance bottlenecks—typically the Attention mechanism layers, which are the computational heart of a Transformer model.

4. Metal Performance Shader (MPS) Optimization

When profiling shows a layer is a bottleneck, we can replace Core ML's default implementation with our own highly optimized version using Metal, Apple's low-level GPU programming framework. This involves writing custom layers in Swift and integrating them using Core ML's MLCustomLayer protocol.

Technology Stack

Model Conversion: Python, PyTorch, Transformers, and coremltools.
iOS Application: Swift, SwiftUI, CoreML, Metal, and Metal Performance Shaders (MPS).
Tooling: Xcode & Xcode Instruments.

How to Replicate

The process involves setting up a Python environment, running a conversion script with your Hugging Face token, and then integrating the resulting .mlpackage model into the Xcode project.

Accomplishments & Learnings

Successfully deployed a multi-billion parameter LLM on an iPhone using a fully stateful inference pipeline. The key learning was the critical importance of KV cache management and the trade-offs between model size, quantization, and output quality.

Future Work

Continue implementing custom Metal layers for the attention mechanism.
Experiment with more aggressive quantization techniques (e.g., 4-bit).
Integrate other models to test the flexibility of the pipeline.
Enhance the chat UI with features like conversation history and context management.