Updated: June 28, 2025
Mantle is an iOS-based chat application that runs powerful Large Language Models (LLMs) entirely on-device, leveraging the full potential of Apple Silicon. This project is a deep dive into the world of on-device AI, focusing on the significant engineering challenges involved in making large, complex models performant within the constrained environment of a mobile phone.
The primary goal of this project is to explore and implement advanced optimization techniques for running LLMs on iOS devices. It serves as a portfolio piece to showcase deep technical expertise in:
Running a multi-billion parameter LLM on a phone is non-trivial. The key is a multi-stage optimization process.
The first step is to convert a pre-trained model into Apple's Core ML format. This project uses meta-llama/Llama-3.2-3B
. To shrink the model, we apply aggressive quantization, converting the model's weights from 16-bit floating-point numbers to smaller types like 8-bit integers, which drastically reduces the model's size and memory usage.
To have a conversation, an LLM must remember context, which is managed by a Key-Value (KV) Cache. This project implements stateful inference where the Core ML model accepts the KV cache as an input state and outputs the updated cache after each prediction. This makes generation significantly faster.
Using Xcode Instruments, we measure latency, memory usage, power consumption, and compute unit utilization. This analysis reveals performance bottlenecks—typically the Attention mechanism layers, which are the computational heart of a Transformer model.
When profiling shows a layer is a bottleneck, we can replace Core ML's default implementation with our own highly optimized version using Metal, Apple's low-level GPU programming framework. This involves writing custom layers in Swift and integrating them using Core ML's MLCustomLayer
protocol.
coremltools
.The process involves setting up a Python environment, running a conversion script with your Hugging Face token, and then integrating the resulting .mlpackage
model into the Xcode project.
Successfully deployed a multi-billion parameter LLM on an iPhone using a fully stateful inference pipeline. The key learning was the critical importance of KV cache management and the trade-offs between model size, quantization, and output quality.