How to run local, uncensored LLM on your phone

21/08/2024

Introduction

LLMs are cool, restrictions and relying on big corporations is not. In this post, I describe how to self-host your own little unhinged assistant that you can always carry in your pocket.

Main challenges we're going to face

Computational power - mobile phones are generally not the most powerful, but we should be able to squeeze just enough power off of them to make it work.
Liberation of the model - our little buddy should be a freak.
Software - whilst we of course can write some large shell script, having to open Termux each time we want to generate a brand new crystal meth recipe is not fun.
Memory - models can be pretty big nowadays.

Solving the challenges

Computational power

First of all, we need a model small enough to run on a phone - going over 5B parameters is probably not a very good idea. You can distil one yourself, but I'm too lazy for that and don't really know how to do it, so I'm going to stick to something popular that already meets these restrictions. LLaMA 3.2 is a good choice; it's new and has 1B and 3B models available by default.

Liberation of the model

We want our little buddy to be a freak; however, people building big models don't always approve of that. To solve this, we can uncensor our model of choice ourselves, which again, I'm not going to do here because I'm lazy. Here's a nice guide on how to do it though: https://erichartford.com/uncensored-models. Good that there's a second option waiting for us - publicly available models that have already been uncensored by someone else. I'm going to use this one.

Software

So, we have our model. How do we run it? Like I mentioned before, using Termux directly is not fun. A nice user-friendly UI is much nicer. ChatterUI is one of a few projects I was able to find that does exactly that!

Tutorial

Download and install ChatterUI. Consider using Obtainium if you're on Android. Auto-updates are cool.
Download your model of choice in the GGUF format. 4-bit quantisation or lower is the way to go if you don't feel like waiting forever for a response - grab yourself an IQ-quant if it's available.
(Optional) Quantise your model (convert to GGUF) if the model you've chosen doesn't have it available as a download.
This website is pretty nice if you have an HF account. If you don't, follow the guide from the llama.cpp's README.
Open ChatterUI, write a l33t system prompt, and load your model's GGUF file.

That's it! You can now try to not get laughed at by your friends when you try to flex on them. Good luck!