Running Phi MoE 3.5 on Macbook Pro

The relatively recently released Phi 3.5 model series includes a mixture-of-experts model featuring 16 x 3.3 Billion parameter expert models. It activates these experts two at a time resulting in pretty good performance but only 6.6 billion parameters held in memory at once. I recently wanted to try running Phi MoE 3.5 on my macbook but was blocked from doing so using my usual method whilst support is built into llama.cpp and then ollama.

I decided to try out another library, mistral.rs, which is written in the rust programming language and already supports these newer models. It required a little bit of fiddling around but I did manage to get it working and the model is relatively responsive.

Getting Our Dependencies and Building Mistral.RS

To get started you will need to have the rust compiler toolchain installed on your macbook including rustc and cargo. The easiest way to do this is via brew:

brew install rust

You’ll also need to grab the code for the project

git clone https://github.com/EricLBuehler/mistral.rs.git

Once you have both of these in place we can build the project. Since we’re running on Mac, we want the compiler to make use of apple Metal which allows the model to use the GPU capabilities of the M-series chip to accelerate the model.

cd mistral.rs
cargo install --path mistralrs-server --features metal

This command may take a couple of minutes to run. The compiled server will be saved in the target/release folder relative to your project folder.

Running the Model with Quantization

The default instructions in the project readme work but you might find it takes up a lot of memory and takes a really long time to run. That’s because, by default mistral.rs does not do any quantization so running the model requires 12GB of memory.

mistral.rs supports in-situ-quantisation which essentially means that the framework loads the model up and does the quantisation at run time (as opposed to requiring you to download a GGUF file that was already quantized). I recommend running the following:

./target/release/mistralrs-server --isq Q4_0 -i plain -m microsoft/Phi-3.5-mini-instruct -a phi3

In this mode we use ISQ to quantize the model down to 4bit mode (--isq Q4_0). You should be able to chat to the model through the terminal

Running as a Server

Mistral.rs provides a HTTP API that is compatible with OpenAI standards. To run in server mode we remove the -i argument and replace it with a port number to run on --port 1234:

./target/release/mistralrs-server --port 1234 --isq Q4_0 plain -m microsoft/Phi-3.5-mini-instruct -a phi3

You can then use an app like Postman or Bruno to interact with your model:

Screenshot of a REST tooling interface. A pane on the left shows a json payload that was sent to the server containing messages to the model telling it to behave as a useful assistant and write a poem.

On the right is the response which contains a message and the beginning of a poem as written by the model.

Running the Vision Model

To run the vision model, we just need to make a couple of changes to our command line arguments:

./target/release/mistralrs-server --port 1234 --isq Q4_0 vision-plain -m microsoft/Phi-3.5-vision-instruct -a phi3v

We still want to use ISQ but this time we swap plain for vision-plain, we swap the model name for the vision equivalent and we change the architecture -a phi3 to -a phi3v.

Likewise we can now interact with the model via HTTP tooling. Here’s a response based on the example from the documentation:

Screenshot of a REST interface. A pane on the left shows a json payload that was sent to the server containing messages to the model telling it to analyse an image url.

On the right is the response which describes the mountain in the picture that was sent.

Running on Linux and Nvidia

I am still struggling to get mistral.rs to build on Linux at the moment, the docker images that are provided by the project don’t seem to play ball with my systems. Once I figure this out I’ll release an updated version of this blog.

The post Running Phi MoE 3.5 on Macbook Pro appeared first on Brainsteam.

Also on:

Bluesky

Running Phi MoE 3.5 on Macbook Pro

Getting Our Dependencies and Building Mistral.RS

Running the Model with Quantization

Running as a Server

Running the Vision Model

Running on Linux and Nvidia

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

The Nightmare Before Christmas 1993 3D HSBS MULTISUBS 1080p BluRay x264...

Program RSUSR003 Reports "Security violation" in SM21 system log

The Teskey Brothers – Run Home Slow (2019) [FLAC 24bit/44,1kHz]

Bugatti T35B 1926 (TKPapercraft) PDF File

Isilon CLI Command Reference

Throw Back: Kwaw Kese — Ma Kwan (Ft. Edem) Prod by Hammer

Outlook のコマンドラインスイッチと初期化される情報について

Re: Premiere Elementsで実装されている、かすみ除去(Dehaze)、Premiere Pro CC 2017では利用可能でしょうか。

How to assign the custom BDXXX scripts to NPCs?

Bureau of Internal Revenue: Regional Offices (Directory)

Cleethorpes pair jailed for savage attack on man in street

James Cecil Marquette

Re: 古いPDFが開けない

KAITLIN THIEL THOMPSON Arrested by Clackamas County Sheriff's Office on Jan...

Child Kidnapping: Amy McNeil was kidnapped on her way to school by 5 adults;...

CALVIN ESSIX Arrested by Miami-Dade County Corrections on Feb 14, 2017

HResult: 0x80240033 Context: uecGeneral Msg: The license terms of one or more...

Top ten worst airports in Asia - List of the 10 worst airports in Asia

The Mother and the Murderer: Woman confronts son’s killer in prison