Llama2 Ports Extensive Benchmark Results on Mac M1 Max

Mojo 🔥 almost matches llama.cpp speed (!!!) with much simpler code and beats llama2.c across the board in multi-threading benchmarks

Date: Oct 18, 2023

TL;DR

         Recently, I obtained early access to the Mojo SDK for Mac from the Modular AI team. I set up a testing environment to evaluate the performance of llama2.🔥 inference using Mojo SDK on Apple Silicon M1 Max hardware. All tests were carried out in CPU-only mode. Along the way I also tested other ports of llama2.c to see how they would perform on the M1 Max. This article presents benchmark results comparing the performance of 3 baby llama2 models inference across 12 different implementations in 7 programming languages on Mac M1 Max hardware. Given the community's strong interest in these comparisons, I was eager to observe the performance of three primary implementations: llama.cpp, C, and Mojo. Each showcased significant results in this competitive evaluation.

Llama2 Inference on Mac M1 Max, multi-threaded [ stories15M.bin ]

Port Link
  1. Average tokens per second: Regarding processing speed as measured by average tokens per second, llama.cpp leads the pack by a considerable margin. The Mojo 🔥, which has gained traction among developers, closely follows llama.cpp, demonstrating its robust capabilities. Both llama.cpp and Mojo 🔥 substantially outpace other languages including Zig, Rust, Julia, and Go, with llama.cpp achieving approximately 1000 tokens per second.
  2. Average time per inference: Evaluating average inference time reveals Mojo as a top contender, closely followed by C. Both maintain an impressive sub-one-second inference time. Next in line are Zig (b) and llama.cpp, while languages such as Rust, Julia, and Go show varying results, with some taking up to nearly 4 seconds per inference

Llama2 Inference on Mac M1 Max, multi-threaded [ stories42M.bin ]

 
 

Llama2 Inference on Mac M1 Max, multi-threaded [ stories110M.bin ]

 
 

Delving into single-threaded performance on Mac M1 Max

        To enable a fair comparison, single-threaded benchmarks are also needed since most implementations do not support multi-threaded inference. The following provides an analysis based on the single-threaded data.

Llama2 Inference on Mac M1 Max, single-threaded [ stories15M.bin ]

 
 
  1. Average Tokens Per Second: The Zig (a) emerges as the top performer in processing speed, achieving nearly 700 tokens per second. Notably, single-threaded llama2.c compiled in runfast mode surpasses multithreaded llama2.c. Interestingly Zig (a) outpacing C++ and Mojo in this setup. While it is demonstrating same performance as in multi-threaded mode. Typically, switching from multi to single-threaded workloads causes a performance drop due to losing parallel processing. This raises the question of whether Zig (a) improperly executed additional threads during single-threaded testing. If the configuration was not set up correctly, it did not allow an apples-to-apples comparison.
  2. Average Time Per Inference: For inference time, llama2.c sets the benchmark with the lowest average, fastest among all implementations. Both Zig configurations also exhibit excellent speed, underscoring efficiency in this metric.

Llama2 Inference on Mac M1 Max, single-threaded [ stories42M.bin ]

 
 

Llama2 Inference on Mac M1 Max, single-threaded [ stories110M.bin ]

 
 

        In summary, these single-threaded benchmarks illuminate optimization trade-offs on the M1 Max. While multi-threading leverages modern hardware, single-threaded performance remains vital for applications where parallelism is unsuitable. The results provide researchers and developers a roadmap to guide optimizations and focus areas.

Methods

        llama2.c ports were executed in both single-threaded and multi-threaded configurations. For this purposes I implemented small benchmarking framework, that ensures consistent benchmarking across implementations. The hypertune tool (custom fork of incredible hyperfine cli benchmarking utility with a few advanced features) captured granular performance metrics including tokens per second, time per inference, and memory usage. For llama.cpp inference baby-llama2 models were converted to fp32 GGUF format using converter for equitable comparison.

(c) Aydyn Tairov, 2023, MIT License