Llama2 Inference on Mac M1 Max, multi-threaded [ stories15M.bin ]
-
Average tokens per second: Regarding processing speed
as measured by average tokens per
second, llama.cpp leads the pack by a considerable margin. The Mojo
🔥,
which has gained traction among developers, closely follows llama.cpp,
demonstrating its
robust capabilities. Both llama.cpp and Mojo 🔥 substantially
outpace
other
languages including Zig, Rust, Julia, and Go, with llama.cpp achieving
approximately 1000
tokens per second.
-
Average time per inference: Evaluating average inference
time reveals Mojo as a top
contender, closely followed by C. Both maintain an impressive sub-one-second
inference
time. Next in line are
Zig (b) and llama.cpp, while languages such as
Rust,
Julia, and Go show varying results, with
some taking up to nearly 4 seconds per inference
Llama2 Inference on Mac M1 Max, multi-threaded [ stories42M.bin ]
Llama2 Inference on Mac M1 Max, multi-threaded [ stories110M.bin ]
Delving into single-threaded performance on Mac M1 Max
To enable a fair comparison, single-threaded
benchmarks are
also needed since most implementations do not
support multi-threaded inference. The following provides an analysis based on the single-threaded
data.
Llama2 Inference on Mac M1 Max, single-threaded [ stories15M.bin ]
-
Average Tokens Per Second: The Zig (a) emerges as the top
performer in
processing speed, achieving nearly 700 tokens per second. Notably, single-threaded
llama2.c compiled in
runfast mode surpasses multithreaded llama2.c. Interestingly
Zig
(a) outpacing C++ and Mojo in this setup.
While it is demonstrating same performance as in multi-threaded mode. Typically, switching from
multi to
single-threaded workloads causes a performance drop due to losing parallel processing.
This raises the question of whether Zig (a) improperly executed additional
threads during
single-threaded testing. If the configuration was not set up correctly, it did not allow an
apples-to-apples comparison.
-
Average Time Per Inference: For inference time, llama2.c sets
the
benchmark with the lowest average, fastest among all implementations. Both Zig
configurations also exhibit excellent speed, underscoring efficiency in this metric.
Llama2 Inference on Mac M1 Max, single-threaded [ stories42M.bin ]
Llama2 Inference on Mac M1 Max, single-threaded [ stories110M.bin
]
In summary, these single-threaded benchmarks
illuminate
optimization trade-offs on the M1 Max. While multi-threading leverages modern hardware,
single-threaded
performance remains vital for applications where parallelism is unsuitable. The results provide
researchers
and developers a roadmap to guide optimizations and focus areas.
Methods
llama2.c ports were executed in
both
single-threaded and multi-threaded configurations.
For this purposes I implemented small benchmarking
framework, that ensures consistent benchmarking across implementations. The hypertune tool (custom fork of incredible
hyperfine cli benchmarking utility
with a few advanced features) captured granular performance metrics
including tokens per second, time per inference, and memory usage. For llama.cpp
inference
baby-llama2 models were converted to fp32 GGUF format using converter for equitable comparison.
(c) Aydyn Tairov, 2023, MIT License