From 2bafee56030f3aa631ed773b194fda94aed54ba3 Mon Sep 17 00:00:00 2001 From: Raj Ghosh <148113238+bxbee@users.noreply.github.com> Date: Sat, 31 Jan 2026 21:28:00 +0530 Subject: [PATCH] Revise README for performance improvements and features Updated README to improve clarity and fix formatting issues. --- src/README.md | 20 +++++++++----------- 1 file changed, 9 insertions(+), 11 deletions(-) diff --git a/src/README.md b/src/README.md index f713b9ab2..658e67488 100644 --- a/src/README.md +++ b/src/README.md @@ -1,20 +1,18 @@ # BitNet CPU Inference Optimization -This update provides significant performance improvements for BitNet inference on CPU through paralleled kernel implementations, native I2_S GEMM/GEMV support, configurable tiling block size and embedding quantization. +This update delivers significant performance improvements for BitNet inference on CPU through parallelized kernel implementations, native I2_S GEMM/GEMV support, configurable tiling block sizes, and embedding quantization ## Update -- **Parallel Weight & Activation Computation** - Implemented parallel processing of weights and activations in the W2A8 vet_dot kernel, achieving improved throughput on both x86 and ARM architectures. +- ** Parallel Weight & Activation Computation +Implemented parallel processing of weights and activations in the W2A8 vec_dot kernel, achieving higher throughput on both x86 and ARM architectures. +- ** Native I2_S GEMM & GEMV Support +Integrated I2_S GEMM and GEMV operations into the ggml library, ensuring full compatibility with the llama.cpp architecture. This enables seamless integration with existing inference pipelines. +- ** Configurable Tiling & Parallelism +Introduced configurable GEMM and GEMV block sizes along with adjustable parallelism levels, allowing fine‑tuned performance optimization across different CPU architectures. +- ** Embedding Quantization +Added support for embedding layer quantization using the Q6_K format, reducing memory footprint and improving inference speed while maintaining high accuracy. -- **Native I2_S GEMM & GEMV Support** - Integrated I2_S GEMM and GEMV operations into ggml library, making them fully compatible with the llama.cpp architecture. This enables seamless integration with existing inference pipelines. - -- **Configurable Tiling & Parallelism** - Introduced configurable GEMM & GEMV block sizes and parallelism levels, allowing performance fine-tuning for different CPU architectures. - -- **Embedding Quantization** - Added support for embedding layer quantization with Q6_K format, reducing memory footprint and improving inference speed while maintaining high accuracy. ## Usage