From 5219479196f3d1ed44b48ca122ed737e28bddd0d Mon Sep 17 00:00:00 2001
From: tfwang <tfwang@alauda.io>
Date: Mon, 19 Jan 2026 18:08:36 +0800
Subject: [PATCH 1/5] add: Evaluating Large Language Models with
 lm-evaluation-harness

---
 docs/en/solutions/How_to_Evaluate_LLM.md     | 541 +++++++++++++++++++
 docs/public/lm-eval/lm-eval_quick_star.ipynb | 401 ++++++++++++++
 2 files changed, 942 insertions(+)
 create mode 100644 docs/en/solutions/How_to_Evaluate_LLM.md
 create mode 100644 docs/public/lm-eval/lm-eval_quick_star.ipynb

diff --git a/docs/en/solutions/How_to_Evaluate_LLM.md b/docs/en/solutions/How_to_Evaluate_LLM.md
new file mode 100644
index 0000000..9cbf929
--- /dev/null
+++ b/docs/en/solutions/How_to_Evaluate_LLM.md
@@ -0,0 +1,541 @@
+---
+products:
+   - Alauda AI
+kind:
+   - Solution
+ProductsVersion:
+   - 4.x
+---
+# Evaluating Large Language Models with lm-evaluation-harness
+
+## Overview
+
+The **lm-evaluation-harness** (lm-eval) is a unified framework developed by EleutherAI for testing generative language models on a large number of evaluation tasks. It provides a standardized way to measure and compare LLM performance across different benchmarks.
+
+### Key Features
+
+- **60+ Standard Academic Benchmarks**: Includes hundreds of subtasks and variants for comprehensive evaluation
+- **Multiple Model Backends**: Support for HuggingFace Transformers, vLLM, API-based models (OpenAI, Anthropic), and local inference servers
+- **Flexible Task Types**: Supports various evaluation methods including:
+  - `generate_until`: Generation tasks with stopping criteria
+  - `loglikelihood`: Log-likelihood evaluation for classification
+  - `loglikelihood_rolling`: Perplexity evaluation
+  - `multiple_choice`: Multiple-choice question answering
+- **Reproducible Evaluations**: Public prompts ensure reproducibility and comparability
+- **API Support**: Evaluate models via OpenAI-compatible APIs, Anthropic, and custom endpoints
+- **Optimized Performance**: Data-parallel evaluation, vLLM acceleration, and automatic batch sizing
+
+### Common Use Cases
+
+lm-evaluation-harness is particularly valuable for:
+
+- **Model Development**: Benchmark base models and track performance across training checkpoints
+- **Fine-tuning Validation**: Compare fine-tuned models against base models to measure improvement or regression
+- **Model Compression**: Evaluate quantized, pruned, or distilled models to assess the performance-efficiency tradeoff
+- **Model Selection**: Compare different models on the same benchmarks to select the best fit for your use case
+- **Reproducible Research**: Ensure consistent evaluation methodology across experiments and publications
+
+The framework is used by Hugging Face's Open LLM Leaderboard, referenced in hundreds of research papers, and adopted by organizations including NVIDIA, Cohere, and Mosaic ML.
+
+## Quickstart
+
+### Installation
+
+Install the base package:
+
+```bash
+pip install lm-eval
+```
+
+For API-based evaluation (recommended for production model services):
+
+```bash
+pip install "lm_eval[api]"
+```
+
+### Basic Usage
+
+#### 1. List Available Tasks
+
+```bash
+lm-eval ls tasks
+```
+
+#### 2. Evaluate via OpenAI-Compatible API
+
+This is the recommended approach for evaluating model services deployed with OpenAI-compatible APIs.
+
+**Example** (evaluate a local model service):
+
+```bash
+lm-eval --model local-chat-completions \
+    --model_args model=Qwen/Qwen2.5-7B-Instruct,base_url=http://localhost:8000/v1 \
+    --tasks gsm8k,arc_easy,hellaswag \
+    --batch_size 8 \
+    --output_path ./results
+```
+
+**Key Parameters**:
+- `--model`: Use `local-chat-completions` for local API servers, `openai-chat-completions` for OpenAI
+- `--model_args`:
+  - `model`: Model name or identifier
+  - `base_url`: API endpoint (for local services only)
+  - `api_key`: API key if required (can also use environment variable)
+  - `tokenizer` (optional): Path to tokenizer for accurate token counting
+  - `tokenized_requests` (optional): Whether to use local tokenization (default: False)
+- `--tasks`: Comma-separated list of evaluation tasks
+- `--batch_size`: Number of requests to process in parallel (adjust based on API rate limits)
+- `--output_path`: Directory to save evaluation results
+
+**About Tokenization**:
+
+lm-eval supports two tokenization modes via the `tokenized_requests` parameter:
+
+- **`tokenized_requests=False` (default)**: Text is sent to the API server, which handles tokenization. Simpler setup, suitable for `generate_until` tasks.
+- **`tokenized_requests=True`**: lm-eval tokenizes text locally and sends token IDs to the API. Required for tasks needing token-level log probabilities.
+
+**Task-specific requirements**:
+
+- **`generate_until` tasks** (GSM8K, HumanEval, MATH, DROP, SQuAD, etc.):
+  - Work with `tokenized_requests=False` (server-side tokenization)
+  - No logprobs needed
+  - ✅ Fully supported with chat APIs
+
+- **`multiple_choice` tasks** (MMLU, ARC, HellaSwag, PIQA, etc.):
+  - Internally use `loglikelihood` to score each choice
+  - Work with `tokenized_requests=False` but less accurate
+  - ⚠️ Work better with logprobs support (not available in most chat APIs)
+
+- **`loglikelihood` / `loglikelihood_rolling` tasks** (LAMBADA, perplexity evaluation):
+  - Require `tokenized_requests=True` + token-level log probabilities from API
+  - ❌ Not supported by most chat APIs (OpenAI ChatCompletions, etc.)
+  - Use local models (HuggingFace, vLLM) for these tasks
+
+**Optional tokenizer configuration** (for accurate token counting or local tokenization):
+
+```bash
+lm-eval --model local-chat-completions \
+    --model_args model=MODEL_NAME,base_url=http://localhost:8000/v1,tokenizer=MODEL_NAME,tokenized_requests=False \
+    --tasks gsm8k
+```
+
+Available tokenization parameters in `model_args`:
+- `tokenizer`: Path or name of the tokenizer (e.g., HuggingFace model name)
+- `tokenizer_backend`: Tokenization system - `"huggingface"` (default), `"tiktoken"`, or `"none"`
+- `tokenized_requests`: `True` (client-side) or `False` (server-side, default)
+
+### Advanced Options
+
+#### Save Results and Sample Responses
+
+Enable `--log_samples` to save individual model responses for detailed analysis:
+
+```bash
+lm-eval --model local-chat-completions \
+    --model_args model=Qwen/Qwen2.5-7B-Instruct,base_url=http://localhost:8000/v1 \
+    --tasks gsm8k,hellaswag \
+    --output_path ./results \
+    --log_samples
+```
+
+This creates a `results/` directory containing:
+- `results.json`: Overall evaluation metrics
+- `*_eval_samples.json`: Individual samples with model predictions and references
+
+#### Use Configuration File
+
+For complex evaluations, use a YAML configuration file:
+
+```yaml
+model: local-chat-completions
+model_args:
+  model: Qwen/Qwen2.5-7B-Instruct
+  base_url: http://localhost:8000/v1
+tasks:
+  - mmlu
+  - gsm8k
+  - arc_easy
+  - arc_challenge
+  - hellaswag
+batch_size: 8
+output_path: ./results
+log_samples: true
+```
+
+Run with config:
+
+```bash
+lm-eval --config config.yaml
+```
+
+#### Quick Testing with Limited Examples
+
+Test your setup with a small number of examples before running full evaluations:
+
+```bash
+lm-eval --model local-chat-completions \
+    --model_args model=Qwen/Qwen2.5-7B-Instruct,base_url=http://localhost:8000/v1 \
+    --tasks mmlu \
+    --limit 10
+```
+
+#### Compare Multiple Models
+
+Evaluate multiple model endpoints by running separate evaluations:
+
+```bash
+# Evaluate base model
+lm-eval --model local-chat-completions \
+    --model_args model=Qwen/Qwen2.5-7B,base_url=http://localhost:8000/v1 \
+    --tasks gsm8k,mmlu \
+    --output_path ./results/base_model
+
+# Evaluate fine-tuned model
+lm-eval --model local-chat-completions \
+    --model_args model=Qwen/Qwen2.5-7B-finetuned,base_url=http://localhost:8001/v1 \
+    --tasks gsm8k,mmlu \
+    --output_path ./results/finetuned_model
+```
+
+**Note**: lm-eval outputs separate `results.json` files for each evaluation. To compare results, you need to read and analyze the JSON files manually. Here's a simple Python script to compare results:
+
+```python
+import json
+
+# Load results
+with open('./results/base_model/results.json') as f:
+    base_results = json.load(f)['results']
+
+with open('./results/finetuned_model/results.json') as f:
+    finetuned_results = json.load(f)['results']
+
+# Compare results
+print("Model Comparison:")
+print("-" * 60)
+for task in base_results.keys():
+    print(f"\n{task}:")
+    for metric in base_results[task].keys():
+        if not metric.endswith('_stderr'):
+            base_score = base_results[task][metric]
+            finetuned_score = finetuned_results[task][metric]
+            diff = finetuned_score - base_score
+            print(f"  {metric}:")
+            print(f"    Base:      {base_score:.4f}")
+            print(f"    Fine-tuned: {finetuned_score:.4f}")
+            print(f"    Difference: {diff:+.4f}")
+```
+
+#### API-Specific Considerations
+
+**Controlling Request Rate**: Adjust these parameters to match your API capacity:
+
+```bash
+lm-eval --model local-chat-completions \
+    --model_args model=MODEL_NAME,base_url=API_URL,num_concurrent=1,max_retries=3,timeout=60 \
+    --tasks gsm8k \
+    --batch_size 1
+```
+
+**Available parameters in `model_args`**:
+- `num_concurrent`: Number of concurrent requests. Typical values: 1 (sequential), 10, 50, or 128 depending on API capacity.
+- `max_retries`: Number of retries for failed requests. Common values: 3, 5, or more.
+- `timeout`: Request timeout in seconds. Adjust based on model size and API speed (e.g., 60, 300, or higher for large models).
+- `batch_size`: Number of requests to batch together (set via `--batch_size` flag, not in `model_args`)
+
+**Authentication**: Set API keys via environment variables or model_args:
+
+```bash
+# Via environment variable
+export OPENAI_API_KEY=your_key
+
+# Or via model_args
+lm-eval --model local-chat-completions \
+    --model_args model=MODEL_NAME,base_url=API_URL,api_key=YOUR_KEY \
+    --tasks gsm8k
+```
+
+### Alternative Model Backends
+
+While API-based evaluation is recommended for production services, lm-eval also supports:
+
+- **HuggingFace Transformers** (`--model hf`): For local model evaluation with full access to logprobs
+- **vLLM** (`--model vllm`): For optimized local inference with tensor parallelism
+- Other backends: See the [official documentation](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/docs) for details
+
+## Datasets
+
+lm-eval includes 60+ standard academic benchmarks. Below is a comprehensive overview of available datasets.
+
+### Understanding Task Types
+
+Before reviewing the datasets, it's important to understand the different task types:
+
+- **`generate_until`**: Generate text until a stopping condition (e.g., newline, max tokens). Best for open-ended generation tasks. Works with both chat and completion APIs.
+- **`multiple_choice`**: Select from multiple options. Can work with or without logprobs (more accurate with logprobs). Works with both chat and completion APIs.
+- **`loglikelihood`**: Calculate token-level log probabilities. Requires API to return logprobs. Only works with completion APIs or local models.
+- **`loglikelihood_rolling`**: Calculate perplexity over sequences. Requires logprobs. Only works with completion APIs or local models.
+
+### Complete Dataset Reference
+
+| Category | Dataset | Task Name | Task Type | Output Metrics | API Interface | Tokenization | Description |
+|----------|---------|-----------|-----------|----------------|---------------|--------------|-------------|
+| **General Knowledge** | MMLU | `mmlu_*` (57 subjects) | `multiple_choice` | `acc`, `acc_norm` | Chat / Completion | Server-side | 57 subjects covering STEM, humanities, social sciences: abstract_algebra, anatomy, astronomy, business_ethics, clinical_knowledge, college_biology, college_chemistry, college_computer_science, college_mathematics, college_medicine, college_physics, computer_security, conceptual_physics, econometrics, electrical_engineering, elementary_mathematics, formal_logic, global_facts, high_school_biology, high_school_chemistry, high_school_computer_science, high_school_european_history, high_school_geography, high_school_government_and_politics, high_school_macroeconomics, high_school_mathematics, high_school_microeconomics, high_school_physics, high_school_psychology, high_school_statistics, high_school_us_history, high_school_world_history, human_aging, human_sexuality, international_law, jurisprudence, logical_fallacies, machine_learning, management, marketing, medical_genetics, miscellaneous, moral_disputes, moral_scenarios, nutrition, philosophy, prehistory, professional_accounting, professional_law, professional_medicine, professional_psychology, public_relations, security_studies, sociology, us_foreign_policy, virology, world_religions |
+| | MMLU-Pro | `mmlu_pro` | `multiple_choice` | `acc` | Chat / Completion | Server-side | Enhanced MMLU with 10 options per question and higher difficulty |
+| | AGIEval | `agieval` | `multiple_choice` | `acc` | Chat / Completion | Server-side | Academic exams including LSAT, SAT, GaoKao (Chinese & English) |
+| | C-Eval | `ceval` | `multiple_choice` | `acc` | Chat / Completion | Server-side | Chinese comprehensive evaluation across 52 subjects |
+| **Instruction Following** | IFEval | `ifeval` | `generate_until` | `prompt_level_strict_acc`, `inst_level_strict_acc` | Chat / Completion | Server-side | Instruction following evaluation with verifiable constraints |
+| **Commonsense Reasoning** | HellaSwag | `hellaswag` | `multiple_choice` | `acc`, `acc_norm` | Chat / Completion | Server-side | Sentence completion with commonsense reasoning |
+| | ARC | `arc_easy` | `multiple_choice` | `acc`, `acc_norm` | Chat / Completion | Server-side | Easy grade-school science questions |
+| | | `arc_challenge` | `multiple_choice` | `acc`, `acc_norm` | Chat / Completion | Server-side | Challenging grade-school science questions |
+| | WinoGrande | `winogrande` | `multiple_choice` | `acc` | Chat / Completion | Server-side | Pronoun resolution reasoning |
+| | OpenBookQA | `openbookqa` | `multiple_choice` | `acc`, `acc_norm` | Chat / Completion | Server-side | Open book question answering |
+| | CommonsenseQA | `commonsense_qa` | `multiple_choice` | `acc` | Chat / Completion | Server-side | Commonsense question answering |
+| | Social IQA | `social_iqa` | `multiple_choice` | `acc` | Chat / Completion | Server-side | Social interaction question answering |
+| **Mathematics** | GSM8K | `gsm8k` | `generate_until` | `exact_match` | Chat / Completion | Server-side | Grade-school math word problems |
+| | | `gsm8k_cot` | `generate_until` | `exact_match` | Chat / Completion | Server-side | GSM8K with chain-of-thought prompting |
+| | MATH | `minerva_math` | `generate_until` | `exact_match` | Chat / Completion | Server-side | Competition-level mathematics problems |
+| | MathQA | `mathqa` | `multiple_choice` | `acc`, `acc_norm` | Chat / Completion | Server-side | Math word problems with multiple choice |
+| | MGSM | `mgsm_direct`, `mgsm_native_cot` | `generate_until` | `exact_match` | Chat / Completion | Server-side | Multilingual grade-school math (10 languages: Bengali, Chinese, French, German, Japanese, Russian, Spanish, Swahili, Telugu, Thai) |
+| **Coding** | HumanEval | `humaneval` | `generate_until` | `pass@1`, `pass@10`, `pass@100` | Chat / Completion | Server-side | Python code generation from docstrings |
+| | MBPP | `mbpp` | `generate_until` | `pass@1`, `pass@10`, `pass@100` | Chat / Completion | Server-side | Basic Python programming problems |
+| **Reading Comprehension** | RACE | `race` | `multiple_choice` | `acc` | Chat / Completion | Server-side | Reading comprehension from exams |
+| | SQuAD | `squad_v2` | `generate_until` | `exact`, `f1`, `HasAns_exact`, `HasAns_f1` | Chat / Completion | Server-side | Extractive question answering |
+| | DROP | `drop` | `generate_until` | `em`, `f1` | Chat / Completion | Server-side | Reading comprehension requiring discrete reasoning |
+| **Language Understanding** | LAMBADA | `lambada_openai` | `loglikelihood` | `perplexity`, `acc` | ❌ Requires logprobs | Client-side | Word prediction in context |
+| | PIQA | `piqa` | `multiple_choice` | `acc`, `acc_norm` | Chat / Completion | Server-side | Physical interaction question answering |
+| | LogiQA | `logiqa` | `multiple_choice` | `acc`, `acc_norm` | Chat / Completion | Server-side | Logical reasoning questions |
+| | COPA | `copa` | `multiple_choice` | `acc` | Chat / Completion | Server-side | Causal reasoning |
+| | StoryCloze | `storycloze_2016` | `multiple_choice` | `acc` | Chat / Completion | Server-side | Story completion task |
+| **Truthfulness & Safety** | TruthfulQA | `truthfulqa_mc1` | `multiple_choice` | `acc` | Chat / Completion | Server-side | Single-correct answer truthfulness |
+| | | `truthfulqa_mc2` | `multiple_choice` | `acc` | Chat / Completion | Server-side | Multiple-correct answer truthfulness |
+| | | `truthfulqa_gen` | `generate_until` | `bleu_max`, `rouge1_max`, `rougeL_max`, `bleurt_max` | Chat / Completion | Server-side | Generative truthfulness evaluation (also outputs _acc and _diff variants) |
+| | BBQ | `bbq_*` (11 categories) | `multiple_choice` | `acc` | Chat / Completion | Server-side | Bias benchmark: age, disability, gender, nationality, physical_appearance, race_ethnicity, religion, ses (socio-economic status), sexual_orientation, race_x_gender (intersectional), race_x_ses (intersectional) |
+| **Multilingual** | Belebele | `belebele_zho_Hans` | `multiple_choice` | `acc`, `acc_norm` | Chat / Completion | Server-side | Chinese (Simplified) reading comprehension |
+| | | `belebele_zho_Hant` | `multiple_choice` | `acc`, `acc_norm` | Chat / Completion | Server-side | Chinese (Traditional) reading comprehension |
+| | | `belebele_eng_Latn` | `multiple_choice` | `acc`, `acc_norm` | Chat / Completion | Server-side | English reading comprehension |
+| | | `belebele_*` | `multiple_choice` | `acc`, `acc_norm` | Chat / Completion | Server-side | 122 languages total (see full list with `lm-eval ls tasks`) |
+| | XCOPA | `xcopa_*` (11 languages) | `multiple_choice` | `acc` | Chat / Completion | Server-side | Causal reasoning: et (Estonian), ht (Haitian), id (Indonesian), it (Italian), qu (Quechua), sw (Swahili), ta (Tamil), th (Thai), tr (Turkish), vi (Vietnamese), zh (Chinese) |
+| | XWinograd | `xwinograd_*` (6 languages) | `multiple_choice` | `acc` | Chat / Completion | Server-side | Winograd schema: en (English), fr (French), jp (Japanese), pt (Portuguese), ru (Russian), zh (Chinese) |
+| **Factual Knowledge** | Natural Questions | `nq_open` | `generate_until` | `exact_match` | Chat / Completion | Server-side | Open-domain question answering |
+| | TriviaQA | `triviaqa` | `generate_until` | `exact_match` | Chat / Completion | Server-side | Trivia question answering |
+| | Web Questions | `webqs` | `multiple_choice` | `exact_match` | Chat / Completion | Server-side | Question answering from web search queries |
+| **Summarization** | CNN/DailyMail | `cnn_dailymail` | `generate_until` | `rouge1`, `rouge2`, `rougeL` | Chat / Completion | Server-side | News article summarization |
+| **Translation** | WMT | `wmt14`, `wmt16`, `wmt20` | `generate_until` | `bleu`, `chrf` | Chat / Completion | Server-side | Machine translation benchmarks (multiple language pairs) |
+| **BIG-Bench** | BIG-Bench Hard (BBH) | `bbh_cot_fewshot` (23 tasks) | `generate_until` | `acc`, `exact_match` | Chat / Completion | Server-side | 23 challenging tasks: boolean_expressions, causal_judgement, date_understanding, disambiguation_qa, dyck_languages, formal_fallacies, geometric_shapes, hyperbaton, logical_deduction (3/5/7 objects), movie_recommendation, multistep_arithmetic_two, navigate, object_counting, penguins_in_a_table, reasoning_about_colored_objects, ruin_names, salient_translation_error_detection, snarks, sports_understanding, temporal_sequences, tracking_shuffled_objects (3/5/7 objects), web_of_lies, word_sorting |
+| **Domain-Specific** | MedQA | `medqa` | `multiple_choice` | `acc`, `acc_norm` | Chat / Completion | Server-side | Medical question answering from USMLE exams |
+| | MedMCQA | `medmcqa` | `multiple_choice` | `acc`, `acc_norm` | Chat / Completion | Server-side | Medical multiple choice questions from Indian medical exams |
+| | PubMedQA | `pubmedqa` | `multiple_choice` | `acc` | Chat / Completion | Server-side | Biomedical question answering from PubMed abstracts |
+
+**Legend**:
+- **Output Metrics**: These are the actual metric keys that appear in the output JSON (e.g., `acc`, `exact_match`, `pass@1`)
+- **API Interface**:
+  - `Chat / Completion`: Works with both OpenAI-compatible chat and completion APIs
+  - `❌ Requires logprobs`: Only works with APIs that return token-level log probabilities, or local models
+- **Tokenization**:
+  - `Server-side`: Uses `tokenized_requests=False` (default). Text is sent to API server, which handles tokenization. Works for `generate_until` and `multiple_choice` tasks.
+  - `Client-side`: Uses `tokenized_requests=True`. lm-eval tokenizes locally and sends token IDs. Required for `loglikelihood` tasks. Improves accuracy for `multiple_choice` tasks but requires logprobs support from API.
+
+**Finding More Tasks**:
+- Run `lm-eval ls tasks` to see all available tasks (60+ datasets with hundreds of variants)
+- Many datasets have language-specific variants (e.g., `belebele_*`, `xcopa_*`)
+- Task groups are available (e.g., `mmlu` runs all 57 MMLU subjects)
+
+### Usage Examples
+
+**Single task evaluation**:
+
+```bash
+lm-eval --model local-chat-completions \
+    --model_args model=MODEL_NAME,base_url=http://localhost:8000/v1 \
+    --tasks gsm8k \
+    --output_path ./results
+```
+
+**Multiple tasks evaluation**:
+
+```bash
+lm-eval --model local-chat-completions \
+    --model_args model=MODEL_NAME,base_url=http://localhost:8000/v1 \
+    --tasks mmlu,gsm8k,arc_easy,arc_challenge,hellaswag \
+    --output_path ./results
+```
+
+**Task group evaluation** (all MMLU subjects):
+
+```bash
+lm-eval --model local-chat-completions \
+    --model_args model=MODEL_NAME,base_url=http://localhost:8000/v1 \
+    --tasks mmlu \
+    --output_path ./results
+```
+
+**Wildcard pattern** (specific MMLU subjects):
+
+```bash
+lm-eval --model local-chat-completions \
+    --model_args model=MODEL_NAME,base_url=http://localhost:8000/v1 \
+    --tasks "mmlu_mathematics,mmlu_physics,mmlu_chemistry" \
+    --output_path ./results
+```
+
+**Multilingual evaluation** (Chinese Belebele):
+
+```bash
+lm-eval --model local-chat-completions \
+    --model_args model=MODEL_NAME,base_url=http://localhost:8000/v1 \
+    --tasks belebele_zho_Hans \
+    --output_path ./results
+```
+
+### Common Task Combinations
+
+**General LLM Benchmark Suite** (recommended for API evaluation):
+
+```bash
+lm-eval --model local-chat-completions \
+    --model_args model=MODEL_NAME,base_url=http://localhost:8000/v1 \
+    --tasks mmlu,gsm8k,arc_challenge,hellaswag,winogrande,truthfulqa_mc2 \
+    --output_path ./results
+```
+
+**Math & Reasoning Suite**:
+
+```bash
+lm-eval --model local-chat-completions \
+    --model_args model=MODEL_NAME,base_url=http://localhost:8000/v1 \
+    --tasks gsm8k,math,arc_challenge \
+    --output_path ./results
+```
+
+**Code Generation Suite**:
+
+```bash
+lm-eval --model local-chat-completions \
+    --model_args model=MODEL_NAME,base_url=http://localhost:8000/v1 \
+    --tasks humaneval,mbpp \
+    --output_path ./results
+```
+
+**Open LLM Leaderboard Suite**:
+
+```bash
+lm-eval --model local-chat-completions \
+    --model_args model=MODEL_NAME,base_url=http://localhost:8000/v1 \
+    --tasks leaderboard \
+    --output_path ./results
+```
+
+### Finding Tasks
+
+**List all available tasks**:
+
+```bash
+lm-eval ls tasks
+```
+
+**Search for specific tasks**:
+
+```bash
+# Search for MMLU tasks
+lm-eval ls tasks | grep mmlu
+
+# Search for math-related tasks
+lm-eval ls tasks | grep -i math
+
+# Search for Chinese language tasks
+lm-eval ls tasks | grep zho
+```
+
+**Task naming patterns**:
+- Dataset groups: `mmlu`, `belebele` (runs all variants)
+- Specific variants: `mmlu_mathematics`, `belebele_zho_Hans`
+- Task variants: `gsm8k` vs `gsm8k_cot` (with/without chain-of-thought)
+
+### Understanding Output Results
+
+After running an evaluation, results are saved in JSON format. Here's what the key metrics mean:
+
+**Common Metrics** (as they appear in `results.json`):
+
+| Metric Key | Full Name | Description | Range | Higher is Better? |
+|------------|-----------|-------------|-------|-------------------|
+| `acc` | Accuracy | Proportion of correct answers | 0.0 - 1.0 | ✅ Yes |
+| `acc_norm` | Normalized Accuracy | Accuracy using length-normalized probabilities | 0.0 - 1.0 | ✅ Yes |
+| `exact_match` | Exact Match | Exact string match between prediction and reference | 0.0 - 1.0 | ✅ Yes |
+| `exact` | Exact Match (SQuAD) | Exact match metric for SQuAD tasks | 0.0 - 100.0 | ✅ Yes |
+| `em` | Exact Match (DROP) | Exact match metric for DROP task | 0.0 - 1.0 | ✅ Yes |
+| `pass@1` | Pass at 1 | Percentage of problems solved on first attempt | 0.0 - 1.0 | ✅ Yes |
+| `pass@10` | Pass at 10 | Percentage of problems solved in 10 attempts | 0.0 - 1.0 | ✅ Yes |
+| `f1` | F1 Score | Harmonic mean of precision and recall | 0.0 - 1.0 | ✅ Yes |
+| `bleu`, `bleu_max` | BLEU Score | Text similarity metric for generation/translation | 0.0 - 100.0 | ✅ Yes |
+| `rouge1`, `rouge2`, `rougeL` | ROUGE Scores | Recall-oriented text similarity | 0.0 - 1.0 | ✅ Yes |
+| `perplexity` | Perplexity | Model's uncertainty (lower is better) | > 0 | ❌ No (lower is better) |
+
+**Example output structure**:
+
+```json
+{
+  "results": {
+    "mmlu": {
+      "acc": 0.6234,
+      "acc_norm": 0.6456,
+      "acc_stderr": 0.0123,
+      "acc_norm_stderr": 0.0115
+    },
+    "gsm8k": {
+      "exact_match": 0.5621,
+      "exact_match_stderr": 0.0142
+    }
+  },
+  "versions": {
+    "mmlu": 0,
+    "gsm8k": 1
+  },
+  "config": {
+    "model": "local-chat-completions",
+    "model_args": "model=Qwen/Qwen2.5-7B-Instruct,base_url=http://localhost:8000/v1",
+    "batch_size": 8
+  }
+}
+```
+
+**Notes**:
+- `*_stderr`: Standard error of the metric (indicates confidence in the result)
+- Multiple metrics per task: Some tasks report several metrics (e.g., both `acc` and `acc_norm`)
+- Use the metric field names exactly as shown when referring to results in reports
+
+### API Requirements by Task Type
+
+| Task Type | Logprobs Required | Best Interface | Tokenization | Notes |
+|-----------|------------------|----------------|--------------|-------|
+| `generate_until` | No | Chat API | Server-side | Recommended for API evaluation. No local tokenizer needed. |
+| `multiple_choice` | Recommended | Both | Server-side | Works with APIs but accuracy improves with logprobs |
+| `loglikelihood` | Yes | Completion only | Client-side | Requires token-level probabilities. Not supported by most chat APIs. |
+| `loglikelihood_rolling` | Yes | Completion only | Client-side | For perplexity evaluation. Not supported by most chat APIs. |
+
+**Important Notes**:
+- **For API-based evaluation**: Focus on `generate_until` tasks (GSM8K, ARC, MMLU, etc.). These don't require local tokenization or logprobs.
+- **Tokenization**: With chat APIs, tokenization is handled server-side automatically. You only need to specify a local tokenizer if you want accurate token counting for cost estimation.
+- **Logprobs limitation**: OpenAI ChatCompletions and most chat APIs don't provide token-level logprobs, making `loglikelihood` tasks unavailable. Use local models (HuggingFace, vLLM) if you need these task types.
+
+## Additional Resources
+
+### Official Documentation
+
+- **GitHub Repository**: [https://github.com/EleutherAI/lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness)
+- **Documentation**: [https://github.com/EleutherAI/lm-evaluation-harness/tree/main/docs](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/docs)
+- **Task Implementation Guide**: [https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md](https://github.com/EleutherAI/lm-evaluation-harness/blob/main/docs/new_task_guide.md)
+
+### Example Notebook
+
+Download the Jupyter notebook example: [lm-eval Quick Start Notebook](../public/lm-eval/lm-eval_quick_star.ipynb)
+
+### Tips & Best Practices
+
+1. **Start Small**: Use `--limit 10` to test your setup before running full evaluations
+2. **Use Auto Batch Size**: Set `--batch_size auto` for optimal GPU utilization
+3. **Save Results**: Always use `--output_path` and `--log_samples` for reproducibility
+4. **Cache Results**: Use `--use_cache <DIR>` to resume interrupted evaluations
+5. **Check Task Compatibility**: Verify your model supports the required output format (logprobs, generation, etc.)
+6. **Monitor Resources**: Large evaluations can take hours; use tools like `htop` or `nvidia-smi` to monitor
+7. **Validate First**: Use `lm-eval validate --tasks <task_name>` to check task configuration
diff --git a/docs/public/lm-eval/lm-eval_quick_star.ipynb b/docs/public/lm-eval/lm-eval_quick_star.ipynb
new file mode 100644
index 0000000..2a91a39
--- /dev/null
+++ b/docs/public/lm-eval/lm-eval_quick_star.ipynb
@@ -0,0 +1,401 @@
+{
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "# LM-Evaluation-Harness Quick Start Guide\n",
+        "\n",
+        "This notebook demonstrates how to use the lm-evaluation-harness (lm-eval) to evaluate language models using command-line interface.\n",
+        "\n",
+        "## Prerequisites\n",
+        "\n",
+        "Install lm-eval with the required backends:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# Install lm-eval with API support\n",
+        "!pip install \"lm_eval[api]\""
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## 1. List Available Tasks\n",
+        "\n",
+        "First, let's see what evaluation tasks are available:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# List all available tasks (showing first 20 lines)\n",
+        "!lm-eval ls tasks | head -20"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "### Search for Specific Tasks\n",
+        "\n",
+        "You can search for specific tasks using grep:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# Search for MMLU tasks\n",
+        "!lm-eval ls tasks | grep mmlu | head -20\n",
+        "\n",
+        "# Search for math-related tasks\n",
+        "!lm-eval ls tasks | grep -i math\n",
+        "\n",
+        "# Search for Chinese language tasks\n",
+        "!lm-eval ls tasks | grep zho"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## 2. Quick Test with Limited Examples\n",
+        "\n",
+        "Before running a full evaluation, it's good practice to test with a small number of examples:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# Test with 5 examples from hellaswag\n",
+        "# Replace the base_url and model name with your local API endpoint\n",
+        "!lm-eval --model local-chat-completions \\\n",
+        "    --model_args model=Qwen/Qwen2.5-0.5B-Instruct,base_url=http://localhost:8000/v1 \\\n",
+        "    --tasks hellaswag \\\n",
+        "    --limit 5"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## 3. Evaluate on Multiple Tasks\n",
+        "\n",
+        "Run evaluation on multiple tasks suitable for API models (generation-based tasks):"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# Evaluate on GSM8K (math reasoning)\n",
+        "!lm-eval --model local-chat-completions \\\n",
+        "    --model_args model=Qwen/Qwen2.5-7B-Instruct,base_url=http://localhost:8000/v1 \\\n",
+        "    --tasks gsm8k \\\n",
+        "    --batch_size 8 \\\n",
+        "    --output_path ./results \\\n",
+        "    --log_samples"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## 4. Evaluate with Configuration File\n",
+        "\n",
+        "For more complex evaluations, use a YAML configuration file:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# Create a configuration file\n",
+        "config = \"\"\"\n",
+        "model: local-chat-completions\n",
+        "model_args:\n",
+        "  model: Qwen/Qwen2.5-7B-Instruct\n",
+        "  base_url: http://localhost:8000/v1\n",
+        "tasks:\n",
+        "  - gsm8k\n",
+        "  - arc_easy\n",
+        "  - hellaswag\n",
+        "batch_size: 8\n",
+        "output_path: ./results\n",
+        "log_samples: true\n",
+        "\"\"\"\n",
+        "\n",
+        "with open('eval_config.yaml', 'w') as f:\n",
+        "    f.write(config)\n",
+        "\n",
+        "print(\"Configuration file created!\")"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# Run evaluation with config file\n",
+        "!lm-eval --config eval_config.yaml --limit 10"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## 5. OpenAI API Evaluation\n",
+        "\n",
+        "To evaluate models via OpenAI API:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "import os\n",
+        "\n",
+        "# Set your OpenAI API key\n",
+        "os.environ['OPENAI_API_KEY'] = 'your-api-key-here'\n",
+        "\n",
+        "# Evaluate GPT-4o-mini\n",
+        "!lm-eval --model openai-chat-completions \\\n",
+        "    --model_args model=gpt-4o-mini \\\n",
+        "    --tasks gsm8k \\\n",
+        "    --limit 20"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## 6. Comprehensive Evaluation Suite\n",
+        "\n",
+        "Run a comprehensive evaluation on multiple benchmarks:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# Comprehensive evaluation (generation-based tasks for API models)\n",
+        "!lm-eval --model local-chat-completions \\\n",
+        "    --model_args model=Qwen/Qwen2.5-7B-Instruct,base_url=http://localhost:8000/v1 \\\n",
+        "    --tasks gsm8k,arc_easy,arc_challenge,boolq,piqa \\\n",
+        "    --batch_size 8 \\\n",
+        "    --output_path ./comprehensive_results \\\n",
+        "    --log_samples"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## 7. View Results\n",
+        "\n",
+        "After evaluation completes, you can view the results:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "import json\n",
+        "\n",
+        "# Load and display results\n",
+        "with open('./results/results.json', 'r') as f:\n",
+        "    results = json.load(f)\n",
+        "\n",
+        "# Display the results\n",
+        "print(\"=== Evaluation Results ===\")\n",
+        "print(json.dumps(results, indent=2))\n",
+        "\n",
+        "# Explain common metrics\n",
+        "print(\"\\n=== Common Output Metrics ===\")\n",
+        "print(\"- acc: Accuracy (proportion of correct answers)\")\n",
+        "print(\"- acc_norm: Normalized accuracy (using length-normalized probabilities)\")\n",
+        "print(\"- exact_match: Exact string match between prediction and reference\")\n",
+        "print(\"- pass@1, pass@10: Percentage of problems solved (for code generation)\")\n",
+        "print(\"- f1: F1 score (harmonic mean of precision and recall)\")\n",
+        "print(\"- bleu, rouge: Text similarity metrics for generation tasks\")"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## 8. Advanced: Task-Specific Examples\n",
+        "\n",
+        "### Mathematics Evaluation (GSM8K with Chain-of-Thought)"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# GSM8K with chain-of-thought reasoning\n",
+        "!lm-eval --model local-chat-completions \\\n",
+        "    --model_args model=Qwen/Qwen2.5-7B-Instruct,base_url=http://localhost:8000/v1 \\\n",
+        "    --tasks gsm8k_cot \\\n",
+        "    --batch_size 8 \\\n",
+        "    --output_path ./results/gsm8k_cot"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "### Multilingual Evaluation"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# Evaluate on Chinese Belebele\n",
+        "!lm-eval --model local-chat-completions \\\n",
+        "    --model_args model=Qwen/Qwen2.5-7B-Instruct,base_url=http://localhost:8000/v1 \\\n",
+        "    --tasks belebele_zho_Hans \\\n",
+        "    --batch_size 8 \\\n",
+        "    --output_path ./results/belebele_chinese"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "### Multiple MMLU Subjects"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# Evaluate on specific MMLU subjects\n",
+        "!lm-eval --model local-chat-completions \\\n",
+        "    --model_args model=Qwen/Qwen2.5-7B-Instruct,base_url=http://localhost:8000/v1 \\\n",
+        "    --tasks mmlu_abstract_algebra,mmlu_anatomy,mmlu_astronomy \\\n",
+        "    --batch_size 8 \\\n",
+        "    --output_path ./results/mmlu_subset"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## 9. Caching and Resume\n",
+        "\n",
+        "Use caching to resume interrupted evaluations:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": null,
+      "metadata": {},
+      "outputs": [],
+      "source": [
+        "# Run with caching enabled\n",
+        "!lm-eval --model local-chat-completions \\\n",
+        "    --model_args model=Qwen/Qwen2.5-7B-Instruct,base_url=http://localhost:8000/v1 \\\n",
+        "    --tasks gsm8k \\\n",
+        "    --batch_size 8 \\\n",
+        "    --use_cache ./cache \\\n",
+        "    --output_path ./results"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {},
+      "source": [
+        "## Tips and Best Practices\n",
+        "\n",
+        "1. **Always test first**: Use `--limit 5` or `--limit 10` to verify your setup\n",
+        "2. **Save results**: Use `--output_path` and `--log_samples` for reproducibility\n",
+        "3. **Choose appropriate tasks**: \n",
+        "   - Use generation tasks (`gsm8k`, `arc_easy`, etc.) for API models without logprobs\n",
+        "   - Use all task types for local models that provide logprobs\n",
+        "4. **Monitor resources**: Large evaluations can take time; monitor with `htop` or `nvidia-smi`\n",
+        "5. **Use caching**: Enable `--use_cache` for long evaluations that might be interrupted\n",
+        "6. **Batch size**: Adjust based on your API rate limits and model capacity\n",
+        "\n",
+        "## Common Task Categories\n",
+        "\n",
+        "### Generation Tasks (Work with all API types)\n",
+        "- `gsm8k`, `gsm8k_cot` - Math reasoning\n",
+        "- `humaneval`, `mbpp` - Code generation\n",
+        "- `truthfulqa_gen` - Truthfulness (generation variant)\n",
+        "\n",
+        "### Multiple Choice Tasks (May work without logprobs)\n",
+        "- `mmlu` - General knowledge (57 subjects)\n",
+        "- `arc_easy`, `arc_challenge` - Science questions\n",
+        "- `hellaswag` - Commonsense reasoning\n",
+        "- `winogrande` - Pronoun resolution\n",
+        "- `boolq` - Yes/no questions\n",
+        "- `piqa` - Physical commonsense\n",
+        "\n",
+        "### Loglikelihood Tasks (Require logprobs - local models only)\n",
+        "- `lambada_openai` - Word prediction\n",
+        "- Perplexity evaluation tasks\n",
+        "\n",
+        "## Resources\n",
+        "\n",
+        "- **Documentation**: https://github.com/EleutherAI/lm-evaluation-harness/tree/main/docs\n",
+        "- **GitHub**: https://github.com/EleutherAI/lm-evaluation-harness\n",
+        "- **Open LLM Leaderboard**: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard"
+      ]
+    }
+  ],
+  "metadata": {
+    "kernelspec": {
+      "display_name": "Python 3",
+      "language": "python",
+      "name": "python3"
+    },
+    "language_info": {
+      "codemirror_mode": {
+        "name": "ipython",
+        "version": 3
+      },
+      "file_extension": ".py",
+      "mimetype": "text/x-python",
+      "name": "python",
+      "nbconvert_exporter": "python",
+      "pygments_lexer": "ipython3",
+      "version": "3.10.0"
+    }
+  },
+  "nbformat": 4,
+  "nbformat_minor": 4
+}

From 5f78680451c86c32a724cd36d6de38df6a5d0a82 Mon Sep 17 00:00:00 2001
From: tfwang <tfwang@alauda.io>
Date: Mon, 19 Jan 2026 18:39:18 +0800
Subject: [PATCH 2/5] update

---
 docs/en/solutions/How_to_Evaluate_LLM.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/en/solutions/How_to_Evaluate_LLM.md b/docs/en/solutions/How_to_Evaluate_LLM.md
index 9cbf929..6d50dcf 100644
--- a/docs/en/solutions/How_to_Evaluate_LLM.md
+++ b/docs/en/solutions/How_to_Evaluate_LLM.md
@@ -528,7 +528,7 @@ After running an evaluation, results are saved in JSON format. Here's what the k
 
 ### Example Notebook
 
-Download the Jupyter notebook example: [lm-eval Quick Start Notebook](../public/lm-eval/lm-eval_quick_star.ipynb)
+Download the Jupyter notebook example: [lm-eval Quick Start Notebook](/lm-eval/lm-eval_quick_star.ipynb)
 
 ### Tips & Best Practices
 

From f6d94be6815ef6145cee67d4869861d754a9b7ef Mon Sep 17 00:00:00 2001
From: tfwang <tfwang@alauda.io>
Date: Mon, 19 Jan 2026 18:46:04 +0800
Subject: [PATCH 3/5] update

---
 docs/public/lm-eval/lm-eval_quick_star.ipynb | 61 +++++---------------
 1 file changed, 13 insertions(+), 48 deletions(-)

diff --git a/docs/public/lm-eval/lm-eval_quick_star.ipynb b/docs/public/lm-eval/lm-eval_quick_star.ipynb
index 2a91a39..372ac54 100644
--- a/docs/public/lm-eval/lm-eval_quick_star.ipynb
+++ b/docs/public/lm-eval/lm-eval_quick_star.ipynb
@@ -163,35 +163,20 @@
     {
       "cell_type": "markdown",
       "metadata": {},
-      "source": [
-        "## 5. OpenAI API Evaluation\n",
-        "\n",
-        "To evaluate models via OpenAI API:"
-      ]
+      "source": []
     },
     {
       "cell_type": "code",
       "execution_count": null,
       "metadata": {},
       "outputs": [],
-      "source": [
-        "import os\n",
-        "\n",
-        "# Set your OpenAI API key\n",
-        "os.environ['OPENAI_API_KEY'] = 'your-api-key-here'\n",
-        "\n",
-        "# Evaluate GPT-4o-mini\n",
-        "!lm-eval --model openai-chat-completions \\\n",
-        "    --model_args model=gpt-4o-mini \\\n",
-        "    --tasks gsm8k \\\n",
-        "    --limit 20"
-      ]
+      "source": []
     },
     {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "## 6. Comprehensive Evaluation Suite\n",
+        "## 5. Comprehensive Evaluation Suite\n",
         "\n",
         "Run a comprehensive evaluation on multiple benchmarks:"
       ]
@@ -215,7 +200,7 @@
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "## 7. View Results\n",
+        "## 6. View Results\n",
         "\n",
         "After evaluation completes, you can view the results:"
       ]
@@ -250,7 +235,7 @@
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "## 8. Advanced: Task-Specific Examples\n",
+        "## 7. Advanced: Task-Specific Examples\n",
         "\n",
         "### Mathematics Evaluation (GSM8K with Chain-of-Thought)"
       ]
@@ -315,7 +300,7 @@
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "## 9. Caching and Resume\n",
+        "## 8. Caching and Resume\n",
         "\n",
         "Use caching to resume interrupted evaluations:"
       ]
@@ -341,39 +326,19 @@
       "source": [
         "## Tips and Best Practices\n",
         "\n",
-        "1. **Always test first**: Use `--limit 5` or `--limit 10` to verify your setup\n",
+        "1. **Always test first**: Use `--limit 5` or `--limit 10` to verify your setup before running full evaluations\n",
         "2. **Save results**: Use `--output_path` and `--log_samples` for reproducibility\n",
-        "3. **Choose appropriate tasks**: \n",
-        "   - Use generation tasks (`gsm8k`, `arc_easy`, etc.) for API models without logprobs\n",
-        "   - Use all task types for local models that provide logprobs\n",
+        "3. **Choose appropriate tasks**: Refer to the complete task list in the documentation for detailed task information\n",
         "4. **Monitor resources**: Large evaluations can take time; monitor with `htop` or `nvidia-smi`\n",
         "5. **Use caching**: Enable `--use_cache` for long evaluations that might be interrupted\n",
-        "6. **Batch size**: Adjust based on your API rate limits and model capacity\n",
-        "\n",
-        "## Common Task Categories\n",
-        "\n",
-        "### Generation Tasks (Work with all API types)\n",
-        "- `gsm8k`, `gsm8k_cot` - Math reasoning\n",
-        "- `humaneval`, `mbpp` - Code generation\n",
-        "- `truthfulqa_gen` - Truthfulness (generation variant)\n",
-        "\n",
-        "### Multiple Choice Tasks (May work without logprobs)\n",
-        "- `mmlu` - General knowledge (57 subjects)\n",
-        "- `arc_easy`, `arc_challenge` - Science questions\n",
-        "- `hellaswag` - Commonsense reasoning\n",
-        "- `winogrande` - Pronoun resolution\n",
-        "- `boolq` - Yes/no questions\n",
-        "- `piqa` - Physical commonsense\n",
-        "\n",
-        "### Loglikelihood Tasks (Require logprobs - local models only)\n",
-        "- `lambada_openai` - Word prediction\n",
-        "- Perplexity evaluation tasks\n",
+        "6. **Batch size**: Adjust `--batch_size` based on your API rate limits and model capacity\n",
+        "7. **API configuration**: Ensure your local model service is running and accessible at the `base_url` you specify\n",
         "\n",
         "## Resources\n",
         "\n",
-        "- **Documentation**: https://github.com/EleutherAI/lm-evaluation-harness/tree/main/docs\n",
-        "- **GitHub**: https://github.com/EleutherAI/lm-evaluation-harness\n",
-        "- **Open LLM Leaderboard**: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard"
+        "- **Complete Task Documentation**: See the main documentation for a comprehensive list of all evaluation tasks and their capabilities\n",
+        "- **lm-eval Documentation**: https://github.com/EleutherAI/lm-evaluation-harness/tree/main/docs\n",
+        "- **GitHub Repository**: https://github.com/EleutherAI/lm-evaluation-harness"
       ]
     }
   ],

From 85f97a8577ad44efaf493ce4528463df01d18473 Mon Sep 17 00:00:00 2001
From: tfwang <tfwang@alauda.io>
Date: Fri, 30 Jan 2026 19:04:13 +0800
Subject: [PATCH 4/5] update

---
 docs/en/solutions/How_to_Evaluate_LLM.md     | 170 +++++++-----
 docs/public/lm-eval/lm-eval_quick_star.ipynb | 256 +++++--------------
 2 files changed, 168 insertions(+), 258 deletions(-)

diff --git a/docs/en/solutions/How_to_Evaluate_LLM.md b/docs/en/solutions/How_to_Evaluate_LLM.md
index 6d50dcf..ce5a976 100644
--- a/docs/en/solutions/How_to_Evaluate_LLM.md
+++ b/docs/en/solutions/How_to_Evaluate_LLM.md
@@ -41,12 +41,6 @@ The framework is used by Hugging Face's Open LLM Leaderboard, referenced in hund
 
 ### Installation
 
-Install the base package:
-
-```bash
-pip install lm-eval
-```
-
 For API-based evaluation (recommended for production model services):
 
 ```bash
@@ -67,24 +61,28 @@ This is the recommended approach for evaluating model services deployed with Ope
 
 **Example** (evaluate a local model service):
 
+For local API servers, use the model name returned by the server (e.g. from `GET /v1/models`). Replace `MODEL_NAME` and `BASE_URL` with your actual values:
+
 ```bash
 lm-eval --model local-chat-completions \
-    --model_args model=Qwen/Qwen2.5-7B-Instruct,base_url=http://localhost:8000/v1 \
-    --tasks gsm8k,arc_easy,hellaswag \
-    --batch_size 8 \
+    --model_args model=MODEL_NAME,base_url=BASE_URL \
+    --apply_chat_template \
+    --tasks gsm8k,minerva_math \
+    --batch_size 1 \
     --output_path ./results
 ```
 
 **Key Parameters**:
 - `--model`: Use `local-chat-completions` for local API servers, `openai-chat-completions` for OpenAI
 - `--model_args`:
-  - `model`: Model name or identifier
-  - `base_url`: API endpoint (for local services only)
+  - `model`: Model name or identifier. For local APIs, use the name returned by the server (e.g. query `GET /v1/models` at your `base_url`).
+  - `base_url`: API base URL (for local services only). Use the full endpoint path: e.g. `http://localhost:8000/v1/chat/completions` for chat API, or `http://localhost:8000/v1/completions` for completions API.
   - `api_key`: API key if required (can also use environment variable)
-  - `tokenizer` (optional): Path to tokenizer for accurate token counting
+  - `tokenizer`: **Required for `local-completions`** (used for loglikelihood / multiple_choice tasks). Optional for `local-chat-completions` (only for token counting); **`--apply_chat_template`** does not require a tokenizer.
   - `tokenized_requests` (optional): Whether to use local tokenization (default: False)
 - `--tasks`: Comma-separated list of evaluation tasks
-- `--batch_size`: Number of requests to process in parallel (adjust based on API rate limits)
+- `--apply_chat_template`: Format prompts as chat messages (list[dict]). Use this flag when using `local-chat-completions`.
+- `--batch_size`: Number of requests to process in parallel. **For API-based evaluation, prefer `1`** to avoid rate limits and timeouts; increase only if your API supports higher concurrency.
 - `--output_path`: Directory to save evaluation results
 
 **About Tokenization**:
@@ -111,18 +109,21 @@ lm-eval supports two tokenization modes via the `tokenized_requests` parameter:
   - ❌ Not supported by most chat APIs (OpenAI ChatCompletions, etc.)
   - Use local models (HuggingFace, vLLM) for these tasks
 
-**Optional tokenizer configuration** (for accurate token counting or local tokenization):
+**Tokenizer**:
+- **`local-completions`**: A tokenizer is **required** when running loglikelihood or multiple_choice tasks (e.g. MMLU, ARC, HellaSwag). Pass it in `model_args`, e.g. `tokenizer=MODEL_NAME` or a path to the tokenizer.
+- **`local-chat-completions`**: Tokenizer is optional (e.g. for token counting). **If you see `LocalChatCompletion expects messages as list[dict]` or `AssertionError`**, pass the **`--apply_chat_template`** flag so that prompts are formatted as chat messages; a tokenizer is not required for this.
 
 ```bash
 lm-eval --model local-chat-completions \
-    --model_args model=MODEL_NAME,base_url=http://localhost:8000/v1,tokenizer=MODEL_NAME,tokenized_requests=False \
+    --model_args model=MODEL_NAME,base_url=BASE_URL \
+    --apply_chat_template \
     --tasks gsm8k
 ```
 
 Available tokenization parameters in `model_args`:
-- `tokenizer`: Path or name of the tokenizer (e.g., HuggingFace model name)
+- `tokenizer`: Path or name of the tokenizer. **Required for `local-completions`** (loglikelihood tasks). Optional for `local-chat-completions` (only for token counting).
 - `tokenizer_backend`: Tokenization system - `"huggingface"` (default), `"tiktoken"`, or `"none"`
-- `tokenized_requests`: `True` (client-side) or `False` (server-side, default)
+- `tokenized_requests`: `True` (client-side) or `False` (server-side, default). Keep `False` for chat API.
 
 ### Advanced Options
 
@@ -132,8 +133,9 @@ Enable `--log_samples` to save individual model responses for detailed analysis:
 
 ```bash
 lm-eval --model local-chat-completions \
-    --model_args model=Qwen/Qwen2.5-7B-Instruct,base_url=http://localhost:8000/v1 \
-    --tasks gsm8k,hellaswag \
+    --model_args model=MODEL_NAME,base_url=BASE_URL \
+    --apply_chat_template \
+    --tasks gsm8k,minerva_math \
     --output_path ./results \
     --log_samples
 ```
@@ -144,20 +146,21 @@ This creates a `results/` directory containing:
 
 #### Use Configuration File
 
-For complex evaluations, use a YAML configuration file:
+For complex evaluations, use a YAML configuration file. For benchmarks that rely heavily on `multiple_choice` scoring (MMLU, AGIEval, C-Eval, etc.), **use completion or local backends instead of chat backends**:
 
 ```yaml
-model: local-chat-completions
+model: local-completions
 model_args:
-  model: Qwen/Qwen2.5-7B-Instruct
-  base_url: http://localhost:8000/v1
+  model: MODEL_NAME
+  base_url: BASE_URL   # e.g. http://localhost:8000/v1/completions
+  tokenizer: TOKENIZER_PATH_OR_NAME   # required for local-completions (loglikelihood tasks)
 tasks:
   - mmlu
   - gsm8k
   - arc_easy
   - arc_challenge
   - hellaswag
-batch_size: 8
+batch_size: 1
 output_path: ./results
 log_samples: true
 ```
@@ -173,8 +176,8 @@ lm-eval --config config.yaml
 Test your setup with a small number of examples before running full evaluations:
 
 ```bash
-lm-eval --model local-chat-completions \
-    --model_args model=Qwen/Qwen2.5-7B-Instruct,base_url=http://localhost:8000/v1 \
+lm-eval --model local-completions \
+    --model_args model=MODEL_NAME,base_url=BASE_URL \
     --tasks mmlu \
     --limit 10
 ```
@@ -184,15 +187,15 @@ lm-eval --model local-chat-completions \
 Evaluate multiple model endpoints by running separate evaluations:
 
 ```bash
-# Evaluate base model
-lm-eval --model local-chat-completions \
-    --model_args model=Qwen/Qwen2.5-7B,base_url=http://localhost:8000/v1 \
+# Evaluate base model (multiple_choice-heavy benchmarks; use completions/local backends)
+lm-eval --model local-completions \
+    --model_args model=MODEL_NAME,base_url=BASE_URL,tokenizer=TOKENIZER_PATH_OR_NAME \
     --tasks gsm8k,mmlu \
     --output_path ./results/base_model
 
 # Evaluate fine-tuned model
-lm-eval --model local-chat-completions \
-    --model_args model=Qwen/Qwen2.5-7B-finetuned,base_url=http://localhost:8001/v1 \
+lm-eval --model local-completions \
+    --model_args model=MODEL_NAME_FINETUNED,base_url=BASE_URL_FINETUNED,tokenizer=TOKENIZER_PATH_OR_NAME \
     --tasks gsm8k,mmlu \
     --output_path ./results/finetuned_model
 ```
@@ -225,6 +228,21 @@ for task in base_results.keys():
             print(f"    Difference: {diff:+.4f}")
 ```
 
+#### Caching and Resume
+
+For long or multi-task evaluations, use `--use_cache <DIR>` so that lm-eval writes intermediate request results to a directory. If the run is interrupted (timeout, crash, or Ctrl+C), you can **resume** by running the same command again: lm-eval will load already-computed results from the cache and only run the remaining requests.
+
+```bash
+lm-eval --model local-chat-completions \
+    --model_args model=MODEL_NAME,base_url=BASE_URL \
+    --tasks gsm8k \
+    --batch_size 1 \
+    --use_cache ./cache \
+    --output_path ./results
+```
+
+To resume after an interruption, run the same command (same `--tasks`, `--model`, `--model_args`, and `--use_cache ./cache`). Do not change the cache path or task list between runs if you want to resume correctly.
+
 #### API-Specific Considerations
 
 **Controlling Request Rate**: Adjust these parameters to match your API capacity:
@@ -237,10 +255,10 @@ lm-eval --model local-chat-completions \
 ```
 
 **Available parameters in `model_args`**:
-- `num_concurrent`: Number of concurrent requests. Typical values: 1 (sequential), 10, 50, or 128 depending on API capacity.
+- `num_concurrent`: Number of concurrent requests. **For API-based evaluation, use `1`** (sequential) to avoid rate limits and timeouts; increase only if your API supports higher concurrency.
 - `max_retries`: Number of retries for failed requests. Common values: 3, 5, or more.
 - `timeout`: Request timeout in seconds. Adjust based on model size and API speed (e.g., 60, 300, or higher for large models).
-- `batch_size`: Number of requests to batch together (set via `--batch_size` flag, not in `model_args`)
+- `batch_size`: Number of requests to batch together (set via `--batch_size` flag, not in `model_args`). Use `1` for API evaluation to avoid rate limits and timeouts.
 
 **Authentication**: Set API keys via environment variables or model_args:
 
@@ -266,12 +284,38 @@ While API-based evaluation is recommended for production services, lm-eval also
 
 lm-eval includes 60+ standard academic benchmarks. Below is a comprehensive overview of available datasets.
 
+### Offline Dataset Preparation
+
+In restricted or offline environments, you can pre-download all required datasets and point lm-eval to the local cache to avoid repeated downloads or external network access.
+
+- **Step 1: Pre-download datasets on an online machine**
+  - Use `lm-eval` or a small Python script with `datasets`/`lm_eval` to download the benchmarks you need (MMLU, GSM8K, ARC, HellaSwag, etc.).
+  - Verify that the datasets are stored under a known cache directory (for example, a shared NAS path).
+- **Step 2: Sync the cache to your offline environment**
+  - Copy the entire dataset cache directory (e.g., HuggingFace datasets cache and any lm-eval specific cache) to your offline machines.
+- **Step 3: Configure environment variables on offline machines**
+  - Point lm-eval and the underlying HuggingFace libraries to the local cache and enable offline mode, for example:
+
+```bash
+export HF_HOME=/path/to/offline_cache
+export TRANSFORMERS_CACHE=/path/to/offline_cache
+export HF_DATASETS_CACHE=/path/to/offline_cache
+
+export HF_DATASETS_OFFLINE=1
+export HF_HUB_OFFLINE=1
+export HF_EVALUATE_OFFLINE=1
+export TRANSFORMERS_OFFLINE=1
+```
+
+- **Step 4: Run lm-eval normally**
+  - Once the cache and environment variables are configured, you can run `lm-eval` commands as usual. The framework will read datasets from the local cache without trying to access the internet.
+
 ### Understanding Task Types
 
 Before reviewing the datasets, it's important to understand the different task types:
 
 - **`generate_until`**: Generate text until a stopping condition (e.g., newline, max tokens). Best for open-ended generation tasks. Works with both chat and completion APIs.
-- **`multiple_choice`**: Select from multiple options. Can work with or without logprobs (more accurate with logprobs). Works with both chat and completion APIs.
+- **`multiple_choice`**: Select from multiple options. Can work with or without logprobs (more accurate with logprobs). **If a multiple_choice benchmark uses `loglikelihood` (e.g., MMLU, AGIEval, C-Eval, and most exam-style datasets), you must run it with completion or local backends, not chat backends.**
 - **`loglikelihood`**: Calculate token-level log probabilities. Requires API to return logprobs. Only works with completion APIs or local models.
 - **`loglikelihood_rolling`**: Calculate perplexity over sequences. Requires logprobs. Only works with completion APIs or local models.
 
@@ -279,7 +323,7 @@ Before reviewing the datasets, it's important to understand the different task t
 
 | Category | Dataset | Task Name | Task Type | Output Metrics | API Interface | Tokenization | Description |
 |----------|---------|-----------|-----------|----------------|---------------|--------------|-------------|
-| **General Knowledge** | MMLU | `mmlu_*` (57 subjects) | `multiple_choice` | `acc`, `acc_norm` | Chat / Completion | Server-side | 57 subjects covering STEM, humanities, social sciences: abstract_algebra, anatomy, astronomy, business_ethics, clinical_knowledge, college_biology, college_chemistry, college_computer_science, college_mathematics, college_medicine, college_physics, computer_security, conceptual_physics, econometrics, electrical_engineering, elementary_mathematics, formal_logic, global_facts, high_school_biology, high_school_chemistry, high_school_computer_science, high_school_european_history, high_school_geography, high_school_government_and_politics, high_school_macroeconomics, high_school_mathematics, high_school_microeconomics, high_school_physics, high_school_psychology, high_school_statistics, high_school_us_history, high_school_world_history, human_aging, human_sexuality, international_law, jurisprudence, logical_fallacies, machine_learning, management, marketing, medical_genetics, miscellaneous, moral_disputes, moral_scenarios, nutrition, philosophy, prehistory, professional_accounting, professional_law, professional_medicine, professional_psychology, public_relations, security_studies, sociology, us_foreign_policy, virology, world_religions |
+| **General Knowledge** | MMLU | `mmlu_*` (57 subjects) | `multiple_choice` | `acc`, `acc_norm` | Chat / Completion | Server-side | 57 subjects covering STEM, humanities, social sciences, law, medicine, business, and more |
 | | MMLU-Pro | `mmlu_pro` | `multiple_choice` | `acc` | Chat / Completion | Server-side | Enhanced MMLU with 10 options per question and higher difficulty |
 | | AGIEval | `agieval` | `multiple_choice` | `acc` | Chat / Completion | Server-side | Academic exams including LSAT, SAT, GaoKao (Chinese & English) |
 | | C-Eval | `ceval` | `multiple_choice` | `acc` | Chat / Completion | Server-side | Chinese comprehensive evaluation across 52 subjects |
@@ -309,19 +353,19 @@ Before reviewing the datasets, it's important to understand the different task t
 | **Truthfulness & Safety** | TruthfulQA | `truthfulqa_mc1` | `multiple_choice` | `acc` | Chat / Completion | Server-side | Single-correct answer truthfulness |
 | | | `truthfulqa_mc2` | `multiple_choice` | `acc` | Chat / Completion | Server-side | Multiple-correct answer truthfulness |
 | | | `truthfulqa_gen` | `generate_until` | `bleu_max`, `rouge1_max`, `rougeL_max`, `bleurt_max` | Chat / Completion | Server-side | Generative truthfulness evaluation (also outputs _acc and _diff variants) |
-| | BBQ | `bbq_*` (11 categories) | `multiple_choice` | `acc` | Chat / Completion | Server-side | Bias benchmark: age, disability, gender, nationality, physical_appearance, race_ethnicity, religion, ses (socio-economic status), sexual_orientation, race_x_gender (intersectional), race_x_ses (intersectional) |
+| | BBQ | `bbq_*` (11 categories) | `multiple_choice` | `acc` | Chat / Completion | Server-side | Bias benchmark covering age, disability, gender, nationality, appearance, race, religion, SES, sexual orientation, and their intersections |
 | **Multilingual** | Belebele | `belebele_zho_Hans` | `multiple_choice` | `acc`, `acc_norm` | Chat / Completion | Server-side | Chinese (Simplified) reading comprehension |
 | | | `belebele_zho_Hant` | `multiple_choice` | `acc`, `acc_norm` | Chat / Completion | Server-side | Chinese (Traditional) reading comprehension |
 | | | `belebele_eng_Latn` | `multiple_choice` | `acc`, `acc_norm` | Chat / Completion | Server-side | English reading comprehension |
-| | | `belebele_*` | `multiple_choice` | `acc`, `acc_norm` | Chat / Completion | Server-side | 122 languages total (see full list with `lm-eval ls tasks`) |
-| | XCOPA | `xcopa_*` (11 languages) | `multiple_choice` | `acc` | Chat / Completion | Server-side | Causal reasoning: et (Estonian), ht (Haitian), id (Indonesian), it (Italian), qu (Quechua), sw (Swahili), ta (Tamil), th (Thai), tr (Turkish), vi (Vietnamese), zh (Chinese) |
-| | XWinograd | `xwinograd_*` (6 languages) | `multiple_choice` | `acc` | Chat / Completion | Server-side | Winograd schema: en (English), fr (French), jp (Japanese), pt (Portuguese), ru (Russian), zh (Chinese) |
+| | | `belebele_*` | `multiple_choice` | `acc`, `acc_norm` | Chat / Completion | Server-side | 122-language multilingual reading comprehension suite (see full list with `lm-eval ls tasks`) |
+| | XCOPA | `xcopa_*` (11 languages) | `multiple_choice` | `acc` | Chat / Completion | Server-side | Multilingual causal reasoning in 11 languages |
+| | XWinograd | `xwinograd_*` (6 languages) | `multiple_choice` | `acc` | Chat / Completion | Server-side | Multilingual Winograd schema benchmark in 6 languages |
 | **Factual Knowledge** | Natural Questions | `nq_open` | `generate_until` | `exact_match` | Chat / Completion | Server-side | Open-domain question answering |
 | | TriviaQA | `triviaqa` | `generate_until` | `exact_match` | Chat / Completion | Server-side | Trivia question answering |
 | | Web Questions | `webqs` | `multiple_choice` | `exact_match` | Chat / Completion | Server-side | Question answering from web search queries |
 | **Summarization** | CNN/DailyMail | `cnn_dailymail` | `generate_until` | `rouge1`, `rouge2`, `rougeL` | Chat / Completion | Server-side | News article summarization |
 | **Translation** | WMT | `wmt14`, `wmt16`, `wmt20` | `generate_until` | `bleu`, `chrf` | Chat / Completion | Server-side | Machine translation benchmarks (multiple language pairs) |
-| **BIG-Bench** | BIG-Bench Hard (BBH) | `bbh_cot_fewshot` (23 tasks) | `generate_until` | `acc`, `exact_match` | Chat / Completion | Server-side | 23 challenging tasks: boolean_expressions, causal_judgement, date_understanding, disambiguation_qa, dyck_languages, formal_fallacies, geometric_shapes, hyperbaton, logical_deduction (3/5/7 objects), movie_recommendation, multistep_arithmetic_two, navigate, object_counting, penguins_in_a_table, reasoning_about_colored_objects, ruin_names, salient_translation_error_detection, snarks, sports_understanding, temporal_sequences, tracking_shuffled_objects (3/5/7 objects), web_of_lies, word_sorting |
+| **BIG-Bench** | BIG-Bench Hard (BBH) | `bbh_cot_fewshot` (23 tasks) | `generate_until` | `acc`, `exact_match` | Chat / Completion | Server-side | 23 challenging reasoning and comprehension tasks (boolean logic, causal reasoning, arithmetic, navigation, etc.) |
 | **Domain-Specific** | MedQA | `medqa` | `multiple_choice` | `acc`, `acc_norm` | Chat / Completion | Server-side | Medical question answering from USMLE exams |
 | | MedMCQA | `medmcqa` | `multiple_choice` | `acc`, `acc_norm` | Chat / Completion | Server-side | Medical multiple choice questions from Indian medical exams |
 | | PubMedQA | `pubmedqa` | `multiple_choice` | `acc` | Chat / Completion | Server-side | Biomedical question answering from PubMed abstracts |
@@ -329,8 +373,8 @@ Before reviewing the datasets, it's important to understand the different task t
 **Legend**:
 - **Output Metrics**: These are the actual metric keys that appear in the output JSON (e.g., `acc`, `exact_match`, `pass@1`)
 - **API Interface**:
-  - `Chat / Completion`: Works with both OpenAI-compatible chat and completion APIs
-  - `❌ Requires logprobs`: Only works with APIs that return token-level log probabilities, or local models
+  - `Chat / Completion`: Conceptually works with both OpenAI-compatible chat and completion APIs. **Any benchmark that uses `loglikelihood` (including most exam-style `multiple_choice` tasks such as MMLU, AGIEval, C-Eval, etc.) should be treated as “completion/local only”.**
+  - `❌ Requires logprobs`: Only works with APIs that return token-level log probabilities, or local models.
 - **Tokenization**:
   - `Server-side`: Uses `tokenized_requests=False` (default). Text is sent to API server, which handles tokenization. Works for `generate_until` and `multiple_choice` tasks.
   - `Client-side`: Uses `tokenized_requests=True`. lm-eval tokenizes locally and sends token IDs. Required for `loglikelihood` tasks. Improves accuracy for `multiple_choice` tasks but requires logprobs support from API.
@@ -346,7 +390,7 @@ Before reviewing the datasets, it's important to understand the different task t
 
 ```bash
 lm-eval --model local-chat-completions \
-    --model_args model=MODEL_NAME,base_url=http://localhost:8000/v1 \
+    --model_args model=MODEL_NAME,base_url=BASE_URL \
     --tasks gsm8k \
     --output_path ./results
 ```
@@ -355,16 +399,16 @@ lm-eval --model local-chat-completions \
 
 ```bash
 lm-eval --model local-chat-completions \
-    --model_args model=MODEL_NAME,base_url=http://localhost:8000/v1 \
-    --tasks mmlu,gsm8k,arc_easy,arc_challenge,hellaswag \
+    --model_args model=MODEL_NAME,base_url=BASE_URL \
+    --tasks gsm8k,minerva_math,drop,squad_v2 \
     --output_path ./results
 ```
 
 **Task group evaluation** (all MMLU subjects):
 
 ```bash
-lm-eval --model local-chat-completions \
-    --model_args model=MODEL_NAME,base_url=http://localhost:8000/v1 \
+lm-eval --model local-completions \
+    --model_args model=MODEL_NAME,base_url=BASE_URL \
     --tasks mmlu \
     --output_path ./results
 ```
@@ -372,8 +416,8 @@ lm-eval --model local-chat-completions \
 **Wildcard pattern** (specific MMLU subjects):
 
 ```bash
-lm-eval --model local-chat-completions \
-    --model_args model=MODEL_NAME,base_url=http://localhost:8000/v1 \
+lm-eval --model local-completions \
+    --model_args model=MODEL_NAME,base_url=BASE_URL \
     --tasks "mmlu_mathematics,mmlu_physics,mmlu_chemistry" \
     --output_path ./results
 ```
@@ -381,8 +425,8 @@ lm-eval --model local-chat-completions \
 **Multilingual evaluation** (Chinese Belebele):
 
 ```bash
-lm-eval --model local-chat-completions \
-    --model_args model=MODEL_NAME,base_url=http://localhost:8000/v1 \
+lm-eval --model local-completions \
+    --model_args model=MODEL_NAME,base_url=BASE_URL \
     --tasks belebele_zho_Hans \
     --output_path ./results
 ```
@@ -392,8 +436,8 @@ lm-eval --model local-chat-completions \
 **General LLM Benchmark Suite** (recommended for API evaluation):
 
 ```bash
-lm-eval --model local-chat-completions \
-    --model_args model=MODEL_NAME,base_url=http://localhost:8000/v1 \
+lm-eval --model local-completions \
+    --model_args model=MODEL_NAME,base_url=BASE_URL \
     --tasks mmlu,gsm8k,arc_challenge,hellaswag,winogrande,truthfulqa_mc2 \
     --output_path ./results
 ```
@@ -401,9 +445,9 @@ lm-eval --model local-chat-completions \
 **Math & Reasoning Suite**:
 
 ```bash
-lm-eval --model local-chat-completions \
-    --model_args model=MODEL_NAME,base_url=http://localhost:8000/v1 \
-    --tasks gsm8k,math,arc_challenge \
+lm-eval --model local-completions \
+    --model_args model=MODEL_NAME,base_url=BASE_URL \
+    --tasks gsm8k,minerva_math,arc_challenge \
     --output_path ./results
 ```
 
@@ -411,7 +455,7 @@ lm-eval --model local-chat-completions \
 
 ```bash
 lm-eval --model local-chat-completions \
-    --model_args model=MODEL_NAME,base_url=http://localhost:8000/v1 \
+    --model_args model=MODEL_NAME,base_url=BASE_URL \
     --tasks humaneval,mbpp \
     --output_path ./results
 ```
@@ -419,8 +463,8 @@ lm-eval --model local-chat-completions \
 **Open LLM Leaderboard Suite**:
 
 ```bash
-lm-eval --model local-chat-completions \
-    --model_args model=MODEL_NAME,base_url=http://localhost:8000/v1 \
+lm-eval --model local-completions \
+    --model_args model=MODEL_NAME,base_url=BASE_URL \
     --tasks leaderboard \
     --output_path ./results
 ```
@@ -493,8 +537,8 @@ After running an evaluation, results are saved in JSON format. Here's what the k
   },
   "config": {
     "model": "local-chat-completions",
-    "model_args": "model=Qwen/Qwen2.5-7B-Instruct,base_url=http://localhost:8000/v1",
-    "batch_size": 8
+    "model_args": "model=MODEL_NAME,base_url=BASE_URL",
+    "batch_size": 1
   }
 }
 ```
@@ -514,7 +558,7 @@ After running an evaluation, results are saved in JSON format. Here's what the k
 | `loglikelihood_rolling` | Yes | Completion only | Client-side | For perplexity evaluation. Not supported by most chat APIs. |
 
 **Important Notes**:
-- **For API-based evaluation**: Focus on `generate_until` tasks (GSM8K, ARC, MMLU, etc.). These don't require local tokenization or logprobs.
+- **For API-based evaluation**: For **chat API**, use `generate_until` tasks (GSM8K, minerva_math, DROP, etc.). For **completion API** with tokenizer, you can run loglikelihood tasks (MMLU, ARC, HellaSwag, etc.).
 - **Tokenization**: With chat APIs, tokenization is handled server-side automatically. You only need to specify a local tokenizer if you want accurate token counting for cost estimation.
 - **Logprobs limitation**: OpenAI ChatCompletions and most chat APIs don't provide token-level logprobs, making `loglikelihood` tasks unavailable. Use local models (HuggingFace, vLLM) if you need these task types.
 
@@ -533,7 +577,7 @@ Download the Jupyter notebook example: [lm-eval Quick Start Notebook](/lm-eval/l
 ### Tips & Best Practices
 
 1. **Start Small**: Use `--limit 10` to test your setup before running full evaluations
-2. **Use Auto Batch Size**: Set `--batch_size auto` for optimal GPU utilization
+2. **Batch size**: For API-based evaluation use `--batch_size 1` to avoid rate limits and timeouts. For local GPU (e.g. vLLM, HuggingFace), `--batch_size auto` can improve utilization.
 3. **Save Results**: Always use `--output_path` and `--log_samples` for reproducibility
 4. **Cache Results**: Use `--use_cache <DIR>` to resume interrupted evaluations
 5. **Check Task Compatibility**: Verify your model supports the required output format (logprobs, generation, etc.)
diff --git a/docs/public/lm-eval/lm-eval_quick_star.ipynb b/docs/public/lm-eval/lm-eval_quick_star.ipynb
index 372ac54..2e6cf72 100644
--- a/docs/public/lm-eval/lm-eval_quick_star.ipynb
+++ b/docs/public/lm-eval/lm-eval_quick_star.ipynb
@@ -71,83 +71,16 @@
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "## 2. Quick Test with Limited Examples\n",
+        "## 2. Set MODEL_NAME and BASE_URL\n",
         "\n",
-        "Before running a full evaluation, it's good practice to test with a small number of examples:"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {},
-      "outputs": [],
-      "source": [
-        "# Test with 5 examples from hellaswag\n",
-        "# Replace the base_url and model name with your local API endpoint\n",
-        "!lm-eval --model local-chat-completions \\\n",
-        "    --model_args model=Qwen/Qwen2.5-0.5B-Instruct,base_url=http://localhost:8000/v1 \\\n",
-        "    --tasks hellaswag \\\n",
-        "    --limit 5"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "## 3. Evaluate on Multiple Tasks\n",
+        "Before running evaluations, set these to your API server:\n",
         "\n",
-        "Run evaluation on multiple tasks suitable for API models (generation-based tasks):"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {},
-      "outputs": [],
-      "source": [
-        "# Evaluate on GSM8K (math reasoning)\n",
-        "!lm-eval --model local-chat-completions \\\n",
-        "    --model_args model=Qwen/Qwen2.5-7B-Instruct,base_url=http://localhost:8000/v1 \\\n",
-        "    --tasks gsm8k \\\n",
-        "    --batch_size 8 \\\n",
-        "    --output_path ./results \\\n",
-        "    --log_samples"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "## 4. Evaluate with Configuration File\n",
-        "\n",
-        "For more complex evaluations, use a YAML configuration file:"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {},
-      "outputs": [],
-      "source": [
-        "# Create a configuration file\n",
-        "config = \"\"\"\n",
-        "model: local-chat-completions\n",
-        "model_args:\n",
-        "  model: Qwen/Qwen2.5-7B-Instruct\n",
-        "  base_url: http://localhost:8000/v1\n",
-        "tasks:\n",
-        "  - gsm8k\n",
-        "  - arc_easy\n",
-        "  - hellaswag\n",
-        "batch_size: 8\n",
-        "output_path: ./results\n",
-        "log_samples: true\n",
-        "\"\"\"\n",
+        "- **BASE_URL**: The full endpoint URL of your API.\n",
+        "  - For **chat API**: `http://<host>:<port>/v1/chat/completions` (e.g. `http://localhost:8000/v1/chat/completions`).\n",
+        "  - For **completions API**: `http://<host>:<port>/v1/completions` (e.g. `http://localhost:8000/v1/completions`).\n",
+        "- **MODEL_NAME**: The model ID exposed by your server. Query the server (e.g. `GET http://localhost:8000/v1/models`) and use the `id` of the model you want to evaluate.\n",
         "\n",
-        "with open('eval_config.yaml', 'w') as f:\n",
-        "    f.write(config)\n",
-        "\n",
-        "print(\"Configuration file created!\")"
+        "Run the cell below to list models (optional). If your API is not on `localhost:8000`, set `BASE_URL_FOR_MODELS` in that cell and `BASE_URL` in the evaluation cells below to your server URL. Then set `MODEL_NAME` and `BASE_URL` in the evaluation cells."
       ]
     },
     {
@@ -156,29 +89,26 @@
       "metadata": {},
       "outputs": [],
       "source": [
-        "# Run evaluation with config file\n",
-        "!lm-eval --config eval_config.yaml --limit 10"
+        "# Optional: list models from your API (adjust the base URL if your server is not on localhost:8000)\n",
+        "import urllib.request\n",
+        "import json\n",
+        "BASE_URL_FOR_MODELS = \"http://localhost:8000/v1\"  # without /chat/completions or /completions\n",
+        "try:\n",
+        "    with urllib.request.urlopen(f\"{BASE_URL_FOR_MODELS}/models\") as resp:\n",
+        "        data = json.load(resp)\n",
+        "    for m in data.get(\"data\", []):\n",
+        "        print(m.get(\"id\", m))\n",
+        "except Exception as e:\n",
+        "    print(\"Could not list models (is your API server running?):\", e)"
       ]
     },
-    {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": []
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {},
-      "outputs": [],
-      "source": []
-    },
     {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "## 5. Comprehensive Evaluation Suite\n",
+        "### Run Evaluation (chat API)\n",
         "\n",
-        "Run a comprehensive evaluation on multiple benchmarks:"
+        "Set `MODEL_NAME` and `BASE_URL` below. Use the **`--apply_chat_template`** flag so prompts are sent as chat messages and avoid \"messages as list[dict]\" / AssertionError. Then run the cell."
       ]
     },
     {
@@ -187,12 +117,18 @@
       "metadata": {},
       "outputs": [],
       "source": [
-        "# Comprehensive evaluation (generation-based tasks for API models)\n",
-        "!lm-eval --model local-chat-completions \\\n",
-        "    --model_args model=Qwen/Qwen2.5-7B-Instruct,base_url=http://localhost:8000/v1 \\\n",
-        "    --tasks gsm8k,arc_easy,arc_challenge,boolq,piqa \\\n",
-        "    --batch_size 8 \\\n",
-        "    --output_path ./comprehensive_results \\\n",
+        "%%bash\n",
+        "# Set these to your API server.\n",
+        "export MODEL_NAME=\"your-model-id\"\n",
+        "export BASE_URL=\"http://localhost:8000/v1/chat/completions\"\n",
+        "\n",
+        "lm-eval --model local-chat-completions \\\n",
+        "    --model_args model=$MODEL_NAME,base_url=$BASE_URL \\\n",
+        "    --apply_chat_template \\\n",
+        "    --tasks gsm8k \\\n",
+        "    --batch_size 1 \\\n",
+        "    --limit 10 \\\n",
+        "    --output_path ./results \\\n",
         "    --log_samples"
       ]
     },
@@ -200,9 +136,9 @@
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "## 6. View Results\n",
+        "## 3. View Results\n",
         "\n",
-        "After evaluation completes, you can view the results:"
+        "After evaluation completes, view the results:"
       ]
     },
     {
@@ -212,74 +148,24 @@
       "outputs": [],
       "source": [
         "import json\n",
+        "import os\n",
         "\n",
-        "# Load and display results\n",
-        "with open('./results/results.json', 'r') as f:\n",
-        "    results = json.load(f)\n",
-        "\n",
-        "# Display the results\n",
-        "print(\"=== Evaluation Results ===\")\n",
-        "print(json.dumps(results, indent=2))\n",
-        "\n",
-        "# Explain common metrics\n",
-        "print(\"\\n=== Common Output Metrics ===\")\n",
-        "print(\"- acc: Accuracy (proportion of correct answers)\")\n",
-        "print(\"- acc_norm: Normalized accuracy (using length-normalized probabilities)\")\n",
-        "print(\"- exact_match: Exact string match between prediction and reference\")\n",
-        "print(\"- pass@1, pass@10: Percentage of problems solved (for code generation)\")\n",
-        "print(\"- f1: F1 score (harmonic mean of precision and recall)\")\n",
-        "print(\"- bleu, rouge: Text similarity metrics for generation tasks\")"
+        "if not os.path.exists('./results/results.json'):\n",
+        "    print(\"No results yet. Run 'Run Evaluation (chat API)' above first.\")\n",
+        "else:\n",
+        "    with open('./results/results.json', 'r') as f:\n",
+        "        results = json.load(f)\n",
+        "    print(\"=== Evaluation Results ===\")\n",
+        "    print(json.dumps(results, indent=2))"
       ]
     },
     {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "## 7. Advanced: Task-Specific Examples\n",
+        "## 4. Example: local-completions with tokenizer (ARC)\n",
         "\n",
-        "### Mathematics Evaluation (GSM8K with Chain-of-Thought)"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {},
-      "outputs": [],
-      "source": [
-        "# GSM8K with chain-of-thought reasoning\n",
-        "!lm-eval --model local-chat-completions \\\n",
-        "    --model_args model=Qwen/Qwen2.5-7B-Instruct,base_url=http://localhost:8000/v1 \\\n",
-        "    --tasks gsm8k_cot \\\n",
-        "    --batch_size 8 \\\n",
-        "    --output_path ./results/gsm8k_cot"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "### Multilingual Evaluation"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {},
-      "outputs": [],
-      "source": [
-        "# Evaluate on Chinese Belebele\n",
-        "!lm-eval --model local-chat-completions \\\n",
-        "    --model_args model=Qwen/Qwen2.5-7B-Instruct,base_url=http://localhost:8000/v1 \\\n",
-        "    --tasks belebele_zho_Hans \\\n",
-        "    --batch_size 8 \\\n",
-        "    --output_path ./results/belebele_chinese"
-      ]
-    },
-    {
-      "cell_type": "markdown",
-      "metadata": {},
-      "source": [
-        "### Multiple MMLU Subjects"
+        "Tasks like ARC and MMLU use **loglikelihood** and require the **completions API** and a **tokenizer**. This demo uses **arc_easy** (small, single task) with `--limit 5` so it won't overload smaller models; for full MMLU use the same command with `--tasks mmlu` (see full doc). Set `MODEL_NAME`, `BASE_URL` (`/v1/completions`), and `TOKENIZER_PATH`, then run."
       ]
     },
     {
@@ -288,57 +174,37 @@
       "metadata": {},
       "outputs": [],
       "source": [
-        "# Evaluate on specific MMLU subjects\n",
-        "!lm-eval --model local-chat-completions \\\n",
-        "    --model_args model=Qwen/Qwen2.5-7B-Instruct,base_url=http://localhost:8000/v1 \\\n",
-        "    --tasks mmlu_abstract_algebra,mmlu_anatomy,mmlu_astronomy \\\n",
-        "    --batch_size 8 \\\n",
-        "    --output_path ./results/mmlu_subset"
+        "%%bash\n",
+        "# Completions API: use /v1/completions. Tokenizer: required for loglikelihood; replace with your model's tokenizer (HuggingFace name or path).\n",
+        "export MODEL_NAME=\"your-model-id\"\n",
+        "export BASE_URL=\"http://localhost:8000/v1/completions\"\n",
+        "export TOKENIZER_PATH=\"Qwen/Qwen2.5-7B\"   # replace according to your actual model\n",
+        "\n",
+        "lm-eval --model local-completions \\\n",
+        "    --model_args model=$MODEL_NAME,base_url=$BASE_URL,tokenizer=$TOKENIZER_PATH \\\n",
+        "    --tasks arc_easy \\\n",
+        "    --batch_size 1 \\\n",
+        "    --limit 5 \\\n",
+        "    --output_path ./results_arc"
       ]
     },
     {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "## 8. Caching and Resume\n",
+        "## 5. Learn more\n",
         "\n",
-        "Use caching to resume interrupted evaluations:"
-      ]
-    },
-    {
-      "cell_type": "code",
-      "execution_count": null,
-      "metadata": {},
-      "outputs": [],
-      "source": [
-        "# Run with caching enabled\n",
-        "!lm-eval --model local-chat-completions \\\n",
-        "    --model_args model=Qwen/Qwen2.5-7B-Instruct,base_url=http://localhost:8000/v1 \\\n",
-        "    --tasks gsm8k \\\n",
-        "    --batch_size 8 \\\n",
-        "    --use_cache ./cache \\\n",
-        "    --output_path ./results"
+        "For config file, multiple tasks, MMLU/completion API, and caching, see the documentation."
       ]
     },
     {
       "cell_type": "markdown",
       "metadata": {},
       "source": [
-        "## Tips and Best Practices\n",
-        "\n",
-        "1. **Always test first**: Use `--limit 5` or `--limit 10` to verify your setup before running full evaluations\n",
-        "2. **Save results**: Use `--output_path` and `--log_samples` for reproducibility\n",
-        "3. **Choose appropriate tasks**: Refer to the complete task list in the documentation for detailed task information\n",
-        "4. **Monitor resources**: Large evaluations can take time; monitor with `htop` or `nvidia-smi`\n",
-        "5. **Use caching**: Enable `--use_cache` for long evaluations that might be interrupted\n",
-        "6. **Batch size**: Adjust `--batch_size` based on your API rate limits and model capacity\n",
-        "7. **API configuration**: Ensure your local model service is running and accessible at the `base_url` you specify\n",
-        "\n",
-        "## Resources\n",
+        "## Tips\n",
         "\n",
-        "- **Complete Task Documentation**: See the main documentation for a comprehensive list of all evaluation tasks and their capabilities\n",
-        "- **lm-eval Documentation**: https://github.com/EleutherAI/lm-evaluation-harness/tree/main/docs\n",
-        "- **GitHub Repository**: https://github.com/EleutherAI/lm-evaluation-harness"
+        "- Use `--batch_size 1` for API evaluation. Use model name from `GET /v1/models` and full `base_url` (e.g. `http://localhost:8000/v1/chat/completions`).\n",
+        "- For more tasks, config file, MMLU/completion API, and caching, see the full documentation (*How to Evaluate LLM*) and [lm-eval docs](https://github.com/EleutherAI/lm-evaluation-harness/tree/main/docs)."
       ]
     }
   ],

From 8b85b62744b5a00041fffa67d36bc87691aea26b Mon Sep 17 00:00:00 2001
From: tfwang <tfwang@alauda.io>
Date: Fri, 30 Jan 2026 19:14:03 +0800
Subject: [PATCH 5/5] update

---
 docs/en/solutions/How_to_Evaluate_LLM.md                      | 4 ++--
 .../{lm-eval_quick_star.ipynb => lm-eval_quick_start.ipynb}   | 0
 2 files changed, 2 insertions(+), 2 deletions(-)
 rename docs/public/lm-eval/{lm-eval_quick_star.ipynb => lm-eval_quick_start.ipynb} (100%)

diff --git a/docs/en/solutions/How_to_Evaluate_LLM.md b/docs/en/solutions/How_to_Evaluate_LLM.md
index ce5a976..3ba4980 100644
--- a/docs/en/solutions/How_to_Evaluate_LLM.md
+++ b/docs/en/solutions/How_to_Evaluate_LLM.md
@@ -352,7 +352,7 @@ Before reviewing the datasets, it's important to understand the different task t
 | | StoryCloze | `storycloze_2016` | `multiple_choice` | `acc` | Chat / Completion | Server-side | Story completion task |
 | **Truthfulness & Safety** | TruthfulQA | `truthfulqa_mc1` | `multiple_choice` | `acc` | Chat / Completion | Server-side | Single-correct answer truthfulness |
 | | | `truthfulqa_mc2` | `multiple_choice` | `acc` | Chat / Completion | Server-side | Multiple-correct answer truthfulness |
-| | | `truthfulqa_gen` | `generate_until` | `bleu_max`, `rouge1_max`, `rougeL_max`, `bleurt_max` | Chat / Completion | Server-side | Generative truthfulness evaluation (also outputs _acc and _diff variants) |
+| | | `truthfulqa_gen` | `generate_until` | `bleu_max`, `rouge1_max`, `rougeL_max`, `bleurt_max` | Chat / Completion | Server-side | Generative truthfulness evaluation; also reports acc and diff metric variants |
 | | BBQ | `bbq_*` (11 categories) | `multiple_choice` | `acc` | Chat / Completion | Server-side | Bias benchmark covering age, disability, gender, nationality, appearance, race, religion, SES, sexual orientation, and their intersections |
 | **Multilingual** | Belebele | `belebele_zho_Hans` | `multiple_choice` | `acc`, `acc_norm` | Chat / Completion | Server-side | Chinese (Simplified) reading comprehension |
 | | | `belebele_zho_Hant` | `multiple_choice` | `acc`, `acc_norm` | Chat / Completion | Server-side | Chinese (Traditional) reading comprehension |
@@ -572,7 +572,7 @@ After running an evaluation, results are saved in JSON format. Here's what the k
 
 ### Example Notebook
 
-Download the Jupyter notebook example: [lm-eval Quick Start Notebook](/lm-eval/lm-eval_quick_star.ipynb)
+Download the Jupyter notebook example: [lm-eval Quick Start Notebook](/lm-eval/lm-eval_quick_start.ipynb)
 
 ### Tips & Best Practices
 
diff --git a/docs/public/lm-eval/lm-eval_quick_star.ipynb b/docs/public/lm-eval/lm-eval_quick_start.ipynb
similarity index 100%
rename from docs/public/lm-eval/lm-eval_quick_star.ipynb
rename to docs/public/lm-eval/lm-eval_quick_start.ipynb