dengfeng-ai/tangshi-lora
Overview
tangshi-lora (GitHub) is a project for fine-tuning the Qwen-2.5-1.5B-Instruct large language model on Tang dynasty poetry using the QLoRA (Quantized Low-Rank Adapter) technique. It is a modern, parameter-efficient follow-up to tangshi-gpt, leveraging the latest in PEFT (Parameter-Efficient Fine-Tuning) methods to enable high-quality, domain-specific text generation with minimal compute resources.
The repository provides a full pipeline: from downloading and preprocessing a large-scale Tang poetry corpus, to instruction-tuning dataset creation, QLoRA-based training, and both quantitative (BLEU, perplexity) and qualitative evaluation. A pre-trained LoRA adapter is included for immediate use, and the codebase is designed for easy reproducibility and extension.
Architecture
The project is organized into modular components:
- Data Preparation: Scripts in
data/download the chinese-poetry corpus via sparse HTTP, then preprocess it into an Alpaca-style instruction-tuning dataset. Only regular-form poems (绝句/律诗) by the top 50 poets are included, with traditional Chinese converted to simplified using OpenCC. - Training:
train/train.pyloads the Qwen model with 4-bit quantization (via bitsandbytes), applies LoRA adapters to key attention modules, and fine-tunes using Hugging Face'sSFTTrainerfrom thetrllibrary. Training configuration is managed viatrain/config.yaml. - Evaluation & Inference: Scripts in
eval/provide quantitative evaluation (BLEU-4, perplexity), qualitative side-by-side comparison, and single-shot generation using the fine-tuned model. - Outputs: The
outputs/checkpoint/directory contains a ready-to-use LoRA adapter, tokenizer config, and chat template.
Key Features
- Efficient Fine-Tuning: Utilizes QLoRA (rank 16, alpha 32, 4-bit NF4 quantization) to adapt a 1.5B parameter model with only ~0.28% trainable parameters, making training feasible on a single T4 GPU (EC2 g4dn.xlarge).
- Domain-Specific Instruction Tuning: Converts ~49K Tang poems into ~13K style-imitation instructions in Alpaca format, e.g.,
模仿李白的风格,写一首诗. - Modern Evaluation: Provides both character-level and word-level BLEU-4 (with jieba segmentation), as well as perplexity, for rigorous quantitative assessment. Qualitative scripts allow for side-by-side comparison of base vs fine-tuned generations.
- Reproducibility: All steps are scriptable, with clear configuration and deterministic data splits to prevent leakage.
- Plug-and-Play Adapter: Pre-trained LoRA weights are included, so users can skip training and immediately perform evaluation or inference.
Project Structure
| Path | Purpose |
|---|---|
data/download.py | Downloads the Tang poetry corpus from GitHub via HTTP |
data/preprocess.py | Preprocesses and formats the dataset for instruction tuning |
train/train.py | Runs QLoRA training using SFTTrainer |
eval/eval.py | Computes BLEU and perplexity metrics |
eval/qualitative.py | Prints side-by-side base vs fine-tuned generations |
eval/generate.py | Single-shot inference with the fine-tuned model |
outputs/checkpoint/ | Pre-trained LoRA adapter and configs |
How It Works
- Data Download: Run
data/download.pyto fetch the raw JSON corpus from chinese-poetry. - Preprocessing: Use
data/preprocess.pyto filter, convert, and format the poems into instruction-tuning samples, with a deterministic train/test split. - Training: Launch training via
train/train.pyandtrain/config.yaml, which applies QLoRA to the Qwen model and fine-tunes on the dataset. - Evaluation/Inference: Evaluate the model quantitatively (
eval/eval.py), qualitatively (eval/qualitative.py), or generate new poems (eval/generate.py) using the provided adapter.
Results
Fine-tuning yields dramatic improvements: BLEU-4 scores increase by orders of magnitude, and perplexity drops from 54.3 to 19.8 on the test set. The fine-tuned model generates poetry much closer in style and content to reference Tang poems, as shown in both quantitative metrics and qualitative samples.