dengfeng-ai/tangshi-lora - Code Analysis

Overview

tangshi-lora (GitHub) is a project for fine-tuning the Qwen-2.5-1.5B-Instruct large language model on Tang dynasty poetry using the QLoRA (Quantized Low-Rank Adapter) technique. It is a modern, parameter-efficient follow-up to tangshi-gpt, leveraging the latest in PEFT (Parameter-Efficient Fine-Tuning) methods to enable high-quality, domain-specific text generation with minimal compute resources.

The repository provides a full pipeline: from downloading and preprocessing a large-scale Tang poetry corpus, to instruction-tuning dataset creation, QLoRA-based training, and both quantitative (BLEU, perplexity) and qualitative evaluation. A pre-trained LoRA adapter is included for immediate use, and the codebase is designed for easy reproducibility and extension.

Architecture

The project is organized into modular components:

Data Preparation: Scripts in data/ download the chinese-poetry corpus via sparse HTTP, then preprocess it into an Alpaca-style instruction-tuning dataset. Only regular-form poems (绝句/律诗) by the top 50 poets are included, with traditional Chinese converted to simplified using OpenCC.
Training: train/train.py loads the Qwen model with 4-bit quantization (via bitsandbytes), applies LoRA adapters to key attention modules, and fine-tunes using Hugging Face's SFTTrainer from the trl library. Training configuration is managed via train/config.yaml.
Evaluation & Inference: Scripts in eval/ provide quantitative evaluation (BLEU-4, perplexity), qualitative side-by-side comparison, and single-shot generation using the fine-tuned model.
Outputs: The outputs/checkpoint/ directory contains a ready-to-use LoRA adapter, tokenizer config, and chat template.

Key Features

Efficient Fine-Tuning: Utilizes QLoRA (rank 16, alpha 32, 4-bit NF4 quantization) to adapt a 1.5B parameter model with only ~0.28% trainable parameters, making training feasible on a single T4 GPU (EC2 g4dn.xlarge).
Domain-Specific Instruction Tuning: Converts ~49K Tang poems into ~13K style-imitation instructions in Alpaca format, e.g., 模仿李白的风格，写一首诗.
Modern Evaluation: Provides both character-level and word-level BLEU-4 (with jieba segmentation), as well as perplexity, for rigorous quantitative assessment. Qualitative scripts allow for side-by-side comparison of base vs fine-tuned generations.
Reproducibility: All steps are scriptable, with clear configuration and deterministic data splits to prevent leakage.
Plug-and-Play Adapter: Pre-trained LoRA weights are included, so users can skip training and immediately perform evaluation or inference.

Project Structure

Path	Purpose
`data/download.py`	Downloads the Tang poetry corpus from GitHub via HTTP
`data/preprocess.py`	Preprocesses and formats the dataset for instruction tuning
`train/train.py`	Runs QLoRA training using SFTTrainer
`eval/eval.py`	Computes BLEU and perplexity metrics
`eval/qualitative.py`	Prints side-by-side base vs fine-tuned generations
`eval/generate.py`	Single-shot inference with the fine-tuned model
`outputs/checkpoint/`	Pre-trained LoRA adapter and configs

How It Works

Data Download: Run data/download.py to fetch the raw JSON corpus from chinese-poetry.
Preprocessing: Use data/preprocess.py to filter, convert, and format the poems into instruction-tuning samples, with a deterministic train/test split.
Training: Launch training via train/train.py and train/config.yaml, which applies QLoRA to the Qwen model and fine-tunes on the dataset.
Evaluation/Inference: Evaluate the model quantitatively (eval/eval.py), qualitatively (eval/qualitative.py), or generate new poems (eval/generate.py) using the provided adapter.

Results

Fine-tuning yields dramatic improvements: BLEU-4 scores increase by orders of magnitude, and perplexity drops from 54.3 to 19.8 on the test set. The fine-tuned model generates poetry much closer in style and content to reference Tang poems, as shown in both quantitative metrics and qualitative samples.