A domain-specific Small Language Model fine-tuned on Australian legal documents. The model learns patterns from AustLII legal corpus and generates answers using domain-adapted knowledge, with strong bias toward Australian legal language and concepts.
A Small Language Model (SLM) fine-tuned from DistilGPT2 (~82M parameters) on Australian legal documents scraped from AustLII. The model learns legal terminology, citation patterns, and reasoning structures through causal language modeling.
- Domain-Adapted: Fine-tuned specifically on Australian legal documents
- Learns Patterns: Internalizes legal terminology, citations, and reasoning structures
- Standalone: No external retrieval needed (unlike RAG systems)
- Lightweight: 82M parameters, runs efficiently on CPU or GPU
- A true SLM: Fine-tuning DistilGPT2 on legal corpus updates model weights
- Domain-adapted: Knowledge internalized in model parameters
- Standalone: No external retrieval at inference time
- Learns patterns: Legal terminology, citations, reasoning structures
- Not trained from scratch: Built on DistilGPT2's pre-training
- Not hallucination-proof: Reduces but doesn't eliminate hallucinations
- Not legal advice: Research/educational tool only
Requirements: Python 3.8+, PyTorch, CUDA (optional for GPU)
# Clone repository
git clone https://github.com/JamesANZ/auslegal-slm.git
cd auslegal-slm
# Create virtual environment
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt- Clean Data (one-time):
python clean_data.py- Prepare Data:
python prepare_data.py- Train Model:
python train_slm.pyTraining Time:
- CPU: Several hours
- GPU: 30-60 minutes
Interactive Mode:
python query_slm.pySingle Question:
python query_slm.py --question "What is negligence?"Best Prompting Strategy:
Since the model was trained on raw legal documents (not Q&A pairs), it works best when prompts are framed as legal text continuations. The default continuation strategy automatically converts questions to legal definition prompts.
Recommended Settings:
python query_slm.py \
--question "What is negligence?" \
--strategy continuation \
--greedy \
--max-length 300Available Strategies:
continuation(default, recommended): Converts questions to legal definition promptsfew_shot: Shows examples of desired formatdirect: Simple Q&A formatstructured: Uses XML-like delimiters
Python API:
from query_slm import LegalSLM, PromptStrategy
slm = LegalSLM()
answer = slm.generate_answer(
"What is negligence?",
use_greedy=True, # Most deterministic
max_length=300, # Complete answers
strategy=PromptStrategy.CONTINUATION,
stop_sequences=["\n\n", ".", "?"]
)- Base Model:
distilgpt2(82M parameters) - Architecture: Transformer decoder (GPT-2 style)
- Training Objective: Causal language modeling (next token prediction)
- Layers: 6 transformer decoder blocks
- Hidden size: 768
- Attention heads: 12
Epochs: 5
Learning rate: 2e-5
Batch size: 4 (effective: 16 with gradient accumulation)
Max sequence length: 512 tokens
Optimizer: AdamW with warmup
Mixed precision: FP16 (if GPU available)Recommended Settings:
Strategy: continuation (converts questions to legal definition prompts)
Decoding: greedy (most deterministic)
Max length: 300 tokens (complete answers)
Temperature: 0.2 (when using sampling)
Repetition penalty: 1.2Why Continuation Strategy Works Best: The model was trained on raw legal documents, not Q&A pairs. Framing questions as legal text continuations (e.g., "In Australian law, negligence is defined as") matches the training data format and produces better results than Q&A style prompts.
- Legal documents scraped from AustLII
- Format: Plain text files with metadata headers
- Processing: Automatic cleaning and tokenization
- Tokenizer: GPT-2 tokenizer (BPE-based)
- Vocabulary size: 50,257 tokens
- Sequence length: 512 tokens (fixed)
- Sliding window: 256 token stride (50% overlap)
- Domain fine-tuning on legal corpus only
- Greedy decoding or low temperature (0.2-0.3) during inference
- Capped generation length (300 tokens default)
- Continuation strategy prompts that match training data format
Note: Fine-tuning reduces but cannot guarantee absence of hallucinations. The model may still generate incorrect or mixed-domain content.
- Corpus size: Training on 119 documents is relatively small
- Coverage: May not cover all areas of Australian law
- Temporal: Documents reflect law at scraping time
- Context window: 512 tokens limits context
- Generalization: May overfit to specific documents
- No citations: Doesn't explicitly cite sources (unlike RAG)
Training metrics saved to models/legal_slm/training_metrics.json:
{
"training_loss": 2.3456,
"eval_loss": 2.4567,
"perplexity": 11.67,
"num_epochs": 5
}Perplexity: Lower is better. ~10-15 is reasonable for domain-adapted models.
- Research – Explore domain-specific language modeling
- Education – Learn about fine-tuning and SLM training
- Prototyping – Test legal domain adaptation approaches
- Comparison – Baseline for hybrid SLM+RAG systems
- SLM + RAG: Combine with retrieval for stricter factual grounding
- LoRA fine-tuning: More parameter-efficient approach
- Comparison models: N-gram, Char-RNN, tiny transformer from scratch
- Gradient checkpointing: Reduce memory for larger batches
auslegal-slm/
├── data/ # Legal documents (scraped, cleaned)
├── preprocessed_data/ # Tokenized training data
├── models/ # Trained models
│ └── legal_slm/ # Fine-tuned DistilGPT2
├── scraper/ # Data collection tools
├── clean_data.py # Data cleaning script
├── prepare_data.py # Data preparation script
├── train_slm.py # Training script
├── query_slm.py # Query interface
└── requirements.txt # Dependencies
@software{auslegal_slm,
title = {Australian Legal Small Language Model},
author = {James Sangalli},
year = {2025},
url = {https://github.com/JamesANZ/auslegal-slm}
}Contributions welcome! Please open an issue or submit a pull request.
- Legal documents from AustLII
- Model architecture: DistilGPT2 by Hugging Face
- Built with Transformers library
If you find this project useful, consider supporting it:
Lightning Network:
lnbc1pjhhsqepp5mjgwnvg0z53shm22hfe9us289lnaqkwv8rn2s0rtekg5vvj56xnqdqqcqzzsxqyz5vqsp5gu6vh9hyp94c7t3tkpqrp2r059t4vrw7ps78a4n0a2u52678c7yq9qyyssq7zcferywka50wcy75skjfrdrk930cuyx24rg55cwfuzxs49rc9c53mpz6zug5y2544pt8y9jflnq0ltlha26ed846jh0y7n4gm8jd3qqaautqa
Bitcoin: bc1ptzvr93pn959xq4et6sqzpfnkk2args22ewv5u2th4ps7hshfaqrshe0xtp
Ethereum/EVM: 0x42ea529282DDE0AA87B42d9E83316eb23FE62c3f