This project improves detection quality for degraded Sino-Nom and Chinese character images by combining curated data collection with targeted OCR model fine-tuning.
Highlights
- Built a high-concurrency scraping pipeline with Goroutines to crawl and curate the CWKB historical dataset.
- Assembled a 3.4K+ image corpus by combining CWKB data with NomNaOCR.
- Fine-tuned the PP-OCRv5 detection architecture with a PP-HGNetV2_B4 backbone, DB Algorithm, PaddlePaddle GPU training, Cosine Annealing, and optimized DBLoss.
- Improved H-mean from 0.731 to 0.952 and reached 0.966 precision on the NomNaOCR test set.
