Sino-Nom/Chinese OCR

This project improves detection quality for degraded Sino-Nom and Chinese character images by combining curated data collection with targeted OCR model fine-tuning.

Highlights

Built a high-concurrency scraping pipeline with Goroutines to crawl and curate the CWKB historical dataset.
Assembled a 3.4K+ image corpus by combining CWKB data with NomNaOCR.
Fine-tuned the PP-OCRv5 detection architecture with a PP-HGNetV2_B4 backbone, DB Algorithm, PaddlePaddle GPU training, Cosine Annealing, and optimized DBLoss.
Improved H-mean from 0.731 to 0.952 and reached 0.966 precision on the NomNaOCR test set.