padie-extended is the 1st open-source Nigerian language text classifier package on PyPI. It is designed to predict Nigerian languages, including Pidgin, Yoruba, Hausa, and Igbo. It provides AI-powered tools for language detection and fosters community collaboration to enhance its capabilities.
padie-extended is a work in progress. It is an extension developed by Ayooluwaposi Olomo, building upon the original Padie repository by @sir-temi and @pythonisoft. Their open-source work laid the foundation for this project. Contributions are welcome. Be sure to check out their repository!
- 🚀 Fast and accurate language detection for Nigerian languages
- 🤖 Pre-trained transformer model for high-quality predictions
- 🌍 Supports 5 languages: English, Nigerian Pidgin, Yoruba, Hausa, and Igbo
- 📦 Simple API - just a few lines of code
- 🔧 Easy integration into existing Python projects
- 💻 Lightweight and efficient for production use
Please do NOT submit datasets to this repository. All dataset contributions should be made to the original Padie repository. This ensures all Padie-based projects benefit from your contributions.
We welcome contributions from developers, linguists, and data scientists interested in improving Nigerian language technology.
Here are some impactful ways you can help:
-
Expand Language Coverage:
Add support for more Nigerian and African languages beyond those currently included. -
Improve Short-Form Text Handling:
The model performs better on long-form text. Training and fine-tuning it on short-form (social media, chat, etc.) data can boost performance. -
Optimize Inference Efficiency:
Reduce model size or latency for deployment on resource-limited environments (mobile, low-bandwidth servers). -
Enhance Evaluation Metrics:
Add multilingual or domain-specific benchmarks (e.g., dialectal variations, code-switching). -
Augment the Dataset:
Contribute curated, diverse, and balanced text data to the main Padie repository, not this one. -
Improve Documentation & Examples:
Add usage examples, Jupyter notebooks, or tutorials showing real-world use cases.
-
Fork the Repository:
Click the "Fork" button at the top of the repository page to create your copy. -
Clone Your Fork:
git clone https://github.com/sir-temi/Padie.git
-
Create a Branch:
git checkout -b feature-name
-
Make Your Changes:
- Model improvements and training techniques
- Bug fixes and code optimizations
- Documentation and examples
- Evaluation tools and metrics
-
Commit and Push:
git commit -m "Describe your changes" git push origin feature-name -
Submit a Pull Request:
Open a pull request against thedevbranch with a clear description of your changes.
pip install padie-extendedIf you’re using this package to detect languages in your own projects (not for model training or development), you only need the following dependencies:
- Python 3.8+
- PyTorch 2.0+
- Transformers 4.30+
- SentencePiece 0.1.99+
- bitsandbytes 0.48.0+
pip install transformers[torch] sentencepiece bitsandbytesfrom padie_extended import LanguageDetector
# Initialize the detector
detector = LanguageDetector()
# Detect language from text
text = "Bawo ni, se daadaa ni?"
result = detector.predict(text)
print(f"Language: {result['language']}")
print(f"Confidence: {result['confidence']:.2%}")Language: Yoruba
Confidence: 98.50%
| Language | Code | Example |
|---|---|---|
| English | en |
"Hello, how are you?" |
| Nigerian Pidgin | pidgin |
"How you dey?" |
| Yoruba | yo |
"Bawo ni?" |
| Hausa | ha |
"Sannu" |
| Igbo | ig |
"Kedu?" |
from padie_extended import LanguageDetector
detector = LanguageDetector()
# Single text
text = "I dey kampe, na God"
result = detector.predict(text)
print(result)
# {'language': 'pidgin', 'all_scores': {...}, 'confidence': 0.96}texts = [
"Good morning everyone",
"Ẹ káàárọ̀",
"Sannu da safe",
"Wetin dey happen?"
]
results = detector.predict_batch(texts)
for text, result in zip(texts, results):
print(f"{text} -> {result['language']}")result = detector.predict("This is a mixed text")
print(result['all_scores'])
# {
# 'english': 0.85,
# 'pidgin': 0.10,
# 'yoruba': 0.03,
# 'hausa': 0.01,
# 'igbo': 0.01
# }detector = LanguageDetector(model_path="path/to/your/model")# Set threshold at initialization (default is 0.5)
detector = LanguageDetector(confidence_threshold=0.7)
# Or override for a specific prediction
result = detector.predict("Maybe pidgin", threshold=0.8)
# Change threshold after initialization
detector.set_threshold(0.6)- Base Model: afro-xlmr-base Transformer-based model
- Training Data: Diverse corpus of Nigerian language texts
- Model Size: 1GB
Tested on a diverse dataset of Nigerian texts:
| Metric | Score |
|---|---|
| Overall Accuracy | 95.3% |
| F1 Score (weighted) | 95.3% |
| Inference Speed | ~4.5 ms per text (measured on GPU) |
- 🌐 Content moderation - Detect language in user-generated content
- 📱 Social media analysis - Analyze multilingual Nigerian social media posts
- 🤖 Chatbots - Route conversations based on detected language
- 📊 Research - Analyze language distribution in datasets
- 🎯 Language-specific processing - Trigger different pipelines per language
If you use this package in your research, please cite:
@software{padie_extended,
author = {Olomo, Ayooluwaposi},
title = {padie-extended: AI-powered Nigerian Language Detection},
year = {2025},
url = {https://github.com/posi-olomo/padie-extended}
}- Built upon the Padie project
- Built with AWS cloud credits generously provided by Dr. Wálé Akínfadérìn
- Built with Hugging Face Transformers
- Inspired by the need for better Nigerian language NLP tools
- Thanks to all future contributors and the Nigerian NLP community
- GitHub: posi-olomo/padie-extended
- PyPI: padie-extended
- Issues: Report a bug
- Documentation: Full Documentation
If you encounter any issues or have questions:
- Check the documentation
- Search existing issues
- Create a new issue
padie-extended is licensed under the MIT License, ensuring it remains free and open for everyone to use, contribute to, and enhance.
Made with ❤️ for the Nigerian tech community