Obtaining Better Turkish Corpora with High Quality Language Identification

Multilingual corpora like mC4 and OSCAR include Turkish subsets, but they have two significant limitations:

Language Identification: These corpora use models that support hundreds of languages, which can result in inaccurate language identification for Turkish.
Content Extraction: Turkish content often includes headers, footers, and metadata, which these codebases struggle to extract properly.

This project aims to use a Turkish-specific language identification model and better scraping techniques to improve the extraction of main content in the OSCAR project, resulting in a higher-quality pretraining corpus for Turkish LLMs.

Relevant links:

https://github.com/oscar-project/ungoliant?tab=readme-ov-file

Suitable for Cmpe492

Bize Ulaşın

Bilgisayar Mühendisliği Bölümü, Boğaziçi Üniversitesi,
34342 Bebek, İstanbul, Türkiye

Telefon: +90 212 359 45 23/24
Faks: +90 212 2872461

Bizi takip edin

Sosyal Medya hesaplarımızı izleyerek bölümdeki gelişmeleri takip edebilirsiniz

Arama formu

Main Menu

Obtaining Better Turkish Corpora with High Quality Language Identification