Obtaining Better Turkish Corpora with High Quality Language Identification

Obtaining Better Turkish Corpora with High Quality Language Identification

 

Multilingual corpora like mC4 and OSCAR include Turkish subsets, but they have two significant limitations:

  1. Language Identification: These corpora use models that support hundreds of languages, which can result in inaccurate language identification for Turkish.
  2. Content Extraction: Turkish content often includes headers, footers, and metadata, which these codebases struggle to extract properly.

This project aims to use a Turkish-specific language identification model and better scraping techniques to improve the extraction of main content in the OSCAR project, resulting in a higher-quality pretraining corpus for Turkish LLMs.

 

Relevant links:

https://github.com/oscar-project/ungoliant?tab=readme-ov-file

 

Suitable for Cmpe492

Project Advisor: 

Suzan Üsküdarlı

Project Status: 

Project Year: 

2024
  • Fall

Bize Ulaşın

Bilgisayar Mühendisliği Bölümü, Boğaziçi Üniversitesi,
34342 Bebek, İstanbul, Türkiye

  • Telefon: +90 212 359 45 23/24
  • Faks: +90 212 2872461
 

Bizi takip edin

Sosyal Medya hesaplarımızı izleyerek bölümdeki gelişmeleri takip edebilirsiniz