Harnessing Common Crawl for AI and ML applications
May 7
•
14:20 - 14:40
Location: Central Room (Updated)
This presentation looks at effective strategies for using Common Crawl's web archive in large-scale research applications, specifically for AI and other ML applications. We will discuss practical approaches to processing and filtering Common Crawl’s datasets, with focus on how to overcome computational challenges and optimise data pipelines. We will also discuss some of the challenges that users might encounter related to the multilingual and heterogeneous nature of Common Crawl’s data. The talk will cover best practices for data filtering, pre-processing, and storage, to ensure the quality and relevance of extracted information for research tasks. Additionally, we will briefly discuss the ranking mechanism used to determine whether a URL is crawled, and demonstrate how to use the Web Graph as a framework for further research.