Previously a member of the Falcon team, where he was in charge of creating the pretraining dataset for the first iteration of the Falcon LLM: RefinedWeb, Guilherme is now a member of the Hugging Face Science Team, where he works on improving pretraining datasets and led the FineWeb and FineWeb2 projects, two large scale datasets for LLM pretraining. More recently, he's been involved in Open-R1, Hugging Face's fully open effort to replicate the DeepSeek-R1 model.