Senior Data Scientist (LLM)

Multiverse Computing • Full-time • Donostia, ES • 1d ago

Multiverse Computing

Multiverse is a well-funded and fast-growing deep-tech company founded in 2019. We are the biggest Quantum Software company in the EU. We are also one of the 100 most promising companies in AI in the world (according to CB Insights, 2023) with 150+ employees and growing, fully multicultural and international.

We provide hyper-efficient software to companies seeking to gain an edge with quantum computing and artificial intelligence. Our main products, Singularity and CompactifAI, address critical needs across various industries. Singularity remains a trusted solution for blue-chip companies in finance, energy, manufacturing, cybersecurity, and more. CompactifAI, on the other hand, is a groundbreaking compressing tool of foundational models that uses Tensor Networks to extremely compress AI systems, such as large language models, making these efficient and portable.

You will be working alongside world leading experts to build solutions that tackle real life issues. We look for passionate people that want to grow in an ethics driven environment, promoting sustainability and diversity. We aim to continue building our truly inclusive culture - come and join us.

We are seeking a Senior Data Scientist with deep expertise in creating high-quality datasets for training and fine-tuning Large Language Models (LLMs). You will be responsible for designing and implementing scalable data pipelines and strategies to support all stages of LLM development: pretraining, supervised fine-tuning, and reinforcement learning with human feedback (RLHF).

This role is critical to ensuring the robustness, safety, and alignment of our AI models. You will have the autonomy to explore innovative data sourcing and curation methods and the opportunity to directly influence the capabilities of state-of-the-art LLMs.

As a Senior Data Scientist, you will

Design and implement strategies for creating, sourcing, and augmenting datasets tailored for LLM training and fine-tuning.

Develop scalable pipelines to collect, clean, filter, annotate, and validate large volumes of text data.

Conduct data audits to ensure quality, diversity, ethical compliance, and bias mitigation.

Collaborate with ML engineers and researchers to align datasets with training objectives and model evaluation needs.

Use tools like Active Learning, synthetic data generation, and self-supervised learning to maximize dataset efficiency.

Leverage human-in-the-loop (HITL) workflows for data labeling and validation where necessary.

Contribute to building data documentation and metadata standards (e.g., Datasheets for Datasets).

Keep up to date with research trends in dataset curation, LLM pretraining data, and benchmarking.

Required Qualifications

Bachelor’s, Master’s, or Ph.D. in Computer Science, AI, Data Science, or a related field.

3+ years of experience in data science, machine learning, or related roles, with demonstrated experience in dataset creation for NLP or LLMs.

In-depth knowledge of the LLM lifecycle: pretraining, fine-tuning, alignment, and evaluation.

Proficient in Python and data tooling ecosystems (Pandas, NumPy, spaCy, Hugging Face Datasets & Transformers).

Hands-on experience with text data collection from diverse sources: web scraping, APIs, proprietary corpora, etc.

Strong understanding of data quality metrics including bias detection, toxicity, and readability.

Experience working with annotation tools (e.g., Prodigy, Label Studio) and managing annotation teams or workflows.

Preferred Qualifications

Experience building or contributing to datasets used in LLM pretraining or supervised fine-tuning.

Familiarity with RLHF workflows and alignment techniques (e.g., preference modeling, reward modeling).

Exposure to multilingual and low-resource language datasets.

Contributions to open-source datasets, tools, or publications in dataset-centric research.

Knowledge of ethical AI, data governance, privacy laws (e.g., GDPR), and responsible data use.

Perks & Benefits

Indefinite contract.

Equal pay guaranteed.

Variable performance bonus.

Signing bonus.

We offer work visa sponsorship (If applicable).

Relocation package (if applicable).

Private health insurance.

Eligibility for educational budget according to internal policy.

Hybrid opportunity.

Flexible working hours.

Language classes and discounted lunch options

Working in a high paced environment, working on cutting edge technologies.

Career plan. Opportunity to learn and teach.

Progressive Company. Happy people culture

As an equal opportunity employer, Multiverse Computing is committed to building an inclusive workplace. The company welcomes people from all different backgrounds, including age, citizenship, ethnic and racial origins, gender identities, individuals with disabilities, marital status, religions and ideologies, and sexual orientations to apply.

Related Jobs

Manager - Machine Learning

Multiverse Computing • Full-time • Donostia, ES • 1d ago

Management Python Docker Physics Computer Science Master PhD Project Management Software Finance

1d ago

Apply