Senior Software Engineer – LLM Evaluation & Repository Validation

Part-time • Turing

Role overview

This role focuses on building and evaluating large language model (LLM) training and evaluation datasets based on real-world software engineering tasks derived from public repository histories. The work centers on creating verifiable software engineering tasks using a synthetic, human-in-the-loop approach, while expanding dataset coverage across programming languages and difficulty levels.

You will contribute as an experienced software engineer working with high-quality, widely used public repositories. The role combines hands-on engineering tasks—such as repository setup, issue triaging, and test coverage evaluation—with collaboration alongside researchers working on LLM evaluation.

What you’ll actually be doing

Analyze and triage GitHub issues across trending open-source libraries.
Set up and configure code repositories, including Dockerization and environment setup.
Evaluate unit test coverage and overall test quality.
Modify and run codebases locally to assess LLM performance in bug-fixing scenarios.
Collaborate with researchers to design and identify repositories and issues that are challenging for LLMs.
Contribute to development environment automation and software pipeline setup.
Lead junior engineers in collaborative project work when applicable.

Who this role is for

Software engineers at a tech lead level.
Engineers familiar with high-quality public GitHub repositories.
Professionals who have worked with well-maintained, widely used repositories with 500+ stars.
Engineers comfortable working hands-on with real-world codebases.

Who this role is likely NOT for

Engineers without experience working in established public repositories.
Candidates who are not comfortable running, modifying, and testing real-world projects locally.
Professionals without proficiency in Git, Docker, or basic software pipeline setup.
Developers without strong experience in at least one of the listed programming languages.

Technical background

Strong experience with at least one of the following languages: Python, JavaScript, Java, Go, Rust, C/C++, C#, or Ruby.
Proficiency with Git, Docker, and basic software pipeline setup.
Ability to understand and navigate complex codebases.
Comfortable running, modifying, and testing real-world projects locally.
Experience contributing to or evaluating open-source projects (preferred).
Previous participation in LLM research or evaluation projects (nice to have).
Experience building or testing developer tools or automation agents (nice to have).

Project scope

Duration: 3 months with expected start date next week.

Focused on building LLM evaluation and training datasets based on public repository histories.

Synthetic task creation using a human-in-the-loop approach.

Collaboration with researchers evaluating LLM interaction with real code.

Commitment required: 20 hours per week with some overlap with PST.