Senior Software Engineer – LLM Evaluation & Repository Validation

Senior Software Engineer – LLM Evaluation & Repository Validation

Part-time Turing

Role overview

This role focuses on building and evaluating large language model (LLM) training and evaluation datasets based on real-world software engineering tasks derived from public repository histories. The work centers on creating verifiable software engineering tasks using a synthetic, human-in-the-loop approach, while expanding dataset coverage across programming languages and difficulty levels.

You will contribute as an experienced software engineer working with high-quality, widely used public repositories. The role combines hands-on engineering tasks—such as repository setup, issue triaging, and test coverage evaluation—with collaboration alongside researchers working on LLM evaluation.


What you’ll actually be doing

  • Analyze and triage GitHub issues across trending open-source libraries.
  • Set up and configure code repositories, including Dockerization and environment setup.
  • Evaluate unit test coverage and overall test quality.
  • Modify and run codebases locally to assess LLM performance in bug-fixing scenarios.
  • Collaborate with researchers to design and identify repositories and issues that are challenging for LLMs.
  • Contribute to development environment automation and software pipeline setup.
  • Lead junior engineers in collaborative project work when applicable.

Who this role is for

  • Software engineers at a tech lead level.
  • Engineers familiar with high-quality public GitHub repositories.
  • Professionals who have worked with well-maintained, widely used repositories with 500+ stars.
  • Engineers comfortable working hands-on with real-world codebases.

Who this role is likely NOT for

  • Engineers without experience working in established public repositories.
  • Candidates who are not comfortable running, modifying, and testing real-world projects locally.
  • Professionals without proficiency in Git, Docker, or basic software pipeline setup.
  • Developers without strong experience in at least one of the listed programming languages.

Technical background

  • Strong experience with at least one of the following languages: Python, JavaScript, Java, Go, Rust, C/C++, C#, or Ruby.
  • Proficiency with Git, Docker, and basic software pipeline setup.
  • Ability to understand and navigate complex codebases.
  • Comfortable running, modifying, and testing real-world projects locally.
  • Experience contributing to or evaluating open-source projects (preferred).
  • Previous participation in LLM research or evaluation projects (nice to have).
  • Experience building or testing developer tools or automation agents (nice to have).

Project scope

Duration: 3 months with expected start date next week.

Focused on building LLM evaluation and training datasets based on public repository histories.

Synthetic task creation using a human-in-the-loop approach.

Collaboration with researchers evaluating LLM interaction with real code.

Commitment required: 20 hours per week with some overlap with PST.