Part-time • Turing
Role overview
This role focuses on building and evaluating large language model (LLM) training and evaluation datasets based on real-world software engineering tasks derived from public repository histories. The work centers on creating verifiable software engineering tasks using a synthetic, human-in-the-loop approach, while expanding dataset coverage across programming languages and difficulty levels.
You will contribute as an experienced software engineer working with high-quality, widely used public repositories. The role combines hands-on engineering tasks—such as repository setup, issue triaging, and test coverage evaluation—with collaboration alongside researchers working on LLM evaluation.
What you’ll actually be doing
- Analyze and triage GitHub issues across trending open-source libraries.
- Set up and configure code repositories, including Dockerization and environment setup.
- Evaluate unit test coverage and overall test quality.
- Modify and run codebases locally to assess LLM performance in bug-fixing scenarios.
- Collaborate with researchers to design and identify repositories and issues that are challenging for LLMs.
- Contribute to development environment automation and software pipeline setup.
- Lead junior engineers in collaborative project work when applicable.
Who this role is for
- Software engineers at a tech lead level.
- Engineers familiar with high-quality public GitHub repositories.
- Professionals who have worked with well-maintained, widely used repositories with 500+ stars.
- Engineers comfortable working hands-on with real-world codebases.
Who this role is likely NOT for
- Engineers without experience working in established public repositories.
- Candidates who are not comfortable running, modifying, and testing real-world projects locally.
- Professionals without proficiency in Git, Docker, or basic software pipeline setup.
- Developers without strong experience in at least one of the listed programming languages.
Technical background
- Strong experience with at least one of the following languages: Python, JavaScript, Java, Go, Rust, C/C++, C#, or Ruby.
- Proficiency with Git, Docker, and basic software pipeline setup.
- Ability to understand and navigate complex codebases.
- Comfortable running, modifying, and testing real-world projects locally.
- Experience contributing to or evaluating open-source projects (preferred).
- Previous participation in LLM research or evaluation projects (nice to have).
- Experience building or testing developer tools or automation agents (nice to have).
Project scope
Duration: 3 months with expected start date next week.
Focused on building LLM evaluation and training datasets based on public repository histories.
Synthetic task creation using a human-in-the-loop approach.
Collaboration with researchers evaluating LLM interaction with real code.
Commitment required: 20 hours per week with some overlap with PST.
