Senior Software Engineer – LLM Evaluation

Flexible • Turing

Role overview

This role focuses on evaluating and improving large language models through software engineering expertise. You will contribute to AI model training initiatives by creating and curating high-quality code datasets, assessing AI-generated outputs, and designing verification mechanisms.

The work involves hands-on coding across multiple programming languages, structured evaluation of model performance across the software engineering lifecycle, and collaboration with cross-functional teams to strengthen AI-driven coding systems.

This role is suited for experienced software engineers with strong full-stack and production experience who can critically assess code quality, architecture, and scalability in a structured and analytical way.

What you’ll actually be doing

Curating code examples, building solutions, and correcting code in Python, JavaScript (including ReactJS), C/C++, Java, Rust, and Go
Evaluating and refining AI-generated code to ensure efficiency, scalability, and reliability
Collaborating with cross-functional teams to improve AI-driven coding solutions against industry performance benchmarks
Building agents that verify code quality and identify recurring error patterns
Hypothesizing steps across the software engineering lifecycle (prototyping, architecture design, API design, production implementation, launch, experiments, monitoring, operations maintenance) and evaluating model capabilities within those stages
Designing verification mechanisms that automatically validate solutions to software engineering tasks

Who this role is for

Software engineers with several years of experience
Engineers with 2+ years of continuous full-time experience at a top-tier product company (e.g., Google, Stripe, Amazon, Apple, Meta, Netflix, Microsoft, Datadog, Dropbox, Shopify, PayPal, IBM Research)
Professionals experienced in building full-stack applications and deploying scalable, production-grade software
Engineers with deep understanding of software architecture, system design, debugging, and code quality evaluation
Individuals with strong oral and written communication skills capable of delivering structured evaluation rationales

Who this role is likely NOT for

Engineers without several years of software engineering experience
Candidates without at least 2+ years of continuous full-time experience at a top-tier product company as specified
Professionals without production-grade full-stack development experience
Individuals without strong software architecture and code review expertise
Candidates who lack clear written and verbal communication skills

Technical background

Several years of software engineering experience
2+ years of continuous full-time experience at a top-tier product company (as specified)
Strong expertise in full-stack application development
Experience deploying scalable, production-grade software
Deep understanding of software architecture, design, development, debugging, and code quality assessment
Proficiency in Python, JavaScript (including ReactJS), C/C++, Java, Rust, and Go

Project scope

Contractor engagement

Flexible engagement

Minimum 10 hours per week, up to 40 hours per week

Partial PST overlap required

Duration: 1 month (starting next week; potential extensions based on performance and fit)