Senior Software Engineer – LLM Evaluation

Senior Software Engineer – LLM Evaluation

Flexible Turing

Role overview

This role focuses on evaluating and improving large language models through software engineering expertise. You will contribute to AI model training initiatives by creating and curating high-quality code datasets, assessing AI-generated outputs, and designing verification mechanisms.

The work involves hands-on coding across multiple programming languages, structured evaluation of model performance across the software engineering lifecycle, and collaboration with cross-functional teams to strengthen AI-driven coding systems.

This role is suited for experienced software engineers with strong full-stack and production experience who can critically assess code quality, architecture, and scalability in a structured and analytical way.


What you’ll actually be doing

  • Curating code examples, building solutions, and correcting code in Python, JavaScript (including ReactJS), C/C++, Java, Rust, and Go
  • Evaluating and refining AI-generated code to ensure efficiency, scalability, and reliability
  • Collaborating with cross-functional teams to improve AI-driven coding solutions against industry performance benchmarks
  • Building agents that verify code quality and identify recurring error patterns
  • Hypothesizing steps across the software engineering lifecycle (prototyping, architecture design, API design, production implementation, launch, experiments, monitoring, operations maintenance) and evaluating model capabilities within those stages
  • Designing verification mechanisms that automatically validate solutions to software engineering tasks

Who this role is for

  • Software engineers with several years of experience
  • Engineers with 2+ years of continuous full-time experience at a top-tier product company (e.g., Google, Stripe, Amazon, Apple, Meta, Netflix, Microsoft, Datadog, Dropbox, Shopify, PayPal, IBM Research)
  • Professionals experienced in building full-stack applications and deploying scalable, production-grade software
  • Engineers with deep understanding of software architecture, system design, debugging, and code quality evaluation
  • Individuals with strong oral and written communication skills capable of delivering structured evaluation rationales

Who this role is likely NOT for

  • Engineers without several years of software engineering experience
  • Candidates without at least 2+ years of continuous full-time experience at a top-tier product company as specified
  • Professionals without production-grade full-stack development experience
  • Individuals without strong software architecture and code review expertise
  • Candidates who lack clear written and verbal communication skills

Technical background

  • Several years of software engineering experience
  • 2+ years of continuous full-time experience at a top-tier product company (as specified)
  • Strong expertise in full-stack application development
  • Experience deploying scalable, production-grade software
  • Deep understanding of software architecture, design, development, debugging, and code quality assessment
  • Proficiency in Python, JavaScript (including ReactJS), C/C++, Java, Rust, and Go

Project scope

Contractor engagement

Flexible engagement

Minimum 10 hours per week, up to 40 hours per week

Partial PST overlap required

Duration: 1 month (starting next week; potential extensions based on performance and fit)