Role overview
This role focuses on evaluating the real-world performance of large language models (LLMs) and autonomous AI agents. You will assess how these systems reason, act, and respond across multi-step workflows, using structured rubrics and benchmarking frameworks.
The work is analytical and detail-oriented. It involves reviewing model outputs, agent behavior traces, and execution steps to determine accuracy, quality, safety, and alignment with defined standards. This role is best suited for professionals with hands-on experience in AI evaluation, QA, research, or structured annotation environments who are comfortable working with ambiguity and evolving criteria.
What you’ll actually be doing
- Evaluate outputs from LLMs and autonomous agent systems using predefined guidelines and scoring rubrics
- Review multi-step agent workflows, including reasoning traces and visual artifacts such as screenshots
- Identify edge cases, recurring failure modes, and quality patterns across evaluations
- Provide structured, written feedback to support benchmarking and model improvement
- Participate in calibration sessions to align on scoring standards and evaluation consistency
- Adapt to evolving evaluation criteria and ambiguous scenarios
- Document findings clearly for technical and non-technical stakeholders
Who this role is for
- Professionals with prior experience in AI output evaluation, QA, testing, annotation, or UX research
- Individuals comfortable applying rubric-based scoring frameworks with high consistency
- Detail-oriented evaluators who can identify subtle reasoning flaws or execution gaps
- Strong written communicators capable of delivering precise, actionable feedback
- Self-directed contributors who can work independently in structured, remote environments
- Analytical thinkers who enjoy assessing system behavior rather than building models directly
Who this role is likely NOT for
- Candidates without hands-on experience evaluating AI systems or structured outputs
- Developers seeking primarily model training or engineering responsibilities
- Individuals who prefer loosely defined tasks without formal evaluation frameworks
- Professionals uncomfortable making judgment calls in ambiguous or edge-case scenarios
- Those looking for purely creative, marketing, or content-focused AI roles
Technical background
- Experience in LLM evaluation, AI benchmarking, QA/testing, UX research, or similar analytical domains
- Familiarity with rubric-based scoring systems and structured annotation workflows
- Strong English proficiency (minimum B2 equivalent) with clear written and verbal communication
- Ability to review multi-step reasoning and agent decision-making processes
- Experience with RLHF, annotation pipelines, or AI benchmarking frameworks (preferred)
- Familiarity with autonomous agents or workflow automation tools (preferred)
- Background in mobile app or digital product evaluation processes (preferred)
Project scope
Flexible engagement based on project needs and evaluation cycles
Ongoing evaluation work focused on benchmarking LLM and agent system performance
Minimum commitment of 20 hours per week during the initial term
Workload may evolve as evaluation guidelines and scenarios change
Structured collaboration with periodic calibration and alignment sessions
