AI Model Evaluator (LLM & Agent Systems)

Part-time • micro1

Pay: $20 - 30 / hour

Role overview

This role focuses on evaluating the real-world performance of large language models (LLMs) and autonomous AI agents. You will assess how these systems reason, act, and respond across multi-step workflows, using structured rubrics and benchmarking frameworks.

The work is analytical and detail-oriented. It involves reviewing model outputs, agent behavior traces, and execution steps to determine accuracy, quality, safety, and alignment with defined standards. This role is best suited for professionals with hands-on experience in AI evaluation, QA, research, or structured annotation environments who are comfortable working with ambiguity and evolving criteria.

What you’ll actually be doing

Evaluate outputs from LLMs and autonomous agent systems using predefined guidelines and scoring rubrics
Review multi-step agent workflows, including reasoning traces and visual artifacts such as screenshots
Identify edge cases, recurring failure modes, and quality patterns across evaluations
Provide structured, written feedback to support benchmarking and model improvement
Participate in calibration sessions to align on scoring standards and evaluation consistency
Adapt to evolving evaluation criteria and ambiguous scenarios
Document findings clearly for technical and non-technical stakeholders

Who this role is for

Professionals with prior experience in AI output evaluation, QA, testing, annotation, or UX research
Individuals comfortable applying rubric-based scoring frameworks with high consistency
Detail-oriented evaluators who can identify subtle reasoning flaws or execution gaps
Strong written communicators capable of delivering precise, actionable feedback
Self-directed contributors who can work independently in structured, remote environments
Analytical thinkers who enjoy assessing system behavior rather than building models directly

Who this role is likely NOT for

Candidates without hands-on experience evaluating AI systems or structured outputs
Developers seeking primarily model training or engineering responsibilities
Individuals who prefer loosely defined tasks without formal evaluation frameworks
Professionals uncomfortable making judgment calls in ambiguous or edge-case scenarios
Those looking for purely creative, marketing, or content-focused AI roles

Technical background

Experience in LLM evaluation, AI benchmarking, QA/testing, UX research, or similar analytical domains
Familiarity with rubric-based scoring systems and structured annotation workflows
Strong English proficiency (minimum B2 equivalent) with clear written and verbal communication
Ability to review multi-step reasoning and agent decision-making processes
Experience with RLHF, annotation pipelines, or AI benchmarking frameworks (preferred)
Familiarity with autonomous agents or workflow automation tools (preferred)
Background in mobile app or digital product evaluation processes (preferred)

Project scope

Flexible engagement based on project needs and evaluation cycles

Ongoing evaluation work focused on benchmarking LLM and agent system performance

Minimum commitment of 20 hours per week during the initial term

Workload may evolve as evaluation guidelines and scenarios change

Structured collaboration with periodic calibration and alignment sessions