Lead QA - Automated Testing & AI Validation

Overview

This organization is a global managed services provider delivering end-to-end IT and business solutions to enterprises. Its offerings include consulting, custom application development, AI, machine learning, data science, and technology operations. The company was formed through the merger of two established entities and focuses on leveraging advanced technologies—especially AI—to drive business outcomes and long-term value for clients.

It serves a diverse range of industries and emphasizes strong partnerships, adaptability, and delivering ethical and socially responsible solutions.

Job Description

  • 6+ years QA experience with 3+ years leading test automation initiatives
  • Expert knowledge of LLM evaluation frameworks and methodologies:
    • LLM-as-a-Judge techniques (G-Eval, custom evaluators) for semantic quality assessment
    • Observability platforms: LangFuse, LangSmith, or similar for trace analysis and monitoring
    • Evaluation metrics: Hallucination detection, faithfulness, answer relevance, context precision/recall
    • RAG evaluation: Experience with RAGAS or similar frameworks for retrieval-augmented systems
  • Strong Python test automation (Pytest, DeepEval) integrated with CI/CD (GitHub Actions)
  • Experience designing evaluation pipelines for multi-agent systems: agent collaboration testing, tool usage validation, reasoning chain analysis
  • Adversarial and red-team testing: Prompt injection, jailbreak attempts, bias/toxicity detection
  • API testing expertise for microservices validation (REST, async workflows)
  • Azure monitoring experience (Application Insights, Log Analytics)
  • Can define quality metrics for generative AI outputs and build automated scoring systems
  • Experience with synthetic dataset generation and golden dataset management

Skills & Requirements

  • Test Automation (Python, PyTest, DeepEval)

  • LLM Evaluation (G-Eval, custom evaluators, LLM-as-a-Judge)

  • RAG Evaluation (RAGAS, retrieval quality metrics)

  • Evaluation Metrics (hallucination detection, faithfulness, relevance, precision/recall)

  • Observability & Monitoring (LangFuse, LangSmith)

  • CI/CD Integration (GitHub Actions)

  • Multi-Agent System Testing (reasoning chains, tool-use validation)

  • Adversarial/Red-Team Testing (prompt injection, jailbreaks, bias/toxicity testing)

  • API Testing (REST, async workflows)

  • Azure Monitoring (App Insights, Log Analytics)

  • Synthetic & Golden Dataset Management

  • Automated Scoring System Design for GenAI Outputs

Join Our Community

Let us know the skills you need and we'll find the best talent for you