About FM:FM is a 190-year-old, Fortune 500 commercial property insurance company of 6,000+ employees with a unique focus on science and risk engineering. Serving over a quarter of the Fortune 500 and major corporations globally, they deliver data-driven strategies that enhance resilience, ensure business continuity, and empower organizations to thrive.FM India located in Bengaluru is a strategic location for driving FM's global operational efficiency that allows them to leverage the countrys talented workforce and advance their capabilities to serve their clients better. GenAI Engineer (AI Evaluation Engineer)Reports To Manager of GenAI EngineeringJob Summary:The Gen AI Evaluation Engineer leads the design, implementation, and operation of enterprise-grade evaluation, quality, and governance frameworks for Generative AI systems in a highly regulated, responsible AI environment. This role ensures the quality, reliability, safety, and performance of LLMs, vision models, RAG pipelines, and agentic workflows deployed in production. Building on strong GenAI engineering foundations, this position focuses on AI-specific testing, experimentation, automation, and continuous evaluation pipelines, integrating quality gates into CI/CD workflows and aligning GenAI solutions with enterprise architecture, compliance, and risk standards. The Gen AI Evaluation Engineer partners closely with product, data science, ML engineering, and platform teams to drive trustworthy, scalable, and production-ready AI systems. Essential Functions & Responsibilities:GenAI Application Design & Implementation:
Design and implement comprehensive AI evaluation and experimentation frameworks for LLMs, vision models, RAG pipelines, and agentic workflows.
Build automated evaluation systems to assess model outputs for accuracy, relevance, bias, hallucinations, safety, and regression stability.
Develop quality benchmarks and continuous testing pipelines covering content quality, safety, alignment, and enterprise compliance.
Establish AI-specific quality gates and acceptance criteria integrated into Agile sprints and CI/CD pipelines.
Design, develop, and evaluate data pipelines and RAG workflows using Promptflow, Azure AI Search, ADF Pipelines, Databricks, Spark, and Vector Databases.
Validate prompt engineering strategies, prompt consistency, and inference pipelines using GenAI-specific testing tools.
Perform prompt-based scenario testing, hallucination detection, and regression validation across model versions.
System Support & Operational Excellence:
Develop and maintain monitoring capabilities for model drift detection, data quality validation, inference latency, and system reliability.
Don't want to miss the next one?
Subscribe to daily email alerts for roles matching your interests.