Job DescriptionThe AI/ML Engineer will play a key role within the AI Center of Excellence (CoE), focusing on building, scaling, and maintaining robust ML and GenAI operational infrastructure. This position is responsible for developing and automating end-to-end machine learning pipelines, deploying models into production, and ensuring their performance and stability over time. The ideal candidate is a hands-on engineer with strong experience in MLOps, LLMOps, and cloud-native tools, and a passion for reliable, scalable, and efficient AI systems.
ACCOUNTABILITIES: (The primary functions, scope and responsibilities of the role)
Engineering and Operations:
- Develop, deploy, and maintain production-grade ML/GenAI pipelines using AWS cloud-native and open-source MLOps tools.
- Automate model training, evaluation, testing, deployment, and monitoring workflows.
- Implement LLMOps practices for prompt versioning, model tracking, and continuous evaluation of GenAI systems.
- Integrate ML systems with CI/CD pipelines and infrastructure-as-code tools.
- Support model inference at scale via APIs, containers, and microservices.
- Work closely with data engineering to ensure high-quality, real-time, and batch data availability for ML workflows.
- Ensure high availability, reliability, and performance of AI services in production environments.
- Maintain robust monitoring and observability on AWS, Snowflake, SalesForce and Oracle ecosystems.
- Implement feature stores and data versioning systems to ensure reproducible ML experiments and deployments.
- Deploy and optimize vector databases and embedding models for semantic search and RAG applications.
- Configure GPU enabled cloud infrastructure and implement monitoring solutions to optimize resource utilization, costs and performance for ML training and inference workloads.
- Establish automated model validation, testing, and rollback procedures for safe production deployments.
Tooling and Infrastructure:
- Build and manage model registries, feature stores, and metadata tracking systems.
- Leverage containerization (e.g., Docker) and orchestration (e.g, Kubernetes, Airflow, Kubeflow) for scalable deployment. Implement role-based access control, auditing, and governance for ML infrastructure.
- Manage cost-effective cloud infrastructure using AWS.
- Build and maintain data quality monitoring systems with automated alerting for data drift and anomalies.
- Implement cost optimization strategies including auto-scaling, spot instances, and resource right-sizing for ML workloads.
Collaboration and Support:
- Partner with data engineers, data scientists, ML engineers, architects, software engineers, infrastructure and security teams to support scalable and efficient AI/ML workflows.
- Contribute to incident response, performance tuning, and continuous improvement of ML pipelines.
- Provide guidance and documentation to promote reproducibility and best practices across teams.
- Work as part of an agile development team and participate in planning and code reviews.
REQUIRED QUALIFICATIONS: (Minimum qualifications needed for this position including education, experience, certification, knowledge and/or physical requirements)
Knowledge of:
- Cloud-native AI/ML development with AWS.
- MLOps/LLMOps frameworks and lifecycle tools on AWS.
- Monitoring and observability platforms on AWS.
- ML model deployment strategies (e.g., batch, real-time, streaming).
- Feature stores and data versioning tools on AWS and Snowflake
- Model serving frameworks like AWS Sagemaker and AWS Bedrock for scalable inference deployment.
- Vector databases and embedding deployment (e.g., Pinecone, Weaviate, FAISS, pgvector) for LLM and RAG applications.
- LLMOps-specific tools including prompt management platforms and LLM serving optimization on AWS.
- Docker registries and artifact management.
Required Skills and Abilities:
- Strong Python programming and scripting skills.
- Hands-on experience deploying and managing ML/GenAI models in production.
- Experience with Docker, Kubernetes, and workflow orchestration tools like Airflow or Kubeflow.
- Proficiency in infrastructure-as-code tools (e.g., Terraform, CloudFormation).
- Ability to debug, troubleshoot, and optimize AI/ML pipelines and systems.
- Comfortable working in agile teams and collaborating cross-functionally.
- Proven ability to automate processes and build reusable ML operational frameworks.
- Experience with A/B testing frameworks and canary deployments for ML models in production environments.
- Knowledge of GPU resource management and optimization for training and inference workloads.
- Understanding data pipeline quality monitoring, drift detection, and automated retraining triggers.
- Experience with secrets management, role-based access control, and secure credential handling for ML systems.
Education and/or Experience:
- Bachelor's degree in computer science, Engineering, or a related field (master's preferred).
- 2-3 years of experience in ML engineering, DevOps, or MLOps roles.
- Demonstrated experience managing production AI/ML workloads and systems.
PREFERRED QUALIFICATIONS: (Additional qualifications that may make a person even more effective in the role, but are not required for consideration)
- Experience with LLMOps and GenAI pipeline monitoring.
- Cloud certifications in AWS, Azure, or GCP.
- Experience supporting AI applications in regulated industries (e.g., healthcare, finance).
- Contributions to open-source MLOps tools or infrastructure projects.
- Experience with edge deployment and model optimization techniques (quantization, pruning, distillation).
- Knowledge of compliance frameworks (SOC2, GDPR, HIPAA) and security best practices for AI/ML systems.
- Experience with real-time streaming data pipelines (Kafka, Kinesis) and event-driven ML Architectures.
ResponsibilitiesThe AI/ML Engineer will play a key role within the AI Center of Excellence (CoE), focusing on building, scaling, and maintaining robust ML and GenAI operational infrastructure. This position is responsible for developing and automating end-to-end machine learning pipelines, deploying models into production, and ensuring their performance and stability over time. The ideal candidate is a hands-on engineer with strong experience in MLOps, LLMOps, and cloud-native tools, and a passion for reliable, scalable, and efficient AI systems.
ACCOUNTABILITIES: (The primary functions, scope and responsibilities of the role)
Engineering and Operations:
- Develop, deploy, and maintain production-grade ML/GenAI pipelines using AWS cloud-native and open-source MLOps tools.
- Automate model training, evaluation, testing, deployment, and monitoring workflows.
- Implement LLMOps practices for prompt versioning, model tracking, and continuous evaluation of GenAI systems.
- Integrate ML systems with CI/CD pipelines and infrastructure-as-code tools.
- Support model inference at scale via APIs, containers, and microservices.
- Work closely with data engineering to ensure high-quality, real-time, and batch data availability for ML workflows.
- Ensure high availability, reliability, and performance of AI services in production environments.
- Maintain robust monitoring and observability on AWS, Snowflake, SalesForce and Oracle ecosystems.
- Implement feature stores and data versioning systems to ensure reproducible ML experiments and deployments.
- Deploy and optimize vector databases and embedding models for semantic search and RAG applications.
- Configure GPU enabled cloud infrastructure and implement monitoring solutions to optimize resource utilization, costs and performance for ML training and inference workloads.
- Establish automated model validation, testing, and rollback procedures for safe production deployments.
Tooling and Infrastructure:
- Build and manage model registries, feature stores, and metadata tracking systems.
- Leverage containerization (e.g., Docker) and orchestration (e.g, Kubernetes, Airflow, Kubeflow) for scalable deployment. Implement role-based access control, auditing, and governance for ML infrastructure.
- Manage cost-effective cloud infrastructure using AWS.
- Build and maintain data quality monitoring systems with automated alerting for data drift and anomalies.
- Implement cost optimization strategies including auto-scaling, spot instances, and resource right-sizing for ML workloads.
Collaboration and Support:
- Partner with data engineers, data scientists, ML engineers, architects, software engineers, infrastructure and security teams to support scalable and efficient AI/ML workflows.
- Contribute to incident response, performance tuning, and continuous improvement of ML pipelines.
- Provide guidance and documentation to promote reproducibility and best practices across teams.
- Work as part of an agile development team and participate in planning and code reviews.
REQUIRED QUALIFICATIONS: (Minimum qualifications needed for this position including education, experience, certification, knowledge and/or physical requirements)
Knowledge of:
- Cloud-native AI/ML development with AWS.
- MLOps/LLMOps frameworks and lifecycle tools on AWS.
- Monitoring and observability platforms on AWS.
- ML model deployment strategies (e.g., batch, real-time, streaming).
- Feature stores and data versioning tools on AWS and Snowflake
- Model serving frameworks like AWS Sagemaker and AWS Bedrock for scalable inference deployment.
- Vector databases and embedding deployment (e.g., Pinecone, Weaviate, FAISS, pgvector) for LLM and RAG applications.
- LLMOps-specific tools including prompt management platforms and LLM serving optimization on AWS.
- Docker registries and artifact management.
Required Skills and Abilities:
- Strong Python programming and scripting skills.
- Hands-on experience deploying and managing ML/GenAI models in production.
- Experience with Docker, Kubernetes, and workflow orchestration tools like Airflow or Kubeflow.
- Proficiency in infrastructure-as-code tools (e.g., Terraform, CloudFormation).
- Ability to debug, troubleshoot, and optimize AI/ML pipelines and systems.
- Comfortable working in agile teams and collaborating cross-functionally.
- Proven ability to automate processes and build reusable ML operational frameworks.
- Experience with A/B testing frameworks and canary deployments for ML models in production environments.
- Knowledge of GPU resource management and optimization for training and inference workloads.
- Understanding data pipeline quality monitoring, drift detection, and automated retraining triggers.
- Experience with secrets management, role-based access control, and secure credential handling for ML systems.
Education and/or Experience:
- Bachelor's degree in computer science, Engineering, or a related field (master's preferred).
- 2-3 years of experience in ML engineering, DevOps, or MLOps roles.
- Demonstrated experience managing production AI/ML workloads and systems.
PREFERRED QUALIFICATIONS: (Additional qualifications that may make a person even more effective in the role, but are not required for consideration)
- Experience with LLMOps and GenAI pipeline monitoring.
- Cloud certifications in AWS, Azure, or GCP.
- Experience supporting AI applications in regulated industries (e.g., healthcare, finance).
- Contributions to open-source MLOps tools or infrastructure projects.
- Experience with edge deployment and model optimization techniques (quantization, pruning, distillation).
- Knowledge of compliance frameworks (SOC2, GDPR, HIPAA) and security best practices for AI/ML systems.
- Experience with real-time streaming data pipelines (Kafka, Kinesis) and event-driven ML Architectures.
Please see the job description for required or recommended skills.
Please see the job description for benefits.