Principal AWS DevOps Engineer

Overview

We are seeking an experienced AWS DevOps Engineer to build, automate, and manage the cloud infrastructure powering an AI-driven enterprise platform. The platform consists of Python-based backend services, AI/ML workloads, LLM integrations, ReactJS frontend applications, and cloud-native microservices deployed on AWS.
 
The ideal candidate will possess deep expertise in Infrastructure as Code (Terraform), CI/CD automation, container orchestration, cloud security, and scalable deployment architectures for AI/ML applications.

Job Description

Key Responsibilities
Cloud Infrastructure & Platform Engineering
  • Design, implement, and manage highly available AWS infrastructure supporting AI/ML workloads.
  • Architect cloud-native environments optimized for scalability, performance, reliability, and cost efficiency.
  • Manage AWS networking, security, storage, and compute services.
  • Support multi-environment deployments (Development, QA, UAT, Production).
Infrastructure as Code (IaC)
  • Develop and maintain infrastructure using Terraform.
  • Create reusable Terraform modules for cloud resources and platform services.
  • Automate provisioning and configuration management across environments.
  • Maintain version-controlled infrastructure repositories and deployment standards.
CI/CD Automation
  • Design and maintain CI/CD pipelines for:
    • Python backend services
    • ReactJS frontend applications
    • AI/ML model deployment pipelines
    • Infrastructure deployments using Terraform
  • Automate code quality checks, testing, security scans, container builds, and releases.
  • Implement GitOps and DevSecOps practices.
AI/ML Platform Operations
  • Deploy and manage AI/ML services on AWS.
  • Support model training, inference, and deployment workflows.
  • Manage GPU-enabled infrastructure where required.
  • Automate model packaging and deployment processes.
  • Integrate AI/ML services with enterprise applications and APIs.
Containerization & Orchestration
  • Build and manage containerized applications using Docker.
  • Deploy and manage workloads on Amazon EKS (Kubernetes).
  • Configure auto-scaling, rolling deployments, blue-green deployments, and canary releases.
  • Optimize container performance and resource utilization.
Monitoring, Logging & Reliability
  • Implement platform observability using monitoring and logging tools.
  • Create dashboards and alerts for infrastructure, applications, APIs, and AI workloads.
  • Conduct root cause analysis and incident resolution.
  • Define and maintain SLOs, SLAs, and operational metrics.
Security & Compliance
  • Implement IAM policies, secrets management, encryption, and security controls.
  • Integrate vulnerability scanning and compliance checks into CI/CD pipelines.
  • Enforce security best practices across infrastructure, containers, and applications.
  • Support enterprise-grade governance and compliance requirements.
Required Technical Skills
Cloud Platform
Strong hands-on experience with AWS services including:
  • EC2
  • ECS/EKS
  • Lambda
  • S3
  • CloudFront
  • VPC
  • IAM
  • CloudWatch
  • SQS
  • API Gateway
  • Secrets Manager
  • AWS Systems Manager
Infrastructure as Code
  • Terraform (Mandatory)
  • Terraform Cloud/Enterprise (Preferred)
  • Remote State Management
  • Module Development
  • Environment Automation
CI/CD & DevOps
  • GitHub Actions
  • Jenkins
  • GitLab CI/CD
  • AWS CodePipeline
  • AWS CodeBuild
  • GitOps methodologies
Container & Orchestration
  • Docker
  • Kubernetes
  • Amazon EKS
  • Helm Charts
Backend Technologies
Experience supporting deployment and operations of:
  • Python
  • FastAPI
  • Flask
  • Django
  • REST APIs
  • Microservices Architecture
Frontend Technologies
Experience supporting deployment and release management of:
  • ReactJS
AI/ML & Data Platforms
Experience with one or more of:
  • OpenAI integrations
  • Vector Databases
  • ML Model Deployment Pipelines
  • MLflow
  • LangChain
  • RAG-based Architectures
Monitoring & Observability
  • CloudWatch
  • Prometheus
  • Grafana
  • ELK/OpenSearch
Scripting & Automation
  • Python
  • Bash/Shell Scripting
  • YAML
  • JSON
Preferred Experience
  • Experience managing AI-powered SaaS or enterprise platforms.
  • Experience deploying LLM-based applications and AI agents.
  • Experience supporting RAG architectures and vector databases.
  • Experience implementing MLOps best practices.
  • Experience working in Agile/Scrum product teams.
Key Deliverables
  • Fully automated cloud infrastructure using Terraform.
  • End-to-end CI/CD pipelines for Python backend and ReactJS frontend.
  • Secure and scalable deployment of AI/ML workloads.
  • High availability and reliability of production environments.
  • Continuous optimization of cloud cost, performance, and security.
Success Metrics
  • Infrastructure provisioning time reduction.
  • Deployment frequency and release success rate.
  • Platform uptime and reliability.
  • Security compliance adherence.
  • Cloud cost optimization.
  • Reduced operational overhead through automation

Skills & Requirements

AWS, Terraform, Terraform Cloud, Infrastructure As Code, CI/CD, GitHub Actions, Jenkins, GitLab CI/CD, AWS CodePipeline, AWS CodeBuild, GitOps, DevSecOps, Docker, Kubernetes, Amazon EKS, Helm Charts, Python, FastAPI, Flask, Django, REST APIs, Microservices Architecture, ReactJS, OpenAI, Vector Databases, ML Model Deployment, MLflow, LangChain, RAG Architectures, CloudWatch, Prometheus, Grafana, ELK, OpenSearch, Bash Scripting, Shell Scripting, YAML, JSON, EC2, ECS, Lambda, S3, CloudFront, VPC, IAM, SQS, API Gateway, Secrets Manager, AWS Systems Manager, MLOps, AI/ML, LLM, AI Agents, Cloud Security, Monitoring, Logging, Observability, Automation, DevOps, Agile, Scrum.

 
 

Apply Now

Join Our Community

Let us know the skills you need and we'll find the best talent for you