Overview
We are seeking an experienced AWS DevOps Engineer to build, automate, and manage the cloud infrastructure powering an AI-driven enterprise platform. The platform consists of Python-based backend services, AI/ML workloads, LLM integrations, ReactJS frontend applications, and cloud-native microservices deployed on AWS.
The ideal candidate will possess deep expertise in Infrastructure as Code (Terraform), CI/CD automation, container orchestration, cloud security, and scalable deployment architectures for AI/ML applications.
Job Description
Key Responsibilities
Cloud Infrastructure & Platform Engineering
- Design, implement, and manage highly available AWS infrastructure supporting AI/ML workloads.
- Architect cloud-native environments optimized for scalability, performance, reliability, and cost efficiency.
- Manage AWS networking, security, storage, and compute services.
- Support multi-environment deployments (Development, QA, UAT, Production).
Infrastructure as Code (IaC)
- Develop and maintain infrastructure using Terraform.
- Create reusable Terraform modules for cloud resources and platform services.
- Automate provisioning and configuration management across environments.
- Maintain version-controlled infrastructure repositories and deployment standards.
CI/CD Automation
- Design and maintain CI/CD pipelines for:
- Python backend services
- ReactJS frontend applications
- AI/ML model deployment pipelines
- Infrastructure deployments using Terraform
- Automate code quality checks, testing, security scans, container builds, and releases.
- Implement GitOps and DevSecOps practices.
AI/ML Platform Operations
- Deploy and manage AI/ML services on AWS.
- Support model training, inference, and deployment workflows.
- Manage GPU-enabled infrastructure where required.
- Automate model packaging and deployment processes.
- Integrate AI/ML services with enterprise applications and APIs.
Containerization & Orchestration
- Build and manage containerized applications using Docker.
- Deploy and manage workloads on Amazon EKS (Kubernetes).
- Configure auto-scaling, rolling deployments, blue-green deployments, and canary releases.
- Optimize container performance and resource utilization.
Monitoring, Logging & Reliability
- Implement platform observability using monitoring and logging tools.
- Create dashboards and alerts for infrastructure, applications, APIs, and AI workloads.
- Conduct root cause analysis and incident resolution.
- Define and maintain SLOs, SLAs, and operational metrics.
Security & Compliance
- Implement IAM policies, secrets management, encryption, and security controls.
- Integrate vulnerability scanning and compliance checks into CI/CD pipelines.
- Enforce security best practices across infrastructure, containers, and applications.
- Support enterprise-grade governance and compliance requirements.
Required Technical Skills
Cloud Platform
Strong hands-on experience with AWS services including:
- EC2
- ECS/EKS
- Lambda
- S3
- CloudFront
- VPC
- IAM
- CloudWatch
- SQS
- API Gateway
- Secrets Manager
- AWS Systems Manager
Infrastructure as Code
- Terraform (Mandatory)
- Terraform Cloud/Enterprise (Preferred)
- Remote State Management
- Module Development
- Environment Automation
CI/CD & DevOps
- GitHub Actions
- Jenkins
- GitLab CI/CD
- AWS CodePipeline
- AWS CodeBuild
- GitOps methodologies
Container & Orchestration
- Docker
- Kubernetes
- Amazon EKS
- Helm Charts
Backend Technologies
Experience supporting deployment and operations of:
- Python
- FastAPI
- Flask
- Django
- REST APIs
- Microservices Architecture
Frontend Technologies
Experience supporting deployment and release management of:
AI/ML & Data Platforms
Experience with one or more of:
- OpenAI integrations
- Vector Databases
- ML Model Deployment Pipelines
- MLflow
- LangChain
- RAG-based Architectures
Monitoring & Observability
- CloudWatch
- Prometheus
- Grafana
- ELK/OpenSearch
Scripting & Automation
- Python
- Bash/Shell Scripting
- YAML
- JSON
Preferred Experience
- Experience managing AI-powered SaaS or enterprise platforms.
- Experience deploying LLM-based applications and AI agents.
- Experience supporting RAG architectures and vector databases.
- Experience implementing MLOps best practices.
- Experience working in Agile/Scrum product teams.
Key Deliverables
- Fully automated cloud infrastructure using Terraform.
- End-to-end CI/CD pipelines for Python backend and ReactJS frontend.
- Secure and scalable deployment of AI/ML workloads.
- High availability and reliability of production environments.
- Continuous optimization of cloud cost, performance, and security.
Success Metrics
- Infrastructure provisioning time reduction.
- Deployment frequency and release success rate.
- Platform uptime and reliability.
- Security compliance adherence.
- Cloud cost optimization.
- Reduced operational overhead through automation
Skills & Requirements
Apply Now