AWS
Amazon EKS Best Practices: Optimizing Your Kubernetes Clusters
Learn essential best practices for designing, deploying, and managing Amazon EKS clusters, including security, scalability, and operational efficiency
February 21, 2024
DevHub Team
5 min read
Amazon Elastic Kubernetes Service (EKS) provides a managed Kubernetes service that simplifies container orchestration at scale. This guide explores best practices for optimizing your EKS clusters across security, scalability, and operational aspects.
%%{init: {'theme': 'base', 'themeVariables': { 'background': 'transparent', 'primaryColor': '#FF9900', 'primaryTextColor': '#232F3E', 'lineColor': '#147EB4' }}}%%
flowchart TB
subgraph Network["VPC & Networking"]
direction TB
subgraph AZ1["Availability Zone 1"]
direction TB
Subnet1["fa:fa-network-wired Public Subnet"]
Subnet2["fa:fa-network-wired Private Subnet"]
MNG1["fa:fa-server Managed Node Group"]
end
subgraph AZ2["Availability Zone 2"]
direction TB
Subnet3["fa:fa-network-wired Public Subnet"]
Subnet4["fa:fa-network-wired Private Subnet"]
MNG2["fa:fa-server Managed Node Group"]
end
end
subgraph Control["Control Plane & Security"]
direction LR
EKS["fa:fa-cogs EKS Control Plane"]
OIDC["fa:fa-key OIDC Provider"]
IAM["fa:fa-shield-alt IAM Roles"]
SG["fa:fa-lock Security Groups"]
end
subgraph Workloads["Workload Management"]
direction TB
subgraph Resources["Resource Management"]
direction LR
HPA["fa:fa-balance-scale HPA"]
VPA["fa:fa-arrows-alt VPA"]
CA["fa:fa-expand Cluster Autoscaler"]
end
subgraph Storage["Storage Solutions"]
direction LR
EBS["fa:fa-hdd EBS CSI"]
EFS["fa:fa-folder EFS CSI"]
S3["fa:fa-database S3"]
end
end
subgraph Observability["Monitoring & Logging"]
direction LR
CW["fa:fa-chart-line CloudWatch"]
CT["fa:fa-history CloudTrail"]
Prometheus["fa:fa-chart-bar Prometheus"]
Fluentbit["fa:fa-stream Fluent Bit"]
end
EKS --> MNG1
EKS --> MNG2
EKS --> OIDC
OIDC --> IAM
MNG1 --> Resources
MNG2 --> Resources
Resources --> Storage
EKS --> Observability
%% Styling
classDef networkNode fill:#FF9900,stroke:#FF9900,color:#232F3E,stroke-width:2px
classDef controlNode fill:#232F3E,stroke:#232F3E,color:#FFFFFF,stroke-width:2px
classDef workloadNode fill:#147EB4,stroke:#147EB4,color:#FFFFFF,stroke-width:2px
classDef observeNode fill:#147EB4,stroke:#147EB4,color:#FFFFFF,stroke-width:2px
classDef groupStyle fill:transparent,stroke:#147EB4,stroke-width:2px
class Subnet1,Subnet2,Subnet3,Subnet4,MNG1,MNG2 networkNode
class EKS,OIDC,IAM,SG controlNode
class HPA,VPA,CA,EBS,EFS,S3 workloadNode
class CW,CT,Prometheus,Fluentbit observeNode
class Network,Control,Workloads,Resources,Storage,Observability groupStyle
Cluster Design Best Practices
1. Node Group Configuration
apiVersion: eks.amazonaws.com/v1alpha1 kind: NodeGroup metadata: name: production-nodes spec: clusterName: production-cluster nodeRole: arn:aws:iam::111122223333:role/eks-node-role subnets: - subnet-0123456789abcdef0 - subnet-0123456789abcdef1 instanceTypes: - m5.large - m5a.large scaling: minSize: 2 maxSize: 10 desiredSize: 3 labels: role: application environment: production taints: - key: dedicated value: production effect: NoSchedule
2. High Availability Setup
apiVersion: apps/v1 kind: Deployment metadata: name: sample-app spec: replicas: 3 strategy: type: RollingUpdate rollingUpdate: maxSurge: 1 maxUnavailable: 0 template: metadata: labels: app: sample-app spec: topologySpreadConstraints: - maxSkew: 1 topologyKey: topology.kubernetes.io/zone whenUnsatisfied: DoNotSchedule labelSelector: matchLabels: app: sample-app affinity: podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 podAffinityTerm: labelSelector: matchExpressions: - key: app operator: In values: - sample-app topologyKey: kubernetes.io/hostname
Security Best Practices
1. IAM Roles for Service Accounts (IRSA)
apiVersion: v1 kind: ServiceAccount metadata: name: app-service-account annotations: eks.amazonaws.com/role-arn: arn:aws:iam::111122223333:role/app-role
2. Network Policies
apiVersion: networking.k8s.io/v1 kind: NetworkPolicy metadata: name: restrict-traffic spec: podSelector: matchLabels: app: secure-app policyTypes: - Ingress - Egress ingress: - from: - namespaceSelector: matchLabels: environment: production - podSelector: matchLabels: role: frontend ports: - protocol: TCP port: 8080 egress: - to: - namespaceSelector: matchLabels: environment: production ports: - protocol: TCP port: 5432
Resource Management
1. Resource Quotas
apiVersion: v1 kind: ResourceQuota metadata: name: compute-resources spec: hard: requests.cpu: "4" requests.memory: 8Gi limits.cpu: "8" limits.memory: 16Gi pods: "20"
2. Horizontal Pod Autoscaling
apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: app-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: sample-app minReplicas: 2 maxReplicas: 10 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70 - type: Resource resource: name: memory target: type: Utilization averageUtilization: 80
Monitoring and Logging
1. CloudWatch Container Insights
apiVersion: v1 kind: ConfigMap metadata: name: cwagent-config namespace: amazon-cloudwatch data: cwagentconfig.json: | { "logs": { "metrics_collected": { "kubernetes": { "cluster_name": "production-cluster", "metrics_collection_interval": 60 } }, "force_flush_interval": 5 } }
2. Prometheus and Grafana Setup
apiVersion: monitoring.coreos.com/v1 kind: ServiceMonitor metadata: name: app-monitor spec: selector: matchLabels: app: sample-app endpoints: - port: metrics interval: 15s path: /metrics
Cost Optimization
1. Spot Instances Configuration
apiVersion: eks.amazonaws.com/v1alpha1 kind: NodeGroup metadata: name: spot-nodes spec: clusterName: production-cluster nodeRole: arn:aws:iam::111122223333:role/eks-node-role instanceTypes: - m5.large - m5a.large - m5d.large capacityType: SPOT scaling: minSize: 2 maxSize: 10
2. Cluster Autoscaler Settings
apiVersion: v1 kind: ConfigMap metadata: name: cluster-autoscaler-config data: config.yaml: | scaleDownUnneededTime: 5m scaleDownDelayAfterAdd: 5m scaleDownUtilizationThreshold: 0.5 skipNodesWithSystemPods: true
Operational Excellence
1. GitOps Implementation
apiVersion: source.toolkit.fluxcd.io/v1beta2 kind: GitRepository metadata: name: app-source spec: interval: 1m url: https://github.com/org/app-config ref: branch: main --- apiVersion: kustomize.toolkit.fluxcd.io/v1beta2 kind: Kustomization metadata: name: app-config spec: interval: 10m path: ./overlays/production prune: true sourceRef: kind: GitRepository name: app-source
2. Backup and Disaster Recovery
apiVersion: velero.io/v1 kind: Schedule metadata: name: daily-backup spec: schedule: "0 0 * * *" template: includedNamespaces: - production storageLocation: default volumeSnapshotLocations: - default ttl: 720h
Best Practices Checklist
-
Cluster Management
- Use managed node groups
- Implement multi-AZ deployment
- Keep control plane updated
- Use cluster autoscaling
-
Security
- Enable IRSA
- Implement network policies
- Use security groups
- Regular security audits
-
Resource Management
- Set resource quotas
- Configure HPA/VPA
- Implement pod disruption budgets
- Use efficient pod scheduling
-
Monitoring
- Enable Container Insights
- Set up Prometheus/Grafana
- Configure alerts
- Implement logging
-
Cost Optimization
- Use Spot Instances
- Implement autoscaling
- Regular cost analysis
- Resource cleanup
Troubleshooting Guide
Common issues and solutions:
-
Node Group Issues
- Check IAM roles
- Verify security groups
- Review capacity issues
-
Networking Problems
- Validate CNI configuration
- Check network policies
- Review service mesh setup
-
Resource Constraints
- Monitor resource usage
- Review quota limits
- Check scaling policies
References
- EKS Best Practices Guide - Official AWS EKS best practices
- EKS Workshop - Hands-on EKS tutorials
- Kubernetes Documentation - Official Kubernetes docs
- AWS EKS Documentation - Official EKS documentation
- Container Insights - Monitoring guide
- EKS Security - Security best practices
- Cluster Autoscaler - AWS specific configuration
- EKS Networking - Networking requirements
EKS
Kubernetes
Container
DevOps