Amazon EKS Best Practices: Optimizing Your Kubernetes Clusters

Amazon EKS Best Practices: Optimizing Your Kubernetes Clusters

Learn essential best practices for designing, deploying, and managing Amazon EKS clusters, including security, scalability, and operational efficiency

February 21, 2024
DevHub Team
5 min read

Amazon Elastic Kubernetes Service (EKS) provides a managed Kubernetes service that simplifies container orchestration at scale. This guide explores best practices for optimizing your EKS clusters across security, scalability, and operational aspects.

%%{init: {'theme': 'base', 'themeVariables': { 'background': 'transparent', 'primaryColor': '#FF9900', 'primaryTextColor': '#232F3E', 'lineColor': '#147EB4' }}}%% flowchart TB subgraph Network["VPC & Networking"] direction TB subgraph AZ1["Availability Zone 1"] direction TB Subnet1["fa:fa-network-wired Public Subnet"] Subnet2["fa:fa-network-wired Private Subnet"] MNG1["fa:fa-server Managed Node Group"] end subgraph AZ2["Availability Zone 2"] direction TB Subnet3["fa:fa-network-wired Public Subnet"] Subnet4["fa:fa-network-wired Private Subnet"] MNG2["fa:fa-server Managed Node Group"] end end subgraph Control["Control Plane & Security"] direction LR EKS["fa:fa-cogs EKS Control Plane"] OIDC["fa:fa-key OIDC Provider"] IAM["fa:fa-shield-alt IAM Roles"] SG["fa:fa-lock Security Groups"] end subgraph Workloads["Workload Management"] direction TB subgraph Resources["Resource Management"] direction LR HPA["fa:fa-balance-scale HPA"] VPA["fa:fa-arrows-alt VPA"] CA["fa:fa-expand Cluster Autoscaler"] end subgraph Storage["Storage Solutions"] direction LR EBS["fa:fa-hdd EBS CSI"] EFS["fa:fa-folder EFS CSI"] S3["fa:fa-database S3"] end end subgraph Observability["Monitoring & Logging"] direction LR CW["fa:fa-chart-line CloudWatch"] CT["fa:fa-history CloudTrail"] Prometheus["fa:fa-chart-bar Prometheus"] Fluentbit["fa:fa-stream Fluent Bit"] end EKS --> MNG1 EKS --> MNG2 EKS --> OIDC OIDC --> IAM MNG1 --> Resources MNG2 --> Resources Resources --> Storage EKS --> Observability %% Styling classDef networkNode fill:#FF9900,stroke:#FF9900,color:#232F3E,stroke-width:2px classDef controlNode fill:#232F3E,stroke:#232F3E,color:#FFFFFF,stroke-width:2px classDef workloadNode fill:#147EB4,stroke:#147EB4,color:#FFFFFF,stroke-width:2px classDef observeNode fill:#147EB4,stroke:#147EB4,color:#FFFFFF,stroke-width:2px classDef groupStyle fill:transparent,stroke:#147EB4,stroke-width:2px class Subnet1,Subnet2,Subnet3,Subnet4,MNG1,MNG2 networkNode class EKS,OIDC,IAM,SG controlNode class HPA,VPA,CA,EBS,EFS,S3 workloadNode class CW,CT,Prometheus,Fluentbit observeNode class Network,Control,Workloads,Resources,Storage,Observability groupStyle

Cluster Design Best Practices

1. Node Group Configuration

apiVersion: kind: NodeGroup metadata: name: production-nodes spec: clusterName: production-cluster nodeRole: arn:aws:iam::111122223333:role/eks-node-role subnets: - subnet-0123456789abcdef0 - subnet-0123456789abcdef1 instanceTypes: - m5.large - m5a.large scaling: minSize: 2 maxSize: 10 desiredSize: 3 labels: role: application environment: production taints: - key: dedicated value: production effect: NoSchedule

2. High Availability Setup

apiVersion: apps/v1 kind: Deployment metadata: name: sample-app spec: replicas: 3 strategy: type: RollingUpdate rollingUpdate: maxSurge: 1 maxUnavailable: 0 template: metadata: labels: app: sample-app spec: topologySpreadConstraints: - maxSkew: 1 topologyKey: whenUnsatisfied: DoNotSchedule labelSelector: matchLabels: app: sample-app affinity: podAntiAffinity: preferredDuringSchedulingIgnoredDuringExecution: - weight: 100 podAffinityTerm: labelSelector: matchExpressions: - key: app operator: In values: - sample-app topologyKey:

Security Best Practices

1. IAM Roles for Service Accounts (IRSA)

apiVersion: v1 kind: ServiceAccount metadata: name: app-service-account annotations: arn:aws:iam::111122223333:role/app-role

2. Network Policies

apiVersion: kind: NetworkPolicy metadata: name: restrict-traffic spec: podSelector: matchLabels: app: secure-app policyTypes: - Ingress - Egress ingress: - from: - namespaceSelector: matchLabels: environment: production - podSelector: matchLabels: role: frontend ports: - protocol: TCP port: 8080 egress: - to: - namespaceSelector: matchLabels: environment: production ports: - protocol: TCP port: 5432

Resource Management

1. Resource Quotas

apiVersion: v1 kind: ResourceQuota metadata: name: compute-resources spec: hard: requests.cpu: "4" requests.memory: 8Gi limits.cpu: "8" limits.memory: 16Gi pods: "20"

2. Horizontal Pod Autoscaling

apiVersion: autoscaling/v2 kind: HorizontalPodAutoscaler metadata: name: app-hpa spec: scaleTargetRef: apiVersion: apps/v1 kind: Deployment name: sample-app minReplicas: 2 maxReplicas: 10 metrics: - type: Resource resource: name: cpu target: type: Utilization averageUtilization: 70 - type: Resource resource: name: memory target: type: Utilization averageUtilization: 80

Monitoring and Logging

1. CloudWatch Container Insights

apiVersion: v1 kind: ConfigMap metadata: name: cwagent-config namespace: amazon-cloudwatch data: cwagentconfig.json: | { "logs": { "metrics_collected": { "kubernetes": { "cluster_name": "production-cluster", "metrics_collection_interval": 60 } }, "force_flush_interval": 5 } }

2. Prometheus and Grafana Setup

apiVersion: kind: ServiceMonitor metadata: name: app-monitor spec: selector: matchLabels: app: sample-app endpoints: - port: metrics interval: 15s path: /metrics

Cost Optimization

1. Spot Instances Configuration

apiVersion: kind: NodeGroup metadata: name: spot-nodes spec: clusterName: production-cluster nodeRole: arn:aws:iam::111122223333:role/eks-node-role instanceTypes: - m5.large - m5a.large - m5d.large capacityType: SPOT scaling: minSize: 2 maxSize: 10

2. Cluster Autoscaler Settings

apiVersion: v1 kind: ConfigMap metadata: name: cluster-autoscaler-config data: config.yaml: | scaleDownUnneededTime: 5m scaleDownDelayAfterAdd: 5m scaleDownUtilizationThreshold: 0.5 skipNodesWithSystemPods: true

Operational Excellence

1. GitOps Implementation

apiVersion: kind: GitRepository metadata: name: app-source spec: interval: 1m url: ref: branch: main --- apiVersion: kind: Kustomization metadata: name: app-config spec: interval: 10m path: ./overlays/production prune: true sourceRef: kind: GitRepository name: app-source

2. Backup and Disaster Recovery

apiVersion: kind: Schedule metadata: name: daily-backup spec: schedule: "0 0 * * *" template: includedNamespaces: - production storageLocation: default volumeSnapshotLocations: - default ttl: 720h

Best Practices Checklist

  1. Cluster Management

    • Use managed node groups
    • Implement multi-AZ deployment
    • Keep control plane updated
    • Use cluster autoscaling
  2. Security

    • Enable IRSA
    • Implement network policies
    • Use security groups
    • Regular security audits
  3. Resource Management

    • Set resource quotas
    • Configure HPA/VPA
    • Implement pod disruption budgets
    • Use efficient pod scheduling
  4. Monitoring

    • Enable Container Insights
    • Set up Prometheus/Grafana
    • Configure alerts
    • Implement logging
  5. Cost Optimization

    • Use Spot Instances
    • Implement autoscaling
    • Regular cost analysis
    • Resource cleanup

Troubleshooting Guide

Common issues and solutions:

  1. Node Group Issues

    • Check IAM roles
    • Verify security groups
    • Review capacity issues
  2. Networking Problems

    • Validate CNI configuration
    • Check network policies
    • Review service mesh setup
  3. Resource Constraints

    • Monitor resource usage
    • Review quota limits
    • Check scaling policies


  1. EKS Best Practices Guide - Official AWS EKS best practices
  2. EKS Workshop - Hands-on EKS tutorials
  3. Kubernetes Documentation - Official Kubernetes docs
  4. AWS EKS Documentation - Official EKS documentation
  5. Container Insights - Monitoring guide
  6. EKS Security - Security best practices
  7. Cluster Autoscaler - AWS specific configuration
  8. EKS Networking - Networking requirements