NexusCS

AWS EKS

Kubernetes
Amazon Elastic Kubernetes Service (EKS) - Fully managed Kubernetes on AWS. Reference for EKS architecture, node types (EC2/Fargate), IAM/RBAC, networking, storage, add-ons, and troubleshooting common issues.
aws
kubernetes
eks
cloud

Getting started

Introduction

Amazon Elastic Kubernetes Service (EKS) is a fully managed Kubernetes service by AWS. AWS manages the control plane (API server, etcd, scheduler, controller manager) while you manage worker nodes.

What is EKS

  • Managed Control Plane: AWS handles patching, scaling, and high availability
  • Two Modes: Standard EKS (manage control plane) and EKS Auto Mode (manage control + data plane)
  • AWS Integration: Works with IAM, VPC, EBS, EFS, ALB, NLB
  • Multi-AZ: API server runs across multiple availability zones

Architecture Overview

Control Plane (AWS-Managed):
  - API Server (Multi-AZ)
  - etcd
  - Scheduler
  - Controller Manager

Data Plane (Customer-Managed):
  - EC2 Managed Nodes
  - EC2 Self-Managed Nodes
  - Fargate (Serverless)

Quick Start

# Create cluster
aws eks create-cluster \
  --name my-cluster \
  --role-arn arn:aws:iam::123456789012:role/eksServiceRole \
  --resources-vpc-config subnetIds=subnet-xxx,subnet-yyy

# Update kubeconfig
aws eks update-kubeconfig --name my-cluster

# Verify connection
kubectl get svc

VPC Requirements

Requirement Value
Minimum Subnets 2 (in different AZs)
IPs per Subnet 8+ available (16 recommended)
Upgrade IPs Up to 5 for cluster upgrades

Versions

Support Types

Type Duration Cost Details
Standard 14 months Included After version release
Extended +12 months Extra cost Total 26 months

Currently Available

Standard Support:

  • 1.34
  • 1.33
  • 1.32

Extended Support:

  • 1.31
  • 1.30
  • 1.29

Upgrade Rules

# Can only upgrade one minor version at a time
1.281.291.281.30(must go 1.281.291.30)

# Auto-upgrade when extended support ends
# (control plane only, not worker nodes)

Version Skew

Kubernetes Version Max kubelet Lag
1.28+ 3 minor versions behind
Before 1.28 2 minor versions behind

Upgrade Process

  1. New API server nodes launched
  2. Health checks performed
  3. Old nodes replaced
  4. Rolling update (cannot pause/stop)
  5. Requires up to 5 available IPs in subnets
# Update cluster version
aws eks update-cluster-version \
  --name my-cluster \
  --kubernetes-version 1.30

Upgrade Insights

  • Automatically scans for deprecated API usage
  • Identifies upgrade blockers
  • Refreshes every 24 hours
  • Cannot upgrade if deprecated APIs used in last 30 days

Node Types

Managed Node Groups (EC2)

# Create managed node group
aws eks create-nodegroup \
  --cluster-name my-cluster \
  --nodegroup-name my-nodes \
  --node-role arn:aws:iam::123456789012:role/NodeRole \
  --subnets subnet-xxx subnet-yyy \
  --instance-types t3.medium

Features:

  • AWS automates provisioning and lifecycle
  • Part of EC2 Auto Scaling group
  • Labeled with eks.amazonaws.com/capacityType

Allocation Strategies:

  • On-Demand: prioritized
  • Spot: price-capacity-optimized (K8s 1.28+) or capacity-optimized (1.27-)

Self-Managed Nodes (EC2)

# Manual management required
# More control over configuration
# Requires manual updates

Use Cases:

  • Custom AMIs
  • Specific instance configurations
  • Advanced networking requirements

Fargate (Serverless)

# Fargate profile
apiVersion: v1
kind: FargateProfile
metadata:
  name: my-profile
selectors:
  - namespace: default

Features:

  • On-demand, right-sized compute
  • Dedicated VM boundary per Pod
  • No shared kernel, CPU, memory, or ENI

Limitations:

  • ❌ No HostPort/HostNetwork
  • ❌ No DaemonSets
  • ❌ No GPUs
  • ❌ No Spot instances
  • ✓ Requires private subnets with NAT gateway

Comparison Table

Feature Managed Nodes Self-Managed Fargate
Management AWS automated Manual Fully serverless
Cost EC2 pricing EC2 pricing Per Pod pricing
Control Medium High Low
DaemonSets
GPUs
Spot
HostNetwork

Spot Capacity Rebalancing

# Enabled by default for Spot nodes
# 2-minute interruption notice
# Recommend 30s or less termination grace periods

spec:
  terminationGracePeriodSeconds: 30

Warning: Pods may be forcibly terminated during concurrent reclamations.

IAM & Authentication

aws-iam-authenticator

# Uses IAM for cluster authentication
# Integrates with OpenID Connect (OIDC)

# Check authentication
aws sts get-caller-identity

OIDC Provider

# Hosts public OIDC discovery endpoint per cluster
# Contains signing keys for service account tokens

# Private keys rotate every 7 days
# Public keys kept until expiry

IRSA (IAM Roles for Service Accounts)

# Service Account with IAM role
apiVersion: v1
kind: ServiceAccount
metadata:
  name: my-sa
  namespace: default
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/MyRole

Injected Environment Variables:

  • AWS_ROLE_ARN
  • AWS_WEB_IDENTITY_TOKEN_FILE

Benefits:

  • Least privilege
  • Credential isolation
  • Auditability

Create IAM Role for IRSA

# Create OIDC provider
eksctl utils associate-iam-oidc-provider \
  --cluster my-cluster \
  --approve

# Create IAM role
eksctl create iamserviceaccount \
  --name my-sa \
  --namespace default \
  --cluster my-cluster \
  --attach-policy-arn arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess \
  --approve

aws-auth ConfigMap (Legacy)

# Maps IAM roles/users to Kubernetes RBAC groups
apiVersion: v1
kind: ConfigMap
metadata:
  name: aws-auth
  namespace: kube-system
data:
  mapRoles: |
    - rolearn: arn:aws:iam::123456789012:role/NodeRole
      username: system:node:{{EC2PrivateDNSName}}
      groups:
        - system:bootstrappers
        - system:nodes

Note: Being replaced by access entries.

Access Entries (New Method)

# Replaces aws-auth ConfigMap
# Requires minimum platform version

aws eks create-access-entry \
  --cluster-name my-cluster \
  --principal-arn arn:aws:iam::123456789012:role/MyRole

Required Roles

eks:node-manager:

  • ClusterRole and ClusterRoleBinding
  • Required for managed node groups
  • Missing/broken causes AccessDenied errors
# Verify role exists
kubectl get clusterrole eks:node-manager
kubectl get clusterrolebinding eks:node-manager

Networking

VPC CNI Plugin

# Check VPC CNI
kubectl get pods -n kube-system -l k8s-app=aws-node

# View CNI configuration
kubectl get daemonset -n kube-system aws-node -o yaml

Features:

  • Manages Pod networking
  • Allocates VPC IP addresses directly to Pods
  • Uses secondary IPs from ENI or prefix delegation
  • Requires IAM permissions (AmazonEKS_CNI_Policy)

Security Groups for Pods

# Assign different VPC security groups to Pods
apiVersion: vpcresources.k8s.aws/v1beta1
kind: SecurityGroupPolicy
metadata:
  name: my-sg-policy
spec:
  podSelector:
    matchLabels:
      app: my-app
  securityGroups:
    groupIds:
      - sg-0123456789abcdef0

Availability:

  • ✓ Fargate
  • ✓ EC2 nodes

Application Load Balancer (ALB)

# Ingress with ALB
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: my-ingress
  annotations:
    alb.ingress.kubernetes.io/scheme: internet-facing
    alb.ingress.kubernetes.io/target-type: ip
spec:
  ingressClassName: alb
  rules:
    - http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: my-service
                port:
                  number: 80

Traffic Modes:

  • Instance: NodePort proxy
  • IP: Direct to Pod (required for Fargate)

Network Load Balancer (NLB)

# Service with NLB
apiVersion: v1
kind: Service
metadata:
  name: my-service
  annotations:
    service.beta.kubernetes.io/aws-load-balancer-type: nlb
spec:
  type: LoadBalancer
  selector:
    app: my-app
  ports:
    - port: 80
      targetPort: 8080

Features:

  • Layer 4 load balancing
  • Provisioned via Service type LoadBalancer

Subnet Tagging

Required for load balancer discovery:

Subnet Type Tag Value
Private kubernetes.io/role/internal-elb 1
Public kubernetes.io/role/elb 1
# Tag private subnet
aws ec2 create-tags \
  --resources subnet-xxx \
  --tags Key=kubernetes.io/role/internal-elb,Value=1

IPv6 Support

# Enable IPv6 for ALB
metadata:
  annotations:
    alb.ingress.kubernetes.io/ip-address-type: dualstack
    alb.ingress.kubernetes.io/target-type: ip # Required

Note: Only works with IP target type.

Add-ons

Default Add-ons (Self-managed)

# VPC CNI
kubectl get daemonset -n kube-system aws-node

# kube-proxy
kubectl get daemonset -n kube-system kube-proxy

# CoreDNS
kubectl get deployment -n kube-system coredns

EKS-Managed Add-ons

# List available add-ons
aws eks describe-addon-versions

# Install add-on
aws eks create-addon \
  --cluster-name my-cluster \
  --addon-name vpc-cni \
  --addon-version v1.18.0-eksbuild.1

# Update add-on
aws eks update-addon \
  --cluster-name my-cluster \
  --addon-name vpc-cni \
  --addon-version v1.18.1-eksbuild.1

Add-on Types

Type Description Examples
AWS AWS-curated, latest security patches VPC CNI, kube-proxy, CoreDNS
AWS Marketplace Third-party verified Datadog, New Relic
Community Open-source, AWS-validated Metrics Server, Cluster Autoscaler

AWS Load Balancer Controller

# Install via Helm
helm repo add eks https://aws.github.io/eks-charts
helm install aws-load-balancer-controller \
  eks/aws-load-balancer-controller \
  -n kube-system \
  --set clusterName=my-cluster

Required for:

  • ALB Ingress
  • NLB with IP targets
  • Fargate load balancing

Storage

EBS CSI Driver

# Install EBS CSI driver add-on
aws eks create-addon \
  --cluster-name my-cluster \
  --addon-name aws-ebs-csi-driver

# Create IAM role for CSI driver
eksctl create iamserviceaccount \
  --name ebs-csi-controller-sa \
  --namespace kube-system \
  --cluster my-cluster \
  --attach-policy-arn arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy \
  --approve

Compatibility:

  • ✓ EC2 nodes
  • ✗ Fargate
  • ✗ Hybrid Nodes

Note: Node DaemonSet only runs on EC2, controller can run on Fargate.

StorageClass with EBS

# EBS StorageClass
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: gp3
provisioner: ebs.csi.aws.com
parameters:
  type: gp3
  encrypted: "true"
  kmsKeyId: arn:aws:kms:us-east-1:123456789012:key/12345678-1234-1234-1234-123456789012
volumeBindingMode: WaitForFirstConsumer

KMS Encryption for EBS

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": ["kms:CreateGrant", "kms:Encrypt", "kms:Decrypt"],
      "Resource": "arn:aws:kms:*:*:key/*"
    }
  ]
}

Required IAM Permissions:

  • kms:CreateGrant
  • kms:Encrypt
  • kms:Decrypt

EFS Support

# EFS PersistentVolume (Static Provisioning)
apiVersion: v1
kind: PersistentVolume
metadata:
  name: efs-pv
spec:
  capacity:
    storage: 5Gi
  volumeMode: Filesystem
  accessModes:
    - ReadWriteMany
  storageClassName: efs-sc
  csi:
    driver: efs.csi.aws.com
    volumeHandle: fs-12345678

Fargate Support:

  • ✓ Automatic EFS mount (no driver installation)
  • ✓ Static provisioning only
  • ✗ Dynamic provisioning

FSx for Lustre

# Supported via CSI driver
# EC2 nodes only
# Not available on Fargate

Storage Comparison

Storage Type Fargate EC2 Dynamic Provisioning Use Case
EBS Block storage, single AZ
EFS ✓ (static) Shared file storage, multi-AZ
FSx Lustre High-performance computing

Troubleshooting

Node Join Failures

aws-auth ConfigMap Issues:

# Check aws-auth ConfigMap
kubectl get configmap aws-auth -n kube-system -o yaml

# Common issues:
# - Missing or incorrect aws-auth entries
# - ARN cannot include path other than /

Fix:

# Correct format
mapRoles: |
  - rolearn: arn:aws:iam::123456789012:role/NodeRole  # No path
    username: system:node:{{EC2PrivateDNSName}}
    groups:
      - system:bootstrappers
      - system:nodes

Node Tags:

# Nodes must have cluster tag
aws ec2 create-tags \
  --resources i-1234567890abcdef0 \
  --tags Key=kubernetes.io/cluster/my-cluster,Value=owned

Public IP Issues:

# Public subnet nodes need public IP or Elastic IP
# Private subnets need NAT gateway route

# Check route table
aws ec2 describe-route-tables --route-table-id rtb-xxx

DNS Issues:

# Node needs private DNS entry
# VPC requires DHCP options set

# Check DHCP options
aws ec2 describe-dhcp-options --dhcp-options-id dopt-xxx

# Required:
# - domain-name
# - domain-name-servers

NodeCreationFailure

# 15-minute timeout
# Windows AMIs may need fast launch enabled

# Check node group events
aws eks describe-nodegroup \
  --cluster-name my-cluster \
  --nodegroup-name my-nodes \
  --query 'nodegroup.health'

InsufficientFreeAddresses

# Not enough available IPs in subnet
# Need to free IPs or change subnets

# Check available IPs
aws ec2 describe-subnets --subnet-ids subnet-xxx \
  --query 'Subnets[0].AvailableIpAddressCount'

# Solution: Add new subnet to node group
aws eks update-nodegroup-config \
  --cluster-name my-cluster \
  --nodegroup-name my-nodes \
  --subnets subnet-new

AccessDenied Error

# Missing eks:node-manager ClusterRole or ClusterRoleBinding

# Verify role exists
kubectl get clusterrole eks:node-manager
kubectl get clusterrolebinding eks:node-manager

# Recreate if missing
kubectl apply -f https://s3.us-west-2.amazonaws.com/amazon-eks/cloudformation/2020-10-29/aws-auth-cm.yaml

Container Runtime Not Ready

# Missing or incorrect aws-auth/access entry for node IAM role

# Check node logs
kubectl logs -n kube-system -l k8s-app=aws-node

# Verify IAM role in aws-auth
kubectl get configmap aws-auth -n kube-system -o yaml

# Check node authorization
kubectl describe node NODE_NAME | grep -i auth

TLS Handshake Timeout

# Node cannot reach public API endpoint
# Check route table & security groups

# Test from node
curl -k https://API_SERVER_ENDPOINT

# Check security group rules
aws ec2 describe-security-groups --group-ids sg-xxx

# Required: Allow HTTPS (443) from node security group

HTTP 401 Unauthorized

# Stale service account tokens (>90 days old)
# Kubernetes client SDK must be recent version

# Recreate service account
kubectl delete sa SERVICE_ACCOUNT_NAME -n NAMESPACE
kubectl create sa SERVICE_ACCOUNT_NAME -n NAMESPACE

# Update client SDK
# Use latest aws-sdk, kubectl, or client library

Too Many Requests

# Launching many nodes causes describeCluster throttling

# Solution: Pass bootstrap arguments
--apiserver-endpoint ENDPOINT \
--b64-cluster-ca CERTIFICATE_AUTHORITY \
--dns-cluster-ip DNS_CLUSTER_IP

# Reduces API calls during node bootstrap

EKS Log Collector

# Pre-built script on nodes
/etc/eks/log-collector-script/eks-log-collector.sh

# Run on node
sudo /etc/eks/log-collector-script/eks-log-collector.sh

# Collects:
# - kubelet logs
# - Container runtime logs
# - VPC CNI logs
# - System information

Cluster Health Issues

Detection:

# EKS detects infrastructure/configuration issues
# Stores in health object
# Can take up to 3 hours to detect

aws eks describe-cluster --name my-cluster \
  --query 'cluster.health'

Notifications:

  • AWS sends email
  • Personal Health Dashboard notification

Recoverable Errors

Error Cause Recovery
SUBNET_NOT_FOUND Cluster subnet deleted update-cluster-config
SECURITY_GROUP_NOT_FOUND Cluster SG deleted update-cluster-config
KMS_KEY_DISABLED KMS key disabled Re-enable key
# Recover from SUBNET_NOT_FOUND
aws eks update-cluster-config \
  --name my-cluster \
  --resources-vpc-config subnetIds=subnet-new1,subnet-new2

Non-Recoverable Errors

Error Cause Action
VPC_NOT_FOUND Cluster VPC deleted Delete and recreate cluster
KMS_KEY_NOT_FOUND KMS key deleted Delete and recreate cluster

Warning: These errors require cluster recreation. Backup data before deleting.

Commands

AWS CLI - Cluster Operations

# Describe cluster
aws eks describe-cluster --name my-cluster

# List clusters
aws eks list-clusters

# Get cluster versions
aws eks describe-cluster-versions

# Update cluster version
aws eks update-cluster-version \
  --name my-cluster \
  --kubernetes-version 1.30

# Update cluster config
aws eks update-cluster-config \
  --name my-cluster \
  --resources-vpc-config subnetIds=subnet-xxx,subnet-yyy

# Delete cluster
aws eks delete-cluster --name my-cluster

AWS CLI - Node Groups

# List node groups
aws eks list-nodegroups --cluster-name my-cluster

# Describe node group
aws eks describe-nodegroup \
  --cluster-name my-cluster \
  --nodegroup-name my-nodes

# Create node group
aws eks create-nodegroup \
  --cluster-name my-cluster \
  --nodegroup-name my-nodes \
  --node-role arn:aws:iam::123456789012:role/NodeRole \
  --subnets subnet-xxx subnet-yyy

# Update node group config
aws eks update-nodegroup-config \
  --cluster-name my-cluster \
  --nodegroup-name my-nodes \
  --scaling-config minSize=2,maxSize=10,desiredSize=4

# Delete node group
aws eks delete-nodegroup \
  --cluster-name my-cluster \
  --nodegroup-name my-nodes

AWS CLI - Add-ons

# List add-ons
aws eks list-addons --cluster-name my-cluster

# Describe add-on versions
aws eks describe-addon-versions --addon-name vpc-cni

# Create add-on
aws eks create-addon \
  --cluster-name my-cluster \
  --addon-name vpc-cni \
  --addon-version v1.18.0-eksbuild.1

# Update add-on
aws eks update-addon \
  --cluster-name my-cluster \
  --addon-name vpc-cni \
  --addon-version v1.18.1-eksbuild.1

# Delete add-on
aws eks delete-addon \
  --cluster-name my-cluster \
  --addon-name vpc-cni

kubectl - Node Management

# View nodes
kubectl get nodes

# Check node labels
kubectl get nodes --show-labels

# Check node capacity type
kubectl get nodes -L eks.amazonaws.com/capacityType

# Describe node
kubectl describe node NODE_NAME

# Check kubelet version
kubectl version

# Drain node
kubectl drain NODE_NAME --ignore-daemonsets --delete-emptydir-data

# Cordon node
kubectl cordon NODE_NAME

# Uncordon node
kubectl uncordon NODE_NAME

kubectl - Add-ons

# Check VPC CNI
kubectl get pods -n kube-system -l k8s-app=aws-node

# Check CoreDNS
kubectl get pods -n kube-system -l k8s-app=kube-dns

# Check kube-proxy
kubectl get pods -n kube-system -l k8s-app=kube-proxy

# Check EBS CSI driver
kubectl get pods -n kube-system \
  -l app.kubernetes.io/name=aws-ebs-csi-driver

# Check AWS Load Balancer Controller
kubectl get pods -n kube-system \
  -l app.kubernetes.io/name=aws-load-balancer-controller

kubectl - IRSA & IAM

# Check service account
kubectl get sa SERVICE_ACCOUNT_NAME -n NAMESPACE -o yaml

# View annotations (should show eks.amazonaws.com/role-arn)
kubectl describe sa SERVICE_ACCOUNT_NAME -n NAMESPACE

# Check aws-auth ConfigMap (legacy)
kubectl get configmap aws-auth -n kube-system -o yaml

# Edit aws-auth ConfigMap
kubectl edit configmap aws-auth -n kube-system

# Create service account with IRSA
kubectl create sa my-sa -n default
kubectl annotate sa my-sa -n default \
  eks.amazonaws.com/role-arn=arn:aws:iam::123456789012:role/MyRole

kubectl - Troubleshooting

# View events (sorted)
kubectl get events --sort-by='.lastTimestamp' -A

# Check pod on specific node
kubectl get pods -A -o wide --field-selector spec.nodeName=NODE_NAME

# Describe failing pod
kubectl describe pod POD_NAME -n NAMESPACE

# Check logs
kubectl logs POD_NAME -n NAMESPACE

# Check previous logs (crashed pod)
kubectl logs POD_NAME -n NAMESPACE --previous

# Execute into pod
kubectl exec -it POD_NAME -n NAMESPACE -- /bin/sh

# Check resource usage
kubectl top nodes
kubectl top pods -A

eksctl - Quick Operations

# Create cluster
eksctl create cluster \
  --name my-cluster \
  --region us-east-1 \
  --nodegroup-name my-nodes \
  --nodes 3

# Create IRSA
eksctl create iamserviceaccount \
  --name my-sa \
  --namespace default \
  --cluster my-cluster \
  --attach-policy-arn arn:aws:iam::aws:policy/AmazonS3ReadOnlyAccess \
  --approve

# Associate OIDC provider
eksctl utils associate-iam-oidc-provider \
  --cluster my-cluster \
  --approve

# Delete cluster
eksctl delete cluster --name my-cluster

Security

Pod Security Standards

# Replace deprecated PodSecurityPolicies
apiVersion: v1
kind: Namespace
metadata:
  name: my-namespace
  labels:
    pod-security.kubernetes.io/enforce: restricted
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/warn: restricted

Levels:

  • Privileged: Unrestricted
  • Baseline: Minimally restrictive
  • Restricted: Heavily restricted

Secrets Encryption

# Enable KMS encryption at cluster creation
aws eks create-cluster \
  --name my-cluster \
  --encryption-config '[{"resources":["secrets"],"provider":{"keyArn":"arn:aws:kms:us-east-1:123456789012:key/12345678-1234-1234-1234-123456789012"}}]' \
  # ... other parameters

# Cannot be enabled on existing clusters
# Must recreate cluster

Encrypts:

  • Kubernetes secrets at rest
  • etcd data

Private Clusters

# API server endpoint only accessible from within VPC
aws eks create-cluster \
  --name my-cluster \
  --resources-vpc-config endpointPrivateAccess=true,endpointPublicAccess=false \
  # ... other parameters

# Requires VPN or Direct Connect for external access

Options:

  • Public Only: Default, accessible from internet
  • Public + Private: Both access methods
  • Private Only: VPC-only access (requires VPN/Direct Connect)

IMDS Restriction

# Block Pod access to instance metadata service
# Prevents credential access from Pods

# Add to node user data
--kubelet-extra-args '--cloud-provider=aws --node-labels=node.kubernetes.io/role=worker --register-node=true --v=2 --cloud-provider-config=/etc/kubernetes/cloud.conf --cluster-dns=10.100.0.10 --cluster-domain=cluster.local --hostname-override=$(curl -s http://169.254.169.254/latest/meta-data/local-hostname) --node-ip=$(curl -s http://169.254.169.254/latest/meta-data/local-ipv4)'

Recommended:

  • Use IRSA instead of instance roles
  • Block IMDS v1
  • Require IMDSv2 with hop limit 1

Network Policies

# Restrict Pod-to-Pod communication
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: deny-all
  namespace: default
spec:
  podSelector: {}
  policyTypes:
    - Ingress
    - Egress

Requires:

  • Calico or other CNI with network policy support
  • VPC CNI supports security groups, not network policies

Audit Logging

# Enable control plane logging
aws eks update-cluster-config \
  --name my-cluster \
  --logging '{"clusterLogging":[{"types":["api","audit","authenticator","controllerManager","scheduler"],"enabled":true}]}'

# View logs in CloudWatch
# Log group: /aws/eks/my-cluster/cluster

Log Types:

  • api: API server logs
  • audit: Audit logs
  • authenticator: Authenticator logs
  • controllerManager: Controller manager logs
  • scheduler: Scheduler logs

Best Practices

Cluster Design

  • Multi-AZ: Deploy across 3+ availability zones
  • Separate Subnets: Use different subnets for public/private workloads
  • Control Plane Logging: Enable all log types
  • Managed Nodes: Use managed node groups for easier lifecycle
  • Version Currency: Keep clusters within standard support window

Networking

# Use VPC CNI prefix delegation for more IPs per node
apiVersion: v1
kind: ConfigMap
metadata:
  name: amazon-vpc-cni
  namespace: kube-system
data:
  enable-prefix-delegation: "true"
  warm-prefix-target: "1"

Best Practices:

  • Tag subnets for load balancer discovery
  • Use security groups for Pods for fine-grained control
  • Implement network policies (requires Calico)
  • Use private subnets for worker nodes
  • Use public subnets only for load balancers

Security

IAM & Authentication:

  • Enable secrets encryption with KMS
  • Use IRSA for Pod IAM permissions (avoid node IAM roles)
  • Restrict IMDS access from Pods (use IMDSv2)
  • Use private API endpoint when possible
  • Enable audit logging

Pod Security:

  • Apply Pod Security Standards
  • Run containers as non-root
  • Use read-only root filesystems
  • Drop unnecessary capabilities
  • Set resource limits

Cost Optimization

# Use Spot instances for fault-tolerant workloads
apiVersion: eks.amazonaws.com/v1
kind: NodeGroup
metadata:
  name: spot-nodes
spec:
  capacityType: SPOT
  instanceTypes:
    - t3.medium
    - t3a.medium
    - t2.medium

Strategies:

  • Right-size node instance types
  • Use Fargate for unpredictable workloads
  • Implement Cluster Autoscaler or Karpenter
  • Monitor and optimize resource requests/limits
  • Use Savings Plans or Reserved Instances
  • Clean up unused resources (volumes, snapshots, AMIs)

Operations

Cluster Maintenance:

  • Keep clusters updated (within standard support)
  • Automate cluster creation with IaC (eksctl, Terraform)
  • Monitor with CloudWatch Container Insights
  • Use Upgrade Insights before upgrades
  • Test upgrades in non-production first

Monitoring:

  • Enable CloudWatch Container Insights
  • Use Prometheus + Grafana for metrics
  • Implement log aggregation (Fluent Bit, CloudWatch Logs)
  • Set up alerting for critical events

Disaster Recovery:

  • Backup etcd regularly (handled by AWS)
  • Version control Kubernetes manifests
  • Use GitOps (ArgoCD, Flux)
  • Document runbooks
  • Test disaster recovery procedures

Resource Management

# Set resource requests and limits
apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  containers:
    - name: app
      image: my-app
      resources:
        requests:
          memory: "64Mi"
          cpu: "250m"
        limits:
          memory: "128Mi"
          cpu: "500m"

Recommendations:

  • Always set requests (for scheduling)
  • Set limits for memory (prevent OOM)
  • Use Vertical Pod Autoscaler for right-sizing
  • Implement Horizontal Pod Autoscaler
  • Use PodDisruptionBudgets for availability

Pricing

Cluster Pricing

Standard Support:

  • $0.10 per hour per cluster
  • $73 per month per cluster

Extended Support:

  • Additional cost per cluster
  • Varies by Kubernetes version

Compute Pricing

EC2 Nodes:

  • Standard EC2 pricing applies
  • On-Demand, Reserved, Spot available
  • Separate from cluster fees

Fargate:

  • Per-vCPU and per-GB memory pricing
  • Each Pod billed individually
  • No node management overhead

Example Fargate:

  • 0.25 vCPU, 0.5 GB: ~$0.012/hour
  • 1 vCPU, 2 GB: ~$0.046/hour
  • 4 vCPU, 8 GB: ~$0.185/hour

Additional Costs

Networking:

  • VPC NAT Gateway: $0.045/hour + data transfer
  • ALB: $0.0225/hour + LCU pricing
  • NLB: $0.0225/hour + NLCU pricing
  • Data transfer: $0.09/GB (varies by region)

Storage:

  • EBS: $0.08-0.10/GB-month (gp3)
  • EFS: $0.30/GB-month (standard)
  • FSx Lustre: $0.14/GB-month

Data Transfer:

  • Within same AZ: Free
  • Between AZs: $0.01/GB
  • To internet: $0.09/GB (first 10 TB)

Cost Optimization

  • Use Spot instances (up to 90% savings)
  • Use Fargate for unpredictable workloads
  • Right-size instances and Pods
  • Use Savings Plans (20-50% savings)
  • Monitor and eliminate waste
  • Use Cluster Autoscaler to scale down

Also see