sigs.k8s.io/cluster-api-provider-aws@v1.5.5/docs/proposal/20180827-project-inception/lessons-learned.md

sigs.k8s.io/cluster-api-provider-aws@v1.5.5/docs/proposal/20180827-project-inception/lessons-learned.md (about)

     1  # AWS History and Lessons Learned
     2  
     3  In an attempt to gather our knowledge of deploying and managing Kubernetes on AWS, we ask that if you have a lesson to share - to please share it
     4  
     5  Please add anything that has been notably challenging, difficult, or surprising that you have experienced while deploying or managing Kubernetes on AWS
     6  
     7  * [Nova] The rate limits on the AWS API need to be taken into consideration at the software level
     8  * [Nova] Not every resource in AWS can be “tagged”
     9  * [Dolezal] Working with IAM roles to secure access has been difficult to implement on a deployment or pod level basis.
    10  * [Ashish] Using ASGs for worker nodes has worked well with the cluster-autoscaler simplifies machine-controller implementation in scenarios where machines (EC2 instances) fail status checks. However these are some of the things to consider:
    11    * ASG per zone (e.g. us-west-2a) to tolerate zone outages
    12    * Storage class per zone with using EBS volumes for persistent disks
    13    * Cluster autoscaler by default is maks ASG api calls too aggressively and can get throttled error.
    14  * [Ashish] Security in clusters:
    15    * Credentials to the cluster:
    16      * Certs issued by cluster cert authority- not scalable but may get us out of the door sooner
    17      * JWT credentials using OIDC and group based RBAC (group being a field in the credentials JWT) for namespace admins
    18    * IAM roles as pod identity, using Kube2Iam, to control access to AWS resources. Works in multi-account environments. Kube2Iam doesn’t work very well under load and we’ve seen aws clients fail with “unable to load credentials” and had to mitigate by increasing timeouts and retry count- works most of the time.
    19      * [Matt Reynolds] Try [Kiam](https://github.com/uswitch/kiam) it's resolved load issues we had with kube2iam
    20  * [Ashish] Single account environments don’t work very well as users may run into resource limits. (not totally specific to Kubernetes though, but something to keep in mind). Similarly, account peering doesn’t scale.
    21  * [Ashish] SSL termination at the loadbalancer in multi-account environments hasn’t worked well.
    22  * [Ashish] Ability to deploy support applications that make the cluster usable- this would include things like: monitoring and alerting stack, kube2iam (if we choose it), credential getting service. May be useful in solving this upstream than specific to AWS
    23  * [Naadir] Subnet allocation:
    24    * Use at least /19s for VPCs. at least /22s for subnets.
    25    * For multi-region preparedness, have a larger supernetwork from which the VPCs are carved out of.
    26    * Use RFC3531 to carve smaller CIDR ranges from bigger ones (e.g. subnets from VPCs) so that they can be resized/shared with DCs later.
    27      * We used netaddr-rb 1.x (the newer one & golang port doesn’t have it)
    28  * [Naadir] Bucket / DNS names:
    29    * For multi-region preparedness, copy AWS endpoint naming schemes, e.g.:
    30      * `<service>.<region>.<account identifier>.<domain_name>`
    31      * `k8s-apiserver.eu-west-1.123456789012.lucernepublishing.com`
    32      * `k8s-apiserver.eu-west-1.prod.adatum.com`