Prefix Mode for Windows¶

In Amazon EKS, each Pod that runs on a Windows host is assigned a secondary IP address by the VPC resource controller by default. This IP address is a VPC-routable address that is allocated from the host's subnet. On Linux, each ENI attached to the instance has multiple slots that can be populated by a secondary IP address or a /28 CIDR (a prefix). Windows hosts, however, only support a single ENI and its available slots. Using only secondary IP addresses can artifically limit the number of pods you can run on a Windows host, even when there is an abundance of IP addresses available for assignment.

In order to increase the pod density on Windows hosts, especially when using smaller instance types, you can enable Prefix Delegation for Windows nodes. When prefix delegation is enabled, /28 IPv4 prefixes are assigned to ENI slots rather than secondary IP addresses. Prefix delegation can be enabled by adding the enable-windows-prefix-delegation: "true" entry to the amazon-vpc-cni config map. This is the same config map where you need to set enable-windows-ipam: "true" entry for enabling Windows support.

Please follow the instructions mentioned in the EKS user guide to enable Prefix Delegation mode for Windows nodes.

illustration of two worker subnets, comparing ENI secondary IPvs to ENIs with delegated prefixes

Figure: Comparison of Secondary IP mode with Prefix Delegation mode

The maximum number of IP addresses you can assign to a network interface depends on the instance type and its size. Each prefix assigned to a network interface consumes an available slot. For example, a c5.large instance has a limit of 10 slots per network interface. The first slot on a network interface is always consumed by the interface's primary IP address, leaving you with 9 slots for prefixes and/or secondary IP addresses. If these slots are assigned prefixes, the node can support (9 * 16) 144 IP address whereas if they're assigned secondary IP addresses it can only support 9 IP addresses. See the documentation on IP addresses per network interface per instance type and assigning prefixes to network interfaces for further information.

During worker node initialization, the VPC Resource Controller assigns one or more prefixes to the primary ENI for faster pod startup by maintaining a warm pool of the IP addresses. The number of prefixes to be held in warm pool can be controlled by setting the following configuration parameters in amazon-vpc-cni config map.

warm-prefix-target, the number of prefixes to be allocated in excess of current need.
warm-ip-target, the number of IP addresses to be allocated in excess of current need.
minimum-ip-target, the minimum number of IP addresses to be available at any time.
warm-ip-target and/or minimum-ip-target if set will override warm-prefix-target.

As more Pods are scheduled on the node, additional prefixes will be requested for the existing ENI. When a Pod is scheduled on the node, VPC Resource Controller would first try to assign an IPv4 address from the existing prefixes on the node. If that is not possible, then a new IPv4 prefix will be requested as long as the subnet has the required capacity.

flow chart of procedure for assigning IP to pod

Figure: Workflow during assignment of IPv4 address to the Pod

Recommendations¶

Use Prefix Delegation when¶

Use prefix delegation if you are experiencing Pod density issues on the worker nodes. To avoid errors, we recommend examining the subnets for contiguous block of addresses for /28 prefix before migrating to prefix mode. Please refer “Use Subnet Reservations to Avoid Subnet Fragmentation (IPv4)” section for Subnet reservation details.

By default, the max-pods on Windows nodes is set to 110. For the vast majority of instance types, this should be sufficient. If you want to increase or decrease this limit, then add the following to the bootstrap command in your user data:

-KubeletExtraArgs '--max-pods=example-value'

For more details about the bootstrap configuration parameters for Windows nodes, please visit the documentation here.

Avoid Prefix Delegation when¶

If your subnet is very fragmented and has insufficient available IP addresses to create /28 prefixes, avoid using prefix mode. The prefix attachment may fail if the subnet from which the prefix is produced is fragmented (a heavily used subnet with scattered secondary IP addresses). This problem may be avoided by creating a new subnet and reserving a prefix.

Configure parameters for prefix delegation to conserve IPv4 addresses¶

warm-prefix-target, warm-ip-target, and minimum-ip-target can be used to fine tune the behaviour of pre-scaling and dynamic scaling with prefixes. By default, the following values are used:

warm-ip-target: "1"
minimum-ip-target: "3"

By fine tuning these configuration parameters, you can achieve an optimal balance of conserving the IP addresses and ensuring decreased Pod latency due to assignment of IP address. For more information about these configuration parameters, visit the documentation here.

Use Subnet Reservations to Avoid Subnet Fragmentation (IPv4)¶

When EC2 allocates a /28 IPv4 prefix to an ENI, it has to be a contiguous block of IP addresses from your subnet. If the subnet that the prefix is generated from is fragmented (a highly used subnet with scattered secondary IP addresses), the prefix attachment may fail, and you will see the following node event:

InsufficientCidrBlocks: The specified subnet does not have enough free cidr blocks to satisfy the request

To avoid fragmentation and have sufficient contiguous space for creating prefixes, use VPC Subnet CIDR reservations to reserve IP space within a subnet for exclusive use by prefixes. Once you create a reservation, the IP addresses from the reserved blocks will not be assigned to other resources. That way, VPC Resource Controller will be able to get available prefixes during the assignment call to the node ENI.

It is recommended to create a new subnet, reserve space for prefixes, and enable prefix assignment for worker nodes running in that subnet. If the new subnet is dedicated only to Pods running in your EKS cluster with prefix delegation enabled, then you can skip the prefix reservation step.

Replace all nodes when migrating from Secondary IP mode to Prefix Delegation mode or vice versa¶

It is highly recommended that you create new node groups to increase the number of available IP addresses rather than doing rolling replacement of existing worker nodes.

When using self-managed node groups, the steps for transition would be:

Increase the capacity in your cluster such that the new nodes would be able to accomodate your workloads
Enable/Disable the Prefix Delegation feature for Windows
Cordon and drain all the existing nodes to safely evict all of your existing Pods. To prevent service disruptions, we suggest implementing Pod Disruption Budgets on your production clusters for critical workloads.
After you confirm the Pods are running, you can delete the old nodes and node groups. Pods on new nodes will be assigned an IPv4 address from a prefix assigned to the node ENI.

When using managed node groups, the steps for transition would be:

Enable/Disable the Prefix Delegation feature for Windows
Update the node group using the steps mentioned here. This performs similar steps as above but are managed by EKS.

Warning

Run all Pods on a node in the same mode

For Windows, we recommend that you avoid running Pods in both secondary IP mode and prefix delegation mode at the same time. Such a situation can arise when you migrate from secondary IP mode to prefix delegation mode or vice versa with running Windows workloads.

While this will not impact your running Pods, there can be inconsistency with respect to the node's IP address capacity. For example, consider that a t3.xlarge node which has 14 slots for secondary IPv4 addresses. If you are running 10 Pods, then 10 slots on the ENI will be consumed by secondary IP addresses. After you enable prefix delegation the capacity advertised to the kube-api server would be (14 slots * 16 ip addresses per prefix) 244 but the actual capacity at that moment would be (4 remaining slots * 16 addresses per prefix) 64. This inconsistency between the amount of capacity advertised and the actual amount of capacity (remaining slots) can cause issues if you run more Pods than there are IP addresses available for assignment.

That being said, you can use the migration strategy as described above to safely transition your Pods from secondary IP address to addresses obtained from prefixes. When toggling between the modes, the Pods will continue running normally and:

When toggling from secondary IP mode to prefix delegation mode, the secondary IP addresses assigned to the running pods will not be released. Prefixes will be assigned to the free slots. Once a pod is terminated, the secondary IP and slot it was using will be released.
When toggling from prefix delegation mode to secondary IP mode, a prefix will be released when all the IPs within its range are no longer allocated to pods. If any IP from the prefix is assigned to a pod then that prefix will be kept until the pods are terminated.

Debugging Issues with Prefix Delegation¶

You can use our debugging guide here to deep dive into the issue you are facing with prefix delegation on Windows.