VPN demystified with HA options – part 1 – Single IP HA

Recently one of my customer asked me how we can improve his current VPN setup to 3^rd party environment for maintenance window. On customer’s side we have Aviatrix spoke GW with HA option enabled. 3^rd party is just AZURE VNG.

Before we jump into optimization, let’s talk about VPN itself (from Aviatrix point of view). I needed to concrete my knowledge and let me share with you a few TIPs while taking that path.

Few facts about VPN

VPN is a broad topic and there are many great articles about it – I assume you have basic knowledge so apologies if I don’t explain everything in details. It is also not about encryption algorithms, IKE negotiations and IPSEC itself. It is more about they way you do it on Aviatrix Platform.

There are 2 Types of VPNs:

Policy-Based
Route-Based

Here is very good Comparison between Policy Based vs Route Based VPN done by Design Expert – Shahzad Ali of Aviatrix (worth reading as there is no point in writing about it one more time) – LINK to AVIATRIX community

There are 2 places to configure this on Aviatrix Controller:

Multi-Cloud Transit -> Setup -> external connections

Only Route-Based
Supports both BGP and static Route-selector (similar concept to VNG route-based with this knob enabled – “use policy based traffic selector”. Here is very good explanation how VNG works for that https://github.com/MicrosoftDocs/azure-docs/blob/main/articles/vpn-gateway/vpn-gateway-connect-multiple-policybased-rm-ps.md

If Aviatrix HA GW is deployed, both will actively try to establish connections to the same remote-GW

Site2Cloud -> Setup

Supports Policy-Based and Route-Based
For Route-Based VPN – VTI tunnel is created
doesn’t support BGP
Gives you an option to establish backup tunnel from HAGW
Supports MAPPED NAT – very useful feature when we have overlapping IPs on both ends. That allows to establish NAT in a subnet-to-subnet manner – simplifying configuration of it.
CIDR of remote-subnet is not automatically propagated into Active-Mesh (towards Transits – if tunnel terminated on SPOKE). We need to use “Customize Spoke Advertised VPC CIDRs” to inject it into Aviatrix infrastructure or use “Auto Advertise Spoke Site2Cloud CIDRs” (mapped tunnels only!)

Let’s get back to customer’s case:

Initial setup

Problem here is that there is one VPN tunnel only and no HA at all. Assuming Azure handles failover scenarios in its own way (this is just A/S deployment with PIP move to 2^nd instance in case of failover). Because it’s traffic-selector being used for tunnel we can’t establish 2^nd backup tunnel from Aviatrix HA GW, and here comes the problem – what do we do for maintenance of Primary GW (upgrade lets say)? We loose connectivity.

Let’s take a look at Spoke’s GW route tables:

As there is only 1 tunnel from primary one we see the next hop for 10.222.0.0/16 VTI interface (tun-14E5DC4F-0) with metric 100. HA GW on the other hand is pointing to primary spoke GW – makes sense right?

TIP Tunnel interface ID is not random. Taking a deeper look you can notice its HEX of public IP of the RemoteGW of VPN.

Let’s try to make it better…

Option 1 - Single IP HA

This feature was introduced some time ago (Controller version of 6.4) but is not being used that often what I can see. The idea behind it is to migrate Public IP of one of the GW to standby one and re-establish the VPN. Simple right? That feature requires standalone GW and HAGW (Transit and spoke are not supported as it’s harder to migrate PIP having active/mesh enabled on them). Interesting use case for this one is if you are migrating VPN from on-PREM world where you have HA but based on FHRP (VRRP or HSRP) and actually have only 1 active tunnel there. Here with both Aviatrix GWs we are simulating such environment.

Key things to remember:

Route-based & Policy-based IPSEC are supported
Mapped and Unmapped connection types are supported
Only PUPLIC IP address is supported
Only for standalone GWs (deployed in HA mode)
AWS / AZURE support only
No BGP support

This is a standalone GW so workflow for deploying HAGW is slightly different. You need to go to

GATEWAY -> EDIT -> Gateway for High Availability Peering

Or with TF by just specifying 2 additional attributes (peering_ha_subnet, peering_ha_gw_size)

resource “aviatrix_gateway” “standalone_1” {

cloud_type = 8

account_name = var.avx_ctrl_account_azure

gw_name = “${local.env_prefix}-standalone-1″

vpc_id = module.mc_spoke_1.vpc.vpc_id

vpc_reg = var.azure_region

gw_size = “Standard_B1ms”

subnet = module.mc_spoke_1.vpc.public_subnets[0].cidr

zone = “az-1”

peering_ha_subnet = module.mc_spoke_1.vpc.public_subnets[0].cidr

peering_ha_gw_size = “Standard_B1ms”

}

When having HAGW we just need to create new Site2Cloud connection with the following attributes specified:

resource “aviatrix_site2cloud” “site2cloud_1” {

vpc_id = “vpntest-spoke1:rg-av-vpntest-spoke1-308843:ecae61e1-f8df-40fc-a0cb-b2160528d87b”

connection_name = “conn2”

connection_type = “unmapped”

remote_gateway_type = “generic”

tunnel_type = “route”

primary_cloud_gateway_name = “vpntest-standalone-1”

remote_gateway_ip = azurerm_public_ip.VNG-PIP.ip_address

ha_enabled = true

enable_single_ip_ha = true

private_route_encryption = false

remote_subnet_cidr = “10.222.0.0/16”

local_subnet_cidr = “10.0.0.0/8, 192.168.0.0/16, 172.16.0.0/12, 149.1.1.0/24”

backup_gateway_name = “vpntest-standalone-1-hagw”

backup_remote_gateway_ip = azurerm_public_ip.VNG-PIP.ip_address

custom_algorithms = false

pre_shared_key = “secret$123”

backup_pre_shared_key = “secret$123”

enable_dead_peer_detection = true

enable_event_triggered_ha = true

enable_ikev2 = false

}

As soon as tunnel goes up Controller is smart enough to program all UDRs in this spoke VNET so 10.222.0.0/16 will appear.

Of course normally we want this prefix to be accessible from the rest of the spokes. To make it happen we need to inject it into transit from the spoke.

TIP You want to add original spoke CIDR (10.201.0.0/16) as from now on we are advertising only these prefixes which are listed there.

TESTING

Lets check how long does it take to failover. Initial state is that only 1 GW has tunnel being UP

And both GWs are running

Keep in mind that you will see public IPs swapped but that’s not all what is happening behind the scenes. Controller does UDR programming for you and IPSEC tunnel still needs to come up.

TIP There is no pre-emption so when we restore primary GW – they will still have PIP swapped. To revert back we need to shutdown hagw and start the whole process from scratch.

Tunning

There are a few options that have impact on the time it takes to failover. In my case these were the following:

Conclusion

This option is quite nice but would not meet customer’s goal – maintenance purpose. With some tuning of parameters above I was able to lower the time needed for failover to 150s. It is not a matter of detection it’s slowness of AZURE’s API itself. For AWS I was surprised to see time around 30s which is acceptable in my opinion. In Azure I think we could upgrade GW faster than doing failover .

It serves its main purpose though – HA. Second GW is deployed in a different AZ so we have our resiliency there and remote ends points only to single IP which is what we wanted.

Follow-up – coming soon
next option will be BGP with 2 different scenarios (Active / Passive vs Active / Active – with VNG)