Customizing your VMware on IBM Cloud environment

I like to apply the following customizations to my vCenter Server (VCS) on IBM Cloud instance after deploying it:

SSH customizations

Out of the box, the vCenter customerroot user is not enabled to use the bash shell. After logging in as customerroot, you can self-enable shell access by running:

shell.set --enabled true

Note that in a future release, IBM expects to remove the need for the customerroot user and will provide the root credentials directly to you for newly deployed instances.

Additionally, I like to install my SSH key into vCenter so that I don’t need to provide a password to login. This involves two steps:

  1. Copy my SSH public key to either /root/.ssh/authorized_keys or /home/customerroot/.ssh/authorized_keys. Note that if you create the folder you should set its permissions to 700, and if you create the file you should set its permissions to 600.
  2. vCenter will only allow you to use key-based login if you set your login shell to bash:
    chsh -s /bin/bash

Note that your authorized key will persist across a major release upgrade of vCenter, but your choice of default shell will not. You will have to perform step 2 again after upgrading vCenter to the next major release.

Although SSH is initially disabled on the hosts, I also add my key to each host’s authorized keys list. For ESXi, the file you should edit is /etc/ssh/keys-<username>/authorized_keys as noted in KB 1002866.

Public connectivity

Some of your activities in vCenter benefit from public connectivity. For example, vCenter is able to refresh the vSAN hardware compatibility list proactively.

vCenter supports the use of proxy servers for some of its internet connectivity. Since I have access only to an http but not an https proxy, I configure this by manually editing /etc/sysconfig/proxy as follows:

PROXY_ENABLED="yes"
HTTP_PROXY="http://10.11.12.13:3128/"
HTTPS_PROXY="http://10.11.12.13:3128/"

Alternately, if your instance has public connectivity enabled, you can configure vCenter routes to use your services NSX edge to SNAT to the public network. This involves the following steps:

  1. Login to NSX manager and select Security | Gateway Firewall, then manage the firewall for the T0 gateway with “service” in its name. Add a new policy for “vCenter internet” and add a rule to this policy with the same name and set to allow traffic. The source IP for this rule should be your vCenter appliance IP, and the destination and allowed services can be Any. Publish your changes. Note that these changes may be overwritten later by IBM Cloud automation in some cases if you deploy or remove add-on services like Zerto and Veeam.
  2. Still in NSX manager, select Networking | NAT. Verify that there is already an SNAT configured for the T0 service gateway that allows all 10.0.0.0/8 traffic to SNAT to the public internet.
  3. Identify the NSX edge’s private IP so that we can configure a route to it later. Still in NSX manager, navigate to Networking | Tier-0 Gateways, and expand the gateway with “service” in its name. Click the number next to “HA VIP configuration” and note the IP address associated with the private uplinks, for example, 10.20.21.22.
  4. Login to the vCenter appliance shell (or run appliancesh from the bash prompt). Run the following command to identify the IBM Cloud private router IP address. It will be the Gateway address associated with the 0.0.0.0 destination, for example, 10.30.31.1:
    com.vmware.appliance.version1.networking.routes.list
  5. Now we need to configure three static routes to direct all private network traffic to the private router, substituting the address you learned in step 4 above. IBM Cloud uses the following IP networks on its private network:
    com.vmware.appliance.version1.networking.routes.add --destination 10.0.0.0 --prefix 8 --gateway 10.30.31.1 --interface nic0
    com.vmware.appliance.version1.networking.routes.add --destination 161.26.0.0 --prefix 16 --gateway 10.30.31.1 --interface nic0
    com.vmware.appliance.version1.networking.routes.add --destination 166.8.0.0 --prefix 14 --gateway 10.30.31.1 --interface nic0
  6. Finally we can reconfigure the default gateway. First display the nic0 configuration:
    com.vmware.appliance.version1.networking.ipv4.list
  7. In this configuration we want to modify only the default gateway address. Keeping all the other details we learned from step 6, and substituting the edge private IP address we learned in step 3, run the following command:
    com.vmware.appliance.version1.networking.ipv4.set --interface nic0 --mode static --address 10.1.2.3 --prefix 26 --defaultGateway 10.20.21.22

Note: If you follow the approach of setting up SNAT and customizing routes, in my experience this can cause problems when you upgrade vCenter to the next major release. It appears that the static routes configured in step 5 do not persist across the upgrade, resulting in no traffic being routed to the private network. Before starting a major release upgrade, you should set the vCenter default route that you configured in step 7 back to the IBM Cloud private router. After the release upgrade, you need to reintroduce the three routes you added in step 5 above, as well as updating the default route you set in step 7 to point to the NSX edge.

vSAN configuration

I customize my vSAN configuration as follows:

  1. In vCenter, navigate to the cluster’s Configuration | vSAN | Services and edit the Performance Service; set it to Enabled.
  2. Navigate to the cluster’s Configuration | vSAN | Services and edit the Data Services; enable Data-In-Transit encryption.

Firmware updates

Your host may be provisioned with optional firmware updates pending, and additional firmware updates may be issued by IBM Cloud at any time thereafter. Available firmware updates will be displayed on the Firmware tab of your bare metal server resource in the IBM Cloud console. You can update firmware for a host with the following steps:

  1. In vCenter, place the host in maintenance mode and wait for it to enter successfully.
  2. In the IBM Cloud console, perform Actions | Power off and wait for the host to power off.
  3. In the IBM Cloud console, perform Actions | Update firmware. This action may take several hours to complete.
  4. In vCenter, remove the host from maintenance mode.

Occasionally I have found that either the firmware update fails, or it succeeds but the success is not reflected in the IBM Cloud console and an update still appears to be available. In cases like this you can resolve the issue by opening an IBM Cloud support ticket.

IPMI

At deploy time, your bare metal servers have IPMI interfaces enabled. Although these interfaces are on your dedicated private VLAN, it is still a best practice to disable them to reduce the internal management surface area. You can do this using the SoftLayer CLI and providing the bare metal server ID that is displayed in the server details page in the IBM Cloud console:

slcli hardware toggle-ipmi --disable 1234567
slcli hardware toggle-ipmi --disable 3456789
. . .

Planning for VMware vSAN ESA

I wrote previously about some considerations for migrating from VMware vSAN Original Storage Architecture (OSA) to Express Storage Architecture (ESA). There are some additional important planning considerations for your hardware choice for vSAN ESA. Even if you are already leveraging NVMe drives using vSAN OSA, your existing hardware may not be supported for ESA. Here are some important considerations:

  • Although OSA was certified on a component level, ESA is certified at the node level using vSAN ESA ReadyNode.
  • These ReadyNode configurations are limited to newer processors.
  • The minimum ReadyNode configuration for compute is 32 cores and 512GB of memory.
  • Although vSAN ESA does not use cache drives, the minimum storage configuration for ESA is four NVMe devices per host. The minimum capacity required for each drive is 1.6TB. At the time of this writing, the largest certified drives are 6.4TB.
  • The minimum network configuration for ESA is 25GbE.
  • The use of TPM 2.0 is recommended
  • With a RAID-5 configuration (erasure coding, FTT=1) you can now deploy as few as three hosts using ESA. All other configurations have the same fixed and recommended minimums as with OSA. As always, with any FTT=1 configuration, you must perform a “full data migration” during host maintenance if you want your storage to remain resilient against host or drive loss during the maintenance window.

VMware NFS resiliency considerations

Here are some important resiliency considerations if you are using NFS datastores for your VMware vSphere cluster. You should be aware of these considerations so that you can evaluate the tradeoffs of your NFS version choice in planning your storage architecture.

NFSv3 considerations

For NFSv3 datastores, ESXi supports storage I/O control (SIOC), which allows you to enable congestion control for your NFS datastore. This helps ensure that your hosts do not overrun the storage array’s IOPS allocation for the datastore. Hosts that detect congestion will adaptively back off the operations they are driving. You should test your congestion thresholds to ensure that they are sufficient to detect and react to problems.

However, NFSv3 does not support multipathing. This is not just a limitation on possible throughput, but a limitation on resiliency. You cannot configure multiple IP addresses for your datastore, and even if your datastore is known by a hostname, ESXi does not allow you to leverage DNS based load balancing to redirect hosts to a new IP address in case of interface maintenance at your storage array; ESXi will not reattempt to resolve the hostname in case of connection failure. Thus, NFSv3 is subject to the possibility that you lose the connection to your datastore in case of interface maintenance on your storage array.

NFSv4.1 considerations

NFSv4.1 datastores have the opposite characteristics for the above issues:

NFSv4.1 supports multipathing, so you are able to configure multiple IP addresses for your datastore connection. This possibly allows you to obtain better network throughput, but more importantly it helps to ensure that your connection to the datastore is resilient in case one of those paths is lost.

However, at this time NFSv4.1 does not support SIOC congestion control. Therefore, if you are using NFSv4.1 you run the risk of triggering a disconnection from your datastore if your host—or especially if multiple hosts—exceeds your storage array’s IOPS allocation for the datastore.

VMware vSAN ESA migration and licensing considerations

With the new vSAN Express Storage Architecture (ESA), you may need to carefully plan your migration path from vSAN 7 to vSAN 8. At the moment, VMware only supports greenfield deployments of vSAN ESA. As a result, even if you have a vSAN cluster with NVMe storage, you will need to migrate your workloads to a new cluster to reach vSAN ESA. Furthermore, if you are moving from SSD to NVMe, you’ll need to ensure your order of operations is correct.

The following graph illustrates your possible migration and upgrade paths:

Your fastest path to ESA is to leave your existing cluster at the vSphere 7 level and create a vSphere 8 ESA cluster after upgrading to vCenter 8.

It’s important to consider both your vSphere and vSAN licensing during this process. For one, you will incur dual licensing for the duration of the migration. But you should also be aware that your vSAN license is tied to your vCenter version rather than your vSphere version. KB 80691 documents the fact that after upgrading to vCenter 8, your vSAN cluster will be operating under an evaluation license until you obtain vSAN 8 licenses. You should work with VMware to ensure both proper vSphere and vSAN licensing throughout this transition process.

VMware’s vExpert program

VMware maintains and supports an evangelism and advocacy program for technologists who have made demonstrated contributions to the VMware community, whom they call vExperts. VMware makes a significant investment in the vExpert program, providing opportunities such as webinars and evaluation licenses. vExperts are appointed for an annual term, and reappointments require demonstrated merit. I’ve been appointed as a vExpert now for three years, and I’m very honored to have received this recognition.

If you’re actively involved in the VMware user community, you should apply! Here are a few testimonials to encourage you:

Updated VMware Solutions API samples

It’s been awhile since I first posted sample IBM Cloud for VMware Solutions API calls. Since then, our offering has moved from NSX–V to NSX–T, and to vSphere 7.0. This results in some changes to the structure of the API calls you need to make for ordering instances, clusters, and hosts.

I’ve updated the sample ordering calls on Github. This includes the following changes:

  • Migrate some of the utility APIs to version 2
  • Send transaction ids with each request to aid in problem diagnosis
  • Order NSX–T and vSphere 7.0 instead of NSX–V and vSphere 6.7
  • Restructure some of the ordering code to order a concrete example instead of a random one
  • Use the new price check parameter to obtain price estimates
  • Ensure that the add–cluster path leverages an existing cluster’s location and VLANs

Highly available key management in IBM Cloud for VMware Solutions

IBM Cloud’s KMIP for VMware offering provides the foundation for cloud-based key management when using VMware vSphere encryption or vSAN encryption. KMIP for VMware is highly available within a single region:

  • KMIP for VMware and Key Protect are highly available when you configure vCenter connections to both regional endpoints. If any one of the three zones in that region fail entirely, key management continues to be available to your VMware workloads.
  • KMIP for VMware and Hyper Protect Crypto Services (HPCS) are highly available if you deploy two or more crypto units for your HPCS instance. If you do so and any one of the three zones in that region fail entirely, key management continues to be available to your VMware workloads.

If you need to migrate or failover your workloads outside of a region, your plan depends on whether you are using vSAN encryption or vSphere encryption:

When you are using vSAN encryption, each site is protected by its own key provider. If you are using vSAN encryption to protect workloads that you replicate between multiple sites, you must create separate KMIP for VMware instances in each site, that are connected to separate Key Protect or HPCS instances in those sites. You must connect your vCenter Server in each site to the local KMIP for VMware instance as its key provider.

When you are using vSphere encryption, most VMware replication and migration techniques today (for example, cross-vCenter vMotion and vSphere replication) rely on having a common key manager between the two sites. This topology is not supported by KMIP for VMware. Instead, you must create separate KMIP for VMware instances in each site, that is connected to separate Key Protect or HPCS instances in those sites. You must connect your vCenter server in each site to the local KMIP for VMware instance as its key provider, and then use a replication technology that supports the attachment and replication of decrypted disks.

Veeam Backup and Replication supports this replication technique. To implement this technique correctly, see the steps that you must take as indicated in the Veeam documentation.

Note that this technique currently does not support the replication of virtual machines with a vTPM device.

Rekeying all of your VMware objects

We saw previously that we could use PowerCLI to rekey objects to a different key provider. It is much more common that you simply want to rekey objects within the same key provider, perhaps to meet a compliance requirement. We can use the same set of commands without specifying a key provider to perform rekey operations.

The simplest and fastest of the three is a vSAN rekey, which only needs to reissue one root key for each cluster protected by vSAN encryption:

PS C:\Users\Administrator> Invoke-VsanEncryptionRekey -Cluster cluster1 -DeepRekey $false
Executing shallow rekey of vSAN Cluster cluster1
PS C:\Users\Administrator>

This performs a shallow rekey. You can perform a deep rekey by changing $false to $true. This will take much longer to complete.

We can also rekey each of our VMs that is protected by vSphere encryption, as follows:

PS C:\Users\Administrator> foreach($myvm in Get-VM){
>>  if($myvm.KMSserver){
>>   echo $myvm.name
>>   Set-VMEncryptionKey -VM $myvm
>>  }
>> }
scott-test

Type Value
---- -----
Task task-23093


PS C:\Users\Administrator>

This took a couple minutes to complete for each VM. You can perform a deep rekey—which will take longer to complete—by adding the -Deep parameter to the Set-VMEncryptionKey cmdlet.

Finally, if you wish to rekey the host encryption keys used to protect core dumps, you can run the following:

PS C:\Users\Administrator> foreach($myhost in Get-VMHost){
>>  echo $myhost.name
>>  Set-VMHostCryptoKey -VMHost $myhost
>> }
host003.smoonen.example.com
host004.smoonen.example.com
host000.smoonen.example.com
host001.smoonen.example.com
host002.smoonen.example.com
PS C:\Users\Administrator>

This took a few minutes to complete for each host. There is no notion of deep rekeying for host encryption keys.

Active Directory and SSO integration for VMware Solutions in IBM Cloud

VMware Solutions instances in IBM Cloud are deployed with a built-in Active Directory domain with one or two directory controllers. Recently IBM Cloud changed the domain name requirements to require three qualifiers (e.g., cloud.example.com) rather than two (e.g., example.com). The reason for this is that we want to ensure you can integrate with your existing domain and forest without experiencing conflict. The domain controllers are configured as SSO provider for vCenter and NSX, and also as DNS provider for the infrastructure components. IBM Cloud creates an administrator userid in this domain which it uses for subsequent operations, such as logging into vCenter to add a new host, updating DNS records for that host, and creating utility accounts for add-on services like Veeam.

This Active Directory domain is your responsibility to secure and manage, including backup, patching, group policy, etc.

In order of integration from loosest to tightest coupling:

1. No integration

You are free to leverage your instance domain directly for user management within the instance. You can point additional components to the instance’s domain controllers for SSO; for example, the IBM Cloud automation does this for you when it deploys and configures HyTrust Cloud Control. You can join other devices to the domain and also use this for DNS management beyond the instance infrastructure.

2. Additional SSO provider

This option and all of the following options each entail some kind of integration with your instance and your existing Active Directory forest. You will first need to establish network connectivity between your instance and your existing Active Directory forest. You might accomplish this with either a VPN connection or a direct link between IBM Cloud and your on-premises environment. As always, you should take great care to secure your domain controllers, so you should explore security measures such as the use of read-only directory controllers, session recording, bastion servers, and gateway firewalls.

You can leverage your own Active Directory domain for SSO purposes by configuring your directory controllers as additional SSO providers for vCenter and NSX manager and by granting your users and groups appropriate permissions. You will need to determine how you configure DNS; some customers manually duplicate the DNS records from their instance domain into their existing Active Directory domain, but it is also possible to establish mutual DNS delegation between the two Active Directory domains.

This approach may allow you to limit the cloud connections to your directory controllers so that you are only opening up LDAPS and DNS ports.

3. One-way trust

You can establish one-way trust from your instance’s Active Directory domain controllers to your existing Active Directory domain. This will enable you to expose and authorize your existing users and groups to vCenter and NSX manager without having to add these directly as SSO providers. You may need to make additional provision for DNS updates, either copying them to your existing domain or establishing DNS delegation to the instance’s domain.

4. Two-way trust

This option requires your existing domain to establish mutual trust with your instance’s domain. If you are comfortable doing this, it could simplify your DNS management between the two domains.

5. Forest merge

I am not aware of any IBM Cloud customers who have done this, and I do not recommend it since it is a disruptive and potentially risky operation. The idea here is to merge the instance’s forest with your existing forest and to configure the instance’s domain as a child domain of your existing domain.

6. Rebuild

IBM Cloud’s VMware Solutions Shared offering implements a variation of the forest merge. It deploys VCS instances and builds VMware Cloud Director environments on top of them. This solution leverages an existing internal Active Directory forest and domain. After each new VCS instance is deployed, our process removes the VCS instance from its domain and reconfigures it to point to the existing domain.

A variation of this option is to create a new child domain in your existing forest for your VCS instance, and leverage the controllers for this child domain for use with your VCS instance.

There are a few important points to observe:

  1. You should either deploy your instance with the same domain name that you intend to convert it to, or else you should accept the fact that your infrastructure components will have host names in a different DNS domain from your Active Directory domain. Changing the DNS domain of infrastructure components is not supported by IBM Cloud automation.
  2. You will need to re-create the IBM Cloud automation user in your existing domain as an administrator and ensure that this user has administrative permissions in vCenter and NSX manager. This user may in the future create additional users or DNS entries. After performing the reconfiguration, you should open a support ticket to the VMware Solutions team asking them to update the automation user’s password in the IBM Cloud database for your instance, and provide the updated password.

Because this process is complex it is error prone, and you should consider this option only if the options above do not work for you. Additionally, you should practice this with a non-production or pre-production VCS deployment, including the test of adding a new host to the environment, before you implement it in production.