Updated instructions for multipath iSCSI in IBM Cloud

Several years ago I blogged detailed instructions to configure multipath iSCSI in IBM Cloud’s classic infrastructure using Endurance block storage.

Since then I’ve learned that VMware documents you should not use port binding in this topology. I was skeptical of this since I wasn’t confident that an HBA scan would make attempts on all vmkernel ports. However, I’ve retested my instructions without using port binding and I can confirm that I’m able to achieve MPIO connectivity to the storage without it.

I’ve updated the my instructions to remove the port binding step.

Firefox exits full-screen mode when you press Esc

I’m switching back from Google Chrome to Firefox as my primary web browser. One thing that annoys me with Firefox is that it exits full-screen mode when you press the Escape key.

You can change this behavior with the following steps:

  1. Type about:config in the address bar to access Advanced Preferences
  2. If Firefox displays a warning message, click “Accept the Risk and Continue”
  3. In the search box, type “escape”
  4. Firefox should display a configuration item named browser.fullscreen.exit_on_escape
  5. If this item’s value is false, you are good to go. If it is true, click the toggle button (⇌) on the right hand side to set it to false.
  6. Close the Advanced Preferences tab.

Failure to decrypt VM or disk

You want to be upgrading your vCenter to 7.0u3o anyway because of VMSA-2023-0023.

However, you may also want to upgrade to this version if you are using vSphere encryption. I have found that some earlier versions of vCenter 7.0u3 may at times fail to decrypt VMs or decrypt disks. This seems to occur when moving a VM from one host to another, when starting a stopped VM, and when creating a snapshot. I’m not sure what the cause of this error is; in our case it seemed to happen for recently rekeyed VMs, and I hypothesize that it occurred in cases where the rekey succeeded but where it took the key provider a long time to generate the key.

Initially we were able to recover from this state by attempting to motion VMs to alternate hosts until successful. However, VMware support recommended we upgrade to 7.0u3o, and we haven’t seen the problem since then. There is a relevant release note in 7.0u3o referring to a failure to “apply an encryption storage policy to a virtual machine,” and I believe this is related to the issue we saw.

VMware encryption: leaked keys and key inventory

VMware vSphere generally leaks keys when objects are deleted or decrypted. The reason for doing so, I believe, is because VMware supposes that you might have a backup copy of the object and may need the key in the future to restore that object. For example, consider the case of a VM that is removed from inventory but remains on disk. VMware cannot know whether you will permanently delete this VM or add it back to inventory. Therefore VMware allows the key to remain in your key provider.

Over time this results in the growth of unused keys in your key provider. In order to clean up unused keys, you first need to inventory the keys that are in use by active objects. The following PowerCLI script uses the VMware.VMEncryption and VMware.VsanEncryption modules in VMware’s PowerCLI community repository. It will inventory all keys in use by your hosts (for core dumps), in use by vSAN clusters (for vSAN disk encryption), and in use by VMs and disks (for vSphere encryption).

$keydata = @()

# Collect host keys
foreach($myhost in Get-VMHost) {
  if($myhost.CryptoSafe) {
    $hostdata = [PSCustomObject]@{
      type        = "host"
      name        = $myhost.Name
      keyprovider = $myhost.KMSserver
      keyid       = $myhost.ExtensionData.Runtime.CryptoKeyId.KeyId
    }
    $keydata += $hostdata
  }
}

# collect vSAN keys
foreach($mycluster in Get-Cluster) {
  $vsanClusterConfig = Get-VsanView -Id "VsanVcClusterConfigSystem-vsan-cluster-config-system"
  $vsanEncryption    = $vsanClusterConfig.VsanClusterGetConfig($mycluster.ExtensionData.MoRef).DataEncryptionConfig

  if($mycluster.vSanEnabled -and $vsanEncryption.EncryptionEnabled) {
    $clusterdata = [PSCustomObject]@{
      type        = "cluster"
      name        = $mycluster.Name
      keyprovider = $vsanEncryption.kmsProviderId.Id
      keyid       = $vsanEncryption.kekId
    }
    $keydata += $clusterdata
  }
}

# collect VM and disk keys
foreach($myvm in Get-VM) {
  if($myvm.encrypted) {
    $vmdata = [PSCustomObject]@{
      type        = "vm"
      name        = $myvm.Name
      keyprovider = $myvm.KMSserver
      keyid       = $myvm.EncryptionKeyId.KeyId
    }
    $keydata += $vmdata
  }

  foreach($mydisk in Get-HardDisk -vm $myvm) {
    if($mydisk.encrypted) {
      $diskdata = [PSCustomObject]@{
        type        = "harddisk"
        name        = $myvm.Name + " | " + $mydisk.Name
        keyprovider = $mydisk.EncryptionKeyId.ProviderId.Id
        keyid       = $mydisk.EncryptionKeyId.KeyId
      }
      $keydata += $diskdata
    }
  }
}

$keydata | Export-CSV -Path keys.csv -NoTypeInformation 

There are some important caveats to note:

  1. This script is over-zealous; it may report that a key is in use multiple times (e.g., host encryption keys shared by multiple hosts, or VM encryption keys shared by the disks of a VM).
  2. Your vCenter may be connected to multiple key key providers. Before deleting any keys, take care that you identify which keys are in use for each key provider.
  3. You may have multiple vCenters connected to the same key provider. Before deleting any keys, take care to collect inventory across all vCenters and any other clients connected to each key provider.
  4. As noted above, you may have VM backups or other resources that are still dependent on an encryption key, even after that resource has been deleted. Before deleting any keys, take care to ensure you have identified which keys may still be in use for your backups.
  5. This script does not address the case of environments using VMware vSphere Trust Authority (vTA).
  6. Importantly, this script does not address the case of “first-class disks,” or what VMware Cloud Director calls “named disks.”

Converting from VUM to vLCM

IBM Cloud vCenter Server (VCS) instances are currently deployed with VUM baselines enabled. After customizing my VCS environment, here is how I switched to vLCM images:

  1. Depending on the vSphere version, it is possible that the drivers provided by the default image may not be at the version level you need. Consult the VMware HCL for your ESXi version and hardware, and identify the driver version you need. Typically for 10GbE you you need i40en, for 25GbE you need icen, and for RAID controller you need lsi_mr3.
  2. Locate the needed driver version for your vSphere release at VMware Customer Connect or work with IBM Cloud support to obtain it. Download the ZIP file and expand the ZIP file to locate the ZIP file inside.
  3. In vCenter, navigate in the main menu to Lifecycle Manager. Select Actions | Import Updates and upload the ZIP file(s) you obtained in step 2 above.
  4. Navigate to vCenter inventory, select your cluster, and select Updates | Image. Then click Setup Image Manually.
  5. In Step 1, choose the vSphere version you desire for your image. Display the details for Components and click Add Components. Change the filter to show “Independent Components and Vendor Addon Components,” then review the drivers you identified earlier in steps 1-2. If the default version differs from the one you need, add it to your image. For your convenience you may want to include vmware-storcli. Then save the image you defined.
  6. In Step 2, after the compliance check completes, review the compliance of each of your hosts and resolve any issues.
  7. Click Finish Image Setup and confirm.
  8. At this point you can remediate your cluster to the new image!

Customizing your VMware on IBM Cloud environment

I like to apply the following customizations to my vCenter Server (VCS) on IBM Cloud instance after deploying it:

SSH customizations

Out of the box, the vCenter customerroot user is not enabled to use the bash shell. After logging in as customerroot, you can self-enable shell access by running:

shell.set --enabled true

Note that in a future release, IBM expects to remove the need for the customerroot user and will provide the root credentials directly to you for newly deployed instances.

Additionally, I like to install my SSH key into vCenter so that I don’t need to provide a password to login. This involves two steps:

  1. Copy my SSH public key to either /root/.ssh/authorized_keys or /home/customerroot/.ssh/authorized_keys. Note that if you create the folder you should set its permissions to 700, and if you create the file you should set its permissions to 600.
  2. vCenter will only allow you to use key-based login if you set your login shell to bash:
    chsh -s /bin/bash

Note that your authorized key will persist across a major release upgrade of vCenter, but your choice of default shell will not. You will have to perform step 2 again after upgrading vCenter to the next major release.

Although SSH is initially disabled on the hosts, I also add my key to each host’s authorized keys list. For ESXi, the file you should edit is /etc/ssh/keys-<username>/authorized_keys as noted in KB 1002866.

Public connectivity

Some of your activities in vCenter benefit from public connectivity. For example, vCenter is able to refresh the vSAN hardware compatibility list proactively.

vCenter supports the use of proxy servers for some of its internet connectivity. Since I have access only to an http but not an https proxy, I configure this by manually editing /etc/sysconfig/proxy as follows:

PROXY_ENABLED="yes"
HTTP_PROXY="http://10.11.12.13:3128/"
HTTPS_PROXY="http://10.11.12.13:3128/"

Alternately, if your instance has public connectivity enabled, you can configure vCenter routes to use your services NSX edge to SNAT to the public network. This involves the following steps:

  1. Login to NSX manager and select Security | Gateway Firewall, then manage the firewall for the T0 gateway with “service” in its name. Add a new policy for “vCenter internet” and add a rule to this policy with the same name and set to allow traffic. The source IP for this rule should be your vCenter appliance IP, and the destination and allowed services can be Any. Publish your changes. Note that these changes may be overwritten later by IBM Cloud automation in some cases if you deploy or remove add-on services like Zerto and Veeam.
  2. Still in NSX manager, select Networking | NAT. Verify that there is already an SNAT configured for the T0 service gateway that allows all 10.0.0.0/8 traffic to SNAT to the public internet.
  3. Identify the NSX edge’s private IP so that we can configure a route to it later. Still in NSX manager, navigate to Networking | Tier-0 Gateways, and expand the gateway with “service” in its name. Click the number next to “HA VIP configuration” and note the IP address associated with the private uplinks, for example, 10.20.21.22.
  4. Login to the vCenter appliance shell (or run appliancesh from the bash prompt). Run the following command to identify the IBM Cloud private router IP address. It will be the Gateway address associated with the 0.0.0.0 destination, for example, 10.30.31.1:
    com.vmware.appliance.version1.networking.routes.list
  5. Now we need to configure three static routes to direct all private network traffic to the private router, substituting the address you learned in step 4 above. IBM Cloud uses the following IP networks on its private network:
    com.vmware.appliance.version1.networking.routes.add --destination 10.0.0.0 --prefix 8 --gateway 10.30.31.1 --interface nic0
    com.vmware.appliance.version1.networking.routes.add --destination 161.26.0.0 --prefix 16 --gateway 10.30.31.1 --interface nic0
    com.vmware.appliance.version1.networking.routes.add --destination 166.8.0.0 --prefix 14 --gateway 10.30.31.1 --interface nic0
  6. Finally we can reconfigure the default gateway. First display the nic0 configuration:
    com.vmware.appliance.version1.networking.ipv4.list
  7. In this configuration we want to modify only the default gateway address. Keeping all the other details we learned from step 6, and substituting the edge private IP address we learned in step 3, run the following command:
    com.vmware.appliance.version1.networking.ipv4.set --interface nic0 --mode static --address 10.1.2.3 --prefix 26 --defaultGateway 10.20.21.22

Note: If you follow the approach of setting up SNAT and customizing routes, in my experience this can cause problems when you upgrade vCenter to the next major release. It appears that the static routes configured in step 5 do not persist across the upgrade, resulting in no traffic being routed to the private network. Before starting a major release upgrade, you should set the vCenter default route that you configured in step 7 back to the IBM Cloud private router. After the release upgrade, you need to reintroduce the three routes you added in step 5 above, as well as updating the default route you set in step 7 to point to the NSX edge.

vSAN configuration

I customize my vSAN configuration as follows:

  1. In vCenter, navigate to the cluster’s Configuration | vSAN | Services and edit the Performance Service; set it to Enabled.
  2. Navigate to the cluster’s Configuration | vSAN | Services and edit the Data Services; enable Data-In-Transit encryption.

Firmware updates

Your host may be provisioned with optional firmware updates pending, and additional firmware updates may be issued by IBM Cloud at any time thereafter. Available firmware updates will be displayed on the Firmware tab of your bare metal server resource in the IBM Cloud console. You can update firmware for a host with the following steps:

  1. In vCenter, place the host in maintenance mode and wait for it to enter successfully.
  2. In the IBM Cloud console, perform Actions | Power off and wait for the host to power off.
  3. In the IBM Cloud console, perform Actions | Update firmware. This action may take several hours to complete.
  4. In vCenter, remove the host from maintenance mode.

Occasionally I have found that either the firmware update fails, or it succeeds but the success is not reflected in the IBM Cloud console and an update still appears to be available. In cases like this you can resolve the issue by opening an IBM Cloud support ticket.

IPMI

At deploy time, your bare metal servers have IPMI interfaces enabled. Although these interfaces are on your dedicated private VLAN, it is still a best practice to disable them to reduce the internal management surface area. You can do this using the SoftLayer CLI and providing the bare metal server ID that is displayed in the server details page in the IBM Cloud console:

slcli hardware toggle-ipmi --disable 1234567
slcli hardware toggle-ipmi --disable 3456789
. . .

Planning for VMware vSAN ESA

I wrote previously about some considerations for migrating from VMware vSAN Original Storage Architecture (OSA) to Express Storage Architecture (ESA). There are some additional important planning considerations for your hardware choice for vSAN ESA. Even if you are already leveraging NVMe drives using vSAN OSA, your existing hardware may not be supported for ESA. Here are some important considerations:

  • Although OSA was certified on a component level, ESA is certified at the node level using vSAN ESA ReadyNode.
  • These ReadyNode configurations are limited to newer processors.
  • The minimum ReadyNode configuration for compute is 32 cores and 512GB of memory.
  • Although vSAN ESA does not use cache drives, the minimum storage configuration for ESA is four NVMe devices per host. The minimum capacity required for each drive is 1.6TB. At the time of this writing, the largest certified drives are 6.4TB.
  • The minimum network configuration for ESA is 25GbE.
  • The use of TPM 2.0 is recommended
  • With a RAID-5 configuration (erasure coding, FTT=1) you can now deploy as few as three hosts using ESA. All other configurations have the same fixed and recommended minimums as with OSA. As always, with any FTT=1 configuration, you must perform a “full data migration” during host maintenance if you want your storage to remain resilient against host or drive loss during the maintenance window.