vCenter key provider client certificates, part 2

Previously I explained how vCenter creates a new client certificate with each key provider connection. This is a good thing; it enables you to connect vCenter to the same provider multiple times as a different identity, which can be valuable in certain multitenant use cases.

However, there is also a bug in the vCenter UI that generates this certificate. For a split second, the UI presents one certificate, but then switches to a new value. If you click the copy button too quickly, you will copy the wrong certificate:

Be sure to wait for the screen to refresh before copying your certificate!

Customizing root login for VMware Cloud Director

Your Linux VM running in VMware Cloud Director might be preconfigured with the best-practice configuration to disable root password login. This might prevent you from using the root password that you set with Director’s Guest OS Customization:

#PermitRootLogin prohibit-password

You can override this behavior using a Guest OS Customization script in a couple of ways. The simplest approach is to use your customization script to set the sshd configuration to allow root password logins:

#!/bin/bash
sed -ie "s/#PermitRootLogin prohibit-password/PermitRootLogin yes/" /etc/ssh/sshd_config

Or, if you prefer, you can use the customization script to insert an SSH public key for the root user:

#!/bin/bash
echo "ssh-rsa AAAAB3...DswrcTw==" >> /root/.ssh/authorized_keys
chmod 644 /root/.ssh/authorized_keys

vSAN sizer

I find that there are several commonly overlooked considerations when sizing a vSAN environment:

  • It is not recommended to operate a vSAN environment at over 70% capacity
  • If you use a resilience strategy of FTT=1, you should plan to perform a full evacuation during host maintenance or else you will be at risk of data loss due to drive failure during maintenance. Depending on your configuration and usage, the time required for a full evacuation can easily take 24 hours or more. In addition, a maintenance strategy of full evacuation requires you to leave one host’s worth of capacity empty.
  • Because of these considerations, I recommend a resilience strategy of FTT=2. With this strategy you have the option of performing host maintenance using ensure-accessibility rather than full evacuation, which is much faster but is still resilient to one failure during maintenance.
  • If you size your environment strictly to the minimum number of nodes for your configuration, then you will fail to create virtual machines or snapshots during host maintenance—including any snapshots used for backups or replication—unless you force provisioning of the object contrary to the storage policy. For this reason, you should consider provisioning at least one more host than is strictly required.

Many of these considerations are summarized in this helpful VMware blog post, which includes a helpful table documenting host minimums and RAID ratios: Adaptive RAID-5 Erasure Coding with the Express Storage Architecture in vSAN 8.

I’ve taken these considerations and created a vSAN sizer Excel workbook, to help both with planning and sizing a vSAN environment.

vCenter key provider client certificates

When you are configuring a standard key provider in VMware vCenter, you can authenticate vCenter to the key provider using any of the following options:

  • vCenter Root CA Certificate
  • vCenter Certificate
  • Upload [a custom] certificate and private key
  • New Certificate Signing Request (CSR) [to be processed by the key provider]

I commonly choose “vCenter certificate.” This uses a certificate signed by the vCenter CA. The certificate is generated specifically for use with KMIP and it has a 10-year expiration.

Importantly, a new certificate is generated by vCenter for each key provider that you configure.

Furthermore, if you cancel the trust process before completing it, vCenter will generate a new certificate the next time you perform the trust process. I’ve been bitten by this in the past—I generated a certificate, cancelled the dialog, and sent the certificate to my cryptographic administrator. When I received confirmation the certificate had been configured, I re-initiated the trust process, but this time it used a new certificate. This took quite some time to debug. Make sure that you complete the trust process even if you expect there to be a waiting period before the certificate is configured in your key provider!

Updated instructions for multipath iSCSI in IBM Cloud

Several years ago I blogged detailed instructions to configure multipath iSCSI in IBM Cloud’s classic infrastructure using Endurance block storage.

Since then I’ve learned that VMware documents you should not use port binding in this topology. I was skeptical of this since I wasn’t confident that an HBA scan would make attempts on all vmkernel ports. However, I’ve retested my instructions without using port binding and I can confirm that I’m able to achieve MPIO connectivity to the storage without it.

I’ve updated the my instructions to remove the port binding step.

Failure to decrypt VM or disk

You want to be upgrading your vCenter to 7.0u3o anyway because of VMSA-2023-0023.

However, you may also want to upgrade to this version if you are using vSphere encryption. I have found that some earlier versions of vCenter 7.0u3 may at times fail to decrypt VMs or decrypt disks. This seems to occur when moving a VM from one host to another, when starting a stopped VM, and when creating a snapshot. I’m not sure what the cause of this error is; in our case it seemed to happen for recently rekeyed VMs, and I hypothesize that it occurred in cases where the rekey succeeded but where it took the key provider a long time to generate the key.

Initially we were able to recover from this state by attempting to motion VMs to alternate hosts until successful. However, VMware support recommended we upgrade to 7.0u3o, and we haven’t seen the problem since then. There is a relevant release note in 7.0u3o referring to a failure to “apply an encryption storage policy to a virtual machine,” and I believe this is related to the issue we saw.

VMware encryption: leaked keys and key inventory

VMware vSphere generally leaks keys when objects are deleted or decrypted. The reason for doing so, I believe, is because VMware supposes that you might have a backup copy of the object and may need the key in the future to restore that object. For example, consider the case of a VM that is removed from inventory but remains on disk. VMware cannot know whether you will permanently delete this VM or add it back to inventory. Therefore VMware allows the key to remain in your key provider.

Over time this results in the growth of unused keys in your key provider. In order to clean up unused keys, you first need to inventory the keys that are in use by active objects. The following PowerCLI script uses the VMware.VMEncryption and VMware.VsanEncryption modules in VMware’s PowerCLI community repository. It will inventory all keys in use by your hosts (for core dumps), in use by vSAN clusters (for vSAN disk encryption), and in use by VMs and disks (for vSphere encryption).

$keydata = @()

# Collect host keys
foreach($myhost in Get-VMHost) {
  if($myhost.CryptoSafe) {
    $hostdata = [PSCustomObject]@{
      type        = "host"
      name        = $myhost.Name
      keyprovider = $myhost.KMSserver
      keyid       = $myhost.ExtensionData.Runtime.CryptoKeyId.KeyId
    }
    $keydata += $hostdata
  }
}

# collect vSAN keys
foreach($mycluster in Get-Cluster) {
  $vsanClusterConfig = Get-VsanView -Id "VsanVcClusterConfigSystem-vsan-cluster-config-system"
  $vsanEncryption    = $vsanClusterConfig.VsanClusterGetConfig($mycluster.ExtensionData.MoRef).DataEncryptionConfig

  if($mycluster.vSanEnabled -and $vsanEncryption.EncryptionEnabled) {
    $clusterdata = [PSCustomObject]@{
      type        = "cluster"
      name        = $mycluster.Name
      keyprovider = $vsanEncryption.kmsProviderId.Id
      keyid       = $vsanEncryption.kekId
    }
    $keydata += $clusterdata
  }
}

# collect VM and disk keys
foreach($myvm in Get-VM) {
  if($myvm.encrypted) {
    $vmdata = [PSCustomObject]@{
      type        = "vm"
      name        = $myvm.Name
      keyprovider = $myvm.KMSserver
      keyid       = $myvm.EncryptionKeyId.KeyId
    }
    $keydata += $vmdata
  }

  foreach($mydisk in Get-HardDisk -vm $myvm) {
    if($mydisk.encrypted) {
      $diskdata = [PSCustomObject]@{
        type        = "harddisk"
        name        = $myvm.Name + " | " + $mydisk.Name
        keyprovider = $mydisk.EncryptionKeyId.ProviderId.Id
        keyid       = $mydisk.EncryptionKeyId.KeyId
      }
      $keydata += $diskdata
    }
  }
}

$keydata | Export-CSV -Path keys.csv -NoTypeInformation 

There are some important caveats to note:

  1. This script is over-zealous; it may report that a key is in use multiple times (e.g., host encryption keys shared by multiple hosts, or VM encryption keys shared by the disks of a VM).
  2. Your vCenter may be connected to multiple key key providers. Before deleting any keys, take care that you identify which keys are in use for each key provider.
  3. You may have multiple vCenters connected to the same key provider. Before deleting any keys, take care to collect inventory across all vCenters and any other clients connected to each key provider.
  4. As noted above, you may have VM backups or other resources that are still dependent on an encryption key, even after that resource has been deleted. Before deleting any keys, take care to ensure you have identified which keys may still be in use for your backups.
  5. This script does not address the case of environments using VMware vSphere Trust Authority (vTA).
  6. Importantly, this script does not address the case of “first-class disks,” or what VMware Cloud Director calls “named disks.”

Converting from VUM to vLCM

IBM Cloud vCenter Server (VCS) instances are currently deployed with VUM baselines enabled. After customizing my VCS environment, here is how I switched to vLCM images:

  1. Depending on the vSphere version, it is possible that the drivers provided by the default image may not be at the version level you need. Consult the VMware HCL for your ESXi version and hardware, and identify the driver version you need. Typically for 10GbE you you need i40en, for 25GbE you need icen, and for RAID controller you need lsi_mr3.
  2. Locate the needed driver version for your vSphere release at VMware Customer Connect or work with IBM Cloud support to obtain it. Download the ZIP file and expand the ZIP file to locate the ZIP file inside.
  3. In vCenter, navigate in the main menu to Lifecycle Manager. Select Actions | Import Updates and upload the ZIP file(s) you obtained in step 2 above.
  4. Navigate to vCenter inventory, select your cluster, and select Updates | Image. Then click Setup Image Manually.
  5. In Step 1, choose the vSphere version you desire for your image. Display the details for Components and click Add Components. Change the filter to show “Independent Components and Vendor Addon Components,” then review the drivers you identified earlier in steps 1-2. If the default version differs from the one you need, add it to your image. For your convenience you may want to include vmware-storcli. Then save the image you defined.
  6. In Step 2, after the compliance check completes, review the compliance of each of your hosts and resolve any issues.
  7. Click Finish Image Setup and confirm.
  8. At this point you can remediate your cluster to the new image!