Fixed: Unexpected reboot when rekeying a virtual machine

vSphere 8.0u3e (see release notes) fixes the issue where a rekeyed VM may experience an unexpected reboot:

PR 3477772: Encrypted virtual machines with active Change Block Tracking (CBT) might intermittently power off after a rekey operation

Due to a race condition between the VMX and hostd services for a specific VMcrypt-related reconfiguration, encrypted VMs with active CBT might unexpectedly power off during a rekey operation.

This issue is resolved in this release.

From what I can tell, this issue is not currently fixed in the vSphere 7 stream.

Unexpected reboot when rekeying a virtual machine

In recent builds of vSphere 7 and vSphere 8, my team has experienced unexpected spontaneous reboots of virtual machines while rekeying them. In our case we were rekeying these machines against a new key provider.

EDIT: Broadcom support has now published KB 387897 documenting this issue. The issue is a kind of race condition between the rekey task and some other activity that is touching the changed block tracking (CBT) file for the virtual machine. Under some conditions the latter activity fails to open the CBT file, and vSphere HA reboots the virtual machine.

The reboots seem unpredictable. Although we are using CBT for backup, we had no in-flight backup job running at the time (since you cannot rekey a virtual machine with snapshots). At times as few as 1% of the rekeyed machines were spontaneously rebooted, but at other times as high as 20% were affected.

We understand that Broadcom will fix this race condition in a future release, but in the meantime if you plan to rekey a virtual machine that is using CBT for backup or replication, you should either:

  1. Perform an orderly shutdown of the virtual machine if you cannot tolerate a spontaneous reboot, or
  2. Disable CBT for the duration of the rekey. You need to evaluate whether your BCDR software can tolerate this, or if you need to perform a full backup or replication to recover from the loss of CBT.

Common vCenter KMS problems and optimizations

Common vCenter KMS problems and optimizations

Here I collect some blog posts with vCenter key provider configuration recommendations:

And here are some additional VMware encryption resources:

Intermittent vCenter KMS connectivity alarms

I’ve seen a number of cases where vCenter issues intermittent KMS connectivity alarms. This often happens in environments where the network or KMS latency is relatively high. One tip provided by VMware / Broadcom support is to remove expired KMS certificates from the vCenter trust store. This is only my impression, but as best as I can tell, these expired certificates do not prevent successful connectivity, but they can contribute to an increased processing delay which is more likely to trigger health alarms.

If you are experiencing one of the following alarms intermittently, you should consider a cleanup of expired CA certificates:

  • Certificate Status
  • Key Management Server Health Status Alarm
  • KMS Server Certificate Status

Broadcom support referred us to the following Knowledge Base articles to view and remove certificates from the vCenter trust store:

In particular, for KMS related alarms, you want to evaluate the certificates in the KMS_ENCRYPTION trust store.

Migrating to the IBM Cloud native KMIP provider

The IBM Cloud Key Protect key management offering has introduced a native KMIP provider to replace the existing “KMIP for VMware” KMIP provider. This new native provider has the advantage of:

  • Improved performance because KMIP-to-key-provider calls are closer in network distance and no longer cross service-to-service authorization boundaries.
  • Improved visibility and management for the KMIP keys.

You can find documentation here: Using the key management interoperability protocol (KMIP)

IBM Cloud’s Hyper Protect Cryptographic Services (HPCS) offering is exploring the possibility of supporting native KMIP providers as well. Stay tuned if you are a user of HPCS.

If you already use the KMIP for VMware provider with Key Protect, you should switch to the new native provider for improved performance. Here’s how you can migrate to the new provider.

First, navigate to your Key Protect instance:

Create a KMIP adapter:

You don’t need to upload a vCenter certificate immediately; in fact, remember that vCenter generates a new certificate with each connection attempt.

Click the Endpoints tab to identify the KMIP endpoint you need to configure in vCenter:

Note that, unlike the KMIP for VMware offering, there is only one endpoint. This single hostname is load balanced and is highly available in each region. Now go to vCenter, select the vCenter object, select Configure | Key Providers, then add a standard key provider:

Examine and trust the certificate:

Now select the new key provider, and select the single server in the provider. Click Establish Trust | Make KMS trust vCenter. I prefer to use the vCenter Certificate option which will generate a new certificate just for this connection.

Remember to wait a few seconds before copying the certificate because it may change. Then copy the certificate and click Done:

Importantly, at this step you need to follow my instructions to reconfigure vCenter to trust the KMIP CA certificate instead of the end-entity certificate. You should do this for two reasons: first, you won’t have to re-trust the certificate every time it is rotated. More importantly, in some cases the native KMIP provider serves alternate certificates on the private connection, and this can confuse vSAN encryption. (The alternate certificates both includes the private hostname among their alternate names, so they are valid. The underlying reason for this difference is because VMware is in the process of adding SNI support to their KMIP connections, and the server behavior differs depending on whether the client sends SNI.) Trusting the CA certificate ensures that the connection is trusted even if the alternate certificate is served on the connection.

Then return to the IBM Cloud and view the details of your KMIP adapter:

Select the SSL certificates tab and click Add certificate:

Paste in the certificate you copied from vCenter:

Back in vCenter, it may take several minutes before the key provider status changes to healthy:

First we need to ensure that any new encrypted objects leverage the new key provider. Select the new provider and click Set as Default. You will be prompted to confirm:

Next we need to migrate all existing objects to the new key provider.

I previously wrote how you can accomplish this using PowerCLI. You would have to combine techniques from connecting to multiple key providers with rekeying all objects, by adding the key provider parameter to each command. After importing the VMEncryption and VsanEncryption modules and connecting to vCenter, this would look something like the following.

WARNING: Since first publishing this, I have learned that in some configurations, vSphere HA may reboot a virtual machine that is encrypted with vSphere encryption and which is being rekeyed. Please read that linked post for information on how you can workaround this problem.

# Rekey host keys used for core dumps
# In almost all cases hosts in the same cluster are protected by the same provider and key,
# but this process ensures they are protected by the new key provider
# It is assumed here that all hosts are already in clusters enabled for encryption.
# Beware: If not, this command will initialize hosts and clusters for encryption.
foreach($myhost in Get-VMHost) {
  echo $myhost.name
  Set-VMHostCryptoKey -VMHost $myhost -KMSClusterId new-key-provider
}

# Display host key providers to verify result
Get-VMhost | Select Name,KMSserver

# Rekey a vSAN cluster
# It is assumed here that the cluster is already enabled for encryption.
# Beware: If not, this command will enable encryption for an unencrypted cluster.
Set-VsanEncryptionKms -Cluster cluster1 -KMSCluster new-key-provider

# Display cluster key provider to verify result
Get-VsanEncryptionKms -Cluster cluster1

# Rekey all encrypted virtual machines
# Each rekey operation starts a task which may take a brief time to complete for each encrypted VM
# Note that this will fail for any virtual machine that has snapshots; you must remove snapshots first
foreach($myvm in Get-VM) {
  if($myvm.KMSserver){
    echo $myvm.name
    Set-VMEncryptionKey -VM $myvm -KMSClusterId new-key-provider
  }
}

# Display all virtual machines' key providers (some are unencrypted) to verify result
Get-VM | Select Name,KMSserver

Note: currently the Set-VsanEncryptionKms function does not appear to work with vCenter 8. Until my bug report is fixed, you will have to use the vCenter UI for the vSAN step. For your cluster, go to Configuration | vSAN | Services. Under Data Services, click Edit. Choose the new key provider and click Apply:

Unfortunately, it is not possible to make all of these changes in the vCenter UI. You can rekey an individual VM against the new key provider, and as we’ve done above, you can rekey your vSAN cluster against the new key provider. And if you have vSAN encryption enabled, reconfiguring vSAN will also rekey your cluster encryption against the new key provider. But if you are not using vSAN, or if you do not have vSAN encryption enabled, I don’t know of a way to rekey your hosts against the new provider in the UI. (In fact, the cluster configuration UI is somewhat misleading as it indicates you have a choice of key provider, and you can even select the new key provider. But this will only influence the creation of new VMs; it will not rekey the hosts against the new provider.) As a result, you should use PowerCLI to rekey your hosts, and I recommend using it for your VMs as well.

After you have rekeyed all objects, you can remove the original key provider from vCenter:

Now you can delete your KMIP for VMware resource from the cloud UI:

For completeness, you should also delete all of the original keys created by the KMIP for VMware adapter. Recall that VMware leaks keys; if you have many keys to delete, you may wish to use the Key Protect CLI to remove them. You can identify these keys by name; they will have a vmware_kmip prefix:

You may notice that there are no standard keys representing the KMIP keys created by the new native adapter. Instead, its keys are visible within the KMIP symmetric keys tab of your KMIP adapter:

VMware Cloud Director HTTP error 431, part 2

Previously I posted an improved NSX LB configuration for use with VMware Cloud Director that can help to restrict unnecessary cookies and avoid errors with excessively large headers.

If instead you are using VMware Avi Load Balancer in front of your Director cells, I want to highlight a recommended DataScript that you can use to accomplish the same result. My colleague Fahad Ladhani posted this in the comments of Tomas Fojta’s blog, but I’m highlighting it here for greater awareness:

-- HTTP_REQUEST
-- get cookies
cookies, count = avi.http.get_cookie_names()
avi.vs.log("cookies_count_before=" .. count)
-- if cookie(s) exists, validate cookie(s) name
if count >= 1 then
  for cookie_num= 1, #cookies do
    -- only keep cookies: JSESSIONID, rstd, vcloud_session_id, vcloud_jwt, sso-preferred, sso_redirect_org, xxxxx.redirectTo and xxxxx.state
    cookie_name = cookies[cookie_num]
    if cookie_name == "JSESSIONID" or  cookie_name == "rstd" or cookie_name == "vcloud_session_id" or cookie_name == "vcloud_jwt" or cookie_name == "sso-preferred" or cookie_name == "sso_redirect_org" then
      avi.vs.log("keep_cookie=" .. cookie_name)
    elseif string.endswith(cookie_name, ".redirectTo") or string.endswith(cookie_name, ".state") then
      avi.vs.log("keep_cookie=" .. cookie_name)
    else
      -- avi.vs.log("delete_cookie=" .. cookie_name)  -- not logging this because log gets truncated
      avi.http.remove_cookie(cookie_name)
    end
  end
end
-- get cookies
cookies, count = avi.http.get_cookie_names()
avi.vs.log("cookies_count_after=" .. count)

VMware Cloud Director HTTP error 431: Request Header Fields Too Large

Tomas Fojta wrote previously about issues with Cloud Director errors having to do with excessively large cookies. This is a common problem for cloud providers where there may be multiple web applications, some of which fail to properly limit their cookie scope. At the moment I am writing this, my browser is sending about 6.7kB of cookie data when visiting cloud.ibm.com. This is close to the limit supported by Cloud Director, and sometimes it goes over that limit.

Tomas suggested an approach using the NSX load balancer haproxy configuration to filter cookies. Unfortunately, Tomas’s approach does not cover all possible cases. For example, it does not cover the case where only one of these two cookies is present, and it does not cover the case where there are additional cookies in the header after these two cookies. Furthermore, there are additional cookies used by Cloud Director; at a minimum this includes the following:

  • JSESSIONID
  • rstd
  • vcloud_session_id
  • vcloud_jwt
  • sso-preferred
  • sso_redirect_org
  • *.redirectTo
  • *.state

If you have a known limited list of cookies (or cookie name patterns) like this that you want to pass to your application, it is relatively easy to program a positive cookie filter with an advanced load balancer such as VMware Avi Load Balancer. But if you are using the NSX embedded load balancer and are limited to the haproxy approach of using reqirep with regular expressions, it is an intractable problem. Therefore, instead of using reqirep to selectively include the cookies that Director needs, I recommend the approach of using reqirep to selectively and iteratively delete cookies that you know are likely to be large and to overflow Director’s supported limit. It may take some iterative experimentation over a period of time for you to identify all of the offending cookies.

For example, we can use the following four rules to remove two of the larger cookies for cloud.ibm.com, neither of which are needed by Director. For each cookie I am removing, I have written a pair of rules: the first rule removes the cookie if it appears anywhere other than the end of the cookie list, and the second removes it if it is at the end of the list:

reqirep ^(Cookie:.*)com\.ibm\.cloud\.iam\.iamcookie\.prod=[^;]*;(.*)$ \1\ \2
reqirep ^(Cookie:.*)com\.ibm\.cloud\.iam\.iamcookie\.prod=[^;]*$ \1
reqirep ^(Cookie:.*)com\.ibm\.cloud\.iam\.Identity\.prod=[^;]*;(.*)$ \1\ \2
reqirep ^(Cookie:.*)com\.ibm\.cloud\.iam\.Identity\.prod=[^;]*$ \1

vCenter key provider server certificates

I’ve written a couple of posts on vCenter key provider client certificates and caveats related to configuring them. In this post I shift to discussing server certificates.

When you connect to a key provider, vCenter only offers you the option of trusting the provider’s end-entity certificate:

Typically an end-entity certificate has a lifetime of a year or less. This means that you will be revisiting the provider configuration to verify the certificate on at least an annual basis.

However, after you have trusted this certificate, vCenter gives you the option of configuring an alternate certificate to be trusted. You can use this to establish trust with one of your key provider’s CA certificates instead of the end-entity certificate. Typically these have longer lifetimes, so your key provider connectivity will be interrupted much less frequently.

You may have to work with your security admin to obtain the CA certificate, or depending on how your key provider is configured you may be able to obtain the certificate directly from the KMIP connection using a tool like openssl:

root@smoonen-vc [ ~ ]# openssl s_client -connect private.eu-de.kms.cloud.ibm.com:5696 -showcerts
CONNECTED(00000003)
depth=2 C = US, O = DigiCert Inc, OU = www.digicert.com, CN = DigiCert Global Root G2
verify return:1
depth=1 C = US, O = DigiCert Inc, CN = DigiCert Global G2 TLS RSA SHA256 2020 CA1
verify return:1
depth=0 C = US, ST = New York, L = Armonk, O = International Business Machines Corporation, CN = private.eu-de.kms.cloud.ibm.com
verify return:1
---
Certificate chain
 0 s:C = US, ST = New York, L = Armonk, O = International Business Machines Corporation, CN = private.eu-de.kms.cloud.ibm.com
   i:C = US, O = DigiCert Inc, CN = DigiCert Global G2 TLS RSA SHA256 2020 CA1
   a:PKEY: rsaEncryption, 2048 (bit); sigalg: RSA-SHA256
   v:NotBefore: Jun 18 00:00:00 2024 GMT; NotAfter: Jun 17 23:59:59 2025 GMT
-----BEGIN CERTIFICATE-----
MIIHWDCCBkCgAwIBAgIQCK1qBW4aHA51Yl6cJVq96TANBgkqhkiG9w0BAQsFADBZ
. . .

You can then paste this certificate directly into the vCenter UI:

After doing this, vCenter will still display the lifetime validity of the end-entity certificate, rather than the CA certificate. But it will now be trusting the CA certificate, and so this trust will extend to the next version of the end-entity certificate, as long as it is signed by the same CA.

vCenter key provider client certificates, part 2

Previously I explained how vCenter creates a new client certificate with each key provider connection. This is a good thing; it enables you to connect vCenter to the same provider multiple times as a different identity, which can be valuable in certain multitenant use cases.

However, there is also a bug in the vCenter UI that generates this certificate. For a split second, the UI presents one certificate, but then switches to a new value. If you click the copy button too quickly, you will copy the wrong certificate:

Be sure to wait for the screen to refresh before copying your certificate!

Customizing root login for VMware Cloud Director

Your Linux VM running in VMware Cloud Director might be preconfigured with the best-practice configuration to disable root password login. This might prevent you from using the root password that you set with Director’s Guest OS Customization:

#PermitRootLogin prohibit-password

You can override this behavior using a Guest OS Customization script in a couple of ways. The simplest approach is to use your customization script to set the sshd configuration to allow root password logins:

#!/bin/bash
sed -ie "s/#PermitRootLogin prohibit-password/PermitRootLogin yes/" /etc/ssh/sshd_config

Or, if you prefer, you can use the customization script to insert an SSH public key for the root user:

#!/bin/bash
echo "ssh-rsa AAAAB3...DswrcTw==" >> /root/.ssh/authorized_keys
chmod 644 /root/.ssh/authorized_keys