vSAN: No witness! No problem?

When you study how vSAN works then you will read about how objects are backed by one or more components. This is dependent on the fault tolerance level and what you would also read is that a tie-breaker witness will be created. This is to make sure that when a network partition occurs, servers in a network partition can decide whether or not they together hold more than 50% of the necessary components and/or witnesses. 

Sometimes however there are no witnesses and that can be just fine. In this article I will explain what other mechanism exists for servers to decide if they own more than 50% of the necessary components when a network partition or other type of failure occurs.

What you should already know is that when you use the RAID 5 or 6 mechanism there also will be no witnesses used due to the distributed nature of these technologies.

'Normal' placement including a witness

First let's take a look at a 'normal' virtual machine with the default vSAN policy. This policy will be Failures to tolerate = 1 and a RAID 1 mirroring level. You can see in the screenshot below that a witness is placed on one server and that two other servers hold a component of the VMDK-object which both contain a copy of the data. Whatever combination of servers remains alive when there is a single host failure they will always be able to decide that they own more than 50% of the necessary witnesses or components. This is where the witness is added as a tiebreaker.

vm with two components and one witness

For the servers in this previous screenshot any combination of two servers will be able to form a majority of 2/3, which is more than 50%:

sa-esxi-02 witness (no data)  + sa-esxi-01 component (data) = 2/3  is >50%
sa-esxi-02 witness (no data)  + sa-esxi-04 component (data) = 2/3 is >50%
sa-esxi-01 component (data)  + sa-esxi-04 component (data) = 2/3 is >50%

In such a failure scenario there will always be data available to run the virtual machine in the vSAN cluster.
With the RAID 1 method you can also choose for a fault tolerance level of 2 which will lead to three components and two witnesses (to be placed on five servers). And the maximum level of failures to tolerate is 3, which will create four data components and three witnesses, requiring seven servers.

You can also investigate the placement of the virtual machines' objects in the Ruby vSphere Console (RVC) on the vCenter Server appliance or with the esxcli vsan command on the ESXi host. In the screenshot below you can see the output of the vsan.vm_object_info command for the virtual machine from the first screenshot. It shows the placement of the witness and the components and you can also see that each has a single vote when it comes to deciding if they are part of the majority of remaining servers. This voting mechanism is what will also be used in the other mechanism that I will explain later.

rvc info for vm

Placement without a witness

Now let's take a look at another virtual machine that is also under the default vSAN storage policy. In the image below you can see that the virtual machine home directory is placed on three servers with a witness and two components. The placement of this object can be different than of the VMDK-object.

vm home object with witness

For the disk you can see that there are four components in the RAID1 set where each half of the mirrored set is split into two components and there is no witness involved. The reason for the RAID0 split is in this case because the virtual machine was too large to fit onto a single disk group of a single server and therefor will be placed on multiple servers. But this behavior can also be seen when you use a policy with a stripe width of 2. But when you do the math with four components there could be a scenario where two of them would survive and therefor would be exactly 50%, which is not more than 50%.

vm with four components and no witness

This is however where the voting mechanism is used. When you look at the VMDK-object, in the screenshot below again taken from RVC, then you can see that the components do not all have 1 vote but one of them has 2 votes. 

rvc output for vm without witness

We can now do the math to decide when a majority of more than 50% remains. Remember that the policy is still for 1 host failure to tolerate. This placement can only be done on four servers. When trying to place this virtual machine in a three node cluster the CLOM-service would decide that placement is not possible and would not create the virtual machine disk object.

There is a total of five votes to be used.

sa-esxi-02 with 2 votes
sa-esxi-01 with 1 vote
sa-esxi-04 with 1 vote
sa-esxi-03 with 1 vote

So there are four hosts than can fail (one at ta time) leading to four possible scenario's:

1) Host sa-esxi-02 fails and therefor the remaining hosts will form a majority:

sa-esxi-01 (1 vote) + sa-esxi-04 (1 vote) + sa-esxi-03 (1 vote) = 3/5 is >50%

2) Host sa-esxi-01 fails and therefor the remaining hosts will form a majority:

sa-esxi-02 (2 votes) + sa-esxi-04 (1 vote) + sa-esxi-03 (1 vote) = 4/5 is >50%

3) Host sa-esxi-04 fails and therefor the remaining hosts will form a majority:

sa-esxi-02 (2 votes) + sa-esxi-01 (1 vote) + sa-esxi-03 (1 vote) = 4/5 is >50%

4) Host sa-esxi-03 fails and therefor the remaining hosts will form a majority:

sa-esxi-02 (2 votes) + sa-esxi-01 (1 vote) + sa-esxi-04 (1 vote) = 4/5 is >50%

Also in this specific placement scenario if hosts sa-esxi-03 and sa-esxi-04 fail the two remaining hosts would be able to provide access to the disk object:

5) Host sa-esxi-03 fails and therefor the remaining hosts will form a majority, but only because they form a combination of one part of the mirrored set:

sa-esxi-02 (2 votes) + sa-esxi-01 (1 vote)  = 3/5 is >50% 

 So as you can see there is a secondary mechanism to provide a tiebreaker mechanism and you can now investigate for yourself how components are placed on your servers and how vSAN can decide if 50% remains available when a host fails. 

Follow us on LinkedIn




Theme by Danetsoft and Danang Probo Sayekti inspired by Maksimer