In my virtual environment recently, I experienced two major failures. The first was with VMware vNetwork Distributed Switch and the second was related to the use of a VMware vShield. Both led to catastrophic failures, that could have easily been avoided if these two subsystems failed-safe instead of failing-closed. VMware vSphere is all about availability, but when critical systems fail like these, not even VMware HA can assist in recovery. You have to fix the problems yourself and usually by hand. Now after, the problem has been solved, and should not recur again, I began to wonder how I missed this and this led me to the total lack of information on how these subsystems actually work. So without further todo, here is how they work and what I consider to be the definition for fail-safe.There are three failure modes available for security tools:
Fail-Safe is the holy grail of security tools. They all want to fail-safe. This is achieved in hardware by having redundant systems that can take over the workload when the primary device fails. Any good security and networking plan builds for such failures. This is why we often see network switches in pairs, why load balancers are in use, why clustering technology is in use, etc.
Networking
However, when we enter a virtual environment our network is flattened, and no matter how we try to change this, the network remains flat. I have described in detail how the network stack works in several other articles (, but now it is time to consider VMware vNetwork Distributed Switches in more detail. In all of my other diagrams they are nothing more than a layer within the stack, but in reality they are much more than that.
In reality, the vNetwork Distributed Switch (vDS) Control Plane extends from the hypervisor into VMware vCenter as shown in Figure 1. While the traditional VMware vSwitch’s control plane lives wholly within the hypervisor, the vDS does not. This implies that for vDS to function properly that VMware vCenter must ALWAYS be running. This is a case of a Failed-Close system. The following catch-22 problem can occur.
Problem