One of the biggest problems with the current debate about Container vs Hypervisor security is that no-one has actually developed a way of measuring security, so the debate is all in qualitative terms (hypervisors “feel” more secure than containers because of the interface breadth) but no-one actually has done a quantitative comparison. The purpose of this blog post is to move the debate forwards by suggesting a quantitative methodology for measuring the Horizontal Attack Profile (HAP). For more details about Attack Profiles, see this blog post. I don’t expect this will be the final word in the debate, but by describing how we did it I hope others can develop quantitative measurements as well.
Well begin by looking at the Nabla technology through the relatively uncontroversial metric of performance. In most security debates, it’s acceptable that some performance is lost by securing the application. As a rule of thumb, placing an application in a hypervisor loses anywhere between 10-30% of the native performance. Our goal here is to show that, for a variety of web tasks, the Nabla containers mechanism has an acceptable performance penalty.
Performance Measurements
We took some standard benchmarks: redis-bench-set, redis-bench-get, python-tornado and node-express and in the latter two we loaded up the web servers with simple external transactional clients. We then performed the same test for docker, gVisor, Kata Containers (as our benchmark for hypervisor containment) and nabla. In all the figures, higher is better (meaning more throughput):
The red Docker measure is included to show the benchmark. As expected, the Kata Containers measure is around 10-30% down on the docker one in each case because of the hypervisor penalty. However, in each case the Nabla performance is the same or higher than the Kata one, showing we pay less performance overhead for our security. A final note is that since the benchmarks are network ones, there’s somewhat of a penalty paid by userspace networking stacks (which nabla necessarily has) for plugging into docker network, so we show two values, one for the bridging plug in (nabla-containers) required to orchestrate nabla with kubernetes and one as a direct connection (nabla-raw) showing where the performance would be without the network penalty.
One final note is that, as expected, gVisor sucks because ptrace is a really inefficient way of connecting the syscalls to the sandbox. However, it is more surprising that gVisor-kvm (where the sandbox connects to the system calls of the container using hypercalls instead) is also pretty lacking in performance. I speculate this is likely because hypercalls exact their own penalty and hypervisors usually try to minimise them, which using them to replace system calls really doesn’t do.
HAP Measurement Methodology
The Quantitative approach to measuring the Horizontal Attack Profile (HAP) says that we take the bug density of the Linux Kernel code and multiply it by the amount of unique code traversed by the running system after it has reached a steady state (meaning that it doesn’t appear to be traversing any new kernel paths). For the sake of this method, we assume the bug density to be uniform and thus the HAP is approximated by the amount of code traversed in the steady state. Measuring this for a running system is another matter entirely, but, fortunately, the kernel has a mechanism called ftrace which can be used to provide a trace of all of the functions called by a given userspace process and thus gives a reasonable approximation of the number of lines of code traversed (note this is an approximation because we measure the total number of lines in the function taking no account of internal code flow, primarily because ftrace doesn’t give that much detail). Additionally, this methodology works very well for containers where all of the control flow emanates from a well known group of processes via the system call information, but it works less well for hypervisors where, in addition to the direct hypercall interface, you also have to add traces from the back end daemons (like the kvm vhost kernel threads or dom0 in the case of Xen).
HAP Results
The results are for the same set of tests as the performance ones except that this time we measure the amount of code traversed in the host kernel:
As stated in our methodology, the height of the bar should be directly proportional to the HAP where lower is obviously better. On these results we can say that in all cases the Nabla runtime tender actually has a better HAP than the hypervisor contained Kata technology, meaning that we’ve achieved a container system with better HAP (i.e. more secure) than hypervisors.
Some of the other results in this set also bear discussing. For instance the Docker result certainly isn’t 10x the Kata result as a naive analysis would suggest. In fact, the containment provided by docker looks to be only marginally worse than that provided by the hypervisor. Given all the hoopla about hypervisors being much more secure than containers this result looks surprising but you have to consider what’s going on: what we’re measuring in the docker case is the system call penetration of normal execution of the systems. Clearly anything malicious could explode this result by exercising all sorts of system calls that the application doesn’t normally use. However, this does show clearly that a docker container with a well crafted seccomp profile (which blocks unexpected system calls) provides roughly equivalent security to a hypervisor.
The other surprising result is that, in spite of their claims to reduce the exposure to Linux System Calls, gVisor actually is either equivalent to the docker use case or, for the python tornado test, significantly worse than the docker case. This too is explicable in terms of what’s going on under the covers: gVisor tries to improve containment by rewriting the Linux system call interface in Go. However, no-one has paid any attention to the amount of system calls the Go runtime is actually using, which is what these results are really showing. Thus, while current gVisor doesn’t currently achieve any containment improvement on this methodology, it’s not impossible to write a future version of the Go runtime that is much less profligate in the way it uses system calls by developing a Secure Go using the same methodology we used to develop Nabla.
Conclusions
On both tests, Nabla is far and away the best containment technology for secure workloads given that it sacrifices the least performance over docker to achieve the containment and, on the published results, is 2x more secure even than using hypervisor based containment.
Hopefully these results show that it is perfectly possible to have containers that are more secure than hypervisors and lays to rest, finally, the arguments about which is the more secure technology. The next step, of course, is establishing the full extent of exposure to a malicious application and to do that, some type of fuzz testing needs to be employed. Unfortunately, right at the moment, gVisor is simply crashing when subjected to fuzz testing, so it needs to become more robust before realistic measurements can be taken.