Introduction
The idea behind this blog post is to take a new look at how cloud security is measured and what its impact is on the various actors in the cloud ecosystem. From the measurement point of view, we look at the vertical stack: all code that is traversed to provide a service all the way from input web request to database update to output response potentially contains bugs; the bug density is variable for the different components but the more code you traverse the higher your chance of exposure to exploitable vulnerabilities. We’ll call this the Vertical Attack Profile (VAP) of the stack. However, even this axis is too narrow because the primary actors are the cloud tenant and the cloud service provider (CSP). In an IaaS cloud, part of the vertical profile belongs to the tenant (The guest kernel, guest OS and application) and part (the hypervisor and host OS) belong to the CSP. However, the CSP vertical has the additional problem that any exploit in this piece of the stack can be used to jump into either the host itself or any of the other tenant virtual machines running on the host. We’ll call this exploit causing a failure of containment the Horizontal Attack Profile (HAP). We should also note that any Horizontal Security failure is a potentially business destroying event for the CSP, so they care deeply about preventing them. Conversely any exploit occurring in the VAP owned by the Tenant can be seen by the CSP as a tenant only problem and one which the Tenant is responsible for locating and fixing. We correlate size of profile with attack risk, so the large the profile the greater the probability of being exploited.
From the Tenant point of view, improving security can be done in one of two ways, the first (and mostly aspirational) is to improve the security and monitoring of the part of the Vertical the Tenant is responsible for and the second is to shift responsibility to the CSP, so make the CSP responsible for more of the Vertical. Additionally, for most Tenants, a Horizontal failure mostly just means they lose trust in the CSP, unless the Tenant is trusting the CSP with sensitive data which can be exfiltrated by the Horizontal exploit. In this latter case, the Tenant still cannot do anything to protect the CSP part of the Security Profile, so it’s mostly a contractual problem: SLAs and penalties for SLA failures.
Examples
To see how these interpretations apply to the various cloud environments, lets look at some of the Cloud (and pre-Cloud) models:
Physical Infrastructure
The left hand diagram shows a standard IaaS rented physical system. Since the Tenant rents the hardware it is shown as red indicating CSP ownership and the the two Tenants are shown in green and yellow. In this model, barring attacks from the actual hardware, the Tenant owns the entirety of the VAP. The nice thing for the CSP is that hardware provides air gap security, so there is no HAP which means it is incredibly secure.
However, there is another (much older) model shown on the right, called the shared login model, where the Tenant only rents a login on the physical system. In this model, only the application belongs to the Tenant, so the CSP is responsible for much of the VAP (the expanded red area). Here the total VAP is the same, but the Tenant’s VAP is much smaller: the CSP is responsible for maintaining and securing everything apart from the application. From the Tenant point of view this is a much more secure system since they’re responsible for much less of the security. From the CSP point of view there is now a because a tenant compromising the kernel can control the entire system and jump to other tenant processes. This actually has the worst HAP of all the systems considered in this blog.
Hypervisor based Virtual Infrastructure
In this model, the total VAP is unquestionably larger (worse) than the physical system above because there’s simply more code to traverse (a guest and a host kernel). However, from the Tenant’s point of view, the VAP should be identical to that of unshared physical hardware because the CSP owns all the additional parts. However, there is the possibility that the Tenant may be compromised by vulnerabilities in the Virtual Hardware Emulation. This can be a worry because an exploit here doesn’t lead to a Horizontal security problem, so the CSP is apt to pay less attention to vulnerabilities in the Virtual Hardware simply because each guest has its own copy (even though that copy is wholly under the control of the CSP).
The HAP is definitely larger (worse) than the physical host because of the shared code in the Host Kernel/Hypervisor, but it has often been argued that because this is so deep in the Vertical stack that the chances of exploit are practically zero (although venom gave the lie to this hope: stack depth represents obscurity, not security).
However, there is another way of improving the VAP and that’s to reduce the number of vulnerabilities that can be hit. One way that this can be done is to reduce the bug density (the argument for rewriting code in safer languages) but another is to restrict the amount of code which can be traversed by narrowing the interface (for example, see arguments in this hotcloud paper). On this latter argument, the host kernel or hypervisor does have a much lower VAP than the guest kernel because the hypercall interface used for emulating the virtual hardware is very narrow (much narrower than the syscall interface).
The important takeaways here are firstly that simply transferring ownership of elements in the VAP doesn’t necessarily improve the Tenant VAP unless you have some assurance that the CSP is actively monitoring and fixing them. Conversely, when the threat is great enough (Horizontal Exploit), you can trust to the natural preservation instincts of the CSP to ensure correct monitoring and remediation because a successful Horizontal attack can be a business destroying event for the CSP.
Container Based Virtual Infrastructure
The total VAP here is identical to that of physical infrastructure. However, the Tenant component is much smaller (the kernel accounting for around 50% of all vulnerabilities). It is this reduction in the Tenant VAP that makes containers so appealing: the CSP is now responsible for monitoring and remediating about half of the physical system VAP which is a great improvement for the Tenant. Plus when the CSP remediates on the host, every container benefits at once, which is much better than having to crack open every virtual machine image to do it. Best of all, the Tenant images don’t have to be modified to benefit from these fixes, simply running on an updated CSP host is enough. However, the cost for this is that the HAP is the entire linux kernel syscall interface meaning the HAP is much larger than then hypervisor virtual infrastructure case because the latter benefits from interface narrowing to only the hypercalls (qualitatively, assuming the hypercall interface is ~30 calls and the syscall interface is ~300 calls, then the HAP is 10x larger in the container case than the hypervisor case); however, thanks to protections from the kernel namespace code, the HAP is less than the shared login server case. Best of all, from the Tenant point of view, this entire HAP cost is borne by the CSP, which makes this an incredible deal: not only does the Tenant get a significant reduction in their VAP but the CSP is hugely motivated to keep on top of all vulnerabilities in their part of the VAP and remediate very fast because of the business implications of a successful horizontal attack. The flip side of this is that a large number of the world’s CSPs are very unhappy about this potential risks and costs and actually try to shift responsibility (and risk) back to the Tenant by advocating nested virtualization solutions like running containers in hypervisors. So remember, you’re only benefiting from the CSP motivation to actively maintain their share of the VAP if your CSP runs bare metal containers because otherwise they’ve quietly palmed the problem back off on you.
Other Avenues for Controlling Attack Profiles
The assumption above was that defect density per component is roughly constant, so effectively the more code the more defects. However, it is definitely true that different code bases have different defect densities, so one way of minimizing your VAP is to choose the code you rely on carefully and, of course, follow bug reduction techniques in the code you write.
Density Reduction
The simplest way of reducing defects is to find and fix the ones in the existing code base (while additionally being careful about introducing new ones). This means it is important to know how actively defects are being searched for and how quickly they are being remediated. In general, the greater the user base for the component, the greater the size of the defect searchers and the faster the speed of their remediation, which means that although the Linux Kernel is a big component in the VAP and HAP, a diligent patch routine is a reasonable line of defence because a fixed bug is not an exploitable bug.
Another way of reducing defect density is to write (or rewrite) the component in a language which is less prone to exploitable defects. While this approach has many advocates, particularly among language partisans, it suffers from the defect decay issue: the idea that the maximum number of defects occurs in freshly minted code and the number goes down over time because the more time from release the more chance they’ve been found. This means that a newly rewritten component, even in a shiny bug reducing language, can still contain more bugs than an older component written in a more exploitable language, simply because a significant number of bugs introduced on creation have been found in the latter.
Code Reduction (Minimization Techniques)
It also stands to reason that, for a complex component, simply reducing the amount of code that is accessible to the upper components reduces the VAP because it directly reduces the number of defects. However, reducing the amount of code isn’t as simple as it sounds: it can only really be done by components that are configurable and then only if you’re not using the actual features you eliminate. Elimination may be done in two ways, either physically, by actually removing the code from the component or virtually by blocking access using a guard (see below).
Guarding and Sandboxing
Guarding is mostly used to do virtual code elimination by blocking access to certain code paths that the upper layers do not use. For instance, seccomp in the Linux Kernel can be used to block access to system calls you know the application doesn’t use, meaning it also blocks any attempt to exploit code that would be in those system calls, thus reducing the VAP (and also reducing the HAP if the kernel is shared).
The deficiencies in the above are obvious: if the application needs to use a system call, you cannot block it although you can filter it, which leads to huge and ever more complex seccomp policies. The solution for the system call an application has to use problem can sometimes be guarding emulation. In this mode the guard code actually emulates all the effects of the system call without actually making the actual system call into the kernel. This approach, often called sandboxing, is certainly effective at reducing the HAP since the guards usually run in their own address space which cannot be used to launch a horizontal attack. However, the sandbox may or may not reduce the VAP depending on the bugs in the emulation code vs the bugs in the original. One of the biggest potential disadvantages to watch out for with sandboxing is the fact that the address space the sandbox runs in is often that of the tenant, often meaning the CSP has quietly switched ownership of that component back to the tenant as well.
Conclusions
First and foremost: security is hard. As a cloud Tenant, you really want to offload as much of it as possible to people who are much more motivated to actually do it than you are (i.e. the Cloud Service Provider).
The complete Vertical Attack Profile of a container bare metal system in the cloud is identical to a physical system and better than a Hypervisor based system; plus the tenant owned portion is roughly 50% of the total VAP meaning that Containers are by far the most secure virtualization technology available today from the Tenant perspective.
The increased Horizontal Attack profile that containers bring should all rightly belong to the Cloud Service Provider. However, CSPs are apt to shirk this responsibility and try to find creative ways to shift responsibility back to the tenant including spreading misinformation about the container Attack profiles to try to make Tenants demand nested solutions.
Before you, as a Tenant, start worrying about the CSP owned Horizontal Attack Profile, make sure that contractual remedies (like SLAs or Reputational damage to the CSP) would be insufficient to cover the consequences of any data loss that might result from a containment breach. Also remember that unless you, as the tenant, are under external compliance obligations like HIPPA or PCI, contractual remedies for a containment failure are likely sufficient and you should keep responsibility for the HAP where it belongs: with the CSP.
Great post. There is a word missing here between ‘a’ and ‘because’.
“From the CSP point of view there is now a because a tenant compromising the kernel can control the entire system and jump to other tenant processes.”
Sorry about that (and thanks for spotting it). It should say “From the CSP point of view there is now a
horizontal security problem
because a tenant can …”
I’ve fixed it and republished
I still see the same omission; I also notice a paragraph that ends rather abruptly: “the hypercall interface used for emulating the virtual hardware is very narrow (much narrower than the sys
The important takeaways here…”
NIce read but you might want to consider correcing the spelling of HIPAA. It is a common mistake my nonetheless distracts from the creibiility of your thesis.
Pingback: Recommended Read: Containers and Cloud Security | thechrisshort
Really high-level and yet inspiring post. Thanks James!
Regarding the usage of a “secure” PL, I think one point is to reduce the number of new bugs to be introduced once we have a “stable” code base. For example, Go was created to deal with a problem Google has been facing for years – fresh graduate do not have much experience on (secure) coding, while they need to contribute to Google’s “stable” code bases. Nevertheless, this does not mean PLs like C could not achieve the same security goal – the development of LLVM and GCC (plugins) show some interesting progress on this.
I am hoping to see Intel SGX mentioned here as well. AFAICT, different programming models of SGX (e.g., program partitioning and LibOS approaches) might have different impact on VAP and HAP, tenants and CSPs.
I do not see how shared login is worse than containers. Each and every mitigation technique available in containers is also available for the shared login system. Moreover, the shared login uses the proven security isolation by user accounts while containers uses various forms of restricted root and often relies on the latest kernel features with time-unproven code. The only drawback of the shared login is that a tenat runs all its applications under tyhe same user id, while containers allows extra partitioning using the accounts within application.
Pingback: Containers or virtual machines: Which is more secure? The answer will surprise you - IT AND US
“make sure that contractual remedies (like SLAs or Reputational damage to the CSP) would be insufficient to cover the consequences of any data loss that might result from a containment breach.”
Do you mean “contractual remedies (like SLAs or Reputational damage to the CSP) would be sufficient” ?