Container Deep Diving: Part 2

Okay, so here we are in part 2 of the container post series.

At the end of part 1 we were able to identify the problems of just using chroot to achieve process isolation on a machine. With this post the goal is to have the same functionality - running bash with test-root as the new root directory with the same technologies as containers. Once that is running we will adress the problems of seeing all the network interfaces as well as still being able to kill arbitrary processes on the machine.

A new root dir

This time around the complexity will increase quite a bit.

While we will ultimately still use chroot the idea is to execute it inside a pre-secured environment. A container. So how can this secure environment be created?

Enter kernel NAMESPACES(7):

       A namespace wraps a global system resource in an abstraction that makes it appear to the processes within the namespace that they have their own isolated instance of the global re‐
       source.  Changes to the global resource are visible to other processes that are members of the namespace, but are invisible to other processes.  One use of namespaces is to  imple‐
       ment containers.
      
        ....
       The  following  table shows the namespace types available on Linux.  The second column of the table shows the flag value that is used to specify the namespace type in various APIs.
       The third column identifies the manual page that provides details on the namespace type.  The last column is a summary of the resources that are isolated by the namespace type.

       Namespace Flag            Page                  Isolates
       Cgroup    CLONE_NEWCGROUP cgroup_namespaces(7)  Cgroup root directory
       IPC       CLONE_NEWIPC    ipc_namespaces(7)     System V IPC, POSIX message queues
       Network   CLONE_NEWNET    network_namespaces(7) Network devices, stacks, ports, etc.
       Mount     CLONE_NEWNS     mount_namespaces(7)   Mount points
       PID       CLONE_NEWPID    pid_namespaces(7)     Process IDs
       Time      CLONE_NEWTIME   time_namespaces(7)    Boot and monotonic clocks
       User      CLONE_NEWUSER   user_namespaces(7)    User and group IDs
       UTS       CLONE_NEWUTS    uts_namespaces(7)     Hostname and NIS domain name

The namespace that we need for the intial step is Mounts. The first of all the namespaces that were introduced to Linux.

According to mount_namespaces(7) this namespace was introduced in Linux 2.4.19 while the fs/namespace.c file seems to be introduced in 2.4.12 by Al Viro and somehow made it into the changelog in 2.4.11. This happended more than 20 years ago and is the base for the modern container stack.

But how exactly can Mount namespaces help us out in restricting a process to a new root directory?

By using a technique called bind-mount. With bind-mounts it is possible to have a directory on the machine show up as a mount point. Basically a portal into a directory which we can then move into our secure environment before starting a chroot in it.

On the root namespace of the machine - every process in a modern linux system is inside a namespace - it is just a normal directory that can be inspected.

So how does this look in commands?:

# Precheck all the available mounts in the root namespace
findmnt

# Step 1: create test-root with bash binary and dependencies
cd /
mkdir -p test-root/{bin,proc,old-root}
cp /bin/bash test-root/bin/bash
cp -a /usr /lib /lib64 test-root

# Step 2: jump into a new namespaced environment -> our secure environment
unshare --mount

# Step 3: create a bind mount for test-root and mount process information
mount --bind test-root test-root


# Step 4: switch the root folder to test-root and keep the old root mounted at old-root
## also mount the process information via the procfs after switching the root
cd test-root
pivot_root . old-root
mount -t proc proc /proc

# Step 5: unmount the old root from the secure environment so that only the new root is available
## `--lazy` is needed as our /bin/bash process is still attached to the old mount
## if we made the namespace persistent and executed a process in it later, then the mount will not available in it
umount --lazy old-root

# Step 6: chroot into the new secure environment
exec chroot . /bin/bash

Here is the process as a small video:

Nice, the first step is done and we are on par with part 1. But now lets reap some benefits when we get to the unsolved isolation issues.

Processes

When creating the secure environment the binary that actually did the namespace switching is unshare.

In Step 2 it is called with --mount, which will only create a new Mount namespace. To further restrict the environment we can now add the PID namespace, which separates all the processes into a new namespace of their own.

At this point it is important to note that the information for PID and Mount namespaces are stored in separate structures inside the kernel. This applies to all namespaces and means that you can mix and match them as you desire, giving the environments the exact configurations that you want.

For our example this means that we can simply continue after the last command before entering the chroot.

To create the new PID namespace lets simply call unshare again but use the --pid flag this time.

Hm, so what is the issue here? Lets go digging.

UNSHARE(1)
    -p, --pid[=file]
           Unshare the PID namespace. If file is specified, then a persistent namespace is created by a bind mount. (Creation of a persistent PID namespace will fail if the --fork option
           is not also specified.)

           See also the --fork and --mount-proc options.
           
           ....
           
    -f, --fork
           Fork the specified program as a child process of unshare rather than running it directly. This is useful when creating a new PID namespace. Note that when unshare is waiting
           for the child process, then it ignores SIGINT and SIGTERM and does not forward any signals to the child. It is necessary to send signals to the child process.
           
           ....
           
    --mount-proc[=mountpoint]
           Just before running the program, mount the proc filesystem at mountpoint (default is /proc). This is useful when creating a new PID namespace. It also implies creating a new
           mount namespace since the /proc mount would otherwise mess up existing programs on the system. The new proc filesystem is explicitly mounted as private (with
           MS_PRIVATE|MS_REC).

--mount-proc is not relevant for us as we have already created a new Mount namespace and do not want another one. But more on that in a second. The fix is in the --fork flag.

So why is this?

Checking the man page of pid_namespaces(7) gives the answer:

   The namespace init process
       The  first  process  created  in  a  new namespace (i.e., the process created using clone(2) with the CLONE_NEWPID flag, or the first child created by a process after a call to un‐
       share(2) using the CLONE_NEWPID flag) has the PID 1, and is the "init" process for the namespace (see init(1)).  This process becomes the parent of any child processes that are or‐
       phaned because a process that resides in this PID namespace terminated (see below for further details).

The important part is or the first child created by a process after a call to unshare(2).

The first process that was called after unshare is ls. So ls will execute in the new namespace and then exit. That exit will trigger the new namespace to be deleted.

In order to get the correct behaviour we want bash to be forked when calling unshare, therefore making it our new parent process in the namespace.

So the correct way is to run unshare --pid --fork:

But at this point there is still one thing missing and that relates to the --mount-proc flag mentioned above.

When checking the active processes with ps -aux the following is printed on a Fedora Workstation:

-bash-5.1# ps -aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
0              1  0.7  0.8 171724 16900 ?        Ss   11:35   0:01 /usr/lib/systemd/systemd rhgb --switched-root --sys
 
...........
 
0            156  0.0  0.0      0     0 ?        I<   11:35   0:00 [ipv6_addrconf]
0            157  0.0  0.0      0     0 ?        I    11:35   0:00 [kworker/u8:8]
0            701  0.0  0.6 393636 13424 ?        Ssl  11:35   0:00 /usr/libexec/udisks2/udisksd
983          703  0.0  0.1  84648  3676 ?        S    11:35   0:00 /usr/sbin/chronyd -F 2
0            705  0.0  0.4 238196  8248 ?        Ssl  11:35   0:00 /usr/libexec/upowerd
0            706  0.0  0.7 255924 14712 ?        Ssl  11:35   0:00 /usr/sbin/abrtd -d -s
70           708  0.0  0.0   8516   356 ?        S    11:35   0:00 avahi-daemon: chroot helper
1000        1351  0.6  0.5 310964 10624 ?        Sl   11:35   0:01 ibus-daemon --panel disable -r --xim
1000        1353  0.1  3.5 581004 71036 ?        Ssl  11:35   0:00 /usr/libexec/gsd-xsettings
 
...........

0           1798  0.0  0.4  18684  9744 ?        Ss   11:37   0:00 /usr/lib/systemd/systemd-hostnamed
0           1822  0.0  0.2   8804  5684 ?        S    11:38   0:00 -bash
0           1848  0.0  0.2   8804  5720 ?        S    11:38   0:00 -bash
0           1877  0.2  0.2   8804  5680 ?        S    11:38   0:00 -bash
0           1904  0.2  0.2   8804  5656 ?        S    11:38   0:00 -bash
0           1933  0.0  0.0   5580   252 ?        S    11:39   0:00 unshare --pid --fork
0           1934  0.0  0.1   7380  3984 ?        S    11:39   0:00 -bash
0           1935  0.0  0.1   9888  2368 ?        R+   11:39   0:00 ps -aux

All processes are displayed.

Why does this happen? Because we mounted /procfs in the previous PID namespace.

Skipping the --mount-proc flag leaves us in the desired Mount namespace but also does not automatically mount the /procfs that belongs to our new PID namespace. Simply unmounting the old /procfs and mounting the new one with umount /proc && mount -t proc proc /proc will fix this and voila, ps -aux now prints:

-bash-5.1# ps -aux
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
0              1  0.0  0.2   7380  4088 ?        S    11:39   0:00 -bash
0              5  0.0  0.1   9888  2380 ?        R+   11:41   0:00 ps -aux

and we will not be able to kill arbitrary processes anymore.

Now there is only 1 thing left to address from part 1.

Network

A quick man unshare reveals:

       -n, --net[=file]
           Unshare the network namespace. If file is specified, then a persistent namespace is created by a bind mount.

And thats it. We are now in a completely isolated environment to run our program, notably without any network access.

Here are all the commands again:

# Step 1: create test-root with bash binary and dependencies
cd /
mkdir -p test-root/{bin,proc,old-root}
cp /bin/bash test-root/bin/bash
cp -a /usr /lib /lib64 test-root

# Step 2: jump into a new namespaced environment -> our secure environment
unshare --mount

# Step 3: create a bind mount for test-root
mount --bind test-root test-root

# Step 4: switch the root folder to test-root and keep the old root mounted at old-root
## also mount the process information via the procfs after switching the root
cd test-root
pivot_root . old-root
mount -t proc proc /proc

# Step 5: unmount the old root from the secure environment so that only the new root is available
## `--lazy` is needed as our /bin/bash process is still attached to the old mount
umount --lazy old-root

# Step 6: enter a new PID namespace
unshare --pid --fork

# Step 7: enter a new network namespace
unshare --net

# Step 8: chroot into the new secure environment
exec chroot . /bin/bash

Recap

Lets look at the environment we have now, a folder mounted as the root, a new process hierarchy and an isolated network:

Quite a few moving parts, but really elegant when they work together like this and become the modern container.

What now?

The current Linux Kernel 5.16 has 8 namespaces and this post only covered 3 of them. The User and Cgroup are especially important for modern containers.

While the first one can map unprivileged users from the root namespace to the root user inside a namespace, which is really important for a safe container, the Cgroup namespace can manage the amount of ressources the namespace will get (RAM and CPU for example). You should definitely try extending the example from this post with those two. Check man namespaces for all the documentation needed for this.

As far as this series of posts is concerned the basics are now covered. Part 3 is going to dig into the current ecosystem. Think podman vs docker.

Until then, happy hacking!.

Back to overview