Unprivileged Build Containers

A while ago, a goal I set myself was to be able to maintain my build and test environments for architecture emulation containers without having to do any of the tasks as root and without creating any suid binaries to do this. One of the big problems here is that distributions get annoyed (and don’t run correctly) if root doesn’t own most of the files … for instance the installers all check to see that the file got installed with the correct ownership and permissions and fail if they don’t. Debian has an interesting mechanism, called fakeroot, to get around this using a preload library intercepting the chmod and chown system calls, but it’s getting a bit hackish to try to extend this to work permanently for an emulation container.

The correct way to do this is with user namespaces so the rest of this post will show you how. Before we get into how to use them, lets begin with the theory of how user namespaces actually work.

Theory of User Namespaces

A user namespace is the single namespace that can be created by an unprivileged user. Their job is to map a set of interior (inside the user namespace) uids, gids and projids¹ to a set of exterior (outside the user namespace).

The way this works is that the root user namespace simply has a 1:1 identity mapping of all 2^32 identifiers, meaning it fully covers the space. However, any new user namespace only need remap a subset of these. Any id that is not mapped into the user namespace becomes inaccessible to that namespace. This doesn’t mean completely inaccessible, it just means any resource owned or accessed by an unmapped id treats an attempted access (even from root in the namespace) as though it were completely unprivileged, so if the resource is readable by any id, it can still be read even in a user namespace where its owning id is unmapped.

User namespaces can also be nested but the nested namespace can only map ids that exist in its parent, so you can only reduce but not expand the id space by nesting. The way the nested mapping works is that it remaps through all the parent namespaces, so what appears on the resource is still the original exterior ids.

User Namespaces also come with an owner (the uid/gid of the process that created the container). The reason for this is that this owner is allowed to execute setuid/setgid to any id mapped into the namespace, so the owning uid/gid pair is the effective “root” of the container. Note that this setuid/setgid property works on entry to the namespace even if uid 0 is not mapped inside the namespace, but won’t survive once the first setuid/setgid is issued.

The final piece of the puzzle is that every other namespace also has an owning user namespace, so while I cannot create a mount namespace as unprivileged user jejb, I can as remapped root inside my user namespace

jejb@jarvis:~> unshare --mount
unshare: unshare failed: Operation not permitted
jejb@jarvis:~> nsenter --user=/tmp/userns
root@jarvis:~# unshare --mount
root@jarvis:~#

And once created, I can always enter this mount namespace provided I’m also in my user namespace.

Setting up Unprivileged Namespaces

Any system user can actually create a user namespace. However a non-root (meaning not uid zero in the parent namespace) user cannot remap any exterior id except their own. This means that, because a build container needs a range of ids, it’s not possible to set up the intial remapped namespace without the help of root. However, once that is done, the user can pretty much do every other operation²

The way remap ranges are set up is via the uid_map, gid_map and projid_map files sitting inside the /proc/<pid> directory. These files may only be written to once and never updated³

As an example, to set up a build container, I need a remapping for every id that would be created during installation. Traditionally for Linux, these are ids 0-999. I want to remap them down to something unprivileged, say 100,000 so my line entry for this is

0 100000 1000

However, I also want an identity mapping for my own id (currently I’m at uid 1000), so I can still use my home directory from within the build container. This also means I can create the roots for the containers within my home directory. Finally, the nobody user and nobody,nogroup groups also need to be mapped, so the final uid map entries look like

0 100000 1000
1000 1000 1
65534 101001 1

For the groups, it’s even more complex because on openSUSE, I’m a member of the users group (gid 100) which sits in the middle of the privileged 0-999 group range, so the gid_map entry I use is

0 100000 100
100 100 1
101 100100 899
65533 101000 2

Which is almost up to the kernel imposed limit of five separate lines.

Finally, here’s how to set this up and create a binding for the user namespace. As myself (I’m uid 1000 user name jejb) I do

jejb@jarvis:~> unshare --user
nobody@jarvis:~> echo $$
20211
nobody@jarvis:~>

Note that I become nobody inside the container because currently the map files are unwritten so there are no mapped ids at all. Now as root, I have to write the mapping files and bind the entry file to the namespace somewhere

jarvis:/home/jejb # echo 1|awk '{print "0 100000 1000\n1000 1000 1\n65534 101001 1"}' > /proc/20211/uid_map
jarvis:/home/jejb # echo 1|awk '{print "0 100000 100\n100 100 1\n101 100100 899\n65533 101000 2"}' > /proc/20211/gid_map
jarvis:/home/jejb # touch /tmp/userns
jarvis:/home/jejb # mount --bind /proc/20211/ns/user /tmp/userns

Now I can exit my user namespace because it’s permanently bound and the next time I enter it I become root inside the container (although with uid 100000 outside)

jejb@jarvis:~> nsenter --user=/tmp/userns
root@jarvis:~# id
uid=0(root) gid=0(root) groups=0(root)
root@jarvis:~# su - jejb
jejb@jarvis:~> id
uid=1000(jejb) gid=100(users) groups=100(users)

Giving me a user namespace with sufficient mapped ids to create a build container.

Unprivileged Architecture Emulation Containers

Essentially, I can use the user namespace constructed above to bootstrap and enter the entire build container and its mount namespace with one proviso that I have to have a pre-created devices directory because I don’t possess the mknod capability as myself, so my container root also doesn’t possess it. The way I get around this is to create the initial dev directory as root and then change the ownership to 100000.100000 (my unprivileged ids)

jejb@jarvis:~/containers/debian-amd64/dev> ls -l
total 0
lrwxrwxrwx 1 100000 100000 13 Feb 20 09:45 fd -> /proc/self/fd/
crw-rw-rw- 1 100000 100000 1, 7 Feb 20 09:45 full
crw-rw-rw- 1 100000 100000 1, 3 Feb 20 09:45 null
lrwxrwxrwx 1 100000 100000 8 Feb 20 09:45 ptmx -> pts/ptmx
drwxr-xr-x 2 100000 100000 6 Feb 20 09:45 pts/
crw-rw-rw- 1 100000 100000 1, 8 Feb 20 09:45 random
drwxr-xr-x 2 100000 100000 6 Feb 20 09:45 shm/
lrwxrwxrwx 1 100000 100000 15 Feb 20 09:45 stderr -> /proc/self/fd/2
lrwxrwxrwx 1 100000 100000 15 Feb 20 09:45 stdin -> /proc/self/fd/0
lrwxrwxrwx 1 100000 100000 15 Feb 20 09:45 stdout -> /proc/self/fd/1
crw-rw-rw- 1 100000 100000 5, 0 Feb 20 09:45 tty
crw-rw-rw- 1 100000 100000 1, 9 Feb 20 09:45 urandom
crw-rw-rw- 1 100000 100000 1, 5 Feb 20 09:45 zero

This seems to be sufficient of a dev skeleton to function with. For completeness sake, I placed my bound user namespace into /run/build-container/userns and following the original architecture container emulation post, with modified susebootstrap and build-container scripts. The net result is that as myself I can now enter and administer and update the build and test architecture emulation container with no suid required

jejb@jarvis:~> nsenter --user=/run/build-container/userns --mount=/run/build-container/ppc64
root@jarvis:/# id
uid=0(root) gid=0(root) groups=0(root)
root@jarvis:/# uname -m
ppc64
root@jarvis:/# su - jejb
jejb@jarvis:~> id
uid=1000(jejb) gid=100(users) groups=100(users)

The only final wrinkle is that root has to set up the user namespace on every boot, but that’s only because there’s no currently defined operating system way of doing this.

A projid or project id is an identifier used to accumulate a group quota on a filesystem. They function very much like groups except that all uids in the projid share the same quota. As of writing only ext4 and xfs really support the group quota, so I’ll basically ignore the projid mapping from now on
Except one: setting up devices, but we’ll deal with that later
this actually causes a problem for the shell, because our standard write command is echo, but echo writes a line at a time, so cannot be used to create multi-line entries for any of these files. We use a trick later on involving awk to do this, but there are a variety of ways around this problem

8 thoughts on “Unprivileged Build Containers”

Harald Hoyer 28 April 2016 at 13:32

What’s the purpose of echo and awk in this line?
# echo 1|awk ‘{print “0 100000 1000\n1000 1000 1\n65534 101001 1”}’ > /proc/20211/uid_map
What’s wrong with printf?
# printf ‘0 100000 1000\n1000 1000 1\n65534 101001 1\n’ > /proc/20211/uid_map

Reply ↓
1. jejb Post author28 April 2016 at 15:29
  
  printf was the second thing I tried: it has the same line at a time behaviour as echo; awk was the third.
  
  Reply ↓
  1. Pavel Machek 1 May 2016 at 06:33
    
    So basically /proc/…/uid_map is breaks when more than one write() syscall is used to write to it? That’s not exactly nice :-(. Actually you might be able to use echo …. | cat > file. Someone is going to cache it ;-).
    
    Reply ↓
  2. Aaron Hope 16 February 2017 at 12:49
    
    If you want precise control of the write(2) buffer size from the shell, wouldn’t dd(1) be the logical choice?
    
    jarvis:/home/jejb # dd bs=4k of=/proc/20211/uid_map <<EOF
    0 100000 1000
    1000 1000 1
    65534 101001 1
    EOF
    
    jarvis:/home/jejb # dd bs=4k of=/proc/20211/gid_map <<EOF
    0 100000 100
    100 100 1
    101 100100 899
    65533 101000 2
    EOF
    
    Reply ↓
Colin Walters 28 April 2016 at 15:32

FWIW gnome-continuous (https://git.gnome.org/browse/gnome-continuous) has been doing builds unprivileged using https://git.gnome.org/browse/linux-user-chroot for years now. I’ll define “unprivileged” as “having zero linux capabilities”, i.e. non-root without retained caps.

xdg-app also does unprivileged builds using a similar mechanism.

The core linux-user-chroot engine is now https://github.com/projectatomic/bubblewrap for systems without userns.

Reply ↓
1. jejb Post author28 April 2016 at 15:53
  
  Well, there’s a couple of reasons why I do it this way. The first is that I actually wanted to demonstrate how to do this using the raw interfaces without any app help. I know there are a large number of orchestration things (openvz, lxc, docker, rkt …) that can also do this, but I wanted to show it’s not as hard as people think. The second, which only really applies to sandboxers, is that I want to be able to administer the build container, which means using the native installer for at least my preferred OSs of debian and openSUSE; this is hard for a lot of sandboxers because (as I explained in the post) OS installers crash if they don’t get their way and require a range of uid/gids. I think bubblewrap has this problem if I read how it works correctly. The third is there’s a lot of “usernamespaces are buggy and evil” FUD going around (paragraph three of the bubblewrap readme) so there’s a useful purpose served in describing exactly what they do and how they work.
  
  Reply ↓
  1. Colin Walters 2 May 2016 at 17:25
    
    I wouldn’t say user namespaces are buggy and evil, and neither does the bubblewrap README. The security issues are real though – I just submitted a PR to link to one: https://github.com/projectatomic/bubblewrap/pull/41
    
    Reply ↓
    1. jejb Post author4 May 2016 at 12:30
      
      That’s a bit shoot the messenger since the exploit isn’t in the userns. A computer is meant to be used, so the more features it has, the bigger the attack surface. Blaming userns for exposing an existing exploit is a bit like blaming the user because they insisted on using a feature which also contained an exploit. There has to be a tradeoff between security and usability otherwise a fully secure system in these terms would be switched off with all the cables removed.
      
      Reply ↓