Monthly Archives: April 2016

Unprivileged Build Containers

A while ago, a goal I set myself was to be able to maintain my build and test environments for architecture emulation containers without having to do any of the tasks as root and without creating any suid binaries to do this.  One of the big problems here is that distributions get annoyed (and don’t run correctly) if root doesn’t own most of the files … for instance the installers all check to see that the file got installed with the correct ownership and permissions and fail if they don’t.  Debian has an interesting mechanism, called fakeroot, to get around this using a preload library intercepting the chmod and chown system calls, but it’s getting a bit hackish to try to extend this to work permanently for an emulation container.

The correct way to do this is with user namespaces so the rest of this post will show you how.  Before we get into how to use them, lets begin with the theory of how user namespaces actually work.

Theory of User Namespaces

A user namespace is the single namespace that can be created by an unprivileged user.  Their job is to map a set of interior (inside the user namespace) uids, gids and projids1 to a set of exterior (outside the user namespace).

The way this works is that the root user namespace simply has a 1:1 identity mapping of all 2^32 identifiers, meaning it fully covers the space.  However, any new user namespace only need remap a subset of these.  Any id that is not mapped into the user namespace becomes inaccessible to that namespace.  This doesn’t mean completely inaccessible, it just means any resource owned or accessed by an unmapped id treats an attempted access (even from root in the namespace) as though it were completely unprivileged, so if the resource is readable by any id, it can still be read even in a user namespace where its owning id is unmapped.

User namespaces can also be nested but the nested namespace can only map ids that exist in its parent, so you can only reduce but not expand the id space by nesting.  The way the nested mapping works is that it remaps through all the parent namespaces, so what appears on the resource is still the original exterior ids.

User Namespaces also come with an owner (the uid/gid of the process that created the container).  The reason for this is that this owner is allowed to execute setuid/setgid to any id mapped into the namespace, so the owning uid/gid pair is the effective “root” of the container.  Note that this setuid/setgid property works on entry to the namespace even if uid 0 is not mapped inside the namespace, but won’t survive once the first setuid/setgid is issued.

The final piece of the puzzle is that every other namespace also has an owning user namespace, so while I cannot create a mount namespace as unprivileged user jejb, I can as remapped root inside my user namespace

jejb@jarvis:~> unshare --mount
unshare: unshare failed: Operation not permitted
jejb@jarvis:~> nsenter --user=/tmp/userns
root@jarvis:~# unshare --mount
root@jarvis:~#

And once created, I can always enter this mount namespace provided I’m also in my user namespace.

Setting up Unprivileged Namespaces

Any system user can actually create a user namespace.  However a non-root (meaning not uid zero in the parent namespace) user cannot remap any exterior id except their own.  This means that, because a build container needs a range of ids, it’s not possible to set up the intial remapped namespace without the help of root.  However, once that is done, the user can pretty much do every other operation2

The way remap ranges are set up is via the uid_map, gid_map and projid_map files sitting inside the /proc/<pid> directory.  These files may only be written to once and never updated3

As an example, to set up a build container, I need a remapping for every id that would be created during installation.  Traditionally for Linux, these are ids 0-999.  I want to remap them down to something unprivileged, say 100,000 so my line entry for this is

0 100000 1000

However, I also want an identity mapping for my own id (currently I’m at uid 1000), so I can still use my home directory from within the build container.  This also means I can create the roots for the containers within my home directory.  Finally, the nobody user and nobody,nogroup groups also need to be mapped, so the final uid map entries look like

0 100000 1000
1000 1000 1
65534 101001 1

For the groups, it’s even more complex because on openSUSE, I’m a member of the users group (gid 100) which sits in the middle of the privileged 0-999 group range, so the gid_map entry I use is

0 100000 100
100 100 1
101 100100 899
65533 101000 2

Which is almost up to the kernel imposed limit of five separate lines.

Finally, here’s how to set this up and create a binding for the user namespace.  As myself (I’m uid 1000 user name jejb) I do

jejb@jarvis:~> unshare --user
nobody@jarvis:~> echo $$
20211
nobody@jarvis:~>

Note that I become nobody inside the container because currently the map files are unwritten so there are no mapped ids at all.  Now as root, I have to write the mapping files and bind the entry file to the namespace somewhere

jarvis:/home/jejb # echo 1|awk '{print "0 100000 1000\n1000 1000 1\n65534 101001 1"}' > /proc/20211/uid_map
jarvis:/home/jejb # echo 1|awk '{print "0 100000 100\n100 100 1\n101 100100 899\n65533 101000 2"}' > /proc/20211/gid_map
jarvis:/home/jejb # touch /tmp/userns
jarvis:/home/jejb # mount --bind /proc/20211/ns/user /tmp/userns

Now I can exit my user namespace because it’s permanently bound and the next time I enter it I become root inside the container (although with uid 100000 outside)

jejb@jarvis:~> nsenter --user=/tmp/userns
root@jarvis:~# id
uid=0(root) gid=0(root) groups=0(root)
root@jarvis:~# su - jejb
jejb@jarvis:~> id
uid=1000(jejb) gid=100(users) groups=100(users)

Giving me a user namespace with sufficient mapped ids to create a build container.

Unprivileged Architecture Emulation Containers

Essentially, I can use the user namespace constructed above to bootstrap and enter the entire build container and its mount namespace with one proviso that I have to have a pre-created devices directory because I don’t possess the mknod capability as myself, so my container root also doesn’t possess it.  The way I get around this is to create the initial dev directory as root and then change the ownership to 100000.100000 (my unprivileged ids)

jejb@jarvis:~/containers/debian-amd64/dev> ls -l
total 0
lrwxrwxrwx 1 100000 100000 13 Feb 20 09:45 fd -> /proc/self/fd/
crw-rw-rw- 1 100000 100000 1, 7 Feb 20 09:45 full
crw-rw-rw- 1 100000 100000 1, 3 Feb 20 09:45 null
lrwxrwxrwx 1 100000 100000 8 Feb 20 09:45 ptmx -> pts/ptmx
drwxr-xr-x 2 100000 100000 6 Feb 20 09:45 pts/
crw-rw-rw- 1 100000 100000 1, 8 Feb 20 09:45 random
drwxr-xr-x 2 100000 100000 6 Feb 20 09:45 shm/
lrwxrwxrwx 1 100000 100000 15 Feb 20 09:45 stderr -> /proc/self/fd/2
lrwxrwxrwx 1 100000 100000 15 Feb 20 09:45 stdin -> /proc/self/fd/0
lrwxrwxrwx 1 100000 100000 15 Feb 20 09:45 stdout -> /proc/self/fd/1
crw-rw-rw- 1 100000 100000 5, 0 Feb 20 09:45 tty
crw-rw-rw- 1 100000 100000 1, 9 Feb 20 09:45 urandom
crw-rw-rw- 1 100000 100000 1, 5 Feb 20 09:45 zero

This seems to be sufficient of a dev skeleton to function with.  For completeness sake, I placed my bound user namespace into /run/build-container/userns and following the original architecture container emulation post, with modified susebootstrap and build-container scripts.  The net result is that as myself I can now enter and administer and update the build and test architecture emulation container with no suid required

jejb@jarvis:~> nsenter --user=/run/build-container/userns --mount=/run/build-container/ppc64
root@jarvis:/# id
uid=0(root) gid=0(root) groups=0(root)
root@jarvis:/# uname -m
ppc64
root@jarvis:/# su - jejb
jejb@jarvis:~> id
uid=1000(jejb) gid=100(users) groups=100(users)

The only final wrinkle is that root has to set up the user namespace on every boot, but that’s only because there’s no currently defined operating system way of doing this.