User:Doskanoness/Unprivileged LXC containers

Unprivileged containers

Unprivileged containers are the safest containers. Usual privileged LXC should be considered unsafe because while running in a separate namespace, UID 0 in the container is still equal to UID 0 (root) outside of the container, meaning that if you somehow get access to any host resource through proc, sys or some random syscalls, you can potentially escape the container and then you'll be root on the host. That's what user namespaces were designed for. Each user that's allowed to use them on the system gets assigned a range of unused UIDs and GIDs. So, unprivileged LXC map, for instance, user and group ids 0 through 65,000 in the container to the ids 100,000 through 165,000 on the host. That means that UID 0 (root) in the container maps into UID 100,000 outside the container. So, in case something goes wrong and an attacker manages to escape the container, one finds himself with as many rights as a nobody user.

The standard paths also have their unprivileged equivalents:

/etc/lxc/lxc.conf => ~/.config/lxc/lxc.conf
/etc/lxc/default.conf => ~/.config/lxc/default.conf
/var/lib/lxc => ~/.local/share/lxc
/var/lib/lxcsnaps => ~/.local/share/lxcsnaps
/var/cache/lxc => ~/.cache/lxc

Your user, while it can create new user namespaces in which it'll be UID 0 and will have some of the root's privileges against resources tied to that namespace will not be granted any extra privilege on the host. Unfortunately, this also means that the following common operations aren't allowed:

Mounting most of the filesystems.
Creating device nodes.
Any operation against a UID/GID outside of the mapped set.

This also means that your user will be limited to creating new network devices on the host or changing bridge configuration. To work around that, the LXC team wrote a tool called “lxc-user-nic” which is the only setuid binary part of LXC 1.0 and which performs one simple task. It parses a configuration file and based on its content creates network devices for the user and bridges them. To prevent abuse, you can restrict the number of devices a user can request and to what bridge they may be added by editing the /etc/lxc/lxc-usernet file.

Prerequisites

Prerequisites for well working unprivileged containers include:

Kernel: 3.13 + a couple of staging patches or later version
User namespaces enabled in the kernel (CONFIG_USER_NS=y)
A very recent version of shadow that supports subuid/subgid (sys-apps/shadow-4.2.1 or later)
Per-user cgroups on all controllers
LXC 1.0 or higher
A version of PAM with a loginuid patch (it's a dependency of a recent version of shadow mentioned above, so it installs automatically with recent shadow-4.2.1)

LXC pre-built containers

Because of the limitations mentioned above, you won't be allowed to use mknod to create a block or character device in a user namespace as being allowed to do so would let you access anything on the host. The same thing goes with some filesystems, you won’t, for example, be allowed to do loop mounts or mount an ext partition, even if you can access the block device. Those limitations are a big problem during the initial bootstrap of a container as tools like debootstrap, yum, … usually try to do some of those restricted actions and will fail pretty badly.

Some templates may be tweaked to work and a workaround such as a modified fakeroot could be used to bypass some of those limitations but the current state is that most distribution templates (including Gentoo) simply won't work with those. Instead, you should use the "download" template which will provide you with pre-built images of the distributions that are known to work in such an environment. This template is used to contact a server which contains daily pre-built rootfs and configuration for most common templates instead of assembling the rootfs and local configuration.

Those images are built from LXC project's Jenkins server. The actual build process is pretty straightforward, a basic chroot is assembled, then the current git master is downloaded, built and the standard templates are run with the right release and architecture, the resulting rootfs is compressed, a basic config and metadata (expiry, files to the template, …) is saved, the result is pulled by LXC project's main server, signed with a dedicated GPG key and published on the public webserver.

The client-side is a simple template that contacts the server over HTTPS (the domain is also DNSSEC enabled and available over IPv6), grabs signed indexes of all the available images, checks if the requested combination of distribution, release, and architecture is supported and if it is, grabs the rootfs and metadata tarballs, validates their signature and stores them in a local cache. Any container creation after that point is done using that cache until the time the cache entries expire at which point it'll grab a new copy from the server. You can also use the "--flush-cache" parameter to flush the local copy (if present).

The template has been carefully written to work on any system that has a POSIX-compliant shell with wget. gpg is recommended but can be disabled if your host doesn't have it (at your own risk). The current list of images can be requested by passing the –list parameter (click "Expand" to see the full output):

root #lxc-create -t download -n alpha -- --list

Setting up the GPG keyring
Downloading the image index

---
DIST    RELEASE ARCH    VARIANT BUILD
---
alpine  3.3     amd64   default 20171116_17:50
alpine  3.3     armhf   default 20170103_17:50
alpine  3.3     i386    default 20171116_17:50
alpine  3.4     amd64   default 20171116_17:50
alpine  3.4     armhf   default 20170111_20:27
alpine  3.4     i386    default 20171116_17:50
alpine  3.5     amd64   default 20171116_17:50
alpine  3.5     i386    default 20171116_17:50
alpine  3.6     amd64   default 20171116_18:00
alpine  3.6     i386    default 20171116_17:50
alpine  edge    amd64   default 20171116_17:50
alpine  edge    armhf   default 20170111_20:27
alpine  edge    i386    default 20171116_17:50
archlinux       current amd64   default 20171117_01:27
archlinux       current i386    default 20171116_01:27
centos  6       amd64   default 20171117_02:16
centos  6       i386    default 20171117_02:16
centos  7       amd64   default 20171117_02:16
debian  buster  amd64   default 20171116_22:42
debian  buster  arm64   default 20171116_22:42
debian  buster  armel   default 20171116_22:42
debian  buster  armhf   default 20171117_04:09
debian  buster  i386    default 20171116_22:42
debian  buster  ppc64el default 20171116_22:42
debian  buster  s390x   default 20171116_22:42
debian  jessie  amd64   default 20171116_22:42
debian  jessie  arm64   default 20171116_22:42
debian  jessie  armel   default 20171116_22:42
debian  jessie  armhf   default 20171116_22:42
debian  jessie  i386    default 20171116_22:42
debian  jessie  powerpc default 20171116_22:42
debian  jessie  ppc64el default 20171116_22:42
debian  jessie  s390x   default 20171116_22:42
debian  sid     amd64   default 20171116_22:42
debian  sid     arm64   default 20171116_22:42
debian  sid     armel   default 20171116_22:42
debian  sid     armhf   default 20171117_04:09
debian  sid     i386    default 20171116_22:42
debian  sid     powerpc default 20171116_22:42
debian  sid     ppc64el default 20171116_22:42
debian  sid     s390x   default 20171116_22:42
debian  stretch amd64   default 20171116_22:42
debian  stretch arm64   default 20171116_22:42
debian  stretch armel   default 20171116_22:42
debian  stretch armhf   default 20171116_22:42
debian  stretch i386    default 20171116_22:42
debian  stretch powerpc default 20161104_22:42
debian  stretch ppc64el default 20171116_22:42
debian  stretch s390x   default 20171116_22:42
debian  wheezy  amd64   default 20171116_22:42
debian  wheezy  armel   default 20171116_22:42
debian  wheezy  armhf   default 20171116_22:42
debian  wheezy  i386    default 20171116_22:42
debian  wheezy  powerpc default 20171116_22:42
debian  wheezy  s390x   default 20171116_22:42
fedora  24      amd64   default 20171117_01:27
fedora  24      i386    default 20171117_01:27
fedora  25      amd64   default 20171117_02:20
fedora  25      i386    default 20171117_01:27
fedora  26      amd64   default 20171117_01:27
fedora  26      i386    default 20171117_01:27
gentoo  current amd64   default 20171116_14:12
gentoo  current i386    default 20171116_14:12
opensuse        42.2    amd64   default 20171117_00:53
opensuse        42.3    amd64   default 20171117_00:53
oracle  6       amd64   default 20171117_11:40
oracle  6       i386    default 20171117_11:40
oracle  7       amd64   default 20171117_11:40
plamo   5.x     amd64   default 20171116_21:36
plamo   5.x     i386    default 20171116_21:36
plamo   6.x     amd64   default 20171116_21:36
plamo   6.x     i386    default 20171116_21:36
ubuntu  artful  amd64   default 20171117_03:49
ubuntu  artful  arm64   default 20171117_03:49
ubuntu  artful  armhf   default 20171117_03:49
ubuntu  artful  i386    default 20171117_03:49
ubuntu  artful  ppc64el default 20171117_03:49
ubuntu  artful  s390x   default 20171117_03:49
ubuntu  bionic  amd64   default 20171117_03:49
ubuntu  bionic  arm64   default 20171117_03:49
ubuntu  bionic  armhf   default 20171117_03:49
ubuntu  bionic  i386    default 20171117_03:49
ubuntu  bionic  ppc64el default 20171117_03:49
ubuntu  bionic  s390x   default 20171117_03:49
ubuntu  precise amd64   default 20171025_03:49
ubuntu  precise armel   default 20171024_03:49
ubuntu  precise armhf   default 20171024_08:01
ubuntu  precise i386    default 20171025_03:49
ubuntu  precise powerpc default 20171025_03:49
ubuntu  trusty  amd64   default 20171117_03:49
ubuntu  trusty  arm64   default 20171117_03:49
ubuntu  trusty  armhf   default 20171117_03:49
ubuntu  trusty  i386    default 20171117_03:49
ubuntu  trusty  powerpc default 20171117_03:49
ubuntu  trusty  ppc64el default 20171117_03:49
ubuntu  xenial  amd64   default 20171117_03:49
ubuntu  xenial  arm64   default 20171117_03:49
ubuntu  xenial  armhf   default 20171117_03:49
ubuntu  xenial  i386    default 20171117_03:49
ubuntu  xenial  powerpc default 20171117_03:49
ubuntu  xenial  ppc64el default 20171117_03:49
ubuntu  xenial  s390x   default 20171117_03:49
ubuntu  zesty   amd64   default 20171117_03:49
ubuntu  zesty   arm64   default 20171117_03:49
ubuntu  zesty   armhf   default 20171117_03:49
ubuntu  zesty   i386    default 20171117_03:49
ubuntu  zesty   powerpc default 20170317_03:49
ubuntu  zesty   ppc64el default 20171117_03:49
ubuntu  zesty   s390x   default 20171117_03:49
---

While the template was designed to work around the limitations of unprivileged containers, it works just as well with system containers, so even on a system that doesn’t support unprivileged containers you can do:

root #lxc-create -t download -n alpha -f /etc/lxc/guest.conf -- -d ubuntu -r trusty -a amd64

And you'll get a new container running the latest build of Ubuntu 15.04 Vivid Vervet amd64.

Configuring unprivileged LXC

Install the required packages:

root #emerge shadow pambase

Create files necessary for assigning subuids and subgids:

root #touch /etc/subuid /etc/subgid

Create a new user if not yet created, set its password, and log in. In this example we use name "lxc":

root #

useradd -m -G users lxc

root #

passwd lxc

root #su - lxc

Make sure your user has a UID and GID map defined in /etc/subuid and /etc/subgid:

root #grep lxc /etc/sub* 2>/dev/null

/etc/subgid:lxc:165537:65536
/etc/subuid:lxc:165537:65536

On Gentoo, a default allocation of 65536 UIDs and GIDs is given to every new user on the system, so you should already have one. If not, you'll have to assign a set of subuids and subgids for a user manually:

root #

usermod --add-subuids 100000-165536 lxc

root #

usermod --add-subgids 100000-165536 lxc

root #chmod +x /home/lxc

That last one is required because LXC needs it to access ~/.local/share/lxc/ after it switched to the mapped UIDs. If you’re using ACLs, you may instead use “u:100000:x” as a more specific ACL.

Now create /home/lxc/.config/lxc/guest.conf with the following content:

FILE /home/lxc/.config/lxc/guest.conf

lxc.net.0.type = veth
lxc.net.0.flags = up
lxc.net.0.link = br0.1
lxc.net.0.name = eth0
lxc.net.0.ipv4.address = 192.168.10.101/24
lxc.net.0.ipv4.gateway = 192.168.10.1
lxc.idmap = u 0 100000 65536
lxc.idmap = g 0 100000 65536

The last two strings are mean that you have one UID map and one GID map defined for the container which will map UIDs and GIDs 0 through 65,536 in the container to UIDs and GIDs 100,000 through 165,536 on the host. Those values should match those found in /etc/subuid and /etc/subgid, the values above are just illustrative ones.

And /etc/lxc/lxc-usernet with:

FILE /etc/lxc/lxc-usernet

lxc veth br0.1 2

This declares that the user “lxc” is allowed up to 2 veth type devices to be created and added to the bridge called br0.1.

Don't forget to add /usr/sbin into the PATH environment variable either inside the /etc/env.d/90lxc for all users to take effect or inside ~/.bashrc the for current user. Otherwise lxc-* commands will not work under your user environment (it is not the case for lxc-1.1.0-r5, lxc-1.1.1 and later versions because they use standard /usr/bin/ path for command files). Example:

FILE ~/.bashrc

# /etc/skel/.bashrc
...
# Put your fun stuff here.
PATH="/usr/sbin:${PATH}"

Now let’s create our first unprivileged container with:

lxc@localhost $lxc-create -t download -n alpha -f ~/.config/lxc/guest.conf -- -d ubuntu -r trusty -a amd64

Don't forget to change the root password of unprivileged LXC with the following commands by running under your user:

lxc@localhost $

lxc-start -n alpha -- /bin/bash

lxc@localhost $

passwd

lxc@localhost $

exit

Then you can log in easily with your new password as usual under your user:

lxc@localhost $

lxc-start -n alpha

If you get error: "Permission denied, can't create directory /sys/fs/cgroup/alpha", please, see section LXC#Create_user_namespace_manually_.28no_systemd.29

P.S. To be accomplished. "Creating cgroups" section has to be added with or without cgmanager through OpenRC/systemd accordingly (See "Creating cgroups" paragraph there as an example at the moment).

Create user namespace using systemd

Running unprivileged containers as an unprivileged user only works if you delegate a cgroup in advance (the cgroup2 delegation model enforces this restriction, not liblxc). Use the following systemd command to delegate the cgroup (per LXC - Getting started: Creating unprivileged containers as a user):

lxc@localhost $

systemd-run --unit=my-unit --user --scope -p "Delegate=yes" -- lxc-start my-container

This works similarly for other lxc commands.

It is possible to delegate unprivileged cgroups by creating a systemd unit:

FILE /etc/systemd/system/user@.service.d/delegate.conf

[Service]
Delegate=cpu cpuset io memory pids

Create user namespace manually (no systemd)

OpenRC configuration pre-check

For systems, that are booted by OpenRC, check that OpenRC mounts cgroups v2.

Open /etc/rc.conf and check those line:

FILE /etc/rc.conf

...
rc_cgroup_mode="unified"
...

By default (with commented line rc_cgroup_mode) is set to "hybrid"

Manage namespaces by libcgroup (cgroupv2)

On systems without systemd, the external script should create the user cgroup namespace manually. In our case, we should create all required dirs for the lxc user, give permission for it and move the current active bash shell to the cgroup user namespace.

Instead of creating cgroup namespaces manually, we can use libcgroup which will make managing the cgroup namespaces much easier.

Install the required packages:

root #emerge libcgroup

Add the below to the file /etc/cgroup/cgconfig.conf:

FILE /etc/cgroup/cgconfig.conf

group lxc {
        perm {
                task {
                        uid = lxc;
                        gid = lxc;
                }
                admin {
                        uid = lxc;
                        gid = lxc;
                }
        }
        cpu {}
        cpuset {}
        hugetlb {}
        io {}
        memory {}
        pids {}
}

Make sure file /etc/cgroup/cgrules.conf contains line:

FILE /etc/cgroup/cgconfig.conf

lxc          *             lxc

Now, lxc should manage cgroups by itself and both systemd and non-systemd containers should work.

Validate configuration

After re-login to user lxc, lxc should have user namespace lxc Let's recheck it:

lxc@localhostcat /proc/self/cgroup

Should be:

FILE /proc/self/cgroup

0::/lxc

Create container example

Now, we can execute any lxc-* command from the lxc user without any permission problems. For example:

lxc@localhost $lxc-create -t download -n my_ubuntu_container -f ~/.config/lxc/guest.conf -- -d ubuntu -r xenial -a amd64

lxc@localhost $lxc-start --name my_ubuntu_container