User:Doskanoness/Unprivileged LXC containers
Unprivileged containers
Unprivileged containers are the safest containers. Usual privileged LXC should be considered unsafe because while running in a separate namespace, UID 0 in the container is still equal to UID 0 (root) outside of the container, meaning that if you somehow get access to any host resource through proc, sys or some random syscalls, you can potentially escape the container and then you'll be root on the host. That's what user namespaces were designed for. Each user that's allowed to use them on the system gets assigned a range of unused UIDs and GIDs. So, unprivileged LXC map, for instance, user and group ids 0 through 65,000 in the container to the ids 100,000 through 165,000 on the host. That means that UID 0 (root) in the container maps into UID 100,000 outside the container. So, in case something goes wrong and an attacker manages to escape the container, one finds himself with as many rights as a nobody user.
The standard paths also have their unprivileged equivalents:
- /etc/lxc/lxc.conf => ~/.config/lxc/lxc.conf
- /etc/lxc/default.conf => ~/.config/lxc/default.conf
- /var/lib/lxc => ~/.local/share/lxc
- /var/lib/lxcsnaps => ~/.local/share/lxcsnaps
- /var/cache/lxc => ~/.cache/lxc
Your user, while it can create new user namespaces in which it'll be UID 0 and will have some of the root's privileges against resources tied to that namespace will not be granted any extra privilege on the host. Unfortunately, this also means that the following common operations aren't allowed:
- Mounting most of the filesystems.
- Creating device nodes.
- Any operation against a UID/GID outside of the mapped set.
This also means that your user will be limited to creating new network devices on the host or changing bridge configuration. To work around that, the LXC team wrote a tool called “lxc-user-nic” which is the only setuid binary part of LXC 1.0 and which performs one simple task. It parses a configuration file and based on its content creates network devices for the user and bridges them. To prevent abuse, you can restrict the number of devices a user can request and to what bridge they may be added by editing the /etc/lxc/lxc-usernet file.
Prerequisites
Prerequisites for well working unprivileged containers include:
- Kernel: 3.13 + a couple of staging patches or later version
- User namespaces enabled in the kernel (CONFIG_USER_NS=y)
- A very recent version of shadow that supports subuid/subgid (sys-apps/shadow-4.2.1 or later)
- Per-user cgroups on all controllers
- LXC 1.0 or higher
- A version of PAM with a loginuid patch (it's a dependency of a recent version of shadow mentioned above, so it installs automatically with recent shadow-4.2.1)
LXC pre-built containers
Because of the limitations mentioned above, you won't be allowed to use mknod to create a block or character device in a user namespace as being allowed to do so would let you access anything on the host. The same thing goes with some filesystems, you won’t, for example, be allowed to do loop mounts or mount an ext partition, even if you can access the block device. Those limitations are a big problem during the initial bootstrap of a container as tools like debootstrap, yum, … usually try to do some of those restricted actions and will fail pretty badly.
Some templates may be tweaked to work and a workaround such as a modified fakeroot could be used to bypass some of those limitations but the current state is that most distribution templates (including Gentoo) simply won't work with those. Instead, you should use the "download" template which will provide you with pre-built images of the distributions that are known to work in such an environment. This template is used to contact a server which contains daily pre-built rootfs and configuration for most common templates instead of assembling the rootfs and local configuration.
Those images are built from LXC project's Jenkins server. The actual build process is pretty straightforward, a basic chroot is assembled, then the current git master is downloaded, built and the standard templates are run with the right release and architecture, the resulting rootfs is compressed, a basic config and metadata (expiry, files to the template, …) is saved, the result is pulled by LXC project's main server, signed with a dedicated GPG key and published on the public webserver.
The client-side is a simple template that contacts the server over HTTPS (the domain is also DNSSEC enabled and available over IPv6), grabs signed indexes of all the available images, checks if the requested combination of distribution, release, and architecture is supported and if it is, grabs the rootfs and metadata tarballs, validates their signature and stores them in a local cache. Any container creation after that point is done using that cache until the time the cache entries expire at which point it'll grab a new copy from the server. You can also use the "--flush-cache" parameter to flush the local copy (if present).
The template has been carefully written to work on any system that has a POSIX-compliant shell with wget. gpg is recommended but can be disabled if your host doesn't have it (at your own risk). The current list of images can be requested by passing the –list
parameter (click "Expand" to see the full output):
root #
lxc-create -t download -n alpha -- --list
Setting up the GPG keyring Downloading the image index --- DIST RELEASE ARCH VARIANT BUILD --- alpine 3.3 amd64 default 20171116_17:50 alpine 3.3 armhf default 20170103_17:50 alpine 3.3 i386 default 20171116_17:50 alpine 3.4 amd64 default 20171116_17:50 alpine 3.4 armhf default 20170111_20:27 alpine 3.4 i386 default 20171116_17:50 alpine 3.5 amd64 default 20171116_17:50 alpine 3.5 i386 default 20171116_17:50 alpine 3.6 amd64 default 20171116_18:00 alpine 3.6 i386 default 20171116_17:50 alpine edge amd64 default 20171116_17:50 alpine edge armhf default 20170111_20:27 alpine edge i386 default 20171116_17:50 archlinux current amd64 default 20171117_01:27 archlinux current i386 default 20171116_01:27 centos 6 amd64 default 20171117_02:16 centos 6 i386 default 20171117_02:16 centos 7 amd64 default 20171117_02:16 debian buster amd64 default 20171116_22:42 debian buster arm64 default 20171116_22:42 debian buster armel default 20171116_22:42 debian buster armhf default 20171117_04:09 debian buster i386 default 20171116_22:42 debian buster ppc64el default 20171116_22:42 debian buster s390x default 20171116_22:42 debian jessie amd64 default 20171116_22:42 debian jessie arm64 default 20171116_22:42 debian jessie armel default 20171116_22:42 debian jessie armhf default 20171116_22:42 debian jessie i386 default 20171116_22:42 debian jessie powerpc default 20171116_22:42 debian jessie ppc64el default 20171116_22:42 debian jessie s390x default 20171116_22:42 debian sid amd64 default 20171116_22:42 debian sid arm64 default 20171116_22:42 debian sid armel default 20171116_22:42 debian sid armhf default 20171117_04:09 debian sid i386 default 20171116_22:42 debian sid powerpc default 20171116_22:42 debian sid ppc64el default 20171116_22:42 debian sid s390x default 20171116_22:42 debian stretch amd64 default 20171116_22:42 debian stretch arm64 default 20171116_22:42 debian stretch armel default 20171116_22:42 debian stretch armhf default 20171116_22:42 debian stretch i386 default 20171116_22:42 debian stretch powerpc default 20161104_22:42 debian stretch ppc64el default 20171116_22:42 debian stretch s390x default 20171116_22:42 debian wheezy amd64 default 20171116_22:42 debian wheezy armel default 20171116_22:42 debian wheezy armhf default 20171116_22:42 debian wheezy i386 default 20171116_22:42 debian wheezy powerpc default 20171116_22:42 debian wheezy s390x default 20171116_22:42 fedora 24 amd64 default 20171117_01:27 fedora 24 i386 default 20171117_01:27 fedora 25 amd64 default 20171117_02:20 fedora 25 i386 default 20171117_01:27 fedora 26 amd64 default 20171117_01:27 fedora 26 i386 default 20171117_01:27 gentoo current amd64 default 20171116_14:12 gentoo current i386 default 20171116_14:12 opensuse 42.2 amd64 default 20171117_00:53 opensuse 42.3 amd64 default 20171117_00:53 oracle 6 amd64 default 20171117_11:40 oracle 6 i386 default 20171117_11:40 oracle 7 amd64 default 20171117_11:40 plamo 5.x amd64 default 20171116_21:36 plamo 5.x i386 default 20171116_21:36 plamo 6.x amd64 default 20171116_21:36 plamo 6.x i386 default 20171116_21:36 ubuntu artful amd64 default 20171117_03:49 ubuntu artful arm64 default 20171117_03:49 ubuntu artful armhf default 20171117_03:49 ubuntu artful i386 default 20171117_03:49 ubuntu artful ppc64el default 20171117_03:49 ubuntu artful s390x default 20171117_03:49 ubuntu bionic amd64 default 20171117_03:49 ubuntu bionic arm64 default 20171117_03:49 ubuntu bionic armhf default 20171117_03:49 ubuntu bionic i386 default 20171117_03:49 ubuntu bionic ppc64el default 20171117_03:49 ubuntu bionic s390x default 20171117_03:49 ubuntu precise amd64 default 20171025_03:49 ubuntu precise armel default 20171024_03:49 ubuntu precise armhf default 20171024_08:01 ubuntu precise i386 default 20171025_03:49 ubuntu precise powerpc default 20171025_03:49 ubuntu trusty amd64 default 20171117_03:49 ubuntu trusty arm64 default 20171117_03:49 ubuntu trusty armhf default 20171117_03:49 ubuntu trusty i386 default 20171117_03:49 ubuntu trusty powerpc default 20171117_03:49 ubuntu trusty ppc64el default 20171117_03:49 ubuntu xenial amd64 default 20171117_03:49 ubuntu xenial arm64 default 20171117_03:49 ubuntu xenial armhf default 20171117_03:49 ubuntu xenial i386 default 20171117_03:49 ubuntu xenial powerpc default 20171117_03:49 ubuntu xenial ppc64el default 20171117_03:49 ubuntu xenial s390x default 20171117_03:49 ubuntu zesty amd64 default 20171117_03:49 ubuntu zesty arm64 default 20171117_03:49 ubuntu zesty armhf default 20171117_03:49 ubuntu zesty i386 default 20171117_03:49 ubuntu zesty powerpc default 20170317_03:49 ubuntu zesty ppc64el default 20171117_03:49 ubuntu zesty s390x default 20171117_03:49 ---
While the template was designed to work around the limitations of unprivileged containers, it works just as well with system containers, so even on a system that doesn’t support unprivileged containers you can do:
root #
lxc-create -t download -n alpha -f /etc/lxc/guest.conf -- -d ubuntu -r trusty -a amd64
And you'll get a new container running the latest build of Ubuntu 15.04 Vivid Vervet amd64.
Configuring unprivileged LXC
Install the required packages:
root #
emerge shadow pambase
Create files necessary for assigning subuids and subgids:
root #
touch /etc/subuid /etc/subgid
Create a new user if not yet created, set its password, and log in. In this example we use name "lxc":
root #
useradd -m -G users lxc
root #
passwd lxc
root #
su - lxc
Make sure your user has a UID and GID map defined in /etc/subuid and /etc/subgid:
root #
grep lxc /etc/sub* 2>/dev/null
/etc/subgid:lxc:165537:65536 /etc/subuid:lxc:165537:65536
On Gentoo, a default allocation of 65536 UIDs and GIDs is given to every new user on the system, so you should already have one. If not, you'll have to assign a set of subuids and subgids for a user manually:
root #
usermod --add-subuids 100000-165536 lxc
root #
usermod --add-subgids 100000-165536 lxc
root #
chmod +x /home/lxc
That last one is required because LXC needs it to access ~/.local/share/lxc/ after it switched to the mapped UIDs. If you’re using ACLs, you may instead use “u:100000:x” as a more specific ACL.
Now create /home/lxc/.config/lxc/guest.conf
with the following content:
lxc.net.0.type = veth
lxc.net.0.flags = up
lxc.net.0.link = br0.1
lxc.net.0.name = eth0
lxc.net.0.ipv4.address = 192.168.10.101/24
lxc.net.0.ipv4.gateway = 192.168.10.1
lxc.idmap = u 0 100000 65536
lxc.idmap = g 0 100000 65536
The last two strings are mean that you have one UID map and one GID map defined for the container which will map UIDs and GIDs 0 through 65,536 in the container to UIDs and GIDs 100,000 through 165,536 on the host. Those values should match those found in /etc/subuid and /etc/subgid, the values above are just illustrative ones.
And /etc/lxc/lxc-usernet with:
lxc veth br0.1 2
This declares that the user “lxc” is allowed up to 2 veth type devices to be created and added to the bridge called br0.1.
Don't forget to add /usr/sbin into the PATH environment variable either inside the /etc/env.d/90lxc for all users to take effect or inside ~/.bashrc the for current user. Otherwise lxc-* commands will not work under your user environment (it is not the case for lxc-1.1.0-r5, lxc-1.1.1 and later versions because they use standard /usr/bin/
path for command files). Example:
# /etc/skel/.bashrc
...
# Put your fun stuff here.
PATH="/usr/sbin:${PATH}"
Now let’s create our first unprivileged container with:
lxc@localhost $
lxc-create -t download -n alpha -f ~/.config/lxc/guest.conf -- -d ubuntu -r trusty -a amd64
Don't forget to change the root password of unprivileged LXC with the following commands by running under your user:
lxc@localhost $
lxc-start -n alpha -- /bin/bash
lxc@localhost $
passwd
lxc@localhost $
exit
Then you can log in easily with your new password as usual under your user:
lxc@localhost $
lxc-start -n alpha
If you get error: "Permission denied, can't create directory /sys/fs/cgroup/alpha", please, see section LXC#Create_user_namespace_manually_.28no_systemd.29
P.S. To be accomplished. "Creating cgroups" section has to be added with or without cgmanager through OpenRC/systemd accordingly (See "Creating cgroups" paragraph there as an example at the moment).
Create user namespace using systemd
Running unprivileged containers as an unprivileged user only works if you delegate a cgroup in advance (the cgroup2 delegation model enforces this restriction, not liblxc). Use the following systemd command to delegate the cgroup (per LXC - Getting started: Creating unprivileged containers as a user):
lxc@localhost $
systemd-run --unit=my-unit --user --scope -p "Delegate=yes" -- lxc-start my-container
This works similarly for other lxc commands.
It is possible to delegate unprivileged cgroups by creating a systemd unit:
[Service]
Delegate=cpu cpuset io memory pids
Create user namespace manually (no systemd)
OpenRC configuration pre-check
For systems, that are booted by OpenRC, check that OpenRC mounts cgroups v2.
Open /etc/rc.conf
and check those line:
...
rc_cgroup_mode="unified"
...
By default (with commented line rc_cgroup_mode) is set to "hybrid"
Manage namespaces by libcgroup (cgroupv2)
On systems without systemd, the external script should create the user cgroup namespace manually. In our case, we should create all required dirs for the lxc user, give permission for it and move the current active bash shell to the cgroup user namespace.
Instead of creating cgroup namespaces manually, we can use libcgroup which will make managing the cgroup namespaces much easier.
Install the required packages:
root #
emerge libcgroup
Add the below to the file /etc/cgroup/cgconfig.conf
:
group lxc {
perm {
task {
uid = lxc;
gid = lxc;
}
admin {
uid = lxc;
gid = lxc;
}
}
cpu {}
cpuset {}
hugetlb {}
io {}
memory {}
pids {}
}
Make sure file /etc/cgroup/cgrules.conf
contains line:
lxc * lxc
Now, lxc should manage cgroups by itself and both systemd and non-systemd containers should work.
Validate configuration
After re-login to user lxc, lxc should have user namespace lxc Let's recheck it:
lxc@localhost
cat /proc/self/cgroup
Should be:
0::/lxc
Create container example
Now, we can execute any lxc-* command from the lxc user without any permission problems. For example:
lxc@localhost $
lxc-create -t download -n my_ubuntu_container -f ~/.config/lxc/guest.conf -- -d ubuntu -r xenial -a amd64
lxc@localhost $
lxc-start --name my_ubuntu_container