systemd/systemd-nspawn

From Gentoo Wiki
Jump to:navigation Jump to:search

This article is a stub. Please help out by expanding it - how to get started.

systemd-nspawn is a lightweight, loosely chroot-like, OS-level OCI container environment native to systemd. Each container exists in its own namespace but within the host's running kernel. Thus, no hardware emulation is taking place and unlike QEMU and Virtualbox non-native CPU instruction sets are not directly supported.

Like a lot of technologies, containerization has trade-offs. A core benefit of containerization is that there is much less overhead than with a traditional virtual machine, so it's possible to spawn a large number of containers much more quickly than a large number of VMs. Unfortunately, though uncommon, exploits leading to container escapes have happened and are more prevalent than VM escapes. Further, any containerized processes that causes a kernel crash will bring down the host system as they share a kernel. Lastly, containers are not, by default, more secure than any other processes on the host system. Hardening containers can be done through a mix of technologies such as cgroups, to constrain resource utilization, and SELinux to prevent privilege escalation and enforce access controls.

Installation

In order to use systemd-nspawn a system must be set to a profile that uses the systemd init system.

Files

  • /var/lib/machines/* — the canonical location for systemd-nspawn container file systems.

To prevent confusion, it is best practice to name the subdirectory holding the container's root file system after the container's hostname.

Service

Assuming a properly structured and syntactically unit file, containers should be discoverable by machinectl. The unit file needs to be located at /etc/systemd/nswpan/<machine_name>.nspawn. Thereafter it can be managed like any other service.

Usage

Assuming, for example, a Gentoo root file system exists at /var/lib/machines/larrythecow/ that has been extracted from a stage3 tarball for the host's instruction set architecture the following commands should bring the container up:

root #systemd-nspawn -b -D /var/lib/machines/larrythecow

The handbook can be followed as normal from this point forward excluding unnecessary bits, such as kernel and bootloader configuration. Once done, the container can be used by itself or as an up-to-date template from which other containers can be spawned. The latter case is made easier if the container's root file system is stored on a BTRFS subvolume.

Invocation

user $systemd-nspawn --help
systemd-nspawn [OPTIONS...] [PATH] [ARGUMENTS...]

Spawn a command or OS in a light-weight container.

  -h --help                 Show this help
     --version              Print version string
  -q --quiet                Do not show status information
     --no-pager             Do not pipe output into a pager
     --settings=BOOLEAN     Load additional settings from .nspawn file

Image:
  -D --directory=PATH       Root directory for the container
     --template=PATH        Initialize root directory from template directory,
                            if missing
  -x --ephemeral            Run container with snapshot of root directory, and
                            remove it after exit
  -i --image=PATH           Root file system disk image (or device node) for
                            the container
     --oci-bundle=PATH      OCI bundle directory
     --read-only            Mount the root directory read-only
     --volatile[=MODE]      Run the system in volatile mode
     --root-hash=HASH       Specify verity root hash for root disk image
     --root-hash-sig=SIG    Specify pkcs7 signature of root hash for verity
                            as a DER encoded PKCS7, either as a path to a file
                            or as an ASCII base64 encoded string prefixed by
                            'base64:'
     --verity-data=PATH     Specify hash device for verity
     --pivot-root=PATH[:PATH]
                            Pivot root to given directory in the container

Execution:
  -a --as-pid2              Maintain a stub init as PID1, invoke binary as PID2
  -b --boot                 Boot up full system (i.e. invoke init)
     --chdir=PATH           Set working directory in the container
  -E --setenv=NAME[=VALUE]  Pass an environment variable to PID 1
  -u --user=USER            Run the command under specified user or UID
     --kill-signal=SIGNAL   Select signal to use for shutting down PID 1
     --notify-ready=BOOLEAN Receive notifications from the child init process
     --suppress-sync=BOOLEAN
                            Suppress any form of disk data synchronization

System Identity:
  -M --machine=NAME         Set the machine name for the container
     --hostname=NAME        Override the hostname for the container
     --uuid=UUID            Set a specific machine UUID for the container

Properties:
  -S --slice=SLICE          Place the container in the specified slice
     --property=NAME=VALUE  Set scope unit property
     --register=BOOLEAN     Register container as machine
     --keep-unit            Do not register a scope for the machine, reuse
                            the service unit nspawn is running in

User Namespacing:
     --private-users=no     Run without user namespacing
     --private-users=yes|pick|identity
                            Run within user namespace, autoselect UID/GID range
     --private-users=UIDBASE[:NUIDS]
                            Similar, but with user configured UID/GID range
     --private-users-ownership=MODE
                            Adjust ('chown') or map ('map') OS tree ownership
                            to private UID/GID range
  -U                        Equivalent to --private-users=pick and
                            --private-users-ownership=auto

Networking:
     --private-network      Disable network in container
     --network-interface=INTERFACE
                            Assign an existing network interface to the
                            container
     --network-macvlan=INTERFACE
                            Create a macvlan network interface based on an
                            existing network interface to the container
     --network-ipvlan=INTERFACE
                            Create an ipvlan network interface based on an
                            existing network interface to the container
  -n --network-veth         Add a virtual Ethernet connection between host
                            and container
     --network-veth-extra=HOSTIF[:CONTAINERIF]
                            Add an additional virtual Ethernet link between
                            host and container
     --network-bridge=INTERFACE
                            Add a virtual Ethernet connection to the container
                            and attach it to an existing bridge on the host
     --network-zone=NAME    Similar, but attach the new interface to an
                            an automatically managed bridge interface
     --network-namespace-path=PATH
                            Set network namespace to the one represented by
                            the specified kernel namespace file node
  -p --port=[PROTOCOL:]HOSTPORT[:CONTAINERPORT]
                            Expose a container IP port on the host

Security:
     --capability=CAP       In addition to the default, retain specified
                            capability
     --drop-capability=CAP  Drop the specified capability from the default set
     --ambient-capability=CAP
                            Sets the specified capability for the started
                            process. Not useful if booting a machine.
     --no-new-privileges    Set PR_SET_NO_NEW_PRIVS flag for container payload
     --system-call-filter=LIST|~LIST
                            Permit/prohibit specific system calls
  -Z --selinux-context=SECLABEL
                            Set the SELinux security context to be used by
                            processes in the container
  -L --selinux-apifs-context=SECLABEL
                            Set the SELinux security context to be used by
                            API/tmpfs file systems in the container

Resources:
     --rlimit=NAME=LIMIT    Set a resource limit for the payload
     --oom-score-adjust=VALUE
                            Adjust the OOM score value for the payload
     --cpu-affinity=CPUS    Adjust the CPU affinity of the container
     --personality=ARCH     Pick personality for this container

Integration:
     --resolv-conf=MODE     Select mode of /etc/resolv.conf initialization
     --timezone=MODE        Select mode of /etc/localtime initialization
     --link-journal=MODE    Link up guest journal, one of no, auto, guest,
                            host, try-guest, try-host
  -j                        Equivalent to --link-journal=try-guest

Mounts:
     --bind=PATH[:PATH[:OPTIONS]]
                            Bind mount a file or directory from the host into
                            the container
     --bind-ro=PATH[:PATH[:OPTIONS]
                            Similar, but creates a read-only bind mount
     --inaccessible=PATH    Over-mount file node with inaccessible node to mask
                            it
     --tmpfs=PATH:[OPTIONS] Mount an empty tmpfs to the specified directory
     --overlay=PATH[:PATH...]:PATH
                            Create an overlay mount from the host to
                            the container
     --overlay-ro=PATH[:PATH...]:PATH
                            Similar, but creates a read-only overlay mount
     --bind-user=NAME       Bind user from host to container

Input/Output:
     --console=MODE         Select how stdin/stdout/stderr and /dev/console are
                            set up for the container.
  -P --pipe                 Equivalent to --console=pipe

Credentials:
     --set-credential=ID:VALUE
                            Pass a credential with literal value to container.
     --load-credential=ID:PATH
                            Load credential to pass to container from file or
                            AF_UNIX stream socket.

See the systemd-nspawn(1) man page for details.

Troubleshooting

Can I combine QEMU and systemd-nspawn to cross-compile binaries?

Yes, follow the instructions to build QEMU with static-user support. Make sure the systemd-binfmt service is enabled. Then, start the container as normal:

root #systemd-nspawn -D /var/lib/machines/<container_with_different_cpu_isa>

See also

  • Docker — a container virtualization environment
  • Podman — a daemonless container engine for developing, managing, and running OCI Containers on Linux.
  • LXC — a virtualization system making use of Linux's namespaces and cgroups.
  • LXD — is a next generation system container manager.

External resources

Image Repositories

Container setup

OpenRC

Unpack a stage3 in /var/lib/machines.

Edit /var/lib/machines/.../etc/inittab to disable agetty on tty[1-6] and enable it on console.

FILE /etc/inittab
# TERMINALS
x1:12345:respawn:/sbin/agetty 38400 console linux
#c1:12345:respawn:/sbin/agetty --noclear 38400 tty1 linux
#c2:2345:respawn:/sbin/agetty 38400 tty2 linux
#c3:2345:respawn:/sbin/agetty 38400 tty3 linux
#c4:2345:respawn:/sbin/agetty 38400 tty4 linux
#c5:2345:respawn:/sbin/agetty 38400 tty5 linux
#c6:2345:respawn:/sbin/agetty 38400 tty6 linux

Clear the root password in /var/lib/machines/.../etc/shadow.

FILE /etc/shadow
root::10770:0:::::

Invoke systemd-nspawn with the -b option to boot.

root #systemd-nspawn -M openrc-20240317 -b