Loading Seccomp Rules From File
The default seccomp profile provides a sane default for running containers withseccomp and disables around 44 system calls out of 300+. It is moderatelyprotective while providing wide application compatibility. The default Dockerprofile can be foundhere.
Loading seccomp rules from file
In effect, the profile is a allowlist which denies access to system calls bydefault, then allowlists specific system calls. The profile works by defining adefaultAction of SCMP_ACT_ERRNO and overriding that action only for specificsystem calls. The effect of SCMP_ACT_ERRNO is to cause a Permission Deniederror. Next, the profile defines a specific list of system calls which are fullyallowed, because their action is overridden to be SCMP_ACT_ALLOW. Finally,some specific rules are for individual system calls such as personality, and others, to allow variants of those system calls with specific arguments.
Seccomp stands for secure computing mode and has been a feature of the Linuxkernel since version 2.6.12. It can be used to sandbox the privileges of aprocess, restricting the calls it is able to make from userspace into thekernel. Kubernetes lets you automatically apply seccomp profiles loaded onto anode to your Pods and containers.
Identifying the privileges required for your workloads can be difficult. In thistutorial, you will go through how to load seccomp profiles into a localKubernetes cluster, how to apply them to a Pod, and how you can begin to craftprofiles that give only the necessary privileges to your container processes.
pods/security/seccomp/profiles/audit.json "defaultAction": "SCMP_ACT_LOG"pods/security/seccomp/profiles/violation.json "defaultAction": "SCMP_ACT_ERRNO"pods/security/seccomp/profiles/fine-grained.json "defaultAction": "SCMP_ACT_ERRNO", "architectures": [ "SCMP_ARCH_X86_64", "SCMP_ARCH_X86", "SCMP_ARCH_X32" ], "syscalls": [ "names": [ "accept4", "epoll_wait", "pselect6", "futex", "madvise", "epoll_ctl", "getsockname", "setsockopt", "vfork", "mmap", "read", "write", "close", "arch_prctl", "sched_getaffinity", "munmap", "brk", "rt_sigaction", "rt_sigprocmask", "sigaltstack", "gettid", "clone", "bind", "socket", "openat", "readlinkat", "exit_group", "epoll_create1", "listen", "rt_sigreturn", "sched_yield", "clock_gettime", "connect", "dup2", "epoll_pwait", "execve", "exit", "fcntl", "getpid", "getuid", "ioctl", "mprotect", "nanosleep", "open", "poll", "recvfrom", "sendto", "set_tid_address", "setitimer", "writev" ], "action": "SCMP_ACT_ALLOW" ]Run these commands:
For simplicity, kind can be used to create a singlenode cluster with the seccomp profiles loaded. Kind runs Kubernetes in Docker,so each node of the cluster is a container. This allows for filesto be mounted in the filesystem of each container similar to loading filesonto a node.
As a beta feature, you can configure Kubernetes to use the profile that thecontainer runtimeprefers by default, rather than falling back to Unconfined.If you want to try that, seeenable the use of RuntimeDefault as the default seccomp profile for all workloadsbefore you continue.
If observing the filesystem of that container, you should see that theprofiles/ directory has been successfully loaded into the default seccomp pathof the kubelet. Use docker exec to run a command in the Pod:
To use seccomp profile defaulting, you must run the kubelet with the SeccompDefaultfeature gate enabled(this is the default). You must also explicitly enable the defaulting behavior for eachnode where you want to use this with the corresponding --seccomp-defaultcommand line flag.Both have to be enabled simultaneously to use the feature.
If enabled, the kubelet will use the RuntimeDefault seccomp profile by default, which isdefined by the container runtime, instead of using the Unconfined (seccomp disabled) mode.The default profiles aim to provide a strong setof security defaults while preserving the functionality of the workload. It ispossible that the default profiles differ between container runtimes and theirrelease versions, for example when comparing those from CRI-O and containerd.
Kubernetes 1.26 lets you configure the seccomp profilethat applies when the spec for a Pod doesn't define a specific seccomp profile.This is a beta feature and the corresponding SeccompDefault featuregate is enabled bydefault. However, you still need to enable this defaulting for each node whereyou would like to use it.
If you are running a Kubernetes 1.26 cluster and want toenable the feature, either run the kubelet with the --seccomp-default commandline flag, or enable it through the kubelet configurationfile. To enable thefeature gate in kind, ensure that kind providesthe minimum required Kubernetes version and enables the SeccompDefault featurein the kind configuration:
Since Kubernetes v1.25, kubelets no longer support the annotations, use of theannotations in static pods is no longer supported, and the seccomp annotationsare no longer auto-populated when pods with seccomp fields are created.Auto-population of the seccomp fields from the annotations is planned to beremoved in a future release.
You can begin to understand the syscalls required by the http-echo process bylooking at the syscall= entry on each line. While these are unlikely toencompass all syscalls it uses, it can serve as a basis for a seccomp profilefor this container.
You should see no output in the syslog. This is because the profile allowed allnecessary syscalls and specified that an error should occur if one outside ofthe list is invoked. This is an ideal situation from a security perspective, butrequired some effort in analyzing the program. It would be nice if there was asimple way to get closer to this security without requiring as much effort.
Sounds like seccomp rules should be a runtime config instead of a compile time thing (aka program reads in the seccomp rules from a file and then loads them, instead of being program data).Or we just ditch all precompiled 32-bit programs with builtin seccomp in 2038 (Log in to post comments) vDSO, 32-bit time, and seccomp Posted Aug 2, 2019 21:12 UTC (Fri) by arnd (subscriber, #8866) [Link]
For example, different libcs use different syscalls, which is the first thing to be compatible with.Shared library loading can lead to very unexpected behaviour as well. LD_PRELOAD is one example. Another one is that when resolving hostnames, libnss in glibc loads shared modules for resolution behavior, and it's very difficult to predict what these will do. (OpenBSDs pledge has a special case for DNS as well, I believe so that they can distinguish between DNS and other UDP.)In the end, with seccomp you need a very good control of how a program is built, which libc it uses, and in the case of glibc+DNS even how the system is configured. That seems unrealistic. vDSO, 32-bit time, and seccomp Posted Aug 6, 2019 7:42 UTC (Tue) by mm7323 (subscriber, #87386) [Link]
Relocation processing and such may make this fiddly to implement, but given most things would by dynamically linked against glibc where the system calls commonly come from, it might be possible to reduce overhead to just when loading that shared library with minimal loss for most other programs. vDSO, 32-bit time, and seccomp Posted Aug 6, 2019 23:37 UTC (Tue) by roc (subscriber, #30627) [Link]
That's why I suggest verifying the return addresses as well as call sites - to make chaining ROP gadgets harder. Combined with something like Pointer Authentication Codes in user space, this could button up call flows nicely to ensure code executes as designed when compiled.That said, I'm not sure if it is possible to 'fake' the return address of a supervisor call or exception on any architectures.> (You'd have to check that the stacks return to loci where there are actually function calls, and that's going to be much more expensive.)All security has an overhead. The question is whether such a system could be made efficient enough to be worth the benefit. The idea here is be to leverage the compiler to produce the needed records and fix them up when loading/dynamic linking so that execution overhead could be as simple as some table lookups in the kernel around system calls. It will never be for free, and even hardware assisted things like PAC add instructions. vDSO, 32-bit time, and seccomp Posted Aug 8, 2019 17:25 UTC (Thu) by flussence (subscriber, #85566) [Link]
The early version of seccomp allowed only a specific set of system calls to be used by process in secure mode: exit(), sigreturn() and write(), read() to already opened file descriptors. Using any other system call resulted in kernel terminating process with SIGKILL signal.
seccomp-bpf is an extension created for more flexible usage of seccomp based on Berkely Packet Filter. It allows more advanced way of controlling access to system. With it we can not only deny of calling specific system calls (like first version of seccomp was only capable of), but we can specify rules for any of them and even require additional arguments comparison. Setting of seccomp-bpf consists of two parts: default behaviour for calling system calls, which does not have any rules specified and set of rules for specific system calls with set behaviour (different from default one) and optional argument(s) comparison.
To check how does enabling and loading seccomp filters for process impact it's performance, we performed few tests with different seccomp settings and for two different architectures. The time was measured using getrusage() system call to determine only the length of time spent by process in kernel.
(Almost) All of the system calls were split into three groups : system calls used by function, which had no sane possibility of arguments comparisons (like nanosleep , which as arguments gets two pointers); system calls used by function, for which arguments can be easily compared (for e.g.: "read() only from fd