| Age | Commit message (Collapse) | Author | Files | Lines |
|
git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl
Pull sysctl updates from Joel Granados:
"sysctl ctl_table constification:
- Constifying ctl_table structs prevents the modification of
proc_handler function pointers. All ctl_table struct arguments are
const qualified in the sysctl API in such a way that the ctl_table
arrays being defined elsewhere and passed through sysctl can be
constified one-by-one.
We kick the constification off by qualifying user_table in
kernel/ucount.c and expect all the ctl_tables to be constified in
the coming releases.
Misc fixes:
- Adjust comments in two places to better reflect the code
- Remove superfluous dput calls
- Remove Luis from sysctl maintainership
- Replace comments about holding a lock with calls to
lockdep_assert_held"
* tag 'sysctl-6.13-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/sysctl/sysctl:
sysctl: Reduce dput(child) calls in proc_sys_fill_cache()
sysctl: Reorganize kerneldoc parameter names
ucounts: constify sysctl table user_table
sysctl: update comments to new registration APIs
MAINTAINERS: remove me from sysctl
sysctl: Convert locking comments to lockdep assertions
const_structs.checkpatch: add ctl_table
sysctl: make internal ctl_tables const
sysctl: allow registration of const struct ctl_table
sysctl: move internal interfaces to const struct ctl_table
bpf: Constify ctl_table argument of filter function
|
|
The sysctl core is moving to allow "struct ctl_table" in read-only memory.
As a preparation for that all functions handling "struct ctl_table" need
to be able to work with "const struct ctl_table".
As __cgroup_bpf_run_filter_sysctl() does not modify its table, it can be
adapted trivially.
Signed-off-by: Thomas Weißschuh <linux@weissschuh.net>
Signed-off-by: Joel Granados <joel.granados@kernel.org>
|
|
A hung_task problem shown below was found:
INFO: task kworker/0:0:8 blocked for more than 327 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Workqueue: events cgroup_bpf_release
Call Trace:
<TASK>
__schedule+0x5a2/0x2050
? find_held_lock+0x33/0x100
? wq_worker_sleeping+0x9e/0xe0
schedule+0x9f/0x180
schedule_preempt_disabled+0x25/0x50
__mutex_lock+0x512/0x740
? cgroup_bpf_release+0x1e/0x4d0
? cgroup_bpf_release+0xcf/0x4d0
? process_scheduled_works+0x161/0x8a0
? cgroup_bpf_release+0x1e/0x4d0
? mutex_lock_nested+0x2b/0x40
? __pfx_delay_tsc+0x10/0x10
mutex_lock_nested+0x2b/0x40
cgroup_bpf_release+0xcf/0x4d0
? process_scheduled_works+0x161/0x8a0
? trace_event_raw_event_workqueue_execute_start+0x64/0xd0
? process_scheduled_works+0x161/0x8a0
process_scheduled_works+0x23a/0x8a0
worker_thread+0x231/0x5b0
? __pfx_worker_thread+0x10/0x10
kthread+0x14d/0x1c0
? __pfx_kthread+0x10/0x10
ret_from_fork+0x59/0x70
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1b/0x30
</TASK>
This issue can be reproduced by the following pressuse test:
1. A large number of cpuset cgroups are deleted.
2. Set cpu on and off repeatly.
3. Set watchdog_thresh repeatly.
The scripts can be obtained at LINK mentioned above the signature.
The reason for this issue is cgroup_mutex and cpu_hotplug_lock are
acquired in different tasks, which may lead to deadlock.
It can lead to a deadlock through the following steps:
1. A large number of cpusets are deleted asynchronously, which puts a
large number of cgroup_bpf_release works into system_wq. The max_active
of system_wq is WQ_DFL_ACTIVE(256). Consequently, all active works are
cgroup_bpf_release works, and many cgroup_bpf_release works will be put
into inactive queue. As illustrated in the diagram, there are 256 (in
the acvtive queue) + n (in the inactive queue) works.
2. Setting watchdog_thresh will hold cpu_hotplug_lock.read and put
smp_call_on_cpu work into system_wq. However step 1 has already filled
system_wq, 'sscs.work' is put into inactive queue. 'sscs.work' has
to wait until the works that were put into the inacvtive queue earlier
have executed (n cgroup_bpf_release), so it will be blocked for a while.
3. Cpu offline requires cpu_hotplug_lock.write, which is blocked by step 2.
4. Cpusets that were deleted at step 1 put cgroup_release works into
cgroup_destroy_wq. They are competing to get cgroup_mutex all the time.
When cgroup_metux is acqured by work at css_killed_work_fn, it will
call cpuset_css_offline, which needs to acqure cpu_hotplug_lock.read.
However, cpuset_css_offline will be blocked for step 3.
5. At this moment, there are 256 works in active queue that are
cgroup_bpf_release, they are attempting to acquire cgroup_mutex, and as
a result, all of them are blocked. Consequently, sscs.work can not be
executed. Ultimately, this situation leads to four processes being
blocked, forming a deadlock.
system_wq(step1) WatchDog(step2) cpu offline(step3) cgroup_destroy_wq(step4)
...
2000+ cgroups deleted asyn
256 actives + n inactives
__lockup_detector_reconfigure
P(cpu_hotplug_lock.read)
put sscs.work into system_wq
256 + n + 1(sscs.work)
sscs.work wait to be executed
warting sscs.work finish
percpu_down_write
P(cpu_hotplug_lock.write)
...blocking...
css_killed_work_fn
P(cgroup_mutex)
cpuset_css_offline
P(cpu_hotplug_lock.read)
...blocking...
256 cgroup_bpf_release
mutex_lock(&cgroup_mutex);
..blocking...
To fix the problem, place cgroup_bpf_release works on a dedicated
workqueue which can break the loop and solve the problem. System wqs are
for misc things which shouldn't create a large number of concurrent work
items. If something is going to generate >WQ_DFL_ACTIVE(256) concurrent
work items, it should use its own dedicated workqueue.
Fixes: 4bfc0bb2c60e ("bpf: decouple the lifetime of cgroup_bpf from cgroup itself")
Cc: stable@vger.kernel.org # v5.3+
Link: https://lore.kernel.org/cgroups/e90c32d2-2a85-4f28-9154-09c7d320cb60@huawei.com/T/#t
Tested-by: Vishal Chourasia <vishalc@linux.ibm.com>
Signed-off-by: Chen Ridong <chenridong@huawei.com>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
The helper bpf_current_task_under_cgroup() currently is only allowed for
tracing programs, allow its usage also in the BPF_CGROUP_* program types.
Move the code from kernel/trace/bpf_trace.c to kernel/bpf/helpers.c,
so it compiles also without CONFIG_BPF_EVENTS.
This will be used in systemd-networkd to monitor the sysctl writes,
and filter it's own writes from others:
https://github.com/systemd/systemd/pull/32212
Signed-off-by: Matteo Croce <teknoraver@meta.com>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20240819162805.78235-3-technoboy85@gmail.com
|
|
Currently bpf_get_current_pid_tgid() is allowed in tracing, cgroup
and sk_msg progs while bpf_get_ns_current_pid_tgid() is only allowed
in tracing progs.
We have an internal use case where for an application running
in a container (with pid namespace), user wants to get
the pid associated with the pid namespace in a cgroup bpf
program. Currently, cgroup bpf progs already allow
bpf_get_current_pid_tgid(). Let us allow bpf_get_ns_current_pid_tgid()
as well.
With auditing the code, bpf_get_current_pid_tgid() is also used
by sk_msg prog. But there are no side effect to expose these two
helpers to all prog types since they do not reveal any kernel specific
data. The detailed discussion is in [1].
So with this patch, both bpf_get_current_pid_tgid() and bpf_get_ns_current_pid_tgid()
are put in bpf_base_func_proto(), making them available to all
program types.
[1] https://lore.kernel.org/bpf/20240307232659.1115872-1-yonghong.song@linux.dev/
Signed-off-by: Yonghong Song <yonghong.song@linux.dev>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Acked-by: Jiri Olsa <jolsa@kernel.org>
Link: https://lore.kernel.org/bpf/20240315184854.2975190-1-yonghong.song@linux.dev
|
|
This is a cleanup patch, making code a bit more concise.
1) Use skb_network_offset(skb) in place of
(skb_network_header(skb) - skb->data)
2) Use -skb_network_offset(skb) in place of
(skb->data - skb_network_header(skb))
3) Use skb_transport_offset(skb) in place of
(skb_transport_header(skb) - skb->data)
4) Use skb_inner_transport_offset(skb) in place of
(skb_inner_transport_header(skb) - skb->data)
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Edward Cree <ecree.xilinx@gmail.com> # for sfc
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Originally, this patch removed a redundant check in
BPF_CGROUP_RUN_PROG_INET_EGRESS, as the check was already being done in
the function it called, __cgroup_bpf_run_filter_skb. For v2, it was
reccomended that I remove the check from __cgroup_bpf_run_filter_skb,
and add the checks to the other macro that calls that function,
BPF_CGROUP_RUN_PROG_INET_INGRESS.
To sum it up, checking that the socket exists and that it is a full
socket is now part of both macros BPF_CGROUP_RUN_PROG_INET_EGRESS and
BPF_CGROUP_RUN_PROG_INET_INGRESS, and it is no longer part of the
function they call, __cgroup_bpf_run_filter_skb.
v3->v4: Fixed weird merge conflict.
v2->v3: Sent to bpf-next instead of generic patch
v1->v2: Addressed feedback about where check should be removed.
Signed-off-by: Oliver Crumrine <ozlinuxc@gmail.com>
Acked-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/7lv62yiyvmj5a7eozv2iznglpkydkdfancgmbhiptrgvgan5sy@3fl3onchgdz3
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
|
|
Instead of performing unconditional system-wide bpf_capable() and
perfmon_capable() calls inside bpf_base_func_proto() function (and other
similar ones) to determine eligibility of a given BPF helper for a given
program, use previously recorded BPF token during BPF_PROG_LOAD command
handling to inform the decision.
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20240124022127.2379740-8-andrii@kernel.org
|
|
Pull io_uring {get,set}sockopt support from Jens Axboe:
"This adds support for using getsockopt and setsockopt via io_uring.
The main use cases for this is to enable use of direct descriptors,
rather than first instantiating a normal file descriptor, doing the
option tweaking needed, then turning it into a direct descriptor. With
this support, we can avoid needing a regular file descriptor
completely.
The net and bpf bits have been signed off on their side"
* tag 'for-6.7/io_uring-sockopt-2023-10-30' of git://git.kernel.dk/linux:
selftests/bpf/sockopt: Add io_uring support
io_uring/cmd: Introduce SOCKET_URING_OP_SETSOCKOPT
io_uring/cmd: Introduce SOCKET_URING_OP_GETSOCKOPT
io_uring/cmd: return -EOPNOTSUPP if net is disabled
selftests/net: Extract uring helpers to be reusable
tools headers: Grab copy of io_uring.h
io_uring/cmd: Pass compat mode in issue_flags
net/socket: Break down __sys_getsockopt
net/socket: Break down __sys_setsockopt
bpf: Add sockptr support for setsockopt
bpf: Add sockptr support for getsockopt
|
|
The whole network stack uses sockptr, and while it doesn't move to
something more modern, let's use sockptr in setsockptr BPF hooks, so, it
could be used by other callers.
The main motivation for this change is to use it in the io_uring
{g,s}etsockopt(), which will use a userspace pointer for *optval, but, a
kernel value for optlen.
Link: https://lore.kernel.org/all/ZSArfLaaGcfd8LH8@gmail.com/
Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20231016134750.1381153-3-leitao@debian.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
The whole network stack uses sockptr, and while it doesn't move to
something more modern, let's use sockptr in getsockptr BPF hooks, so, it
could be used by other callers.
The main motivation for this change is to use it in the io_uring
{g,s}etsockopt(), which will use a userspace pointer for *optval, but, a
kernel value for optlen.
Link: https://lore.kernel.org/all/ZSArfLaaGcfd8LH8@gmail.com/
Signed-off-by: Breno Leitao <leitao@debian.org>
Acked-by: Martin KaFai Lau <martin.lau@kernel.org>
Link: https://lore.kernel.org/r/20231016134750.1381153-2-leitao@debian.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
These hooks allows intercepting connect(), getsockname(),
getpeername(), sendmsg() and recvmsg() for unix sockets. The unix
socket hooks get write access to the address length because the
address length is not fixed when dealing with unix sockets and
needs to be modified when a unix socket address is modified by
the hook. Because abstract socket unix addresses start with a
NUL byte, we cannot recalculate the socket address in kernelspace
after running the hook by calculating the length of the unix socket
path using strlen().
These hooks can be used when users want to multiplex syscall to a
single unix socket to multiple different processes behind the scenes
by redirecting the connect() and other syscalls to process specific
sockets.
We do not implement support for intercepting bind() because when
using bind() with unix sockets with a pathname address, this creates
an inode in the filesystem which must be cleaned up. If we rewrite
the address, the user might try to clean up the wrong file, leaking
the socket in the filesystem where it is never cleaned up. Until we
figure out a solution for this (and a use case for intercepting bind()),
we opt to not allow rewriting the sockaddr in bind() calls.
We also implement recvmsg() support for connected streams so that
after a connect() that is modified by a sockaddr hook, any corresponding
recmvsg() on the connected socket can also be modified to make the
connected program think it is connected to the "intended" remote.
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Daan De Meyer <daan.j.demeyer@gmail.com>
Link: https://lore.kernel.org/r/20231011185113.140426-5-daan.j.demeyer@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
|
|
As prep for adding unix socket support to the cgroup sockaddr hooks,
let's propagate the sockaddr length back to the caller after running
a bpf cgroup sockaddr hook program. While not important for AF_INET or
AF_INET6, the sockaddr length is important when working with AF_UNIX
sockaddrs as the size of the sockaddr cannot be determined just from the
address family or the sockaddr's contents.
__cgroup_bpf_run_filter_sock_addr() is modified to take the uaddrlen as
an input/output argument. After running the program, the modified sockaddr
length is stored in the uaddrlen pointer.
Signed-off-by: Daan De Meyer <daan.j.demeyer@gmail.com>
Link: https://lore.kernel.org/r/20231011185113.140426-3-daan.j.demeyer@gmail.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
|
|
Fix missing or extra function parameter kernel-doc warnings
in cgroup.c:
kernel/bpf/cgroup.c:1359: warning: Excess function parameter 'type' description in '__cgroup_bpf_run_filter_skb'
kernel/bpf/cgroup.c:1359: warning: Function parameter or member 'atype' not described in '__cgroup_bpf_run_filter_skb'
kernel/bpf/cgroup.c:1439: warning: Excess function parameter 'type' description in '__cgroup_bpf_run_filter_sk'
kernel/bpf/cgroup.c:1439: warning: Function parameter or member 'atype' not described in '__cgroup_bpf_run_filter_sk'
kernel/bpf/cgroup.c:1467: warning: Excess function parameter 'type' description in '__cgroup_bpf_run_filter_sock_addr'
kernel/bpf/cgroup.c:1467: warning: Function parameter or member 'atype' not described in '__cgroup_bpf_run_filter_sock_addr'
kernel/bpf/cgroup.c:1512: warning: Excess function parameter 'type' description in '__cgroup_bpf_run_filter_sock_ops'
kernel/bpf/cgroup.c:1512: warning: Function parameter or member 'atype' not described in '__cgroup_bpf_run_filter_sock_ops'
kernel/bpf/cgroup.c:1685: warning: Excess function parameter 'type' description in '__cgroup_bpf_run_filter_sysctl'
kernel/bpf/cgroup.c:1685: warning: Function parameter or member 'atype' not described in '__cgroup_bpf_run_filter_sysctl'
kernel/bpf/cgroup.c:795: warning: Excess function parameter 'type' description in '__cgroup_bpf_replace'
kernel/bpf/cgroup.c:795: warning: Function parameter or member 'new_prog' not described in '__cgroup_bpf_replace'
Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: bpf@vger.kernel.org
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/20230912060812.1715-1-rdunlap@infradead.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:
====================
pull-request: bpf-next 2023-05-16
We've added 57 non-merge commits during the last 19 day(s) which contain
a total of 63 files changed, 3293 insertions(+), 690 deletions(-).
The main changes are:
1) Add precision propagation to verifier for subprogs and callbacks,
from Andrii Nakryiko.
2) Improve BPF's {g,s}setsockopt() handling with wrong option lengths,
from Stanislav Fomichev.
3) Utilize pahole v1.25 for the kernel's BTF generation to filter out
inconsistent function prototypes, from Alan Maguire.
4) Various dyn-pointer verifier improvements to relax restrictions,
from Daniel Rosenberg.
5) Add a new bpf_task_under_cgroup() kfunc for designated task,
from Feng Zhou.
6) Unblock tests for arm64 BPF CI after ftrace supporting direct call,
from Florent Revest.
7) Add XDP hint kfunc metadata for RX hash/timestamp for igc,
from Jesper Dangaard Brouer.
8) Add several new dyn-pointer kfuncs to ease their usability,
from Joanne Koong.
9) Add in-depth LRU internals description and dot function graph,
from Joe Stringer.
10) Fix KCSAN report on bpf_lru_list when accessing node->ref,
from Martin KaFai Lau.
11) Only dump unprivileged_bpf_disabled log warning upon write,
from Kui-Feng Lee.
12) Extend test_progs to directly passing allow/denylist file,
from Stephen Veiss.
13) Fix BPF trampoline memleak upon failure attaching to fentry,
from Yafang Shao.
14) Fix emitting struct bpf_tcp_sock type in vmlinux BTF,
from Yonghong Song.
* tag 'for-netdev' of https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (57 commits)
bpf: Fix memleak due to fentry attach failure
bpf: Remove bpf trampoline selector
bpf, arm64: Support struct arguments in the BPF trampoline
bpftool: JIT limited misreported as negative value on aarch64
bpf: fix calculation of subseq_idx during precision backtracking
bpf: Remove anonymous union in bpf_kfunc_call_arg_meta
bpf: Document EFAULT changes for sockopt
selftests/bpf: Correctly handle optlen > 4096
selftests/bpf: Update EFAULT {g,s}etsockopt selftests
bpf: Don't EFAULT for {g,s}setsockopt with wrong optlen
libbpf: fix offsetof() and container_of() to work with CO-RE
bpf: Address KCSAN report on bpf_lru_list
bpf: Add --skip_encoding_btf_inconsistent_proto, --btf_gen_optimized to pahole flags for v1.25
selftests/bpf: Accept mem from dynptr in helper funcs
bpf: verifier: Accept dynptr mem as mem in helpers
selftests/bpf: Check overflow in optional buffer
selftests/bpf: Test allowing NULL buffer in dynptr slice
bpf: Allow NULL buffers in bpf_dynptr_slice(_rw)
selftests/bpf: Add testcase for bpf_task_under_cgroup
bpf: Add bpf_task_under_cgroup() kfunc
...
====================
Link: https://lore.kernel.org/r/20230515225603.27027-1-daniel@iogearbox.net
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
With the way the hooks implemented right now, we have a special
condition: optval larger than PAGE_SIZE will expose only first 4k into
BPF; any modifications to the optval are ignored. If the BPF program
doesn't handle this condition by resetting optlen to 0,
the userspace will get EFAULT.
The intention of the EFAULT was to make it apparent to the
developers that the program is doing something wrong.
However, this inadvertently might affect production workloads
with the BPF programs that are not too careful (i.e., returning EFAULT
for perfectly valid setsockopt/getsockopt calls).
Let's try to minimize the chance of BPF program screwing up userspace
by ignoring the output of those BPF programs (instead of returning
EFAULT to the userspace). pr_info_once those cases to
the dmesg to help with figuring out what's going wrong.
Fixes: 0d01da6afc54 ("bpf: implement getsockopt and setsockopt hooks")
Suggested-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20230511170456.1759459-2-sdf@google.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
|
|
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
- cpuset changes including the fix for an incorrect interaction with
CPU hotplug and an optimization
- Other doc and cosmetic changes
* tag 'cgroup-for-6.4' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
docs: cgroup-v1/cpusets: update libcgroup project link
cgroup/cpuset: Minor updates to test_cpuset_prs.sh
cgroup/cpuset: Include offline CPUs when tasks' cpumasks in top_cpuset are updated
cgroup/cpuset: Skip task update if hotplug doesn't affect current cpuset
cpuset: Clean up cpuset_node_allowed
cgroup: bpf: use cgroup_lock()/cgroup_unlock() wrappers
|
|
Some socket options do getsockopt with optval=NULL to estimate the size
of the final buffer (which is returned via optlen). This breaks BPF
getsockopt assumptions about permitted optval buffer size. Let's enforce
these assumptions only when non-NULL optval is provided.
Fixes: 0d01da6afc54 ("bpf: implement getsockopt and setsockopt hooks")
Reported-by: Martin KaFai Lau <martin.lau@kernel.org>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/ZD7Js4fj5YyI2oLd@google.com/T/#mb68daf700f87a9244a15d01d00c3f0e5b08f49f7
Link: https://lore.kernel.org/bpf/20230418225343.553806-2-sdf@google.com
|
|
Replace mutex_[un]lock() with cgroup_[un]lock() wrappers to stay
consistent across cgroup core and other subsystem code, while
operating on the cgroup_mutex.
Signed-off-by: Kamalesh Babulal <kamalesh.babulal@oracle.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Christian Brauner <brauner@kernel.org>
Signed-off-by: Tejun Heo <tj@kernel.org>
|
|
Lift verifier restriction to use BPF_ST_MEM instructions to write to
context data structures. This requires the following changes:
- verifier.c:do_check() for BPF_ST updated to:
- no longer forbid writes to registers of type PTR_TO_CTX;
- track dst_reg type in the env->insn_aux_data[...].ptr_type field
(same way it is done for BPF_STX and BPF_LDX instructions).
- verifier.c:convert_ctx_access() and various callbacks invoked by
it are updated to handled BPF_ST instruction alongside BPF_STX.
Signed-off-by: Eduard Zingerman <eddyz87@gmail.com>
Link: https://lore.kernel.org/r/20230304011247.566040-2-eddyz87@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
These helpers are safe to call from any context and there's no reason to
restrict access to them. Remove them from bpf_trace and filter lists and add
to bpf_base_func_proto() under perfmon_capable().
v2: After consulting with Andrii, relocated in bpf_base_func_proto() so that
they require bpf_capable() but not perfomon_capable() as it doesn't read
from or affect others on the system.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/r/ZAD8QyoszMZiTzBY@slm.duckdns.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Merge in the left-over fixes before the net-next pull-request.
Conflicts:
drivers/net/ethernet/mediatek/mtk_ppe.c
ae3ed15da588 ("net: ethernet: mtk_eth_soc: fix state in __mtk_foe_entry_clear")
9d8cb4c096ab ("net: ethernet: mtk_eth_soc: add foe_entry_size to mtk_eth_soc")
https://lore.kernel.org/all/6cb6893b-4921-a068-4c30-1109795110bb@tessares.net/
kernel/bpf/helpers.c
8addbfc7b308 ("bpf: Gate dynptr API behind CAP_BPF")
5679ff2f138f ("bpf: Move bpf_loop and bpf_for_each_map_elem under CAP_BPF")
8a67f2de9b1d ("bpf: expose bpf_strtol and bpf_strtoul to all program types")
https://lore.kernel.org/all/20221003201957.13149-1-daniel@iogearbox.net/
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
Attach flags is only valid for attached progs of this layer cgroup,
but not for effective progs. For querying with EFFECTIVE flags,
exporting attach flags does not make sense. So when effective query,
we reject prog_attach_flags array and don't need to populate it.
Also we limit attach_flags to output 0 during effective query.
Fixes: b79c9fc9551b ("bpf: implement BPF_PROG_QUERY for BPF_LSM_CGROUP")
Signed-off-by: Pu Lehui <pulehui@huawei.com>
Link: https://lore.kernel.org/r/20220921104604.2340580-2-pulehui@huaweicloud.com
Signed-off-by: Martin KaFai Lau <martin.lau@kernel.org>
|
|
Daniel Borkmann says:
====================
pull-request: bpf-next 2022-09-05
The following pull-request contains BPF updates for your *net-next* tree.
We've added 106 non-merge commits during the last 18 day(s) which contain
a total of 159 files changed, 5225 insertions(+), 1358 deletions(-).
There are two small merge conflicts, resolve them as follows:
1) tools/testing/selftests/bpf/DENYLIST.s390x
Commit 27e23836ce22 ("selftests/bpf: Add lru_bug to s390x deny list") in
bpf tree was needed to get BPF CI green on s390x, but it conflicted with
newly added tests on bpf-next. Resolve by adding both hunks, result:
[...]
lru_bug # prog 'printk': failed to auto-attach: -524
setget_sockopt # attach unexpected error: -524 (trampoline)
cb_refs # expected error message unexpected error: -524 (trampoline)
cgroup_hierarchical_stats # JIT does not support calling kernel function (kfunc)
htab_update # failed to attach: ERROR: strerror_r(-524)=22 (trampoline)
[...]
2) net/core/filter.c
Commit 1227c1771dd2 ("net: Fix data-races around sysctl_[rw]mem_(max|default).")
from net tree conflicts with commit 29003875bd5b ("bpf: Change bpf_setsockopt(SOL_SOCKET)
to reuse sk_setsockopt()") from bpf-next tree. Take the code as it is from
bpf-next tree, result:
[...]
if (getopt) {
if (optname == SO_BINDTODEVICE)
return -EINVAL;
return sk_getsockopt(sk, SOL_SOCKET, optname,
KERNEL_SOCKPTR(optval),
KERNEL_SOCKPTR(optlen));
}
return sk_setsockopt(sk, SOL_SOCKET, optname,
KERNEL_SOCKPTR(optval), *optlen);
[...]
The main changes are:
1) Add any-context BPF specific memory allocator which is useful in particular for BPF
tracing with bonus of performance equal to full prealloc, from Alexei Starovoitov.
2) Big batch to remove duplicated code from bpf_{get,set}sockopt() helpers as an effort
to reuse the existing core socket code as much as possible, from Martin KaFai Lau.
3) Extend BPF flow dissector for BPF programs to just augment the in-kernel dissector
with custom logic. In other words, allow for partial replacement, from Shmulik Ladkani.
4) Add a new cgroup iterator to BPF with different traversal options, from Hao Luo.
5) Support for BPF to collect hierarchical cgroup statistics efficiently through BPF
integration with the rstat framework, from Yosry Ahmed.
6) Support bpf_{g,s}et_retval() under more BPF cgroup hooks, from Stanislav Fomichev.
7) BPF hash table and local storages fixes under fully preemptible kernel, from Hou Tao.
8) Add various improvements to BPF selftests and libbpf for compilation with gcc BPF
backend, from James Hilliard.
9) Fix verifier helper permissions and reference state management for synchronous
callbacks, from Kumar Kartikeya Dwivedi.
10) Add support for BPF selftest's xskxceiver to also be used against real devices that
support MAC loopback, from Maciej Fijalkowski.
11) Various fixes to the bpf-helpers(7) man page generation script, from Quentin Monnet.
12) Document BPF verifier's tnum_in(tnum_range(), ...) gotchas, from Shung-Hsi Yu.
13) Various minor misc improvements all over the place.
* https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (106 commits)
bpf: Optimize rcu_barrier usage between hash map and bpf_mem_alloc.
bpf: Remove usage of kmem_cache from bpf_mem_cache.
bpf: Remove prealloc-only restriction for sleepable bpf programs.
bpf: Prepare bpf_mem_alloc to be used by sleepable bpf programs.
bpf: Remove tracing program restriction on map types
bpf: Convert percpu hash map to per-cpu bpf_mem_alloc.
bpf: Add percpu allocation support to bpf_mem_alloc.
bpf: Batch call_rcu callbacks instead of SLAB_TYPESAFE_BY_RCU.
bpf: Adjust low/high watermarks in bpf_mem_cache
bpf: Optimize call_rcu in non-preallocated hash map.
bpf: Optimize element count in non-preallocated hash map.
bpf: Relax the requirement to use preallocated hash maps in tracing progs.
samples/bpf: Reduce syscall overhead in map_perf_test.
selftests/bpf: Improve test coverage of test_maps
bpf: Convert hash map to bpf_mem_alloc.
bpf: Introduce any context BPF specific memory allocator.
selftest/bpf: Add test for bpf_getsockopt()
bpf: Change bpf_getsockopt(SOL_IPV6) to reuse do_ipv6_getsockopt()
bpf: Change bpf_getsockopt(SOL_IP) to reuse do_ip_getsockopt()
bpf: Change bpf_getsockopt(SOL_TCP) to reuse do_tcp_getsockopt()
...
====================
Link: https://lore.kernel.org/r/20220905161136.9150-1-daniel@iogearbox.net
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
bpf_strncmp is already exposed everywhere. The motivation is to keep
those helpers in kernel/bpf/helpers.c. Otherwise it's tempting to move
them under kernel/bpf/cgroup.c because they are currently only used
by sysctl prog types.
Suggested-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20220823222555.523590-4-sdf@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
The following hooks are per-cgroup hooks but they are not
using cgroup_{common,current}_func_proto, fix it:
* BPF_PROG_TYPE_CGROUP_SKB (cg_skb)
* BPF_PROG_TYPE_CGROUP_SOCK_ADDR (cg_sock_addr)
* BPF_PROG_TYPE_CGROUP_SOCK (cg_sock)
* BPF_PROG_TYPE_LSM+BPF_LSM_CGROUP
Also:
* move common func_proto's into cgroup func_proto handlers
* make sure bpf_{g,s}et_retval are not accessible from recvmsg,
getpeername and getsockname (return/errno is ignored in these
places)
* as a side effect, expose get_current_pid_tgid, get_current_comm_proto,
get_current_ancestor_cgroup_id, get_cgroup_classid to more cgroup
hooks
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20220823222555.523590-3-sdf@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Split cgroup_base_func_proto into the following:
* cgroup_common_func_proto - common helpers for all cgroup hooks
* cgroup_current_func_proto - common helpers for all cgroup hooks
running in the process context (== have meaningful 'current').
Move bpf_{g,s}et_retval and other cgroup-related helpers into
kernel/bpf/cgroup.c so they closer to where they are being used.
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/r/20220823222555.523590-2-sdf@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Syzkaller reported a triggered kernel BUG as follows:
------------[ cut here ]------------
kernel BUG at kernel/bpf/cgroup.c:925!
invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
CPU: 1 PID: 194 Comm: detach Not tainted 5.19.0-14184-g69dac8e431af #8
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
RIP: 0010:__cgroup_bpf_detach+0x1f2/0x2a0
Code: 00 e8 92 60 30 00 84 c0 75 d8 4c 89 e0 31 f6 85 f6 74 19 42 f6 84
28 48 05 00 00 02 75 0e 48 8b 80 c0 00 00 00 48 85 c0 75 e5 <0f> 0b 48
8b 0c5
RSP: 0018:ffffc9000055bdb0 EFLAGS: 00000246
RAX: 0000000000000000 RBX: ffff888100ec0800 RCX: ffffc900000f1000
RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff888100ec4578
RBP: 0000000000000000 R08: ffff888100ec0800 R09: 0000000000000040
R10: 0000000000000000 R11: 0000000000000000 R12: ffff888100ec4000
R13: 000000000000000d R14: ffffc90000199000 R15: ffff888100effb00
FS: 00007f68213d2b80(0000) GS:ffff88813bc80000(0000)
knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055f74a0e5850 CR3: 0000000102836000 CR4: 00000000000006e0
Call Trace:
<TASK>
cgroup_bpf_prog_detach+0xcc/0x100
__sys_bpf+0x2273/0x2a00
__x64_sys_bpf+0x17/0x20
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7f68214dbcb9
Code: 08 44 89 e0 5b 41 5c c3 66 0f 1f 84 00 00 00 00 00 48 89 f8 48 89
f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01
f0 ff8
RSP: 002b:00007ffeb487db68 EFLAGS: 00000246 ORIG_RAX: 0000000000000141
RAX: ffffffffffffffda RBX: 000000000000000b RCX: 00007f68214dbcb9
RDX: 0000000000000090 RSI: 00007ffeb487db70 RDI: 0000000000000009
RBP: 0000000000000003 R08: 0000000000000012 R09: 0000000b00000003
R10: 00007ffeb487db70 R11: 0000000000000246 R12: 00007ffeb487dc20
R13: 0000000000000004 R14: 0000000000000001 R15: 000055f74a1011b0
</TASK>
Modules linked in:
---[ end trace 0000000000000000 ]---
Repetition steps:
For the following cgroup tree,
root
|
cg1
|
cg2
1. attach prog2 to cg2, and then attach prog1 to cg1, both bpf progs
attach type is NONE or OVERRIDE.
2. write 1 to /proc/thread-self/fail-nth for failslab.
3. detach prog1 for cg1, and then kernel BUG occur.
Failslab injection will cause kmalloc fail and fall back to
purge_effective_progs. The problem is that cg2 have attached another prog,
so when go through cg2 layer, iteration will add pos to 1, and subsequent
operations will be skipped by the following condition, and cg will meet
NULL in the end.
`if (pos && !(cg->bpf.flags[atype] & BPF_F_ALLOW_MULTI))`
The NULL cg means no link or prog match, this is as expected, and it's not
a bug. So here just skip the no match situation.
Fixes: 4c46091ee985 ("bpf: Fix KASAN use-after-free Read in compute_effective_progs")
Signed-off-by: Pu Lehui <pulehui@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andrii@kernel.org>
Link: https://lore.kernel.org/bpf/20220813134030.1972696-1-pulehui@huawei.com
|
|
We have two options:
1. Treat all BPF_LSM_CGROUP the same, regardless of attach_btf_id
2. Treat BPF_LSM_CGROUP+attach_btf_id as a separate hook point
I was doing (2) in the original patch, but switching to (1) here:
* bpf_prog_query returns all attached BPF_LSM_CGROUP programs
regardless of attach_btf_id
* attach_btf_id is exported via bpf_prog_info
Reviewed-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20220628174314.1216643-6-sdf@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Previous patch adds 1:1 mapping between all 211 LSM hooks
and bpf_cgroup program array. Instead of reserving a slot per
possible hook, reserve 10 slots per cgroup for lsm programs.
Those slots are dynamically allocated on demand and reclaimed.
struct cgroup_bpf {
struct bpf_prog_array * effective[33]; /* 0 264 */
/* --- cacheline 4 boundary (256 bytes) was 8 bytes ago --- */
struct hlist_head progs[33]; /* 264 264 */
/* --- cacheline 8 boundary (512 bytes) was 16 bytes ago --- */
u8 flags[33]; /* 528 33 */
/* XXX 7 bytes hole, try to pack */
struct list_head storages; /* 568 16 */
/* --- cacheline 9 boundary (576 bytes) was 8 bytes ago --- */
struct bpf_prog_array * inactive; /* 584 8 */
struct percpu_ref refcnt; /* 592 16 */
struct work_struct release_work; /* 608 72 */
/* size: 680, cachelines: 11, members: 7 */
/* sum members: 673, holes: 1, sum holes: 7 */
/* last cacheline: 40 bytes */
};
Reviewed-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20220628174314.1216643-5-sdf@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Allow attaching to lsm hooks in the cgroup context.
Attaching to per-cgroup LSM works exactly like attaching
to other per-cgroup hooks. New BPF_LSM_CGROUP is added
to trigger new mode; the actual lsm hook we attach to is
signaled via existing attach_btf_id.
For the hooks that have 'struct socket' or 'struct sock' as its first
argument, we use the cgroup associated with that socket. For the rest,
we use 'current' cgroup (this is all on default hierarchy == v2 only).
Note that for some hooks that work on 'struct sock' we still
take the cgroup from 'current' because some of them work on the socket
that hasn't been properly initialized yet.
Behind the scenes, we allocate a shim program that is attached
to the trampoline and runs cgroup effective BPF programs array.
This shim has some rudimentary ref counting and can be shared
between several programs attaching to the same lsm hook from
different cgroups.
Note that this patch bloats cgroup size because we add 211
cgroup_bpf_attach_type(s) for simplicity sake. This will be
addressed in the subsequent patch.
Also note that we only add non-sleepable flavor for now. To enable
sleepable use-cases, bpf_prog_run_array_cg has to grab trace rcu,
shim programs have to be freed via trace rcu, cgroup_bpf.effective
should be also trace-rcu-managed + maybe some other changes that
I'm not aware of.
Reviewed-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20220628174314.1216643-4-sdf@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
This lets us reclaim some space to be used by new cgroup lsm slots.
Before:
struct cgroup_bpf {
struct bpf_prog_array * effective[23]; /* 0 184 */
/* --- cacheline 2 boundary (128 bytes) was 56 bytes ago --- */
struct list_head progs[23]; /* 184 368 */
/* --- cacheline 8 boundary (512 bytes) was 40 bytes ago --- */
u32 flags[23]; /* 552 92 */
/* XXX 4 bytes hole, try to pack */
/* --- cacheline 10 boundary (640 bytes) was 8 bytes ago --- */
struct list_head storages; /* 648 16 */
struct bpf_prog_array * inactive; /* 664 8 */
struct percpu_ref refcnt; /* 672 16 */
struct work_struct release_work; /* 688 32 */
/* size: 720, cachelines: 12, members: 7 */
/* sum members: 716, holes: 1, sum holes: 4 */
/* last cacheline: 16 bytes */
};
After:
struct cgroup_bpf {
struct bpf_prog_array * effective[23]; /* 0 184 */
/* --- cacheline 2 boundary (128 bytes) was 56 bytes ago --- */
struct hlist_head progs[23]; /* 184 184 */
/* --- cacheline 5 boundary (320 bytes) was 48 bytes ago --- */
u8 flags[23]; /* 368 23 */
/* XXX 1 byte hole, try to pack */
/* --- cacheline 6 boundary (384 bytes) was 8 bytes ago --- */
struct list_head storages; /* 392 16 */
struct bpf_prog_array * inactive; /* 408 8 */
struct percpu_ref refcnt; /* 416 16 */
struct work_struct release_work; /* 432 72 */
/* size: 504, cachelines: 8, members: 7 */
/* sum members: 503, holes: 1, sum holes: 1 */
/* last cacheline: 56 bytes */
};
Suggested-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/r/20220628174314.1216643-3-sdf@google.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
|
|
Syzbot found a Use After Free bug in compute_effective_progs().
The reproducer creates a number of BPF links, and causes a fault
injected alloc to fail, while calling bpf_link_detach on them.
Link detach triggers the link to be freed by bpf_link_free(),
which calls __cgroup_bpf_detach() and update_effective_progs().
If the memory allocation in this function fails, the function restores
the pointer to the bpf_cgroup_link on the cgroup list, but the memory
gets freed just after it returns. After this, every subsequent call to
update_effective_progs() causes this already deallocated pointer to be
dereferenced in prog_list_length(), and triggers KASAN UAF error.
To fix this issue don't preserve the pointer to the prog or link in the
list, but remove it and replace it with a dummy prog without shrinking
the table. The subsequent call to __cgroup_bpf_detach() or
__cgroup_bpf_detach() will correct it.
Fixes: af6eea57437a ("bpf: Implement bpf_link-based cgroup BPF program attachment")
Reported-by: <syzbot+f264bffdfbd5614f3bb2@syzkaller.appspotmail.com>
Signed-off-by: Tadeusz Struk <tadeusz.struk@linaro.org>
Signed-off-by: Andrii Nakryiko <andrii@kernel.org>
Cc: <stable@vger.kernel.org>
Link: https://syzkaller.appspot.com/bug?id=8ebf179a95c2a2670f7cf1ba62429ec044369db4
Link: https://lore.kernel.org/bpf/20220517180420.87954-1-tadeusz.struk@linaro.org
|
|
Rename bpf_prog_run_array_cg_flags to bpf_prog_run_array_cg and
use it everywhere. check_return_code already enforces sane
return ranges for all cgroup types. (only egress and bind hooks have
uncanonical return ranges, the rest is using [0, 1])
No functional changes.
v2:
- 'func_ret & 1' under explicit test (Andrii & Martin)
Suggested-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20220425220448.3669032-1-sdf@google.com
|
|
Commit 7d08c2c91171 ("bpf: Refactor BPF_PROG_RUN_ARRAY family of macros
into functions") switched a bunch of BPF_PROG_RUN macros to inline
routines. This changed the semantic a bit. Due to arguments expansion
of macros, it used to be:
rcu_read_lock();
array = rcu_dereference(cgrp->bpf.effective[atype]);
...
Now, with with inline routines, we have:
array_rcu = rcu_dereference(cgrp->bpf.effective[atype]);
/* array_rcu can be kfree'd here */
rcu_read_lock();
array = rcu_dereference(array_rcu);
I'm assuming in practice rcu subsystem isn't fast enough to trigger
this but let's use rcu API properly.
Also, rename to lower caps to not confuse with macros. Additionally,
drop and expand BPF_PROG_CGROUP_INET_EGRESS_RUN_ARRAY.
See [1] for more context.
[1] https://lore.kernel.org/bpf/CAKH8qBs60fOinFdxiiQikK_q0EcVxGvNTQoWvHLEUGbgcj1UYg@mail.gmail.com/T/#u
v2
- keep rcu locks inside by passing cgroup_bpf
Fixes: 7d08c2c91171 ("bpf: Refactor BPF_PROG_RUN_ARRAY family of macros into functions")
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore. |