aboutsummaryrefslogtreecommitdiff
path: root/net/core
AgeCommit message (Collapse)AuthorFilesLines
6 daysMerge tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfLinus Torvalds1-2/+2
Pull bpf fixes from Alexei Starovoitov: - Fix effective prog array index with BPF_F_PREORDER (Amery Hung) - Zero-initialize the fib lookup flow struct (Avinash Duduskar) - Disable xfrm_decode_session hook attachment (Bradley Morgan) - Allow type tag BTF records to succeed other modifier records (Emil Tsalapatis) - Fix build_id caching in stack_map_get_build_id_offset() (Ihor Solodrai) - Add missing access_ok call to copy_user_syms (Jiri Olsa) - Fix stack slot index in nospec checks (Nuoqi Gui) - Preserve pointer spill metadata during half-slot cleanup (Nuoqi Gui) - Fix partial copy of non-linear test_run output (Sun Jian) - Fix BPF_PROG_ASSOC_STRUCT_OPS last field check (Thiébaud Weksteen) - Reset register bounds before narrowing retval range (Tristan Madani) - Fix vmlinux BTF leak in bpftool cgroup commands (Yichong Chen) - Guard error writes in conntrack kfuncs (Yiyang Chen) * tag 'bpf-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: bpf: Disable xfrm_decode_session hook attachment selftests/bpf: Add test for stale bounds on LSM retval context load bpf: Reset register bounds before narrowing retval range in check_mem_access() selftests/bpf: Cover small conntrack opts error writes bpf: Guard conntrack opts error writes selftests/bpf: Cover half-slot cleanup of pointer spills bpf: Preserve pointer spill metadata during half-slot cleanup selftests/bpf: Test cgroup link replace with BPF_F_PREORDER bpf: Fix effective prog array index with BPF_F_PREORDER bpf: Fix BPF_PROG_ASSOC_STRUCT_OPS last field check bpf: zero-initialize the fib lookup flow struct bpftool: Fix vmlinux BTF leak in cgroup commands bpf: Add missing access_ok call to copy_user_syms bpf: Allow type tag BTF records to succeed other modifier records bpf: Emit verbose message when prog-specific btf_struct_access rejects a write bpf: Fix build_id caching in stack_map_get_build_id_offset() bpf: Fix partial copy of non-linear test_run output selftests/bpf: Cover stack nospec slot indexing bpf: Fix stack slot index in nospec checks
6 daysMerge tag 'net-7.2-rc1' of ↵Linus Torvalds10-80/+206
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: "Including fixes from netfilter and IPsec. Current release - regressions: - do not acquire dev->tx_global_lock in netdev_watchdog_up() - ethtool: keep rtnl_lock for ops using ethtool_op_get_link() - fix deadlock in nested UP notifier events Current release - new code bugs: - eth: - cn20k: fix subbank free list indexing for search order - airoha: fix BQL underflow in shared QDMA TX ring Previous releases - regressions: - netfilter: - flowtable: fix offloaded ct timeout never being extended - nf_conncount: prevent connlimit drops for early confirmed ct Previous releases - always broken: - require CAP_NET_ADMIN in the originating netns when modifying cross-netns devices - report NAPI thread PID in the caller's pid namespace - mac802154: fix dirty frag in in-place crypto for IOT radios - sctp: hold socket lock when dumping endpoints in sctp_diag, avoid an overflow - eth: gve: fix header buffer corruption with header-split and HW-GRO - af_key: initialize alg_key_len for IPComp states, prevent OOB read" * tag 'net-7.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (213 commits) selftests: bonding: add a test for VLAN propagation over a bonded real device vlan: defer real device state propagation to netdev_work net: add the driver-facing netdev_work scheduling API net: turn the rx_mode work into a generic netdev_work facility net: ethtool: keep rtnl_lock for ops using ethtool_op_get_link() rxrpc: Fix rxrpc_rotate_tx_rotate() to check there's something to rotate rxrpc: Fix leak of released call in recvmsg(MSG_PEEK) rxrpc: Fix socket notification race rxrpc: Fix potential infinite loop in rxrpc_recvmsg() rxrpc: Fix oob challenge leak in cleanup after notification failure rxrpc: Fix the reception of a reply packet before data transmission afs: Fix uncancelled rxrpc OOB message handler afs: Fix further netns teardown to cancel the preallocation charger rxrpc: Fix double unlock in rxrpc_recvmsg() rxrpc: Fix leak of connection from OOB challenge rxrpc: Fix ACKALL packet handling net: hns3: differentiate autoneg default values between copper and fiber net: hns3: fix permanent link down deadlock after reset net: hns3: refactor MAC autoneg and speed configuration net: hns3: unify copper port ksettings configuration path ...
6 daysvlan: defer real device state propagation to netdev_workJakub Kicinski1-0/+1
vlan_device_event() generates nested UP/DOWN, MTU and feature change events. It executes an event for the VLAN device directly from the notifier - while the locks of the lower device are held. This causes deadlocks, for example: bond (3) bond_update_speed_duplex(vlan) | ^ v vlan (2) UP(vlan) (4) vlan_ethtool_get_link_ksettings() | ^ v dummy (1) UP(dummy) (5) __ethtool_get_link_ksettings() The dummy device is ops locked, vlan creates a nested event (2), then bond wants to ask vlan for link state (3). bond uses the "I'm already holding the instance lock" flavor of API. But in this case the lock held refers to vlan itself. We hit vlan's link settings trampoline (4) and call __ethtool_get_link_ksettings() which tries to lock dummy. Deadlock. There's no clean way for us to tell the vlan_ethtool_get_link_ksettings() that the caller is already in lower device's critical section. Defer the propagation to the per-netdev work facility instead: the notifier only schedules netdev_work_sched(vlandev, VLAN_WORK_*), and ndo_work (vlan_dev_work) applies the change later. Hopefully nobody expects the VLAN state changes to be instantaneous. If someone does expect the changes to be instantaneous we will have to do the same thing Stan did for rx_mode and "strategically" place sync calls, to make sure such delayed works are executed after we drop the ops lock but before we drop rtnl_lock. Stan suggests that if we need that down the line we may consider reshaping the mechanism into "async notifications". AFAICT only vlan does this sort of netdev open chaining, so as a first try I think that sticking the complexity into the vlan code makes sense. One corner case is that we need to cancel the event if user explicitly changes the state before work could run. Consider the following operations with vlan0 on top of dummy0: ip link set dev dummy0 up # queues work to up vlan0 ip link set dev vlan0 down # user explicitly downs the vlan ndo_work # acts on the stale event Reported-by: syzbot+09da62a8b78959ceb8bb@syzkaller.appspotmail.com Reported-by: syzbot+cb67c392b0b8f0fd0fc1@syzkaller.appspotmail.com Reported-by: syzbot+9bb8bd77f3966641f298@syzkaller.appspotmail.com Fixes: 9f275c2e9020 ("net: ethtool: make sure __ethtool_get_link_ksettings() is ops-locked") Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Nicolai Buchwitz <nb@tipi-net.de> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20260624182018.2445732-4-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 daysnet: add the driver-facing netdev_work scheduling APIJakub Kicinski1-18/+63
With an extra event mask we can easily extend the netdev work to also service driver-defined events. For advanced drivers this is probably not a perfect match, but it makes running deferred work easier in simple cases. Expose the netdev_work facility to drivers. Add helpers to schedule work and a dedicated ndo to perform the driver- -scheduled actions. Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20260624182018.2445732-3-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
6 daysnet: turn the rx_mode work into a generic netdev_work facilityJakub Kicinski5-76/+132
The rx_mode update runs from a workqueue: drivers have their ndo_set_rx_mode_async() callback executed by a single global work item under RTNL and ops lock. This is a useful pattern. Support multiple "events" that need to be serviced and make RX_MODE sync the first one. Call the events "core" because later on we will let drivers define and schedule their own. Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Link: https://patch.msgid.link/20260624182018.2445732-2-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 daysnet: lwtunnel: Drop skb metadata before LWT encapsulationJakub Sitnicki1-0/+6
skb metadata is meant for passing information between XDP and TC. It lives in the skb headroom, immediately before skb->data. LWT programs cannot access the __sk_buff->data_meta pseudo-pointer to metadata. However, LWT encapsulation prepends outer headers, moving skb->data back over the headroom where the metadata sits. On an RX-originated (forwarded) packet that still carries XDP metadata this goes wrong in two different ways, depending on the encap type: 1. Non-BPF LWT encaps (mpls, seg6, ioam6 ...) call skb_push()/skb_pull() and silently overwrite the metadata that sits in the headroom. 2) BPF LWT xmit calls bpf_skb_change_head(), which uses skb_data_move(). That helper expects metadata immediately before skb->data. But since the IP output path runs LWT xmit before neighbour output has built the outgoing L2 header, for forwarded packets skb->data points at the L3 header while skb_mac_header() still points at the old L2 header. skb_data_move() sees metadata ending at skb_mac_header(), not before skb->data, warns and clears metadata: WARNING: CPU: 21 PID: 454557 at include/linux/skbuff.h:4609 skb_data_move+0x47/0x90 CPU: 21 UID: 0 PID: 454557 Comm: napi/iconduit-g Tainted: G O 6.18.21 #1 RIP: 0010:skb_data_move+0x47/0x90 Call Trace: <IRQ> bpf_skb_change_head+0xe6/0x1a0 bpf_prog_...+0x213/0x2e3 run_lwt_bpf.isra.0+0x1d3/0x360 bpf_xmit+0x46/0xe0 lwtunnel_xmit+0xa1/0xf0 ip_finish_output2+0x1e7/0x5e0 ip_output+0x63/0x100 __netif_receive_skb_one_core+0x85/0xa0 process_backlog+0x9c/0x150 __napi_poll+0x2b/0x190 net_rx_action+0x40b/0x7f0 handle_softirqs+0xd2/0x270 do_softirq+0x3f/0x60 </IRQ> That is what happens, as for how to fix it - a received packet that carries metadata can reach an encap through any of the three LWT redirect modes: LWTUNNEL_STATE_INPUT_REDIRECT ip6_rcv_finish dst_input lwtunnel_input LWTUNNEL_STATE_OUTPUT_REDIRECT ip6_rcv_finish dst_input ip6_forward ip6_forward_finish dst_output lwtunnel_output LWTUNNEL_STATE_XMIT_REDIRECT ip6_rcv_finish dst_input ip6_forward ip6_forward_finish dst_output ip6_output ip6_finish_output ip6_finish_output2 lwtunnel_xmit Every encap funnels through the three LWT dispatch helpers, so drop the metadata there, right before handing the skb to the encap op. This single chokepoint covers all encap types and all three redirect modes: - lwtunnel_input(): seg6, rpl, ila, seg6_local - lwtunnel_output(): ioam6 - lwtunnel_xmit(): mpls, LWT BPF xmit Alternatively, we could clear the metadata right after TC ingress hook. That would require a compromise, however. Metadata would become inaccessible from TC egress (in setups where it actually reaches the hook it tact, that is without any L2 tunnels on path). Fixes: 8989d328dfe7 ("net: Helper to move packet data and metadata after skb_push/pull") Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Link: https://patch.msgid.link/20260619-bpf-lwt-drop-skb-metadata-v3-1-71d6a33ab76b@cloudflare.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
8 daysnet, bpf: check master for NULL in xdp_master_redirect()Xiang Mei1-1/+1
xdp_master_redirect() dereferences the result of netdev_master_upper_dev_get_rcu() without a NULL check, but that helper returns NULL when the receiving device has no upper-master adjacency. The reach guard only checks netif_is_bond_slave(). On bond slave release bond_upper_dev_unlink() drops the upper-master adjacency before clearing IFF_SLAVE, so an XDP_TX reaching xdp_master_redirect() in that window still passes netif_is_bond_slave() while master is already NULL, and faults on master->flags at offset 0xb0: BUG: kernel NULL pointer dereference, address: 00000000000000b0 RIP: 0010:xdp_master_redirect (net/core/filter.c:4432) Call Trace: xdp_master_redirect (net/core/filter.c:4432) bpf_prog_run_generic_xdp (include/net/xdp.h:700) do_xdp_generic (net/core/dev.c:5608) __netif_receive_skb_one_core (net/core/dev.c:6204) process_backlog (net/core/dev.c:6319) __napi_poll (net/core/dev.c:7729) net_rx_action (net/core/dev.c:7792) handle_softirqs (kernel/softirq.c:622) __dev_queue_xmit (include/linux/bottom_half.h:33) packet_sendmsg (net/packet/af_packet.c:3082) __sys_sendto (net/socket.c:2252) Kernel panic - not syncing: Fatal exception in interrupt The missing check dates back to the original code; commit 1921f91298d1 ("net, bpf: fix null-ptr-deref in xdp_master_redirect() for down master") later added the master->flags read where the fault now lands but kept the unconditional deref. Check master for NULL before use; a NULL master is treated the same as one that is not up. Fixes: 879af96ffd72 ("net, core: Add support for XDP redirection to slave device") Reported-by: Weiming Shi <bestswngs@gmail.com> Signed-off-by: Xiang Mei <xmei5@asu.edu> Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev> Link: https://patch.msgid.link/20260620201531.180123-1-xmei5@asu.edu Signed-off-by: Jakub Kicinski <kuba@kernel.org>
10 daysbpf: zero-initialize the fib lookup flow structAvinash Duduskar1-2/+2
bpf_ipv4_fib_lookup() and bpf_ipv6_fib_lookup() build the flow key on the stack with a bare "struct flowi4 fl4;" / "struct flowi6 fl6;" and fill it field by field, but never set flowi4_l3mdev / flowi6_l3mdev. On the non-DIRECT path the lookup goes through the fib rules whenever the netns has custom rules, which a VRF installs: bpf_ipv4_fib_lookup() -> fib_lookup() -> __fib_lookup() -> l3mdev_update_flow() reads !fl->flowi_l3mdev -> fib_rules_lookup() -> fib_rule_match() -> l3mdev_fib_rule_match() uses fl->flowi_l3mdev l3mdev_update_flow() resolves the l3mdev master from the ingress device only while the field is still zero. Left at a nonzero stack value the resolution is skipped, and l3mdev_fib_rule_match() then tests that value as an ifindex, so the VRF master is not resolved and the rule fails to match: an ingress enslaved to a VRF can fail to select its table. FIB rules matching on an L3 master device (l3mdev_fib_rule_iif_match()/ _oif_match()) read the same value, so an "ip rule iif/oif <vrf>" mismatches the same way. Zero-initialize the whole flow struct rather than adding one more field assignment, so any flowi field added later is covered too. ip_route_input_slow() likewise zeroes the field before its input lookup. CONFIG_INIT_STACK_ALL_ZERO masks this by default, but it depends on compiler support (CC_HAS_AUTO_VAR_INIT_ZERO), so INIT_STACK_NONE builds, including older toolchains that fall back to it, are exposed. Built with INIT_STACK_ALL_PATTERN, a plain bpf_fib_lookup (no VLAN, no DIRECT) over a VRF slave whose destination is routed only in the VRF table returns BPF_FIB_LKUP_RET_NOT_FWDED, and resolves with this patch. On the default config the lookup succeeds either way, so ordinary testing does not catch the bug. Fixes: 40867d74c374 ("net: Add l3mdev index to flow struct and avoid oif reset for port devices") Signed-off-by: Avinash Duduskar <avinash.duduskar@gmail.com> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/r/20260617224719.1428599-1-avinash.duduskar@gmail.com Signed-off-by: Alexei Starovoitov <ast@kernel.org>
13 daysflow_dissector: check device type before reading ETH_ADDRSYun Zhou1-2/+10
__skb_flow_dissect() unconditionally reads 12 bytes from eth_hdr(skb) when FLOW_DISSECTOR_KEY_ETH_ADDRS is requested. This assumes the skb has a valid Ethernet header at mac_header, which is not always the case. The problem can be triggered by: 1. Creating a TUN device in L3 mode (IFF_TUN, hard_header_len=0) 2. Attaching a multiq qdisc with a flower filter matching on eth_src 3. Sending a packet through AF_PACKET Since TUN in L3 mode has no link-layer header, mac_header points to the L3 data area. The flow dissector reads 12 bytes of uninitialized skb memory, which then propagates through fl_set_masked_key() and is used as a rhashtable lookup key in __fl_lookup(), as reported by KMSAN. Rejecting the filter in the control path (at tc filter add time) is not feasible because TC filter blocks can be shared between arbitrary devices -- a filter installed on an Ethernet device may later classify packets on a headerless device through a shared block. The device association is not fixed at filter creation time. Fix this by gating the memcpy on dev->type == ARPHRD_ETHER, which ensures only true Ethernet-framed packets have their addresses read. This is more precise than the previous hard_header_len >= 12 check, which would incorrectly pass for non-Ethernet link types like IPoIB (ARPHRD_INFINIBAND, hard_header_len=24) and FDDI (hard_header_len=21) whose L2 headers are not in Ethernet format. Additionally check skb_mac_header_was_set() to guard against the pathological case where mac_header is the unset sentinel (~0U), which would cause eth_hdr() to return a wild pointer. For the act_mirred redirect case (Ethernet packet redirected to a non-Ethernet device sharing a TC block), zeroing the key is the correct behavior: the packet is now being classified on the target device, where Ethernet address matching is not semantically meaningful. Note: on non-Ethernet devices, the zeroed key will match a filter configured with all-zero MAC addresses. This is an improvement over the previous behavior where uninitialized memory could randomly match any filter. Reported-by: syzbot+fa2f5b1fb06147be5e16@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=fa2f5b1fb06147be5e16 Fixes: 67a900cc0436 ("flow_dissector: introduce support for Ethernet addresses") Signed-off-by: Yun Zhou <yun.zhou@windriver.com> Link: https://patch.msgid.link/20260616123057.482154-1-yun.zhou@windriver.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-17netdev-genl: report NAPI thread PID in the caller's pid namespaceMaoyi Xie1-1/+3
netdev_nl_napi_fill_one() reports the NAPI kthread PID in NETDEV_A_NAPI_PID using task_pid_nr(), which returns the PID in the initial pid namespace. NETDEV_CMD_NAPI_GET does not have GENL_ADMIN_PERM and the netdev genl family is netnsok, so a caller in a child pid namespace can issue it. That caller then sees the kthread's global PID, even though the kthread is not visible in its pid namespace, where the value should be 0. Translate the PID through the caller's pid namespace, the same way commit 3799c2570982 ("io_uring/fdinfo: translate SqThread PID through caller's pid_ns") did for the io_uring SQPOLL thread. The doit and dumpit paths both run synchronously in the caller's context, so task_active_pid_ns(current) is the caller's pid namespace. Fixes: db4704f4e4df ("netdev-genl: Add PID for the NAPI thread") Cc: stable@vger.kernel.org Signed-off-by: Maoyi Xie <maoyixie.tju@gmail.com> Reviewed-by: Joe Damato <joe@dama.to> Reviewed-by: Samiullah Khawaja <skhawaja@google.com> Link: https://patch.msgid.link/20260615171736.1709318-1-maoyixie.tju@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-17net: ip_gre: require CAP_NET_ADMIN in the device netns for changelinkMaoyi Xie1-0/+8
A tunnel changelink() operates on at most two netns, dev_net(dev) and the tunnel link netns t->net. They differ once the device is created in or moved to a netns other than the one the request runs in. The rtnl changelink path checks CAP_NET_ADMIN only against dev_net(dev), so a caller privileged there but not in t->net can rewrite a tunnel that lives in t->net. Add rtnl_dev_link_net_capable() next to rtnl_get_net_ns_capable() in net/core/rtnetlink.c. It requires CAP_NET_ADMIN in the link netns and is skipped when the link netns is dev_net(dev), where the rtnl path already checked it. The other patches in this series use the same helper. Gate ipgre_changelink() and erspan_changelink() with it, at the top of the op before any attribute is parsed, because the parsers update live tunnel fields first. ipgre_netlink_parms() sets t->collect_md before ip_tunnel_changelink() runs. Commit 8b484efd5cb4 ("ip6: vti: Use ip6_tnl.net in vti6_siocdevprivate().") added the same check on the ioctl path. This adds it on RTM_NEWLINK. Reported-by: Xiao Liang <shaw.leon@gmail.com> Closes: https://lore.kernel.org/netdev/CABAhCOSzP1vaThGV35_VnsRCb=87_CPjPVsTHbq905k8A+BuUg@mail.gmail.com/ Fixes: b57708add314 ("gre: add x-netns support") Cc: stable@vger.kernel.org Signed-off-by: Maoyi Xie <maoyixie.tju@gmail.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260612085941.3158249-2-maoyixie.tju@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-17Merge tag 'bpf-next-7.2' of ↵Linus Torvalds2-31/+98
git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next Pull bpf updates from Alexei Starovoitov: "Major changes: - Recover from BPF arena page faults using a scratch page and add ptep_try_set() for lockless empty-slot installs on x86 and arm64. This allows BPF kfuncs to access arena pointers directly. The 'arena_direct_access' stable branch was created for this work and was pulled into sched-ext and bpf-next trees (Tejun Heo, Kumar Kartikeya Dwivedi) - Lift old restriction and support 6+ arguments in BPF programs and kfuncs on x86 and arm64 (Yonghong Song, Puranjay Mohan) Other features and fixes: - Add 24-bit BTF vlen and reclaim unused bits in the BTF UAPI to ease addition of new BTF kinds (Alan Maguire) - Raise the maximum BPF call chain depth from 8 to 16 frames (Alexei Starovoitov) - Refactor object relationship tracking in the verifier and fix a dynptr use-after-free bug (Amery Hung) - Harden the signed program loader and reject exclusive maps as inner maps (Daniel Borkmann) - Replace the verifier min/max bounds fields with a circular number (cnum) representation and improve 32->64 bit range refinements (Eduard Zingerman) - Introduce the arena library and runtime (libarena) with a buddy allocator, rbtree and SPMC queue data structures, ASAN support and a parallel test harness. Allow subprograms to return arena pointers and switch to a BTF type-tag based __arena annotation (Emil Tsalapatis) - Cache build IDs in the sleepable stackmap path and avoid faultable build ID reads under mm locks (Ihor Solodrai) - Introduce the tracing_multi link to attach a single BPF program to many kernel functions at once. Allow specifying the uprobe_multi target via FD (Jiri Olsa) - Extend the bpf_list family of kfuncs with bpf_list_add/del(), and bpf_list_is_first/is_last/empty() (Kaitao Cheng) - Extend the BPF syscall with common attributes support for prog_load, btf_load and map_create (Leon Hwang) - Wrap rhashtable as BPF map (Mykyta Yatsenko, Herbert Xu) - Add sleepable support for tracepoint programs and fix deadlocks in LRU map due to NMI reentry (Mykyta Yatsenko) - Fix OOB access in bpf_flow_keys, fix nullness analysis of inner arrays, enforce write checks for global subprograms (Nuoqi Gui) - Report the maximum combined stack depth and print a breakdown of instructions processed per subprogram (Paul Chaignon) - Add an XDP load-balancer benchmark and arm64 JIT support for stack arguments (Puranjay Mohan) - Add kfuncs to traverse over wakeup_sources (Samuel Wu) - Allow sleepable BPF programs to use LPM trie maps directly (Vlad Poenaru) - Many more fixes and cleanups across the verifier, BTF, sockmap, devmap, bpffs, security hooks, s390/riscv/loongarch JITs, rqspinlock, libbpf, bpftool, selftests" * tag 'bpf-next-7.2' of git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (336 commits) selftests/bpf: Work around llvm stack overflow in crypto progs selftests/bpf: add test for bpf_msg_pop_data() overflow bpf, sockmap: fix integer overflow in bpf_msg_pop_data() bounds check sockmap: Fix use-after-free in udp_bpf_recvmsg() bpf, sockmap: keep sk_msg copy state in sync bpf, sockmap: Fix wrong rsge offset in bpf_msg_push_data() bpf, sockmap: reject overflowing copy + len in bpf_msg_push_data() selftsets/bpf: Retry map update on helper_fill_hashmap() selftests/bpf: Add test for sleepable lsm_cgroup rejection selftests/bpf: Add test to verify the fix for bpf_setsockopt() helper bpf: Fix bpf_get/setsockopt to tos for ipv4-mapped ipv6 socket selftests/bpf: Avoid static LLVM linking for cross builds selftests/bpf: Use common CFLAGS for urandom_read selftests/bpf: Initialize operation name before use tools/bpf: build: Append extra cflags libbpf: Initialize CFLAGS before including Makefile.include bpftool: Append extra host flags bpftool: Avoid adding EXTRA_CFLAGS to HOST_CFLAGS bpftool: Pass host flags to bootstrap libbpf selftests/bpf: correct CONFIG_PPC64 macro name in comment ...
2026-06-17Merge tag 'net-next-7.2' of ↵Linus Torvalds27-453/+532
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core & protocols: - Work on removing rtnl_lock protection throughout the stack continues. In this chapter: - don't use rtnl_lock for IPv6 multicast routing configuration - don't take rtnl_lock in ethtool for modern drivers - prepare Qdisc dump callbacks for rtnl_lock removal - Support dumping just ifindex + name of all interfaces, under RCU. It's a common operation for Netlink CLI tools (when translating names to ifindexes) and previously required full rtnl_lock. - Support dumping qdiscs and page pools for a specific netdev. Even tho user space wants a dump of all netdevs, most of the time, the OOO programming model results in repeating the dump for each netdev. Which, in absence of a cache, leads to a O(n^2) behavior. - Flush nexthops once on multi-nexthop removal (e.g. when device goes down), another O(n^2) -> O(n) improvement. - Rehash locally generated traffic to a different nexthop on retransmit timeout. - Honor oif when choosing nexthop for locally generated IPv6 traffic. - Convert TCP Auth Option to crypto library, and drop non-RFC algos. - Increase subflow limits in MPTCP to 64 and endpoint limit to 256. - Support MPTCP signaling of IPv6 address + port (ADD_ADDR). We need to selectively skip reporting of the standard TCP Timestamp option, because they won't fit into the header space together (12 + 30 > 40). - Support using bridge neighbor suppression, Duplicate Address Detection, Gratuitous ARP and unsolicited NA forwarding - in EVPN deployments, e.g. VXLAN fabrics (IPv4 and IPv6). - Improve link state reporting for upper netdevs (e.g. macvlan) over tunnel devices (again, mostly for EVPN deployments). - Support binding GENEVE tunnels to a local address. - Speed up UDP tunnel destruction (remove one synchronize_rcu()). - Support exponential field encoding in multicast (IGMPv3 and MLDv2). - Support attaching PSP crypto offload to containers (veth, netkit). - Add a new IPSec Netlink message XFRM_MSG_MIGRATE_STATE that allows migrating individual IPsec SAs independently of their policies. The existing XFRM_MSG_MIGRATE is tightly coupled to policy+SA migration, lacks SPI for unique SA identification, and cannot express reqid changes or migrate Transport mode selectors. The new interface identifies the SA via SPI and mark, supports reqid changes, address family changes, encap removal, and uses an atomic create+install flow under x->lock to prevent SN/IV reuse during AEAD SA migration. - Implement GRO/GSO support for PPPoE. - Convert sockopt callbacks in a number of protocols to iov_iter. Cross-tree stuff: - Remove support for Crypto TFM cloning (unblocked after the TCP Auth Option rework). This feature regressed performance for all crypto API users, since it changed crypto transformation objects into reference-counted objects. - Add FCrypt-PCBC implementation to rxrpc and remove it from the global crypto API as obsolete and insecure. Wireless: - Major rework of station bandwidth handling, fixing issues with lower capability than AP. - Cleanups for EMLSR spec issues (drafts differed). - More Neighbor Awareness Networking (Wi-Fi Aware) work (multicast, schedule improvements, multi-station etc.) - Some Ultra High Reliability (UHR) / IEEE 802.11bn (D1.4) work (e.g. non-primary channel access, UHR DBE support). - Fine Timing Measurement ranging (i.e. distance measurement) APIs. Netfilter: - Use per-rule hash initval in nf_conncount. This avoids unnecessary lock contention with short keys (e.g. conntrack zones) in different namespaces. - Various safety improvements, both in packet parsing and object lifetimes. Notably add refcounts to conntrack timeout policy. Deletions: - Remove TLS + sockmap integration. TLS wants to pin user pages to avoid a copy, and sockmap wants to write to the input stream. More work on this integration is clearly needed, and we can't find any users (original author admitted that they never deployed it). - Remove support for TLS offload with TCP Offload Engine (the far more common opportunistic offload is retained). The locking looks unfixable (driver sleeps under TCP spin locks) and people from the vendor that added this are AWOL. - Remove more ATM code, trying to leave behind only what PPPoATM needs, AAL5 and br2684 with permanent circuits. - Remove AppleTalk. Let it join hamradio in our out of tree protocol graveyard, I mean, repository. - Disable 32-bit x_tables compatibility (32bit binaries on 64bit kernel) interface in user namespaces. To be deleted completely, soon. - Remove 5/10 MHz support from cfg80211/mac80211. Drivers: - Software: - Support DEVMEM/DMABUF Tx over NETMEM_TX_NO_DMA devices (netkit) - bonding: add knob to strictly follow 802.3ad for link state - New drivers: - Alibaba Elastic Ethernet Adaptor (cloud vNIC). - NXP NETC switch within i.MX94. - DPLL: - Add operational state to pins (implement in zl3073x). - Add generic DPLL type, for daisy-chaining DPLLs (implement in ice). - Ethernet high-speed NICs: - Huawei (hinic3): - enhance tc flow offload support with queue selection, tunnels - nVidia/Mellanox: - avoid over-copying payload to the skb's linear part (up to 60% win for LRO on slow CPUs like ARM64 V2) - expose more per-queue stats over the standard API - support additional, unprivileged PFs in the DPU configuration - support Socket Direct (multi-PF) with switchdev offloads - add a pool / frag allocator for DMA mapped buffers for control objects, save memory on systems with 64kB page size - take advantage of the ability to dynamically change RSS table size, even when table is configured by the user - increase the max RSS table size for even traffic distribution - Ethernet NICs: - Marvell/Aquantia: - AQC113 PTP support - Realtek USB (r8152): - support 10Gbit Link Speeds and Energy-Efficient Ethernet (EEE) - support firmware loaded (for RTL8157/RTL8159) - support for the RTL8159 - Intel (ixgbe): - support Energy-Efficient Ethernet (EEE) on E610 devices - Ethernet switches: - Airoha: - support multiple netdevs on a single GDM block / port - Marvell (mv88e6xxx): - support SERDES of mv88e6321 - Microchip (ksz8/9): - rework the driver callbacks to remove one indirection layer - Motorcomm (yt921x): - support port rate policing - support TBF qdisc offload - support ACL/flower offload - nVidia/Mellanox: - expose per-PG rx_discards - Realtek: - rtl8365mb: bridge offloading and VLAN support - Ethernet PHYs: - Airoha: - support Airoha AN8801R Gigabit PHYs. - Micrel: - implement 3 low-loss cable tunables - Realtek: - support MDI swapping for RTL8226-CG - support MDIO for RTL931x - Qualcomm: - at803x: Rx and Tx clock management for IPQ5018 PHY - Motorcomm: - support YT8522 100M RMII PHY - set drive strength in YT8531s RGMII - TI: - dp83822: add optional external PHY clock - Bluetooth: - hci_sync: add support for HCI_LE_Set_Host_Feature [v2] - SMP: use AES-CMAC library API - Intel: - support Product level reset - support smart trigger dump - Mediatek: - add event filter to filter specific event - Realtek: - fix RTL8761B/BU broken LE extended scan - WiFi: - Broadcom (b43): - new support for a 11n device - MediaTek (mt76): - support mt7927 - mt792x: broken usb transport detection - mt7921: regulatory improvements - Qualcomm (ath9k): - GPIO interface improvements - Qualcomm (ath12k): - WDS support - replace dynamic memory allocation in WMI Rx path - thermal throttling/cooling device support - 6 GHz incumbent interference detection - channel 177 in 5 GHz - Realtek (rt89): - RTL8922AU support - USB 3 mode switch for performance - better monitor radiotap support - RTL8922DE preparations" * tag 'net-next-7.2' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1778 commits) ipv4: fib_rule: Move fib4_rules_exit() to ->exit(). net: serialize netif_running() check in enqueue_to_backlog() net: skmsg: preserve sg.copy across SG transforms appletalk: move the protocol out of tree appletalk: stop storing per-interface state in struct net_device selftests/bpf: test that TLS crypto is rejected on a sockmap socket selftests/bpf: drop the unused kTLS program from test_sockmap selftests/bpf: remove sockmap + ktls tests tls: remove dead sockmap (psock) handling from the SW path tls: reject the combination of TLS and sockmap atm: remove orphaned uAPI for deleted drivers, protocols and SVCs atm: remove unused ATM PHY operations atm: remove the unused pre_send and send_bh device operations atm: remove the unused change_qos device operation atm: remove SVC socket support and the signaling daemon interface atm: remove the local ATM (NSAP) address registry atm: remove dead SONET PHY ioctls atm: remove the unused send_oam / push_oam callbacks atm: remove AAL3/4 transport support net: dsa: sja1105: fix lastused timestamp in flower stats ...
2026-06-16net: serialize netif_running() check in enqueue_to_backlog()Eric Dumazet1-2/+4
Syzbot reported a KASAN slab-use-after-free in fib_rules_lookup(). The root cause is a race condition where packets can escape the backlog flushing during device unregistration (e.g., during netns exit). Commit e9e4dd3267d0 ("net: do not process device backlog during unregistration") introduced a lockless netif_running() check in enqueue_to_backlog() to prevent queuing packets to an unregistering device. However, this creates a TOCTOU race window. A lockless transmitter (like veth_xmit) can pass the check before dev_close() clears IFF_UP. If the transmitter is then delayed, flush_all_backlogs() can run and finish before the transmitter grabs the backlog lock and queues the packet. The packet then escapes the flush and triggers UAF later when processed. Fix this by moving the netif_running() check inside the backlog lock. This serializes the check with the flush work (which also grabs the lock). We then either queue the packet before the flush runs (so it gets flushed), or check netif_running() after the flush/close completes (so it gets dropped). Fixes: e9e4dd3267d0 ("net: do not process device backlog during unregistration") Reported-by: syzbot+965506b59a2de0b6905c@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/6a315824.b0403584.28d0ff.0000.GAE@google.com/T/#u Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Julian Anastasov <ja@ssi.bg> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260616141317.407791-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-16Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski5-6/+40
Merge in late fixes in preparation for the net-next PR. Conflicts: net/tls/tls_sw.c 406e8a651a7b ("net: skmsg: preserve sg.copy across SG transforms") 79511603a65b ("tls: remove dead sockmap (psock) handling from the SW path") drivers/net/ethernet/microsoft/mana/mana_en.c f8fd56977eeea ("net: mana: guard TX wq object destroy with INVALID_MANA_HANDLE check") d07efe5a6e641 ("net: mana: Use per-queue allocation for tx_qp to reduce allocation size") https://lore.kernel.org/ajAPXu-C_PuTgV-a@sirena.org.uk No adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-16net: skmsg: preserve sg.copy across SG transformsYiming Qian2-0/+29
The sk_msg sg.copy bitmap is part of the scatterlist entry ownership state. A set bit tells sk_msg_compute_data_pointers() not to expose the entry through writable BPF ctx->data. This protects entries backed by pages that are not private to the sk_msg, such as splice-backed file page-cache pages. Several sk_msg transform paths move, copy, split, or compact msg->sg.data[] entries without moving the matching sg.copy bit. This can make an externally backed entry arrive at a new slot with a clear copy bit. A later SK_MSG verdict can then expose sg_virt(sge) as writable ctx->data and BPF stores can modify the original page cache. Keep sg.copy synchronized with sg.data[] whenever entries are transferred, shifted, split, or copied into a new sk_msg. Clear the bit when an entry is replaced by a newly allocated private page or freed. This covers the BPF pull/push/pop helpers, sk_msg_shift_left/right(), sk_msg_xfer(), and tls_split_open_record(), including the partial tail entry created during TLS open-record splitting. Fixes: d3b18ad31f93 ("tls: add bpf support to sk_msg handling") Cc: stable@vger.kernel.org Reported-by: Yiming Qian <yimingqian591@gmail.com> Reported-by: Keenan Dong <keenanat2000@gmail.com> Signed-off-by: Yiming Qian <yimingqian591@gmail.com> Link: https://patch.msgid.link/20260610062137.49075-1-yimingqian591@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-16tls: remove dead sockmap (psock) handling from the SW pathJakub Kicinski1-49/+3
TLS and sockmap are now mutually exclusive. Try to delete the code from sendmsg and recvmsg path which is now obviously dead. The main goal is to delete enough code for AI security scanners to no longer bother us with sockmap related bugs. At the same time retain the code in case someone has the cycles to fix all of this and make the integration work, again. If the integration does not get restored we can wipe the rest of the skmsg code from TLS in two or three releases. The changes on the Tx side are deeper since that's where most of the bugs are, Rx side simply takes the data from sockmap and gives it to the user. On Tx split record handling and rolling back the iterator were the two problem areas. Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://patch.msgid.link/20260614014102.461064-3-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-06-16Merge tag 'slab-for-7.2' of ↵Linus Torvalds1-11/+13
git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab Pull slab updates from Vlastimil Babka: - Support for "allocation tokens" (currently available in Clang 22+) for smarter partitioning of kmalloc caches based on the allocated object type, which can be enabled instead of the "random" per-caller-address-hash partitioning. It should be able to deterministically separate types containing a pointer from those that do not (Marco Elver) - Improvements and simplification of the kmem_cache_alloc_bulk() and mempool_alloc_bulk() API. This includes adaptation of callers (Christoph Hellwig) - Performance improvements and cleanups related mostly to sheaves refill (Hao Li, Shengming Hu, Vlastimil Babka) - Several fixups for the slabinfo tool (Xuewen Wang) * tag 'slab-for-7.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vbabka/slab: mm/slab: do not limit zeroing to orig_size when only red zoning is enabled mm/slub: preserve original size in _kmalloc_nolock_noprof retry path mm: simplify the mempool_alloc_bulk API mm/slab: improve kmem_cache_alloc_bulk mm/slub: detach and reattach partial slabs in batch mm/slub: introduce helpers for node partial slab state mm/slub: use empty sheaf helpers for oversized sheaves tools/mm/slabinfo: remove redundant slab->partial assignment tools/mm/slabinfo: remove dead assignment in get_obj_and_str() tools/mm/slabinfo: Fix trace disable logic inversion MAINTAINERS: add slab-related scripts and tools to SLAB ALLOCATOR mm/slub: fix typo in sheaves comment mm, slab: simplify returning slab in __refill_objects_node() mm, slab: add an optimistic __slab_try_return_freelist() slab: fix kernel-docs for mm-api slab: improve KMALLOC_PARTITION_RANDOM randomness slab: support for compiler-assisted type-based slab cache partitioning mm/slub: defer freelist construction until after bulk allocation from a new slab
2026-06-14bpf, sockmap: fix integer overflow in bpf_msg_pop_data() bounds checkSechang Lim1-2/+2
start and len are u32, so u64 last = start + len; evaluates start + len in 32-bit and wraps before storing it in last. The bounds check if (start >= offset + l || last > msg->sg.size) return -EINVAL; can then be passed with an out-of-range start/len, after which the pop loop runs off the end of the scatterlist and sk_msg_shift_left() calls put_page() on the empty msg->sg.end slot: Oops: general protection fault, probably for non-canonical address 0xdffffc0000000001: 0000 [#1] SMP KASAN PTI KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f] RIP: 0010:sk_msg_shift_left net/core/filter.c:2957 [inline] RIP: 0010:____bpf_msg_pop_data net/core/filter.c:3103 [inline] RIP: 0010:bpf_msg_pop_data+0x753/0x1a10 net/core/filter.c:2984 Call Trace: <TASK> bpf_prog_4cc92c278f4d5d56+0x1b1/0x1e8 bpf_prog_run_pin_on_cpu+0x107/0x320 include/linux/filter.h:746 sk_psock_msg_verdict+0x357/0x7f0 net/core/skmsg.c:934 tcp_bpf_send_verdict net/ipv4/tcp_bpf.c:420 [inline] tcp_bpf_sendmsg+0x766/0x1ae0 net/ipv4/tcp_bpf.c:583 __sock_sendmsg+0x153/0x1c0 net/socket.c:802 __sys_sendto+0x326/0x430 net/socket.c:2265 __x64_sys_sendto+0xe3/0x100 net/socket.c:2268 do_syscall_64+0x14c/0x480 entry_SYSCALL_64_after_hwframe+0x77/0x7f </TASK> Widen the addition with a (u64) cast so the bound is evaluated in 64-bit and a len near U32_MAX no longer wraps below msg->sg.size. While here, change pop from int to u32. It counts bytes against the unsigned scatterlist lengths and can never be negative, so the signed type only invites sign-confusion in the pop loop. Fixes: 7246d8ed4dcc ("bpf: helper to pop data from messages") Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Signed-off-by: Sechang Lim <rhkrqnwk98@gmail.com> Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Link: https://lore.kernel.org/r/20260615021959.140010-6-jiayuan.chen@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-14bpf, sockmap: keep sk_msg copy state in syncZhang Cen1-5/+83
SK_MSG uses msg->sg.copy as per-scatterlist-entry provenance. Entries with this bit set are copied before data/data_end are exposed to SK_MSG BPF programs for direct packet access. bpf_msg_pull_data(), bpf_msg_push_data(), and bpf_msg_pop_data() rewrite the sk_msg scatterlist ring by collapsing, splitting, and shifting entries. These operations move msg->sg.data[] entries, but the parallel copy bitmap can be left behind on the old slot. A copied entry can then return to msg->sg.start with its copy bit clear and be exposed as directly writable packet data. This corruption path requires an attached SK_MSG BPF program that calls the mutating helpers; ordinary sockmap/TLS traffic that never runs push/pop/pull helper sequences is not affected. Keep msg->sg.copy synchronized with scatterlist entry moves, preserve the copy bit when an entry is split, clear it when a helper replaces an entry with a private page, and clear slots vacated by pull-data compaction. Fixes: 015632bb30da ("bpf: sk_msg program helper bpf_sk_msg_pull_data") Fixes: 6fff607e2f14 ("bpf: sk_msg program helper bpf_msg_push_data") Fixes: 7246d8ed4dcc ("bpf: helper to pop data from messages") Cc: stable@vger.kernel.org Co-developed-by: Han Guidong <2045gemini@gmail.com> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Signed-off-by: Han Guidong <2045gemini@gmail.com> Signed-off-by: Zhang Cen <rollkingzzc@gmail.com> Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Link: https://lore.kernel.org/r/20260615021959.140010-4-jiayuan.chen@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-14bpf, sockmap: Fix wrong rsge offset in bpf_msg_push_data()Weiming Shi1-1/+1
When bpf_msg_push_data() splits a scatterlist element into head and tail, the tail's page offset is advanced by `start` (absolute message byte offset) instead of `start - offset` (byte position within the element). This makes rsge.offset overshoot by `offset` bytes, pointing to the wrong location within the page or beyond its boundary. Consumers of the corrupted entry either silently read wrong data or trigger an out-of-bounds access. BUG: KASAN: slab-use-after-free in bpf_msg_pull_data (net/core/filter.c:2728) Read of size 32752 at addr ffff8881042f0010 by task poc/130 Call Trace: __asan_memcpy (mm/kasan/shadow.c:105) bpf_msg_pull_data (net/core/filter.c:2728) bpf_prog_run_pin_on_cpu (include/linux/bpf.h:1402) sk_psock_msg_verdict (net/core/skmsg.c:934) tcp_bpf_send_verdict (net/ipv4/tcp_bpf.c:421) sock_sendmsg_nosec (net/socket.c:727) Fixes: 6fff607e2f14 ("bpf: sk_msg program helper bpf_msg_push_data") Reported-by: Xiang Mei <xmei5@asu.edu> Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Signed-off-by: Weiming Shi <bestswngs@gmail.com> Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Link: https://lore.kernel.org/r/20260615021959.140010-3-jiayuan.chen@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-14bpf, sockmap: reject overflowing copy + len in bpf_msg_push_data()Weiming Shi1-0/+3
When the scatterlist ring is full or nearly full, bpf_msg_push_data() enters a copy fallback path and computes copy + len for the page allocation size. Since len comes from BPF with arg3_type = ARG_ANYTHING and both are u32, a crafted len can wrap the sum to a small value, causing an undersized allocation followed by an out-of-bounds memcpy. BUG: unable to handle page fault for address: ffffed104089a402 Oops: Oops: 0000 [#1] SMP KASAN NOPTI Call Trace: __asan_memcpy (mm/kasan/shadow.c:105) bpf_msg_push_data (net/core/filter.c:2852 net/core/filter.c:2788) bpf_prog_9ed8b5711920a7d7+0x2e/0x36 sk_psock_msg_verdict (net/core/skmsg.c:934) tcp_bpf_sendmsg (net/ipv4/tcp_bpf.c:421 net/ipv4/tcp_bpf.c:584) __sys_sendto (net/socket.c:2206) do_syscall_64 (arch/x86/entry/syscall_64.c:94) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130) Add an overflow check before the allocation. Link: https://lore.kernel.org/all/20260424155913.A19FDC19425@smtp.kernel.org Fixes: 6fff607e2f14 ("bpf: sk_msg program helper bpf_msg_push_data") Tested-by: Xiang Mei <xmei5@asu.edu> Tested-by: Xinyu Ma <mmmxny@gmail.com> Reviewed-by: Jiayuan Chen <jiayuan.chen@linux.dev> Reviewed-by: Emil Tsalapatis <emil@etsalapatis.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Signed-off-by: Weiming Shi <bestswngs@gmail.com> Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Link: https://lore.kernel.org/r/20260615021959.140010-2-jiayuan.chen@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-14bpf: Fix bpf_get/setsockopt to tos for ipv4-mapped ipv6 socketLeon Hwang1-1/+14
When TCP over IPv4 via INET6 API, bpf_get/setsockopt with ipv4 will fail, because sk->sk_family is AF_INET6. With ipv6 will success, not take effect, because inet_csk(sk)->icsk_af_ops is ipv6_mapped and use ip_queue_xmit, inet_sk(sk)->tos. To relax this restriction, allow getting/setting tos for those possible ipv4-mapped ipv6 sockets. Fixes: ee7f1e1302f5 ("bpf: Change bpf_setsockopt(SOL_IP) to reuse do_ip_setsockopt()") Signed-off-by: Feng Zhou <zhoufeng.zf@bytedance.com> Signed-off-by: Leon Hwang <leon.hwang@linux.dev> Link: https://lore.kernel.org/r/20260613162443.60515-2-leon.hwang@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-06-12net: watchdog: fix refcount tracking racesEric Dumazet1-1/+2
Blamed commit converted the untracked dev_hold()/dev_put() calls in the watchdog code to use the tracked dev_hold_track()/dev_put_track() (which were later renamed/interfaced to netdev_hold() and netdev_put()). By introducing dev->watchdog_dev_tracker to store the reference tracking information without adding synchronization between netdev_watchdog_up() and dev_watchdog(), it enabled the race condition where this pointer could be overwritten or freed concurrently, leading to the list corruption crash syzbot reported: list_del corruption, ffff888114a18c00->next is NULL kernel BUG at lib/list_debug.c:52 ! Oops: invalid opcode: 0000 [#1] SMP KASAN PTI CPU: 1 UID: 0 PID: 91 Comm: kworker/u8:5 Not tainted syzkaller #0 PREEMPT(lazy) Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 05/09/2026 Workqueue: events_unbound linkwatch_event RIP: 0010:__list_del_entry_valid_or_report.cold+0x22/0x2a lib/list_debug.c:52 Call Trace: <TASK> __list_del_entry_valid include/linux/list.h:132 [inline] __list_del_entry include/linux/list.h:246 [inline] list_move_tail include/linux/list.h:341 [inline] ref_tracker