aboutsummaryrefslogtreecommitdiff
path: root/include/net
AgeCommit message (Collapse)AuthorFilesLines
4 daysnet/sched: fix pedit partial COW leading to page cache corruptionRajat Gupta1-1/+0
tcf_pedit_act() computes the COW range for skb_ensure_writable() once before the key loop using tcfp_off_max_hint, but the hint does not account for the runtime header offset added by typed keys. This can leave part of the write region un-COW'd. Fix by moving skb_ensure_writable() inside the per-key loop where the actual write offset is known, and add overflow checking on the offset arithmetic. For negative offsets (e.g. Ethernet header edits at ingress), use skb_cow() to COW the headroom instead. Guard offset_valid() against INT_MIN, where negation is undefined. Fixes: 8b796475fd78 ("net/sched: act_pedit: really ensure the skb is writable") Reported-by: Yiming Qian <yimingqian591@gmail.com> Reported-by: Keenan Dong <keenanat2000@gmail.com> Reported-by: Han Guidong <2045gemini@gmail.com> Reported-by: Zhang Cen <rollkingzzc@gmail.com> Reviewed-by: Han Guidong <2045gemini@gmail.com> Tested-by: Han Guidong <2045gemini@gmail.com> Reviewed-by: Davide Caratti <dcaratti@redhat.com> Tested-by: Davide Caratti <dcaratti@redhat.com> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Tested-by: Toke Høiland-Jørgensen <toke@redhat.com> Reviewed-by: Victor Nogueira <victor@mojatatu.com> Tested-by: Victor Nogueira <victor@mojatatu.com> Acked-by: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Rajat Gupta <rajat.gupta@oss.qualcomm.com> Link: https://patch.msgid.link/20260531123221.48732-1-jhs@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 daysMerge tag 'for-net-2026-06-03' of ↵Jakub Kicinski1-0/+1
git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth Luiz Augusto von Dentz says: ==================== bluetooth pull request for net: - hci_core: fix memory leak in error path of hci_alloc_dev() - hci_sync: reject oversized Broadcast Announcement prepend - MGMT: Fix backward compatibility with userspace - MGMT: validate advertising TLV before type checks - L2CAP: reject BR/EDR signaling packets over MTUsig - RFCOMM: validate skb length in MCC handlers - RFCOMM: hold listener socket in rfcomm_connect_ind() - ISO: Fix not releasing hdev reference on iso_conn_big_sync - ISO: Fix a use-after-free of the hci_conn pointer - ISO: Fix data-race on iso_pi fields in hci_get_route calls - SCO: Fix data-race on sco_pi fields in sco_connect - BNEP: reject short frames before parsing * tag 'for-net-2026-06-03' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth: Bluetooth: MGMT: Fix backward compatibility with userspace Bluetooth: SCO: Fix data-race on sco_pi fields in sco_connect Bluetooth: ISO: Fix data-race on iso_pi fields in hci_get_route calls Bluetooth: ISO: Fix a use-after-free of the hci_conn pointer Bluetooth: ISO: Fix not releasing hdev reference on iso_conn_big_sync Bluetooth: fix memory leak in error path of hci_alloc_dev() Bluetooth: bnep: reject short frames before parsing Bluetooth: hci_sync: reject oversized Broadcast Announcement prepend Bluetooth: L2CAP: reject BR/EDR signaling packets over MTUsig Bluetooth: RFCOMM: validate skb length in MCC handlers Bluetooth: MGMT: validate advertising TLV before type checks Bluetooth: RFCOMM: hold listener socket in rfcomm_connect_ind() ==================== Link: https://patch.msgid.link/20260603162714.342496-1-luiz.dentz@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 daysmptcp: fix uninit-value in mptcp_established_optionsPaolo Abeni1-2/+5
syzbot reported the following uninit splat: BUG: KMSAN: uninit-value in mptcp_write_data_fin net/mptcp/options.c:542 [inline] BUG: KMSAN: uninit-value in mptcp_established_options_dss net/mptcp/options.c:590 [inline] BUG: KMSAN: uninit-value in mptcp_established_options+0x112f/0x3530 net/mptcp/options.c:874 mptcp_write_data_fin net/mptcp/options.c:542 [inline] mptcp_established_options_dss net/mptcp/options.c:590 [inline] mptcp_established_options+0x112f/0x3530 net/mptcp/options.c:874 tcp_established_options+0x312/0xcc0 net/ipv4/tcp_output.c:1192 __tcp_transmit_skb+0x5dc/0x5fe0 net/ipv4/tcp_output.c:1575 __tcp_send_ack+0x967/0xad0 net/ipv4/tcp_output.c:4499 tcp_send_ack+0x3d/0x60 net/ipv4/tcp_output.c:4505 mptcp_subflow_shutdown+0x164/0x690 net/mptcp/protocol.c:3137 mptcp_check_send_data_fin+0x31b/0x3d0 net/mptcp/protocol.c:3218 __mptcp_wr_shutdown net/mptcp/protocol.c:3234 [inline] __mptcp_close+0x860/0x1360 net/mptcp/protocol.c:3313 mptcp_close+0x42/0x260 net/mptcp/protocol.c:3367 inet_release+0x1ee/0x2a0 net/ipv4/af_inet.c:442 __sock_release net/socket.c:722 [inline] sock_close+0xd6/0x2f0 net/socket.c:1514 __fput+0x60e/0x1010 fs/file_table.c:510 ____fput+0x25/0x30 fs/file_table.c:538 task_work_run+0x208/0x2b0 kernel/task_work.c:233 resume_user_mode_work include/linux/resume_user_mode.h:50 [inline] __exit_to_user_mode_loop kernel/entry/common.c:67 [inline] exit_to_user_mode_loop+0x306/0x1b60 kernel/entry/common.c:98 __exit_to_user_mode_prepare include/linux/irq-entry-common.h:207 [inline] syscall_exit_to_user_mode_prepare include/linux/irq-entry-common.h:238 [inline] syscall_exit_to_user_mode include/linux/entry-common.h:318 [inline] __do_fast_syscall_32+0x2c7/0x460 arch/x86/entry/syscall_32.c:310 do_fast_syscall_32+0x37/0x80 arch/x86/entry/syscall_32.c:332 do_SYSENTER_32+0x1f/0x30 arch/x86/entry/syscall_32.c:370 entry_SYSENTER_compat_after_hwframe+0x84/0x8e Local variable opts created at: __tcp_transmit_skb+0x4d/0x5fe0 net/ipv4/tcp_output.c:1536 __tcp_send_ack+0x967/0xad0 net/ipv4/tcp_output.c:4499 The output path currently omits initializing the mptcp extension `use_map` flag in a few corner cases. Address the issue always zeroing all the extensions flags before eventually initializing the individual bits. To that extent, introduce and use a struct_group to avoid multiple bitwise operations. Fixes: cfcceb7a39fc ("tcp: shrink per-packet memset in __tcp_transmit_skb()") Cc: stable@vger.kernel.org Reported-by: syzbot+ff020673c5e3d94d9478@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=ff020673c5e3d94d9478 Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20260602-net-mptcp-misc-fixes-7-1-rc7-v2-10-856831229976@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
5 daysBluetooth: L2CAP: reject BR/EDR signaling packets over MTUsigMichael Bommarito1-0/+1
net/bluetooth/l2cap_core.c:l2cap_sig_channel() accepts BR/EDR signaling packets up to the channel MTU and dispatches each command without enforcing the signaling MTU (MTUsig). A Bluetooth BR/EDR peer within radio range can send a fixed-channel CID 0x0001 packet that is larger than MTUsig and contains many L2CAP_ECHO_REQ commands before pairing. In a real-radio stock-kernel run, one 681-byte signaling packet containing 168 zero-length ECHO_REQ commands made the target transmit 168 ECHO_RSP frames over about 220 ms. Impact: a Bluetooth BR/EDR peer within radio range, before pairing, can force 168 ECHO_RSP frames from one 681-byte fixed-channel signaling packet containing packed ECHO_REQ commands. Define Linux's BR/EDR signaling MTU as the spec minimum of 48 bytes and reject any larger signaling packet with one L2CAP_COMMAND_REJECT_RSP carrying L2CAP_REJ_MTU_EXCEEDED before any command is dispatched. The Bluetooth Core spec wording for MTUExceeded says the reject identifier shall match the first request command in the packet, and that packets containing only responses shall be silently discarded. Linux intentionally deviates from that prescription: silently discarding desynchronizes the peer because the remote stack never learns its responses were dropped, and locating the first request command requires walking command headers past MTUsig, i.e. processing bytes from a packet we have already decided is too large to process. We therefore always emit one reject and use the identifier from the first command header, a single fixed-offset byte read. The unrestricted BR/EDR signaling parser and ECHO_REQ response path both trace to the initial git import; no later introducing commit is available for a Fixes tag. Cc: stable@vger.kernel.org Suggested-by: Luiz Augusto von Dentz <luiz.dentz@gmail.com> Link: https://lore.kernel.org/r/20260518002800.1361430-1-michael.bommarito@gmail.com Link: https://lore.kernel.org/r/20260520135034.1060859-1-michael.bommarito@gmail.com Link: https://lore.kernel.org/r/20260521000555.3712030-1-michael.bommarito@gmail.com Assisted-by: Claude:claude-opus-4-7 Assisted-by: Codex:gpt-5-5-xhigh Signed-off-by: Michael Bommarito <michael.bommarito@gmail.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
6 daysMerge tag 'nf-26-06-01' of ↵Jakub Kicinski1-2/+1
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf Pablo Neira Ayuso says: ==================== Netfilter/IPVS fixes for net The following patchset contains Netfilter/IPVS fixes for net: 1) Fix splat with PREEMPT_RCU because smp_processor_id() in nfqueue, from Fernando Fernandez Mancera. 2) Fix possible use of pointer to old IPVS scheduler after RCU grace period when editing service, from Julian Anastasov. 3) Fix possible forever RCU walk over rt->fib6_siblings in nft_fib6, if rt is unlinked mid-iteration, apparently same issue happens in the fib6 core. From Jiayuan Chen. 4) Add mutex to guard refcount in synproxy infrastructure, since concurrent hook {un}registration can happen. From Fernando Fernandez Mancera. 5) Bail out if IRC conntrack helper fails to parse a command, do not try parsing using other command handlers, from Florian Westphal. This fixes a possible out-of-bound read. 6) Possible use-after-free in nft_tunnel by releasing template dst after all references has been dropped, from Tristan Madani. 7) Ignore conntrack template in nft_ct, from Jiayuan Chen. 8) Missing skb_ensure_writable() in ebt_snat, Yiming Qian. 9) Remove multi-register byteorder support, this allows for kernel stack info leak, from Florian Westphal. * tag 'nf-26-06-01' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: netfilter: nft_byteorder: remove multi-register support netfilter: bridge: make ebt_snat ARP rewrite writable netfilter: nft_ct: bail out on template ct in get eval netfilter: nft_tunnel: fix use-after-free on object destroy netfilter: conntrack_irc: fix possible out-of-bounds read netfilter: synproxy: add mutex to guard hook reference counting netfilter: nft_fib_ipv6: bail out of sibling walk if rt got unlinked ipvs: clear the svc scheduler ptr early on edit netfilter: xt_NFQUEUE: prefer raw_smp_processor_id ==================== Link: https://patch.msgid.link/20260601115923.433946-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 daysnet/sched: act_api: use RCU with deferred freeing for action lifecycleJamal Hadi Salim1-0/+1
When NEWTFILTER and DELFILTER are run concurrently it is possible to create a race with an associated action. Let's illustrate with CPU0 running NEWTFILTER and CPU1 running DELFILTER: 0: mutex_lock() <-- holds the idr lock 0: rcu_read_lock() 0: p = idr_find(idr, index) <-- action p is valid (RCU protects IDR) 0: mutex_unlock() <-- releases the idr lock 1: refcount_dec_and_mutex_lock() <-- refcnt 1->0, mutex held 1: idr_remove(idr, index) <-- Action removed from IDR 1: mutex_unlock() <-- mutex released allowing us to delete the action 1: tcf_action_cleanup(p); kfree(p) <-- Kfrees p immediately, no deferral 0: refcount_inc_not_zero(&p->tcfa_refcnt) <-- ouch, UAF p points to freed memory This patch fixes the race condition between NEWTFILTER and DELFILTER by adding struct rcu_head to tc_action used in the deferral and introducing a call_rcu() in the delete path to defer the final kfree(). Note: this is a revert of commit d7fb60b9cafb ("net_sched: get rid of tcfa_rcu") but also modernization/simplification to directly use kfree_rcu(). Let's illustrate the new restored code path: 0: rcu_read_lock() 1: refcount_dec_and_mutex_lock() <-- refcnt 1->0, mutex held 1: idr_remove(idr, index) 1: mutex_unlock() 1: call_rcu(&p->tcfa_rcu, tcf_action_rcu_free) <-- defer kfree after grace period 0: p = idr_find(idr, index) 0: refcount_inc_not_zero(&p->tcfa_refcnt) <-- fails, refcnt already 0 1: rcu_read_unlock() <-- release so freeing can run after grace period After CPU1 calls idr_remove(), the object is no longer reachable through the IDR. CPU0's subsequent idr_find() will return NULL, and even if it still held a stale pointer, the immediate kfree() is now deferred until after the RCU grace period, so no UAF can occur. Fixes: d7fb60b9cafb ("net_sched: get rid of tcfa_rcu") Suggested-by: Jakub Kicinski <kuba@kernel.org> Reported-by: Kyle Zeng <kylebot@openai.com> Tested-by: Victor Nogueira <victor@mojatatu.com> Tested-by: syzbot@syzkaller.appspotmail.com Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Tested-by: Kyle Zeng <kylebot@openai.com> Reviewed-by: Pedro Tammela <pctammela@mojatatu.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Victor Nogueira <victor@mojatatu.com> Link: https://patch.msgid.link/20260531160812.68020-1-jhs@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
7 daysipvs: clear the svc scheduler ptr early on editJulian Anastasov1-2/+1
ip_vs_edit_service() while unbinding the old scheduler clears the svc->scheduler ptr after the scheduler module initiates RCU callbacks. This can cause packets to use the old scheduler at the time when svc->sched_data is already freed after RCU grace period. Fix it by clearing the ptr early in ip_vs_unbind_scheduler(), before the done_service method schedules any RCU callbacks. Also, if the new scheduler fails to initialize when replacing the old scheduler, try to restore the old scheduler while still returning the error code. Link: https://sashiko.dev/#/patchset/20260519015506.634185-1-rosenp%40gmail.com Fixes: 05f00505a89a ("ipvs: fix crash if scheduler is changed") Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
10 daysMerge tag 'ipsec-2026-05-29' of ↵Jakub Kicinski1-1/+2
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2026-05-29 1) xfrm: route MIGRATE notifications to caller's netns Thread the caller's netns through km_migrate() so that MIGRATE notifications go to the issuing netns, fixing both the init_net listener leak and MOBIKE notifications inside non-init netns. From Maoyi Xie. 2) xfrm: ipcomp: Free destination pages on acomp errors Move the out_free_req label up so that allocated destination pages are released on decompression errors, not only on success. From Herbert Xu. 3) xfrm: Check for underflow in xfrm_state_mtu Reject configurations that cause xfrm_state_mtu() to underflow, preventing a negative TFCPAD value from becoming a memset size that triggers an out-of-bounds write of several terabytes. From David Ahern. 4) xfrm: ah: use skb_to_full_sk in async output callbacks Convert the possibly-incomplete skb->sk to a full socket pointer in async AH callbacks so that a request_sock or timewait_sock never reaches xfrm_output_resume() downstream consumers. From Michael Bommarito. 5) Add and revert: esp: fix page frag reference leak on skb_to_sgvec failure The patch does not fix te issue completely. 6) xfrm: esp: restore combined single-frag length gate Check the aligned post-trailer combined length against a page limit in the fast path, preventing skb_page_frag_refill() from falling back to a page too small for the destination scatterlist. From Jingguo Tan. 7) xfrm: iptfs: reset runtime state when cloning SAs Reinitialise the clone's mode_data runtime objects before publishing it, preventing queued skbs from being freed with list state copied from the original SA when migration fails. From Shaomin Chen. 8) xfrm: move policy_bydst RCU sync from per-netns .exit to .pre_exit Flush policy tables and drain the workqueue in a .pre_exit handler so that cleanup_net() pays one RCU grace period per batch instead of one per namespace, fixing stalls at high CLONE_NEWNET rates. From Usama Arif. 9) xfrm: input: hold netns during deferred transport reinjection Take a netns reference when queueing deferred transport reinjection work and drop it after the callback completes, keeping the skb->cb net pointer valid until the deferred work runs. From Zhengchuan Liang. * tag 'ipsec-2026-05-29' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec: Revert "esp: fix page frag reference leak on skb_to_sgvec failure" xfrm: input: hold netns during deferred transport reinjection xfrm: move policy_bydst RCU sync from per-netns .exit to .pre_exit xfrm: iptfs: reset runtime state when cloning SAs xfrm: esp: restore combined single-frag length gate esp: fix page frag reference leak on skb_to_sgvec failure xfrm: ah: use skb_to_full_sk in async output callbacks xfrm: Check for underflow in xfrm_state_mtu xfrm: ipcomp: Free destination pages on acomp errors xfrm: route MIGRATE notifications to caller's netns ==================== Link: https://patch.msgid.link/20260529092648.3878973-1-steffen.klassert@secunet.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
11 daysMerge tag 'net-7.1-rc6' of ↵Linus Torvalds1-0/+7
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Paolo Abeni: "This is again significantly bigger than the same point into the previous cycle, but at least smaller than last week. I'm not aware of any pending regression for the current cycle. Including fixes from netfilter. Current release - regressions: - netfilter: walk fib6_siblings under RCU Previous releases - regressions: - netlink: fix sending unassigned nsid after assigned one - bridge: fix sleep in atomic context in netlink path - sched: fix ethx:ingress -> ethy:egress -> ethx:ingress mirred loop - ipv4: fix net->ipv4.sysctl_local_reserved_ports UaF - eth: tun: free page on short-frame rejection in tun_xdp_one() Previous releases - always broken: - skbuff: fix missing zerocopy reference in pskb_carve helpers - handshake: drain pending requests at net namespace exit - ethtool: - rss: avoid modifying the RSS context response - module: avoid leaking a netdev ref on module flash errors - coalesce: cap profile updates at NET_DIM_PARAMS_NUM_PROFILES - netfilter: fix dst corruption in same register operation - nfc: hci: fix out-of-bounds read in HCP header parsing - ipv6: exthdrs: refresh nh pointer after ipv6_hop_jumbo() - eth: - vti: use ip6_tnl.net in vti6_changelink(). - vxlan: do not reuse cached ip_hdr() value after skb_tunnel_check_pmtu()" * tag 'net-7.1-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (94 commits) dpll: zl3073x: make frequency monitor a per-device attribute dpll: zl3073x: use __dpll_device_change_ntf() and remove change_work dpll: export __dpll_device_change_ntf() for use under dpll_lock net/handshake: Drain pending requests at net namespace exit net/handshake: Verify file-reference balance in submit paths net/handshake: Close the submit-side sock_hold race net/handshake: hand off the pinned file reference to accept_doit net/handshake: Take a long-lived file reference at submit net/handshake: Pass negative errno through handshake_complete() nvme-tcp: store negative errno in queue->tls_err net/handshake: Use spin_lock_bh for hn_lock net: skbuff: fix missing zerocopy reference in pskb_carve helpers net: hibmcge: move dma_rmb() after dma_sync_single_for_cpu() in RX path net: hibmcge: disable Relaxed Ordering to fix RX packet corruption selftests/tc-testing: Add netem test case exercising loops selftests/tc-testing: Add mirred test cases exercising loops net/sched: act_mirred: Fix return code in early mirred redirect error paths net/sched: act_mirred: Fix blockcast recursion bypass leading to stack overflow net/sched: Fix ethx:ingress -> ethy:egress -> ethx:ingress mirred loop net/sched: fix packet loop on netem when duplicate is on ...
13 daysMerge tag 'mm-hotfixes-stable-2026-05-25-16-22' of ↵Linus Torvalds1-17/+2
git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "13 hotfixes. 9 are for MM. 9 are cc:stable and the remaining 4 address post-7.1 issues or aren't considered suitable for backporting. All patches are singletons - please see the individual changelogs for details" * tag 'mm-hotfixes-stable-2026-05-25-16-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: Revert "mm: introduce a new page type for page pool in page type" mm/vmalloc: do not trigger BUG() on BH disabled context MAINTAINERS, mailmap: change email for Eugen Hristev mm/migrate_device: fix pgtable leak in migrate_vma_insert_huge_pmd_page kernel/fork: validate exit_signal in kernel_clone() mm: memcontrol: propagate NMI slab stats to memcg vmstats mm/damon/sysfs-schemes: delete tried region in regions_rmdirs() mm/rmap: initialize nr_pages to 1 at loop start in try_to_unmap_one zram: fix use-after-free in zram_writeback_endio memfd: deny writeable mappings when implying SEAL_WRITE ipc: limit next_id allocation to the valid ID range Revert "mm/hugetlbfs: update hugetlbfs to use mmap_prepare" MAINTAINERS: .mailmap: update after GEHC spin-off
2026-05-22netfilter: nf_tables: fix dst corruption in same register operationFernando Fernandez Mancera1-0/+7
For lshift and rshift, the shift operations are performed in a loop over 32-bit words. The loop calculates the shifted value and write it to dst, and then immediately reads from src to calculate the carry for the next iteration. Because src and dst could point to the same memory location, the carry is incorrectly calculated using the newly modified dst value instead of the original src value. Adding a temporary local variable to cache the original value before writing to dst and using it for the carry calculation solves the problem. In addition, partial overlap is rejected from control plane for all kind of operations including byteorder. This was tested with the following bytecode: table test_table ip flags 0 use 1 handle 1 ip test_table test_chain use 3 type filter hook input prio 0 policy accept packets 0 bytes 0 flags 1 ip test_table test_chain 2 [ immediate reg 1 0x44332211 0x88776655 ] [ bitwise reg 1 = ( reg 1 << 0x08000000 ) ] [ cmp eq reg 1 0x66443322 0x00887766 ] [ counter pkts 0 bytes 0 ] ip test_table test_chain 4 3 [ immediate reg 1 0x44332211 0x88776655 ] [ bitwise reg 1 = ( reg 1 << 0x08000000 ) ] [ cmp eq reg 1 0x55443322 0x00887766 ] [ counter pkts 21794 bytes 1917798 ] Fixes: 567d746b55bc ("netfilter: bitwise: add support for shifts.") Acked-by: Jeremy Sowden <jeremy@azazel.net> Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Signed-off-by: Florian Westphal <fw@strlen.de>
2026-05-21Revert "mm: introduce a new page type for page pool in page type"Byungchul Park1-17/+2
This reverts commit db359fccf212 ("mm: introduce a new page type for page pool in page type") and a part of 735a309b4bfb9e ("net: add net_iov_init() and use it to initialize ->page_type"). Netpp page_type'ed pages might be used in mapping so as to use @_mapcount. However, since @page_type and @_mapcount are union'ed in struct page, these two can't be used at the same time. Revert the commit introducing page_type for Netpp for now. The patch will be retried once @page_type and @_mapcount get allowed to be used at the same time. The revert also includes removal of @page_type initialization part introduced by commit 735a309b4bfb9e ("net: add net_iov_init() and use it to initialize ->page_type"), which will be restored on the retry. Link: https://lore.kernel.org/20260515034701.17027-1-byungchul@sk.com Fixes: db359fccf212 ("mm: introduce a new page type for page pool in page type") Signed-off-by: Byungchul Park <byungchul@sk.com> Reported-by: Dragos Tatulea <dtatulea@nvidia.com> Closes: https://lore.kernel.org/all/982b9bc1-0a0a-4fc5-8e3a-3672db2b29a1@nvidia.com Acked-by: Jakub Kicinski <kuba@kernel.org> Acked-by: David Hildenbrand (Arm) <david@kernel.org> Acked-by: Harry Yoo (Oracle) <harry@kernel.org> Reviewed-by: Lorenzo Stoakes <ljs@kernel.org> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Brendan Jackman <jackmanb@google.com> Cc: David S. Miller <davem@davemloft.net> Cc: Eric Dumazet <edumazet@google.com> Cc: Ilias Apalodimas <ilias.apalodimas@linaro.org> Cc: Jesper Dangaard Brouer <hawk@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Fastabend <john.fastabend@gmail.com> Cc: Leon Romanovsky <leon@kernel.org> Cc: Liam R. Howlett <liam@infradead.org> Cc: Mark Bloch <mbloch@nvidia.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Paolo Abeni <pabeni@redhat.com> Cc: Pavel Begunkov <asml.silence@gmail.com> Cc: Saeed Mahameed <saeedm@nvidia.com> Cc: Simon Horman <horms@kernel.org> Cc: Stanislav Fomichev <sdf@fomichev.me> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Tariq Toukan <tariqt@nvidia.com> Cc: Toke Hoiland-Jorgensen <toke@redhat.com> Cc: Vlastimil Babka <vbabka@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-05-20tcp: fix stale per-CPU tcp_tw_isn leak enabling ISN predictionEric Dumazet1-3/+4
Blamed commit moved the TIME_WAIT-derived ISN from the skb control block to a per-CPU variable, assuming the value would always be consumed by tcp_conn_request() for the same packet that wrote it. That assumption is violated by multiple drop paths between the producer (__this_cpu_write(tcp_tw_isn, isn) in tcp_v{4,6}_rcv()) and the consumer (tcp_conn_request()): - min_ttl / min_hopcount check - xfrm policy check - tcp_inbound_hash() MD5/AO mismatch - tcp_filter() eBPF/SO_ATTACH_FILTER drop - th->syn && th->fin discard in tcp_rcv_state_process() TCP_LISTEN - psp_sk_rx_policy_check() in tcp_v{4,6}_do_rcv() - tcp_checksum_complete() in tcp_v{4,6}_do_rcv() - tcp_v{4,6}_cookie_check() returning NULL When a packet is dropped on any of these paths, tcp_tw_isn is left set. The next SYN processed on the same CPU then consumes the non zero value in tcp_conn_request(), receiving a potentially predictable ISN. This patch moves back tcp_tw_isn to skb->cb[], getting rid of the per-cpu variable. Note that tcp_v{4,6}_fill_cb() do not set it. Very litle impact on overall code size/complexity: $ scripts/bloat-o-meter -t vmlinux.old vmlinux.new add/remove: 0/0 grow/shrink: 2/1 up/down: 8/-15 (-7) Function old new delta tcp_v6_rcv 3038 3042 +4 tcp_v4_rcv 3035 3039 +4 tcp_conn_request 2938 2923 -15 Total: Before=24436060, After=24436053, chg -0.00% Fixes: 41eecbd712b7 ("tcp: replace TCP_SKB_CB(skb)->tcp_tw_isn with a per-cpu field") Reported-by: Chris Mason <clm@meta.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260519084611.2485277-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-20net: shaper: rework the VALID marking (again)Jakub Kicinski1-0/+1
Recent commit changed the semantics from NOT_VALID to VALID. I didn't realize that the flags are not stored atomically with the entry in XArray. There's still a race of reader observing a VALID mark for a slot, getting interrupted, writer replacing the entry with a different one, reader continuing, fetching the entry which is now a different pointer than the pointer for which VALID was meant. The biggest consequence of this is that we may see a UAF since net_shaper_rollback() assumed that entries without VALID can be freed without observing RCU. Looks like the XArray marks are buying us nothing at this point. Let's convert the code to an explicit valid field. The smp_load_acquire() / smp_store_release() barriers are marginally cleaner. Reported-by: Sashiko <sashiko-bot@kernel.org> Fixes: 93954b40f6a4 ("net-shapers: implement NL set and delete operations") Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20260515221325.1685455-3-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-18Merge tag 'nf-26-05-16' of ↵Jakub Kicinski3-3/+9
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf Pablo Neira Ayuso says: ==================== Netfilter/IPVS fixes for net The following patchset contains Netfilter/IPVS fixes for net: 1) Fix small race windows in nf_ct_helper_log() when accessing helper, from Florian Westphal. 2) Fix potential infinite loop and race conditions in IPVS caused by frequent user-triggered service table changes, from Julia Anastasov. 3) Fix a race condition when dumping ipsets for restore, from Jozsef Kadlecsik. 4) Fix inner transport offset in IPv6 in nft_inner when extension headers come before the layer 4 transport header, from Yizhou Zhao. 5) Fix incorrect iteration over IPv4 ranges in several hash set types, from Nan Li. 6) Fix incorrect order when restoring BH in nft_inner_restore_tun_ctx(), from Florian Westphal. 7) Validate option array from ip6t_hbh checkpath() to fix an off-by-one access, from Zhengchuan Liang. 8) Fix race condition between ipset list -terse and concurrent updates, from Jozsef Kadlecisk. 9) Fix race condition when inserting elements into a hash bucket, also from Jozsef. 10) Annotate access to first free slot in hashtable, from Jozsef Kadlecsik. 11) Ensure sufficient headroom in br_netfilter neigh transmission, from Lorenzo Bianconi. 12) Hold reference on skb->dev in nfqueue exit path, bridge local input is speciall since skb->dev != state->indev, allowing for net_device to go away while packet is sitting in nfqueue. From Haoze Xie. * tag 'nf-26-05-16' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: netfilter: nf_queue: hold bridge skb->dev while queued netfilter: br_netfilter: Reallocate headroom if necessary in neigh_hh_bridge() netfilter: ipset: annotate "pos" for concurrent readers/writers netfilter: ipset: Fix data race between add and dump in all hash types netfilter: ipset: Fix data race between add and list header in all hash types netfilter: ip6t_hbh: reject oversized option lists netfilter: nft_inner: release local_lock before re-enabling softirqs netfilter: ipset: stop hash:* range iteration at end netfilter: nft_inner: Fix IPv6 inner_thoff desync netfilter: ipset: fix a potential dump-destroy race ipvs: avoid possible loop in ip_vs_dst_event on resizing netfilter: nf_conntrack_helper: fix possible null deref during error log ==================== Link: https://patch.msgid.link/20260516115627.967773-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-16netfilter: nf_queue: hold bridge skb->dev while queuedHaoze Xie1-0/+1
br_pass_frame_up() rewrites skb->dev from the ingress port to the bridge master before queueing bridge LOCAL_IN packets. NFQUEUE only holds references on state.in/out and bridge physdevs, so a queued bridge packet can retain a freed bridge master in skb->dev until reinjection. When the verdict is reinjected later, br_netif_receive_skb() re-enters the receive path with skb->dev still pointing at the freed bridge master, triggering a use-after-free. Store skb->dev in the queue entry, hold a reference on it for the queue lifetime, and use the saved device when dropping queued packets during NETDEV_DOWN handling. Fixes: ac2863445686 ("netfilter: bridge: add nf_afinfo to enable queuing to userspace") Cc: stable@kernel.org Reported-by: Yuan Tan <yuantan098@gmail.com> Reported-by: Yifan Wu <yifanwucs@gmail.com> Reported-by: Juefei Pu <tomapufckgml@gmail.com> Reported-by: Xin Liu <bird@lzu.edu.cn> Signed-off-by: Haoze Xie <royenheart@gmail.com> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2026-05-16netfilter: br_netfilter: Reallocate headroom if necessary in neigh_hh_bridge()Lorenzo Bianconi1-2/+6
neigh_hh_bridge() assumes the skb always has sufficient headroom to copy the aligned L2 header. This assumption can trigger the crash reported below using the following netfilter setup: $modprobe br_netfilter $sysctl -w net.bridge.bridge-nf-call-iptables=1 $root@OpenWrt:~# nft list ruleset table ip nat { chain prerouting { type nat hook prerouting priority dstnat; policy accept; ip daddr 192.168.83.123 dnat to 192.168.83.120 } } - iperf3 client (192.168.83.119) --> bridge (192.168.83.118) --> iperf3 server (192.168.83.120) the iperf3 client is sending packet for 192.168.83.123 to the bridge device. [ 1579.036575] Unable to handle kernel write to read-only memory at virtual address ffffff8004d76ffe [ 1579.045482] Mem abort info: [ 1579.048273] ESR = 0x000000009600004f [ 1579.052024] EC = 0x25: DABT (current EL), IL = 32 bits [ 1579.057363] SET = 0, FnV = 0 [ 1579.060417] EA = 0, S1PTW = 0 [ 1579.063550] FSC = 0x0f: level 3 permission fault [ 1579.068345] Data abort info: [ 1579.071224] ISV = 0, ISS = 0x0000004f, ISS2 = 0x00000000 [ 1579.076720] CM = 0, WnR = 1, TnD = 0, TagAccess = 0 [ 1579.081770] GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0 [ 1579.087092] swapper pgtable: 4k pages, 39-bit VAs, pgdp=0000000080dc4000 [ 1579.093794] [ffffff8004d76ffe] pgd=180000009ffff003, p4d=180000009ffff003, pud=180000009ffff003, pmd=180000009ffe3003, pte=0060000084d76787 [ 1579.106343] Internal error: Oops: 000000009600004f [#1] SMP [ 1579.193824] CPU: 0 UID: 0 PID: 235 Comm: napi/qdma_eth-3 Tainted: G O 6.12.57 #0 [ 1579.202614] Tainted: [O]=OOT_MODULE [ 1579.206102] Hardware name: Airoha AN7581 Evaluation Board (DT) [ 1579.211929] pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 1579.218889] pc : br_nf_pre_routing_finish_bridge+0x1ac/0xcc8 [br_netfilter] [ 1579.225859] lr : br_nf_pre_routing_finish_bridge+0x18c/0xcc8 [br_netfilter] [ 1579.232822] sp : ffffffc0817cba20 [ 1579.236128] x29: ffffffc0817cba20 x28: 0000000000000000 x27: ffffff8002b89000 [ 1579.243273] x26: ffffff8004d7700e x25: 0000000000000008 x24: 0000000000000000 [ 1579.250416] x23: ffffffc08179d4c0 x22: 0000000000000000 x21: ffffffc08179d4c0 [ 1579.257561] x20: ffffff8004d9b800 x19: ffffff8015010000 x18: 0000000000000014 [ 1579.264704] x17: ffffffbf9e930000 x16: ffffffc0817c8000 x15: 0000000000000070 [ 1579.271848] x14: 0000000000000080 x13: 0000000000000001 x12: 0000000000000000 [ 1579.278993] x11: ffffffc0798caae0 x10: ffffff8014db6fd8 x9 : 0000000000000000 [ 1579.286136] x8 : 0000000000000003 x7 : ffffffc08171f628 x6 : 000000001a3b83d3 [ 1579.293281] x5 : 0000000000000000 x4 : 1beb76f22fee0000 x3 : ffffff8004d7700e [ 1579.300425] x2 : 0000000000000000 x1 : ffffff8004d9b8bc x0 : ffffff80026ed000 [ 1579.307570] Call trace: [ 1579.310018] br_nf_pre_routing_finish_bridge+0x1ac/0xcc8 [br_netfilter] [ 1579.316632] br_nf_hook_thresh+0xd4/0x14bc [br_netfilter] [ 1579.322032] br_nf_hook_thresh+0x250/0x14bc [br_netfilter] [ 1579.327517] br_nf_hook_thresh+0x76c/0x14bc [br_netfilter] [ 1579.333003] br_handle_frame+0x180/0x480 [ 1579.336935] __netif_receive_skb_core.constprop.0+0x540/0xf40 [ 1579.342682] __netif_receive_skb_one_core+0x28/0x50 [ 1579.347561] process_backlog+0x98/0x1e0 [ 1579.351398] __napi_poll+0x34/0x1c4 [ 1579.354887] net_rx_action+0x178/0x330 [ 1579.358638] handle_softirqs+0x108/0x2d4 [ 1579.362560] __do_softirq+0x10/0x18 [ 1579.366051] ____do_softirq+0xc/0x20 [ 1579.369627] call_on_irq_stack+0x30/0x4c [ 1579.373550] do_softirq_own_stack+0x18/0x20 [ 1579.377734] do_softirq+0x4c/0x60 [ 1579.381050] __local_bh_enable_ip+0x88/0x98 [ 1579.385234] napi_threaded_poll_loop+0x188/0x21c [ 1579.389853] napi_threaded_poll+0x70/0x80 [ 1579.393863] kthread+0xd8/0xdc [ 1579.396918] ret_from_fork+0x10/0x20 [ 1579.400499] Code: 88dffc22 3707ffc2 f9406663 f9406684 (f81f0064) [ 1579.406589] ---[ end trace 0000000000000000 ]--- [ 1579.411209] Kernel panic - not syncing: Oops: Fatal exception in interrupt [ 1579.418083] SMP: stopping secondary CPUs [ 1579.422012] Kernel Offset: disabled Fix the issue reallocating the skb headroom if necessary in neigh_hh_bridge routine. Fixes: e179e6322ac33 ("netfilter: bridge-netfilter: Fix MAC header handling with IP DNAT") Reviewed-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2026-05-16ipvs: avoid possible loop in ip_vs_dst_event on resizingJulian Anastasov1-1/+2
Sashiko points out that unprivileged user can frequently call ip_vs_flush() or ip_vs_del_service() to trigger svc_table_changes updates that can lead to infinite loop in ip_vs_dst_event(). This can also happen if the user triggers frequent table resizing without deleting all services. We should also consider the possible effects if the user triggers many NETDEV_DOWN events. One way to solve it is to hold svc_resize_sem in ip_vs_dst_event() but this can block the dev notifier during the whole resizing process. Instead, use new rw_semaphore svc_replace_sem to protect just the svc_table replacement which is a short code section. Then hold svc_replace_sem in ip_vs_dst_event() to serialize with replacing the svc_table. As result, loop is avoided as there is no need to repeat the table walking from the start. By this way changes in svc_table_changes can happen only when all services are removed and all dev references dropped which allows us to abort the table walking. As IP_VS_WORK_SVC_NORESIZE is the flag used to stop the svc_resize_work under service_mutex, we should check only this flag often but not while under service_mutex. To remove the mutex_trylock() for service_mutex in the second phase where the resizer installs the new table after rehashing, we will avoid holding the service_mutex there. As result, the code in configuration context which is under service_mutex should access ipvs->svc_table under RCU because it can be replaced at anytime and released after a RCU grace period. As for ip_vs_zero_all(), it needs different solution as a table walker which can escape single RCU read-side critical section: to hold the svc_replace_sem to prevent table to be replaced. In ip_vs_status_show() prefer to hold svc_replace_sem to avoid many loops, just detect if the svc_table is removed. Prefer the newly attached table for the u_thresh/l_thresh checks to know when to grow/shrink while adding or deleting services because the new table size is based on the latest parameters. Link: https://sashiko.dev/#/patchset/20260505001648.360569-1-pablo%40netfilter.org Fixes: 840aac3d900d ("ipvs: use resizable hash table for services") Signed-off-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2026-05-14Bluetooth: serialize accept_q accessJiexun Wang1-0/+1
bt_sock_poll() walks the accept queue without synchronization, while child teardown can unlink the same socket and drop its last reference. The unsynchronized accept queue walk has existed since the initial Bluetooth import. Protect accept_q with a dedicated lock for queue updates and polling. Also rework bt_accept_dequeue() to take temporary child references under the queue lock before dropping it and locking the child socket. Fixes: 1da177e4c3f41524e886b7f1b8a0c1fc7321cac2 ("Linux-2.6.12-rc2") Cc: stable@vger.kernel.org Reported-by: Jann Horn <jannh@google.com> Reported-by: Yuan Tan <yuantan098@gmail.com> Reported-by: Yifan Wu <yifanwucs@gmail.com> Reported-by: Juefei Pu <tomapufckgml@gmail.com> Reported-by: Xin Liu <bird@lzu.edu.cn> Signed-off-by: Jiexun Wang <wangjiexun2025@gmail.com> Signed-off-by: Ren Wei <n05ec@lzu.edu.cn> Signed-off-by: Jiexun Wang <wangjiexun2025@gmail.com> Reviewed-by: Jann Horn <jannh@google.com> Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
2026-05-13macsec: use rcu_work to defer TX SA crypto cleanup out of softirqJinliang Zheng1-1/+2
free_txsa() is an RCU callback running in softirq context, but calls crypto_free_aead() which can invoke vunmap() internally on hardware crypto drivers (e.g. hisi_sec2), triggering a kernel crash. Use rcu_work to defer the cleanup to a workqueue, for the same reasons as the analogous fix to free_rxsa() in the previous patch. Fixes: c09440f7dcb3 ("macsec: introduce IEEE 802.1AE driver") Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://patch.msgid.link/20260511153102.2640368-4-alexjlzheng@tencent.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-13macsec: use rcu_work to defer RX SA crypto cleanup out of softirqJinliang Zheng1-1/+3
crypto_free_aead() can internally invoke vunmap() (e.g. via dma_free_attrs() in hardware crypto drivers such as hisi_sec2). vunmap() must not be called from softirq context, but free_rxsa() is an RCU callback that runs in softirq, leading to a kernel crash: vunmap+0x4c/0x70 __iommu_dma_free+0xd0/0x138 dma_free_attrs+0xf4/0x100 sec_aead_exit+0x64/0xb8 [hisi_sec2] crypto_destroy_tfm+0x98/0x110 free_rxsa+0x28/0x50 [macsec] rcu_do_batch+0x184/0x460 rcu_core+0xf4/0x1f8 handle_softirqs+0x118/0x330 Use rcu_work to defer the cleanup to a workqueue. rcu_work dispatches the worker asynchronously after the RCU grace period, so no thread blocks waiting, and concurrent releases of multiple SAs naturally share the same grace period. Fixes: c09440f7dcb3 ("macsec: introduce IEEE 802.1AE driver") Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://patch.msgid.link/20260511153102.2640368-3-alexjlzheng@tencent.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-08Merge tag 'nf-26-05-08' of ↵Jakub Kicinski1-1/+4
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following batch contains Netfilter fixes for net: 1) Allow initial x_tables table replacement without emitting an audit log message. Delay the register message until after hooks are wired up to avoid unnecessary unregister logs during error unwinding. 2) Fix a NULL dereference by allocating hook ops before adding the table to the per-netns list. Use `synchronize_rcu()` during error unwinding to ensure the table stops processing packets before teardown. Defer audit log register message until all operations succeed. 3) Refactor xtables to use a single `xt_unregister_table_pre_exit` function. Eliminate code duplication by centralizing table unregistration logic within the xtables core. ebtables cannot be changed due to incompatibility. 4) Unregister xtables templates before module removal. This prevents a race condition where userspace instantiates a new table after the pernet unreg removed the current table. 5) Add `xtables_unregister_table_exit` to fully unregister netfilter tables during module removal. Unlink the table from dying lists, then free hook operations. 6) Implement a two-stage removal scheme for ebtables following the x_tables pattern. Assign table->ops while holding the ebt mutex to prevent exposing partially-filled structures. 7) Fix ebtables module initialization race. Register the template last in table initialization functions. Prevent table instantiation before pernet operations are available. 8) Fix a race condition in x_tables module initialization. Ensure pernet ops are fully set up before exposing the table to userspace. 9) Fix a race condition in ebtables module initialization, similar to previous patch. 10) Restore propagation of helper to expected connection, this is a fix-for-recent-fix. 11) Validate that the expectation tuple and mask netlink attributes are present when adding expectation via nfqueue, this fixes a possible null-ptr-deref. 12) Fix possible rare memleak in the SIP helper in case helper has been detached from conntrack entry, from Li Xiasong. 13) Fix refcount leak in nft_ct when creating custom expectation, also from Li Xiason. Patches 1-9 from Florian Westphal. 10) Restore propagation of helper to expected connection, this is a fix-for-recent-fix. 11) Check that tuple and mask netlink attributes are set when creating an expectation via nfqueue. * tag 'nf-26-05-08' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf: netfilter: nft_ct: fix missing expect put in obj eval netfilter: nf_conntrack_sip: get helper before allocating expectation netfilter: ctnetlink: check tuple and mask in expectations created via nfqueue netfilter: nf_conntrack_expect: restore helper propagation via expectation netfilter: bridge: eb_tables: close module init race netfilter: x_tables: close dangling table module init race netfilter: ebtables: close dangling table module init race netfilter: ebtables: move to two-stage removal scheme netfilter: x_tables: add and use xtables_unregister_table_exit netfilter: x_tables: unregister the templates first netfilter: x_tables: add and use xt_unregister_table_pre_exit netfilter: x_tables: allocate hook ops while under mutex netfilter: x_tables: allow initial table replace without emitting audit log message ==================== Link: https://patch.msgid.link/20260507234509.603182-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-08genetlink: free the skb on 'group >= family->n_mcgrps'Alice Ryhl1-1/+3
These methods generally consume ownership of the provided skb, so even if an error path is encountered, the skb is freed. This is because the very first thing they do after some initial setup is to unconditionally consume the skb via consume_skb(skb). Any subsequent errors lead to the core netlink layer freeing the skb. However, there is one check that occurs before ownership is passed, which is the check for the group index. So if this error condition is encountered, then the skb is leaked. This error condition is generally considered a violation of the netlink API, so it's not expected to occur under normal circumstances. For the same reason, no callers check for this error condition, and no callers need to be adjusted. However, we should still follow the same ownership semantics of the rest of the function. Thus, free the skb in this codepath. Suggested-by: Andrew Lunn <andrew@lunn.ch> Suggested-by: Matthew Maurer <mmaurer@google.com> Fixes: 2a94fe48f32c ("genetlink: make multicast groups const, prevent abuse") Link: https://lore.kernel.org/r/845b36ba-7b3a-41f2-acb2-b284f253e2ca@lunn.ch Signed-off-by: Alice Ryhl <aliceryhl@google.com> Link: https://patch.msgid.link/20260506-genlmsg-return-v2-1-a63ee2a055d6@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-08net: nsh: fix incorrect header length macrosIlya Maximets1-2/+2
NSH header length is a 6-bit field that encodes the total length of the header in 4-byte words. So the maximum length is 0b111111 * 4, which is 252 and not 256. The maximum context length is the same number minus the length of the base header (8), so 244. These macros are used to validate push_nsh() action in openvswitch. Miscalculation here doesn't cause any real issues. In the worst case the oversized context is truncated while building the header, so we'll construct and send a broken packet, which is not a big problem, as any receiver should validate the fields. No invalid memory accesses will happen during the header push. But we should fix the macros to reject the incorrect actions in the first place. Using previously defined values and calculating the length instead of defining numbers directly, so it's easier to understand where they come from and harder to make a mistake. Fixes: 1f0b7744c505 ("net: add NSH header structures and helpers") Signed-off-by: Ilya Maximets <i.maximets@ovn.org> Reviewed-by: Aaron Conole <aconole@redhat.com> Link: https://patch.msgid.link/20260507120434.2962505-1-i.maximets@ovn.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-08ipv6: flowlabel: enforce per-netns limit for unprivileged callersMaoyi Xie1-0/+1
fl_size, fl_ht and ip6_fl_lock in net/ipv6/ip6_flowlabel.c are file scope and shared across netns. mem_check() reads fl_size to decide whether to deny non-CAP_NET_ADMIN callers. capable() runs against init_user_ns, so an unprivileged user in any non-init userns can push fl_size past FL_MAX_SIZE - FL_MAX_SIZE / 4 and starve every other unprivileged userns on the host. Add struct netns_ipv6::flowlabel_count, bumped and decremented next to fl_size in fl_intern, ip6_fl_gc and ip6_fl_purge. The new field fills the existing 4-byte hole after ipmr_seq, so struct netns_ipv6 stays the same size on 64-bit builds. Bump FL_MAX_SIZE from 4096 to 8192. It has been 4096 since the file was added. Machines and connection counts have grown. mem_check() folds an extra per-netns ceiling into the existing non-CAP_NET_ADMIN conditional. The ceiling is half of the total budget that unprivileged callers have ever been able to use, i.e. (FL_MAX_SIZE - FL_MAX_SIZE / 4) / 2 = 3072 entries. With FL_MAX_SIZE doubled, this preserves the original per-user reach of 3K (what an unprivileged caller could already obtain before this change), while forcing an attacker to spread allocations across at least two netns to exhaust the global non-CAP_NET_ADMIN budget. CAP_NET_ADMIN against init_user_ns still bypasses both caps. The previous patch took ip6_fl_lock across mem_check and fl_intern, so the new flowlabel_count read in mem_check and the new flowlabel_count++ in fl_intern run under the same critical section. flowlabel_count is therefore plain int, like fl_size. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Suggested-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Cc: stable@vger.kernel.org # v5.15+ Signed-off-by: Maoyi Xie <maoyi.xie@ntu.edu.sg> Link: https://patch.msgid.link/20260506082416.2259567-3-maoyixie.tju@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-05-08netfilter: nf_conntrack_expect: restore helper propagation via expectationPablo Neira Ayuso