aboutsummaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)AuthorFilesLines
2025-11-07net: allow skb_release_head_state() to be called multiple timesEric Dumazet1-4/+3
Currently, only skb dst is cleared (thanks to skb_dst_drop()) Make sure skb->destructor, conntrack and extensions are cleared. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://patch.msgid.link/20251106202935.1776179-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-07net: add prefetch() in skb_defer_free_flush()Eric Dumazet1-0/+1
skb_defer_free_flush() is becoming more important these days. Add a prefetch operation to reduce latency a bit on some platforms like AMD EPYC 7B12. On more recent cpus, a stall happens when reading skb_shinfo(). Avoiding it will require a more elaborate strategy. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Paolo Abeni <pabeni@redhat.com> Link: https://patch.msgid.link/20251106085500.2438951-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-07psp: add stats from psp spec to driver facing apiJakub Kicinski2-2/+22
Provide a driver api for reporting device statistics required by the "Implementation Requirements" section of the PSP Architecture Specification. Use a warning to ensure drivers report stats required by the spec. Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com> Link: https://patch.msgid.link/20251106002608.1578518-4-daniel.zahka@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-07psp: report basic stats from the coreJakub Kicinski4-1/+98
Track and report stats common to all psp devices from the core. A 'stale-event' is when the core marks the rx state of an active psp_assoc as incapable of authenticating psp encapsulated data. Signed-off-by: Daniel Zahka <daniel.zahka@gmail.com> Link: https://patch.msgid.link/20251106002608.1578518-2-daniel.zahka@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-07tcp: add net.ipv4.tcp_comp_sack_rtt_percentEric Dumazet3-8/+28
TCP SACK compression has been added in 2018 in commit 5d9f4262b7ea ("tcp: add SACK compression"). It is working great for WAN flows (with large RTT). Wifi in particular gets a significant boost _when_ ACK are suppressed. Add a new sysctl so that we can tune the very conservative 5 % value that has been used so far in this formula, so that small RTT flows can benefit from this feature. delay = min ( 5 % of RTT, 1 ms) This patch adds new tcp_comp_sack_rtt_percent sysctl to ease experiments and tuning. Given that we cap the delay to 1ms (tcp_comp_sack_delay_ns sysctl), set the default value to 33 %. Quoting Neal Cardwell ( https://lore.kernel.org/netdev/CADVnQymZ1tFnEA1Q=vtECs0=Db7zHQ8=+WCQtnhHFVbEOzjVnQ@mail.gmail.com/ ) The rationale for 33% is basically to try to facilitate pipelining, where there are always at least 3 ACKs and 3 GSO/TSO skbs per SRTT, so that the path can maintain a budget for 3 full-sized GSO/TSO skbs "in flight" at all times: + 1 skb in the qdisc waiting to be sent by the NIC next + 1 skb being sent by the NIC (being serialized by the NIC out onto the wire) + 1 skb being received and aggregated by the receiver machine's aggregation mechanism (some combination of LRO, GRO, and sack compression) Note that this is basically the same magic number (3) and the same rationales as: (a) tcp_tso_should_defer() ensuring that we defer sending data for no longer than cwnd/tcp_tso_win_divisor (where tcp_tso_win_divisor = 3), and (b) bbr_quantization_budget() ensuring that cwnd is at least 3 GSO/TSO skbs to maintain pipelining and full throughput at low RTTs Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Link: https://patch.msgid.link/20251106115236.3450026-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-07strparser: Fix signed/unsigned mismatch bugNate Karstens1-1/+1
The `len` member of the sk_buff is an unsigned int. This is cast to `ssize_t` (a signed type) for the first sk_buff in the comparison, but not the second sk_buff. On 32-bit systems, this can result in an integer underflow for certain values because unsigned arithmetic is being used. This appears to be an oversight: if the intention was to use unsigned arithmetic, then the first cast would have been omitted. The change ensures both len values are cast to `ssize_t`. The underflow causes an issue with ktls when multiple TLS PDUs are included in a single TCP segment. The mainline kernel does not use strparser for ktls anymore, but this is still useful for other features that still use strparser, and for backporting. Signed-off-by: Nate Karstens <nate.karstens@garmin.com> Cc: stable@vger.kernel.org Fixes: 43a0c6751a32 ("strparser: Stream parser for messages") Reviewed-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Sabrina Dubroca <sd@queasysnail.net> Link: https://patch.msgid.link/20251106222835.1871628-1-nate.karstens@garmin.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-07tcp: Remove timeout arg from reqsk_timeout().Kuniyuki Iwashima2-3/+4
reqsk_timeout() is always called with @timeout being TCP_RTO_MAX. Let's remove the arg. As a prep for the next patch, reqsk_timeout() is moved to tcp.h and renamed to tcp_reqsk_timeout(). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251106003357.273403-5-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-07tcp: Remove redundant init for req->num_timeout.Kuniyuki Iwashima1-2/+0
Commit 5903123f662e ("tcp: Use BPF timeout setting for SYN ACK RTO") introduced req->timeout and initialised it in 3 places: 1. reqsk_alloc() sets 0 2. inet_reqsk_alloc() sets TCP_TIMEOUT_INIT 3. tcp_conn_request() sets tcp_timeout_init() 1. has been always redundant as 2. overwrites it immediately. 2. was necessary for TFO SYN+ACK but is no longer needed after commit 8ea731d4c2ce ("tcp: Make SYN ACK RTO tunable by BPF programs with TFO"). 3. was moved to reqsk_queue_hash_req() in the previous patch. Now, we always initialise req->timeout just before scheduling the SYN+ACK timer: * For non-TFO SYN+ACK : reqsk_queue_hash_req() * For TFO SYN+ACK : tcp_fastopen_create_child() Let's remove the redundant initialisation of req->timeout in reqsk_alloc() and inet_reqsk_alloc(). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251106003357.273403-4-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-07tcp: Remove timeout arg from reqsk_queue_hash_req().Kuniyuki Iwashima2-15/+10
inet_csk_reqsk_queue_hash_add() is no longer shared by DCCP. We do not need to pass req->timeout down to reqsk_queue_hash_req(). Let's move tcp_timeout_init() from tcp_conn_request() to reqsk_queue_hash_req(). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251106003357.273403-3-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-07tcp: Call tcp_syn_ack_timeout() directly.Kuniyuki Iwashima4-5/+4
Since DCCP has been removed, we do not need to use request_sock_ops.syn_ack_timeout(). Let's call tcp_syn_ack_timeout() directly. Now other function pointers of request_sock_ops are protocol-dependent. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251106003357.273403-2-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-06xsk: Move NETDEV_XDP_ACT_ZC into generic headerDaniel Borkmann1-5/+1
Move NETDEV_XDP_ACT_ZC into xdp_sock_drv.h header such that external code can reuse it, and rename it into more generic NETDEV_XDP_ACT_XSK. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Co-developed-by: David Wei <dw@davidwei.uk> Signed-off-by: David Wei <dw@davidwei.uk> Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://patch.msgid.link/20251031212103.310683-7-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-06net: dsa: add tagging driver for MaxLinear GSW1xx switch familyDaniel Golle3-0/+125
Add support for a new DSA tagging protocol driver for the MaxLinear GSW1xx switch family. The GSW1xx switches use a proprietary 8-byte special tag inserted between the source MAC address and the EtherType field to indicate the source and destination ports for frames traversing the CPU port. Implement the tag handling logic to insert the special tag on transmit and parse it on receive. Signed-off-by: Daniel Golle <daniel@makrotopia.org> Reviewed-by: Alexander Sverdlin <alexander.sverdlin@siemens.com> Tested-by: Alexander Sverdlin <alexander.sverdlin@siemens.com> Link: https://patch.msgid.link/0e973ebfd9433c30c96f50670da9e9449a0d98f2.1762170107.git.daniel@makrotopia.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-06devlink: Add new "max_mac_per_vf" generic device paramMohammad Heib1-0/+5
Add a new device generic parameter to controls the maximum number of MAC filters allowed per VF. For example, to limit a VF to 3 MAC addresses: $ devlink dev param set pci/0000:3b:00.0 name max_mac_per_vf \ value 3 \ cmode runtime Signed-off-by: Mohammad Heib <mheib@redhat.com> Reviewed-by: Simon Horman <horms@kernel.org> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
2025-11-06wifi: mac80211: reject address change while connectingJohannes Berg1-3/+11
While connecting, the MAC address can already no longer be changed. The change is already rejected if netif_carrier_ok(), but of course that's not true yet while connecting. Check for auth_data or assoc_data, so the MAC address cannot be changed. Also more comprehensively check that there are no stations on the interface being changed - if any peer station is added it will know about our address already, so we cannot change it. Cc: stable@vger.kernel.org Fixes: 3c06e91b40db ("wifi: mac80211: Support POWERED_ADDR_CHANGE feature") Link: https://patch.msgid.link/20251105154119.f9f6c1df81bb.I9bb3760ede650fb96588be0d09a5a7bdec21b217@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-11-06Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski20-74/+180
Cross-merge networking fixes after downstream PR (net-6.18-rc5). Conflicts: drivers/net/wireless/ath/ath12k/mac.c 9222582ec524 ("Revert "wifi: ath12k: Fix missing station power save configuration"") 6917e268c433 ("wifi: ath12k: Defer vdev bring-up until CSA finalize to avoid stale beacon") https://lore.kernel.org/11cece9f7e36c12efd732baa5718239b1bf8c950.camel@sipsolutions.net Adjacent changes: drivers/net/ethernet/intel/Kconfig b1d16f7c0063 ("libie: depend on DEBUG_FS when building LIBIE_FWLOG") 93f53db9f9dc ("ice: switch to Page Pool") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-06Merge tag 'net-6.18-rc5' of ↵Linus Torvalds19-73/+178
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net Pull networking fixes from Jakub Kicinski: Including fixes from bluetooth and wireless. Current release - new code bugs: - ptp: expose raw cycles only for clocks with free-running counter - bonding: fix null-deref in actor_port_prio setting - mdio: ERR_PTR-check regmap pointer returned by device_node_to_regmap() - eth: libie: depend on DEBUG_FS when building LIBIE_FWLOG Previous releases - regressions: - virtio_net: fix perf regression due to bad alignment of virtio_net_hdr_v1_hash - Revert "wifi: ath10k: avoid unnecessary wait for service ready message" caused regressions for QCA988x and QCA9984 - Revert "wifi: ath12k: Fix missing station power save configuration" caused regressions for WCN7850 - eth: bnxt_en: shutdown FW DMA in bnxt_shutdown(), fix memory corruptions after kexec Previous releases - always broken: - virtio-net: fix received packet length check for big packets - sctp: fix races in socket diag handling - wifi: add an hrtimer-based delayed work item to avoid low granularity of timers set relatively far in the future, and use it where it matters (e.g. when performing AP-scheduled channel switch) - eth: mlx5e: - correctly propagate error in case of module EEPROM read failure - fix HW-GRO on systems with PAGE_SIZE == 64kB - dsa: b53: fixes for tagging, link configuration / RMII, FDB, multicast - phy: lan8842: implement latest errata" * tag 'net-6.18-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (63 commits) selftests/vsock: avoid false-positives when checking dmesg net: bridge: fix MST static key usage net: bridge: fix use-after-free due to MST port state bypass lan966x: Fix sleeping in atomic context bonding: fix NULL pointer dereference in actor_port_prio setting net: dsa: microchip: Fix reserved multicast address table programming net: wan: framer: pef2256: Switch to devm_mfd_add_devices() net: libwx: fix device bus LAN ID net/mlx5e: SHAMPO, Fix header formulas for higher MTUs and 64K pages net/mlx5e: SHAMPO, Fix skb size check for 64K pages net/mlx5e: SHAMPO, Fix header mapping for 64K pages net: ti: icssg-prueth: Fix fdb hash size configuration net/mlx5e: Fix return value in case of module EEPROM read error net: gro_cells: Reduce lock scope in gro_cell_poll libie: depend on DEBUG_FS when building LIBIE_FWLOG wifi: mac80211_hwsim: Limit destroy_on_close radio removal to netgroup netpoll: Fix deadlock in memory allocation under spinlock net: ethernet: ti: netcp: Standardize knav_dma_open_channel to return NULL on error virtio-net: fix received length check in big packets bnxt_en: Fix warning in bnxt_dl_reload_down() ...
2025-11-06net: bridge: fix MST static key usageNikolay Aleksandrov3-2/+14
As Ido pointed out, the static key usage in MST is buggy and should use inc/dec instead of enable/disable because we can have multiple bridges with MST enabled which means a single bridge can disable MST for all. Use static_branch_inc/dec to avoid that. When destroying a bridge decrement the key if MST was enabled. Fixes: ec7328b59176 ("net: bridge: mst: Multiple Spanning Tree (MST) mode") Reported-by: Ido Schimmel <idosch@nvidia.com> Closes: https://lore.kernel.org/netdev/20251104120313.1306566-1-razor@blackwall.org/T/#m6888d87658f94ed1725433940f4f4ebb00b5a68b Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20251105111919.1499702-3-razor@blackwall.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-06net: bridge: fix use-after-free due to MST port state bypassNikolay Aleksandrov3-6/+8
syzbot reported[1] a use-after-free when deleting an expired fdb. It is due to a race condition between learning still happening and a port being deleted, after all its fdbs have been flushed. The port's state has been toggled to disabled so no learning should happen at that time, but if we have MST enabled, it will bypass the port's state, that together with VLAN filtering disabled can lead to fdb learning at a time when it shouldn't happen while the port is being deleted. VLAN filtering must be disabled because we flush the port VLANs when it's being deleted which will stop learning. This fix adds a check for the port's vlan group which is initialized to NULL when the port is getting deleted, that avoids the port state bypass. When MST is enabled there would be a minimal new overhead in the fast-path because the port's vlan group pointer is cache-hot. [1] https://syzkaller.appspot.com/bug?extid=dd280197f0f7ab3917be Fixes: ec7328b59176 ("net: bridge: mst: Multiple Spanning Tree (MST) mode") Reported-by: syzbot+dd280197f0f7ab3917be@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/69088ffa.050a0220.29fc44.003d.GAE@google.com/ Signed-off-by: Nikolay Aleksandrov <razor@blackwall.org> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20251105111919.1499702-2-razor@blackwall.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-06net: selftests: export packet creation helpers for driver useRaju Rangoju1-41/+7
Export the network selftest packet creation infrastructure to allow network drivers to reuse the existing selftest framework instead of duplicating packet creation code. Signed-off-by: Raju Rangoju <Raju.Rangoju@amd.com> Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com> Link: https://patch.msgid.link/20251031111811.775434-1-Raju.Rangoju@amd.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-11-05net: gro_cells: Reduce lock scope in gro_cell_pollSebastian Andrzej Siewior1-2/+2
One GRO-cell device's NAPI callback can nest into the GRO-cell of another device if the underlying device is also using GRO-cell. This is the case for IPsec over vxlan. These two GRO-cells are separate devices. From lockdep's point of view it is the same because each device is sharing the same lock class and so it reports a possible deadlock assuming one device is nesting into itself. Hold the bh_lock only while accessing gro_cell::napi_skbs in gro_cell_poll(). This reduces the locking scope and avoids acquiring the same lock class multiple times. Fixes: 25718fdcbdd2 ("net: gro_cells: Use nested-BH locking for gro_cell") Reported-by: Gal Pressman <gal@nvidia.com> Closes: https://lore.kernel.org/all/66664116-edb8-48dc-ad72-d5223696dd19@nvidia.com/ Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Link: https://patch.msgid.link/20251104153435.ty88xDQt@linutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-04netpoll: Fix deadlock in memory allocation under spinlockBreno Leitao1-5/+2
Fix a AA deadlock in refill_skbs() where memory allocation while holding skb_pool->lock can trigger a recursive lock acquisition attempt. The deadlock scenario occurs when the system is under severe memory pressure: 1. refill_skbs() acquires skb_pool->lock (spinlock) 2. alloc_skb() is called while holding the lock 3. Memory allocator fails and calls slab_out_of_memory() 4. This triggers printk() for the OOM warning 5. The console output path calls netpoll_send_udp() 6. netpoll_send_udp() attempts to acquire the same skb_pool->lock 7. Deadlock: the lock is already held by the same CPU Call stack: refill_skbs() spin_lock_irqsave(&skb_pool->lock) <- lock acquired __alloc_skb() kmem_cache_alloc_node_noprof() slab_out_of_memory() printk() console_flush_all() netpoll_send_udp() skb_dequeue() spin_lock_irqsave(&skb_pool->lock) <- deadlock attempt This bug was exposed by commit 248f6571fd4c51 ("netpoll: Optimize skb refilling on critical path") which removed refill_skbs() from the critical path (where nested printk was being deferred), letting nested printk being called from inside refill_skbs() Refactor refill_skbs() to never allocate memory while holding the spinlock. Another possible solution to fix this problem is protecting the refill_skbs() from nested printks, basically calling printk_deferred_{enter,exit}() in refill_skbs(), then, any nested pr_warn() would be deferred. I prefer this approach, given I _think_ it might be a good idea to move the alloc_skb() from GFP_ATOMIC to GFP_KERNEL in the future, so, having the alloc_skb() outside of the lock will be necessary step. There is a possible TOCTOU issue when checking for the pool length, and queueing the new allocated skb, but, this is not an issue, given that an extra SKB in the pool is harmless and it will be eventually used. Signed-off-by: Breno Leitao <leitao@debian.org> Fixes: 248f6571fd4c51 ("netpoll: Optimize skb refilling on critical path") Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20251103-fix_netpoll_aa-v4-1-4cfecdf6da7c@debian.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-04net: Convert struct sockaddr to fixed-size "sa_data[14]"Kees Cook4-8/+8
Revert struct sockaddr from flexible array to fixed 14-byte "sa_data", to solve over 36,000 -Wflex-array-member-not-at-end warnings, since struct sockaddr is embedded within many network structs. With socket/proto sockaddr-based internal APIs switched to use struct sockaddr_unsized, there should be no more uses of struct sockaddr that depend on reading beyond the end of struct sockaddr::sa_data that might trigger bounds checking. Comparing an x86_64 "allyesconfig" vmlinux build before and after this patch showed no new "ud1" instructions from CONFIG_UBSAN_BOUNDS nor any new "field-spanning" memcpy CONFIG_FORTIFY_SOURCE instrumentations. Cc: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: Kees Cook <kees@kernel.org> Link: https://patch.msgid.link/20251104002617.2752303-8-kees@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-04bpf: Convert cgroup sockaddr filters to use sockaddr_unsized consistentlyKees Cook1-2/+2
Update BPF cgroup sockaddr filtering infrastructure to use sockaddr_unsized consistently throughout the call chain, removing redundant explicit casts from callers. No binary changes expected. Signed-off-by: Kees Cook <kees@kernel.org> Link: https://patch.msgid.link/20251104002617.2752303-6-kees@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-04net: Convert proto callbacks from sockaddr to sockaddr_unsizedKees Cook22-57/+69
Convert struct proto pre_connect(), connect(), bind(), and bind_add() callback function prototypes from struct sockaddr to struct sockaddr_unsized. This does not change per-implementation use of sockaddr for passing around an arbitrarily sized sockaddr struct. Those will be addressed in future patches. Additionally removes the no longer referenced struct sockaddr from include/net/inet_common.h. No binary changes expected. Signed-off-by: Kees Cook <kees@kernel.org> Link: https://patch.msgid.link/20251104002617.2752303-5-kees@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-04net: Convert proto_ops connect() callbacks to use sockaddr_unsizedKees Cook49-76/+80
Update all struct proto_ops connect() callback function prototypes from "struct sockaddr *" to "struct sockaddr_unsized *" to avoid lying to the compiler about object sizes. Calls into struct proto handlers gain casts that will be removed in the struct proto conversion patch. No binary changes expected. Signed-off-by: Kees Cook <kees@kernel.org> Link: https://patch.msgid.link/20251104002617.2752303-3-kees@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-04net: Convert proto_ops bind() callbacks to use sockaddr_unsizedKees Cook53-78/+79
Update all struct proto_ops bind() callback function prototypes from "struct sockaddr *" to "struct sockaddr_unsized *" to avoid lying to the compiler about object sizes. Calls into struct proto handlers gain casts that will be removed in the struct proto conversion patch. No binary changes expected. Signed-off-by: Kees Cook <kees@kernel.org> Link: https://patch.msgid.link/20251104002617.2752303-2-kees@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-04net: devmem: Remove unused declaration net_devmem_bind_tx_release()Yue Haibing1-1/+0
Commit bd61848900bf ("net: devmem: Implement TX path") declared this but never implemented it. Signed-off-by: Yue Haibing <yuehaibing@huawei.com> Acked-by: Stanislav Fomichev <sdf@fomichev.me> Reviewed-by: Mina Almasry <almasrymina@google.com> Link: https://patch.msgid.link/20251103072046.1670574-1-yuehaibing@huawei.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-04mptcp: pm: in kernel: only use fullmesh endp if anyMatthieu Baerts (NGI0)1-7/+3
Our documentation is saying that the in-kernel PM is only using fullmesh endpoints to establish subflows to announced addresses when at least one endpoint has a fullmesh flag. But this was not totally correct: only fullmesh endpoints were used if at least one endpoint *from the same address family as the received ADD_ADDR* has the fullmesh flag. This is confusing, and it seems clearer not to have differences depending on the address family. So, now, when at least one MPTCP endpoint has a fullmesh flag, the local addresses are picked from all fullmesh endpoints, which might be 0 if there are no endpoints for the correct address family. One selftest needs to be adapted for this behaviour change. Reviewed-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251101-net-next-mptcp-fm-endp-nb-bind-v1-2-b4166772d6bb@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-04mptcp: pm: in-kernel: record fullmesh endp nbMatthieu Baerts (NGI0)3-3/+38
Instead of iterating over all endpoints, under RCU read lock, just to check if one of them as the fullmesh flag, we can keep a counter of fullmesh endpoint, similar to what is done with the other flags. This counter is now checked, before iterating over all endpoints. Similar to the other counters, this new one is also exposed. A userspace app can then know when it is being used in a fullmesh mode, with potentially (too) many subflows. Reviewed-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251101-net-next-mptcp-fm-endp-nb-bind-v1-1-b4166772d6bb@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-04net: mark deliver_skb() as unlikely and not inlinedEric Dumazet1-11/+11
deliver_skb() should not be inlined as is it not called in the fast path. Add unlikely() clauses giving hints to the compiler about this fact. Before this patch: size net/core/dev.o text data bss dec hex filename 121794 13330 176 135300 21084 net/core/dev.o __netif_receive_skb_core() size on x86_64 : 4080 bytes. After: size net/core/dev.o text data bss dec hex filenamee 120330 13338 176 133844 20ad4 net/core/dev.o __netif_receive_skb_core() size on x86_64 : 2781 bytes. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251103165256.1712169-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-04rtnetlink: honor RTEXT_FILTER_SKIP_STATS in IFLA_STATSAdrian Moreno1-4/+11
Gathering interface statistics can be a relatively expensive operation on certain systems as it requires iterating over all the cpus. RTEXT_FILTER_SKIP_STATS was first introduced [1] to skip AF_INET6 statistics from interface dumps and it was then extended [2] to also exclude IFLA_VF_INFO. The semantics of the flag does not seem to be limited to AF_INET or VF statistics and having a way to query the interface status (e.g: carrier, address) without retrieving its statistics seems reasonable. So this patch extends the use RTEXT_FILTER_SKIP_STATS to also affect IFLA_STATS. [1] https://lore.kernel.org/all/20150911204848.GC9687@oracle.com/ [2] https://lore.kernel.org/all/20230611105108.122586-1-gal@nvidia.com/ Signed-off-by: Adrian Moreno <amorenoz@redhat.com> Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Link: https://patch.msgid.link/20251103154006.1189707-1-amorenoz@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-04xsk: use a smaller new lock for shared pool caseJason Xing2-10/+8
- Split cq_lock into two smaller locks: cq_prod_lock and cq_cached_prod_lock - Avoid disabling/enabling interrupts in the hot xmit path In either xsk_cq_cancel_locked() or xsk_cq_reserve_locked() function, the race condition is only between multiple xsks sharing the same pool. They are all in the process context rather than interrupt context, so now the small lock named cq_cached_prod_lock can be used without handling interrupts. While cq_cached_prod_lock ensures the exclusive modification of @cached_prod, cq_prod_lock in xsk_cq_submit_addr_locked() only cares about @producer and corresponding @desc. Both of them don't necessarily be consistent with @cached_prod protected by cq_cached_prod_lock. That's the reason why the previous big lock can be split into two smaller ones. Please note that SPSC rule is all about the global state of producer and consumer that can affect both layers instead of local or cached ones. Frequently disabling and enabling interrupt are very time consuming in some cases, especially in a per-descriptor granularity, which now can be avoided after this optimization, even when the pool is shared by multiple xsks. With this patch, the performance number[1] could go from 1,872,565 pps to 1,961,009 pps. It's a minor rise of around 5%. [1]: taskset -c 1 ./xdpsock -i enp2s0f1 -q 0 -t -S -s 64 Signed-off-by: Jason Xing <kernelxing@tencent.com> Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://patch.msgid.link/20251030000646.18859-3-kerneljasonxing@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-11-04xsk: do not enable/disable irq when grabbing/releasing xsk_tx_list_lockJason Xing1-8/+4
The commit ac98d8aab61b ("xsk: wire upp Tx zero-copy functions") originally introducing this lock put the deletion process in the sk_destruct which can run in irq context obviously, so the xxx_irqsave()/xxx_irqrestore() pair was used. But later another commit 541d7fdd7694 ("xsk: proper AF_XDP socket teardown ordering") moved the deletion into xsk_release() that only happens in process context. It means that since this commit, it doesn't necessarily need that pair. Now, there are two places that use this xsk_tx_list_lock and only run in the process context. So avoid manipulating the irq then. Signed-off-by: Jason Xing <kernelxing@tencent.com> Acked-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Link: https://patch.msgid.link/20251030000646.18859-2-kerneljasonxing@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-11-04net/dns_resolver: use credential guards in dns_query()Christian Brauner1-4/+2
Use credential guards for scoped credential override with automatic restoration on scope exit. Link: https://patch.msgid.link/20251103-work-creds-guards-simple-v1-16-a3e156839e7f@kernel.org Reviewed-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-04unix: don't copy credsChristian Brauner1-13/+4
No need to copy kernel credentials. Link: https://patch.msgid.link/20251103-work-creds-init_cred-v1-8-cb3ec8711a6a@kernel.org Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>
2025-11-03ethtool: netlink: add ETHTOOL_MSG_MSE_GET and wire up PHY MSE accessOleksij Rempel4-1/+342
Introduce the userspace entry point for PHY MSE diagnostics via ethtool netlink. This exposes the core API added previously and returns both capability information and one or more snapshots. Userspace sends ETHTOOL_MSG_MSE_GET. The reply carries: - ETHTOOL_A_MSE_CAPABILITIES: scale limits and timing information - ETHTOOL_A_MSE_CHANNEL_* nests: one or more snapshots (per-channel if available, otherwise WORST, otherwise LINK) Link down returns -ENETDOWN. Changes: - YAML: add attribute sets (mse, mse-capabilities, mse-snapshot) and the mse-get operation - UAPI (generated): add ETHTOOL_A_MSE_* enums and message IDs, ETHTOOL_MSG_MSE_GET/REPLY - ethtool core: add net/ethtool/mse.c implementing the request, register genl op, and hook into ethnl dispatch - docs: document MSE_GET in ethtool-netlink.rst The include/uapi/linux/ethtool_netlink_generated.h is generated from Documentation/netlink/specs/ethtool.yaml. Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de> Link: https://patch.msgid.link/20251027122801.982364-3-o.rempel@pengutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-03net: Extend NAPI threaded polling to allow kthread based busy pollingSamiullah Khawaja3-11/+52
Add a new state NAPI_STATE_THREADED_BUSY_POLL to the NAPI state enum to enable and disable threaded busy polling. When threaded busy polling is enabled for a NAPI, enable NAPI_STATE_THREADED also. When the threaded NAPI is scheduled, set NAPI_STATE_IN_BUSY_POLL to signal napi_complete_done not to rearm interrupts. Whenever NAPI_STATE_THREADED_BUSY_POLL is unset, the NAPI_STATE_IN_BUSY_POLL will be unset, napi_complete_done unsets the NAPI_STATE_SCHED_THREADED bit also, which in turn will make the kthread go to sleep. Signed-off-by: Samiullah Khawaja <skhawaja@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Acked-by: Martin Karsten <mkarsten@uwaterloo.ca> Tested-by: Martin Karsten <mkarsten@uwaterloo.ca> Link: https://patch.msgid.link/20251028203007.575686-2-skhawaja@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-03mpls: Drop RTNL for RTM_NEWROUTE, RTM_DELROUTE, and RTM_GETROUTE.Kuniyuki Iwashima1-3/+5
RTM_NEWROUTE looks up dev under RCU (ip_route_output(), ipv6_stub->ipv6_dst_lookup_flow(), netdev_get_by_index()), and each neighbour holds the refcnt of its dev. Also, net->mpls.platform_label is protected by a dedicated per-netns mutex. Now, no MPLS code depends on RTNL. Let's drop RTNL for RTM_NEWROUTE, RTM_DELROUTE, and RTM_GETROUTE. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Link: https://patch.msgid.link/20251029173344.2934622-14-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-03mpls: Protect net->mpls.platform_label with a per-netns mutex.Kuniyuki Iwashima2-20/+42
MPLS (re)uses RTNL to protect net->mpls.platform_label, but the lock does not need to be RTNL at all. Let's protect net->mpls.platform_label with a dedicated per-netns mutex. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Link: https://patch.msgid.link/20251029173344.2934622-13-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-03mpls: Convert RTM_GETNETCONF to RCU.Kuniyuki Iwashima1-14/+30
mpls_netconf_get_devconf() calls __dev_get_by_index(), and this only depends on RTNL. Let's convert mpls_netconf_get_devconf() to RCU and use dev_get_by_index_rcu(). Note that nlmsg_new() is moved ahead to use GFP_KERNEL. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Link: https://patch.msgid.link/20251029173344.2934622-12-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-03mpls: Convert mpls_dump_routes() to RCU.Kuniyuki Iwashima1-10/+16
mpls_dump_routes() sets fib_dump_filter.rtnl_held to true and calls __dev_get_by_index() in mpls_valid_fib_dump_req(). This is the only RTNL dependant in mpls_dump_routes(). Also, synchronize_rcu() in resize_platform_label_table() guarantees that net->mpls.platform_label is alive under RCU. Let's convert mpls_dump_routes() to RCU and use dev_get_by_index_rcu(). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Link: https://patch.msgid.link/20251029173344.2934622-11-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-03mpls: Use mpls_route_input() where appropriate.Kuniyuki Iwashima1-19/+13
In many places, we uses rtnl_dereference() twice for net->mpls.platform_label and net->mpls.platform_label[index]. Let's replace the code with mpls_route_input(). We do not use mpls_route_input() in mpls_dump_routes() since we will rely on RCU there. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Link: https://patch.msgid.link/20251029173344.2934622-10-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-03mpls: Add mpls_route_input().Kuniyuki Iwashima1-10/+18
mpls_route_input_rcu() is called from mpls_forward() and mpls_getroute(). The former is under RCU, and the latter is under RTNL, so mpls_route_input_rcu() uses rcu_dereference_rtnl(). Let's use rcu_dereference() in mpls_route_input_rcu() and add an RTNL variant for mpls_getroute(). Later, we will remove rtnl_dereference() there. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Link: https://patch.msgid.link/20251029173344.2934622-9-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-03mpls: Pass net to mpls_dev_get().Kuniyuki Iwashima2-6/+8
We will replace RTNL with a per-netns mutex to protect dev->mpls_ptr. Then, we will use rcu_dereference_protected() with the lockdep_is_held() annotation, which requires net to access the per-netns mutex. However, dev_net(dev) is not safe without RTNL. Let's pass net to mpls_dev_get(). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Link: https://patch.msgid.link/20251029173344.2934622-8-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-03mpls: Add mpls_dev_rcu().Kuniyuki Iwashima3-7/+12
mpls_dev_get() uses rcu_dereference_rtnl() to fetch dev->mpls_ptr. We will replace RTNL with a dedicated mutex to protect the field. Then, we will use rcu_dereference_protected() for clarity. Let's add mpls_dev_rcu() for the RCU reader. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Link: https://patch.msgid.link/20251029173344.2934622-7-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-03mpls: Use in6_dev_rcu() and dev_net_rcu() in mpls_forward() and mpls_xmit().Kuniyuki Iwashima3-10/+12
mpls_forward() and mpls_xmit() are called under RCU. Let's use in6_dev_rcu() and dev_net_rcu() there to annotate as such. Now we pass net to mpls_stats_inc_outucastpkts() not to read dev_net_rcu() twice. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Link: https://patch.msgid.link/20251029173344.2934622-6-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-03mpls: Unify return paths in mpls_dev_notify().Kuniyuki Iwashima1-8/+16
We will protect net->mpls.platform_label by a dedicated mutex. Then, we need to wrap functions called from mpls_dev_notify() with the mutex. As a prep, let's unify the return paths. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Link: https://patch.msgid.link/20251029173344.2934622-4-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-03mpls: Hold dev refcnt for mpls_nh.Kuniyuki Iwashima2-19/+45
MPLS uses RTNL 1) to guarantee the lifetime of struct mpls_nh.nh_dev 2) to protect net->mpls.platform_label , but neither actually requires RTNL. If we do not call dev_put() in find_outdev() and call it just before freeing struct mpls_route, we can drop RTNL for 1). Let's hold the refcnt of mpls_nh.nh_dev and track it with netdevice_tracker. Two notable changes: If mpls_nh_build_multi() fails to set up a neighbour, we need to call netdev_put() for successfully created neighbours in mpls_rt_free_rcu(), so the number of neighbours (rt->rt_nhn) is now updated in each iteration. When a dev is unregistered, mpls_ifdown() clones mpls_route and replaces it with the clone, so the clone requires extra netdev_hold(). Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Link: https://patch.msgid.link/20251029173344.2934622-3-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-03mpls: Return early in mpls_label_ok().Kuniyuki Iwashima1-6/+5
When mpls_label_ok() returns false, it does not need to update *index. Let's remove is_ok and return early. Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Guillaume Nault <gnault@redhat.com> Link: https://patch.msgid.link/20251029173344.2934622-2-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-03sctp: make sctp_transport_init() voidHuiwen He1-15/+6
sctp_transport_init() is static and never returns NULL. It is only called by sctp_transport_new(), so change it to void and remove the redundant return value check. Signed-off-by: Huiwen He <hehuiwen@kylinos.cn> Acked-by: Xin Long <lucien.xin@gmail.com> Link: https://patch.msgid.link/20251103023619.1025622-1-hehuiwen@kylinos.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>