aboutsummaryrefslogtreecommitdiff
path: root/net
AgeCommit message (Collapse)AuthorFilesLines
2025-11-26Merge tag 'linux-can-fixes-for-6.18-20251126' of ↵Jakub Kicinski1-0/+3
git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can Marc Kleine-Budde says: ==================== pull-request: can 2025-11-26 this is a pull request of 8 patches for net/main. Seungjin Bae provides a patch for the kvaser_usb driver to fix a potential infinite loop in the USB data stream command parser. Thomas Mühlbacher's patch for the sja1000 driver IRQ handler's max loop handling, that might lead to unhandled interrupts. 3 patches by me for the gs_usb driver fix handling of failed transmit URBs and add checking of the actual length of received URBs before accessing the data. The next patch is by me and is a port of Thomas Mühlbacher's patch (fix IRQ handler's max loop handling, that might lead to unhandled interrupts.) to the sun4i_can driver. Biju Das provides a patch for the rcar_canfd driver to fix the CAN-FD mode setting. The last patch is by Shaurya Rane for the em_canid filter to ensure that the complete CAN frame is present in the linear data buffer before accessing it. * tag 'linux-can-fixes-for-6.18-20251126' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can: net/sched: em_canid: fix uninit-value in em_canid_match can: rcar_canfd: Fix CAN-FD mode as default can: sun4i_can: sun4i_can_interrupt(): fix max irq loop handling can: gs_usb: gs_usb_receive_bulk_callback(): check actual_length before accessing data can: gs_usb: gs_usb_receive_bulk_callback(): check actual_length before accessing header can: gs_usb: gs_usb_xmit_callback(): fix handling of failed transmitted URBs can: sja1000: fix max irq loop handling can: kvaser_usb: leaf: Fix potential infinite loop in command parsers ==================== Link: https://patch.msgid.link/20251126155713.217105-1-mkl@pengutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-26mptcp: clear scheduled subflows on retransmitPaolo Abeni1-2/+11
When __mptcp_retrans() kicks-in, it schedules one or more subflows for retransmission, but such subflows could be actually left alone if there is no more data to retransmit and/or in case of concurrent fallback. Scheduled subflows could be processed much later in time, i.e. when new data will be transmitted, leading to bad subflow selection. Explicitly clear all scheduled subflows before leaving the retransmission function. Fixes: ee2708aedad0 ("mptcp: use get_retrans wrapper") Cc: stable@vger.kernel.org Reported-by: Filip Pokryvka <fpokryvk@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251125-net-mptcp-clear-sched-rtx-v1-1-1cea4ad2165f@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-26phy: add hwtstamp_get callback to phy driversVadim Fedorenko1-4/+5
PHY devices had lack of hwtstamp_get callback even though most of them are tracking configuration info. Introduce new call back to mii_timestamper. Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Kory Maincent <kory.maincent@bootlin.com> Signed-off-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Link: https://patch.msgid.link/20251124181151.277256-3-vadim.fedorenko@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-26libceph: drop started parameter of __ceph_open_session()Ilya Dryomov1-3/+2
With the previous commit revamping the timeout handling, started isn't used anymore. It could be taken into account by adjusting the initial value of the timeout, but there is little point as both callers capture the timestamp shortly before calling __ceph_open_session() -- the only thing of note that happens in the interim is taking client->mount_mutex and that isn't expected to take multiple seconds. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
2025-11-26libceph: fix potential use-after-free in have_mon_and_osd_map()Ilya Dryomov2-25/+42
The wait loop in __ceph_open_session() can race with the client receiving a new monmap or osdmap shortly after the initial map is received. Both ceph_monc_handle_map() and handle_one_map() install a new map immediately after freeing the old one kfree(monc->monmap); monc->monmap = monmap; ceph_osdmap_destroy(osdc->osdmap); osdc->osdmap = newmap; under client->monc.mutex and client->osdc.lock respectively, but because neither is taken in have_mon_and_osd_map() it's possible for client->monc.monmap->epoch and client->osdc.osdmap->epoch arms in client->monc.monmap && client->monc.monmap->epoch && client->osdc.osdmap && client->osdc.osdmap->epoch; condition to dereference an already freed map. This happens to be reproducible with generic/395 and generic/397 with KASAN enabled: BUG: KASAN: slab-use-after-free in have_mon_and_osd_map+0x56/0x70 Read of size 4 at addr ffff88811012d810 by task mount.ceph/13305 CPU: 2 UID: 0 PID: 13305 Comm: mount.ceph Not tainted 6.14.0-rc2-build2+ #1266 ... Call Trace: <TASK> have_mon_and_osd_map+0x56/0x70 ceph_open_session+0x182/0x290 ceph_get_tree+0x333/0x680 vfs_get_tree+0x49/0x180 do_new_mount+0x1a3/0x2d0 path_mount+0x6dd/0x730 do_mount+0x99/0xe0 __do_sys_mount+0x141/0x180 do_syscall_64+0x9f/0x100 entry_SYSCALL_64_after_hwframe+0x76/0x7e </TASK> Allocated by task 13305: ceph_osdmap_alloc+0x16/0x130 ceph_osdc_init+0x27a/0x4c0 ceph_create_client+0x153/0x190 create_fs_client+0x50/0x2a0 ceph_get_tree+0xff/0x680 vfs_get_tree+0x49/0x180 do_new_mount+0x1a3/0x2d0 path_mount+0x6dd/0x730 do_mount+0x99/0xe0 __do_sys_mount+0x141/0x180 do_syscall_64+0x9f/0x100 entry_SYSCALL_64_after_hwframe+0x76/0x7e Freed by task 9475: kfree+0x212/0x290 handle_one_map+0x23c/0x3b0 ceph_osdc_handle_map+0x3c9/0x590 mon_dispatch+0x655/0x6f0 ceph_con_process_message+0xc3/0xe0 ceph_con_v1_try_read+0x614/0x760 ceph_con_workfn+0x2de/0x650 process_one_work+0x486/0x7c0 process_scheduled_works+0x73/0x90 worker_thread+0x1c8/0x2a0 kthread+0x2ec/0x300 ret_from_fork+0x24/0x40 ret_from_fork_asm+0x1a/0x30 Rewrite the wait loop to check the above condition directly with client->monc.mutex and client->osdc.lock taken as appropriate. While at it, improve the timeout handling (previously mount_timeout could be exceeded in case wait_event_interruptible_timeout() slept more than once) and access client->auth_err under client->monc.mutex to match how it's set in finish_auth(). monmap_show() and osdmap_show() now take the respective lock before accessing the map as well. Cc: stable@vger.kernel.org Reported-by: David Howells <dhowells@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Viacheslav Dubeyko <Slava.Dubeyko@ibm.com>
2025-11-26socket: Split out a getsockname helper for io_uringGabriel Krisman Bertazi1-16/+20
Similar to getsockopt, split out a helper to check security and issue the operation from the main handler that can be used by io_uring. Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-26socket: Unify getsockname and getpeername implementationGabriel Krisman Bertazi2-44/+15
They are already implemented by the same get_name hook in the protocol level. Bring the unification one level up to reduce code duplication in preparation to supporting these as io_uring operations. Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Signed-off-by: Gabriel Krisman Bertazi <krisman@suse.de> Signed-off-by: Jens Axboe <axboe@kernel.dk>
2025-11-26net/sched: em_canid: fix uninit-value in em_canid_matchShaurya Rane1-0/+3
Use pskb_may_pull() to ensure a complete CAN frame is present in the linear data buffer before reading the CAN ID. A simple skb->len check is insufficient because it only verifies the total data length but does not guarantee the data is present in skb->data (it could be in fragments). pskb_may_pull() both validates the length and pulls fragmented data into the linear buffer if necessary, making it safe to directly access skb->data. Reported-by: syzbot+5d8269a1e099279152bc@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=5d8269a1e099279152bc Fixes: f057bbb6f9ed ("net: em_canid: Ematch rule to match CAN frames according to their identifiers") Signed-off-by: Shaurya Rane <ssrane_b23@ee.vjti.ac.in> Link: https://patch.msgid.link/20251126085718.50808-1-ssranevjti@gmail.com Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2025-11-26can: raw: instantly reject unsupported CAN framesOliver Hartkopp1-8/+46
For real CAN interfaces the CAN_CTRLMODE_FD and CAN_CTRLMODE_XL control modes indicate whether an interface can handle those CAN FD/XL frames. In the case a CAN XL interface is configured in CANXL-only mode with disabled error-signalling neither CAN CC nor CAN FD frames can be sent. The checks are performed on CAN_RAW sockets to give an instant feedback to the user when writing unsupported CAN frames to the interface. Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Link: https://patch.msgid.link/20251126-canxl-v8-16-e7e3eb74f889@pengutronix.de Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2025-11-26wifi: mac80211: allow sharing identical chanctx for S1G interfacesLachlan Hodges2-3/+15
Introduce support for sharing identical channel contexts for S1G interfaces. Additionally, do not downgrade channel requests for S1G interfaces. Signed-off-by: Lachlan Hodges <lachlan.hodges@morsemicro.com> Link: https://patch.msgid.link/20251126015758.149034-1-lachlan.hodges@morsemicro.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-11-25xsk: avoid data corruption on cq descriptor numberFernando Fernandez Mancera1-55/+88
Since commit 30f241fcf52a ("xsk: Fix immature cq descriptor production"), the descriptor number is stored in skb control block and xsk_cq_submit_addr_locked() relies on it to put the umem addrs onto pool's completion queue. skb control block shouldn't be used for this purpose as after transmit xsk doesn't have control over it and other subsystems could use it. This leads to the following kernel panic due to a NULL pointer dereference. BUG: kernel NULL pointer dereference, address: 0000000000000000 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: Oops: 0000 [#1] SMP NOPTI CPU: 2 UID: 1 PID: 927 Comm: p4xsk.bin Not tainted 6.16.12+deb14-cloud-amd64 #1 PREEMPT(lazy) Debian 6.16.12-1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.17.0-debian-1.17.0-1 04/01/2014 RIP: 0010:xsk_destruct_skb+0xd0/0x180 [...] Call Trace: <IRQ> ? napi_complete_done+0x7a/0x1a0 ip_rcv_core+0x1bb/0x340 ip_rcv+0x30/0x1f0 __netif_receive_skb_one_core+0x85/0xa0 process_backlog+0x87/0x130 __napi_poll+0x28/0x180 net_rx_action+0x339/0x420 handle_softirqs+0xdc/0x320 ? handle_edge_irq+0x90/0x1e0 do_softirq.part.0+0x3b/0x60 </IRQ> <TASK> __local_bh_enable_ip+0x60/0x70 __dev_direct_xmit+0x14e/0x1f0 __xsk_generic_xmit+0x482/0xb70 ? __remove_hrtimer+0x41/0xa0 ? __xsk_generic_xmit+0x51/0xb70 ? _raw_spin_unlock_irqrestore+0xe/0x40 xsk_sendmsg+0xda/0x1c0 __sys_sendto+0x1ee/0x200 __x64_sys_sendto+0x24/0x30 do_syscall_64+0x84/0x2f0 ? __pfx_pollwake+0x10/0x10 ? __rseq_handle_notify_resume+0xad/0x4c0 ? restore_fpregs_from_fpstate+0x3c/0x90 ? switch_fpu_return+0x5b/0xe0 ? do_syscall_64+0x204/0x2f0 ? do_syscall_64+0x204/0x2f0 ? do_syscall_64+0x204/0x2f0 entry_SYSCALL_64_after_hwframe+0x76/0x7e </TASK> [...] Kernel panic - not syncing: Fatal exception in interrupt Kernel Offset: 0x1c000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff) Instead use the skb destructor_arg pointer along with pointer tagging. As pointers are always aligned to 8B, use the bottom bit to indicate whether this a single address or an allocated struct containing several addresses. Fixes: 30f241fcf52a ("xsk: Fix immature cq descriptor production") Closes: https://lore.kernel.org/netdev/0435b904-f44f-48f8-afb0-68868474bf1c@nop.hu/ Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de> Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Link: https://patch.msgid.link/20251124171409.3845-1-fmancera@suse.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-25tcp: remove icsk->icsk_retransmit_timerEric Dumazet3-18/+11
Now sk->sk_timer is no longer used by TCP keepalive, we can use its storage for TCP and MPTCP retransmit timers for better cache locality. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20251124175013.1473655-5-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-25tcp: introduce icsk->icsk_keepalive_timerEric Dumazet7-18/+21
sk->sk_timer has been used for TCP keepalives. Keepalive timers are not in fast path, we want to use sk->sk_timer storage for retransmit timers, for better cache locality. Create icsk->icsk_keepalive_timer and change keepalive code to no longer use sk->sk_timer. Added space is reclaimed in the following patch. This includes changes to MPTCP, which was also using sk_timer. Alias icsk->mptcp_tout_timer and icsk->icsk_keepalive_timer for inet_sk_diag_fill() sake. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20251124175013.1473655-4-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-25net: move sk_dst_pending_confirm and sk_pacing_status to sock_read_tx groupEric Dumazet1-2/+2
These two fields are mostly read in TCP tx path, move them in an more appropriate group for better cache locality. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20251124175013.1473655-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-25tcp: rename icsk_timeout() to tcp_timeout_expires()Eric Dumazet5-10/+10
In preparation of sk->tcp_timeout_timer introduction, rename icsk_timeout() helper and change its argument to plain 'const struct sock *sk'. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20251124175013.1473655-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-25tools: ynl-gen: add regeneration commentAsbjørn Sloth Tønnesen14-0/+14
Add a comment on regeneration to the generated files. The comment is placed after the YNL-GEN line[1], as to not interfere with ynl-regen.sh's detection logic. [1] and after the optional YNL-ARG line. Link: https://lore.kernel.org/r/aR5m174O7pklKrMR@zx2c4.com/ Suggested-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Asbjørn Sloth Tønnesen <ast@fiberby.net> Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251120174429.390574-3-ast@fiberby.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-25net: dsa: append ethtool counters of all hidden ports to conduitVladimir Oltean1-33/+93
Currently there is no way to see packet counters on cascade ports, and no clarity on how the API for that would look like. Because it's something that is currently needed, just extend the hack where ethtool -S on the conduit interface dumps CPU port counters, and also use it to dump counters of cascade ports. Note that the "pXX_" naming convention changes to "sXX_pYY", to distinguish between ports having the same index but belonging to different switches. This has a slight chance of causing regressions to existing tooling: - grepping for "p04_counter_name" still works, but might return more than one string now - grepping for " p04_counter_name" no longer works Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/20251122112311.138784-4-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-25net: dsa: use kernel data types for ethtool ops on conduitVladimir Oltean1-6/+5
Suppress some checkpatch 'CHECK' messages about u8 being preferable over uint8_t, etc. No functional change. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/20251122112311.138784-3-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-25net: dsa: cpu_dp->orig_ethtool_ops might be NULLVladimir Oltean1-7/+7
In theory this would have been seen by now, but it seems that all drivers used as DSA conduit interfaces thus far have had ethtool_ops set, and it's hard to even find modern Ethernet drivers (and not VF ones) which don't use ethtool. Here is the unfiltered list of drivers which register any sort of net_device but don't set its ethtool_ops pointer. I don't think any of them 'risks' being used as a DSA conduit, maybe except for moxart, rnpbge and icssm, I'm not sure. - drivers/net/can/dev/dev.c - drivers/net/wwan/qcom_bam_dmux.c - drivers/net/wwan/t7xx/t7xx_netdev.c - drivers/net/arcnet/arcnet.c - drivers/net/hamradio/ - drivers/net/slip/slip.c - drivers/net/ethernet/ezchip/nps_enet.c - drivers/net/ethernet/moxa/moxart_ether.c - drivers/net/ethernet/wangxun/txgbevf/txgbevf_main.c - drivers/net/ethernet/wangxun/ngbevf/ngbevf_main.c - drivers/net/ethernet/huawei/hinic3/hinic3_main.c - drivers/net/ethernet/i825xx/ - drivers/net/ethernet/ti/icssm/icssm_prueth.c - drivers/net/ethernet/seeq/ - drivers/net/ethernet/litex/litex_liteeth.c - drivers/net/ethernet/sunplus/spl2sw_driver.c - drivers/net/ethernet/mucse/rnpgbe/rnpgbe_main.c - drivers/net/ipa/ - drivers/net/wireless/microchip/wilc1000/ - drivers/net/wireless/mediatek/mt76/dma.c - drivers/net/wireless/ath/ath12k/ - drivers/net/wireless/ath/ath11k/ - drivers/net/wireless/ath/ath6kl/ - drivers/net/wireless/ath/ath10k/ - drivers/net/wireless/intel/iwlwifi/pcie/gen1_2/trans.c - drivers/net/wireless/virtual/mac80211_hwsim.c - drivers/net/wireless/quantenna/qtnfmac/pcie/pcie.c - drivers/net/wireless/realtek/rtw89/core.c - drivers/net/wireless/realtek/rtw88/pci.c - drivers/net/caif/ - drivers/net/plip/ - drivers/net/wan/ - drivers/net/mctp/ - drivers/net/ppp/ - drivers/net/thunderbolt/ Nonetheless, it's good for the framework not to make such assumptions, and not panic when coming across such kind of host device in the future. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Link: https://patch.msgid.link/20251122112311.138784-2-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-25net_sched: use qdisc_dequeue_drop() in cake, codel, fq_codelEric Dumazet3-3/+10
cake, codel and fq_codel can drop many packets from dequeue(). Use qdisc_dequeue_drop() so that the freeing can happen outside of the qdisc spinlock scope. Add TCQ_F_DEQUEUE_DROPS to sch->flags. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251121083256.674562-15-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-11-25net_sched: add qdisc_dequeue_drop() helperEric Dumazet1-9/+13
Some qdisc like cake, codel, fq_codel might drop packets in their dequeue() method. This is currently problematic because dequeue() runs with the qdisc spinlock held. Freeing skbs can be extremely expensive. Add qdisc_dequeue_drop() method and a new TCQ_F_DEQUEUE_DROPS so that these qdiscs can opt-in to defer the skb frees after the socket spinlock is released. TCQ_F_DEQUEUE_DROPS is an attempt to not penalize other qdiscs with an extra cache line miss. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251121083256.674562-14-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-11-25net_sched: add tcf_kfree_skb_list() helperEric Dumazet1-10/+5
Using kfree_skb_list_reason() to free list of skbs from qdisc operations seems wrong as each skb might have a different drop reason. Cleanup __dev_xmit_skb() to call tcf_kfree_skb_list() once in preparation of the following patch. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251121083256.674562-13-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-11-25net: annotate a data-race in __dev_xmit_skb()Eric Dumazet1-1/+1
q->limit is read locklessly, add a READ_ONCE(). Fixes: 100dfa74cad9 ("net: dev_queue_xmit() llist adoption") Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251121083256.674562-12-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-11-25net: prefech skb->priority in __dev_xmit_skb()Eric Dumazet1-0/+1
Most qdiscs need to read skb->priority at enqueue time(). In commit 100dfa74cad9 ("net: dev_queue_xmit() llist adoption") I added a prefetch(next), lets add another one for the second half of skb. Note that skb->priority and skb->hash share a common cache line, so this patch helps qdiscs needing both fields. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251121083256.674562-11-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-11-25net_sched: sch_fq: prefetch one skb ahead in dequeue()Eric Dumazet1-2/+5
prefetch the skb that we are likely to dequeue at the next dequeue(). Also call fq_dequeue_skb() a bit sooner in fq_dequeue(). This reduces the window between read of q.qlen and changes of fields in the cache line that could be dirtied by another cpu trying to queue a packet. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251121083256.674562-10-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-11-25net_sched: sch_fq: move qdisc_bstats_update() to fq_dequeue_skb()Eric Dumazet1-1/+1
Group together changes to qdisc fields to reduce chances of false sharing if another cpu attempts to acquire the qdisc spinlock. qdisc_qstats_backlog_dec(sch, skb); sch->q.qlen--; qdisc_bstats_update(sch, skb); Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251121083256.674562-9-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-11-25net_sched: cake: use qdisc_pkt_segs()Eric Dumazet1-9/+3
Use new qdisc_pkt_segs() to avoid a cache line miss in cake_enqueue() for non GSO packets. cake_overhead() does not have to recompute it. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251121083256.674562-7-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-11-25net_sched: use qdisc_skb_cb(skb)->pkt_segs in bstats_update()Eric Dumazet6-1/+6
Avoid up to two cache line misses in qdisc dequeue() to fetch skb_shinfo(skb)->gso_segs/gso_size while qdisc spinlock is held. This gives a 5 % improvement in a TX intensive workload. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251121083256.674562-6-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-11-25net: use qdisc_pkt_len_segs_init() in sch_handle_ingress()Eric Dumazet1-1/+1
sch_handle_ingress() sets qdisc_skb_cb(skb)->pkt_len. We also need to initialize qdisc_skb_cb(skb)->pkt_segs. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251121083256.674562-5-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-11-25net_sched: initialize qdisc_skb_cb(skb)->pkt_segs in qdisc_pkt_len_init()Eric Dumazet2-5/+12
qdisc_pkt_len_init() is currently initalizing qdisc_skb_cb(skb)->pkt_len. Add qdisc_skb_cb(skb)->pkt_segs initialization and rename this function to qdisc_pkt_len_segs_init(). Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251121083256.674562-4-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-11-25net: init shinfo->gso_segs from qdisc_pkt_len_init()Eric Dumazet1-1/+2
Qdisc use shinfo->gso_segs for their pkts stats in bstats_update(), but this field needs to be initialized for SKB_GSO_DODGY users. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251121083256.674562-3-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-11-25net_sched: make room for (struct qdisc_skb_cb)->pkt_segsEric Dumazet4-9/+9
Add a new u16 field, next to pkt_len : pkt_segs This will cache shinfo->gso_segs to speed up qdisc deqeue(). Move slave_dev_queue_mapping at the end of qdisc_skb_cb, and move three bits from tc_skb_cb : - post_ct - post_ct_snat - post_ct_dnat Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251121083256.674562-2-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-11-25wifi: cfg80211: include s1g_primary_2mhz when sending chandefLachlan Hodges1-0/+3
The chandef now includes a flag denoting the use of a 2MHz primary channel for S1G interfaces, include this when sending the chandef. Signed-off-by: Lachlan Hodges <lachlan.hodges@morsemicro.com> Link: https://patch.msgid.link/20251125025927.245280-2-lachlan.hodges@morsemicro.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2025-11-24mptcp: leverage the backlog for RX packet processingPaolo Abeni2-58/+129
When the msk socket is owned or the msk receive buffer is full, move the incoming skbs in a msk level backlog list. This avoid traversing the joined subflows and acquiring the subflow level socket lock at reception time, improving the RX performances. When processing the backlog, use the fwd alloc memory borrowed from the incoming subflow. skbs exceeding the msk receive space are not dropped; instead they are kept into the backlog until the receive buffer is freed. Dropping packets already acked at the TCP level is explicitly discouraged by the RFC and would corrupt the data stream for fallback sockets. Special care is needed to avoid adding skbs to the backlog of a closed msk and to avoid leaving dangling references into the backlog at subflow closing time. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251121-net-next-mptcp-memcg-backlog-imp-v1-14-1f34b6c1e0b1@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-24mptcp: introduce mptcp-level backlogPaolo Abeni3-9/+97
We are soon using it for incoming data processing. MPTCP can't leverage the sk_backlog, as the latter is processed before the release callback, and such callback for MPTCP releases and re-acquire the socket spinlock, breaking the sk_backlog processing assumption. Add a skb backlog list inside the mptcp sock struct, and implement basic helper to transfer packet to and purge such list. Packets in the backlog are memory accounted and still use the incoming subflow receive memory, to allow back-pressure. The backlog size is implicitly bounded to the sum of subflows rcvbuf. When a subflow is closed, references from the backlog to such sock are removed. No packet is currently added to the backlog, so no functional changes intended here. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251121-net-next-mptcp-memcg-backlog-imp-v1-13-1f34b6c1e0b1@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-24mptcp: borrow forward memory from subflowPaolo Abeni5-11/+46
In the MPTCP receive path, we release the subflow allocated fwd memory just to allocate it again shortly after for the msk. That could increases the failures chances, especially when we will add backlog processing, with other actions could consume the just released memory before the msk socket has a chance to do the rcv allocation. Replace the skb_orphan() call with an open-coded variant that explicitly borrows, the fwd memory from the subflow socket instead of releasing it. The borrowed memory does not have PAGE_SIZE granularity; rounding to the page size will make the fwd allocated memory higher than what is strictly required and could make the incoming subflow fwd mem consistently negative. Instead, keep track of the accumulated frag and borrow the full page at subflow close time. This allow removing the last drop in the TCP to MPTCP transition and the associated, now unused, MIB. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251121-net-next-mptcp-memcg-backlog-imp-v1-12-1f34b6c1e0b1@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-24mptcp: handle first subflow closing consistentlyPaolo Abeni2-6/+11
Currently, as soon as the PM closes a subflow, the msk stops accepting data from it, even if the TCP socket could be still formally open in the incoming direction, with the notable exception of the first subflow. The root cause of such behavior is that code currently piggy back two separate semantic on the subflow->disposable bit: the subflow context must be released and that the subflow must stop accepting incoming data. The first subflow is never disposed, so it also never stop accepting incoming data. Use a separate bit to mark the latter status and set such bit in __mptcp_close_ssk() for all subflows. Beyond making per subflow behaviour more consistent this will also simplify the next patch. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251121-net-next-mptcp-memcg-backlog-imp-v1-11-1f34b6c1e0b1@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-24mptcp: drop the __mptcp_data_ready() helperPaolo Abeni1-12/+7
It adds little clarity and there is a single user of such helper, just inline it in the caller. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Geliang Tang <geliang@kernel.org> Tested-by: Geliang Tang <geliang@kernel.org> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251121-net-next-mptcp-memcg-backlog-imp-v1-10-1f34b6c1e0b1@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-24mptcp: make mptcp_destroy_common() staticPaolo Abeni2-23/+21
Such function is only used inside protocol.c, there is no need to expose it to the whole stack. Note that the function definition most be moved earlier to avoid forward declaration. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Geliang Tang <geliang@kernel.org> Tested-by: Geliang Tang <geliang@kernel.org> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251121-net-next-mptcp-memcg-backlog-imp-v1-9-1f34b6c1e0b1@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-24mptcp: do not miss early first subflow close event notificationPaolo Abeni1-2/+2
The MPTCP protocol is not currently emitting the NL event when the first subflow is closed before msk accept() time. By replacing the in use close helper is such scenario, implicitly introduce the missing notification. Note that in such scenario we want to be sure that mptcp_close_ssk() will not trigger any PM work, move the msk state change update earlier, so that the previous patch will offer such guarantee. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Geliang Tang <geliang@kernel.org> Tested-by: Geliang Tang <geliang@kernel.org> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251121-net-next-mptcp-memcg-backlog-imp-v1-8-1f34b6c1e0b1@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-24mptcp: ensure the kernel PM does not take action too latePaolo Abeni2-1/+5
The PM hooks can currently take place when the msk is already shutting down. Subflow creation will fail, thanks to the existing check at join time, but we can entirely avoid starting the to be failed operations. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Geliang Tang <geliang@kernel.org> Tested-by: Geliang Tang <geliang@kernel.org> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251121-net-next-mptcp-memcg-backlog-imp-v1-7-1f34b6c1e0b1@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-24mptcp: cleanup fallback dummy mapping generationPaolo Abeni2-1/+10
MPTCP currently access ack_seq outside the msk socket log scope to generate the dummy mapping for fallback socket. Soon we are going to introduce backlog usage and even for fallback socket the ack_seq value will be significantly off outside of the msk socket lock scope. Avoid relying on ack_seq for dummy mapping generation, using instead the subflow sequence number. Note that in case of disconnect() and (re)connect() we must ensure that any previous state is re-set. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Geliang Tang <geliang@kernel.org> Tested-by: Geliang Tang <geliang@kernel.org> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251121-net-next-mptcp-memcg-backlog-imp-v1-6-1f34b6c1e0b1@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-24mptcp: cleanup fallback data fin receptionPaolo Abeni1-1/+3
MPTCP currently generate a dummy data_fin for fallback socket when the fallback subflow has completed data reception using the current ack_seq. We are going to introduce backlog usage for the msk soon, even for fallback sockets: the ack_seq value will not match the most recent sequence number seen by the fallback subflow socket, as it will ignore data_seq sitting in the backlog. Instead use the last map sequence number to set the data_fin, as fallback (dummy) map sequences are always in sequence. Reviewed-by: Geliang Tang <geliang@kernel.org> Tested-by: Geliang Tang <geliang@kernel.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <martineau@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251121-net-next-mptcp-memcg-backlog-imp-v1-5-1f34b6c1e0b1@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-24mptcp: fix memcg accounting for passive socketsPaolo Abeni3-11/+38
The passive sockets never got proper memcg accounting: the msk socket is associated with the memcg at accept time, but the passive subflows never got it right. At accept time, traverse the subflows list and associate each of them with the msk memcg, and try to do the same at join completion time, if the msk has been already accepted. Fixes: cf7da0d66cc1 ("mptcp: Create SUBFLOW socket for incoming connections") Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/298 Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/597 Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251121-net-next-mptcp-memcg-backlog-imp-v1-4-1f34b6c1e0b1@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-24mptcp: grafting MPJ subflow earlierPaolo Abeni1-7/+23
Later patches need to ensure that all MPJ subflows are grafted to the msk socket before accept() completion. Currently the grafting happens under the msk socket lock: potentially at msk release_cb time which make satisfying the above condition a bit tricky. Move the MPJ subflow grafting earlier, under the msk data lock, so that we can use such lock as a synchronization point. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251121-net-next-mptcp-memcg-backlog-imp-v1-3-1f34b6c1e0b1@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-24mptcp: factor-out cgroup data inherit helperPaolo Abeni2-8/+14
MPTCP will soon need the same functionality for passive sockets, factor them out in a common helper. No functional change intended. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Geliang Tang <geliang@kernel.org> Reviewed-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251121-net-next-mptcp-memcg-backlog-imp-v1-2-1f34b6c1e0b1@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-24net: factor-out _sk_charge() helperPaolo Abeni2-16/+19
Move out of __inet_accept() the code dealing charging newly accepted socket to memcg. MPTCP will soon use it to on a per subflow basis, in different contexts. No functional changes intended. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Geliang Tang <geliang@kernel.org> Acked-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org> Link: https://patch.msgid.link/20251121-net-next-mptcp-memcg-backlog-imp-v1-1-1f34b6c1e0b1@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-24net: optimize eth_type_trans() vs CONFIG_STACKPROTECTOR_STRONG=yEric Dumazet1-8/+8
Some platforms exhibit very high costs with CONFIG_STACKPROTECTOR_STRONG=y when a function needs to pass the address of a local variable to external functions. eth_type_trans() (and its callers) is showing this anomaly on AMD EPYC 7B12 platforms (and maybe others). We could : 1) inline eth_type_trans() This would help if its callers also has the same issue, and the canary cost would be paid by the callers already. This is a bit cumbersome because netdev_uses_dsa() is pulling whole <net/dsa.h> definitions. 2) Compile net/ethernet/eth.c with -fno-stack-protector This would weaken security. 3) Hack eth_type_trans() to temporarily use skb->dev as a place holder if skb_header_pointer() needs to pull 2 bytes not present in skb->head. This patch implements 3), and brings a 5% improvement on TX/RX intensive workload (tcp_rr 10,000 flows) on AMD EPYC 7B12. Removing CONFIG_STACKPROTECTOR_STRONG on this platform can improve performance by 25 %. This means eth_type_trans() issue is not an isolated artifact. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20251121061725.206675-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-24net: sched: fix TCF_LAYER_TRANSPORT handling in tcf_get_base_ptr()Eric Dumazet3-3/+15
syzbot reported that tcf_get_base_ptr() can be called while transport header is not set [1]. Instead of returning a dangling pointer, return NULL. Fix tcf_get_base_ptr() callers to handle this NULL value. [1] WARNING: CPU: 1 PID: 6019 at ./include/linux/skbuff.h:3071 skb_transport_header include/linux/skbuff.h:3071 [inline] WARNING: CPU: 1 PID: 6019 at ./include/linux/skbuff.h:3071 tcf_get_base_ptr include/net/pkt_cls.h:539 [inline] WARNING: CPU: 1 PID: 6019 at ./include/linux/skbuff.h:3071 em_nbyte_match+0x2d8/0x3f0 net/sched/em_nbyte.c:43 Modules linked in: CPU: 1 UID: 0 PID: 6019 Comm: syz.0.17 Not tainted syzkaller #0 PREEMPT(full) Call Trace: <TASK> tcf_em_match net/sched/ematch.c:494 [inline] __tcf_em_tree_match+0x1ac/0x770 net/sched/ematch.c:520 tcf_em_tree_match include/net/pkt_cls.h:512 [inline] basic_classify+0x115/0x2d0 net/sched/cls_basic.c:50 tc_classify include/net/tc_wrapper.h:197 [inline] __tcf_classify net/sched/cls_api.c:1764 [inline] tcf_classify+0x4cf/0x1140 net/sched/cls_api.c:1860 multiq_classify net/sched/sch_multiq.c:39 [inline] multiq_enqueue+0xfd/0x4c0 net/sched/sch_multiq.c:66 dev_qdisc_enqueue+0x4e/0x260 net/core/dev.c:4118 __dev_xmit_skb net/core/dev.c:4214 [inline] __dev_queue_xmit+0xe83/0x3b50 net/core/dev.c:4729 packet_snd net/packet/af_packet.c:3076 [inline] packet_sendmsg+0x3e33/0x5080 net/packet/af_packet.c:3108 sock_sendmsg_nosec net/socket.c:727 [inline] __sock_sendmsg+0x21c/0x270 net/socket.c:742 ____sys_sendmsg+0x505/0x830 net/socket.c:2630 Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: syzbot+f3a497f02c389d86ef16@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/6920855a.a70a0220.2ea503.0058.GAE@google.com/T/#u Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com> Link: https://patch.msgid.link/20251121154100.1616228-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-11-24wifi: cfg80211: stop radar detection in cfg80211_leave()Johannes Berg3-0/+21
If an interface is set down or, per the previous patch, changes type, radar detection for it should be cancelled. This is done for AP mode in mac80211 (somewhat needlessly, since cfg80211 can do it, but didn't until now), but wasn't handled for mesh, so if radar detection was started and then the interface set down or its type switched (the latter sometimes happning in the hwsim test 'mesh_peer_connected_dfs'), radar detection would be around with the interface unknown to the driver, later leading to some warnings around chanctx usage. Link: https://patch.msgid.link/20251121174021.290120e419e3.I2a5650c9062e29c988992dd8ce0d8eb570d23267@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>