aboutsummaryrefslogtreecommitdiff
path: root/net/ipv4
AgeCommit message (Collapse)AuthorFilesLines
2026-01-27bpf, sockmap: Fix incorrect copied_seq calculationJiayuan Chen1-2/+3
A socket using sockmap has its own independent receive queue: ingress_msg. This queue may contain data from its own protocol stack or from other sockets. The issue is that when reading from ingress_msg, we update tp->copied_seq by default. However, if the data is not from its own protocol stack, tcp->rcv_nxt is not increased. Later, if we convert this socket to a native socket, reading from this socket may fail because copied_seq might be significantly larger than rcv_nxt. This fix also addresses the syzkaller-reported bug referenced in the Closes tag. This patch marks the skmsg objects in ingress_msg. When reading, we update copied_seq only if the data is from its own protocol stack. FD1:read() -- FD1->copied_seq++ | [read data] | [enqueue data] v [sockmap] -> ingress to self -> ingress_msg queue FD1 native stack ------> ^ -- FD1->rcv_nxt++ -> redirect to other | [enqueue data] | | | ingress to FD1 v ^ ... | [sockmap] FD2 native stack Closes: https://syzkaller.appspot.com/bug?extid=06dbd397158ec0ea4983 Fixes: 04919bed948dc ("tcp: Introduce tcp_read_skb()") Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com> Reviewed-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Jiayuan Chen <jiayuan.chen@linux.dev> Link: https://lore.kernel.org/r/20260124113314.113584-2-jiayuan.chen@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2026-01-27tcp: move sk_forced_mem_schedule() to tcp.cEric Dumazet2-27/+27
TCP fast path can (auto)inline this helper, instead of (auto)inling it from tcp_send_fin(). No change of overall code size, but tcp_sendmsg() is faster. $ scripts/bloat-o-meter -t vmlinux.old vmlinux.new add/remove: 0/0 grow/shrink: 1/1 up/down: 141/-140 (1) Function old new delta tcp_stream_alloc_skb 216 357 +141 tcp_send_fin 688 548 -140 Total: Before=22236729, After=22236730, chg +0.00% BTW, we might change tcp_send_fin() to use tcp_stream_alloc_skb(). Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Link: https://patch.msgid.link/20260123111605.4089200-1-edumazet@google.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-01-25ipv4: igmp: annotate data-races around idev->mr_maxdelayEric Dumazet1-2/+2
idev->mr_maxdelay is read and written locklessly, add READ_ONCE()/WRITE_ONCE() annotations. While we are at it, make this field an u32. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20260122172247.2429403-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23tcp: move tcp_stream_memory_free() to tcp.cEric Dumazet2-14/+13
Moving tcp_stream_memory_free() to tcp.c allows the compiler to (auto)inline it from tcp_poll() and tcp_sendmsg_locked() for better performance. $ scripts/bloat-o-meter -t vmlinux.old vmlinux.new add/remove: 0/0 grow/shrink: 2/0 up/down: 118/0 (118) Function old new delta tcp_poll 840 923 +83 tcp_sendmsg_locked 4217 4252 +35 Total: Before=22573095, After=22573213, chg +0.00% Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Link: https://patch.msgid.link/20260122090228.1678207-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-22Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2-1/+4
Cross-merge networking fixes after downstream PR (net-6.19-rc7). Conflicts: drivers/net/ethernet/huawei/hinic3/hinic3_irq.c b35a6fd37a00 ("hinic3: Add adaptive IRQ coalescing with DIM") fb2bb2a1ebf7 ("hinic3: Fix netif_queue_set_napi queue_index input parameter error") https://lore.kernel.org/fc0a7fdf08789a52653e8ad05281a0a849e79206.1768915707.git.zhuyikai1@h-partners.com drivers/net/wireless/ath/ath12k/mac.c drivers/net/wireless/ath/ath12k/wifi7/hw.c 31707572108d ("wifi: ath12k: Fix wrong P2P device link id issue") c26f294fef2a ("wifi: ath12k: Move ieee80211_ops callback to the arch specific module") https://lore.kernel.org/20260114123751.6a208818@canb.auug.org.au Adjacent changes: drivers/net/wireless/ath/ath12k/mac.c 8b8d6ee53dfd ("wifi: ath12k: Fix scan state stuck in ABORTING after cancel_remain_on_channel") 914c890d3b90 ("wifi: ath12k: Add framework for hardware specific ieee80211_ops registration") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-22tcp: move tcp_rate_check_app_limited() to tcp.cEric Dumazet3-21/+19
tcp_rate_check_app_limited() is used from tcp_sendmsg_locked() fast path and from other callers. Move it to tcp.c so that it can be inlined in tcp_sendmsg_locked(). Small increase of code, for better TCP performance. $ scripts/bloat-o-meter -t vmlinux.old vmlinux.new add/remove: 0/0 grow/shrink: 1/0 up/down: 87/0 (87) Function old new delta tcp_sendmsg_locked 4217 4304 +87 Total: Before=22566462, After=22566549, chg +0.00% Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Link: https://patch.msgid.link/20260121095923.3134639-3-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-22tcp: move tcp_rate_gen to tcp_input.cEric Dumazet2-110/+110
This function is called from one caller only, in TCP fast path. Move it to tcp_input.c so that compiler can inline it. $ scripts/bloat-o-meter -t vmlinux.old vmlinux.new add/remove: 0/2 grow/shrink: 1/0 up/down: 226/-300 (-74) Function old new delta tcp_ack 5405 5631 +226 __pfx_tcp_rate_gen 16 - -16 tcp_rate_gen 284 - -284 Total: Before=22566536, After=22566462, chg -0.00% Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Link: https://patch.msgid.link/20260121095923.3134639-2-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-22cipso: harden use of skb_cow() in cipso_v4_skbuff_setattr()Will Rosenberg1-1/+2
If skb_cow() is passed a headroom <= -NET_SKB_PAD, it will trigger a BUG. As a result, use cases should avoid calling with a headroom that is negative to prevent triggering this issue. This is the same code pattern fixed in Commit 58fc7342b529 ("ipv6: BUG() in pskb_expand_head() as part of calipso_skbuff_setattr()"). In cipso_v4_skbuff_setattr(), len_delta can become negative, leading to a negative headroom passed to skb_cow(). However, the BUG is not triggerable because the condition headroom <= -NET_SKB_PAD cannot be satisfied due to limits on the IPv4 options header size. Avoid potential problems in the future by only using skb_cow() to grow the skb headroom. Signed-off-by: Will Rosenberg <whrosenb@asu.edu> Acked-by: Paul Moore <paul@paul-moore.com> Link: https://patch.msgid.link/20260120155738.982771-1-whrosenb@asu.edu Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-01-21tcp: preserve const qualifier in tcp_rsk() and inet_rsk()Eric Dumazet1-1/+1
We can change tcp_rsk() and inet_rsk() to propagate their argument const qualifier thanks to container_of_const(). Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260120125353.1470456-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-20kernel.h: drop hex.h and update all hex.h usersRandy Dunlap1-0/+1
Remove <linux/hex.h> from <linux/kernel.h> and update all users/callers of hex.h interfaces to directly #include <linux/hex.h> as part of the process of putting kernel.h on a diet. Removing hex.h from kernel.h means that 36K C source files don't have to pay the price of parsing hex.h for the roughly 120 C source files that need it. This change has been build-tested with allmodconfig on most ARCHes. Also, all users/callers of <linux/hex.h> in the entire source tree have been updated if needed (if not already #included). Link: https://lkml.kernel.org/r/20251215005206.2362276-1-rdunlap@infradead.org Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Reviewed-by: Andy Shevchenko <andriy.shevchenko@intel.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Yury Norov (NVIDIA) <yury.norov@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2026-01-20tcp: move tcp_rate_skb_delivered() to tcp_input.cEric Dumazet2-44/+44
tcp_rate_skb_delivered() is only called from tcp_input.c. Move it there and make it static. Both gcc and clang are (auto)inlining it, TCP performance is increased at a small space cost. $ scripts/bloat-o-meter -t vmlinux.old vmlinux.new add/remove: 0/2 grow/shrink: 3/0 up/down: 509/-187 (322) Function old new delta tcp_sacktag_walk 1682 1867 +185 tcp_ack 5230 5405 +175 tcp_shifted_skb 437 586 +149 __pfx_tcp_rate_skb_delivered 16 - -16 tcp_rate_skb_delivered 171 - -171 Total: Before=22566192, After=22566514, chg +0.00% Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Link: https://patch.msgid.link/20260118123204.2315993-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-17fou: Don't allow 0 for FOU_ATTR_IPPROTO.Kuniyuki Iwashima1-1/+1
fou_udp_recv() has the same problem mentioned in the previous patch. If FOU_ATTR_IPPROTO is set to 0, skb is not freed by fou_udp_recv() nor "resubmit"-ted in ip_protocol_deliver_rcu(). Let's forbid 0 for FOU_ATTR_IPPROTO. Fixes: 23461551c0062 ("fou: Support for foo-over-udp RX path") Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260115172533.693652-4-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-17gue: Fix skb memleak with inner IP protocol 0.Kuniyuki Iwashima1-0/+3
syzbot reported skb memleak below. [0] The repro generated a GUE packet with its inner protocol 0. gue_udp_recv() returns -guehdr->proto_ctype for "resubmit" in ip_protocol_deliver_rcu(), but this only works with non-zero protocol number. Let's drop such packets. Note that 0 is a valid number (IPv6 Hop-by-Hop Option). I think it is not practical to encap HOPOPT in GUE, so once someone starts to complain, we could pass down a resubmit flag pointer to distinguish two zeros from the upper layer: * no error * resubmit HOPOPT [0] BUG: memory leak unreferenced object 0xffff888109695a00 (size 240): comm "syz.0.17", pid 6088, jiffies 4294943096 hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 40 c2 10 81 88 ff ff 00 00 00 00 00 00 00 00 .@.............. backtrace (crc a84b336f): kmemleak_alloc_recursive include/linux/kmemleak.h:44 [inline] slab_post_alloc_hook mm/slub.c:4958 [inline] slab_alloc_node mm/slub.c:5263 [inline] kmem_cache_alloc_noprof+0x3b4/0x590 mm/slub.c:5270 __build_skb+0x23/0x60 net/core/skbuff.c:474 build_skb+0x20/0x190 net/core/skbuff.c:490 __tun_build_skb drivers/net/tun.c:1541 [inline] tun_build_skb+0x4a1/0xa40 drivers/net/tun.c:1636 tun_get_user+0xc12/0x2030 drivers/net/tun.c:1770 tun_chr_write_iter+0x71/0x120 drivers/net/tun.c:1999 new_sync_write fs/read_write.c:593 [inline] vfs_write+0x45d/0x710 fs/read_write.c:686 ksys_write+0xa7/0x170 fs/read_write.c:738 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xa4/0xf80 arch/x86/entry/syscall_64.c:94 entry_SYSCALL_64_after_hwframe+0x77/0x7f Fixes: 37dd0247797b1 ("gue: Receive side for Generic UDP Encapsulation") Reported-by: syzbot+4d8c7d16b0e95c0d0f0d@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/6965534b.050a0220.38aacd.0001.GAE@google.com/ Signed-off-by: Kuniyuki Iwashima <kuniyu@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260115172533.693652-2-kuniyu@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-17tcp: move tcp_rate_skb_sent() to tcp_output.cEric Dumazet2-35/+35
It is only called from __tcp_transmit_skb() and __tcp_retransmit_skb(). Move it in tcp_output.c and make it static. clang compiler is now able to inline it from __tcp_transmit_skb(). gcc compiler inlines it in the two callers, which is also fine. Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Link: https://patch.msgid.link/20260114165109.1747722-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-15Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski4-9/+15
Cross-merge networking fixes after downstream PR (net-6.19-rc6). No conflicts, or adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-15Merge tag 'ipsec-2026-01-14' of ↵Paolo Abeni1-2/+2
git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2026-01-14 1) Fix inner mode lookup in tunnel mode GSO segmentation. The protocol was taken from the wrong field. 2) Set ipv4 no_pmtu_disc flag only on output SAs. The insertation of input SAs can fail if no_pmtu_disc is set. Please pull or let me know if there are problems. ipsec-2026-01-14 * tag 'ipsec-2026-01-14' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec: xfrm: set ipv4 no_pmtu_disc flag only on output sa when direction is set xfrm: Fix inner mode lookup in tunnel mode GSO segmentation ==================== Link: https://patch.msgid.link/20260114121817.1106134-1-steffen.klassert@secunet.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2026-01-13dst: fix races in rt6_uncached_list_del() and rt_del_uncached_list()Eric Dumazet1-2/+2
syzbot was able to crash the kernel in rt6_uncached_list_flush_dev() in an interesting way [1] Crash happens in list_del_init()/INIT_LIST_HEAD() while writing list->prev, while the prior write on list->next went well. static inline void INIT_LIST_HEAD(struct list_head *list) { WRITE_ONCE(list->next, list); // This went well WRITE_ONCE(list->prev, list); // Crash, @list has been freed. } Issue here is that rt6_uncached_list_del() did not attempt to lock ul->lock, as list_empty(&rt->dst.rt_uncached) returned true because the WRITE_ONCE(list->next, list) happened on the other CPU. We might use list_del_init_careful() and list_empty_careful(), or make sure rt6_uncached_list_del() always grabs the spinlock whenever rt->dst.rt_uncached_list has been set. A similar fix is neeed for IPv4. [1] BUG: KASAN: slab-use-after-free in INIT_LIST_HEAD include/linux/list.h:46 [inline] BUG: KASAN: slab-use-after-free in list_del_init include/linux/list.h:296 [inline] BUG: KASAN: slab-use-after-free in rt6_uncached_list_flush_dev net/ipv6/route.c:191 [inline] BUG: KASAN: slab-use-after-free in rt6_disable_ip+0x633/0x730 net/ipv6/route.c:5020 Write of size 8 at addr ffff8880294cfa78 by task kworker/u8:14/3450 CPU: 0 UID: 0 PID: 3450 Comm: kworker/u8:14 Tainted: G L syzkaller #0 PREEMPT_{RT,(full)} Tainted: [L]=SOFTLOCKUP Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/25/2025 Workqueue: netns cleanup_net Call Trace: <TASK> dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120 print_address_description mm/kasan/report.c:378 [inline] print_report+0xca/0x240 mm/kasan/report.c:482 kasan_report+0x118/0x150 mm/kasan/report.c:595 INIT_LIST_HEAD include/linux/list.h:46 [inline] list_del_init include/linux/list.h:296 [inline] rt6_uncached_list_flush_dev net/ipv6/route.c:191 [inline] rt6_disable_ip+0x633/0x730 net/ipv6/route.c:5020 addrconf_ifdown+0x143/0x18a0 net/ipv6/addrconf.c:3853 addrconf_notify+0x1bc/0x1050 net/ipv6/addrconf.c:-1 notifier_call_chain+0x19d/0x3a0 kernel/notifier.c:85 call_netdevice_notifiers_extack net/core/dev.c:2268 [inline] call_netdevice_notifiers net/core/dev.c:2282 [inline] netif_close_many+0x29c/0x410 net/core/dev.c:1785 unregister_netdevice_many_notify+0xb50/0x2330 net/core/dev.c:12353 ops_exit_rtnl_list net/core/net_namespace.c:187 [inline] ops_undo_list+0x3dc/0x990 net/core/net_namespace.c:248 cleanup_net+0x4de/0x7b0 net/core/net_namespace.c:696 process_one_work kernel/workqueue.c:3257 [inline] process_scheduled_works+0xad1/0x1770 kernel/workqueue.c:3340 worker_thread+0x8a0/0xda0 kernel/workqueue.c:3421 kthread+0x711/0x8a0 kernel/kthread.c:463 ret_from_fork+0x510/0xa50 arch/x86/kernel/process.c:158 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:246 </TASK> Allocated by task 803: kasan_save_stack mm/kasan/common.c:57 [inline] kasan_save_track+0x3e/0x80 mm/kasan/common.c:78 unpoison_slab_object mm/kasan/common.c:340 [inline] __kasan_slab_alloc+0x6c/0x80 mm/kasan/common.c:366 kasan_slab_alloc include/linux/kasan.h:253 [inline] slab_post_alloc_hook mm/slub.c:4953 [inline] slab_alloc_node mm/slub.c:5263 [inline] kmem_cache_alloc_noprof+0x18d/0x6c0 mm/slub.c:5270 dst_alloc+0x105/0x170 net/core/dst.c:89 ip6_dst_alloc net/ipv6/route.c:342 [inline] icmp6_dst_alloc+0x75/0x460 net/ipv6/route.c:3333 mld_sendpack+0x683/0xe60 net/ipv6/mcast.c:1844 mld_send_cr net/ipv6/mcast.c:2154 [inline] mld_ifc_work+0x83e/0xd60 net/ipv6/mcast.c:2693 process_one_work kernel/workqueue.c:3257 [inline] process_scheduled_works+0xad1/0x1770 kernel/workqueue.c:3340 worker_thread+0x8a0/0xda0 kernel/workqueue.c:3421 kthread+0x711/0x8a0 kernel/kthread.c:463 ret_from_fork+0x510/0xa50 arch/x86/kernel/process.c:158 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:246 Freed by task 20: kasan_save_stack mm/kasan/common.c:57 [inline] kasan_save_track+0x3e/0x80 mm/kasan/common.c:78 kasan_save_free_info+0x46/0x50 mm/kasan/generic.c:584 poison_slab_object mm/kasan/common.c:253 [inline] __kasan_slab_free+0x5c/0x80 mm/kasan/common.c:285 kasan_slab_free include/linux/kasan.h:235 [inline] slab_free_hook mm/slub.c:2540 [inline] slab_free mm/slub.c:6670 [inline] kmem_cache_free+0x18f/0x8d0 mm/slub.c:6781 dst_destroy+0x235/0x350 net/core/dst.c:121 rcu_do_batch kernel/rcu/tree.c:2605 [inline] rcu_core kernel/rcu/tree.c:2857 [inline] rcu_cpu_kthread+0xba5/0x1af0 kernel/rcu/tree.c:2945 smpboot_thread_fn+0x542/0xa60 kernel/smpboot.c:160 kthread+0x711/0x8a0 kernel/kthread.c:463 ret_from_fork+0x510/0xa50 arch/x86/kernel/process.c:158 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:246 Last potentially related work creation: kasan_save_stack+0x3e/0x60 mm/kasan/common.c:57 kasan_record_aux_stack+0xbd/0xd0 mm/kasan/generic.c:556 __call_rcu_common kernel/rcu/tree.c:3119 [inline] call_rcu+0xee/0x890 kernel/rcu/tree.c:3239 refdst_drop include/net/dst.h:266 [inline] skb_dst_drop include/net/dst.h:278 [inline] skb_release_head_state+0x71/0x360 net/core/skbuff.c:1156 skb_release_all net/core/skbuff.c:1180 [inline] __kfree_skb net/core/skbuff.c:1196 [inline] sk_skb_reason_drop+0xe9/0x170 net/core/skbuff.c:1234 kfree_skb_reason include/linux/skbuff.h:1322 [inline] tcf_kfree_skb_list include/net/sch_generic.h:1127 [inline] __dev_xmit_skb net/core/dev.c:4260 [inline] __dev_queue_xmit+0x26aa/0x3210 net/core/dev.c:4785 NF_HOOK_COND include/linux/netfilter.h:307 [inline] ip6_output+0x340/0x550 net/ipv6/ip6_output.c:247 NF_HOOK+0x9e/0x380 include/linux/netfilter.h:318 mld_sendpack+0x8d4/0xe60 net/ipv6/mcast.c:1855 mld_send_cr net/ipv6/mcast.c:2154 [inline] mld_ifc_work+0x83e/0xd60 net/ipv6/mcast.c:2693 process_one_work kernel/workqueue.c:3257 [inline] process_scheduled_works+0xad1/0x1770 kernel/workqueue.c:3340 worker_thread+0x8a0/0xda0 kernel/workqueue.c:3421 kthread+0x711/0x8a0 kernel/kthread.c:463 ret_from_fork+0x510/0xa50 arch/x86/kernel/process.c:158 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:246 The buggy address belongs to the object at ffff8880294cfa00 which belongs to the cache ip6_dst_cache of size 232 The buggy address is located 120 bytes inside of freed 232-byte region [ffff8880294cfa00, ffff8880294cfae8) The buggy address belongs to the physical page: page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x294cf memcg:ffff88803536b781 flags: 0x80000000000000(node=0|zone=1) page_type: f5(slab) raw: 0080000000000000 ffff88802ff1c8c0 ffffea0000bf2bc0 dead000000000006 raw: 0000000000000000 00000000800c000c 00000000f5000000 ffff88803536b781 page dumped because: kasan: bad access detected page_owner tracks the page as allocated page last allocated via order 0, migratetype Unmovable, gfp_mask 0x52820(GFP_ATOMIC|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP), pid 9, tgid 9 (kworker/0:0), ts 91119585830, free_ts 91088628818 set_page_owner include/linux/page_owner.h:32 [inline] post_alloc_hook+0x234/0x290 mm/page_alloc.c:1857 prep_new_page mm/page_alloc.c:1865 [inline] get_page_from_freelist+0x28c0/0x2960 mm/page_alloc.c:3915 __alloc_frozen_pages_noprof+0x181/0x370 mm/page_alloc.c:5210 alloc_pages_mpol+0xd1/0x380 mm/mempolicy.c:2486 alloc_slab_page mm/slub.c:3075 [inline] allocate_slab+0x86/0x3b0 mm/slub.c:3248 new_slab mm/slub.c:3302 [inline] ___slab_alloc+0xb10/0x13e0 mm/slub.c:4656 __slab_alloc+0xc6/0x1f0 mm/slub.c:4779 __slab_alloc_node mm/slub.c:4855 [inline] slab_alloc_node mm/slub.c:5251 [inline] kmem_cache_alloc_noprof+0x101/0x6c0 mm/slub.c:5270 dst_alloc+0x105/0x170 net/core/dst.c:89 ip6_dst_alloc net/ipv6/route.c:342 [inline] icmp6_dst_alloc+0x75/0x460 net/ipv6/route.c:3333 mld_sendpack+0x683/0xe60 net/ipv6/mcast.c:1844 mld_send_cr net/ipv6/mcast.c:2154 [inline] mld_ifc_work+0x83e/0xd60 net/ipv6/mcast.c:2693 process_one_work kernel/workqueue.c:3257 [inline] process_scheduled_works+0xad1/0x1770 kernel/workqueue.c:3340 worker_thread+0x8a0/0xda0 kernel/workqueue.c:3421 kthread+0x711/0x8a0 kernel/kthread.c:463 ret_from_fork+0x510/0xa50 arch/x86/kernel/process.c:158 page last free pid 5859 tgid 5859 stack trace: reset_page_owner include/linux/page_owner.h:25 [inline] free_pages_prepare mm/page_alloc.c:1406 [inline] __free_frozen_pages+0xfe1/0x1170 mm/page_alloc.c:2943 discard_slab mm/slub.c:3346 [inline] __put_partials+0x149/0x170 mm/slub.c:3886 __slab_free+0x2af/0x330 mm/slub.c:5952 qlink_free mm/kasan/quarantine.c:163 [inline] qlist_free_all+0x97/0x100 mm/kasan/quarantine.c:179 kasan_quarantine_reduce+0x148/0x160 mm/kasan/quarantine.c:286 __kasan_slab_alloc+0x22/0x80 mm/kasan/common.c:350 kasan_slab_alloc include/linux/kasan.h:253 [inline] slab_post_alloc_hook mm/slub.c:4953 [inline] slab_alloc_node mm/slub.c:5263 [inline] kmem_cache_alloc_noprof+0x18d/0x6c0 mm/slub.c:5270 getname_flags+0xb8/0x540 fs/namei.c:146 getname include/linux/fs.h:2498 [inline] do_sys_openat2+0xbc/0x200 fs/open.c:1426 do_sys_open fs/open.c:1436 [inline] __do_sys_openat fs/open.c:1452 [inline] __se_sys_openat fs/open.c:1447 [inline] __x64_sys_openat+0x138/0x170 fs/open.c:1447 do_syscall_x64 arch/x86/entry/syscall_64.c:63 [inline] do_syscall_64+0xec/0xf80 arch/x86/entry/syscall_64.c:94 Fixes: 8d0b94afdca8 ("ipv6: Keep track of DST_NOCACHE routes in case of iface down/unregister") Fixes: 78df76a065ae ("ipv4: take rt_uncached_lock only if needed") Reported-by: syzbot+179fc225724092b8b2b2@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/6964cdf2.050a0220.eaf7.009d.GAE@google.com/T/#u Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Martin KaFai Lau <martin.lau@kernel.org> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20260112103825.3810713-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-12net: ipconfig: Remove outdated comment and indent code blockThorsten Blum1-47/+42
The comment has been around ever since commit 1da177e4c3f4 ("Linux-2.6.12-rc2") and can be removed. Remove it and indent the code block accordingly. Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev> Link: https://patch.msgid.link/20260109121128.170020-2-thorsten.blum@linux.dev Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-10ipv4: ip_gre: make ipgre_header() robustEric Dumazet1-2/+9
Analog to commit db5b4e39c4e6 ("ip6_gre: make ip6gre_header() robust") Over the years, syzbot found many ways to crash the kernel in ipgre_header() [1]. This involves team or bonding drivers ability to dynamically change their dev->needed_headroom and/or dev->hard_header_len In this particular crash mld_newpack() allocated an skb with a too small reserve/headroom, and by the time mld_sendpack() was called, syzbot managed to attach an ipgre device. [1] skbuff: skb_under_panic: text:ffffffff89ea3cb7 len:2030915468 put:2030915372 head:ffff888058b43000 data:ffff887fdfa6e194 tail:0x120 end:0x6c0 dev:team0 kernel BUG at net/core/skbuff.c:213 ! Oops: invalid opcode: 0000 [#1] SMP KASAN PTI CPU: 1 UID: 0 PID: 1322 Comm: kworker/1:9 Not tainted syzkaller #0 PREEMPT(full) Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/25/2025 Workqueue: mld mld_ifc_work RIP: 0010:skb_panic+0x157/0x160 net/core/skbuff.c:213 Call Trace: <TASK> skb_under_panic net/core/skbuff.c:223 [inline] skb_push+0xc3/0xe0 net/core/skbuff.c:2641 ipgre_header+0x67/0x290 net/ipv4/ip_gre.c:897 dev_hard_header include/linux/netdevice.h:3436 [inline] neigh_connected_output+0x286/0x460 net/core/neighbour.c:1618 NF_HOOK_COND include/linux/netfilter.h:307 [inline] ip6_output+0x340/0x550 net/ipv6/ip6_output.c:247 NF_HOOK+0x9e/0x380 include/linux/netfilter.h:318 mld_sendpack+0x8d4/0xe60 net/ipv6/mcast.c:1855 mld_send_cr net/ipv6/mcast.c:2154 [inline] mld_ifc_work+0x83e/0xd60 net/ipv6/mcast.c:2693 process_one_work kernel/workqueue.c:3257 [inline] process_scheduled_works+0xad1/0x1770 kernel/workqueue.c:3340 worker_thread+0x8a0/0xda0 kernel/workqueue.c:3421 kthread+0x711/0x8a0 kernel/kthread.c:463 ret_from_fork+0x510/0xa50 arch/x86/kernel/process.c:158 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:246 Fixes: c54419321455 ("GRE: Refactor GRE tunneling code.") Reported-by: syzbot+7c134e1c3aa3283790b9@syzkaller.appspotmail.com Closes: https://www.spinics.net/lists/netdev/msg1147302.html Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260108190214.1667040-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-08ipv4: ip_tunnel: spread netdev_lockdep_set_classes()Eric Dumazet1-3/+2
Inspired by yet another syzbot report. IPv6 tunnels call netdev_lockdep_set_classes() for each tunnel type, while IPv4 currently centralizes netdev_lockdep_set_classes() call from ip_tunnel_init(). Make ip_tunnel_init() a macro, so that we have different lockdep classes per tunnel type. Fixes: 0bef512012b1 ("net: add netdev_lockdep_set_classes() to virtual drivers") Reported-by: syzbot+1240b33467289f5ab50b@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/695d439f.050a0220.1c677c.0347.GAE@google.com/T/#u Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260106172426.1760721-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-08Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski5-11/+11
Cross-merge networking fixes after downstream PR (net-6.19-rc5). No conflicts, or adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-08arp: do not assume dev_hard_header() does not change skb->headEric Dumazet1-3/+4
arp_create() is the only dev_hard_header() caller making assumption about skb->head being unchanged. A recent commit broke this assumption. Initialize @arp pointer after dev_hard_header() call. Fixes: db5b4e39c4e6 ("ip6_gre: make ip6gre_header() robust") Reported-by: syzbot+58b44a770a1585795351@syzkaller.appspotmail.com Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260107212250.384552-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-08net: do not write to msg_get_inq in calleeWillem de Bruijn1-5/+3
NULL pointer dereference fix. msg_get_inq is an input field from caller to callee. Don't set it in the callee, as the caller may not clear it on struct reuse. This is a kernel-internal variant of msghdr only, and the only user does reinitialize the field. So this is not critical for that reason. But it is more robust to avoid the write, and slightly simpler code. And it fixes a bug, see below. Callers set msg_get_inq to request the input queue length to be returned in msg_inq. This is equivalent to but independent from the SO_INQ request to return that same info as a cmsg (tp->recvmsg_inq). To reduce branching in the hot path the second also sets the msg_inq. That is WAI. This is a fix to commit 4d1442979e4a ("af_unix: don't post cmsg for SO_INQ unless explicitly asked for"), which fixed the inverse. Also avoid NULL pointer dereference in unix_stream_read_generic if state->msg is NULL and msg->msg_get_inq is written. A NULL state->msg can happen when splicing as of commit 2b514574f7e8 ("net: af_unix: implement splice for stream af_unix sockets"). Also collapse two branches using a bitwise or. Cc: stable@vger.kernel.org Fixes: 4d1442979e4a ("af_unix: don't post cmsg for SO_INQ unless explicitly asked for") Link: https://lore.kernel.org/netdev/willemdebruijn.kernel.24d8030f7a3de@gmail.com/ Signed-off-by: Willem de Bruijn <willemb@google.com> Reviewed-by: Jens Axboe <axboe@kernel.dk> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Link: https://patch.msgid.link/20260106150626.3944363-1-willemdebruijn.kernel@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-06udp: udplite is unlikelyEric Dumazet1-2/+3
Add some unlikely() annotations to speed up the fast path, at least with clang compiler. Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260105101719.2378881-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-06udp: call skb_orphan() before skb_attempt_defer_free()Eric Dumazet1-0/+1
Standard UDP receive path does not use skb->destructor. But skmsg layer does use it, since it calls skb_set_owner_sk_safe() from udp_read_skb(). This then triggers this warning in skb_attempt_defer_free(): DEBUG_NET_WARN_ON_ONCE(skb->destructor); We must call skb_orphan() to fix this issue. Fixes: 6471658dc66c ("udp: use skb_attempt_defer_free()") Reported-by: syzbot+3e68572cf2286ce5ebe9@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/695b83bd.050a0220.1c9965.002b.GAE@google.com/T/#u Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260105093630.1976085-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-06ipv4/inet_sock.h: Avoid thousands of -Wflex-array-member-not-at-end warningsGustavo A. R. Silva5-65/+77
Use DEFINE_RAW_FLEX() to avoid thousands of -Wflex-array-member-not-at-end warnings. Remove struct ip_options_data, and adjust the rest of the code so that flexible-array member struct ip_options_rcu::opt.__data[] ends last in struct icmp_bxm. Compensate for this by using the DEFINE_RAW_FLEX() helper to define each on-stack struct instance that contained struct ip_options_data as a member, and to define struct ip_options_rcu with a fixed on-stack size for its nested flexible-array member opt.__data[]. Also, add a couple of code comments to prevent people from adding members to a struct after another member that contains a flexible array. With these changes, fix 2600 warnings of the following type: include/net/inet_sock.h:65:33: warning: structure containing a flexible array member is not at the end of another structure [-Wflex-array-member-not-at-end] Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Link: https://patch.msgid.link/aVteBadWA6AbTp7X@kspp Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-05ipv4: Improve martian logsClara Engler1-4/+4
At the current moment, the logs for martian packets are as follows: ``` martian source {DST} from {SRC}, on dev {DEV} martian destination {DST} from {SRC}, dev {DEV} ``` These messages feel rather hard to understand in production, especially the "martian source" one, mostly because it is grammatically ambitious to parse which part is now the source address and which part is the destination address. For example, "{DST}" may there be interpreted as the actual source address due to following the word "source", thereby implying the actual source address to be the destination one. Personally, I discovered this bug while toying around with TUN interfaces and using them as a tunnel (receiving packets via a TUN interface and sending them over a TCP stream; receiving packets from a TCP stream and writing them to a TUN).[^1] When these IP addresses contained local IPs (i.e. 10.0.0.0/8 in source and destination), everything worked fine. However, sending them to a real routable IP address on the internet led to them being treated as a martian packet, obviously. Using a few sysctl(8) and iptables(8) settings[^2] fixed it, but while debugging I found the log message starting with "martian source" rather confusing, as I was unsure on whether the packet that gets dropped was the packet originating from me or the response from the endpoint, as "martian source <ROUTABLE IP>" could also be falsely interpreted as the response packet being martian, due to the word "source" followed by the routable IP address, implying the source address of that packet is set to this IP, as explained above. In the end, I had to look into the source code of the kernel on where this error message gets generated, which is usually an indicator of there being room for improvement with regard to this error message. In terms of improvement, this commit changes the error messages for martian source and martian destination packets as follows: ``` martian source (src={SRC}, dst={DST}, dev={DEV}) martian destination (src={SRC}, dst={DST}, dev={DEV}) ``` These new wordings leave pretty much no room for ambiguity as all parameters are prefixed with a respective key explaining their semantic meaning. See also the following thread on LKML.[^3] [^1]: <https://backreference.org/2010/03/26/tuntap-interface-tutorial> [^2]: sysctl net.ipv4.ip_forward=1 && \ iptables -A INPUT -i tun0 -j ACCEPT && \ iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE [^3]: <https://lore.kernel.org/all/aSd4Xj8rHrh-krjy@4944566b5c925f79/> Signed-off-by: Clara Engler <cve@cve.cx> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20260101125114.2608-1-cve@cve.cx Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-05compiler-context-analysis: Change __cond_acquires to take return valueMarco Elver1-1/+1
While Sparse is oblivious to the return value of conditional acquire functions, Clang's context analysis needs to know the return value which indicates successful acquisition. Add the additional argument, and convert existing uses. Notably, Clang's interpretation of the value merely relates to the use in a later conditional branch, i.e. 1 ==> context lock acquired in branch taken if condition non-zero, and 0 ==> context lock acquired in branch taken if condition is zero. Given the precise value does not matter, introduce symbolic variants to use instead of either 0 or 1, which should be more intuitive. No functional change intended. Signed-off-by: Marco Elver <elver@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://patch.msgid.link/20251219154418.3592607-10-elver@google.com
2026-01-04inet: frags: drop fraglist conntrack referencesFlorian Westphal1-0/+2
Jakub added a warning in nf_conntrack_cleanup_net_list() to make debugging leaked skbs/conntrack references more obvious. syzbot reports this as triggering, and I can also reproduce this via ip_defrag.sh selftest: conntrack cleanup blocked for 60s WARNING: net/netfilter/nf_conntrack_core.c:2512 [..] conntrack clenups gets stuck because there are skbs with still hold nf_conn references via their frag_list. net.core.skb_defer_max=0 makes the hang disappear. Eric Dumazet points out that skb_release_head_state() doesn't follow the fraglist. ip_defrag.sh can only reproduce this problem since commit 6471658dc66c ("udp: use skb_attempt_defer_free()"), but AFAICS this problem could happen with TCP as well if pmtu discovery is off. The relevant problem path for udp is: 1. netns emits fragmented packets 2. nf_defrag_v6_hook reassembles them (in output hook) 3. reassembled skb is tracked (skb owns nf_conn reference) 4. ip6_output refragments 5. refragmented packets also own nf_conn reference (ip6_fragment calls ip6_copy_metadata()) 6. on input path, nf_defrag_v6_hook skips defragmentation: the fragments already have skb->nf_conn attached 7. skbs are reassembled via ipv6_frag_rcv() 8. skb_consume_udp -> skb_attempt_defer_free() -> skb ends up in pcpu freelist, but still has nf_conn reference. Possible solutions: 1 let defrag engine drop nf_conn entry, OR 2 export kick_defer_list_purge() and call it from the conntrack netns exit callback, OR 3 add skb_has_frag_list() check to skb_attempt_defer_free() 2 & 3 also solve ip_defrag.sh hang but share same drawback: Such reassembled skbs, queued to socket, can prevent conntrack module removal until userspace has consumed the packet. While both tcp and udp stack do call nf_reset_ct() before placing skb on socket queue, that function doesn't iterate frag_list skbs. Therefore drop nf_conn entries when they are placed in defrag queue. Keep the nf_conn entry of the first (offset 0) skb so that reassembled skb retains nf_conn entry for sake of TX path. Note that fixes tag is incorrect; it points to the commit introducing the 'ip_defrag.sh reproducible problem': no need to backport this patch to every stable kernel. Reported-by: syzbot+4393c47753b7808dac7d@syzkaller.appspotmail.com Closes: https://lore.kernel.org/netdev/693b0fa7.050a0220.4004e.040d.GAE@google.com/ Fixes: 6471658dc66c ("udp: use skb_attempt_defer_free()") Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260102140030.32367-1-fw@strlen.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-04inet: ping: Fix icmp out countingyuan.gao1-3/+1
When the ping program uses an IPPROTO_ICMP socket to send ICMP_ECHO messages, ICMP_MIB_OUTMSGS is counted twice. ping_v4_sendmsg ping_v4_push_pending_frames ip_push_pending_frames ip_finish_skb __ip_make_skb icmp_out_count(net, icmp_type); // first count icmp_out_count(sock_net(sk), user_icmph.type); // second count However, when the ping program uses an IPPROTO_RAW socket, ICMP_MIB_OUTMSGS is counted correctly only once. Therefore, the first count should be removed. Fixes: c319b4d76b9e ("net: ipv4: add IPPROTO_ICMP socket kind") Signed-off-by: yuan.gao <yuan.gao@ucloud.cn> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Tested-by: Ido Schimmel <idosch@nvidia.com> Link: https://patch.msgid.link/20251224063145.3615282-1-yuan.gao@ucloud.cn Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-12-30net: fib: restore ECMP balance from loopbackVadim Fedorenko1-16/+10
Preference of nexthop with source address broke ECMP for packets with source addresses which are not in the broadcast domain, but rather added to loopback/dummy interfaces. Original behaviour was to balance over nexthops while now it uses the latest nexthop from the group. To fix the issue introduce next hop scoring system where next hops with source address equal to requested will always have higher priority. For the case with 198.51.100.1/32 assigned to dummy0 and routed using 192.0.2.0/24 and 203.0.113.0/24 networks: 2: dummy0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000 link/ether d6:54:8a:ff:78:f5 brd ff:ff:ff:ff:ff:ff inet 198.51.100.1/32 scope global dummy0 valid_lft forever preferred_lft forever 7: veth1@if6: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 link/ether 06:ed:98:87:6d:8a brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 192.0.2.2/24 scope global veth1 valid_lft forever preferred_lft forever inet6 fe80::4ed:98ff:fe87:6d8a/64 scope link proto kernel_ll valid_lft forever preferred_lft forever 9: veth3@if8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000 link/ether ae:75:23:38:a0:d2 brd ff:ff:ff:ff:ff:ff link-netnsid 0 inet 203.0.113.2/24 scope global veth3 valid_lft forever preferred_lft forever inet6 fe80::ac75:23ff:fe38:a0d2/64 scope link proto kernel_ll valid_lft forever preferred_lft forever ~ ip ro list: default nexthop via 192.0.2.1 dev veth1 weight 1 nexthop via 203.0.113.1 dev veth3 weight 1 192.0.2.0/24 dev veth1 proto kernel scope link src 192.0.2.2 203.0.113.0/24 dev veth3 proto kernel scope link src 203.0.113.2 before: for i in {1..255} ; do ip ro get 10.0.0.$i; done | grep veth | awk ' {print $(NF-2)}' | sort | uniq -c: 255 veth3 after: for i in {1..255} ; do ip ro get 10.0.0.$i; done | grep veth | awk ' {print $(NF-2)}' | sort | uniq -c: 122 veth1 133 veth3 Fixes: 32607a332cfe ("ipv4: prefer multipath nexthop that matches source address") Signed-off-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20251221192639.3911901-1-vadim.fedorenko@linux.dev Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-12-30ipv4: Fix reference count leak when using error routes with nexthop objectsIdo Schimmel1-3/+4
When a nexthop object is deleted, it is marked as dead and then fib_table_flush() is called to flush all the routes that are using the dead nexthop. The current logic in fib_table_flush() is to only flush error routes (e.g., blackhole) when it is called as part of network namespace dismantle (i.e., with flush_all=true). Therefore, error routes are not flushed when their nexthop object is deleted: # ip link add name dummy1 up type dummy # ip nexthop add id 1 dev dummy1 # ip route add 198.51.100.1/32 nhid 1 # ip route add blackhole 198.51.100.2/32 nhid 1 # ip nexthop del id 1 # ip route show blackhole 198.51.100.2 nhid 1 dev dummy1 As such, they keep holding a reference on the nexthop object which in turn holds a reference on the nexthop device, resulting in a reference count leak: # ip link del dev dummy1 [ 70.516258] unregister_netdevice: waiting for dummy1 to become free. Usage count = 2 Fix by flushing error routes when their nexthop is marked as dead. IPv6 does not suffer from this problem. Fixes: 493ced1ac47c ("ipv4: Allow routes to use nexthop objects") Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Closes: https://lore.kernel.org/netdev/d943f806-4da6-4970-ac28-b9373b0e63ac@I-love.SAKURA.ne.jp/ Reported-by: syzbot+881d65229ca4f9ae8c84@syzkaller.appspotmail.com Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: David Ahern <dsahern@kernel.org> Link: https://patch.msgid.link/20251221144829.197694-1-idosch@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-12-23erspan: Initialize options_len before referencing options.Frode Nordahl1-2/+4
The struct ip_tunnel_info has a flexible array member named options that is protected by a counted_by(options_len) attribute. The compiler will use this information to enforce runtime bounds checking deployed by FORTIFY_SOURCE string helpers. As laid out in the GCC documentation, the counter must be initialized before the first reference to the flexible array member. After scanning through the files that use struct ip_tunnel_info and also refer to options or options_len, it appears the normal case is to use the ip_tunnel_info_opts_set() helper. Said helper would initialize options_len properly before copying data into options, however in the GRE ERSPAN code a partial update is done, preventing the use of the helper function. Before this change the handling of ERSPAN traffic in GRE tunnels would cause a kernel panic when the kernel is compiled with GCC 15+ and having FORTIFY_SOURCE configured: memcpy: detected buffer overflow: 4 byte write of buffer size 0 Call Trace: <IRQ> __fortify_panic+0xd/0xf erspan_rcv.cold+0x68/0x83 ? ip_route_input_slow+0x816/0x9d0 gre_rcv+0x1b2/0x1c0 gre_rcv+0x8e/0x100 ? raw_v4_input+0x2a0/0x2b0 ip_protocol_deliver_rcu+0x1ea/0x210 ip_local_deliver_finish+0x86/0x110 ip_local_deliver+0x65/0x110 ? ip_rcv_finish_core+0xd6/0x360 ip_rcv+0x186/0x1a0 Cc: stable@vger.kernel.org Link: https://gcc.gnu.org/onlinedocs/gcc/Common-Variable-Attributes.html#index-counted_005fby-variable-attribute Reported-at: https://launchpad.net/bugs/2129580 Fixes: bb5e62f2d547 ("net: Add options as a flexible array to struct ip_tunnel_info") Signed-off-by: Frode Nordahl <fnordahl@ubuntu.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20251213101338.4693-1-fnordahl@ubuntu.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2025-12-10inet: frags: flush pending skbs in fqdir_pre_exit()Jakub Kicinski2-5/+43
We have been seeing occasional deadlocks on pernet_ops_rwsem since September in NIPA. The stuck task was usually modprobe (often loading a driver like ipvlan), trying to take the lock as a Writer. lockdep does not track readers for rwsems so the read wasn't obvious from the reports. On closer inspection the Reader holding the lock was conntrack looping forever in nf_conntrack_cleanup_net_list(). Based on past experience with occasional NIPA crashes I looked thru the tests which run before the crash and noticed that the crash follows ip_defrag.sh. An immediate red flag. Scouring thru (de)fragmentation queues reveals skbs sitting around, holding conntrack references. The problem is that since conntrack depends on nf_defrag_ipv6, nf_defrag_ipv6 will load first. Since nf_defrag_ipv6 loads first its netns exit hooks run _after_ conntrack's netns exit hook. Flush all fragment queue SKBs during fqdir_pre_exit() to release conntrack references before conntrack cleanup runs. Also flush the queues in timer expiry handlers when they discover fqdir->dead is set, in case packet sneaks in while we're running the pre_exit flush. The commit under Fixes is not exactly the culprit, but I think previously the timer firing would eventually unblock the spinning conntrack. Fixes: d5dd88794a13 ("inet: fix various use-after-free in defrags units") Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20251207010942.1672972-4-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2025-12-10inet: frags: add inet_frag_queue_