The Linux Kernel Network Scheduler Vulnerabilities and Exploits - Privilege Escalation
The u32 filter Overview
Where the module is:
net/sched/cls_u32.c
Ugly (or Universal) 32bit key Packet Classifier.
Linux TC (traffic control) Flow Control Introduction
Linux TC imposes different throughput and delay limits on multiple specific ip
Netlink and TC
TC is implemented based on the Netlink protocol.
Multi-queue default Qdisc
A custom qdisc setup
One Example
Transmission quality control, transmission bandwidth and delay
Using some SHELL commands can realize the use of TC. It can also be realized through Netlink programming.
Vulnerability Mining
For the 2021 Tianfu Cup competition, I sorted out the loopholes that syzkaller played locally before. Found a UAF vulnerability on the exclusive SLAB, because this kind of vulnerability has not been used before, but reported the mentality of giving it a try.
[ 203.112091] ==================================================================
[ 203.112113] BUG: KASAN: use-after-free in sock_prot_inuse_add+0x80/0x90
[ 203.112121] Read of size 8 at addr ffff888106660188 by task poc/6597
[ 203.112134] CPU: 0 PID: 6597 Comm: poc Tainted: G ---------r- - 4.18.0+ #32
[ 203.112138] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/22/2020
[ 203.112140] Call Trace:
[ 203.112148] dump_stack+0xa4/0xea
[ 203.112164] print_address_description.constprop.5+0x1e/0x230
[ 203.112197] __kasan_report.cold.7+0x37/0x82
[ 203.112210] kasan_report+0x3b/0x50
[ 203.112217] sock_prot_inuse_add+0x80/0x90
[ 203.112224] netlink_release+0x97f/0x1190
[ 203.112257] __sock_release+0xd3/0x2b0
[ 203.112262] sock_close+0x1e/0x30
[ 203.112267] __fput+0x2d4/0x840
[ 203.112275] task_work_run+0x16e/0x1d0
[ 203.112284] exit_to_usermode_loop+0x207/0x230
[ 203.112290] do_syscall_64+0x3f5/0x470
[ 203.112302] entry_SYSCALL_64_after_hwframe+0x65/0xca
[ 203.112308] RIP: 0033:0x7fee34abd1a8
[ 203.112315] Code: 07 02 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 f3 0f 1e fa 48 8d 05 b5 44 2d 00 8b 00 85 c0 75 17 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 40 c3 0f 1f 80 00 00 00 00 53 89 fb 48 83 ec
[ 203.112318] RSP: 002b:00007ffdb62366c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
[ 203.112323] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00007fee34abd1a8
[ 203.112327] RDX: 0000000000000000 RSI: 00000000200001c0 RDI: 0000000000000004
[ 203.112330] RBP: 00007ffdb62366e0 R08: 00007ffdb62366e0 R09: 00007ffdb62366e0
[ 203.112333] R10: 00007ffdb62366e0 R11: 0000000000000246 R12: 0000000000400f50
[ 203.112337] R13: 00007ffdb6236820 R14: 0000000000000000 R15: 0000000000000000
[ 203.112345] Allocated by task 6247:
[ 203.112353] kasan_save_stack+0x1d/0x80
[ 203.112359] __kasan_kmalloc.constprop.10+0xc1/0xd0
[ 203.112367] slab_post_alloc_hook+0x43/0x280
[ 203.112377] kmem_cache_alloc+0x131/0x280
[ 203.112386] copy_net_ns+0xec/0x330
[ 203.112395] create_new_namespaces+0x583/0x9a0
[ 203.112404] unshare_nsproxy_namespaces+0xcb/0x200
[ 203.112414] ksys_unshare+0x468/0x8d0
[ 203.112423] __x64_sys_unshare+0x36/0x50
[ 203.112432] do_syscall_64+0xe4/0x470
[ 203.112443] entry_SYSCALL_64_after_hwframe+0x65/0xca
[ 203.112453] Freed by task 59:
[ 203.112487] kasan_save_stack+0x1d/0x80
[ 203.112510] kasan_set_track+0x20/0x30
[ 203.112535] kasan_set_free_info+0x1f/0x30
[ 203.112557] __kasan_slab_free+0x108/0x150
[ 203.112578] kmem_cache_free+0x83/0x430
[ 203.112593] net_drop_ns+0x7d/0x90
[ 203.112604] cleanup_net+0x6ee/0x960
[ 203.112619] process_one_work+0x742/0x1030
[ 203.112632] worker_thread+0x95/0xce0
[ 203.112643] kthread+0x32c/0x3f0
[ 203.112654] ret_from_fork+0x35/0x40
[ 203.112686] The buggy address belongs to the object at ffff888106660000
which belongs to the cache net_namespace of size 8000
[ 203.112698] The buggy address is located 392 bytes inside of
8000-byte region [ffff888106660000, ffff888106661f40)
[ 203.112704] The buggy address belongs to the page:
[ 203.112739] page:ffffea0004199800 refcount:1 mapcount:0 mapping:00000000306a7880 index:0xffff888106664080 head:ffffea0004199800 order:3 compound_mapcount:0 compound_pincount:0
[ 203.112752] flags: 0x17ffffc0008100(slab|head)
[ 203.112774] raw: 0017ffffc0008100 dead000000000100 dead000000000200 ffff88810b6ff600
[ 203.112792] raw: ffff888106664080 0000000080030002 00000001ffffffff ffff888101f819c1
[ 203.112798] page dumped because: kasan: bad access detected
[ 203.112803] pages's memcg:ffff888101f819c1
[ 203.112814] Memory state around the buggy address:
[ 203.112831] ffff888106660080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 203.112857] ffff888106660100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 203.112868] >ffff888106660180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 203.112873] ^
[ 203.112884] ffff888106660200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 203.112894] ffff888106660280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 203.112900] =================================================================
Vulnerability Principle
Original PoC
The PoC automatically converted by syzkaller can trigger the vulnerability stably.
Distribute
unshare
|-> __x64_sys_unshare
|-> ksys_unshare
|-> unshare_nsproxy_namespaces
|-> copy_net_ns
|-> kmem_cache_alloc
freed
exit_process
|-> ret_from_fork
|-> kthread
|-> worker_thread
|-> process_one_work
|-> cleanup_net
|-> net_drop_ns
|-> kmem_cache_free
UAFsock_close
|-> exit_to_usermode_loop
|-> task_work_run
|-> __fput
|-> sock_close
|-> __sock_release
|-> sock_prot_inuse_add
Distributing the source code of netnet/core/net_namespace.c
445 struct net *copy_net_ns(unsigned long flags,
446 struct user_namespace *user_ns, struct net *old_net)
447 {
448 struct ucounts *ucounts;
449 struct net *net;
450 int rv;
451
452 if (!(flags & CLONE_NEWNET))
453 return get_net(old_net);
454
455 ucounts = inc_net_namespaces(user_ns);
456 if (!ucounts)
457 return ERR_PTR(-ENOSPC);
458
459 net = net_alloc(); <---
460 if (!net) {
461 rv = -ENOMEM;
462 goto dec_ucounts;
463 }
464 refcount_set(&net->passive, 1);
465 net->ucounts = ucounts;
466 get_user_ns(user_ns);
....
487 return net;
488 }
395 static struct net *net_alloc(void)
396 {
397 struct net *net = NULL;
398 struct net_generic *ng;
399
400 ng = net_alloc_generic();
401 if (!ng)
402 goto out;
403
404 net = kmem_cache_zalloc(net_cachep, GFP_KERNEL); <---
405 if (!net)
406 goto out_free;
407
....
427 }
$ sudo cat /sys/kernel/slab/net_namespace/object_size
4928
$ sudo cat /sys/kernel/slab/net_namespace/order
3
release function437 void net_drop_ns(void *p)
438 {
439 struct net *net = (struct net *)p;
440
441 if (net)
442 net_free(net);
443 }
444
The structure of UAF (hereinafter, net_namespace is collectively referred to as the net structure) 56 struct net {
57 /* First cache line can be often dirtied.
58 |* Do not place here read-mostly fields.
59 |*/
60 refcount_t passive; /* To decide when the network
61 |* namespace should be freed.
62 |*/
63 spinlock_t rules_mod_lock;
64
65 unsigned int dev_unreg_count;
66
67 unsigned int dev_base_seq; /* protected by rtnl_mutex */
68 int ifindex;
69
70 spinlock_t nsid_lock;
71 atomic_t fnhe_genid;
72
73 struct list_head list; /* list of network namespaces */
74 struct list_head exit_list; /* To linked to call pernet exit
75 |* methods on dead net (
76 |* pernet_ops_rwsem read locked),
77 |* or to unregister pernet ops
78 |* (pernet_ops_rwsem write locked).
79 |*/
80 struct llist_node cleanup_list; /* namespaces on death row */
81
82 #ifdef CONFIG_KEYS
83 struct key_tag *key_domain; /* Key domain of operation tag */
84 #endif
85 struct user_namespace *user_ns; /* Owning user namespace */
86 struct ucounts *ucounts;
87 struct idr netns_ids;
88
89 struct ns_common ns; <---/*现实任意地址读*/
90
91 struct list_head dev_base_head;
92 struct proc_dir_entry *proc_net;
93 struct proc_dir_entry *proc_net_stat;
94
95 #ifdef CONFIG_SYSCTL
96 struct ctl_table_set sysctls;
97 #endif
98
99 struct sock *rtnl; /* rtnetlink socket */
100 struct sock *genl_sock;
101
102 struct uevent_sock *uevent_sock; /* uevent socket */
103
104 struct hlist_head *dev_name_head;
105 struct hlist_head *dev_index_head;
106 struct raw_notifier_head netdev_chain;
107
108 /* Note that @hash_mix can be read millions times per second,
109 |* it is critical that it is on a read_mostly cache line.
110 |*/
111 u32 hash_mix;
112
113 struct net_device *loopback_dev; /* The loopback */
114
115 /* core fib_rules */
116 struct list_head rules_ops;
117
118 struct netns_core core;
119 struct netns_mib mib;
120 struct netns_packet packet;
121 struct netns_unix unx;
122 struct netns_nexthop nexthop;
123 struct netns_ipv4 ipv4;
124 #if IS_ENABLED(CONFIG_IPV6)
125 struct netns_ipv6 ipv6;
126 #endif
127 #if IS_ENABLED(CONFIG_IEEE802154_6LOWPAN)
128 struct netns_ieee802154_lowpan ieee802154_lowpan;
129 #endif
130 #if defined(CONFIG_IP_SCTP) || defined(CONFIG_IP_SCTP_MODULE)
131 struct netns_sctp sctp;
132 #endif
133 #ifdef CONFIG_NETFILTER
134 struct netns_nf nf;
135 #if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
136 struct netns_ct ct;
137 #endif
138 #if defined(CONFIG_NF_TABLES) || defined(CONFIG_NF_TABLES_MODULE)
139 struct netns_nftables nft;
140 #endif
141 #endif
142 #ifdef CONFIG_WEXT_CORE
143 struct sk_buff_head wext_nlevents;
144 #endif
145 struct net_generic __rcu *gen;
146
147 /* Used to store attached BPF programs */
148 struct netns_bpf bpf;
149
150 /* Note : following structs are cache line aligned */
151 #ifdef CONFIG_XFRM
152 struct netns_xfrm xfrm;
153 #endif
154
155 u64 net_cookie; /* written once */
156
157 #if IS_ENABLED(CONFIG_IP_VS)
158 struct netns_ipvs *ipvs;
159 #endif
160 #if IS_ENABLED(CONFIG_MPLS)
161 struct netns_mpls mpls;
162 #endif
163 #if IS_ENABLED(CONFIG_CAN)
164 struct netns_can can;
165 #endif
166 #ifdef CONFIG_XDP_SOCKETS
167 struct netns_xdp xdp;
168 #endif
169 #if IS_ENABLED(CONFIG_MCTP)
170 struct netns_mctp mctp;
171 #endif
172 #if IS_ENABLED(CONFIG_CRYPTO_USER)
173 struct sock *crypto_nlsk;
174 #endif
175 struct sock *diag_nlsk;
176 #if IS_ENABLED(CONFIG_SMC)
177 struct netns_smc smc;
178 #endif
179 } __randomize_layout;
PoC Rewriting
After further analysis, it is because the u32_change function will incorrectly reduce the reference count of nets, which leads to the logic problem of UAF. Starting from this, the trigger path of PoC is optimized.u32_change()
|--> u32_destroy_key()
|--> tcf_exts_put_net()
|--> put_net()
At the same time, a logical primitive that subtracts 1 from the reference count on the net is constructed.The optimized trigger process is as follows:
[ 253.623920] ------------[ cut here ]------------
[ 253.623929] refcount_t: underflow; use-after-free.
[ 253.623984] WARNING: CPU: 0 PID: 4009 at lib/refcount.c:28 refcount_warn_saturate+0x10c/0x1f0
[ 253.624026] Modules linked in: act_police cls_u32 ip6_gre gre ip6_tunnel tunnel6 uas usb_storage binfmt_misc snd_seq_dummy snd_hrtimer vsock_loopback vmw_vsock_virtio_transport_common vmw_vsock_vmci_transport vsock snd_ens1371 snd_ac97_codec gameport ac97_bus snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi intel_rapl_msr intel_rapl_common nls_iso8859_1 snd_seq crct10dif_pclmul ghash_clmulni_intel sch_fq_codel aesni_intel snd_seq_device crypto_simd snd_timer cryptd snd vmw_balloon joydev rapl input_leds soundcore vmw_vmci serio_raw vmwgfx ttm drm_kms_helper mac_hid cec rc_core fb_sys_fops syscopyarea sysfillrect sysimgblt ipmi_devintf ipmi_msghandler msr parport_pc ppdev lp drm parport ip_tables x_tables autofs4 hid_generic crc32_pclmul psmouse usbhid ahci mptspi hid libahci mptscsih e1000 mptbase scsi_transport_spi i2c_piix4 pata_acpi floppy
[ 253.624306] CPU: 0 PID: 4009 Comm: apparmor_parser Tainted: G B 5.15.30+ #2
[ 253.624330] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/22/2020
[ 253.624338] RIP: 0010:refcount_warn_saturate+0x10c/0x1f0
[ 253.624351] Code: 1d 6d 3a 1d 03 31 ff 89 de e8 90 f1 18 ff 84 db 75 a0 e8 47 f6 18 ff 48 c7 c7 e0 f0 65 85 c6 05 4d 3a 1d 03 01 e8 f2 76 57 01 <0f> 0b eb 84 e8 2b f6 18 ff 0f b6 1d 36 3a 1d 03 31 ff 89 de e8 5b
[ 253.624361] RSP: 0000:ffff888137fafc90 EFLAGS: 00010282
[ 253.624369] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[ 253.624376] RDX: ffff88810caf0000 RSI: 0000000000000100 RDI: ffffed1026ff5f84
[ 253.624383] RBP: ffff888137fafca0 R08: 0000000000000100 R09: ffff8881e183098b
[ 253.624390] R10: 0000000000000000 R11: 0000000000000001 R12: ffff888120ec008c
[ 253.624397] R13: ffff888105f42000 R14: ffff888120ec0000 R15: ffff888120ec008c
[ 253.624404] FS: 00007fc64fc8d740(0000) GS:ffff8881e1800000(0000) knlGS:0000000000000000
[ 253.624414] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 253.624421] CR2: 000055893f3fadf9 CR3: 0000000135002001 CR4: 00000000003706f0
[ 253.624445] Call Trace:
[ 253.624451] <TASK>
[ 253.624458] __sk_destruct+0x693/0x790
[ 253.624478] sk_destruct+0xd3/0x100
[ 253.624494] __sk_free+0xfe/0x400
[ 253.624509] sk_free+0x88/0xc0
[ 253.624524] deferred_put_nlk_sk+0x170/0x320
[ 253.624544] rcu_core+0x51a/0x1250
[ 253.624607] rcu_core_si+0xe/0x10
[ 253.624618] __do_softirq+0x189/0x536
[ 253.624631] irq_exit_rcu+0xec/0x130
[ 253.624641] sysvec_apic_timer_interrupt+0x40/0x90
[ 253.624664] asm_sysvec_apic_timer_interrupt+0x12/0x20
[ 253.624675] RIP: 0033:0x55893f2e92d2
[ 253.624685] Code: c3 0f 1f 80 00 00 00 00 48 39 cb 74 3b 48 8b 7d 10 49 89 d8 4c 89 ee 48 8b 07 48 89 54 24 68 44 89 f2 48 89 4c 24 60 4c 89 e1 <48> 8b 40 38 48 83 c4 28 5b 5d 41 5c 41 5d 41 5e 41 5f ff e0 66 2e
[ 253.624694] RSP: 002b:00007ffc26b6c960 EFLAGS: 00000202
[ 253.624703] RAX: 000055893f3ec3a0 RBX: 0000558940c048d0 RCX: 000055893f3eb588
[ 253.624710] RDX: 0000000000000006 RSI: 0000000000000000 RDI: 000055893f3eb510
[ 253.624717] RBP: 000055893f3eb528 R08: 0000558940c048d0 R09: 000055893f3eb4a0
[ 253.624723] R10: 0000558940e14270 R11: 00007fc64fea9ce0 R12: 000055893f3eb588
[ 253.624730] R13: 0000000000000000 R14: 0000000000000006 R15: 000055893f3a48e8
[ 253.624740] </TASK>
[ 253.624743] ---[ end trace ddbeecae4d8b2b8c ]---
[ 253.626421] ------------[ cut here ]------------
[ 253.626431] refcount_t: saturated; leaking memory.
[ 253.626489] WARNING: CPU: 3 PID: 309 at lib/refcount.c:19 refcount_warn_saturate+0x1bd/0x1f0
[ 253.626513] Modules linked in: act_police cls_u32 ip6_gre gre ip6_tunnel tunnel6 uas usb_storage binfmt_misc snd_seq_dummy snd_hrtimer vsock_loopback vmw_vsock_virtio_transport_common vmw_vsock_vmci_transport vsock snd_ens1371 snd_ac97_codec gameport ac97_bus snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi intel_rapl_msr intel_rapl_common nls_iso8859_1 snd_seq crct10dif_pclmul ghash_clmulni_intel sch_fq_codel aesni_intel snd_seq_device crypto_simd snd_timer cryptd snd vmw_balloon joydev rapl input_leds soundcore vmw_vmci serio_raw vmwgfx ttm drm_kms_helper mac_hid cec rc_core fb_sys_fops syscopyarea sysfillrect sysimgblt ipmi_devintf ipmi_msghandler msr parport_pc ppdev lp drm parport ip_tables x_tables autofs4 hid_generic crc32_pclmul psmouse usbhid ahci mptspi hid libahci mptscsih e1000 mptbase scsi_transport_spi i2c_piix4 pata_acpi floppy
[ 253.626837] CPU: 3 PID: 309 Comm: kworker/u256:28 Tainted: G B W 5.15.30+ #2
[ 253.626851] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/22/2020
[ 253.626859] Workqueue: netns cleanup_net
[ 253.626874] RIP: 0010:refcount_warn_saturate+0x1bd/0x1f0
[ 253.626888] Code: 03 31 ff 89 de e8 e3 f0 18 ff 84 db 0f 85 ef fe ff ff e8 96 f5 18 ff 48 c7 c7 e0 ef 65 85 c6 05 9f 39 1d 03 01 e8 41 76 57 01 <0f> 0b e9 d0 fe ff ff e8 77 f5 18 ff 48 c7 c7 40 f1 65 85 c6 05 7c
[ 253.626899] RSP: 0000:ffff8881032ff688 EFLAGS: 00010282
[ 253.626908] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[ 253.626915] RDX: ffff888103093380 RSI: 0000000000000000 RDI: ffffed102065fec3
[ 253.626922] RBP: ffff8881032ff698 R08: 0000000000000000 R09: ffff8881e19b098b
[ 253.626930] R10: 0000000000000000 R11: 0000000000000001 R12: ffff888120ec008c
[ 253.626936] R13: ffff88812dc76500 R14: dffffc0000000000 R15: 00000000c0000000
[ 253.626944] FS: 0000000000000000(0000) GS:ffff8881e1980000(0000) knlGS:0000000000000000
[ 253.626954] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 253.626961] CR2: 00007f2ede8e1024 CR3: 00000001736a6006 CR4: 00000000003706e0
[ 253.626993] Call Trace:
[ 253.626997] <TASK>
[ 253.627006] u32_clear_hnode+0x4c7/0x680 [cls_u32]
[ 253.627058] u32_destroy_hnode.isra.0+0xa4/0x240 [cls_u32]
[ 253.627069] u32_destroy+0x2da/0x390 [cls_u32]
[ 253.627080] tcf_proto_destroy+0x85/0x300
[ 253.627091] tcf_proto_put+0x9c/0xd0
[ 253.627101] tcf_chain_flush+0x1c0/0x310
[ 253.627112] __tcf_block_put+0x158/0x2e0
[ 253.627123] tcf_block_put+0xe3/0x130
[ 253.627178] fq_codel_destroy+0x3c/0xb0 [sch_fq_codel]
[ 253.627189] qdisc_destroy+0xb1/0x2a0
[ 253.627200] qdisc_put+0xe0/0x100
[ 253.627211] dev_shutdown+0x253/0x390
[ 253.627224] unregister_netdevice_many+0x7e0/0x1720
[ 253.627282] ip6gre_exit_batch_net+0x36b/0x450 [ip6_gre]
[ 253.627367] ops_exit_list+0x115/0x160
[ 253.627378] cleanup_net+0x475/0xb40
[ 253.627403] process_one_work+0x8bf/0x11d0
[ 253.627416] worker_thread+0x60b/0x1340
[ 253.627441] kthread+0x388/0x470
[ 253.627461] ret_from_fork+0x22/0x30
[ 253.627476] </TASK>
[ 253.627480] ---[ end trace ddbeecae4d8b2b8d ]---
Vulnerability Patch
In the u32_change function, the tcf_exts_put_net function (decreases the reference count on nets by 1) should not be executed.author Eric Dumazet <edumazet@google.com> 2022-04-13 10:35:41 -0700
committer Jakub Kicinski <kuba@kernel.org> 2022-04-15 14:26:11 -0700
commit 3db09e762dc79584a69c10d74a6b98f89a9979f8 (patch)
tree 1a269d290124f61d42c2cb059de92a0661f818a5
parent f3226eed54318e7bdc186f8f7ed27bcd3cb8b681 (diff)
download linux-3db09e762dc79584a69c10d74a6b98f89a9979f8.tar.gz
net/sched: cls_u32: fix netns refcount changes in u32_change()
We are now able to detect extra put_net() at the moment
they happen, instead of much later in correct code paths.
u32_init_knode() / tcf_exts_init() populates the ->exts.net
pointer, but as mentioned in tcf_exts_init(),
the refcount on netns has not been elevated yet.
The refcount is taken only once tcf_exts_get_net()
is called.
So the two u32_destroy_key() calls from u32_change()
are attempting to release an invalid reference on the netns.
syzbot report:
refcount_t: decrement hit 0; leaking memory.
WARNING: CPU: 0 PID: 21708 at lib/refcount.c:31 refcount_warn_saturate+0xbf/0x1e0 lib/refcount.c:31
Modules linked in:
CPU: 0 PID: 21708 Comm: syz-executor.5 Not tainted 5.18.0-rc2-next-20220412-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:refcount_warn_saturate+0xbf/0x1e0 lib/refcount.c:31
Code: 1d 14 b6 b2 09 31 ff 89 de e8 6d e9 89 fd 84 db 75 e0 e8 84 e5 89 fd 48 c7 c7 40 aa 26 8a c6 05 f4 b5 b2 09 01 e8 e5 81 2e 05 <0f> 0b eb c4 e8 68 e5 89 fd 0f b6 1d e3 b5 b2 09 31 ff 89 de e8 38
RSP: 0018:ffffc900051af1b0 EFLAGS: 00010286
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000040000 RSI: ffffffff8160a0c8 RDI: fffff52000a35e28
RBP: 0000000000000004 R08: 0000000000000000 R09: 0000000000000000
R10: ffffffff81604a9e R11: 0000000000000000 R12: 1ffff92000a35e3b
R13: 00000000ffffffef R14: ffff8880211a0194 R15: ffff8880577d0a00
FS: 00007f25d183e700(0000) GS:ffff8880b9c00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f19c859c028 CR3: 0000000051009000 CR4: 00000000003506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
__refcount_dec include/linux/refcount.h:344 [inline]
refcount_dec include/linux/refcount.h:359 [inline]
ref_tracker_free+0x535/0x6b0 lib/ref_tracker.c:118
netns_tracker_free include/net/net_namespace.h:327 [inline]
put_net_track include/net/net_namespace.h:341 [inline]
tcf_exts_put_net include/net/pkt_cls.h:255 [inline]
u32_destroy_key.isra.0+0xa7/0x2b0 net/sched/cls_u32.c:394
u32_change+0xe01/0x3140 net/sched/cls_u32.c:909
tc_new_tfilter+0x98d/0x2200 net/sched/cls_api.c:2148
rtnetlink_rcv_msg+0x80d/0xb80 net/core/rtnetlink.c:6016
netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2495
netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline]
netlink_unicast+0x543/0x7f0 net/netlink/af_netlink.c:1345
netlink_sendmsg+0x904/0xe00 net/netlink/af_netlink.c:1921
sock_sendmsg_nosec net/socket.c:705 [inline]
sock_sendmsg+0xcf/0x120 net/socket.c:725
____sys_sendmsg+0x6e2/0x800 net/socket.c:2413
___sys_sendmsg+0xf3/0x170 net/socket.c:2467
__sys_sendmsg+0xe5/0x1b0 net/socket.c:2496
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f25d0689049
Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f25d183e168 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00007f25d079c030 RCX: 00007f25d0689049
RDX: 0000000000000000 RSI: 0000000020000340 RDI: 0000000000000005
RBP: 00007f25d06e308d R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007ffd0b752e3f R14: 00007f25d183e300 R15: 0000000000022000
</TASK>
Fixes: 35c55fc156d8 ("cls_u32: use tcf_exts_get_net() before call_rcu()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Diffstat
-rw-r--r-- net/sched/cls_u32.c 16
1 files changed, 10 insertions, 6 deletions
diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c
index cf5649292ee00..fcba6c43ba509 100644
--- a/net/sched/cls_u32.c
+++ b/net/sched/cls_u32.c
@@ -386,14 +386,19 @@ static int u32_init(struct tcf_proto *tp)
return 0;
}
-static int u32_destroy_key(struct tc_u_knode *n, bool free_pf)
+static void __u32_destroy_key(struct tc_u_knode *n)
{
struct tc_u_hnode *ht = rtnl_dereference(n->ht_down);
tcf_exts_destroy(&n->exts);
- tcf_exts_put_net(&n->exts);
if (ht && --ht->refcnt == 0)
kfree(ht);
+ kfree(n);
+}
+
+static void u32_destroy_key(struct tc_u_knode *n, bool free_pf)
+{
+ tcf_exts_put_net(&n->exts);
#ifdef CONFIG_CLS_U32_PERF
if (free_pf)
free_percpu(n->pf);
@@ -402,8 +407,7 @@ static int u32_destroy_key(struct tc_u_knode *n, bool free_pf)
if (free_pf)
free_percpu(n->pcpu_success);
#endif
- kfree(n);
- return 0;
+ __u32_destroy_key(n);
}
/* u32_delete_key_rcu should be called when free'ing a copied
@@ -900,13 +904,13 @@ static int u32_change(struct net *net, struct sk_buff *in_skb,
extack);
if (err) {
- u32_destroy_key(new, false);
+ __u32_destroy_key(new);
return err;
}
err = u32_replace_hw_knode(tp, new, flags, extack);
if (err) {
- u32_destroy_key(new, false);
+ __u32_destroy_key(new);
return err;
}
Problem Introduction
commit 35c55fc156d85a396a975fc17636f560fc02fd65
Author: Cong Wang <xiyou.wangcong@gmail.com>
Date: Mon Nov 6 13:47:30 2017 -0800
cls_u32: use tcf_exts_get_net() before call_rcu()
Hold netns refcnt before call_rcu() and release it after
the tcf_exts_destroy() is done.
Note, on ->destroy() path we have to respect the return value
of tcf_exts_get_net(), on other paths it should always return
true, so we don't need to care.
Cc: Lucas Bates <lucasb@mojatatu.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c
index dadd1b344497..b58eccb21f03 100644
--- a/net/sched/cls_u32.c
+++ b/net/sched/cls_u32.c
@@ -399,6 +399,7 @@ static int u32_destroy_key(struct tcf_proto *tp, struct tc_u_knode *n,
bool free_pf)
{
tcf_exts_destroy(&n->exts);
+ tcf_exts_put_net(&n->exts);
if (n->ht_down)
n->ht_down->refcnt--;
#ifdef CONFIG_CLS_U32_PERF
@@ -476,6 +477,7 @@ static int u32_delete_key(struct tcf_proto *tp, struct tc_u_knode *key)
RCU_INIT_POINTER(*kp, key->next);
tcf_unbind_filter(tp, &key->res);
+ tcf_exts_get_net(&key->exts);
call_rcu(&key->rcu, u32_delete_key_freepf_rcu);
return 0;
}
Therefore, the time range affected by the vulnerability is from November 6, 2017 to April 13, 2022, which lasts for 4 and a half years.Timeline
TimeLine | |
---|---|
July 27, 2021 | Confirm Vulnerability |
October 2021 | Complete the exploit |
April 12, 2022 | syzbot hits a similar vulnerability |
April 13, 2022 | community tinkering |
August 2022 | Participate in domestic competitions |
Exploit
The steps of vulnerability exploitation are divided into:- Through information leakage and address randomization;
- Elevate permissions through run_cmd.
Information Leakage
Step 1: Heap Layout- Fill the idle net in SLAB
- Eat all the pages of the net-specific SLAB in the cache, so that the newly allocated net can use the pages newly allocated by the system. The yellow area in the figure represents the net objects of heap spraying, such as SLAB 1 and SLAB 2 in the figure.
- Create a victim net from the newly allocated slab
- Indicated by the red area in the figure.
- Then eat all the slab where the victim is located.
- As shown in the figure, slab A and slab B both use net objects to fill up the 8 page-sized slabs.
In order to access the victim's reference through this file later.
mount("/proc/self/ns/net", "./mynetns", "nsfs", MS_BIND, NULL)
Step 3: Return the page where the victim is located to the partner systemDecrease Victim's reference count by 1 via u32_destroy_key
Step 4: User-mode mmap heap sprays the physical page where the victim is located
Return the physical page to the system in the third step just now, and allocate it through mmap.
Step 5: Construct an arbitrary address read
On the file obtained through mount, call ioctl(NS_GET_NSTYPE), and the user mode can get the value of ns->ops->type. Because the value of ops is controllable, it can realize arbitrary address reading.
Step 6: Read cpu_area_entry, bypass Kaslr
Because the virtual address (0xfffffe0000000000) of cpu_area_entry in the system is fixed, and this address contains a kernel code segment address after Kaslr. So the offset can be calculated to bypass Kaslr.
fs/nsfs.c
88 static long ns_ioctl(struct file *filp, unsigned int ioctl,
189 unsigned long arg)
190 {
191 struct user_namespace *user_ns;
192 struct ns_common *ns = get_proc_ns(file_inode(filp));
193 uid_t __user *argp;
194 uid_t uid;
195
196 switch (ioctl) {
197 case NS_GET_USERNS:
198 return open_related_ns(ns, ns_get_owner);
199 case NS_GET_PARENT:
200 if (!ns->ops->get_parent)
201 return -EINVAL;
202 return open_related_ns(ns, ns->ops->get_parent);
203 case NS_GET_NSTYPE:
204 return ns->ops->type; <---/*现实任意地址读*/
205 case NS_GET_OWNER_UID:
206 if (ns->ops->type != CLONE_NEWUSER)
207 return -EINVAL;
208 user_ns = container_of(ns, struct user_namespace, ns);
209 argp = (uid_t __user *) arg;
210 uid = from_kuid_munged(current_user_ns(), user_ns->owner);
211 return put_user(uid, argp);
212 default:
213 return -ENOTTY;
214 }
215 }
include/linux/ns_common.h 9 struct ns_common {
10 atomic_long_t stashed;
11 const struct proc_ns_operations *ops; <---
12 unsigned int inum;
13 refcount_t count;
14 };
Elevate Privileges Through run_cmd
After bypassing address randomization, the next step of privilege escalation can be performed.- Read the address of victim net
- Construct fake ops in user mode
- Hijack the PC
147 int open_related_ns(struct ns_common *ns,
148 | struct ns_common *(*get_ns)(struct ns_common *ns))
149 {
150 struct path path = {};
151 struct file *f;
152 int err;
153 int fd;
154
155 fd = get_unused_fd_flags(O_CLOEXEC);
156 if (fd < 0)
157 return fd;
158
159 do {
160 struct ns_common *relative;
161
162 relative = get_ns(ns);
163 if (IS_ERR(relative)) {
164 put_unused_fd(fd);
165 return PTR_ERR(relative);
166 }
167
168 err = __ns_get_path(&path, relative);
169 } while (err == -EAGAIN);
170
171 if (err) {
172 put_unused_fd(fd);
173 return err;
174 }
175
176 f = dentry_open(&path, O_RDONLY, current_cred());
177 path_put(&path);
178 if (IS_ERR(f)) {
179 put_unused_fd(fd);
180 fd = PTR_ERR(f);
181 } else
182 fd_install(fd, f);
183
184 return fd;
185 }
The owner is the last PC hijacked, and the data of ns can also be controlled, so run_cmd can be executed to complete privilege escalation.1371 struct ns_common *ns_get_owner(struct ns_common *ns)
1372 {
1373 struct user_namespace *my_user_ns = current_user_ns();
1374 struct user_namespace *owner, *p;
1375
1376 /* See if the owner is in the current user namespace */
1377 owner = p = ns->ops->owner(ns); <---/*劫持PC*/
1378 for (;;) {
1379 if (!p)
1380 return ERR_PTR(-EPERM);
1381 if (p == my_user_ns)
1382 break;
1383 p = p->parent;
1384 }
1385
1386 return &get_user_ns(owner)->ns;
1387 }
16 struct proc_ns_operations {
17 const char *name;
18 const char *real_ns_name;
19 int type;
20 struct ns_common *(*get)(struct task_struct *task);
21 void (*put)(struct ns_common *ns);
22 int (*install)(struct nsset *nsset, struct ns_common *ns);
23 struct user_namespace *(*owner)(struct ns_common *ns); <---
24 struct ns_common *(*get_parent)(struct ns_common *ns);
25 } __randomize_layout;
Reference link
Source :- https://paper.seebug.org/2036/
Post a Comment