SUM简介

SUM in mstatus

SUM是mstatus的18位, sstatus是mstatus的子集，对于SUM而言是一样的

mstatus

SUM的作用

The SUM (permit Supervisor User Memory access) bit modifies the privilege with which S-mode loads and stores access virtual memory. When SUM=0, S-mode memory accesses to pages that are accessible by U-mode (U=1 in Figure 4.18) will fault. When SUM=1, these accesses are permitted. SUM has no effect when page-based virtual memory is not in effect. Note that, while SUM is ordinarily ignored when not executing in S-mode, it is in effect when MPRV=1 and MPP=S. SUM is read-only 0 if S-mode is not supported or if satp.MODE is read-only 0.

内核对SUM的使用

SYM_FUNC_START(__asm_copy_to_user)
SYM_FUNC_START(fallback_scalar_usercopy)

        /* Enable access to user memory */
        li t6, SR_SUM
        csrs CSR_STATUS, t6

性能测试

Qemu命令如下

$ qemu-system-riscv64 -m 8G -M virt -cpu rv64 -smp 4 ...

Qemu里面运行unixbench里面的pipe

$ ./pgms/pipe 10
COUNT|246574|1|lps

和x86 host上的性能相差甚远，即使使用的是qemu tcg，但是性能还是不符合预期。

性能分析

做个profiling

$sudo perf top -p 3962

7.81% qemu-system-riscv64   [.] tlb_set_page_full
7.67% qemu-system-riscv64   [.] get_physical_address
...
2.68% qemu-system-riscv64   [.] riscv_cpu_tlb_fill

比较容易发现跟tlb相关并最终追踪到write_mstatus()

(gdb) p &((CPURISCVState *)0)->mstatus
$1 = (uint64_t *)0x13d0

看一下都往mstatus里面写了什么

uprobe:$1:write_mstatus {@[*(arg0 + 0x13d0) ^ arg2] = count(); }

@[24576]: 11
@[0]: 6286
@[16384]: 63955
@[262144]: 63978
@[2]: 231262

其中262144就是0x40000，也就是SUM位，符合linux使用SUM的预期，问题是怎么解决写SUM带来的性能问题.

解决方法

Qemu softMMU

关于qemu softmmu的介绍可以参考 A deep dive into QEMU: TCG memory accesses

这里稍微引用一点

Thanks to the soft-mmu and support of virtual TLBs, QEMU offers a slow path and a fast path when accessing guest memory. The slow path can be seen as a TLB-miss and implies a subsequent call to the PowerPC software MMU implemented inside QEMU to translate a guest virtual address into a guest physical address.

If there is a TLB-hit, QEMU already holds the guest physical address in its vCPU maintained TLBs and is able to directly generate the final memory access into guest RAM with an Intel x86 instruction. Have a look at tcg_out_qemu_st_direct.

显然tlb miss对性能会产生关键影响。

最终解法

优化的办法有多个，最终我们选择了目前看来最好的一种，也就是给 S+SUM 单独的TLB，关于正确性可以参考: https://lists.nongnu.org/archive/html/qemu-riscv/2023-03/msg00558.html

My question is:
> * If we ensure this separate index (S+SUM) has no overlapping tlb
> entries with S-mode (ignore M-mode so far), during SUM=1, we have to
> look into both (S+SUM) and S index for kernel address translation, that
> should be not desired.

This is an incorrect assumption.  S+SUM may very well have overlapping tlb
entries with S.  With SUM=1, you *only* look in S+SUM index; with SUM=0, you
*only* look in S index.

The only difference is a check in get_physical_address is no longer against
MSTATUS_SUM directly, but against the mmu_index.

> * If all the tlb operations are against (S+SUM) during SUM=1, then
> (S+SUM) could contain some duplicated tlb entries of kernel address in S
> index, the duplication means extra tlb lookup and fill.

Yes, if the same address is probed via S and S+SUM, there is a duplicated
lookup. But this is harmless.

> Also if we want
> to flush tlb entry of specific addr0, we have to flush both index.

Yes, this is also true. But so far target/riscv is making no use of per-mmuidx
flushing. At the moment you're *only* using tlb_flush(cpu), which flushes every
mmuidx. Nor are you making use of per-page flushing.

So, really, no change required at all there.

commit c8f8a9957ea20ac4ba0588ddd00130e8dcf41d93
Author: Fei Wu <fei2.wu@intel.com>
Date:   Wed Apr 12 13:43:15 2023 +0200

    target/riscv: Reduce overhead of MSTATUS_SUM change
    
    Kernel needs to access user mode memory e.g. during syscalls, the window
    is usually opened up for a very limited time through MSTATUS.SUM, the
    overhead is too much if tlb_flush() gets called for every SUM change.
    
    This patch creates a separate MMU index for S+SUM, so that it's not
    necessary to flush tlb anymore when SUM changes. This is similar to how
    ARM handles Privileged Access Never (PAN).
    
    Result of 'pipe 10' from unixbench boosts from 223656 to 1705006. Many
    other syscalls benefit a lot from this too.

优化效果

见上面commit，pipe 10从223656提高到了1705006，将近8倍
对于其他涉及到SUM修改，一般是有较多syscall的地方，这个优化的效果都会比较明显

MSTATUS_SUM对QEMU性能的影响