起因
看到Svvptc proposal的一封邮件 Feedback Request on Svvptc Extension Proposal,里面提到sfence.vma指令执行太多有优化空间。
1
2
3
4
5
6
7
8
9
10
11
12
+--------------------------+------+-----------------+
| Test | Gain | # of SFENCE.VMA |
+--------------------------+------+-----------------+
| Kernel boot | 6% | 50535 -> 8768 |
| ltp - mmapstress01 | 8% | 44978 -> 6300 |
| lmbench - lat_pagefault | 20% | 665254 -> 832 |
| lmbench - lat_mmap | 5% | 546401 -> 718 |
+--------------------------+------+-----------------+
1. The gains represent performance improvements for
the benchmark metrics.
2. The second column lists the reduction in
issued single-address SFENCE.VMA instructions.
一些想法
除了Svvptc扩展本身,我其实对怎么发现这个优化点更感兴趣,甚至有没有更通用的方法去发现一系列优化点。如果我们把这个问题抽象成,一条执行代价较高的指令执行了较多次数,那这个问题是比较适合qemu之类的binary translation工具来做的,毕竟所有的guest指令都要经过qemu来翻译,很多时候这比真实硬件来得精确,况且真实硬件也不一定有对应的pmu。
验证想法
借助qemu的insn plugin,我们可以统计在某种场景比如bootup时sfence.vma的执行情况
1
2
qemu-system-riscv64 -machine virt -cpu rv64 -m 4G -smp 4 \
-plugin $PD/libinsn.so,match=sfence -d plugin -D log \
bootup后的统计大致这样,可以看到比邮件里面统计的还多不少
1
2
3
4
5
6
7
0x80003262, 'sfence.vma a4,s3', 61053 hits , cpu 0, 235498 match hits, Δ+1807 since last match, 46420 avg insns/match
0x80003262, 'sfence.vma a4,s3', 61054 hits , cpu 1, 284713 match hits, Δ+1806 since last match, 35344 avg insns/match
0x80003262, 'sfence.vma a4,s3', 61055 hits , cpu 2, 198382 match hits, Δ+1807 since last match, 37572 avg insns/match
0x80003262, 'sfence.vma a4,s3', 61056 hits , cpu 3, 207546 match hits, Δ+12931 since last match, 38918 avg insns/match
0x80003262, 'sfence.vma a4,s3', 61057 hits , cpu 0, 235499 match hits, Δ+1806 since last match, 46420 avg insns/match
0x80003262, 'sfence.vma a4,s3', 61058 hits , cpu 2, 198383 match hits, Δ+1806 since last match, 37572 avg insns/match
0x80003262, 'sfence.vma a4,s3', 61059 hits , cpu 1, 284714 match hits, Δ+1806 since last match, 35344 avg insns/match
通用化
- 同样方法我们用来检查fence,cmo或者其他指令,自动上报相关优化点而不需要额外的人工去分析和上报
- 可以把更多场景作为测试用例加进来
- 将热点和代码关联起来,现在的统计方法还有优化空间