RISC-V中断处理

Posted by Fei Wu on March 3, 2024

Trap的种类

按照同步还是异步,可以将trap简单分成中断(interrupts)和异常(exceptions)。这里我们主要关注interrupts。

  • 目前riscv spec定义了3种中断,software/timer/external,后续扩展可能会加入新的中断,比如针对profiling的LCOFIP
  • 根据目标mode不同,supervisor和machine mode是不同的中断

riscy interrupts

中断的触发条件

寄存器

  • mstatus - 先关注mie/sie mstatus

  • mie/mip - 跟中断的种类对应,EIP/TIP/SIP mie mip

  • mideleg - 也是跟中断种类对应。可以设想,正常的M mode实现不会delegate MEI/MTI/MSI这些针对M mode的中断。

  • sstatus - 可以认为和mstatus是同一寄存器,只是sstatus只能看到S mode能看到的部分。4.1.1 Supervisor Status Register(sstatus)

The sstatus register is a subset of the mstatus register. In a straightforward implementation, reading or writing any field in sstatus is equivalent to reading or writing the homonymous field in mstatus.

  • sie/sip - 可以认为是mie/mip的同一寄存器。 4.1.3 Supervisor Interrupt Registers (sip and sie)

The sip and sie registers are subsets of the mip and mie registers. Reading any implemented field, or writing any writable field, of sip/sie effects a read or write of the homonymous field of mip/mie.

  • sideleg - 如果支持delegate到u mode执行的话,暂时没有

实现功能

这里会参考riscv priviledge architecture spec,如果没有特别说明,引用的内容来源于此。

3.1.8 Machine Trap Delegation Registers (medeleg and mideleg)

By default, all traps at any privilege level are handled in machine mode, though a machine-mode handler can redirect traps back to the appropriate level with the MRET instruction (Section 3.3.2). To increase performance, implementations can provide individual read/write bits within medeleg and mideleg to indicate that certain exceptions and interrupts should be processed directly by a lower privilege level.

3.1.9 Machine Interrupt Registers (mip and mie)

An interrupt i will trap to M-mode (causing the privilege mode to change to M-mode) if all of the following are true: (a) either the current privilege mode is M and the MIE bit in the mstatus register is set, or the current privilege mode has less privilege than M-mode; (b) bit i is set in both mip and mie; and (c) if register mideleg exists, bit i is not set in mideleg.

4.1.3 Supervisor Interrupt Registers (sip and sie)

An interrupt i will trap to S-mode if both of the following are true: (a) either the current privilege mode is S and the SIE bit in the sstatus register is set, or the current privilege mode has less privilege than S-mode; and (b) bit i is set in both sip and sie.

总的来说,提到了2件事情

  • 存在2级中断开关。一个是全局开关在[m|s]status寄存器中,一个在针对software/timer/external这3种中断的单独开关,在[m|s]ie中
  • 存在delegation机制。默认情况下所有的中断都是通过M mode来处理,M mode可以通过delegation机制将特别的中断直接通过S mode处理,完全绕过M mode

还有一件事情就是interrupt pending是怎么设置的,在3.1.8 Machine Trap Delegation Registers (medeleg and mideleg)

Traps never transition from a more-privileged mode to a less-privileged mode. For example, if M- mode has delegated illegal instruction exceptions to S-mode, and M-mode software later executes an illegal instruction, the trap is taken in M-mode, rather than being delegated to S-mode. By contrast, traps may be taken horizontally. Using the same example, if M-mode has delegated illegal instruction exceptions to S-mode, and S-mode software later executes an illegal instruction, the trap is taken in S-mode.

Delegated interrupts result in the interrupt being masked at the delegator privilege level. For example, if the supervisor timer interrupt (STI) is delegated to S-mode by setting mideleg[5], STIs will not be taken when executing in M-mode. By contrast, if mideleg[5] is clear, STIs can be taken in any mode and regardless of current mode will transfer control to M-mode.

简单地说,在高优先级mode是禁止针对低优先级mode中断的,比如在M mode时自动关闭S mode的中断。3.1.6.1 Privilege and Global Interrupt-Enable Stack in mstatus register说得更加直接

When a hart is executing in privilege mode x, interrupts are globally enabled when x IE=1 and globally disabled when x IE=0. Interrupts for lower-privilege modes, w<x, are always globally disabled regardless of the setting of any global w IE bit for the lower-privilege mode. Interrupts for higher-privilege modes, y>x, are always globally enabled regardless of the setting of the global yIE bit for the higher-privilege mode. Higher-privilege-level code can use separate per-interrupt enable bits to disable selected higher-privilege-mode interrupts before ceding control to a lower-privilege mode.

A higher-privilege mode y could disable all of its interrupts before ceding control to a lower- privilege mode but this would be unusual as it would leave only a synchronous trap, non-maskable interrupt, or reset as means to regain control of the hart.

注意这里,高优先级mode (比之前高)的中断是global enabled,但是却可以分别disable,也就是说有意配置的话,是可能关掉高优先级mode中断的。

中断对应硬件操作

寄存器

  • mtvec - 对于中断来说,direct/vectored有区别,对于同步异常来说没区别

When MODE=Direct, all traps into machine mode cause the pc to be set to the address in the BASE field. When MODE=Vectored, all synchronous exceptions into machine mode cause the pc to be set to the address in the BASE field, whereas interrupts cause the pc to be set to the address in the BASE field plus four times the interrupt cause number.

  • mepc - 被中断的指令地址,使用virtual address,注意在没有打开地址转换的情况,virtual就是physical address

When a trap is taken into M-mode, mepc is written with the virtual address of the instruction that was interrupted or that encountered the exception. Otherwise, mepc is never written by the implementation, though it may be explicitly written by software.

  • mcause - 中断原因,也就是最开始提到的中断种类

When a trap is taken into M-mode, mcause is written with a code indicating the event that caused the trap. Otherwise, mcause is never written by the implementation, though it may be explicitly written by software.

  • mtval - 用于进一步定位中断原因

When a trap is taken into M-mode, mtval is either set to zero or written with exception-specific information to assist software in handling the trap. Otherwise, mtval is never written by the implementation, though it may be explicitly written by software.

从上面的描写可以知道,mepc/mcause/mtval等都是software可写的,这里面有进一步操作和利用的可能性。

  • mscratch - 额外可用的寄存器。

Typically, it is used to hold a pointer to a machine-mode hart-local context space and swapped with a user register upon entry to an M-mode trap handler.

以上都是M mode的寄存器,S mode也有自己的一份,和sstatus/sip/sie等不同,这些寄存器和M mode是完全不同的。

  • stvec, sepc, scause, stval, sscratch

进入中断

3.1.6.1 Privilege and Global Interrupt-Enable Stack in mstatus register

To support nested traps, each privilege mode x that can respond to interrupts has a two-level stack of interrupt-enable bits and privilege modes. xPIE holds the value of the interrupt-enable bit active prior to the trap, and xPP holds the previous privilege mode. The xPP fields can only hold privilege modes up to x, so MPP is two bits wide and SPP is one bit wide. When a trap is taken from privilege mode y into privilege mode x, xPIE is set to the value of xIE; xIE is set to 0; and xPP is set to y.

当中断发生时,硬件会做如下事情,S和M mode分别使用自己对应的寄存器。假设trap是从mode y进入mode x:

  • 设置status,要保存什么的根本原因在于什么会被修改
    • status.xPP=y,因为当前mode可能会改变,这样就可以回到原来的mode y
    • status.xPIE=xIE,因为xIE进入中断就会被改(=0),所以需要把原来的xIE保存到xPIE中,注意不是把yIE保存到xPIE(话说什么情况下,在mode y的时候xIE会设置为0?)
    • status.xIE=0,关闭mode x的中断
    • spec里面xPIE holds the value of the interrupt-enable bit active prior to the trap有歧义,按这个表达更像是把yIE保存到xPIE
    • yIE不用保存的原因是,一般情况下mode x不会修改yIE,如果需要修改mode x的中断处理软件还能先保存yIE。
  • 设置epc/cause/tval,这些都比较直接。
  • 跳到tvec相应位置执行代码。

退出中断

3.1.6.1 Privilege and Global Interrupt-Enable Stack in mstatus register

An MRET or SRET instruction is used to return from a trap in M-mode or S-mode respectively. When executing an xRET instruction, supposing xPP holds the value y, xIE is set to xPIE; the privilege mode is changed to y; xPIE is set to 1; and xPP is set to the least-privileged supported mode (U if U-mode is implemented, else M). If x PP̸=M, xRET also sets MPRV=0.

假设现在使用xRET退回到mode y,那么xPP肯定是y,当然y可能等于x。(什么情况是在M mode调用sRET?)

  • 恢复cpu mode为xPP,也就是y
  • status.xIE = xPIE
  • if (xPP /= M) status.MPRV = 0, some discussions in https://github.com/riscv/riscv-isa-manual/issues/427
  • status.xPIE = 1
  • status.xPP = U
  • 跳到EPC的地址

spec明确说了xPP=U是为了debug

Setting xPP to the least-privileged supported mode on an xRET helps identify software bugs in the management of the two-level privilege-mode stack.

那么对于xPIE=1呢?正常的话如果后面再次通过中断进入到mode x,那么xPIE会被设置,这个设置也仅仅是debug使用。但是如果后面是通过[m|s]RET进入到x的呢?

关于MPRV的定义参考3.1.6.3 Memory Privilege in mstatus Register

The MPRV (Modify PRiVilege) bit modifies the effective privilege mode, i.e., the privilege level at which loads and stores execute. When MPRV=0, loads and stores behave as normal, using the translation and protection mechanisms of the current privilege mode. When MPRV=1, load and store memory addresses are translated and protected, and endianness is applied, as though the current privilege mode were set to MPP. Instruction address-translation and protection are unaffected by the setting of MPRV. MPRV is read-only 0 if U-mode is not supported.

Linux内核中断处理

下面的代码基于upstream commit 45ec2f5f6,有一定删减。

进入中断

1
2
3
4
5
6
7
8
9
10
11
12
13
14
SYM_CODE_START(handle_exception)
        /*
         * If coming from userspace, preserve the user thread pointer and load
         * the kernel thread pointer.  If we came from the kernel, the scratch
         * register will contain 0, and we should continue on the current TP.
         */
        csrrw tp, CSR_SCRATCH, tp
        bnez tp, .Lsave_context

.Lrestore_kernel_tpsp:
        csrr tp, CSR_SCRATCH
        REG_S sp, TASK_TI_KERNEL_SP(tp)

.Lsave_context:
  • handle_exception是异常处理的起点
  • 内核在save context前只有scratch一个寄存器可用,而scratch本身也包含有效信息,所以需要使用swap指令csrrw tp, CSR_SCRATCH,tp,这样所有信息都没有丢失
  • 当运行在mode U的时候,scratch指向内核current task_struct
  • 当运行在mode S的时候,scratch会设置为0(见进入和退出异常时的设置)

关于这条swap指令的妙用见3.1.13 Machine Scratch Register (mscratch)

The MIPS ISA allocated two user registers (k0/k1) for use by the operating system. Although the MIPS scheme provides a fast and simple implementation, it also reduces available user registers, and does not scale to further privilege levels, or nested traps. It can also require both registers are cleared before returning to user level to avoid a potential security hole and to provide deterministic debugging behavior.

The RISC-V user ISA was designed to support many possible privileged system environments and so we did not want to infect the user-level ISA with any OS-dependent features. The RISC- V CSR swap instructions can quickly save/restore values to the mscratch register. Unlike the MIPS design, the OS can rely on holding a value in the mscratch register while the user context is running.

因为没有其他临时寄存器可用,这里不能通过判断spp来确定被中断的是U还是S mode。

1
2
3
4
5
6
7
8
9
10
11
12
13
register struct task_struct *riscv_current_is_tp __asm__("tp");

/*
 * This only works because "struct thread_info" is at offset 0 from "struct
 * task_struct".  This constraint seems to be necessary on other architectures
 * as well, but __switch_to enforces it.  We can't check TASK_TI here because
 * <asm/asm-offsets.h> includes this, and I can't get the definition of "struct
 * task_struct" here due to some header ordering problems.
 */
static __always_inline struct task_struct *get_current(void)
{
        return riscv_current_is_tp;
}
  • tp在内核用于get current
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
.Lsave_context:
        REG_S sp, TASK_TI_USER_SP(tp)
        REG_L sp, TASK_TI_KERNEL_SP(tp)
        addi sp, sp, -(PT_SIZE_ON_STACK)
        REG_S x1,  PT_RA(sp)
        REG_S x3,  PT_GP(sp)
        REG_S x5,  PT_T0(sp)
        save_from_x6_to_x31

        /*
         * Disable user-mode memory access as it should only be set in the
         * actual user copy routines.
         *
         * Disable the FPU/Vector to detect illegal usage of floating point
         * or vector in kernel space.
         */
        li t0, SR_SUM | SR_FS_VS

        REG_L s0, TASK_TI_USER_SP(tp)
        csrrc s1, CSR_STATUS, t0
        csrr s2, CSR_EPC
        csrr s3, CSR_TVAL
        csrr s4, CSR_CAUSE
        csrr s5, CSR_SCRATCH
        REG_S s0, PT_SP(sp)
        REG_S s1, PT_STATUS(sp)
        REG_S s2, PT_EPC(sp)
        REG_S s3, PT_BADADDR(sp)
        REG_S s4, PT_CAUSE(sp)
        REG_S s5, PT_TP(sp)
  • x2(sp) 保存在thread_info里面TASK_TI_USER_SP,只是方便写代码
  • 内核栈指针保存在TASK_TI_KERNEL_SP
  • 如果came from kernel,会reuse内核栈,这黑Load的sp是.Lrestore_kernel_tpsp 里面写入的
  • x1(ra), x3(gp), x5(t0), x6..x31 保存在内核栈上(形成pt_regs)
  • x2(sp)也保存到内核栈上(s0)
  • 此时的scratch 保存着trap前的tp,所以x4(tp)也保存到内核栈上了(s5)
  • 至此所有寄存器保存完毕
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
        /*
         * Set the scratch register to 0, so that if a recursive exception
         * occurs, the exception vector knows it came from the kernel
         */
        csrw CSR_SCRATCH, x0

        /* Load the global pointer */
        load_global_pointer

        /* Load the kernel shadow call stack pointer if coming from userspace */
        scs_load_current_if_task_changed s5

#ifdef CONFIG_RISCV_ISA_V_PREEMPTIVE
        move a0, sp
        call riscv_v_context_nesting_start
#endif
        move a0, sp /* pt_regs */
        la ra, ret_from_exception

        /*
         * MSB of cause differentiates between
         * interrupts and exceptions
         */
        bge s4, zero, 1f

        /* Handle interrupts */
        tail do_irq
1:
        /* Handle other exceptions */
  • 比较直接,设置ra为ret_from_exception,然后跳到do_irq执行

退出中断

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
/*
 * The ret_from_exception must be called with interrupt disabled. Here is the
 * caller list:
 *  - handle_exception
 *  - ret_from_fork
 */
SYM_CODE_START_NOALIGN(ret_from_exception)
        REG_L s0, PT_STATUS(sp)
        andi s0, s0, SR_SPP
        bnez s0, 1f

        /* Save unwound kernel stack pointer in thread_info */
        addi s0, sp, PT_SIZE_ON_STACK
        REG_S s0, TASK_TI_KERNEL_SP(tp)

        /*
         * Save TP into the scratch register , so we can find the kernel data
         * structures again.
         */
        csrw CSR_SCRATCH, tp
1:
  • 如果要返回到mode U,保存kernel sp到TASK_TI_KERNEL_SP(tp),scratch也会用来保存tp(current)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
1:
        REG_L a0, PT_STATUS(sp)

        REG_L  a2, PT_EPC(sp)
        REG_SC x0, a2, PT_EPC(sp)

        csrw CSR_STATUS, a0
        csrw CSR_EPC, a2

        REG_L x1,  PT_RA(sp)
        REG_L x3,  PT_GP(sp)
        REG_L x4,  PT_TP(sp)
        REG_L x5,  PT_T0(sp)
        restore_from_x6_to_x31

        REG_L x2,  PT_SP(sp)

        sret
  • 正常流程,恢复用户态x1..x31
  • 恢复epc,sret后跳到epc的位置

NMI的实现

对于profiling kernel来说,一个要解决的问题是获取关中断情况下的调用栈,也就是说在kernel关中断时,还有某种中断能触发,这有很多方法,我们这里统一称为NMI,以sscofmpf里面的overflow interrupt (LCOFI) 为例

Pseudo NMI

riscv: IntroducePseudoNMI

The existing RISC-V kernel disables interrupts via per cpu control register CSR_STATUS, the SIE bit of which controls the enablement of all interrupts of whole cpu. When SIE bit is clear, no interrupt is enabled. This patch series implements NMI by switching interrupt disable way to another per cpu control register CSR_IE. This register controls the enablement of each separate interrupt. Each bit of CSR_IE corresponds to a single major interrupt and a clear bit means disablement of corresponding interrupt.

To implement pseudo NMI, we switch to CSR_IE masking when disabling irqs. When interrupts are disabled, all bits of CSR_IE corresponding to normal interrupts are cleared while bits corresponding to NMIs are still kept as ones. The SIE bit of CSR_STATUS is now untouched and always kept as one.

SBI SSE

中断默认是在M mode处理的,所以只要不delegate LCOFI到S mode处理,S mode(Linux kernel)所谓的关中断并不能阻止LCOFI。SBI SSE使用这种方案,当然SSE不只是用于profiling, 也用于RAS等。

这是最新的patchset of riscv: add support for SBI Supervisor Software Events

  • 异常入口还是handle_exception
  • sse入口是handle_sse,也就是触发sse event会先有中断到M mode,然后通过mret返回到s mode执行handle_sse

关键点

  • 如果sse event只需要save/restore被中断的上下文,那么问题会比较简单
  • 问题是sse event还需要访问被中断的上下文,比如当LCOFI发生时,想要获得被中断的callstack,从而需要指导current task_struct,才能找到相应的stack

在正常的中断处理函数handle_exception中,current只可能来源于2个地方

  • 当被中断上下文是U mode时,此时current存放在scratch里面
  • 当被中断上下文是S mode时,此时current存放在tp里面,而scratch为0

nmi带来了什么问题,以handle exception起始为例:

1
2
        csrrw tp, CSR_SCRATCH, tp 
        bnez tp, .Lsave_context

在指令1交换tp和scratch的值后,此时发生sse event,那么在sse event handler里面怎么知道这个被中断上下文的current?此时情况变得复杂:

  • 不能再以scratch是否为0来判断current在哪个寄存器里
  • scratch里面可能是0, current,也可能是userspace的tp

现在patchset的方法

sbi sse通过epc来标记handle_exception哪块代码被中断,试图获取正确的current。以这个序列为例:

1
2
handle_exception (interrupted context)
   --> sse event handler

通过在handle_exception的代码里面标注current是在tp还是scratch来解决这个问题。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
#define __SSE_TASK_LOC(s_loc, u_loc)			\
	.pushsection __task_loc,"a";	\
	RISCV_PTR 99f;				\
	.byte TASK_LOC(s_loc, u_loc);		\
	.popsection;				\
	99:
#else
#define __SSE_TASK_LOC(s_loc, u_loc)
#endif

SYM_CODE_START(handle_exception)
	/*
	 * If coming from userspace, preserve the user thread pointer and load
	 * the kernel thread pointer.  If we came from the kernel, the scratch
	 * register will contain 0, and we should continue on the current TP.
	 */
__SSE_TASK_LOC(IN_TP, IN_SSCRATCH)
	csrrw tp, CSR_SCRATCH, tp
__SSE_TASK_LOC(IN_SSCRATCH, IN_TP)
	bnez tp, .Lsave_context

但是注意sse event是有优先级,高优先级的event可以抢占低优先event,考虑这种情况

1
2
3
handle_exception (interrupted context)
   -> sse event1 handler
      -> sse event2 handler

btw,这里面有另一个可以讨论的问题,event2是要去访问event1的状态还是最初被中断的状态(interrupted context). 现在讨论的是怎么获取interrupted context的状态比如调用栈。

目前这个方法并不能很好的处理这种情况,见邮件列表的讨论:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
>> +    /*
>> +     * If interrupting the kernel during exception handling
>> +     * (see handle_exception), then, we might have tp either in
>> SSCRATCH or
>> +     * in tp, this part is non regular and requires some more work to
>> +     * determine were is located the current task.
>> +     */
>> +    la t1, handle_exception
>> +    la t2, .Lret_from_exception_end
> 
> Since SSE events are nested, is below possible ?
> 
> Event 1 happened when `handle_exception` < PC < `ret_from_exception` and
> `tp = 0`
> `handle_see` is called and this code hasn't reached a point of obtaining
> correct `tp`
> Firmware is triggered again on this or another hart.
> Firmware decides to inject another event on this hart (either global or
> local) i.e. Event 2
> 
> Now `handle_sse` for Event 2 will think I am good because PC is outside
> the range of `handle_exception`
> and `ret_from_exception`. And go ahead and call `do_sse` which will lead
> to crash eventually.

And you are right :D That part of the exception is a pain to handle,
I(ll try to find a way to handle nested event at the beginning of SSE
handler.

如果sse可以不支持优先级,就可以不支持event嵌套,那么就没有这个问题,当然我们也可以考虑其他解决方案。

M1 - More Info from M mode

patchset里面方案不能处理嵌套sse event的原因是,在第2个event2 handler里面只有event1的信息,而不知道最初interrupted context的信息。一个可能方案是在M mode通过mret去执行event handler的时候,把最初interrupted context的信息包括epc也传进去,这样event2 handler还是可以通过标记的位置来获得正确的current of interrupted context.

M mode有所有sse event嵌套的信息,而具体的S mode event handler是不具备的,只有M mode传给它的那部分(通过mret或者ecal).

  • sse handler(handle_sse)通过a0寄存器接收M mode准备好的sse_registered_event,通过这个数据结构可以拿到它自已运行的stack等信息,而不需要利用tp寄存器
  • handle_sse对于tp的使用,就是想访问current of interrupted context
  • 在event1 handler改动过tp/scratch后,是否会影响到event2拿到current of interrupted context,是这个方案需要确定的。

M2 - 按scratch区分

相比handle_exception,handle_sse需要处理更多scratch的可能性:

  • 0
  • current task_struct
  • userspace tp. 确保指向U mode的地址空间

我们能否保证

  • scratch >= kernel start addr, current在scratch
  • scratch < kernel start addr, current在tp

security角度,如果用户态设置了kernel address space的tp,需要想办法避免这种情况。如果此时提供了一个额外的临时寄存器,那么handle_exception会比较容易做到这个,在swap tp和scratch之前先检查。

Resumable NMI (RNMI)

即使使用这种方式,也会碰到和sbi sse同样的问题。

总结

  • trap默认到M mode处理,但是可以通过delegation放到S mode基至U mode处理
  • swap指令和scratch的使用
  • M mode可以执行sret,什么情况下使用
  • 在xRET的时候设置xPIE=1是处于debug的原因,还是其他原因?
  • 当引人nmi后,为了获取被中断上下文,引入了很大的复杂性,sbi sse发出去的patch不能处理sse event嵌套的问题,这里提出2种可能方案,但是没有验证正确性,其他arch上nmi的实现值得参考
  • 上面的一些问题可以通过理解qemu的实现,以及调整qemu的实现来解答