前言

最近计划学习一下cpu虚拟化相关内容，其中一个重要内容就是中断虚拟化。

在学习中断虚拟化之前，首先需要对中断有基本的了解认识，虽然之前在linux内核硬中断分析中有了解，但比较片面，这里对中断会进行更深入的学习

中断硬件

本质上，内核是操作系统控制硬件的接口，其逻辑实现紧密遵循硬件规范。这种硬件强相关的特性，使得研究cpu的中断硬件机制成为理解内核中断子系统的关键点。

下面介绍一下cpu的中断硬件机制的发展

pic(Programmable Interrupt Controller)

pic，也就是8259A，其是最早的cpu的中断管理芯片，其实物图和结构图如下所示
8259A实物图
8259A结构图

中断设置

按照8259A的手册，一般通过pio的方式完成ICW(Initialization Command Words)和OCW(Operation Command Words)寄存器的读取和设置，从而完成中断触发方式、中断向量基址、中断状态等8259A工作状态的查询和设置，这里不详细介绍了，感兴趣可以查看8259A的手册

中断处理

根据8259A的手册，一个典型的中断处理流程如下所示

一个或多个IR引脚发送中断信号(触发方式由ICW1配置)，对应的IRR(Interrupt Request Register)中的相关bit位锁存请求状态(直到后续被清除)
PR(Priority Resolver)根据中断优先级(OCW2配置)和IMR(Interrupt Mask Register)评估所有的中断请求。如果存在有效请求，并通过INT引脚向CPU发送中断请求信号
CPU检测到INT信号后，在当前指令执行完毕后发送第一个INTA脉冲(8259A手册规定的电信号)
在第一个INTA周期内，8259A会在ISR(Interrupt Service Register)中将当前IRR中最高优先级的IR对应的位进行标记，并清除IRR中相应的bit位标志
在第二个INTA周期内，8259A会通过数据总线向CPU发送8位中断类型码(ICW2配置基地址+IRQ编号)
最后，根据ICW4配置的模式结束此次中断处理(AEOI模式会自动清除ISR对应的位；否则等到CPU发送EOI命令再清除ISR对应的位)

apic(Advanced Programmable Interrupt Controller)

前面提到的pic只适用于单处理器的中断处理，而对于当前的多CPU，则需要新的中断控制器，即apic。

apic由lapic(local apic)和ioapic(I/O apic)构成，其整体结构图如下所示
apic整体结构图

整体上，lapic类似于每个CPU的8259A，用于处理其所在的CPU需要处理的中断；而ioapic则是用来负责统一接受外部设备的中断请求，并重新分配给给不同CPU的lapic

lapic

lapic的主要功能就是接受中断消息或是自身/其他lapic产生的中断，然后通知CPU处理，其整体结构图如下所示
lapic结构图

可以看到，其基本和8259A很相似，包括ISR、IRR等，但又新增了很多结构用来实现新增的功能，因此其设置和中断流程大体相似，但又有区别

中断设置

按照apic的手册，lapic的寄存器被默认被映射到起始地址为0xfee00000的连续4KB的物理内存中，可通过IA32_APIC_BASE MSR(Model Specific Register)更改基址。寄存器通过mmio方式进行读取和设置，具体的寄存器布局如下所示
lapic寄存器布局

中断处理

lapic可以处理三种中断：本地中断(诸如apic timer generated interrupts等)、IPI(Inter-Processor Interrupts)和ioapic发送的中断。lapic处理这些中断遵循下述流程

过滤中断目的地是自己的中断(本地中断目的地始终是自己，IPI和ioapic发送的中断需要根据中断的destination字段判断)
如果中断的delivery Mode是NMI(Non-Maskable Interrupt)、SMI(System Management Interrupt)、INIT、ExtINT和SIPI(Start-up IPI)，则直接通过相关引脚将中断发送给CPU，完成此次中断处理
对于其他delivery Mode的中断，根据中断号将IRR对应的位进行标记(直到后续被清除)
lapic基于PPR(Processor Priority Register)评估IRR中所有的中断请求。如果存在有效请求，清除IRR中相应的bit位，标记ISR中相应的bit位，并通过相关引脚将中断发送给CPU
CPU在处理完中断后，向EOIR(End-Of-Interrupt Register)写入
lapic清除ISR中相应的bit位，并根据SVR(Spurious Vector Register)内存可能会通过总线向所有ioapic广播EOI

ioapic

ioapic负责接收外部I/O设备的硬件中断，并将其转换成一定格式的消息，通过总线发送给一个或多个lapic，其相关示意图如下所示

ioapic引脚图
ioapic结构图

中断设置

和lapic一样，ioapic有大量用于配置/读取工作状态的寄存器。但不同的是，ioapic没有将所有的寄存器直接映射到物理内存上；其仅仅映射了两个寄存器：位于0xfec0xy00(x和y可通过APICBASE寄存器配置)的IOREGSEL(I/O Register Select)，用于要访问的ioapic的寄存器编号；位于0xfec0xy10的IOWIN(I/O Window Register)，用于读写选择的寄存器内容

其中最重要的是IOREDTBL(I/O Redirection Table Registers),其是一张24项的中断重定向表，每项是一个64位的寄存器，对应着ioapic的一个中断引脚的中断重定向配置信息，包含中断向量、中断的destination、中断的delivery mode等内容

中断处理

ioapic的中断处理很简单，每当其收到外部I/O设备通过中断引脚发送的中断信号后，其会根据中断引脚对应的IOREDTBL中的重定向表项格式化出一条消息，并通过总线发送给destination字段指定的lapic

msi(Message Signaled Interrupt)

在早期的计算机架构中，外设硬件中断主要通过ioapic转发至各cpu的lapic。但随着pci/pcie设备呈指数级增长，传统中断机制暴露出诸多局限性。

为此，pci规范在2.2版本引入了msi(Message Signaled Interrupt)机制，其让pci设备通过pci总线存储器写事务，将特定的消息地址(message address)和消息数据(message data)直接写入内存映射的lapic寄存器中，从而绕过ioapic直接向lapic发送中断通知，其整体的架构图如下所示(具体硬件连线细节不一定准确，主要表现绕过ioapic)

msi整体结构图

中断设置

pci的msi中断的设置是通过pci的配置空间实现的，前面博客qemu的PCI设备中简单介绍过，其配置空间如下所示

其中，msi的配置通过msi capability/msix capability实现

msi capability

msi capability的结构如下所示

msi_capability结构

可以看到，其有多个字段，但重点是Message Address(Message Upper Address)和Message Data字段

其中Message Address的结构如下所示，其定义了pci设备发出中断时内存写TLP的目的地址
Message Address结构

而Message Data则定义了写入的内容，用来描述中断的相关信息，其结构如下所示
Message Data结构

msix capability

msix capability的结构如下所示

msix_capability结构

与msi capability不同的是，其使用一个数组存放Message Address字段和Message Data字段，数组结构如下所示

msix table

中断处理

pci的中断处理遵循下述流程

当pci设备需要发起中断请求时，pci硬件会向capability中设置的Message Address地址写Message Data数据，其会以存储器写TLP的形式发送到RC(Root Complex)
RC收到后，将其转换为interrupt message总线事务并广播，与ioapic操作类似

中断子系统

中断子系统负责统一管理硬件中断资源，其接收并路由硬件中断信号，将其转换为对应的虚拟中断号，并调用注册的中断服务例程

硬件中断

不同架构下中断触发的CPU硬件行为是不一致的，因此其会有不一样的硬件中断管理机制，这里以x86平台为例进行分析

根据前面中断硬件的介绍，内核会在系统初始化阶段配置中断控制器的中断向量信息。当中断触发时，处理器会接收包含中断对应中断向量的信号，并将该中断向量作为索引值，访问IDT中对应的门描述符从而进行处理，如下所示

中断处理流程

中断子系统会在idt_setup_apic_and_irq_gates中初始化该IDT，如下所示

//#0  idt_setup_apic_and_irq_gates () at arch/x86/kernel/idt.c:282
//#1  0xffffffff832a7262 in native_init_IRQ () at arch/x86/kernel/irqinit.c:103
//#2  0xffffffff8329ab9b in start_kernel () at init/main.c:977
//#3  0xffffffff832a59a8 in x86_64_start_reservations (real_mode_data=real_mode_data@entry=0x14770 <entry_stack_storage+1904> <error: Cannot access memory at address 0x14770>) at arch/x86/kernel/head64.c:507
//#4  0xffffffff832a5ae6 in x86_64_start_kernel (real_mode_data=0x14770 <entry_stack_storage+1904> <error: Cannot access memory at address 0x14770>) at arch/x86/kernel/head64.c:488
//#5  0xffffffff810a96f6 in secondary_startup_64 () at arch/x86/kernel/head_64.S:420
//#6  0x0000000000000000 in ?? ()
void __init idt_setup_apic_and_irq_gates(void)
{
	int i = FIRST_EXTERNAL_VECTOR;
	void *entry;
    ...
	for_each_clear_bit_from(i, system_vectors, FIRST_SYSTEM_VECTOR) {
		entry = irq_entries_start + IDT_ALIGN * (i - FIRST_EXTERNAL_VECTOR);
		set_intr_gate(i, entry);
	}
    ...
}

可以看到，其设置中断向量从FIRST_EXTERNAL_VECTOR到FIRST_SYSTEM_VECTOR之间的中断的入口函数为irq_entries_start所在的代码段中

/*
 * ASM code to emit the common vector entry stubs where each stub is
 * packed into IDT_ALIGN bytes.
 *
 * Note, that the 'pushq imm8' is emitted via '.byte 0x6a, vector' because
 * GCC treats the local vector variable as unsigned int and would expand
 * all vectors above 0x7F to a 5 byte push. The original code did an
 * adjustment of the vector number to be in the signed byte range to avoid
 * this. While clever it's mindboggling counterintuitive and requires the
 * odd conversion back to a real vector number in the C entry points. Using
 * .byte achieves the same thing and the only fixup needed in the C entry
 * point is to mask off the bits above bit 7 because the push is sign
 * extending.
 */
	.align IDT_ALIGN
SYM_CODE_START(irq_entries_start)
    vector=FIRST_EXTERNAL_VECTOR
    .rept NR_EXTERNAL_VECTORS
	UNWIND_HINT_IRET_REGS
0 :
	ENDBR
	.byte	0x6a, vector
	jmp	asm_common_interrupt
	/* Ensure that the above is IDT_ALIGN bytes max */
	.fill 0b + IDT_ALIGN - ., 1, 0xcc
	vector = vector+1
    .endr
SYM_CODE_END(irq_entries_start)

irq_entries_start的代码段会push对应的中断向量，并调用asm_common_interrupt函数。

这里asm_common_inerrupt由DECLARE_IDTENTRY_IRQ宏在arch/x86/entry/entry_64.S文件中定义，如下所示

DECLARE_IDTENTRY_IRQ(X86_TRAP_OTHER,	common_interrupt);

#ifndef __ASSEMBLY__
    ...
#else /* !__ASSEMBLY__ */

    /* Entries for common/spurious (device) interrupts */
    #define DECLARE_IDTENTRY_IRQ(vector, func)				\
    	idtentry_irq vector func
#endif

/*
 * Interrupt entry/exit.
 *
 + The interrupt stubs push (vector) onto the stack, which is the error_code
 * position of idtentry exceptions, and jump to one of the two idtentry points
 * (common/spurious).
 *
 * common_interrupt is a hotpath, align it to a cache line
 */
.macro idtentry_irq vector cfunc
	.p2align CONFIG_X86_L1_CACHE_SHIFT
	idtentry \vector asm_\cfunc \cfunc has_error_code=1
.endm

/**
 * idtentry - Macro to generate entry stubs for simple IDT entries
 * @vector:		Vector number
 * @asmsym:		ASM symbol for the entry point
 * @cfunc:		C function to be called
 * @has_error_code:	Hardware pushed error code on stack
 *
 * The macro emits code to set up the kernel context for straight forward
 * and simple IDT entries. No IST stack, no paranoid entry checks.
 */
.macro idtentry vector asmsym cfunc has_error_code:req
SYM_CODE_START(\asmsym)

	.if \vector == X86_TRAP_BP
		/* #BP advances %rip to the next instruction */
		UNWIND_HINT_IRET_ENTRY offset=\has_error_code*8 signal=0
	.else
		UNWIND_HINT_IRET_ENTRY offset=\has_error_code*8
	.endif

	ENDBR
	ASM_CLAC
	cld

	.if \has_error_code == 0
		pushq	$-1			/* ORIG_RAX: no syscall to restart */
	.endif

	.if \vector == X86_TRAP_BP
		/*
		 * If coming from kernel space, create a 6-word gap to allow the
		 * int3 handler to emulate a call instruction.
		 */
		testb	$3, CS-ORIG_RAX(%rsp)
		jnz	.Lfrom_usermode_no_gap_\@
		.rept	6
		pushq	5*8(%rsp)
		.endr
		UNWIND_HINT_IRET_REGS offset=8
.Lfrom_usermode_no_gap_\@:
	.endif

	idtentry_body \cfunc \has_error_code

_ASM_NOKPROBE(\asmsym)
SYM_CODE_END(\asmsym)
.endm

/**
 * idtentry_body - Macro to emit code calling the C function
 * @cfunc:		C function to be called
 * @has_error_code:	Hardware pushed error code on stack
 */
.macro idtentry_body cfunc has_error_code:req

	/*
	 * Call error_entry() and switch to the task stack if from userspace.
	 *
	 * When in XENPV, it is already in the task stack, and it can't fault
	 * for native_iret() nor native_load_gs_index() since XENPV uses its
	 * own pvops for IRET and load_gs_index().  And it doesn't need to
	 * switch the CR3.  So it can skip invoking error_entry().
	 */
	ALTERNATIVE "call error_entry; movq %rax, %rsp", \
		    "call xen_error_entry", X86_FEATURE_XENPV

	ENCODE_FRAME_POINTER
	UNWIND_HINT_REGS

	movq	%rsp, %rdi			/* pt_regs pointer into 1st argument*/

	.if \has_error_code == 1
		movq	ORIG_RAX(%rsp), %rsi	/* get error code into 2nd argument*/
		movq	$-1, ORIG_RAX(%rsp)	/* no syscall to restart */
	.endif

	call	\cfunc

	/* For some configurations \cfunc ends up being a noreturn. */
	REACHABLE

	jmp	error_return
.endm
...

在DECLARE_IDTENTRY_IRQ宏，其使用idtentry定义和实现asm_common_interrupt，最终会调用common_interrupt进行处理。

而commont_interrupt是由DEFINE_IDTENTRY_IRQ宏在arch/x86/kernel/irq.c中实现，如下所示

#ifndef __ASSEMBLY__
    /**
     * DEFINE_IDTENTRY_IRQ - Emit code for device interrupt IDT entry points
     * @func:	Function name of the entry point
     *
     * The vector number is pushed by the low level entry stub and handed
     * to the function as error_code argument which needs to be truncated
     * to an u8 because the push is sign extending.
     *
     * irq_enter/exit_rcu() are invoked before the function body and the
     * KVM L1D flush request is set. Stack switching to the interrupt stack
     * has to be done in the function body if necessary.
     */
    #define DEFINE_IDTENTRY_IRQ(func)					\
    static void __##func(struct pt_regs *regs, u32 vector);			\
    									\
    __visible noinstr void func(struct pt_regs *regs,			\
    			    unsigned long error_code)			\
    {									\
    	irqentry_state_t state = irqentry_enter(regs);			\
    	u32 vector = (u32)(u8)error_code;				\
    									\
    	instrumentation_begin();					\
    	kvm_set_cpu_l1tf_flush_l1d();					\
    	run_irq_on_irqstack_cond(__##func, regs, vector);		\
    	instrumentation_end();						\
    	irqentry_exit(regs, state);					\
    }									\
    									\
    static noinline void __##func(struct pt_regs *regs, u32 vector)

#else /* !__ASSEMBLY__ */
    ...
#endif

/*
 * common_interrupt() handles all normal device IRQ's (the special SMP
 * cross-CPU interrupts have their own entry points).
 */
DEFINE_IDTENTRY_IRQ(common_interrupt)
{
	struct pt_regs *old_regs = set_irq_regs(regs);
	struct irq_desc *desc;

	/* entry code tells RCU that we're not quiescent.  Check it. */
	RCU_LOCKDEP_WARN(!rcu_is_watching(), "IRQ failed to wake up RCU");

	desc = __this_cpu_read(vector_irq[vector]);
	if (likely(!IS_ERR_OR_NULL(desc))) {
		handle_irq(desc, regs);
	} else {
		apic_eoi();

		if (desc == VECTOR_UNUSED) {
			pr_emerg_ratelimited("%s: %d.%u No irq handler for vector\n",
					     __func__, smp_processor_id(),
					     vector);
		} else {
			__this_cpu_write(vector_irq[vector], VECTOR_UNUSED);
		}
	}

	set_irq_regs(old_regs);
}

虚拟中断

上面只介绍了x86体系下的硬件中断资源的管理，实际上不同的体系架构有不同的处理方式。linux内核为了屏蔽这些不同，会以虚拟中断的形式向其他子系统或驱动提供服务，其中每一个硬件中断会分配一个全局唯一的虚拟中断，由struct irq_desc描述，统一存储在sparse_irqs变量中

static struct maple_tree sparse_irqs = MTREE_INIT_EXT(sparse_irqs,
					MT_FLAGS_ALLOC_RANGE |
					MT_FLAGS_LOCK_EXTERN |
					MT_FLAGS_USE_RCU,
					sparse_irq_lock);

/**
 * struct irq_desc - interrupt descriptor
 * @irq_common_data:	per irq and chip data passed down to chip functions
 * @kstat_irqs:		irq stats per cpu
 * @handle_irq:		highlevel irq-events handler
 * @action:		the irq action chain
 * @status_use_accessors: status information
 * @core_internal_state__do_not_mess_with_it: core internal status information
 * @depth:		disable-depth, for nested irq_disable() calls
 * @wake_depth:		enable depth, for multiple irq_set_irq_wake() callers
 * @tot_count:		stats field for non-percpu irqs
 * @irq_count:		stats field to detect stalled irqs
 * @last_unhandled:	aging timer for unhandled count
 * @irqs_unhandled:	stats field for spurious unhandled interrupts
 * @threads_handled:	stats field for deferred spurious detection of threaded handlers
 * @threads_handled_last: comparator field for deferred spurious detection of threaded handlers
 * @lock:		locking for SMP
 * @affinity_hint:	hint to user space for preferred irq affinity
 * @affinity_notify:	context for notification of affinity changes
 * @pending_mask:	pending rebalanced interrupts
 * @threads_oneshot:	bitfield to handle shared oneshot threads
 * @threads_active:	number of irqaction threads currently running
 * @wait_for_threads:	wait queue for sync_irq to wait for threaded handlers
 * @nr_actions:		number of installed actions on this descriptor
 * @no_suspend_depth:	number of irqactions on a irq descriptor with
 *			IRQF_NO_SUSPEND set
 * @force_resume_depth:	number of irqactions on a irq descriptor with
 *			IRQF_FORCE_RESUME set
 * @rcu:		rcu head for delayed free
 * @kobj:		kobject used to represent this struct in sysfs
 * @request_mutex:	mutex to protect request/free before locking desc->lock
 * @dir:		/proc/irq/ procfs entry
 * @debugfs_file:	dentry for the debugfs file
 * @name:		flow handler name for /proc/interrupts output
 */
struct irq_desc {
	struct irq_common_data	irq_common_data;
	struct irq_data		irq_data;
	unsigned int __percpu	*kstat_irqs;
	irq_flow_handler_t	handle_irq;
	struct irqaction	*action;	/* IRQ action list */
	unsigned int		status_use_accessors;
	unsigned int		core_internal_state__do_not_mess_with_it;
	unsigned int		depth;		/* nested irq disables */
	unsigned int		wake_depth;	/* nested wake enables */
	unsigned int		tot_count;
	unsigned int		irq_count;	/* For detecting broken IRQs */
	unsigned long		last_unhandled;	/* Aging timer for unhandled count */
	unsigned int		irqs_unhandled;
	atomic_t		threads_handled;
	int			threads_handled_last;
	raw_spinlock_t		lock;
	struct cpumask		*percpu_enabled;
	const struct cpumask	*percpu_affinity;
#ifdef CONFIG_SMP
	const struct cpumask	*affinity_hint;
	struct irq_affinity_notify *affinity_notify;
#ifdef CONFIG_GENERIC_PENDING_IRQ
	cpumask_var_t		pending_mask;
#endif
#endif
	unsigned long		threads_oneshot;
	atomic_t		threads_active;
	wait_queue_head_t       wait_for_threads;
#ifdef CONFIG_PM_SLEEP
	unsigned int		nr_actions;
	unsigned int		no_suspend_depth;
	unsigned int		cond_suspend_depth;
	unsigned int		force_resume_depth;
#endif
#ifdef CONFIG_PROC_FS
	struct proc_dir_entry	*dir;
#endif
#ifdef CONFIG_GENERIC_IRQ_DEBUGFS
	struct dentry		*debugfs_file;
	const char		*dev_name;
#endif
#ifdef CONFIG_SPARSE_IRQ
	struct rcu_head		rcu;
	struct kobject		kobj;
#endif
	struct mutex		request_mutex;
	int			parent_irq;
	struct module		*owner;
	const char		*name;
#ifdef CONFIG_HARDIRQS_SW_RESEND
	struct hlist_node	resend_node;
#endif
} ____cacheline_internodealigned_in_smp;

其中，handle_irq和action是struct irq_desc最重要的两个字段。handle_irq负责处理中断的硬件清理操作并触发驱动注册的中断处理程序；action则记录所有驱动注册的用于处理该中断的处理程序。

linux内核使用alloc_desc()分配虚拟中断资源，使用__irq_set_handler()设置虚拟中断的handler_irq字段，并使用request_irq为驱动注册中断对应的处理程序。这里以管理virtio-net-pci设备的虚拟中断为例进行分析，如下所示

//#0  irq_insert_desc (desc=0xffff888100844200, irq=25) at kernel/irq/irqdesc.c:170
//#1  alloc_descs (owner=0x0 <fixed_percpu_data>, affinity=0x0 <fixed_percpu_data>, node=-1, cnt=1, start=25) at kernel/irq/irqdesc.c:538
//#2  __irq_alloc_descs (irq=irq@entry=-1, from=<optimized out>, from@entry=1, cnt=cnt@entry=1, node=node@entry=-1, owner=owner@entry=0x0 <fixed_percpu_data>, affinity=affinity@entry=0x0 <fixed_percpu_data>) at kernel/irq/irqdesc.c:859
//#3  0xffffffff81187850 in irq_domain_alloc_descs (cnt=cnt@entry=1, hwirq=hwirq@entry=0, node=-1, node@entry=1, affinity=affinity@entry=0x0 <fixed_percpu_data>, virq=-1) at kernel/irq/irqdomain.c:1100
//#4  0xffffffff81188fd1 in irq_domain_alloc_descs (affinity=<optimized out>, node=<optimized out>, hwirq=<optimized out>, cnt=<optimized out>, virq=<optimized out>) at kernel/irq/irqdomain.c:1485
//#5  irq_domain_alloc_irqs_locked (domain=domain@entry=0xffff88811016d480, irq_base=irq_base@entry=-1, nr_irqs=nr_irqs@entry=1, node=node@entry=-1, arg=arg@entry=0xffffc900000139f8, realloc=realloc@entry=false, affinity=0x0 <fixed_percpu_data>) at kernel/irq/irqdomain.c:1483
//#6  0xffffffff8118953c in __irq_domain_alloc_irqs (domain=domain@entry=0xffff88811016d480, irq_base=irq_base@entry=-1, nr_irqs=1, node=-1, arg=arg@entry=0xffffc900000139f8, realloc=realloc@entry=false, affinity=0x0 <fixed_percpu_data>) at kernel/irq/irqdomain.c:1555
//#7  0xffffffff8118bfdd in __msi_domain_alloc_irqs (dev=0xffff8881001360c0, domain=0xffff88811016d480, ctrl=0xffffc90000013a80) at ./include/linux/device.h:882
//#8  0xffffffff8118d569 in msi_domain_alloc_locked (ctrl=0xffffc90000013a80, dev=0xffff8881001360c0) at kernel/irq/msi.c:1383
//#9  msi_domain_alloc_irqs_all_locked (dev=dev@entry=0xffff8881001360c0, domid=domid@entry=0, nirqs=nirqs@entry=3) at kernel/irq/msi.c:1461
//#10 0xffffffff816121b6 in pci_msi_setup_msi_irqs (dev=dev@entry=0xffff888100136000, nvec=nvec@entry=3, type=type@entry=17) at drivers/pci/msi/irqdomain.c:17
//#11 0xffffffff81611843 in msix_setup_interrupts (affd=0x0 <fixed_percpu_data>, nvec=3, entries=0x0 <fixed_percpu_data>, dev=0xffff888100136000) at drivers/pci/msi/msi.c:670
//#12 msix_capability_init (affd=0x0 <fixed_percpu_data>, nvec=3, entries=<optimized out>, dev=0xffff888100136000) at drivers/pci/msi/msi.c:727
//#13 __pci_enable_msix_range (dev=dev@entry=0xffff888100136000, entries=entries@entry=0x0 <fixed_percpu_data>, minvec=minvec@entry=3, maxvec=maxvec@entry=3, affd=affd@entry=0x0 <fixed_percpu_data>, flags=flags@entry=4) at drivers/pci/msi/msi.c:833
//#14 0xffffffff8161001a in pci_alloc_irq_vectors_affinity (dev=0xffff888100136000, min_vecs=min_vecs@entry=3, max_vecs=max_vecs@entry=3, flags=4, affd=affd@entry=0x0 <fixed_percpu_data>) at drivers/pci/msi/api.c:270
//#15 0xffffffff8169a4ce in vp_request_msix_vectors (desc=0x0 <fixed_percpu_data>, per_vq_vectors=<optimized out>, nvectors=<optimized out>, vdev=0xffff8881009cb800) at drivers/virtio/virtio_pci_common.c:133
//#16 vp_find_vqs_msix (vdev=vdev@entry=0xffff8881009cb800, nvqs=nvqs@entry=3, vqs=vqs@entry=0xffff8881003f9aa0, callbacks=callbacks@entry=0xffff8881003f9ac0, names=names@entry=0xffff8881003f9ae0, per_vq_vectors=per_vq_vectors@entry=true, ctx=0xffff8881003f78e0, desc=0x0 <fixed_percpu_data>) at drivers/virtio/virtio_pci_common.c:312
//#17 0xffffffff8169a8e4 in vp_find_vqs (vdev=vdev@entry=0xffff8881009cb800, nvqs=3, vqs=0xffff8881003f9aa0, callbacks=0xffff8881003f9ac0, names=0xffff8881003f9ae0, ctx=0xffff8881003f78e0, desc=0x0 <fixed_percpu_data>) at drivers/virtio/virtio_pci_common.c:408
//#18 0xffffffff81698df6 in vp_modern_find_vqs (vdev=0xffff8881009cb800, nvqs=<optimized out>, vqs=<optimized out>, callbacks=<optimized out>, names=<optimized out>, ctx=<optimized out>, desc=0x0 <fixed_percpu_data>) at drivers/virtio/virtio_pci_modern.c:604
//#19 0xffffffff819f4d48 in virtio_find_vqs_ctx (desc=0x0 <fixed_percpu_data>, ctx=0xffff8881003f78e0, names=0xffff8881003f9ae0, callbacks=0xffff8881003f9ac0, vqs=0xffff8881003f9aa0, nvqs=3, vdev=<optimized out>) at ./include/linux/virtio_config.h:242
//#20 virtnet_find_vqs (vi=0xffff888114ad0920) at drivers/net/virtio_net.c:4389
//#21 init_vqs (vi=0xffff888114ad0920) at drivers/net/virtio_net.c:4478
//#22 0xffffffff819f650e in virtnet_probe (vdev=0xffff8881009cb800) at drivers/net/virtio_net.c:4799
//#23 0xffffffff81693a1e in virtio_dev_probe (_d=0xffff8881009cb810) at drivers/virtio/virtio.c:311
//#24 0xffffffff8193e7dc in call_driver_probe (drv=0xffffffff82c10240 <virtio_net_driver>, dev=0xffff8881009cb810) at drivers/base/dd.c:578
//#25 really_probe (dev=dev@entry=0xffff8881009cb810, drv=drv@entry=0xffffffff82c10240 <virtio_net_driver>) at drivers/base/dd.c:656
//#26 0xffffffff8193ea4e in __driver_probe_device (drv=drv@entry=0xffffffff82c10240 <virtio_net_driver>, dev=dev@entry=0xffff8881009cb810) at drivers/base/dd.c:798
//#27 0xffffffff8193eb29 in driver_probe_device (drv=drv@entry=0xffffffff82c10240 <virtio_net_driver>, dev=dev@entry=0xffff8881009cb810) at drivers/base/dd.c:828
//#28 0xffffffff8193eda5 in __driver_attach (data=0xffffffff82c10240 <virtio_net_driver>, dev=0xffff8881009cb810) at drivers/base/dd.c:1214
//#29 __driver_attach (dev=0xffff8881009cb810, data=0xffffffff82c10240 <virtio_net_driver>) at drivers/base/dd.c:1154
//#30 0xffffffff8193c577 in bus_for_each_dev (bus=<optimized out>, start=start@entry=0x0 <fixed_percpu_data>, data=data@entry=0xffffffff82c10240 <virtio_net_driver>, fn=fn@entry=0xffffffff8193ed20 <__driver_attach>) at drivers/base/bus.c:368
//#31 0xffffffff8193e1b9 in driver_attach (drv=drv@entry=0xffffffff82c10240 <virtio_net_driver>) at drivers/base/dd.c:1231
//#32 0xffffffff8193d957 in bus_add_driver (drv=drv@entry=0xffffffff82c10240 <virtio_net_driver>) at drivers/base/bus.c:673
//#33 0xffffffff8193ff4b in driver_register (drv=drv@entry=0xffffffff82c10240 <virtio_net_driver>) at drivers/base/driver.c:246
//#34 0xffffffff8169317b in register_virtio_driver (driver=driver@entry=0xffffffff82c10240 <virtio_net_driver>) at drivers/virtio/virtio.c:370
//#35 0xffffffff832f8819 in virtio_net_driver_init () at drivers/net/virtio_net.c:5050
//#36 0xffffffff81001a63 in do_one_initcall (fn=0xffffffff832f8790 <virtio_net_driver_init>) at init/main.c:1238
//#37 0xffffffff8329b1d7 in do_initcall_level (command_line=0xffff8881002c1240 "rdinit", level=6) at init/main.c:1300
//#38 do_initcalls () at init/main.c:1316
//#39 do_basic_setup () at init/main.c:1335
//#40 kernel_init_freeable () at init/main.c:1548
//#41 0xffffffff81f8f7a5 in kernel_init (unused=<optimized out>) at init/main.c:1437
//#42 0xffffffff810baddf in ret_from_fork (prev=<optimized out>, regs=0xffffc90000013f58, fn=0xffffffff81f8f790 <kernel_init>, fn_arg=0x0 <fixed_percpu_data>) at arch/x86/kernel/process.c:147
//#43 0xffffffff8100244a in ret_from_fork_asm () at arch/x86/entry/entry_64.S:243
//#44 0x0000000000000000 in ?? ()
static int alloc_descs(unsigned int start, unsigned int cnt, int node,
		       const struct irq_affinity_desc *affinity,
		       struct module *owner)
{
	struct irq_desc *desc;
	int i;

	/* Validate affinity mask(s) */
	if (affinity) {
		for (i = 0; i < cnt; i++) {
			if (cpumask_empty(&affinity[i].mask))
				return -EINVAL;
		}
	}

	for (i = 0; i < cnt; i++) {
		const struct cpumask *mask = NULL;
		unsigned int flags = 0;

		if (affinity) {
			if (affinity->is_managed) {
				flags = IRQD_AFFINITY_MANAGED |
					IRQD_MANAGED_SHUTDOWN;
			}
			mask = &affinity->mask;
			node = cpu_to_node(cpumask_first(mask));
			affinity++;
		}

		desc = alloc_desc(start + i, node, flags, mask, owner);
		if (!desc)
			goto err;
		irq_insert_desc(start + i, desc);
		irq_sysfs_add(start + i, desc);
		irq_add_debugfs_entry(start + i, desc);
	}
	return start;

err:
	for (i--; i >= 0; i--)
		free_desc(start + i);
	return -ENOMEM;
}

可以看到，在virtio设备驱动的中断初始化过程中，会调用__irq_domain_alloc_irqs()与对应的irq domain进行交互并最终调用irq_insert_desc()完成虚拟中断资源的分配并添加到sparse_irqs中。

其中irq domain代表着一个中断控制器，在这里是MSI irq domain。由于一个中断传送到CPU过程中可能需要多个中断控制器参与，所以irq domain往往会构成层次结构：这里则是MSI irq domain(设备中断控制器)->cpu vector irq domain(apic中断控制器)

//#0  __irq_do_set_handler (desc=desc@entry=0xffff888100844200, handle=handle@entry=0xffffffff811856e0 <handle_edge_irq>, is_chained=is_chained@entry=0, name=name@entry=0xffffffff82707bf8 "edge") at kernel/irq/chip.c:988
//#1  0xffffffff811866b0 in __irq_set_handler (irq=irq@entry=25, handle=0xffffffff811856e0 <handle_edge_irq>, is_chained=is_chained@entry=0, name=0xffffffff82707bf8 "edge") at kernel/irq/chip.c:1067
//#2  0xffffffff8118b868 in msi_domain_ops_init (domain=<optimized out>, info=0xffff888100843f88, virq=25, hwirq=<optimized out>, arg=<optimized out>) at kernel/irq/msi.c:778
//#3  0xffffffff8118b799 in msi_domain_alloc (domain=0xffff88811016d480, virq=25, nr_irqs=1, arg=0xffffc900000139f8) at kernel/irq/msi.c:702
//#4  0xffffffff81188ecc in irq_domain_alloc_irqs_hierarchy (arg=0xffffc900000139f8, nr_irqs=1, irq_base=25, domain=0xffff88811016d480) at kernel/irq/irqdomain.c:1471
//#5  irq_domain_alloc_irqs_locked (domain=domain@entry=0xffff88811016d480, irq_base=irq_base@entry=-1, nr_irqs=nr_irqs@entry=1, node=node@entry=-1, arg=arg@entry=0xffffc900000139f8, realloc=realloc@entry=false, affinity=0x0 <fixed_percpu_data>) at kernel/irq/irqdomain.c:1498
//#6  0xffffffff8118953c in __irq_domain_alloc_irqs (domain=domain@entry=0xffff88811016d480, irq_base=irq_base@entry=-1, nr_irqs=1, node=-1, arg=arg@entry=0xffffc900000139f8, realloc=realloc@entry=false, affinity=0x0 <fixed_percpu_data>) at kernel/irq/irqdomain.c:1555
//#7  0xffffffff8118bfdd in __msi_domain_alloc_irqs (dev=0xffff8881001360c0, domain=0xffff88811016d480, ctrl=0xffffc90000013a80) at ./include/linux/device.h:882
//#8  0xffffffff8118d569 in msi_domain_alloc_locked (ctrl=0xffffc90000013a80, dev=0xffff8881001360c0) at kernel/irq/msi.c:1383
//#9  msi_domain_alloc_irqs_all_locked (dev=dev@entry=0xffff8881001360c0, domid=domid@entry=0, nirqs=nirqs@entry=3) at kernel/irq/msi.c:1461
//#10 0xffffffff816121b6 in pci_msi_setup_msi_irqs (dev=dev@entry=0xffff888100136000, nvec=nvec@entry=3, type=type@entry=17) at drivers/pci/msi/irqdomain.c:17
//#11 0xffffffff81611843 in msix_setup_interrupts (affd=0x0 <fixed_percpu_data>, nvec=3, entries=0x0 <fixed_percpu_data>, dev=0xffff888100136000) at drivers/pci/msi/msi.c:670
//#12 msix_capability_init (affd=0x0 <fixed_percpu_data>, nvec=3, entries=<optimized out>, dev=0xffff888100136000) at drivers/pci/msi/msi.c:727
//#13 __pci_enable_msix_range (dev=dev@entry=0xffff888100136000, entries=entries@entry=0x0 <fixed_percpu_data>, minvec=minvec@entry=3, maxvec=maxvec@entry=3, affd=affd@entry=0x0 <fixed_percpu_data>, flags=flags@entry=4) at drivers/pci/msi/msi.c:833
//#14 0xffffffff8161001a in pci_alloc_irq_vectors_affinity (dev=0xffff888100136000, min_vecs=min_vecs@entry=3, max_vecs=max_vecs@entry=3, flags=4, affd=affd@entry=0x0 <fixed_percpu_data>) at drivers/pci/msi/api.c:270
//#15 0xffffffff8169a4ce in vp_request_msix_vectors (desc=0x0 <fixed_percpu_data>, per_vq_vectors=<optimized out>, nvectors=<optimized out>, vdev=0xffff8881009cb800) at drivers/virtio/virtio_pci_common.c:133
//#16 vp_find_vqs_msix (vdev=vdev@entry=0xffff8881009cb800, nvqs=nvqs@entry=3, vqs=vqs@entry=0xffff8881003f9aa0, callbacks=callbacks@entry=0xffff8881003f9ac0, names=names@entry=0xffff8881003f9ae0, per_vq_vectors=per_vq_vectors@entry=true, ctx=0xffff8881003f78e0, desc=0x0 <fixed_percpu_data>) at drivers/virtio/virtio_pci_common.c:312
//#17 0xffffffff8169a8e4 in vp_find_vqs (vdev=vdev@entry=0xffff8881009cb800, nvqs=3, vqs=0xffff8881003f9aa0, callbacks=0xffff8881003f9ac0, names=0xffff8881003f9ae0, ctx=0xffff8881003f78e0, desc=0x0 <fixed_percpu_data>) at drivers/virtio/virtio_pci_common.c:408
//#18 0xffffffff81698df6 in vp_modern_find_vqs (vdev=0xffff8881009cb800, nvqs=<optimized out>, vqs=<optimized out>, callbacks=<optimized out>, names=<optimized out>, ctx=<optimized out>, desc=0x0 <fixed_percpu_data>) at drivers/virtio/virtio_pci_modern.c:604
//#19 0xffffffff819f4d48 in virtio_find_vqs_ctx (desc=0x0 <fixed_percpu_data>, ctx=0xffff8881003f78e0, names=0xffff8881003f9ae0, callbacks=0xffff8881003f9ac0, vqs=0xffff8881003f9aa0, nvqs=3, vdev=<optimized out>) at ./include/linux/virtio_config.h:242
//#20 virtnet_find_vqs (vi=0xffff888114ad0920) at drivers/net/virtio_net.c:4389
//#21 init_vqs (vi=0xffff888114ad0920) at drivers/net/virtio_net.c:4478
//#22 0xffffffff819f650e in virtnet_probe (vdev=0xffff8881009cb800) at drivers/net/virtio_net.c:4799
//#23 0xffffffff81693a1e in virtio_dev_probe (_d=0xffff8881009cb810) at drivers/virtio/virtio.c:311
//#24 0xffffffff8193e7dc in call_driver_probe (drv=0xffffffff82c10240 <virtio_net_driver>, dev=0xffff8881009cb810) at drivers/base/dd.c:578
//#25 really_probe (dev=dev@entry=0xffff8881009cb810, drv=drv@entry=0xffffffff82c10240 <virtio_net_driver>) at drivers/base/dd.c:656
//#26 0xffffffff8193ea4e in __driver_probe_device (drv=drv@entry=0xffffffff82c10240 <virtio_net_driver>, dev=dev@entry=0xffff8881009cb810) at drivers/base/dd.c:798
//#27 0xffffffff8193eb29 in driver_probe_device (drv=drv@entry=0xffffffff82c10240 <virtio_net_driver>, dev=dev@entry=0xffff8881009cb810) at drivers/base/dd.c:828
//#28 0xffffffff8193eda5 in __driver_attach (data=0xffffffff82c10240 <virtio_net_driver>, dev=0xffff8881009cb810) at drivers/base/dd.c:1214
//#29 __driver_attach (dev=0xffff8881009cb810, data=0xffffffff82c10240 <virtio_net_driver>) at drivers/base/dd.c:1154
//#30 0xffffffff8193c577 in bus_for_each_dev (bus=<optimized out>, start=start@entry=0x0 <fixed_percpu_data>, data=data@entry=0xffffffff82c10240 <virtio_net_driver>, fn=fn@entry=0xffffffff8193ed20 <__driver_attach>) at drivers/base/bus.c:368
//#31 0xffffffff8193e1b9 in driver_attach (drv=drv@entry=0xffffffff82c10240 <virtio_net_driver>) at drivers/base/dd.c:1231
//#32 0xffffffff8193d957 in bus_add_driver (drv=drv@entry=0xffffffff82c10240 <virtio_net_driver>) at drivers/base/bus.c:673
//#33 0xffffffff8193ff4b in driver_register (drv=drv@entry=0xffffffff82c10240 <virtio_net_driver>) at drivers/base/driver.c:246
//#34 0xffffffff8169317b in register_virtio_driver (driver=driver@entry=0xffffffff82c10240 <virtio_net_driver>) at drivers/virtio/virtio.c:370
//#35 0xffffffff832f8819 in virtio_net_driver_init () at drivers/net/virtio_net.c:5050
//#36 0xffffffff81001a63 in do_one_initcall (fn=0xffffffff832f8790 <virtio_net_driver_init>) at init/main.c:1238
//#37 0xffffffff8329b1d7 in do_initcall_level (command_line=0xffff8881002c1240 "rdinit", level=6) at init/main.c:1300
//#38 do_initcalls () at init/main.c:1316
//#39 do_basic_setup () at init/main.c:1335
//#40 kernel_init_freeable () at init/main.c:1548
//#41 0xffffffff81f8f7a5 in kernel_init (unused=<optimized out>) at init/main.c:1437
//#42 0xffffffff810baddf in ret_from_fork (prev=<optimized out>, regs=0xffffc90000013f58, fn=0xffffffff81f8f790 <kernel_init>, fn_arg=0x0 <fixed_percpu_data>) at arch/x86/kernel/process.c:147
//#43 0xffffffff8100244a in ret_from_fork_asm () at arch/x86/entry/entry_64.S:243
//#44 0x0000000000000000 in ?? ()
void
__irq_set_handler(unsigned int irq, irq_flow_handler_t handle, int is_chained,
		  const char *name)
{
	unsigned long flags;
	struct irq_desc *desc = irq_get_desc_buslock(irq, &flags, 0);

	if (!desc)
		return;

	__irq_do_set_handler(desc, handle, is_chained, name);
	irq_put_desc_busunlock(desc, flags);
}

在与irq domain交互分配虚拟中断资源的过程中，irq domain还会设置虚拟中断的handle_irq字段，从而设置中断控制器要求的硬件清理操作。这里被赋值为handle_edge_irq()函数去完成硬件清理操作并触发驱动注册的中断处理程序

//#0  request_threaded_irq (irq=25, handler=0xffffffff81694d10 <vring_interrupt>, thread_fn=thread_fn@entry=0x0 <fixed_percpu_data>, irqflags=irqflags@entry=0, devname=devname@entry=0xffff888113134900 "virtio0-input.0", dev_id=dev_id@entry=0xffff888110171300) at kernel/irq/manage.c:2150
//#1  0xffffffff8169a67c in request_irq (dev=0xffff888110171300, name=0xffff888113134900 "virtio0-input.0", flags=0, handler=<optimized out>, irq=<optimized out>) at ./include/linux/interrupt.h:171
//#2  vp_find_vqs_msix (vdev=vdev@entry=0xffff8881009cb800, nvqs=nvqs@entry=3, vqs=vqs@entry=0xffff8881003f9aa0, callbacks=callbacks@entry=0xffff8881003f9ac0, names=names@entry=0xffff8881003f9ae0, per_vq_vectors=per_vq_vectors@entry=true, ctx=0xffff8881003f78e0, desc=0x0 <fixed_percpu_data>) at drivers/virtio/virtio_pci_common.c:347
//#3  0xffffffff8169a8e4 in vp_find_vqs (vdev=vdev@entry=0xffff8881009cb800, nvqs=3, vqs=0xffff8881003f9aa0, callbacks=0xffff8881003f9ac0, names=0xffff8881003f9ae0, ctx=0xffff8881003f78e0, desc=0x0 <fixed_percpu_data>) at drivers/virtio/virtio_pci_common.c:408
//#4  0xffffffff81698df6 in vp_modern_find_vqs (vdev=0xffff8881009cb800, nvqs=<optimized out>, vqs=<optimized out>, callbacks=<optimized out>, names=<optimized out>, ctx=<optimized out>, desc=0x0 <fixed_percpu_data>) at drivers/virtio/virtio_pci_modern.c:604
//#5  0xffffffff819f4d48 in virtio_find_vqs_ctx (desc=0x0 <fixed_percpu_data>, ctx=0xffff8881003f78e0, names=0xffff8881003f9ae0, callbacks=0xffff8881003f9ac0, vqs=0xffff8881003f9aa0, nvqs=3, vdev=<optimized out>) at ./include/linux/virtio_config.h:242
//#6  virtnet_find_vqs (vi=0xffff888114ad0920) at drivers/net/virtio_net.c:4389
//#7  init_vqs (vi=0xffff888114ad0920) at drivers/net/virtio_net.c:4478
//#8  0xffffffff819f650e in virtnet_probe (vdev=0xffff8881009cb800) at drivers/net/virtio_net.c:4799
//#9  0xffffffff81693a1e in virtio_dev_probe (_d=0xffff8881009cb810) at drivers/virtio/virtio.c:311
//#10 0xffffffff8193e7dc in call_driver_probe (drv=0xffffffff82c10240 <virtio_net_driver>, dev=0xffff8881009cb810) at drivers/base/dd.c:578
//#11 really_probe (dev=dev@entry=0xffff8881009cb810, drv=drv@entry=0xffffffff82c10240 <virtio_net_driver>) at drivers/base/dd.c:656
//#12 0xffffffff8193ea4e in __driver_probe_device (drv=drv@entry=0xffffffff82c10240 <virtio_net_driver>, dev=dev@entry=0xffff8881009cb810) at drivers/base/dd.c:798
//#13 0xffffffff8193eb29 in driver_probe_device (drv=drv@entry=0xffffffff82c10240 <virtio_net_driver>, dev=dev@entry=0xffff8881009cb810) at drivers/base/dd.c:828
//#14 0xffffffff8193eda5 in __driver_attach (data=0xffffffff82c10240 <virtio_net_driver>, dev=0xffff8881009cb810) at drivers/base/dd.c:1214
//#15 __driver_attach (dev=0xffff8881009cb810, data=0xffffffff82c10240 <virtio_net_driver>) at drivers/base/dd.c:1154
//#16 0xffffffff8193c577 in bus_for_each_dev (bus=<optimized out>, start=start@entry=0x0 <fixed_percpu_data>, data=data@entry=0xffffffff82c10240 <virtio_net_driver>, fn=fn@entry=0xffffffff8193ed20 <__driver_attach>) at drivers/base/bus.c:368
//#17 0xffffffff8193e1b9 in driver_attach (drv=drv@entry=0xffffffff82c10240 <virtio_net_driver>) at drivers/base/dd.c:1231
//#18 0xffffffff8193d957 in bus_add_driver (drv=drv@entry=0xffffffff82c10240 <virtio_net_driver>) at drivers/base/bus.c:673
//#19 0xffffffff8193ff4b in driver_register (drv=drv@entry=0xffffffff82c10240 <virtio_net_driver>) at drivers/base/driver.c:246
//#20 0xffffffff8169317b in register_virtio_driver (driver=driver@entry=0xffffffff82c10240 <virtio_net_driver>) at drivers/virtio/virtio.c:370
//#21 0xffffffff832f8819 in virtio_net_driver_init () at drivers/net/virtio_net.c:5050
//#22 0xffffffff81001a63 in do_one_initcall (fn=0xffffffff832f8790 <virtio_net_driver_init>) at init/main.c:1238
//#23 0xffffffff8329b1d7 in do_initcall_level (command_line=0xffff8881002c1240 "rdinit", level=6) at init/main.c:1300
//#24 do_initcalls () at init/main.c:1316
//#25 do_basic_setup () at init/main.c:1335
//#26 kernel_init_freeable () at init/main.c:1548
//#27 0xffffffff81f8f7a5 in kernel_init (unused=<optimized out>) at init/main.c:1437
//#28 0xffffffff810baddf in ret_from_fork (prev=<optimized out>, regs=0xffffc90000013f58, fn=0xffffffff81f8f790 <kernel_init>, fn_arg=0x0 <fixed_percpu_data>) at arch/x86/kernel/process.c:147
//#29 0xffffffff8100244a in ret_from_fork_asm () at arch/x86/entry/entry_64.S:243
//#30 0x0000000000000000 in ?? ()
static inline int __must_check
request_irq(unsigned int irq, irq_handler_t handler, unsigned long flags,
	    const char *name, void *dev)
{
	return request_threaded_irq(irq, handler, NULL, flags, name, dev);
}

int request_threaded_irq(unsigned int irq, irq_handler_t handler,
			 irq_handler_t thread_fn, unsigned long irqflags,
			 const char *devname, void *dev_id)
{
	struct irqaction *action;
	struct irq_desc *desc;
	int retval;

	if (irq == IRQ_NOTCONNECTED)
		return -ENOTCONN;

	/*
	 * Sanity-check: shared interrupts must pass in a real dev-ID,
	 * otherwise we'll have trouble later trying to figure out
	 * which interrupt is which (messes up the interrupt freeing
	 * logic etc).
	 *
	 * Also shared interrupts do not go well with disabling auto enable.
	 * The sharing interrupt might request it while it's still disabled
	 * and then wait for interrupts forever.
	 *
	 * Also IRQF_COND_SUSPEND only makes sense for shared interrupts and
	 * it cannot be set along with IRQF_NO_SUSPEND.
	 */
	if (((irqflags & IRQF_SHARED) && !dev_id) ||
	    ((irqflags & IRQF_SHARED) && (irqflags & IRQF_NO_AUTOEN)) ||
	    (!(irqflags & IRQF_SHARED) && (irqflags & IRQF_COND_SUSPEND)) ||
	    ((irqflags & IRQF_NO_SUSPEND) && (irqflags & IRQF_COND_SUSPEND)))
		return -EINVAL;

	desc = irq_to_desc(irq);
	if (!desc)
		return -EINVAL;

	if (!irq_settings_can_request(desc) ||
	    WARN_ON(irq_settings_is_per_cpu_devid(desc)))
		return -EINVAL;

	if (!handler) {
		if (!thread_fn)
			return -EINVAL;
		handler = irq_default_primary_handler;
	}

	action = kzalloc(sizeof(struct irqaction), GFP_KERNEL);
	if (!action)
		return -ENOMEM;

	action->handler = handler;
	action->thread_fn = thread_fn;
	action->flags = irqflags;
	action->name = devname;
	action->dev_id = dev_id;

	retval = irq_chip_pm_get(&desc->irq_data);
	if (retval < 0) {
		kfree(action);
		return retval;
	}

	retval = __setup_irq(irq, desc, action);

	if (retval) {
		irq_chip_pm_put(&desc->irq_data);
		kfree(action->secondary);
		kfree(action);
	}
    ...
	return retval;
}

在完成虚拟中断资源的分配和设置后，virtio驱动会调用request_irq()来注册中断的业务处理回调函数，这里是vring_interrupt()，并在前面设置的handle_irq函数指针中被调用，如下所示

//#0  vring_interrupt (irq=25, _vq=0xffff888110171300) at drivers/virtio/virtio_ring.c:2571
//#1  0xffffffff811805a5 in __handle_irq_event_percpu (desc=desc@entry=0xffff888100844200) at kernel/irq/handle.c:158
//#2  0xffffffff81180793 in handle_irq_event_percpu (desc=0xffff888100844200) at kernel/irq/handle.c:193
//#3  handle_irq_event (desc=desc@entry=0xffff888100844200) at kernel/irq/handle.c:210
//#4  0xffffffff81185766 in handle_edge_irq (desc=0xffff888100844200) at kernel/irq/chip.c:831
//#5  0xffffffff810adfbc in generic_handle_irq_desc (desc=<optimized out>) at ./include/linux/irqdesc.h:161
//#6  handle_irq (regs=0x0 <fixed_percpu_data>, desc=<optimized out>) at arch/x86/kernel/irq.c:238
//#7  __common_interrupt (regs=regs@entry=0xffffc900000e0ef8, vector=vector@entry=33) at arch/x86/kernel/irq.c:257
//#8  0xffffffff81f8aa7b in common_interrupt (regs=0xffffc900000e0ef8, error_code=33) at arch/x86/kernel/irq.c:247
//#9  0xffffffff820013e6 in asm_common_interrupt () at ./arch/x86/include/asm/idtentry.h:693
//#10 0x0000000000000000 in ?? ()
irqreturn_t __handle_irq_event_percpu(struct irq_desc *desc)
{
	irqreturn_t retval = IRQ_NONE;
	unsigned int irq = desc->irq_data.irq;
	struct irqaction *action;

	record_irq_time(desc);

	for_each_action_of_desc(desc, action) {
		irqreturn_t res;

		/*
		 * If this IRQ would be threaded under force_irqthreads, mark it so.
		 */
		if (irq_settings_can_thread(desc) &&
		    !(action->flags & (IRQF_NO_THREAD | IRQF_PERCPU | IRQF_ONESHOT)))
			lockdep_hardirq_threaded();

		trace_irq_handler_entry(irq, action);
		res = action->handler(irq, action->dev_id);
		trace_irq_handler_exit(irq, action, res);

		if (WARN_ONCE(!irqs_disabled(),"irq %u handler %pS enabled interrupts\n",
			      irq, action->handler))
			local_irq_disable();

		switch (res) {
		case IRQ_WAKE_THREAD:
			/*
			 * Catch drivers which return WAKE_THREAD but
			 * did not set up a thread function
			 */
			if (unlikely(!action->thread_fn)) {
				warn_no_thread(irq, action);
				break;
			}

			__irq_wake_thread(desc, action);
			break;

		default:
			break;
		}

		retval |= res;
	}

	return retval;
}

中断路由

中断路由的核心是建立硬件中断(apic中断的向量vector)到虚拟中断(linux irq)的映射关系。

根据前面硬件中断小节可知，在X86架构下，vector_irq为其核心映射表，其更新通过apic_update_vector()实现，如下所示

//#0  apic_update_vector (irqd=irqd@entry=0xffff888107167a80, newvec=newvec@entry=33, newcpu=1) at arch/x86/kernel/apic/vector.c:91
//#1  0xffffffff810f315d in assign_vector_locked (irqd=0xffff888107167a80, dest=dest@entry=0xffffffff8353f0b0 <vector_searchmask>) at arch/x86/kernel/apic/vector.c:263
//#2  0xffffffff810f32e0 in assign_irq_vector_any_locked (irqd=irqd@entry=0xffff888107167a80) at arch/x86/kernel/apic/vector.c:296
//#3  0xffffffff810f343f in activate_reserved (irqd=0xffff888107167a80) at arch/x86/kernel/apic/vector.c:404
//#4  x86_vector_activate (dom=<optimized out>, irqd=0xffff888107167a80, reserve=<optimized out>) at arch/x86/kernel/apic/vector.c:473
//#5  0xffffffff8119b30b in __irq_domain_activate_irq (irqd=0xffff888107167a80, reserve=reserve@entry=false) at kernel/irq/irqdomain.c:1830
//#6  0xffffffff8119b2ed in __irq_domain_activate_irq (irqd=irqd@entry=0xffff888107011e28, reserve=reserve@entry=false) at kernel/irq/irqdomain.c:1827
//#7  0xffffffff8119d638 in irq_domain_activate_irq (irq_data=irq_data@entry=0xffff888107011e28, reserve=reserve@entry=false) at kernel/irq/irqdomain.c:1853
//#8  0xffffffff81199963 in irq_activate (desc=desc@entry=0xffff888107011e00) at kernel/irq/chip.c:293
//#9  0xffffffff81196712 in __setup_irq (irq=irq@entry=25, desc=desc@entry=0xffff888107011e00, new=new@entry=0xffff8881071b2300) at kernel/irq/manage.c:1754
//#10 0xffffffff81196ceb in request_threaded_irq (irq=25, handler=0xffffffff816a8250 <vring_interrupt>, thread_fn=thread_fn@entry=0x0 <fixed_percpu_data>, irqflags=irqflags@entry=0, devname=devname@entry=0xffff88810706fd00 "virtio0-input.0", dev_id=dev_id@entry=0xffff8881071aad00) at kernel/irq/manage.c:2207
//#11 0xffffffff816adbbc in request_irq (dev=0xffff8881071aad00, name=0xffff88810706fd00 "virtio0-input.0", flags=0, handler=<optimized out>, irq=<optimized out>) at ./include/linux/interrupt.h:171
//#12 vp_find_vqs_msix (vdev=vdev@entry=0xffff88810708b000, nvqs=nvqs@entry=3, vqs=vqs@entry=0xffff888100ed3c20, callbacks=callbacks@entry=0xffff888100ed3c40, names=names@entry=0xffff888100ed3c60, per_vq_vectors=per_vq_vectors@entry=true, ctx=0xffff888100f11978, desc=0x0 <fixed_percpu_data>) at drivers/virtio/virtio_pci_common.c:347
//#13 0xffffffff816ade24 in vp_find_vqs (vdev=vdev@entry=0xffff88810708b000, nvqs=3, vqs=0xffff888100ed3c20, callbacks=0xffff888100ed3c40, names=0xffff888100ed3c60, ctx=0xffff888100f11978, desc=0x0 <fixed_percpu_data>) at drivers/virtio/virtio_pci_common.c:408
//#14 0xffffffff816ac336 in vp_modern_find_vqs (vdev=0xffff88810708b000, nvqs=<optimized out>, vqs=<optimized out>, callbacks=<optimized out>, names=<optimized out>, ctx=<optimized out>, desc=0x0 <fixed_percpu_data>) at drivers/virtio/virtio_pci_modern.c:604
//#15 0xffffffff81a08285 in virtio_find_vqs_ctx (desc=0x0 <fixed_percpu_data>, ctx=0xffff888100f11978, names=0xffff888100ed3c60, callbacks=0xffff888100ed3c40, vqs=0xffff888100ed3c20, nvqs=3, vdev=0xffff888107167a80) at ./include/linux/virtio_config.h:242
//#16 virtnet_find_vqs (vi=0xffff8881070a8920) at drivers/net/virtio_net.c:4389
//#17 init_vqs (vi=0xffff8881070a8920) at drivers/net/virtio_net.c:4478
//#18 0xffffffff81a09a4e in virtnet_probe (vdev=0xffff88810708b000) at drivers/net/virtio_net.c:4799
//#19 0xffffffff816a6f5b in virtio_dev_probe (_d=0xffff88810708b010) at drivers/virtio/virtio.c:311
//#20 0xffffffff81951d19 in call_driver_probe (drv=0xffffffff82c10240 <virtio_net_driver>, dev=0xffff88810708b010) at drivers/base/dd.c:578
//#21 really_probe (dev=dev@entry=0xffff88810708b010, drv=drv@entry=0xffffffff82c10240 <virtio_net_driver>) at drivers/base/dd.c:656
//#22 0xffffffff81951f8e in __driver_probe_device (drv=drv@entry=0xffffffff82c10240 <virtio_net_driver>, dev=dev@entry=0xffff88810708b010) at drivers/base/dd.c:798
//#23 0xffffffff81952069 in driver_probe_device (drv=drv@entry=0xffffffff82c10240 <virtio_net_driver>, dev=dev@entry=0xffff88810708b010) at drivers/base/dd.c:828
//#24 0xffffffff819522e5 in __driver_attach (data=0xffffffff82c10240 <virtio_net_driver>, dev=0xffff88810708b010) at drivers/base/dd.c:1214
//#25 __driver_attach (dev=0xffff88810708b010, data=0xffffffff82c10240 <virtio_net_driver>) at drivers/base/dd.c:1154
//#26 0xffffffff8194fab4 in bus_for_each_dev (bus=<optimized out>, start=start@entry=0x0 <fixed_percpu_data>, data=data@entry=0xffffffff82c10240 <virtio_net_driver>, fn=fn@entry=0xffffffff81952260 <__driver_attach>) at drivers/base/bus.c:368
//#27 0xffffffff819516f9 in driver_attach (drv=drv@entry=0xffffffff82c10240 <virtio_net_driver>) at drivers/base/dd.c:1231
//#28 0xffffffff81950e97 in bus_add_driver (drv=drv@entry=0xffffffff82c10240 <virtio_net_driver>) at drivers/base/bus.c:673
//#29 0xffffffff8195348b in driver_register (drv=drv@entry=0xffffffff82c10240 <virtio_net_driver>) at drivers/base/driver.c:246
//#30 0xffffffff816a66bb in register_virtio_driver (driver=driver@entry=0xffffffff82c10240 <virtio_net_driver>) at drivers/virtio/virtio.c:370
//#31 0xffffffff832fbbf9 in virtio_net_driver_init () at drivers/net/virtio_net.c:5050
//#32 0xffffffff81001a60 in do_one_initcall (fn=0xffffffff832fbb70 <virtio_net_driver_init>) at init/main.c:1238
//#33 0xffffffff8329e1d7 in do_initcall_level (command_line=0xffff8881002c2280 "rdinit", level=6) at init/main.c:1300
//#34 do_initcalls () at init/main.c:1316
//#35 do_basic_setup () at init/main.c:1335
//#36 kernel_init_freeable () at init/main.c:1548
//#37 0xffffffff81fa3a85 in kernel_init (unused=<optimized out>) at init/main.c:1437
//#38 0xffffffff810cea7c in ret_from_fork (prev=<optimized out>, regs=0xffffc90000013f58, fn=0xffffffff81fa3a70 <kernel_init>, fn_arg=0x0 <fixed_percpu_data>) at arch/x86/kernel/process.c:147
//#39 0xffffffff8100244a in ret_from_fork_asm () at arch/x86/entry/entry_64.S:243
//#40 0x0000000000000000 in ?? ()
static void apic_update_vector(struct irq_data *irqd, unsigned int newvec,
			       unsigned int newcpu)
{
	struct apic_chip_data *apicd = apic_chip_data(irqd);
	struct irq_desc *desc = irq_data_to_desc(irqd);
	bool managed = irqd_affinity_is_managed(irqd);

	lockdep_assert_held(&vector_lock);

	trace_vector_update(irqd->irq, newvec, newcpu, apicd->vector,
			    apicd->cpu);

	/*
	 * If there is no vector associated or if the associated vector is
	 * the shutdown vector, which is associated to make PCI/MSI
	 * shutdown mode work, then there is nothing to release. Clear out
	 * prev_vector for this and the offlined target case.
	 */
	apicd->prev_vector = 0;
	if (!apicd->vector || apicd->vector == MANAGED_IRQ_SHUTDOWN_VECTOR)
		goto setnew;
	/*
	 * If the target CPU of the previous vector is online, then mark
	 * the vector as move in progress and store it for cleanup when the
	 * first interrupt on the new vector arrives. If the target CPU is
	 * offline then the regular release mechanism via the cleanup
	 * vector is not possible and the vector can be immediately freed
	 * in the underlying matrix allocator.
	 */
	if (cpu_online(apicd->cpu)) {
		apicd->move_in_progress = true;
		apicd->prev_vector = apicd->vector;
		apicd->prev_cpu = apicd->cpu;
		WARN_ON_ONCE(apicd->cpu == newcpu);
	} else {
		irq_matrix_free(vector_matrix, apicd->cpu, apicd->vector,
				managed);
	}

setnew:
	apicd->vector = newvec;
	apicd->cpu = newcpu;
	BUG_ON(!IS_ERR_OR_NULL(per_cpu(vector_irq, newcpu)[newvec]));
	per_cpu(vector_irq, newcpu)[newvec] = desc;
}

可以看到，构建x86的中断路由的逻辑是比较清晰的：Linux内核首先分配虚拟中断资源；然后与irq domain进行交互，完成硬件中断资源的分配与设置后；最后更新vector_irq数组，完成硬件中断到虚拟中断的映射即可

中断下半部

根据前面中断硬件的介绍，在CPU处理中断的过程中会屏蔽相关中断，直到CPU完成中断处理并发送EOI命令。如果中断处理程序耗时过长，则会导致其他中断得不到有效处理，导致硬件中断丢失。为了避免这种情况，Linux内核把中断处理分为中断上半部和中断下半部：其中中断上半部在中断屏蔽期间执行，其主要快速响应硬件并调度中断下半部的执行；而中断下半部用于异步的处理耗时任务。

目前主要的中断下半部包括softirq(软中断)、tasklet和workqueue。其中tasklet基于softirq，都运行在软中断上下文中；而workqueue则运行在内核进程上下文中

softirq

softirq事件

softirq事件在Linux内核编译时就确定好的，每个软中断号对应一个事件

enum
{
	HI_SOFTIRQ=0,
	TIMER_SOFTIRQ,
	NET_TX_SOFTIRQ,
	NET_RX_SOFTIRQ,
	BLOCK_SOFTIRQ,
	IRQ_POLL_SOFTIRQ,
	TASKLET_SOFTIRQ,
	SCHED_SOFTIRQ,
	HRTIMER_SOFTIRQ,
	RCU_SOFTIRQ,    /* Preferable RCU should always be the last softirq */

	NR_SOFTIRQS
};

其内部使用数组softirq_vec管理事件的handler，并使用open_softirq()注册handler

static struct softirq_action softirq_vec[NR_SOFTIRQS] __cacheline_aligned_in_smp;

void open_softirq(int nr, void (*action)(struct softirq_action *))
{
	softirq_vec[nr].action = action;
}

一般的，中断上半部中会调用raise_softirq()标记待处理的软中断事件，后续会在异步执行的中断下半部中调用__do_softirq()处理软中断事件

#define or_softirq_pending(x)	(__this_cpu_or(local_softirq_pending_ref, (x)))
void raise_softirq(unsigned int nr)
{
	unsigned long flags;

	local_irq_save(flags);
	raise_softirq_irqoff(nr);
	local_irq_restore(flags);
}

void __raise_softirq_irqoff(unsigned int nr)
{
	lockdep_assert_irqs_disabled();
	trace_softirq_raise(nr);
	or_softirq_pending(1UL << nr);
}

#define local_softirq_pending()	(__this_cpu_read(local_softirq_pending_ref))
asmlinkage __visible void __softirq_entry __do_softirq(void)
{
	unsigned long end = jiffies + MAX_SOFTIRQ_TIME;
	unsigned long old_flags = current->flags;
	int max_restart = MAX_SOFTIRQ_RESTART;
	struct softirq_action *h;
	bool in_hardirq;
	__u32 pending;
	int softirq_bit;

	/*
	 * Mask out PF_MEMALLOC as the current task context is borrowed for the
	 * softirq. A softirq handled, such as network RX, might set PF_MEMALLOC
	 * again if the socket is related to swapping.
	 */
	current->flags &= ~PF_MEMALLOC;

	pending = local_softirq_pending();

	softirq_handle_begin();
	in_hardirq = lockdep_softirq_start();
	account_softirq_enter(current);

restart:
	/* Reset the pending bitmask before enabling irqs */
	set_softirq_pending(0);

	local_irq_enable();

	h = softirq_vec;

	while ((softirq_bit = ffs(pending))) {
		unsigned int vec_nr;
		int prev_count;

		h += softirq_bit - 1;

		vec_nr = h - softirq_vec;
		prev_count = preempt_count();

		kstat_incr_softirqs_this_cpu(vec_nr);

		trace_softirq_entry(vec_nr);
		h->action(h);
		trace_softirq_exit(vec_nr);
		if (unlikely(prev_count != preempt_count())) {
			pr_err("huh, entered softirq %u %s %p with preempt_count %08x, exited with %08x?\n",
			       vec_nr, softirq_to_name[vec_nr], h->action,
			       prev_count, preempt_count());
			preempt_count_set(prev_count);
		}
		h++;
		pending >>= softirq_bit;
	}

	if (!IS_ENABLED(CONFIG_PREEMPT_RT) &&
	    __this_cpu_read(ksoftirqd) == current)
		rcu_softirq_qs();

	local_irq_disable();

	pending = local_softirq_pending();
	if (pending) {
		if (time_before(jiffies, end) && !need_resched() &&
		    --max_restart)
			goto restart;

		wakeup_softirqd();
	}

	account_softirq_exit(current);
	lockdep_softirq_end(in_hardirq);
	softirq_handle_end();
	current_restore_flags(old_flags, PF_MEMALLOC);
}

可以看到，其软中断处理逻辑还是很清晰。当中断上半部准备产生一个需要异步处理的软中断事件时，其会标记软中断事件号对应的比特。而当中断下半部准备处理软中断事件时，其会执行所有被标记的软中断事件的handler

触发时机

如前面所说，被标记的软中断任务会在中断下半部被异步地执行。其执行操作主要在以下两个时机被调度触发

CPU调度到ksoftirqd内核线程
在中断上半部(硬件中断处理程序)退出时(irq_exit_rcu)

ksoftirqd

linux内核会在spawn_ksoftirqd()中为每一个CPU都初始化一个ksoftirqd内核线程，专门用于处理softirq事件

//#0  spawn_ksoftirqd () at kernel/softirq.c:972
//#1  0xffffffff81001a60 in do_one_initcall (fn=0xffffffff832c8940 <spawn_ksoftirqd>) at init/main.c:1238
//#2  0xffffffff8329e100 in do_pre_smp_initcalls () at init/main.c:1344
//#3  kernel_init_freeable () at init/main.c:1537
//#4  0xffffffff81fa3a85 in kernel_init (unused=<optimized out>) at init/main.c:1437
//#5  0xffffffff810cea7c in ret_from_fork (prev=<optimized out>, regs=0xffffc90000013f58, fn=0xffffffff81fa3a70 <kernel_init>, fn_arg=0x0 <fixed_percpu_data>) at arch/x86/kernel/process.c:147
//#6  0xffffffff8100244a in ret_from_fork_asm () at arch/x86/entry/entry_64.S:243
//#7  0x0000000000000000 in ?? ()
static __init int spawn_ksoftirqd(void)
{
	cpuhp_setup_state_nocalls(CPUHP_SOFTIRQ_DEAD, "softirq:dead", NULL,
				  takeover_tasklets);
	BUG_ON(smpboot_register_percpu_thread(&softirq_threads));

	return 0;
}
early_initcall(spawn_ksoftirqd);

可以看到，其在内核初始化阶段，基于softirq_threads参数调用smpboot_register_percpu_thread()为每个CPU创建内核线程

static struct smp_hotplug_thread softirq_threads = {
	.store			= &ksoftirqd,
	.thread_should_run	= ksoftirqd_should_run,
	.thread_fn		= run_ksoftirqd,
	.thread_comm		= "ksoftirqd/%u",
};

/**
 * smpboot_register_percpu_thread - Register a per_cpu thread related
 * 					    to hotplug
 * @plug_thread:	Hotplug thread descriptor
 *
 * Creates and starts the threads on all online cpus.
 */
int smpboot_register_percpu_thread(struct smp_hotplug_thread *plug_thread)
{
	unsigned int cpu;
	int ret = 0;

	cpus_read_lock();
	mutex_lock(&smpboot_threads_lock);
	for_each_online_cpu(cpu) {
		ret = __smpboot_create_thread(plug_thread, cpu);
		if (ret) {
			smpboot_destroy_threads(plug_thread);
			goto out;
		}
		smpboot_unpark_thread(plug_thread, cpu);
	}
	list_add(&plug_thread->list, &hotplug_threads);
out:
	mutex_unlock(&smpboot_threads_lock);
	cpus_read_unlock();
	return ret;
}

static int
__smpboot_create_thread(struct smp_hotplug_thread *ht, unsigned int cpu)
{
	struct task_struct *tsk = *per_cpu_ptr(ht->store, cpu);
	struct smpboot_thread_data *td;

	if (tsk)
		return 0;

	td = kzalloc_node(sizeof(*td), GFP_KERNEL, cpu_to_node(cpu));
	if (!td)
		return -ENOMEM;
	td->cpu = cpu;
	td->ht = ht;

	tsk = kthread_create_on_cpu(smpboot_thread_fn, td, cpu,
				    ht->thread_comm);
	...
	return 0;
}

/**
 * smpboot_thread_fn - percpu hotplug thread loop function
 * @data:	thread data pointer
 *
 * Checks for thread stop and park conditions. Calls the necessary
 * setup, cleanup, park and unpark functions for the registered
 * thread.
 *
 * Returns 1 when the thread should exit, 0 otherwise.
 */
static int smpboot_thread_fn(void *data)
{
	struct smpboot_thread_data *td = data;
	struct smp_hotplug_thread *ht = td->ht;

	while (1) {
		set_current_state(TASK_INTERRUPTIBLE);
		preempt_disable();
		...
		if (!ht->thread_should_run(td->cpu)) {
			preempt_enable_no_resched();
			schedule();
		} else {
			__set_current_state(TASK_RUNNING);
			preempt_enable();
			ht->thread_fn(td->cpu);
		}
	}
}

整体逻辑也很清晰，smpboot_register_percpu_thread()会给每个CPU创建一个执行smpboot_thread_fn()循环的内核线程。在循环里会调用回调函数run_ksoftirqd()函数，并最终调用__do_softirq()处理softirq事件

irq_exit_rcu

前面硬件中断小节介绍了，中断上半部的handler由DEFINE_IDTENTRY_IRQ宏定义，其在handler最后会调用到irq_exit_rcu()，并触发invoke_softirq()

#define DEFINE_IDTENTRY_IRQ(func)					\
static void __##func(struct pt_regs *regs, u32 vector);			\
									\
__visible noinstr void func(struct pt_regs *regs,			\
			    unsigned long error_code)			\
{									\
	irqentry_state_t state = irqentry_enter(regs);			\
	u32 vector = (u32)(u8)error_code;				\
									\
	instrumentation_begin();					\
	kvm_set_cpu_l1tf_flush_l1d();					\
	run_irq_on_irqstack_cond(__##func, regs, vector);		\
	instrumentation_end();						\
	irqentry_exit(regs, state);					\
}									\
									\
static noinline void __##func(struct pt_regs *regs, u32 vector)


#define run_irq_on_irqstack_cond(func, regs, vector)			\
{									\
	...
	call_on_irqstack_cond(func, regs, ASM_CALL_IRQ,			\
			      IRQ_CONSTRAINTS, regs, vector);		\
}

/*
 * Macro to invoke system vector and device interrupt C handlers.
 */
#define call_on_irqstack_cond(func, regs, asm_call, constr, c_args...)	\
{									\
	...
		irq_enter_rcu();					\
		func(c_args);						\
		irq_exit_rcu();						\
	...
}

void irq_exit_rcu(void)
{
	__irq_exit_rcu();
	 /* must be last! */
	lockdep_hardirq_exit();
}

static inline void __irq_exit_rcu(void)
{
#ifndef __ARCH_IRQ_EXIT_IRQS_DISABLED
	local_irq_disable();
#else
	lockdep_assert_irqs_disabled();
#endif
	account_hardirq_exit(current);
	preempt_count_sub(HARDIRQ_OFFSET);
	if (!in_interrupt() && local_softirq_pending())
		invoke_softirq();

	tick_irq_exit();
}

//#0  invoke_softirq () at kernel/softirq.c:421
//#1  __irq_exit_rcu () at kernel/softirq.c:633
//#2  irq_exit_rcu () at kernel/softirq.c:645
//#3  0xffffffff81f8aabe in common_interrupt (regs=0xffffc9000009be38, error_code=<optimized out>) at arch/x86/kernel/irq.c:247
static inline void invoke_softirq(void)
{
	if (!force_irqthreads() || !__this_cpu_read(ksoftirqd)) {
#ifdef CONFIG_HAVE_IRQ_EXIT_ON_IRQ_STACK
		/*
		 * We can safely execute softirq on the current stack if
		 * it is the irq stack, because it should be near empty
		 * at this stage.
		 */
		__do_softirq();
#else
		/*
		 * Otherwise, irq_exit() is called on the task stack that can
		 * be potentially deep already. So call softirq in its own stack
		 * to prevent from any overrun.
		 */
		do_softirq_own_stack();
#endif
	} else {
		wakeup_softirqd();
	}
}

而在invoke_softirq()中，其会直接调用__do_softirq()直接处理软中断时间或是唤醒ksoftirqd内核线程去处理

tasklet

由于softirq事件在内核编译时就确定了，如果想新添加还需要修改并重新编译内核，这是无法接受的。因此内核实现了一种更常用的中断下半部，即tasklet。tasklet构建在softirq之上，其实现基于两个softirq事件，TASKLET_SOFTIRQ和HI_SOFTIRQ

tasklet事件

tasklet用struct tasklet_struct表示每一个动态事件

struct tasklet_struct
{
	struct tasklet_struct *next;
	unsigned long state;
	atomic_t count;
	bool use_callback;
	union {
		void (*func)(unsigned long data);
		void (*callback)(struct tasklet_struct *t);
	};
	unsigned long data;
};

其内部使用链表tasklet_hi_vec/tasklet_vec进行管理，其链表头是个per_cpu变量，即每个cpu一个链表

/*
 * Tasklets
 */
struct tasklet_head {
	struct tasklet_struct *head;
	struct tasklet_struct **tail;
};

static DEFINE_PER_CPU(struct tasklet_head, tasklet_vec);
static DEFINE_PER_CPU(struct tasklet_head, tasklet_hi_vec);

内核其他子系统和内核驱动模块可以通过tasklet_setup()/tasklet_init()初始化动态创建的tasklet事件，或DECLARE_TASKLET()/DECLARE_TASKLET_OLD()静态创建tasklet事件

void tasklet_setup(struct tasklet_struct *t,
		   void (*callback)(struct tasklet_struct *))
{
	t->next = NULL;
	t->state = 0;
	atomic_set(&t->count, 0);
	t->callback = callback;
	t->use_callback = true;
	t->data = 0;
}

void tasklet_init(struct tasklet_struct *t,
		  void (*func)(unsigned long), unsigned long data)
{
	t->next = NULL;
	t->state = 0;
	atomic_set(&t->count, 0);
	t->func = func;
	t->use_callback = false;
	t->data = data;
}

#define DECLARE_TASKLET(name, _callback)		\
struct tasklet_struct name = {				\
	.count = ATOMIC_INIT(0),			\
	.callback = _callback,				\
	.use_callback = true,				\
}

#define DECLARE_TASKLET_OLD(name, _func)		\
struct tasklet_struct name = {				\
	.count = ATOMIC_INIT(0),			\
	.func = _func,					\
}

一般的，中断上半部会调用tasklet_hi_schedule()/tasklet_schedule()将待处理的tasklet事件插入到tasklet_vec/tasklet_hi_vec队列队尾并标记对应的softirq事件，后续会在TASKLET_SOFTIRQ/HI_SOFTIRQ softirq事件中调用tasket_action()/tasklet_hi_action()处理tasklet事件

static inline void tasklet_schedule(struct tasklet_struct *t)
{
	if (!test_and_set_bit(TASKLET_STATE_SCHED, &t->state))
		__tasklet_schedule(t);
}
void __tasklet_schedule(struct tasklet_struct *t)
{
	__tasklet_schedule_common(t, &tasklet_vec,
				  TASKLET_SOFTIRQ);
}

static inline void tasklet_hi_schedule(struct tasklet_struct *t)
{
	if (!test_and_set_bit(TASKLET_STATE_SCHED, &t->state))
		__tasklet_hi_schedule(t);
}
void __tasklet_hi_schedule(struct tasklet_struct *t)
{
	__tasklet_schedule_common(t, &tasklet_hi_vec,
				  HI_SOFTIRQ);
}

static void __tasklet_schedule_common(struct tasklet_struct *t,
				      struct tasklet_head __percpu *headp,
				      unsigned int softirq_nr)
{
	struct tasklet_head *head;
	unsigned long flags;

	local_irq_save(flags);
	head = this_cpu_ptr(headp);
	t->next = NULL;
	*head->tail = t;
	head->tail = &(t->next);
	raise_softirq_irqoff(softirq_nr);
	local_irq_restore(flags);
}

处理逻辑

根据前面softirq事件小节，内核在softirq_init()中注册TASKLET_SOFTIRQ事件和HI_SOFTIRQ事件的handler分别为tasklet_action()和tasklet_hi_action()，如下所示

static __latent_entropy void tasklet_action(struct softirq_action *a)
{
	workqueue_softirq_action(false);
	tasklet_action_common(a, this_cpu_ptr(&tasklet_vec), TASKLET_SOFTIRQ);
}
static __latent_entropy void tasklet_hi_action(struct softirq_action *a)
{
	workqueue_softirq_action(true);
	tasklet_action_common(a, this_cpu_ptr(&tasklet_hi_vec), HI_SOFTIRQ);
}

static void tasklet_action_common(struct softirq_action *a,
				  struct tasklet_head *tl_head,
				  unsigned int softirq_nr)
{
	struct tasklet_struct *list;

	local_irq_disable();
	list = tl_head->head;
	tl_head->head = NULL;
	tl_head->tail = &tl_head->head;
	local_irq_enable();

	while (list) {
		struct tasklet_struct *t = list;

		list = list->next;

		if (tasklet_trylock(t)) {
			if (!atomic_read(&t->count)) {
				if (tasklet_clear_sched(t)) {
					if (t->use_callback) {
						trace_tasklet_entry(t, t->callback);
						t->callback(t);
						trace_tasklet_exit(t, t->callback);
					} else {
						trace_tasklet_entry(t, t->func);
						t->func(t->data);
						trace_tasklet_exit(t, t->func);
					}
				}
				tasklet_unlock(t);
				continue;
			}
			tasklet_unlock(t);
		}

		local_irq_disable();
		t->next = NULL;
		*tl_head->tail = t;
		tl_head->tail = &t->next;
		__raise_softirq_irqoff(softirq_nr);
		local_irq_enable();
	}
}

对于workqueue_softirq_action()，其用于workqueue的POOL_BH类型的worker处理work事件，这个在后面workqueue小节中再详细介绍

而tasklet_action_common()的逻辑很清晰，遍历per_cpu变量的tasklet_hi_vec/tasklet_vec链表来执行待处理tasklet事件。因为处理tasklet事件前会先获取锁，这保证了同一时刻同一个tasklet事件只会在一个cpu上执行

workqueue

由于softirq事件在内核编译时就确定了，如果想新添加还需要修改并重新编译内核，这是无法接受的。因此内核实现了一种支持动态创建事件的中断下半部，即workqueue：即workqueue使用者生产work事件，而基于内核线程的worker异步地消费work事件

内核中workqueue有多个版本的实现，目前最新版本的workqueue实现是CMWQ(Concurrency Managed Workqueue)。其整体思路为如下两点：

worker由worker_pool进行管理，其会根据负载弹性扩缩worker数量
内核会静态/动态创建有限数量的worker_pool，workqueue公用这些worker_pool

worker_pool

CWMQ中最核心的概念就是worker_pool，其用于管理worker线程及其需要处理的work事件，由struct worker_pool结构体进行描述

struct worker_pool {
	raw_spinlock_t		lock;		/* the pool lock */
	int			cpu;		/* I: the associated cpu */
	int			node;		/* I: the associated node ID */
	int			id;		/* I: pool ID */
	unsigned int		flags;		/* L: flags */

	unsigned long		watchdog_ts;	/* L: watchdog timestamp */
	bool			cpu_stall;	/* WD: stalled cpu bound pool */

	/*
	 * The counter is incremented in a process context on the associated CPU
	 * w/ preemption disabled, and decremented or reset in the same context
	 * but w/ pool->lock held. The readers grab pool->lock and are
	 * guaranteed to see if the counter reached zero.
	 */
	int			nr_running;

	struct list_head	worklist;	/* L: list of pending works */

	int			nr_workers;	/* L: total number of workers */
	int			nr_idle;	/* L: currently idle workers */

	struct list_head	idle_list;	/* L: list of idle workers */
	struct timer_list	idle_timer;	/* L: worker idle timeout */
	struct work_struct      idle_cull_work; /* L: worker idle cleanup */

	struct timer_list	mayday_timer;	  /* L: SOS timer for workers */

	/* a workers is either on busy_hash or idle_list, or the manager */
	DECLARE_HASHTABLE(busy_hash, BUSY_WORKER_HASH_ORDER);
						/* L: hash of busy workers */

	struct worker		*manager;	/* L: purely informational */
	struct list_head	workers;	/* A: attached workers */

	struct ida		worker_ida;	/* worker IDs for task name */

	struct workqueue_attrs	*attrs;		/* I: worker attributes */
	struct hlist_node	hash_node;	/* PL: unbound_pool_hash node */
	int			refcnt;		/* PL: refcnt for unbound pools */

	/*
	 * Destruction of pool is RCU protected to allow dereferences
	 * from get_work_pool().
	 */
	struct rcu_head		rcu;
};

其中，worklist字段标记该pool所有待处理的work事件，其会被该pool的所有worker线程一起消费；而worker字段则标记该pool管理的所有worker线程，其会共同消费该pool管理的work事件队列，具体细节在下面章节具体介绍

内核中一共有两类worker_pool，即bound worker_pool和unbound worker_pool

其中bound worker_pool中所有的worker线程只能运行在指定CPU上。该类型的worker_pool是由内核基于DEFINE_PER_CPU_SHARED_ALIGNED()静态定义的，如下所示

enum wq_internal_consts {
	NR_STD_WORKER_POOLS	= 2,		/* # standard pools per cpu */
	...
};

/* to raise softirq for the BH worker pools on other CPUs */
static DEFINE_PER_CPU_SHARED_ALIGNED(struct irq_work [NR_STD_WORKER_POOLS], bh_pool_irq_works);

/* the BH worker pools */
static DEFINE_PER_CPU_SHARED_ALIGNED(struct worker_pool [NR_STD_WORKER_POOLS], bh_worker_pools);

/* the per-cpu worker pools */
static DEFINE_PER_CPU_SHARED_ALIGNED(struct worker_pool [NR_STD_WORKER_POOLS], cpu_worker_pools);

可以看到，内核中一共定义了bh_pool_irq_works、bh_worker_pools和cpu_worker_pools三个bound worker_pools，并且每一个worker_pools在每个CPU上都有两个worker_pool，后一个(下标为1)的worker_pool其调度优先级会更高一些

而对于unbound worker_pool，其管理的worker可以在多个cpu上运行，内核会使用get_unbound_pool()来获取和创建

/**
 * get_unbound_pool - get a worker_pool with the specified attributes
 * @attrs: the attributes of the worker_pool to get
 *
 * Obtain a worker_pool which has the same attributes as @attrs, bump the
 * reference count and return it.  If there already is a matching
 * worker_pool, it will be used; otherwise, this function attempts to
 * create a new one.
 *
 * Should be called with wq_pool_mutex held.
 *
 * Return: On success, a worker_pool with the same attributes as @attrs.
 * On failure, %NULL.
 */
static struct worker_pool *get_unbound_pool(const struct workqueue_attrs *attrs)
{
	struct wq_pod_type *pt = &wq_pod_types[WQ_AFFN_NUMA];
	u32 hash = wqattrs_hash(attrs);
	struct worker_pool *pool;
	int pod, node = NUMA_NO_NODE;

	lockdep_assert_held(&wq_pool_mutex);

	/* do we already have a matching pool? */
	hash_for_each_possible(unbound_pool_hash, pool, hash_node, hash) {
		if (wqattrs_equal(pool->attrs, attrs)) {
			pool->refcnt++;
			return pool;
		}
	}

	/* If __pod_cpumask is contained inside a NUMA pod, that's our node */
	for (pod = 0; pod < pt->nr_pods; pod++) {
		if (cpumask_subset(attrs->__pod_cpumask, pt->pod_cpus[pod])) {
			node = pt->pod_node[pod];
			break;
		}
	}

	/* nope, create a new one */
	pool = kzalloc_node(sizeof(*pool), GFP_KERNEL, node);
	if (!pool || init_worker_pool(pool) < 0)
		goto fail;

	pool->node = node;
	copy_workqueue_attrs(pool->attrs, attrs);
	wqattrs_clear_for_pool(pool->attrs);

	if (worker_pool_assign_id(pool) < 0)
		goto fail;

	/* create and start the initial worker */
	if (wq_online && !create_worker(pool))
		goto fail;

	/* install */
	hash_add(unbound_pool_hash, &pool->hash_node, hash);

	return pool;
fail:
	if (pool)
		put_unbound_pool(pool);
	return NULL;
}

可以看到，内核会使用unbound_pool_hash管理所有的unbound worker_pool，只有workqueue使用者需要当前特定的worker_pool并且unbound_pool_hash中没有记录时才会创建该worker_pool

work事件

前面worker_pool管理的work事件可以用struct work_struct进行描述

struct work_struct {
	atomic_long_t data;
	struct list_head entry;
	work_func_t func;
#ifdef CONFIG_LOCKDEP
	struct lockdep_map lockdep_map;
#endif
};

workqueue使用者可以通过INIT_WORK()初始化动态创建的work事件，或使用DECLARE_WORK()静态创建work事件

static inline void __init_work(struct work_struct *work, int onstack) { }
#define WORK_DATA_INIT()	ATOMIC_LONG_INIT((unsigned long)WORK_STRUCT_NO_POOL)

#define __INIT_WORK_KEY(_work, _func, _onstack, _key)			\
	do {								\
		__init_work((_work), _onstack);				\
		(_work)->data = (atomic_long_t) WORK_DATA_INIT();	\
		INIT_LIST_HEAD(&(_work)->entry);			\
		(_work)->func = (_func);				\
	} while (0)

#define __INIT_WORK(_work, _func, _onstack)				\
	do {								\
		static __maybe_unused struct lock_class_key __key;	\
									\
		__INIT_WORK_KEY(_work, _func, _onstack, &__key);	\
	} while (0)

#define INIT_WORK(_work, _func)						\
	__INIT_WORK((_work), (_func), 0)



#define WORK_DATA_STATIC_INIT()	\
	ATOMIC_LONG_INIT((unsigned long)(WORK_STRUCT_NO_POOL | WORK_STRUCT_STATIC))

#define __WORK_INITIALIZER(n, f) {					\
	.data = WORK_DATA_STATIC_INIT(),				\
	.entry	= { &(n).entry, &(n).entry },				\
	.func = (f),							\
	__WORK_INIT_LOCKDEP_MAP(#n, &(n))				\
	}

#define DECLARE_WORK(n, f)						\
	struct work_struct n = __WORK_INITIALIZER(n, f)

而所有的work事件会被以队列的形式管理在不同队列中(例如前面worker_pool中所有待处理事件被管理在struct worker_pool的worklist字段)，struct work_struct的entry字段指向work事件所在的事件队列。kernel使用insert_work()将work事件插入到事件队列中

/**
 * insert_work - insert a work into a pool
 * @pwq: pwq @work belongs to
 * @work: work to insert
 * @head: insertion point
 * @extra_flags: extra WORK_STRUCT_* flags to set
 *
 * Insert @work which belongs to @pwq after @head.  @extra_flags is or'd to
 * work_struct flags.
 *
 * CONTEXT:
 * raw_spin_lock_irq(pool->lock).
 */
static void insert_work(struct pool_workqueue *pwq, struct work_struct *work,
			struct list_head *head, unsigned int extra_flags)
{
	debug_work_activate(work);

	/* record the work call stack in order to print it in KASAN reports */
	kasan_record_aux_stack_noalloc(work);

	/* we own @work, set data and link */
	set_work_pwq(work, pwq, extra_flags);
	list_add_tail(&work->entry, head);
	get_pwq(pwq);
}

worker进程

前面介绍的worker线程则由struct worker进行描述

/*
 * The poor guys doing the actual heavy lifting.  All on-duty workers are
 * either serving the manager role, on idle list or on busy hash.  For
 * details on the locking annotation (L, I, X...), refer to workqueue.c.
 *
 * Only to be used in workqueue and async.
 */
struct worker {
	/* on idle list while idle, on busy hash table while busy */
	union {
		struct list_head	entry;	/* L: while idle */
		struct hlist_node	hentry;	/* L: while busy */
	};

	struct work_struct	*current_work;	/* K: work being processed and its */
	work_func_t		current_func;	/* K: function */
	struct pool_workqueue	*current_pwq;	/* K: pwq */
	u64			current_at;	/* K: runtime at start or last wakeup */
	unsigned int		current_color;	/* K: color */

	int			sleeping;	/* S: is worker sleeping? */

	/* used by the scheduler to determine a worker's last known identity */
	work_func_t		last_func;	/* K: last work's fn */

	struct list_head	scheduled;	/* L: scheduled works */

	struct task_struct	*task;		/* I: worker task */
	struct worker_pool	*pool;		/* A: the associated pool */
						/* L: for rescuers */
	struct list_head	node;		/* A: anchored at pool->workers */
						/* A: runs through worker->node */

	unsigned long		last_active;	/* K: last active timestamp */
	unsigned int		flags;		/* L: flags */
	int			id;		/* I: worker id */

	/*
	 * Opaque string set with work_set_desc().  Printed out with task
	 * dump for debugging - WARN, BUG, panic or sysrq.
	 */
	char			desc[WORKER_DESC_LEN];

	/* used only by rescuers to point to the target workqueue */
	struct workqueue_struct	*rescue_wq;	/* I: the workqueue to rescue */
};

其node字段指向所在的worker_pool管理的所有worker列表
内核会调用create_worker()为worker_pool创建worker对应的内核线程

/**
 * create_worker - create a new workqueue worker
 * @pool: pool the new worker will belong to
 *
 * Create and start a new worker which is attached to @pool.
 *
 * CONTEXT:
 * Might sleep.  Does GFP_KERNEL allocations.
 *
 * Return:
 * Pointer to the newly created worker.
 */
static struct worker *create_worker(struct worker_pool *pool)
{
	struct worker *worker;
	int id;
	char id_buf[23];

	/* ID is needed to determine kthread name */
	id = ida_alloc(&pool->worker_ida, GFP_KERNEL);
	if (id < 0) {
		pr_err_once("workqueue: Failed to allocate a worker ID: %pe\n",
			    ERR_PTR(id));
		return NULL;
	}

	worker = alloc_worker(pool->node);
	if (!worker) {
		pr_err_once("workqueue: Failed to allocate a worker\n");
		goto fail;
	}

	worker->id = id;

	if (!(pool->flags & POOL_BH)) {
		if (pool->cpu >= 0)
			snprintf(id_buf, sizeof(id_buf), "%d:%d%s", pool->cpu, id,
				 pool->attrs->nice < 0  ? "H" : "");
		else
			snprintf(id_buf, sizeof(id_buf), "u%d:%d", pool->id, id);

		worker->task = kthread_create_on_node(worker_thread, worker,
					pool->node, "kworker/%s", id_buf);
		if (IS_ERR(worker->task)) {
			if (PTR_ERR(worker->task) == -EINTR) {
				pr_err("workqueue: Interrupted when creating a worker thread \"kworker/%s\"\n",
				       id_buf);
			} else {
				pr_err_once("workqueue: Failed to create a worker thread: %pe",
					    worker->task);
			}
			goto fail;
		}

		set_user_nice(worker->task, pool->attrs->nice);
		kthread_bind_mask(worker->task, pool_allowed_cpus(pool));
	}

	/* successful, attach the worker to the pool */
	worker_attach_to_pool(worker, pool);

	/* start the newly created worker */
	raw_spin_lock_irq(&pool->lock);

	worker->pool->nr_workers++;
	worker_enter_idle(worker);

	/*
	 * @worker is waiting on a completion in kthread() and will trigger hung
	 * check if not woken up soon. As kick_pool() is noop if @pool is empty,
	 * wake it up explicitly.
	 */
	if (worker->task)
		wake_up_process(worker->task);

	raw_spin_unlock_irq(&pool->lock);

	return worker;

fail:
	ida_free(&pool->worker_ida, id);
	kfree(worker);
	return NULL;
}

可以看到，对于非POOL_BH类型，内核调用kthread_create_on_node函数，创建名为kworker/X的内核线程，该内核线程会执行worker_thread()线程函数来处理work事件

/**
 * worker_thread - the worker thread function
 * @__worker: self
 *
 * The worker thread function.  All workers belong to a worker_pool -
 * either a per-cpu one or dynamic unbound one.  These workers process all
 * work items regardless of their specific target workqueue.  The only
 * exception is work items which belong to workqueues with a rescuer which
 * will be explained in rescuer_thread().
 *
 * Return: 0
 */
static int worker_thread(void *__worker)
{
	struct worker *worker = __worker;
	struct worker_pool *pool = worker->pool;

	/* tell the scheduler that this is a workqueue worker */
	set_pf_worker(true);
woke_up:
	raw_spin_lock_irq(&pool->lock);

	/* am I supposed to die? */
	if (unlikely(worker->flags & WORKER_DIE)) {
		raw_spin_unlock_irq(&pool->lock);
		set_pf_worker(false);

		set_task_comm(worker->task, "kworker/dying");
		ida_free(&pool->worker_ida, worker->id);
		worker_detach_from_pool(worker);
		WARN_ON_ONCE(!list_empty(&worker->entry));
		kfree(worker);
		return 0;
	}

	worker_leave_idle(worker);
recheck:
	/* no more worker necessary? */
	if (!need_more_worker(pool))
		goto sleep;

	/* do we need to manage? */
	if (unlikely(!may_start_working(pool)) && manage_workers(worker))
		goto recheck;

	/*
	 * ->scheduled list can only be filled while a worker is
	 * preparing to process a work or actually processing it.
	 * Make sure nobody diddled with it while I was sleeping.
	 */
	WARN_ON_ONCE(!list_empty(&worker->scheduled));

	/*
	 * Finish PREP stage.  We're guaranteed to have at least one idle
	 * worker or that someone else has already assumed the manager
	 * role.  This is where @worker starts participating in concurrency
	 * management if applicable and concurrency management is restored
	 * after being rebound.  See rebind_workers() for details.
	 */
	worker_clr_flags(worker, WORKER_PREP | WORKER_REBOUND);

	do {
		struct work_struct *work =
			list_first_entry(&pool->worklist,
					 struct work_struct, entry);

		if (assign_work(work, worker, NULL))
			process_scheduled_works(worker);
	} while (keep_working(pool));

	worker_set_flags(worker, WORKER_PREP);
sleep:
	/*
	 * pool->lock is held and there's no work to process and no need to
	 * manage, sleep.  Workers are woken up only while holding
	 * pool->lock or from local cpu, so setting the current state
	 * before releasing pool->lock is enough to prevent losing any
	 * event.
	 */
	worker_enter_idle(worker);
	__set_current_state(TASK_IDLE);
	raw_spin_unlock_irq(&pool->lock);
	schedule();
	goto woke_up;
}

/**
 * assign_work - assign a work item and its linked work items to a worker
 * @work: work to assign
 * @worker: worker to assign to
 * @nextp: out parameter for nested worklist walking
 *
 * Assign @work and its linked work items to @worker. If @work is already being
 * executed by another worker in the same pool, it'll be punted there.
 *
 * If @nextp is not NULL, it's updated to point to the next work of the last
 * scheduled work. This allows assign_work() to be nested inside
 * list_for_each_entry_safe().
 *
 * Returns %true if @work was successfully assigned to @worker. %false if @work
 * was punted to another worker already executing it.
 */
static bool assign_work(struct work_struct *work, struct worker *worker,
			struct work_struct **nextp)
{
	struct worker_pool *pool = worker->pool;
	struct worker *collision;

	lockdep_assert_held(&pool->lock);

	/*
	 * A single work shouldn't be executed concurrently by multiple workers.
	 * __queue_work() ensures that @work doesn't jump to a different pool
	 * while still running in the previous pool. Here, we should ensure that
	 * @work is not executed concurrently by multiple workers from the same
	 * pool. Check whether anyone is already processing the work. If so,
	 * defer the work to the currently executing one.
	 */
	collision = find_worker_executing_work(pool, work);
	if (unlikely(collision)) {
		move_linked_works(work, &collision->scheduled, nextp);
		return false;
	}

	move_linked_works(work, &worker->scheduled, nextp);
	return true;
}

/**
 * process_one_work - process single work
 * @worker: self
 * @work: work to process
 *
 * Process @work.  This function contains all the logics necessary to
 * process a single work including synchronization against and
 * interaction with other workers on the same cpu, queueing and
 * flushing.  As long as context requirement is met, any worker can
 * call this function to process a work.
 *
 * CONTEXT:
 * raw_spin_lock_irq(pool->lock) which is released and regrabbed.
 */
static void process_one_work(struct worker *worker, struct work_struct *work)
__releases(&pool->lock)
__acquires(&pool->lock)
{
	struct pool_workqueue *pwq = get_work_pwq(work);
	struct worker_pool *pool = worker->pool;
	unsigned long work_data;
	int lockdep_start_depth, rcu_start_depth;
	bool bh_draining = pool->flags & POOL_BH_DRAINING;
#ifdef CONFIG_LOCKDEP
	/*
	 * It is permissible to free the struct work_struct from
	 * inside the function that is called from it, this we need to
	 * take into account for lockdep too.  To avoid bogus "held
	 * lock freed" warnings as well as problems when looking into
	 * work->lockdep_map, make a copy and use that here.
	 */
	struct lockdep_map lockdep_map;

	lockdep_copy_map(&lockdep_map, &work->lockdep_map);
#endif
	/* ensure we're on the correct CPU */
	WARN_ON_ONCE(!(pool->flags & POOL_DISASSOCIATED) &&
		     raw_smp_processor_id() != pool->cpu);

	/* claim and dequeue */
	debug_work_deactivate(work);
	hash_add(pool->busy_hash, &worker->hentry, (unsigned long)work);
	worker->current_work = work;
	worker->current_func = work->func;
	worker->current_pwq = pwq;
	if (worker->task)
		worker->current_at = worker->task->se.sum_exec_runtime;
	work_data = *work_data_bits(work);
	worker->current_color = get_work_color(work_data);

	/*
	 * Record wq name for cmdline and debug reporting, may get
	 * overridden through set_worker_desc().
	 */
	strscpy(worker->desc, pwq->wq->name, WORKER_DESC_LEN);

	list_del_init(&work->entry);

	/*
	 * CPU intensive works don't participate in concurrency management.
	 * They're the scheduler's responsibility.  This takes @worker out
	 * of concurrency management and the next code block will chain
	 * execution of the pending work items.
	 */
	if (unlikely(pwq->wq->flags & WQ_CPU_INTENSIVE))
		worker_set_flags(worker, WORKER_CPU_INTENSIVE);

	/*
	 * Kick @pool if necessary. It's always noop for per-cpu worker pools
	 * since nr_running would always be >= 1 at this point. This is used to
	 * chain execution of the pending work items for WORKER_NOT_RUNNING
	 * workers such as the UNBOUND and CPU_INTENSIVE ones.
	 */
	kick_pool(pool);

	/*
	 * Record the last pool and clear PENDING which should be the last
	 * update to @work.  Also, do this inside @pool->lock so that
	 * PENDING and queued state changes happen together while IRQ is
	 * disabled.
	 */
	set_work_pool_and_clear_pending(work, pool->id, 0);

	pwq->stats[PWQ_STAT_STARTED]++;
	raw_spin_unlock_irq(&pool->lock);

	rcu_start_depth = rcu_preempt_depth();
	lockdep_start_depth = lockdep_depth(current);
	/* see drain_dead_softirq_workfn() */
	if (!bh_draining)
		lock_map_acquire(&pwq->wq->lockdep_map);
	lock_map_acquire(&lockdep_map);
	/*
	 * Strictly speaking we should mark the invariant state without holding
	 * any locks, that is, before these two lock_map_acquire()'s.
	 *
	 * However, that would result in:
	 *
	 *   A(W1)
	 *   WFC(C)
	 *		A(W1)
	 *		C(C)
	 *
	 * Which would create W1->C->W1 dependencies, even though there is no
	 * actual deadlock possible. There are two solutions, using a
	 * read-recursive acquire on the work(queue) 'locks', but this will then
	 * hit the lockdep limitation on recursive locks, or simply discard
	 * these locks.
	 *
	 * AFAICT there is no possible deadlock scenario between the
	 * flush_work() and complete() primitives (except for single-threaded
	 * workqueues), so hiding them isn't a problem.
	 */
	lockdep_invariant_state(true);
	trace_workqueue_execute_start(work);
	worker->current_func(work);
	/*
	 * While we must be careful to not use "work" after this, the trace
	 * point will only record its address.
	 */
	trace_workqueue_execute_end(work, worker->current_func);
	pwq->stats[PWQ_STAT_COMPLETED]++;
	lock_map_release(&lockdep_map);
	if (!bh_draining)
		lock_map_release(&pwq->wq->lockdep_map);

	if (unlikely((worker->task && in_atomic()) ||
		     lockdep_depth(current) != lockdep_start_depth ||
		     rcu_preempt_depth() != rcu_start_depth)) {
		pr_err("BUG: workqueue leaked atomic, lock or RCU: %s[%d]\n"
		       "     preempt=0x%08x lock=%d->%d RCU=%d->%d workfn=%ps\n",
		       current->comm, task_pid_nr(current), preempt_count(),
		       lockdep_start_depth, lockdep_depth(current),
		       rcu_start_depth, rcu_preempt_depth(),
		       worker->current_func);
		debug_show_held_locks(current);
		dump_stack();
	}

	/*
	 * The following prevents a kworker from hogging CPU on !PREEMPTION
	 * kernels, where a requeueing work item waiting for something to
	 * happen could deadlock with stop_machine as such work item could
	 * indefinitely requeue itself while all other CPUs are trapped in
	 * stop_machine. At the same time, report a quiescent RCU state so
	 * the same condition doesn't freeze RCU.
	 */
	if (worker->task)
		cond_resched();

	raw_spin_lock_irq(&pool->lock);

	/*
	 * In addition to %WQ_CPU_INTENSIVE, @worker may also have been marked
	 * CPU intensive by wq_worker_tick() if @work hogged CPU longer than
	 * wq_cpu_intensive_thresh_us. Clear it.
	 */
	worker_clr_flags(worker, WORKER_CPU_INTENSIVE);

	/* tag the worker for identification in schedule() */
	worker->last_func = worker->current_func;

	/* we're done with it, release */
	hash_del(&worker->hentry);
	worker->current_work = NULL;
	worker->current_func = NULL;
	worker->current_pwq = NULL;
	worker->current_color = INT_MAX;

	/* must be the last step, see the function comment */
	pwq_dec_nr_in_flight(pwq, work_data);
}

/**
 * process_scheduled_works - process scheduled works
 * @worker: self
 *
 * Process all scheduled works.  Please note that the scheduled list
 * may change while processing a work, so this function repeatedly
 * fetches a work from the top and executes it.
 *
 * CONTEXT:
 * raw_spin_lock_irq(pool->lock) which may be released and regrabbed
 * multiple times.
 */
static void process_scheduled_works(struct worker *worker)
{
	struct work_struct *work;
	bool first = true;

	while ((work = list_first_entry_or_null(&worker->scheduled,
						struct work_struct, entry))) {
		if (first) {
			worker->pool->watchdog_ts = jiffies;
			first = false;
		}
		process_one_work(worker, work);
	}
}

其逻辑比较清晰，worker_thread()函数会循环执行获取事件并处理事件流程直到被标记WORKER_DIE。具体的，在每个循环中其使用assign_work()从struct worker_pool的worklist中获取事件，并根据情况将其添加到当前thread的worker或对应worker的scheduled字段的队列中；会使用process_scheduled_works()中持续调用process_one_work()处理所有的schedules事件队列

除此之外，worker_thread()中manage_workers()和worker_enter_idle()也实现了worker_pool中worker线程数量的弹性扩容和缩容

/**
 * manage_workers - manage worker pool
 * @worker: self
 *
 * Assume the manager role and manage the worker pool @worker belongs
 * to.  At any given time, there can be only zero or one manager per
 * pool.  The exclusion is handled automatically by this function.
 *
 * The caller can safely start processing works on false return.  On
 * true return, it's guaranteed that need_to_create_worker() is false
 * and may_start_working() is true.
 *
 * CONTEXT:
 * raw_spin_lock_irq(pool->lock) which may be released and regrabbed
 * multiple times.  Does GFP_KERNEL allocations.
 *
 * Return:
 * %false if the pool doesn't need management and the caller can safely
 * start processing works, %true if management function was performed and
 * the conditions that the caller verified before calling the function may
 * no longer be true.
 */
static bool manage_workers(struct worker *worker)
{
	struct worker_pool *pool = worker->pool;

	if (pool->flags & POOL_MANAGER_ACTIVE)
		return false;

	pool->flags |= POOL_MANAGER_ACTIVE;
	pool->manager = worker;

	maybe_create_worker(pool);

	pool->manager = NULL;
	pool->flags &= ~POOL_MANAGER_ACTIVE;
	rcuwait_wake_up(&manager_wait);
	return true;
}

/**
 * maybe_create_worker - create a new worker if necessary
 * @pool: pool to create a new worker for
 *
 * Create a new worker for @pool if necessary.  @pool is guaranteed to
 * have at least one idle worker on return from this function.  If
 * creating a new worker takes longer than MAYDAY_INTERVAL, mayday is
 * sent to all rescuers with works scheduled on @pool to resolve
 * possible allocation deadlock.
 *
 * On return, need_to_create_worker() is guaranteed to be %false and
 * may_start_working() %true.
 *
 * LOCKING:
 * raw_spin_lock_irq(pool->lock) which may be released and regrabbed
 * multiple times.  Does GFP_KERNEL allocations.  Called only from
 * manager.
 */
static void maybe_create_worker(struct worker_pool *pool)
__releases(&pool->lock)
__acquires(&pool->lock)
{
restart:
	raw_spin_unlock_irq(&pool->lock);

	/* if we don't make progress in MAYDAY_INITIAL_TIMEOUT, call for help */
	mod_timer(&pool->mayday_timer, jiffies + MAYDAY_INITIAL_TIMEOUT);

	while (true) {
		if (create_worker(pool) || !need_to_create_worker(pool))
			break;

		schedule_timeout_interruptible(CREATE_COOLDOWN);

		if (!need_to_create_worker(pool))
			break;
	}

	del_timer_sync(&pool->mayday_timer);
	raw_spin_lock_irq(&pool->lock);
	/*
	 * This is necessary even after a new worker was just successfully
	 * created as @pool->lock was dropped and the new worker might have
	 * already become busy.
	 */
	if (need_to_create_worker(pool))
		goto restart;
}

可以看到，所有worker在准备处理work事件会调用manage_workers()来创建worker直到至少有一个idle worker，从而完成扩容

static int init_worker_pool(struct worker_pool *pool)
{
	...
	timer_setup(&pool->idle_timer, idle_worker_timeout, TIMER_DEFERRABLE);
	INIT_WORK(&pool->idle_cull_work, idle_cull_fn);
	...
}

/**
 * worker_enter_idle - enter idle state
 * @worker: worker which is entering idle state
 *
 * @worker is entering idle state.  Update stats and idle timer if
 * necessary.
 *
 * LOCKING:
 * raw_spin_lock_irq(pool->lock).
 */
static void worker_enter_idle(struct worker *worker)
{
	struct worker_pool *pool = worker->pool;

	if (WARN_ON_ONCE(worker->flags & WORKER_IDLE) ||
	    WARN_ON_ONCE(!list_empty(&worker->entry) &&
			 (worker->hentry.next || worker->hentry.pprev)))
		return;

	/* can't use worker_set_flags(), also called from create_worker() */
	worker->flags |= WORKER_IDLE;
	pool->nr_idle++;
	worker->last_active = jiffies;

	/* idle_list is LIFO */
	list_add(&worker->entry, &pool->idle_list);

	if (too_many_workers(pool) && !timer_pending(&pool->idle_timer))
		mod_timer(&pool->idle_timer, jiffies + IDLE_WORKER_TIMEOUT);

	/* Sanity check nr_running. */
	WARN_ON_ONCE(pool->nr_workers == pool->nr_idle && pool->nr_running);
}

/**
 * idle_worker_timeout - check if some idle workers can now be deleted.
 * @t: The pool's idle_timer that just expired
 *
 * The timer is armed in worker_enter_idle(). Note that it isn't disarmed in
 * worker_leave_idle(), as a worker flicking between idle and active while its
 * pool is at the too_many_workers() tipping point would cause too much timer
 * housekeeping overhead. Since IDLE_WORKER_TIMEOUT is long enough, we just let
 * it expire and re-evaluate things from there.
 */
static void idle_worker_timeout(struct timer_list *t)
{
	struct worker_pool *pool = from_timer(pool, t, idle_timer);
	bool do_cull = false;

	if (work_pending(&pool->idle_cull_work))
		return;

	raw_spin_lock_irq(&pool->lock);

	if (too_many_workers(pool)) {
		struct worker *worker;
		unsigned long expires;

		/* idle_list is kept in LIFO order, check the last one */
		worker = list_last_entry(&pool->idle_list, struct worker, entry);
		expires = worker->last_active + IDLE_WORKER_TIMEOUT;
		do_cull = !time_before(jiffies, expires);

		if (!do_cull)
			mod_timer(&pool->idle_timer, expires);
	}
	raw_spin_unlock_irq(&pool->lock);

	if (do_cull)
		queue_work(system_unbound_wq, &pool->idle_cull_work);
}

/**
 * idle_cull_fn - cull workers that have been idle for too long.
 * @work: the pool's work for handling these idle workers
 *
 * This goes through a pool's idle workers and gets rid of those that have been
 * idle for at least IDLE_WORKER_TIMEOUT seconds.
 *
 * We don't want to disturb isolated CPUs because of a pcpu kworker being
 * culled, so this also resets worker affinity. This requires a sleepable
 * context, hence the split between timer callback and work item.
 */
static void idle_cull_fn(struct work_struct *work)
{
	struct worker_pool *pool = container_of(work, struct worker_pool, idle_cull_work);
	LIST_HEAD(cull_list);

	/*
	 * Grabbing wq_pool_attach_mutex here ensures an already-running worker
	 * cannot proceed beyong set_pf_worker() in its self-destruct path.
	 * This is required as a previously-preempted worker could run after
	 * set_worker_dying() has happened but before detach_dying_workers() did.
	 */
	mutex_lock(&wq_pool_attach_mutex);
	raw_spin_lock_irq(&pool->lock);

	while (too_many_workers(pool)) {
		struct worker *worker;
		unsigned long expires;

		worker = list_last_entry(&pool->idle_list, struct worker, entry);
		expires = worker->last_active + IDLE_WORKER_TIMEOUT;

		if (time_before(jiffies, expires)) {
			mod_timer(&pool->idle_timer, expires);
			break;
		}

		set_worker_dying(worker, &cull_list);
	}

	raw_spin_unlock_irq(&pool->lock);
	detach_dying_workers(&cull_list);
	mutex_unlock(&wq_pool_attach_mutex);

	reap_dying_workers(&cull_list);
}

可以看到，在初始化每个struct worker_pool时会初始化一个timer(idle_timer)和idle_cull_work的work事件。当worker进程没有待处理的work事件要进入idle状态时，其会调用worker_enter_idle()，设置前面初始化的idle_timer定时器，当定时器超时后调用设置的处理函数idle_worker_timeout()来产生idle_cull_work的work事件，该事件会调用idle_cull_fn()来完成最终的worker删除，实现缩容

接口

而workqueue暴露给使用者的前端接口，则是struct workqueue_struct，如下所示

/*
 * The externally visible workqueue.  It relays the issued work items to
 * the appropriate worker_pool through its pool_workqueues.
 */
struct workqueue_struct {
	struct list_head	pwqs;		/* WR: all pwqs of this wq */
	struct list_head	list;		/* PR: list of all workqueues */

	struct mutex		mutex;		/* protects this wq */
	int			work_color;	/* WQ: current work color */
	int			flush_color;	/* WQ: current flush color */
	atomic_t		nr_pwqs_to_flush; /* flush in progress */
	struct wq_flusher	*first_flusher;	/* WQ: first flusher */
	struct list_head	flusher_queue;	/* WQ: flush waiters */
	struct list_head	flusher_overflow; /* WQ: flush overflow list */

	struct list_head	maydays;	/* MD: pwqs requesting rescue */
	struct worker		*rescuer;	/* MD: rescue worker */

	int			nr_drainers;	/* WQ: drain in progress */

	/* See alloc_workqueue() function comment for info on min/max_active */
	int			max_active;	/* WO: max active works */
	int			min_active;	/* WO: min active works */
	int			saved_max_active; /* WQ: saved max_active */
	int			saved_min_active; /* WQ: saved min_active */

	struct workqueue_attrs	*unbound_attrs;	/* PW: only for unbound wqs */
	struct pool_workqueue __rcu *dfl_pwq;   /* PW: only for unbound wqs */

#ifdef CONFIG_SYSFS
	struct wq_device	*wq_dev;	/* I: for sysfs interface */
#endif
#ifdef CONFIG_LOCKDEP
	char			*lock_name;
	struct lock_class_key	key;
	struct lockdep_map	lockdep_map;
#endif
	char			name[WQ_NAME_LEN]; /* I: workqueue name */

	/*
	 * Destruction of workqueue_struct is RCU protected to allow walking
	 * the workqueues list without grabbing wq_pool_mutex.
	 * This is used to dump all workqueues from sysrq.
	 */
	struct rcu_head		rcu;

	/* hot fields used during command issue, aligned to cacheline */
	unsigned int		flags ____cacheline_aligned; /* WQ: WQ_* flags */
	struct pool_workqueue __percpu __rcu **cpu_pwq; /* I: per-cpu pwqs */
	struct wq_node_nr_active *node_nr_active[]; /* I: per-node nr_active */
};

前面介绍workqueue会公用所有的worker_pool，其使用struct pool_workqueue结构体来管理workqueue_struct和worker_pool的一一映射关系

/*
 * The per-pool workqueue.  While queued, bits below WORK_PWQ_SHIFT
 * of work_struct->data are used for flags and the remaining high bits
 * point to the pwq; thus, pwqs need to be aligned at two's power of the
 * number of flag bits.
 */
struct pool_workqueue {
	struct worker_pool	*pool;		/* I: the associated pool */
	struct workqueue_struct *wq;		/* I: the owning workqueue */
	int			work_color;	/* L: current color */
	int			flush_color;	/* L: flushing color */
	int			refcnt;		/* L: reference count */
	int			nr_in_flight[WORK_NR_COLORS];
						/* L: nr of in_flight works */
	bool			plugged;	/* L: execution suspended */

	/*
	 * nr_active management and WORK_STRUCT_INACTIVE:
	 *
	 * When pwq->nr_active >= max_active, new work item is queued to
	 * pwq->inactive_works instead of pool->worklist and marked with
	 * WORK_STRUCT_INACTIVE.
	 *
	 * All work items marked with WORK_STRUCT_INACTIVE do not participate in
	 * nr_active and all work items in pwq->inactive_works are marked with
	 * WORK_STRUCT_INACTIVE. But not all WORK_STRUCT_INACTIVE work items are
	 * in pwq->inactive_works. Some of them are ready to run in
	 * pool->worklist or worker->scheduled. Those work itmes are only struct
	 * wq_barrier which is used for flush_work() and should not participate
	 * in nr_active. For non-barrier work item, it is marked with
	 * WORK_STRUCT_INACTIVE iff it is in pwq->inactive_works.
	 */
	int			nr_active;	/* L: nr of active works */
	struct list_head	inactive_works;	/* L: inactive works */
	struct list_head	pending_node;	/* LN: node on wq_node_nr_active->pending_pwqs */
	struct list_head	pwqs_node;	/* WR: node on wq->pwqs */
	struct list_head	mayday_node;	/* MD: node on wq->maydays */

	u64			stats[PWQ_NR_STATS];

	/*
	 * Release of unbound pwq is punted to a kthread_worker. See put_pwq()
	 * and pwq_release_workfn() for details. pool_workqueue itself is also
	 * RCU protected so that the first pwq can be determined without
	 * grabbing wq->mutex.
	 */
	struct kthread_work	release_work;
	struct rcu_head		rcu;
} __aligned(1 << WORK_STRUCT_PWQ_SHIFT);

可以看到，pool_workqueue的wq字段和pool字段则分别为struct workqueue_struct和worker_pool，完成了一一映射

创建

内核使用alloc_workqueue()创建一个workqueue，如下所示

__printf(1, 4)
struct workqueue_struct *alloc_workqueue(const char *fmt,
					 unsigned int flags,
					 int max_active, ...)
{
	va_list args;
	struct workqueue_struct *wq;
	size_t wq_size;
	int name_len;

	if (flags & WQ_BH) {
		if (WARN_ON_ONCE(flags & ~__WQ_BH_ALLOWS))
			return NULL;
		if (WARN_ON_ONCE(max_active))
			return NULL;
	}

	/* see the comment above the definition of WQ_POWER_EFFICIENT */
	if ((flags & WQ_POWER_EFFICIENT) && wq_power_efficient)
		flags |= WQ_UNBOUND;

	/* allocate wq and format name */
	if (flags & WQ_UNBOUND)
		wq_size = struct_size(wq, node_nr_active, nr_node_ids + 1);
	else
		wq_size = sizeof(*wq);

	wq = kzalloc(wq_size, GFP_KERNEL);
	if (!wq)
		return NULL;

	if (flags & WQ_UNBOUND) {
		wq->unbound_attrs = alloc_workqueue_attrs();
		if (!wq->unbound_attrs)
			goto err_free_wq;
	}

	va_start(args, max_active);
	name_len = vsnprintf(wq->name, sizeof(wq->name), fmt, args);
	va_end(args);

	if (name_len >= WQ_NAME_LEN)
		pr_warn_once("workqueue: name exceeds WQ_NAME_LEN. Truncating to: %s\n",
			     wq->name);

	if (flags & WQ_BH) {
		/*
		 * BH workqueues always share a single execution context per CPU
		 * and don't impose any max_active limit.
		 */
		max_active = INT_MAX;
	} else {
		max_active = max_active ?: WQ_DFL_ACTIVE;
		max_active = wq_clamp_max_active(max_active, flags, wq->name);
	}

	/* init wq */
	wq->flags = flags;
	wq->max_active = max_active;
	wq->min_active = min(max_active, WQ_DFL_MIN_ACTIVE);
	wq->saved_max_active = wq->max_active;
	wq->saved_min_active = wq->min_active;
	mutex_init(&wq->mutex);
	atomic_set(&wq->nr_pwqs_to_flush, 0);
	INIT_LIST_HEAD(&wq->pwqs);
	INIT_LIST_HEAD(&wq->flusher_queue);
	INIT_LIST_HEAD(&wq->flusher_overflow);
	INIT_LIST_HEAD(&wq->maydays);

	wq_init_lockdep(wq);
	INIT_LIST_HEAD(&wq->list);

	if (flags & WQ_UNBOUND) {
		if (alloc_node_nr_active(wq->node_nr_active) < 0)
			goto err_unreg_lockdep;
	}

	if (alloc_and_link_pwqs(wq) < 0)
		goto err_free_node_nr_active;

	if (wq_online && init_rescuer(wq) < 0)
		goto err_destroy;

	if ((wq->flags & WQ_SYSFS) && workqueue_sysfs_register(wq))
		goto err_destroy;

	/*
	 * wq_pool_mutex protects global freeze state and workqueues list.
	 * Grab it, adjust max_active and add the new @wq to workqueues
	 * list.
	 */
	mutex_lock(&wq_pool_mutex);

	mutex_lock(&wq->mutex);
	wq_adjust_max_active(wq);
	mutex_unlock(&wq->mutex);

	list_add_tail_rcu(&wq->list, &workqueues);

	mutex_unlock(&wq_pool_mutex);

	return wq;

err_free_node_nr_active:
	if (wq->flags & WQ_UNBOUND)
		free_node_nr_active(wq->node_nr_active);
err_unreg_lockdep:
	wq_unregister_lockdep(wq);
	wq_free_lockdep(wq);
err_free_wq:
	free_workqueue_attrs(wq->unbound_attrs);
	kfree(wq);
	return NULL;
err_destroy:
	destroy_workqueue(wq);
	return NULL;
}

可以看到，其函数逻辑比较清晰，在完成相关数据结构分配和初始化后，其会调用alloc_and_link_pwqs()，根据flags参数完成对应worker_pool的选择和映射，如下所示

static int alloc_and_link_pwqs(struct workqueue_struct *wq)
{
	bool highpri = wq->flags & WQ_HIGHPRI;
	int cpu, ret;

	wq->cpu_pwq = alloc_percpu(struct pool_workqueue *);
	if (!wq->cpu_pwq)
		goto enomem;

	if (!(wq->flags & WQ_UNBOUND)) {
		for_each_possible_cpu(cpu) {
			struct pool_workqueue **pwq_p;
			struct worker_pool __percpu *pools;
			struct worker_pool *pool;

			if (wq->flags & WQ_BH)
				pools = bh_worker_pools;
			else
				pools = cpu_worker_pools;

			pool = &(per_cpu_ptr(pools, cpu)[highpri]);
			pwq_p = per_cpu_ptr(wq->cpu_pwq, cpu);

			*pwq_p = kmem_cache_alloc_node(pwq_cache, GFP_KERNEL,
						       pool->node);
			if (!*pwq_p)
				goto enomem;

			init_pwq(*pwq_p, wq, pool);

			mutex_lock(&wq->mutex);
			link_pwq(*pwq_p);
			mutex_unlock(&wq->mutex);
		}
		return 0;
	}

	cpus_read_lock();
	if (wq->flags & __WQ_ORDERED) {
		struct pool_workqueue *dfl_pwq;

		ret = apply_workqueue_attrs(wq, ordered_wq_attrs[highpri]);
		/* there should only be single pwq for ordering guarantee */
		dfl_pwq = rcu_access_pointer(wq->dfl_pwq);
		WARN(!ret && (wq->pwqs.next != &dfl_pwq->pwqs_node ||
			      wq->pwqs.prev != &dfl_pwq->pwqs_node),
		     "ordering guarantee broken for workqueue %s\n", wq->name);
	} else {
		ret = apply_workqueue_attrs(wq, unbound_std_wq_attrs[highpri]);
	}
	cpus_read_unlock();

	/* for unbound pwq, flush the pwq_release_worker ensures that the
	 * pwq_release_workfn() completes before calling kfree(wq).
	 */
	if (ret)
		kthread_flush_worker(pwq_release_worker);

	return ret;

enomem:
	if (wq->cpu_pwq) {
		for_each_possible_cpu(cpu) {
			struct pool_workqueue *pwq = *per_cpu_ptr(wq->cpu_pwq, cpu);

			if (pwq)
				kmem_cache_free(pwq_cache, pwq);
		}
		free_percpu(wq->cpu_pwq);
		wq->cpu_pwq = NULL;
	}
	return -ENOMEM;
}

可以看到，对于bound类型，其对应的per-cpu字段cpu_pwq会被分别初始化一个struct pool_workqueue实例。该实例将直接映射到内核静态定义好的per-cpu worker_pool，即bh_worker_pools/cpu_worker_pools，从而建立workqueue到指定per-cpu worker_pool的映射关系。

而对于unbound类型，则是使用apply_workqueue_attrs()完成worker_pool的选择和映射

int apply_workqueue_attrs(struct workqueue_struct *wq,
			  const struct workqueue_attrs *attrs)
{
	...
	mutex_lock(&wq_pool_mutex);
	ret = apply_workqueue_attrs_locked(wq, attrs);
	mutex_unlock(&wq_pool_mutex);

	return ret;
}

static int apply_workqueue_attrs_locked(struct workqueue_struct *wq,
					const struct workqueue_attrs *attrs)
{
	struct apply_wqattrs_ctx *ctx;

	/* only unbound workqueues can change attributes */
	if (WARN_ON(!(wq->flags & WQ_UNBOUND)))
		return -EINVAL;

	ctx = apply_wqattrs_prepare(wq, attrs, wq_unbound_cpumask);
	if (IS_ERR(ctx))
		return PTR_ERR(ctx);

	/* the ctx has been prepared successfully, let's commit it */
	apply_wqattrs_commit(ctx);
	apply_wqattrs_cleanup(ctx);

	return 0;
}

/* allocate the attrs and pwqs for later installation */
static struct apply_wqattrs_ctx *
apply_wqattrs_prepare(struct workqueue_struct *wq,
		      const struct workqueue_attrs *attrs,
		      const cpumask_var_t unbound_cpumask)
{
	struct apply_wqattrs_ctx *ctx;
	struct workqueue_attrs *new_attrs;
	int cpu;

	lockdep_assert_held(&wq_pool_mutex);

	if (WARN_ON(attrs->affn_scope < 0 ||
		    attrs->affn_scope >= WQ_AFFN_NR_TYPES))
		return ERR_PTR(-EINVAL);

	ctx = kzalloc(struct_size(ctx, pwq_tbl, nr_cpu_ids), GFP_KERNEL);

	new_attrs = alloc_workqueue_attrs();
	if (!ctx || !new_attrs)
		goto out_free;

	/*
	 * If something goes wrong during CPU up/down, we'll fall back to
	 * the default pwq covering whole @attrs->cpumask.  Always create
	 * it even if we don't use it immediately.
	 */
	copy_workqueue_attrs(new_attrs, attrs);
	wqattrs_actualize_cpumask(new_attrs, unbound_cpumask);
	cpumask_copy(new_attrs->__pod_cpumask, new_attrs->cpumask);
	ctx->dfl_pwq = alloc_unbound_pwq(wq, new_attrs);
	if (!ctx->dfl_pwq)
		goto out_free;

	for_each_possible_cpu(cpu) {
		if (new_attrs->ordered) {
			ctx->dfl_pwq->refcnt++;
			ctx->pwq_tbl[cpu] = ctx->dfl_pwq;
		} else {
			wq_calc_pod_cpumask(new_attrs, cpu, -1);
			ctx->pwq_tbl[cpu] = alloc_unbound_pwq(wq, new_attrs);
			if (!ctx->pwq_tbl[cpu])
				goto out_free;
		}
	}

	/* save the user configured attrs and sanitize it. */
	copy_workqueue_attrs(new_attrs, attrs);
	cpumask_and(new_attrs->cpumask, new_attrs->cpumask, cpu_possible_mask);
	cpumask_copy(new_attrs->__pod_cpumask, new_attrs->cpumask);
	ctx->attrs = new_attrs;

	/*
	 * For initialized ordered workqueues, there should only be one pwq
	 * (dfl_pwq). Set the plugged flag of ctx->dfl_pwq to suspend execution
	 * of newly queued work items until execution of older work items in
	 * the old pwq's have completed.
	 */
	if ((wq->flags & __WQ_ORDERED) && !list_empty(&wq->pwqs))
		ctx->dfl_pwq->plugged = true;

	ctx->wq = wq;
	return ctx;

out_free:
	free_workqueue_attrs(new_attrs);
	apply_wqattrs_cleanup(ctx);
	return ERR_PTR(-ENOMEM);
}

static void apply_wqattrs_commit(struct apply_wqattrs_ctx *ctx)
{
	int cpu;

	/* all pwqs have been created successfully, let's install'em */
	mutex_lock(&ctx->wq->mutex);

	copy_workqueue_attrs(ctx->wq->unbound_attrs, ctx->attrs);

	/* save the previous pwqs and install the new ones */
	for_each_possible_cpu(cpu)
		ctx->pwq_tbl[cpu] = install_unbound_pwq(ctx->wq, cpu,
							ctx->pwq_tbl[cpu]);
	ctx->dfl_pwq = install_unbound_pwq(ctx->wq, -1, ctx->dfl_pwq);

	/* update node_nr_active->max */
	wq_update_node_max_active(ctx->wq, -1);

	/* rescuer needs to respect wq cpumask changes */
	if (ctx->wq->rescuer)
		set_cpus_allowed_ptr(ctx->wq->rescuer->task,
				     unbound_effective_cpumask(ctx->wq));

	mutex_unlock(&ctx->wq->mutex);
}

内核会将dfl_pwq字段和per-cpu字段的cpu_pwq都初始化一个struct pool_workqueue实例，但是这些实例都会映射同一个通过get_unbound_pool()获取/创建的worker_pool实例，从而确保不受特定cpu限制

work调度

内核使用queue_work()/queue_work_on()让workqueue使用者向workqueue提交work事件，其逻辑如下所示

static inline bool queue_work(struct workqueue_struct *wq,
			      struct work_struct *work)
{
	return queue_work_on(WORK_CPU_UNBOUND, wq, work);
}

bool queue_work_on(int cpu, struct workqueue_struct *wq,
		   struct work_struct *work)
{
	bool ret = false;
	unsigned long irq_flags;

	local_irq_save(irq_flags);

	if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT, work_data_bits(work))) {
		__queue_work(cpu, wq, work);
		ret = true;
	}

	local_irq_restore(irq_flags);
	return ret;
}

static void __queue_work(int cpu, struct workqueue_struct *wq,
			 struct work_struct *work)
{
	struct pool_workqueue *pwq;
	struct worker_pool *last_pool, *pool;
	unsigned int work_flags;
	unsigned int req_cpu = cpu;

	/*
	 * While a work item is PENDING && off queue, a task trying to
	 * steal the PENDING will busy-loop waiting for it to either get
	 * queued or lose PENDING.  Grabbing PENDING and queueing should
	 * happen with IRQ disabled.
	 */
	lockdep_assert_irqs_disabled();

	/*
	 * For a draining wq, only works from the same workqueue are
	 * allowed. The __WQ_DESTROYING helps to spot the issue that
	 * queues a new work item to a wq after destroy_workqueue(wq).
	 */
	if (unlikely(wq->flags & (__WQ_DESTROYING | __WQ_DRAINING) &&
		     WARN_ON_ONCE(!is_chained_work(wq))))
		return;
	rcu_read_lock();
retry:
	/* pwq which will be used unless @work is executing elsewhere */
	if (req_cpu == WORK_CPU_UNBOUND) {
		if (wq->flags & WQ_UNBOUND)
			cpu = wq_select_unbound_cpu(raw_smp_processor_id());
		else
			cpu = raw_smp_processor_id();
	}

	pwq = rcu_dereference(*per_cpu_ptr(wq->cpu_pwq, cpu));
	pool = pwq->pool;

	/*
	 * If @work was previously on a different pool, it might still be
	 * running there, in which case the work needs to be queued on that
	 * pool to guarantee non-reentrancy.
	 */
	last_pool = get_work_pool(work);
	if (last_pool && last_pool != pool) {
		struct worker *worker;

		raw_spin_lock(&last_pool->lock);

		worker = find_worker_executing_work(last_pool, work);

		if (worker && worker->current_pwq->wq == wq) {
			pwq = worker->current_pwq;
			pool = pwq->pool;
			WARN_ON_ONCE(pool != last_pool);
		} else {
			/* meh... not running there, queue here */
			raw_spin_unlock(&last_pool->lock);
			raw_spin_lock(&pool->lock);
		}
	} else {
		raw_spin_lock(&pool->lock);
	}

	/*
	 * pwq is determined and locked. For unbound pools, we could have raced
	 * with pwq release and it could already be dead. If its refcnt is zero,
	 * repeat pwq selection. Note that unbound pwqs never die without
	 * another pwq replacing it in cpu_pwq or while work items are executing
	 * on it, so the retrying is guaranteed to make forward-progress.
	 */
	if (unlikely(!pwq->refcnt)) {
		if (wq->flags & WQ_UNBOUND) {
			raw_spin_unlock(&pool->lock);
			cpu_relax();
			goto retry;
		}
		/* oops */
		WARN_ONCE(true, "workqueue: per-cpu pwq for %s on cpu%d has 0 refcnt",
			  wq->name, cpu);
	}

	/* pwq determined, queue */
	trace_workqueue_queue_work(req_cpu, pwq, work);

	if (WARN_ON(!list_empty(&work->entry)))
		goto out;

	pwq->nr_in_flight[pwq->work_color]++;
	work_flags = work_color_to_flags(pwq->work_color);

	/*
	 * Limit the number of concurrently active work items to max_active.
	 * @work must also queue behind existing inactive work items to maintain
	 * ordering when max_active changes. See wq_adjust_max_active().
	 */
	if (list_empty(&pwq->inactive_works) && pwq_tryinc_nr_active(pwq, false)) {
		if (list_empty(&pool->worklist))
			pool->watchdog_ts = jiffies;

		trace_workqueue_activate_work(work);
		insert_work(pwq, work, &pool->worklist, work_flags);
		kick_pool(pool);
	} else {
		work_flags |= WORK_STRUCT_INACTIVE;
		insert_work(pwq, work, &pwq->inactive_works, work_flags);
	}

out:
	raw_spin_unlock(&pool->lock);
	rcu_read_unlock();
}

其整体逻辑也很清晰，如果是bound类型的workqueue，则调用insert_work()将其添加到cpu_pwq字段对应的当前cpu的per_cpu的worker_pool中；如果是unbound类型的workqueue，则调用insert_work()将其添加到wq_select_unbound_cpu()选择的cpu的per_cpu的worker_pool中

在添加玩work事件后，还需要调用kick_pool()来唤醒worker_pool中的worker线程，如下所示

/**
 * kick_pool - wake up an idle worker if necessary
 * @pool: pool to kick
 *
 * @pool may have pending work items. Wake up worker if necessary. Returns
 * whether a worker was woken up.
 */
static bool kick_pool(struct worker_pool *pool)
{
	struct worker *worker = first_idle_worker(pool);
	struct task_struct *p;

	lockdep_assert_held(&pool->lock);

	if (!need_more_worker(pool) || !worker)
		return false;

	if (pool->flags & POOL_BH) {
		kick_bh_pool(pool);
		return true;
	}

	p = worker->task;

#ifdef CONFIG_SMP
	/*
	 * Idle @worker is about to execute @work and waking up provides an
	 * opportunity to migrate @worker at a lower cost by setting the task's
	 * wake_cpu field. Let's see if we want to move @worker to improve
	 * execution locality.
	 *
	 * We're waking the worker that went idle the latest and there's some
	 * chance that @worker is marked idle but hasn't gone off CPU yet. If
	 * so, setting the wake_cpu won't do anything. As this is a best-effort
	 * optimization and the race window is narrow, let's leave as-is for
	 * now. If this becomes pronounced, we can skip over workers which are
	 * still on cpu when picking an idle worker.
	 *
	 * If @pool has non-strict affinity, @worker might have ended up outside
	 * its affinity scope. Repatriate.
	 */
	if (!pool->attrs->affn_strict &&
	    !cpumask_test_cpu(p->wake_cpu, pool->attrs->__pod_cpumask)) {
		struct work_struct *work = list_first_entry(&pool->worklist,
						struct work_struct, entry);
		p->wake_cpu = cpumask_any_distribute(pool->attrs->__pod_cpumask);
		get_work_pwq(work)->stats[PWQ_STAT_REPATRIATED]++;
	}
#endif
	wake_up_process(p);
	return true;
}

这里可以看到，对于非POOL_BH类型的，其就是唤醒worker_pool中的idle worker进程即可。

而对于POOL_BH类型，前面worker进程小节中提到过，对于POOL_BH类型并不会创建worker进程，而是在tasklet的下半部的workqueue_softirq_action()中直接调用bh_worker()进行处理，如下所示

void workqueue_softirq_action(bool highpri)
{
	struct worker_pool *pool =
		&per_cpu(bh_worker_pools, smp_processor_id())[highpri];
	if (need_more_worker(pool))
		bh_worker(list_first_entry(&pool->workers, struct worker, node));
}

static void bh_worker(struct worker *worker)
{
	struct worker_pool *pool = worker->pool;
	int nr_restarts = BH_WORKER_RESTARTS;
	unsigned long end = jiffies + BH_WORKER_JIFFIES;

	raw_spin_lock_irq(&pool->lock);
	worker_leave_idle(worker);

	/*
	 * This function follows the structure of worker_thread(). See there for
	 * explanations on each step.
	 */
	if (!need_more_worker(pool))
		goto done;

	WARN_ON_ONCE(!list_empty(&worker->scheduled));
	worker_clr_flags(worker, WORKER_PREP | WORKER_REBOUND);

	do {
		struct work_struct *work =
			list_first_entry(&pool->worklist,
					 struct work_struct, entry);

		if (assign_work(work, worker, NULL))
			process_scheduled_works(worker);
	} while (keep_working(pool) &&
		 --nr_restarts && time_before(jiffies, end));

	worker_set_flags(worker, WORKER_PREP);
done:
	worker_enter_idle(worker);
	kick_pool(pool);
	raw_spin_unlock_irq(&pool->lock);
}

可以看到，基本和worker_thread()逻辑类似，但是没有了睡眠和唤醒的逻辑，即执行完待处理的work事件后就退出，避免阻塞tasklet下半部的逻辑。因此workqueue只需要在kich_bh_pool()中在对应的cpu上生成HI_SOFTIRQ/TASKLET_SOFTIRQ事件即可，后续会在tasklet中断下半部完成work事件的处理

static void kick_bh_pool(struct worker_pool *pool)
{
#ifdef CONFIG_SMP
	/* see drain_dead_softirq_workfn() for BH_DRAINING */
	if (unlikely(pool->cpu != smp_processor_id() &&
		     !(pool->flags & POOL_BH_DRAINING))) {
		irq_work_queue_on(bh_pool_irq_work(pool), pool->cpu);
		return;
	}
#endif
	if (pool->attrs->nice == HIGHPRI_NICE_LEVEL)
		raise_softirq_irqoff(HI_SOFTIRQ);
	else
		raise_softirq_irqoff(TASKLET_SOFTIRQ);
}

void __init workqueue_init_early(void)
{
	...
	void (*irq_work_fns[2])(struct irq_work *) = { bh_pool_kick_normal,
						       bh_pool_kick_highpri };
	...
	/* initialize BH and CPU pools */
	for_each_possible_cpu(cpu) {
		...
		for_each_bh_worker_pool(pool, cpu) {
			...
			init_irq_work(bh_pool_irq_work(pool), irq_work_fns[i]);
			i++;
		}
		...
	}
	...
}

static void bh_pool_kick_normal(struct irq_work *irq_work)
{
	raise_softirq_irqoff(TASKLET_SOFTIRQ);
}

static void bh_pool_kick_highpri(struct irq_work *irq_work)
{
	raise_softirq_irqoff(HI_SOFTIRQ);
}

中断简介

前言

中断硬件

pic(Programmable Interrupt Controller)

中断设置

中断处理

apic(Advanced Programmable Interrupt Controller)

lapic

中断设置

中断处理

ioapic

中断设置

中断处理

msi(Message Signaled Interrupt)

中断设置

msi capability

msix capability

中断处理

中断子系统

硬件中断

虚拟中断

中断路由

中断下半部

softirq

softirq事件

触发时机

ksoftirqd

irq_exit_rcu

tasklet

tasklet事件

处理逻辑

workqueue

worker_pool

work事件

worker进程

接口

创建

work调度

参考