当前位置：首页 > news >正文

【Linux内核设计与实现】第三章——进程管理01

news 2025/10/16 12:18:59

文章目录

1. 引言
2. 进程&线程——概念
3. 进程控制块/进程描述符(PCB)
4. 进程内核栈（Kernel Stack）
- 4.1. 进程内核栈的定义
- 4.2. thread_info 体系结构相关进程描述
- 4.3. 定位进程描述符(task_struct)和内核栈以及内核栈指针的问题
5. 进程 ID（PID）
6. 进程状态
- 6.1. 进程的几个状态
- 6.2. 设置进程状态
- - 设置当前状态
  - 设置特殊状态
- 6.3. 进程状态检查
- 6.4. 状态切换的几个核心函数
- 6.5. 经典的进程状态转换
#附
#01. 计算 task_struct 大小
- #01.1. 编写 task_struct_size 模块
- #01.2. 编写 Makefile
- #01.3. 编译并插入模块
- #01.4. 查看内核打印
#02. 验证内核栈大小
#03. 验证内核栈结构
#04. 验证栈指针位置
#下一篇

1. 引言

进程是 Linux 操作系统抽象概念中最基本的一种。本文将和读者一起学习有关 Linux 内核中对进程、线程的概念；然后再一起研究在 Linux 内核中是如何管理每个进程，本文只涉及对进程以及线程生命周期的介绍（进程是如何创建、消亡）。我们的操作系统存在的目的就是为了更好的运行用户程序，由此可见进程管理在操作系统中的地位可谓是至关重要的。而有关进程调度内容笔者将放在下一篇文章中详述，虽然这些都属于进程管理的范畴，不过由于笔者是按照《Linux 内核设计与实现》第三版并结合 Linux 6.15.0-rc2 内核代码来展开讨论，因此这里的探讨顺序将会与书籍原文保持一致。但是由于加上较为细节的分析内核代买，笔者写完发现这章内容巨多，因此将分为多篇内容发布，也方便读者们能够轻松看完。

2. 进程&线程——概念

在计算机科学领域，有关进程(Process)较为官方的定义是，进程是“正在执行的程序实例”。与静态存储在磁盘上的程序文件不同，进程具有动态的执行上下文，包括程序计数器（PC）、寄存器集、堆栈以及进程控制块（PCB）中记录的资源分配信息。与之对应衍生出的概念线程(Thread)，较为官方的定义是，线程是“正在执行的程序指令序列中的最小可独立调度单元”。它包含程序计数器、寄存器状态、堆栈以及线程控制块（TCB）等上下文信息。线程也被称为轻量级进程，因为它与同属一个进程的其他线程共享同一地址空间和打开的文件描述符等资源。

请添加图片描述
[注]：这里笔者从 Wiki 偷过来一张图供读者们稍微形象的理解——此处为照顾之前没有学习过操作系统的同学们。

事实上，进程就是正在执行的程序代码的实时结果。内核的调度对象是线程，而非进程。在 Linux 内核中事实上并不特别区分进程和线程。对于 Linux 而言线程只不过是一种特殊的进程而已。进程的另一个名字也叫 task，Linux 内核也通常把进程叫做任务(task)，文中提到的任务、进程所指的都是进程。

3. 进程控制块/进程描述符(PCB)

操作系统中用于记录和管理每个进程所有信息的结构称为进程控制块（Process Control Block，简称 PCB）。在 Linux 系统中这种结构也会被称为进程描述符（Process Descriptor）。Linux 内核中将进程列表存放在一个任务队列（task list）的双向循环链表中。链表中的每一项类型为 task_struct 结构体类型，这个类型也就是 Linux 内核中的进程描述符，在该结构体中保存了一个操作系统中进程的所有通用信息。

&emsp；task_struct 结构体相对比较大，在笔者当前的 64 位机器下，该结构体大约占用 13.43KB。^#01（数据的计算过程见文末附 #01）由于现代计算机对性能要求越来越高，因此现代操作系统体积越来越庞大，用于管理进程的数据结构保存的信息也越来越多，因此进程描述符的体积也逐渐增加。

在这里插入图片描述

task_struct 结构体中的成员非常多，现在就暂且看看就行，等真正遇见再回过头来仔细分析其含义和作用。

// linux 6.15.0-rc2
// PATH: include/linux/sched.h
struct task_struct {
#ifdef CONFIG_THREAD_INFO_IN_TASK/** For reasons of header soup (see current_thread_info()), this* must be the first element of task_struct.*/struct thread_info		thread_info;
#endifunsigned int			__state;/* saved state for "spinlock sleepers" */unsigned int			saved_state;/** This begins the randomizable portion of task_struct. Only* scheduling-critical items should be added above here.*/randomized_struct_fields_startvoid				*stack;refcount_t			usage;/* Per task flags (PF_*), defined further below: */unsigned int			flags;unsigned int			ptrace;#ifdef CONFIG_MEM_ALLOC_PROFILINGstruct alloc_tag		*alloc_tag;
#endif#ifdef CONFIG_SMPint				on_cpu;struct __call_single_node	wake_entry;unsigned int			wakee_flips;unsigned long			wakee_flip_decay_ts;struct task_struct		*last_wakee;/** recent_used_cpu is initially set as the last CPU used by a task* that wakes affine another task. Waker/wakee relationships can* push tasks around a CPU where each wakeup moves to the next one.* Tracking a recently used CPU allows a quick search for a recently* used CPU that may be idle.*/int				recent_used_cpu;int				wake_cpu;
#endifint				on_rq;int				prio;int				static_prio;int				normal_prio;unsigned int			rt_priority;struct sched_entity		se;struct sched_rt_entity		rt;struct sched_dl_entity		dl;struct sched_dl_entity		*dl_server;
#ifdef CONFIG_SCHED_CLASS_EXTstruct sched_ext_entity		scx;
#endifconst struct sched_class	*sched_class;#ifdef CONFIG_SCHED_COREstruct rb_node			core_node;unsigned long			core_cookie;unsigned int			core_occupation;
#endif#ifdef CONFIG_CGROUP_SCHEDstruct task_group		*sched_task_group;
#endif#ifdef CONFIG_UCLAMP_TASK/** Clamp values requested for a scheduling entity.* Must be updated with task_rq_lock() held.*/struct uclamp_se		uclamp_req[UCLAMP_CNT];/** Effective clamp values used for a scheduling entity.* Must be updated with task_rq_lock() held.*/struct uclamp_se		uclamp[UCLAMP_CNT];
#endifstruct sched_statistics         stats;#ifdef CONFIG_PREEMPT_NOTIFIERS/* List of struct preempt_notifier: */struct hlist_head		preempt_notifiers;
#endif#ifdef CONFIG_BLK_DEV_IO_TRACEunsigned int			btrace_seq;
#endifunsigned int			policy;unsigned long			max_allowed_capacity;int				nr_cpus_allowed;const cpumask_t			*cpus_ptr;cpumask_t			*user_cpus_ptr;cpumask_t			cpus_mask;void				*migration_pending;
#ifdef CONFIG_SMPunsigned short			migration_disabled;
#endifunsigned short			migration_flags;#ifdef CONFIG_PREEMPT_RCUint				rcu_read_lock_nesting;union rcu_special		rcu_read_unlock_special;struct list_head		rcu_node_entry;struct rcu_node			*rcu_blocked_node;
#endif /* #ifdef CONFIG_PREEMPT_RCU */#ifdef CONFIG_TASKS_RCUunsigned long			rcu_tasks_nvcsw;u8				rcu_tasks_holdout;u8				rcu_tasks_idx;int				rcu_tasks_idle_cpu;struct list_head		rcu_tasks_holdout_list;int				rcu_tasks_exit_cpu;struct list_head		rcu_tasks_exit_list;
#endif /* #ifdef CONFIG_TASKS_RCU */#ifdef CONFIG_TASKS_TRACE_RCUint				trc_reader_nesting;int				trc_ipi_to_cpu;union rcu_special		trc_reader_special;struct list_head		trc_holdout_list;struct list_head		trc_blkd_node;int				trc_blkd_cpu;
#endif /* #ifdef CONFIG_TASKS_TRACE_RCU */struct sched_info		sched_info;struct list_head		tasks;
#ifdef CONFIG_SMPstruct plist_node		pushable_tasks;struct rb_node			pushable_dl_tasks;
#endifstruct mm_struct		*mm;struct mm_struct		*active_mm;struct address_space		*faults_disabled_mapping;int				exit_state;int				exit_code;int				exit_signal;/* The signal sent when the parent dies: */int				pdeath_signal;/* JOBCTL_*, siglock protected: */unsigned long			jobctl;/* Used for emulating ABI behavior of previous Linux versions: */unsigned int			personality;/* Scheduler bits, serialized by scheduler locks: */unsigned			sched_reset_on_fork:1;unsigned			sched_contributes_to_load:1;unsigned			sched_migrated:1;unsigned			sched_task_hot:1;/* Force alignment to the next boundary: */unsigned			:0;/* Unserialized, strictly 'current' *//** This field must not be in the scheduler word above due to wakelist* queueing no longer being serialized by p->on_cpu. However:** p->XXX = X;			ttwu()* schedule()			  if (p->on_rq && ..) // false*   smp_mb__after_spinlock();	  if (smp_load_acquire(&p->on_cpu) && //true*   deactivate_task()		      ttwu_queue_wakelist())*     p->on_rq = 0;			p->sched_remote_wakeup = Y;** guarantees all stores of 'current' are visible before* ->sched_remote_wakeup gets used, so it can be in this word.*/unsigned			sched_remote_wakeup:1;
#ifdef CONFIG_RT_MUTEXESunsigned			sched_rt_mutex:1;
#endif/* Bit to tell TOMOYO we're in execve(): */unsigned			in_execve:1;unsigned			in_iowait:1;
#ifndef TIF_RESTORE_SIGMASKunsigned			restore_sigmask:1;
#endif
#ifdef CONFIG_MEMCG_V1unsigned			in_user_fault:1;
#endif
#ifdef CONFIG_LRU_GEN/* whether the LRU algorithm may apply to this access */unsigned			in_lru_fault:1;
#endif
#ifdef CONFIG_COMPAT_BRKunsigned			brk_randomized:1;
#endif
#ifdef CONFIG_CGROUPS/* disallow userland-initiated cgroup migration */unsigned			no_cgroup_migration:1;/* task is frozen/stopped (used by the cgroup freezer) */unsigned			frozen:1;
#endif
#ifdef CONFIG_BLK_CGROUPunsigned			use_memdelay:1;
#endif
#ifdef CONFIG_PSI/* Stalled due to lack of memory */unsigned			in_memstall:1;
#endif
#ifdef CONFIG_PAGE_OWNER/* Used by page_owner=on to detect recursion in page tracking. */unsigned			in_page_owner:1;
#endif
#ifdef CONFIG_EVENTFD/* Recursion prevention for eventfd_signal() */unsigned			in_eventfd:1;
#endif
#ifdef CONFIG_ARCH_HAS_CPU_PASIDunsigned			pasid_activated:1;
#endif
#ifdef CONFIG_X86_BUS_LOCK_DETECTunsigned			reported_split_lock:1;
#endif
#ifdef CONFIG_TASK_DELAY_ACCT/* delay due to memory thrashing */unsigned                        in_thrashing:1;
#endif
#ifdef CONFIG_PREEMPT_RTstruct netdev_xmit		net_xmit;
#endifunsigned long			atomic_flags; /* Flags requiring atomic access. */struct restart_block		restart_block;pid_t				pid;pid_t				tgid;#ifdef CONFIG_STACKPROTECTOR/* Canary value for the -fstack-protector GCC feature: */unsigned long			stack_canary;
#endif/** Pointers to the (original) parent process, youngest child, younger sibling,* older sibling, respectively.  (p->father can be replaced with* p->real_parent->pid)*//* Real parent process: */struct task_struct __rcu	*real_parent;/* Recipient of SIGCHLD, wait4() reports: */struct task_struct __rcu	*parent;/** Children/sibling form the list of natural children:*/struct list_head		children;struct list_head		sibling;struct task_struct		*group_leader;/** 'ptraced' is the list of tasks this task is using ptrace() on.** This includes both natural children and PTRACE_ATTACH targets.* 'ptrace_entry' is this task's link on the p->parent->ptraced list.*/struct list_head		ptraced;struct list_head		ptrace_entry;/* PID/PID hash table linkage. */struct pid			*thread_pid;struct hlist_node		pid_links[PIDTYPE_MAX];struct list_head		thread_node;struct completion		*vfork_done;/* CLONE_CHILD_SETTID: */int __user			*set_child_tid;/* CLONE_CHILD_CLEARTID: */int __user			*clear_child_tid;/* PF_KTHREAD | PF_IO_WORKER */void				*worker_private;u64				utime;u64				stime;
#ifdef CONFIG_ARCH_HAS_SCALED_CPUTIMEu64				utimescaled;u64				stimescaled;
#endifu64				gtime;struct prev_cputime		prev_cputime;
#ifdef CONFIG_VIRT_CPU_ACCOUNTING_GENstruct vtime			vtime;
#endif#ifdef CONFIG_NO_HZ_FULLatomic_t			tick_dep_mask;
#endif/* Context switch counts: */unsigned long			nvcsw;unsigned long			nivcsw;/* Monotonic time in nsecs: */u64				start_time;/* Boot based time in nsecs: */u64				start_boottime;/* MM fault and swap info: this can arguably be seen as either mm-specific or thread-specific: */unsigned long			min_flt;unsigned long			maj_flt;/* Empty if CONFIG_POSIX_CPUTIMERS=n */struct posix_cputimers		posix_cputimers;#ifdef CONFIG_POSIX_CPU_TIMERS_TASK_WORKstruct posix_cputimers_work	posix_cputimers_work;
#endif/* Process credentials: *//* Tracer's credentials at attach: */const struct cred __rcu		*ptracer_cred;/* Objective and real subjective task credentials (COW): */const struct cred __rcu		*real_cred;/* Effective (overridable) subjective task credentials (COW): */const struct cred __rcu		*cred;#ifdef CONFIG_KEYS/* Cached requested key. */struct key			*cached_requested_key;
#endif/** executable name, excluding path.** - normally initialized begin_new_exec()* - set it with set_task_comm()*   - strscpy_pad() to ensure it is always NUL-terminated and*     zero-padded*   - task_lock() to ensure the operation is atomic and the name is*     fully updated.*/char				comm[TASK_COMM_LEN];struct nameidata		*nameidata;#ifdef CONFIG_SYSVIPCstruct sysv_sem			sysvsem;struct sysv_shm			sysvshm;
#endif
#ifdef CONFIG_DETECT_HUNG_TASKunsigned long			last_switch_count;unsigned long			last_switch_time;
#endif/* Filesystem information: */struct fs_struct		*fs;/* Open file information: */struct files_struct		*files;#ifdef CONFIG_IO_URINGstruct io_uring_task		*io_uring;
#endif/* Namespaces: */struct nsproxy			*nsproxy;/* Signal handlers: */struct signal_struct		*signal;struct sighand_struct __rcu		*sighand;sigset_t			blocked;sigset_t			real_blocked;/* Restored if set_restore_sigmask() was used: */sigset_t			saved_sigmask;struct sigpending		pending;unsigned long			sas_ss_sp;size_t				sas_ss_size;unsigned int			sas_ss_flags;struct callback_head		*task_works;#ifdef CONFIG_AUDIT
#ifdef CONFIG_AUDITSYSCALLstruct audit_context		*audit_context;
#endifkuid_t				loginuid;unsigned int			sessionid;
#endifstruct seccomp			seccomp;struct syscall_user_dispatch	syscall_dispatch;/* Thread group tracking: */u64				parent_exec_id;u64				self_exec_id;/* Protection against (de-)allocation: mm, files, fs, tty, keyrings, mems_allowed, mempolicy: */spinlock_t			alloc_lock;/* Protection of the PI data structures: */raw_spinlock_t			pi_lock;struct wake_q_node		wake_q;#ifdef CONFIG_RT_MUTEXES/* PI waiters blocked on a rt_mutex held by this task: */struct rb_root_cached		pi_waiters;/* Updated under owner's pi_lock and rq lock */struct task_struct		*pi_top_task;/* Deadlock detection and priority inheritance handling: */struct rt_mutex_waiter		*pi_blocked_on;
#endif#ifdef CONFIG_DEBUG_MUTEXES/* Mutex deadlock detection: */struct mutex_waiter		*blocked_on;
#endif#ifdef CONFIG_DETECT_HUNG_TASK_BLOCKERstruct mutex			*blocker_mutex;
#endif#ifdef CONFIG_DEBUG_ATOMIC_SLEEPint				non_block_count;
#endif#ifdef CONFIG_TRACE_IRQFLAGSstruct irqtrace_events		irqtrace;unsigned int			hardirq_threaded;u64				hardirq_chain_key;int				softirqs_enabled;int				softirq_context;int				irq_config;
#endif
#ifdef CONFIG_PREEMPT_RTint				softirq_disable_cnt;
#endif#ifdef CONFIG_LOCKDEP
# define MAX_LOCK_DEPTH			48ULu64				curr_chain_key;int				lockdep_depth;unsigned int			lockdep_recursion;struct held_lock		held_locks[MAX_LOCK_DEPTH];
#endif#if defined(CONFIG_UBSAN) && !defined(CONFIG_UBSAN_TRAP)unsigned int			in_ubsan;
#endif/* Journalling filesystem info: */void				*journal_info;/* Stacked block device info: */struct bio_list			*bio_list;/* Stack plugging: */struct blk_plug			*plug;/* VM state: */struct reclaim_state		*reclaim_state;struct io_context		*io_context;#ifdef CONFIG_COMPACTIONstruct capture_control		*capture_control;
#endif/* Ptrace state: */unsigned long			ptrace_message;kernel_siginfo_t		*last_siginfo;struct task_io_accounting	ioac;
#ifdef CONFIG_PSI/* Pressure stall state */unsigned int			psi_flags;
#endif
#ifdef CONFIG_TASK_XACCT/* Accumulated RSS usage: */u64				acct_rss_mem1;/* Accumulated virtual memory usage: */u64				acct_vm_mem1;/* stime + utime since last update: */u64				acct_timexpd;
#endif
#ifdef CONFIG_CPUSETS/* Protected by ->alloc_lock: */nodemask_t			mems_allowed;/* Sequence number to catch updates: */seqcount_spinlock_t		mems_allowed_seq;int				cpuset_mem_spread_rotor;
#endif
#ifdef CONFIG_CGROUPS/* Control Group info protected by css_set_lock: */struct css_set __rcu		*cgroups;/* cg_list protected by css_set_lock and tsk->alloc_lock: */struct list_head		cg_list;
#endif
#ifdef CONFIG_X86_CPU_RESCTRLu32				closid;u32				rmid;
#endif
#ifdef CONFIG_FUTEXstruct robust_list_head __user	*robust_list;
#ifdef CONFIG_COMPATstruct compat_robust_list_head __user *compat_robust_list;
#endifstruct list_head		pi_state_list;struct futex_pi_state		*pi_state_cache;struct mutex			futex_exit_mutex;unsigned int			futex_state;
#endif
#ifdef CONFIG_PERF_EVENTSu8				perf_recursion[PERF_NR_CONTEXTS];struct perf_event_context	*perf_event_ctxp;struct mutex			perf_event_mutex;struct list_head		perf_event_list;struct perf_ctx_data __rcu	*perf_ctx_data;
#endif
#ifdef CONFIG_DEBUG_PREEMPTunsigned long			preempt_disable_ip;
#endif
#ifdef CONFIG_NUMA/* Protected by alloc_lock: */struct mempolicy		*mempolicy;short				il_prev;u8				il_weight;short				pref_node_fork;
#endif
#ifdef CONFIG_NUMA_BALANCINGint				numa_scan_seq;unsigned int			numa_scan_period;unsigned int			numa_scan_period_max;int				numa_preferred_nid;unsigned long			numa_migrate_retry;/* Migration stamp: */u64				node_stamp;u64				last_task_numa_placement;u64				last_sum_exec_runtime;struct callback_head		numa_work;/** This pointer is only modified for current in syscall and* pagefault context (and for tasks being destroyed), so it can be read* from any of the following contexts:*  - RCU read-side critical section*  - current->numa_group from everywhere*  - task's runqueue locked, task not running*/struct numa_group __rcu		*numa_group;/** numa_faults is an array split into four regions:* faults_memory, faults_cpu, faults_memory_buffer, faults_cpu_buffer* in this precise order.** faults_memory: Exponential decaying average of faults on a per-node* basis. Scheduling placement decisions are made based on these* counts. The values remain static for the duration of a PTE scan.* faults_cpu: Track the nodes the process was running on when a NUMA* hinting fault was incurred.* faults_memory_buffer and faults_cpu_buffer: Record faults per node* during the current scan window. When the scan completes, the counts* in faults_memory and faults_cpu decay and these values are copied.*/unsigned long			*numa_faults;unsigned long			total_numa_faults;/** numa_faults_locality tracks if faults recorded during the last* scan window were remote/local or failed to migrate. The task scan* period is adapted based on the locality of the faults with different* weights depending on whether they were shared or private faults*/unsigned long			numa_faults_locality[3];unsigned long			numa_pages_migrated;
#endif /* CONFIG_NUMA_BALANCING */#ifdef CONFIG_RSEQstruct rseq __user *rseq;u32 rseq_len;u32 rseq_sig;/** RmW on rseq_event_mask must be performed atomically* with respect to preemption.*/unsigned long rseq_event_mask;
# ifdef CONFIG_DEBUG_RSEQ/** This is a place holder to save a copy of the rseq fields for* validation of read-only fields. The struct rseq has a* variable-length array at the end, so it cannot be used* directly. Reserve a size large enough for the known fields.*/char				rseq_fields[sizeof(struct rseq)];
# endif
#endif#ifdef CONFIG_SCHED_MM_CIDint				mm_cid;		/* Current cid in mm */int				last_mm_cid;	/* Most recent cid in mm */int				migrate_from_cpu;int				mm_cid_active;	/* Whether cid bitmap is active */struct callback_head		cid_work;
#endifstruct tlbflush_unmap_batch	tlb_ubc;/* Cache last used pipe for splice(): */struct pipe_inode_info		*splice_pipe;struct page_frag		task_frag;#ifdef CONFIG_TASK_DELAY_ACCTstruct task_delay_info		*delays;
#endif#ifdef CONFIG_FAULT_INJECTIONint				make_it_fail;unsigned int			fail_nth;
#endif/** When (nr_dirtied >= nr_dirtied_pause), it's time to call* balance_dirty_pages() for a dirty throttling pause:*/int				nr_dirtied;int				nr_dirtied_pause;/* Start of a write-and-pause period: */unsigned long			dirty_paused_when;#ifdef CONFIG_LATENCYTOPint				latency_record_count;struct latency_record		latency_record[LT_SAVECOUNT];
#endif/** Time slack values; these are used to round up poll() and* select() etc timeout values. These are in nanoseconds.*/u64				timer_slack_ns;u64				default_timer_slack_ns;#if defined(CONFIG_KASAN_GENERIC) || defined(CONFIG_KASAN_SW_TAGS)unsigned int			kasan_depth;
#endif#ifdef CONFIG_KCSANstruct kcsan_ctx		kcsan_ctx;
#ifdef CONFIG_TRACE_IRQFLAGSstruct irqtrace_events		kcsan_save_irqtrace;
#endif
#ifdef CONFIG_KCSAN_WEAK_MEMORYint				kcsan_stack_depth;
#endif
#endif#ifdef CONFIG_KMSANstruct kmsan_ctx		kmsan_ctx;
#endif#if IS_ENABLED(CONFIG_KUNIT)struct kunit			*kunit_test;
#endif#ifdef CONFIG_FUNCTION_GRAPH_TRACER/* Index of current stored address in ret_stack: */int				curr_ret_stack;int				curr_ret_depth;/* Stack of return addresses for return function tracing: */unsigned long			*ret_stack;/* Timestamp for last schedule: */unsigned long long		ftrace_timestamp;unsigned long long		ftrace_sleeptime;/** Number of functions that haven't been traced* because of depth overrun:*/atomic_t			trace_overrun;/* Pause tracing: */atomic_t			tracing_graph_pause;
#endif#ifdef CONFIG_TRACING/* Bitmask and counter of trace recursion: */unsigned long			trace_recursion;
#endif /* CONFIG_TRACING */#ifdef CONFIG_KCOV/* See kernel/kcov.c for more details. *//* Coverage collection mode enabled for this task (0 if disabled): */unsigned int			kcov_mode;/* Size of the kcov_area: */unsigned int			kcov_size;/* Buffer for coverage collection: */void				*kcov_area;/* KCOV descriptor wired with this task or NULL: */struct kcov			*kcov;/* KCOV common handle for remote coverage collection: */u64				kcov_handle;/* KCOV sequence number: */int				kcov_sequence;/* Collect coverage from softirq context: */unsigned int			kcov_softirq;
#endif#ifdef CONFIG_MEMCG_V1struct mem_cgroup		*memcg_in_oom;
#endif#ifdef CONFIG_MEMCG/* Number of pages to reclaim on returning to userland: */unsigned int			memcg_nr_pages_over_high;/* Used by memcontrol for targeted memcg charge: */struct mem_cgroup		*active_memcg;/* Cache for current->cgroups->memcg->objcg lookups: */struct obj_cgroup		*objcg;
#endif#ifdef CONFIG_BLK_CGROUPstruct gendisk			*throttle_disk;
#endif#ifdef CONFIG_UPROBESstruct uprobe_task		*utask;
#endif
#if defined(CONFIG_BCACHE) || defined(CONFIG_BCACHE_MODULE)unsigned int			sequential_io;unsigned int			sequential_io_avg;
#endifstruct kmap_ctrl		kmap_ctrl;
#ifdef CONFIG_DEBUG_ATOMIC_SLEEPunsigned long			task_state_change;
# ifdef CONFIG_PREEMPT_RTunsigned long			saved_state_change;
# endif
#endifstruct rcu_head			rcu;refcount_t			rcu_users;int				pagefault_disabled;
#ifdef CONFIG_MMUstruct task_struct		*oom_reaper_list;struct timer_list		oom_reaper_timer;
#endif
#ifdef CONFIG_VMAP_STACKstruct vm_struct		*stack_vm_area;
#endif
#ifdef CONFIG_THREAD_INFO_IN_TASK/* A live task holds one reference: */refcount_t			stack_refcount;
#endif
#ifdef CONFIG_LIVEPATCHint patch_state;
#endif
#ifdef CONFIG_SECURITY/* Used by LSM modules for access restriction: */void				*security;
#endif
#ifdef CONFIG_BPF_SYSCALL/* Used by BPF task local storage */struct bpf_local_storage __rcu	*bpf_storage;/* Used for BPF run context */struct bpf_run_ctx		*bpf_ctx;
#endif/* Used by BPF for per-TASK xdp storage */struct bpf_net_context		*bpf_net_context;#ifdef CONFIG_GCC_PLUGIN_STACKLEAKunsigned long			lowest_stack;unsigned long			prev_lowest_stack;
#endif#ifdef CONFIG_X86_MCEvoid __user			*mce_vaddr;__u64				mce_kflags;u64				mce_addr;__u64				mce_ripv : 1,mce_whole_page : 1,__mce_reserved : 62;struct callback_head		mce_kill_me;int				mce_count;
#endif#ifdef CONFIG_KRETPROBESstruct llist_head               kretprobe_instances;
#endif
#ifdef CONFIG_RETHOOKstruct llist_head               rethooks;
#endif#ifdef CONFIG_ARCH_HAS_PARANOID_L1D_FLUSH/** If L1D flush is supported on mm context switch* then we use this callback head to queue kill work* to kill tasks that are not running on SMT disabled* cores*/struct callback_head		l1d_flush_kill;
#endif#ifdef CONFIG_RV/** Per-task RV monitor. Nowadays fixed in RV_PER_TASK_MONITORS.* If we find justification for more monitors, we can think* about adding more or developing a dynamic method. So far,* none of these are justified.*/union rv_task_monitor		rv[RV_PER_TASK_MONITORS];
#endif#ifdef CONFIG_USER_EVENTSstruct user_event_mm		*user_event_mm;
#endif/** New fields for task_struct should be added above here, so that* they are included in the randomized portion of task_struct.*/randomized_struct_fields_end/* CPU-specific state of this task: */struct thread_struct		thread;/** WARNING: on x86, 'thread_struct' contains a variable-sized* structure.  It *MUST* be at the end of 'task_struct'.** Do not put anything below here!*/
};

4. 进程内核栈（Kernel Stack）

每个进程最终都会通过系统调用陷入内核执行，进程内核栈用于在执行系统调用、异常或中断处理时保存上下文。在内核态运行的进程所使用的栈会与用户空间的栈不同，是一个在单独在内核空间的栈，也被称作内核栈。进程内核栈在进程创建的时候调用 alloc_thread_stack_node 函数最终由 __slab_alloc_node 函数从 slab 缓存池中分配内存，其大小为 THREAD_SIZE。从历史上说，通常来讲内核栈的大小是两页，也就是在 32 位机器上的内核栈就是 8KB，64 位机器上的内核栈就是 16KB，是一个固定不变的值。

[注]：有关内核栈分配与管理的相关内容笔者将在后面的文章中继续与大家一同学习，本文暂不涉及改内容。

4.1. 进程内核栈的定义

内核栈的定义如下，内核栈在理论上被定义为一个 union，该结构被称之为联合体或者共用体，该类型的特性是 “一个联合体的大小等于其内部所占空间最大的成员大小，并且所有成员都共享同一段内存”。根据这个特性，我们也就知道了一个内核栈 thread_union 的大小就是 sizeof(unsigned long) * (THREAD_SIZE / sizeof(long)) = THERAD_SIZE。

// Linux Kernel 2.6.34
// PATH: include/linux/sched.h
union thread_union {struct thread_info thread_info;unsigned long stack[THREAD_SIZE/sizeof(long)];
};//--------------------------------------------------------------------// Linux Kernel 6.15.0-rc2
// PATH: include/linux/sched.h
union thread_union {struct task_struct task;
#ifndef CONFIG_THREAD_INFO_IN_TASKstruct thread_info thread_info;
#endifunsigned long stack[THREAD_SIZE/sizeof(long)];
};

为什么这里笔者要强调 “理论上” 一词，这是由于在以往较旧版本的内核中的确是这样做的，在网络上有很多博客和文章都在介绍内核栈也是用的较老版本的内核代码，因此他们的说法也是对的。较老版本的内核中的内核栈是将 thread_info 与真正的内核栈 stack 绑定在一起通过 alloc_thread_info 分配两个页面的内存空间。因此，这时候的 thread_info 地址就是内核栈的内存起始地址。（在旧版本的内核中） 而在 thread_info 中定义了指向进程描述符 struct_task 的指针，以此来寻找进程描述符。

// Linux Kernel 2.6.34
// PATH: arch/x86/include/asm/thread_info.h
#define alloc_thread_info(tsk)						\((struct thread_info *)__get_free_pages(THREAD_FLAGS, THREAD_ORDER))

__get_free_pages 是一个内核函数，用于分配连续的物理内存页。
THREAD_ORDER 决定了内核栈的大小（通常为 2 个页）。
分配的内存同时包含 thread_info 和内核栈，thread_info 位于内核栈的底部。

因此，旧版本的内核栈就看起来像如下图这两种都是对的。

在这里插入图片描述

[注]：x86 的栈是向下生长的，即从高地址向低地址扩展。

在新版本中的内核代码中 thread_union 的定义仍然存在，但它很少被直接使用。这是因为内核的设计已经逐渐从直接依赖 thread_union 转向更灵活的方式来管理任务的内核栈和相关数据结构，例如通过 task_struct 和独立的栈分配机制，在 dup_task_struct 中使用 alloc_thread_stack_node 来为 task_struct 分配内存空间，使用 alloc_thread_stack_node 为内核栈分配内存空间。因此新版本的内核栈看起来就像是这样。也可以简单的认为内核栈在逻辑上就是一片连续的内存空间。

内核栈空间是否物理连续取决于 CONFIG_VMAP_STACK 是否开启，开启则内核栈空间最终由 __vmalloc_node 使用虚拟内存分配，虽然虚拟地址是连续的，但物理上可能是分散的。反之则最终通过 alloc_pages_node 分配的内存是物理连续的。（一般情况，该配置是默认开启。）

在这里插入图片描述

[注]：这里实际上已经和 thread_union 没什么太大关系了。

由于笔者是在 x86_64 的机器下编写本文。因此这里主要就以 x86_64 架构为主，不同架构在 arch 目录下的代码会有不同，日后有余力在将其整理对比。下面就来看看 THREAD_SIZE 到底是多少。

找到 x86 架构下的 THREAD_SIZE 定义位置，如下。可以看到 THREAD_SIZE 是通过 PAGE_SIZE 来计算的。在 32 位下，直接就等于 PAGE_SIZE * 2，而在 64 位下则需要判断 CONFIG_KASAN，然后再进行计算。

// Linux Kernel 6.15.0-rc2
// PATH: arch/x86/include/asm/page_32_types.h
#define THREAD_SIZE_ORDER	1
#define THREAD_SIZE		(PAGE_SIZE << THREAD_SIZE_ORDER)//--------------------------------------------------------------// Linux Kernel 6.15.0-rc2
// PATH: arch/x86/include/asm/page_64_types.h
#ifdef CONFIG_KASAN
#define KASAN_STACK_ORDER 1
#else
#define KASAN_STACK_ORDER 0
#endif#define THREAD_SIZE_ORDER	(2 + KASAN_STACK_ORDER)
#define THREAD_SIZE  (PAGE_SIZE << THREAD_SIZE_ORDER)

再来找到 PAGE_SIZE 的定义位置，直接就是通过 1 << CONFIG_PAGE_SHIFT 这个值来得到最终页面大小。

// Linux Kernel 6.15.0-rc2
// PATH: include/vdso/page.h
#define PAGE_SHIFT      CONFIG_PAGE_SHIFT#define PAGE_SIZE	(_AC(1,UL) << CONFIG_PAGE_SHIFT)

在这里插入图片描述

通过查看编译内核时的 .config 文件就可以这两个未确定的值是多少，那么这里从而也就确定了。其实此时的 PAGE_SIZE 都是 1 << 12 = 4096 = 4KB，这里可以看到在 32 位机器上，那么 THREAD_SIZE = 2 * 4KB = 8KB。而在 64 位机器上，就是 THREAD_SIZE = 2^2 * 4KB = 16KB。^#02
[注]：具体验证计算结果的过程可见文末附 #02 部分内容。

4.2. thread_info 体系结构相关进程描述

在上面小节中内核栈的定义代码里我们可以看到除了 task_struct 类型，还有一个不可忽视的 thread_info 类型。通过代码了解到 thread_info 在 thread_union 和 task_struct 中都可能包含。

[注]：虽然笔者已经提前告知各位 thread_union 类型几乎没什么作用了，不过为了研究透彻，这里我们还是抱着学习的心态一起探个究竟。

// Linux Kernel 2.6.34
// PATH: arch/x86/include/asm/thread_info.h
struct thread_info {struct task_struct	*task;		/* main task structure */struct exec_domain	*exec_domain;	/* execution domain */__u32			flags;		/* low level flags */__u32			status;		/* thread synchronous flags */__u32			cpu;		/* current CPU */int			preempt_count;	/* 0 => preemptable,<0 => BUG */mm_segment_t		addr_limit;struct restart_block    restart_block;void __user		*sysenter_return;
#ifdef CONFIG_X86_32unsigned long           previous_esp;   /* ESP of the previous stack incase of nested (IRQ) stacks*/__u8			supervisor_stack[0];
#endifint			uaccess_err;
};//--------------------------------------------------------------------// Linux Kernel 6.15.0-rc2
// PATH: arch/x86/include/asm/thread_info.h
struct thread_info {unsigned long		flags;		/* low level flags */unsigned long		syscall_work;	/* SYSCALL_WORK_ flags */u32			status;		/* thread synchronous flags */
#ifdef CONFIG_SMPu32			cpu;		/* current CPU */
#endif
};

实际上在 linux kernel 中，task_struct、thread_info 都用来保存进程相关信息，即进程 PCB(Process Control Block) 信息。然而不同的体系结构里，进程需要存储的信息不尽相同，因此 linux 将其分为两部分存储，使用 task_struct 存储进程通用的信息，将体系结构相关的部分存储在 thread_info 中。这也是为什么 struct task_struct 在 include/linux/sched.h 中定义，而 thread_info 在 arch/ 下体系结构相关头文件里定义的原因。下面具体看看 thread_info 到底会被包含在哪个类型中。

在这里插入图片描述

在代码中以及 .config 文件中我们了解到，CONFIG_THREAD_INFO_IN_TASK 这个宏有定义（事实上这个宏 CONFIG_THREAD_INFO_IN_TASK 在新版内核 x86_64 中是默认开启的），那么当前内核就是直接把 thread_info 放进 task_struct 里，则 thread_union 中不再包含，反之则反。也就是说进程内核栈 thread_union 的首地址就是 task_struct 的首地址，也就是 task_struct 的第一个成员 thread_info 的地址。 那么整个内核栈空间目前就看起来像下图这样。

[注]：请注意标黄这段分析，这是在假设现在还不清楚内核是否不再使用 thread_union 的情况。这里假设使用的就是 thread_union，则符合此分析结果。

在这里插入图片描述

[注]：x86 的栈是向下生长的，即从高地址向低地址扩展。

4.3. 定位进程描述符(task_struct)和内核栈以及内核栈指针的问题

通过以上内容我们已经了解到了一个内核栈的具体样貌。接下来一起来学习如何定位一个内核栈和进程描述符（task_struct 也就是 PCB），或者说是，用代码如何找到进程的内核栈（kernel stack）以及一个进程的进程描述符（task_struct）。

通过上文我们已经知道了有一个配置宏是默认开启的，那就是 CONFIG_THREAD_INFO_IN_TASK，与它相关的代码还有下面这部分。

// Linux Kernel 6.15.0-rc2
// PATH: include/linux/thread_info.h
#ifdef CONFIG_THREAD_INFO_IN_TASK
/** For CONFIG_THREAD_INFO_IN_TASK kernels we need <asm/current.h> for the* definition of current, but for !CONFIG_THREAD_INFO_IN_TASK kernels,* including <asm/current.h> can cause a circular dependency on some platforms.*/
#include <asm/current.h>
#define current_thread_info() ((struct thread_info *)current)
#endif

其实通过函数名字就可以才得到它的意图，就是获取 thread_info，而 thread_info 的首地址也就等于 task_struct 首地址。无论是通过所包含的头文件，还是通过注释内容我们都知道，这里用来定义宏的 current 应该去体系结构相关的头文件下去找 current.h。

// Linux Kernel 6.15.0-rc2
// arch/x86/include/asm/current.h
struct task_struct;DECLARE_PER_CPU_CACHE_HOT(struct task_struct *, current_task);
/* const-qualified alias provided by the linker. */
DECLARE_PER_CPU_CACHE_HOT(struct task_struct * const __percpu_seg_override,const_current_task);static __always_inline struct task_struct *get_current(void)
{if (IS_ENABLED(CONFIG_USE_X86_SEG_SUPPORT))return this_cpu_read_const(const_current_task);return this_cpu_read_stable(current_task);
}#define current get_current()

那么这里就来到了 x86 架构下的 current.h 文件，current 是一个宏定义，最终调用的是 get_current 函数来获取当前函数的 task_struct 指针。

get_current 函数涉及的内容较多，这里简单介绍一下 get_currnet 函数的工作原理。
这里使用到了内核的一个 per-CPU 机制来获取当前进程的 task_struct。per-CPU 变量是一种特殊的变量类型，每个 CPU 都有自己独立的实例。这种机制允许每个 CPU 独立访问和操作自己的变量实例。每个 CPU 分配独立存储空间的变量，每个 CPU 都有自己的副本。

当前代码中首先用 DECLARE_PER_CPU_CACHE_HOT 声明了一个 per-CPU 变量，类型为 struct task_struct *；
this_cpu_read_stable 是一个内核宏，用于读取当前 CPU 上 per-CPU 变量的 current_task 值并以 struct task_struct * 类型返回。

到此就完成了读取当前进程 task_struct 的操作。

Linux 内核引入 per-CPU 变量之后，逐渐通过 per-CPU 变量来实现 current 宏，x86 从 Linux Kernel 4.1 版本开始逐渐简化 thread_info 结构体，直到 Linux Kernel 4.9 便彻底移除了 thread_info 中的 task，也就不再通过 thread_info 获取 task_struct 指针了，而改用 current_struct percpu 变量存放 task_struct 指针，详情可参阅该 [PATCH] x86: Move thread_info into task_struct。

那么这里其实就已经有办法获取当前进程的 task_struct 值了，而根据上面黄标的分析内容，进程描述符 task_struct 的第一个成员就是 thread_info，而 thread_union 的第一个成员也是 task_struct，因此实际上在当前内核中，这三个地址是相同的，这里将会是得到一个这样的现象。那么这里用代码来验证一下我们的假设。^#03

[72607.977772] task_struct address: ffff88fd9520a940
[72607.977773] task_struct->stack(stack base): ffffca988fdb0000
[72607.977774] task_struct->thread_info address: ffff88fd9520a940

[注]：为了提升本文阅读效果，此处仅展示验证结果，具体过程见文末附 #03 小节内容。

这里发现三个地址并不相同，而只有 task_struct 与 thread_info 是相同的，这是由于 thread_info 就是 task_struct 第一个成员。那么这里也就是验证了其实内核栈地址并不是如同 thread_union 描述的那样。那么通过这里我们更加确信了内核并不是使用 thread_union 这一类型来描述内核栈的，实际上 task_struct 与内核栈是独立分配的，因此在内核栈内存中与 task_struct 的关系应该如下图一般。也可以简单的将内核栈认为是一个逻辑上连续的内存空间。

在这里插入图片描述

可以看到上图中笔者还标注了 sp 的位置，这正是我们所熟知的栈指针，而在 task_struct 中的 *stack 所指向的位置并非程序意义上的栈顶或栈底位置，而是是内核栈的内存起始地址。由于 x86 栈的生长方向是向下的，因此栈底位置应该是 task_struct->stack + THREAD_SIZE，而栈顶的位置保存在 sp 寄存器中，在 x86_64 就是 rsp 寄存器中。以下是验证结果。^#04

[73502.218940] task_struct->stack(stack base): ffffca9889dc4000
[73502.218942] task_struct->stack(stack top): ffffca9889de4000
[73502.218944] Current stack pointer (sp): ffffca9889dc77d0

[注]：此处同样仅展示结果内容，具体验证代码见文末附 #04 小节。

5. 进程 ID（PID）

内核通过一个唯一的进程 ID 即 PID(Process identification) 来标识每个进程。内核把每个进程的 PID 存放在他们各自的进程描述符中(task_struct)的 pid 字段中。

进程 PID 为了与老版本的 Unix 和 Linux 兼容，其最大值默认是 32768(0x8000)（short int 短整型的最大值），该限制在 <linux/thread.h> 中定义。不过在某些大型系统确实需要更多进程数的话，也可以不考虑兼容问题，直接通过修改 /proc/sys/kernel/pic_max 来提高上限。

// Linux Kernel 2.6.34
// Linux Kernel 6.15.0-rc2
// PATH: linux/include/linux/threads.h
/** This controls the default maximum pid allocated to a process*/
#define PID_MAX_DEFAULT (IS_ENABLED(CONFIG_BASE_SMALL) ? 0x1000 : 0x8000)

6. 进程状态

6.1. 进程的几个状态

进程状态的管理和切换主要通过 task_struct 结构体中的 __state 字段以及相关的状态宏和函数实现。

// Linux Kernel 6.15.0-rc2
// PATH: include/linux/sched.h
/* Used in tsk->__state: */
#define TASK_RUNNING            0x00000000
#define TASK_INTERRUPTIBLE      0x00000001
#define TASK_UNINTERRUPTIBLE    0x00000002
#define __TASK_STOPPED          0x00000004
#define __TASK_TRACED           0x00000008
#define TASK_DEAD               0x00000080
#define TASK_WAKEKILL           0x00000100
#define TASK_WAKING             0x00000200
#define TASK_NOLOAD             0x00000400
#define TASK_NEW                0x00000800
#define TASK_RTLOCK_WAIT        0x00001000

TASK_RUNNING：进程为可执行状态（准备运行），或者是正在执行中，或者是在运行队列中等待执行；
TASK_INTERRUPTIBLE：进程可中断，表示进程正在阻塞，等待某些条件完成（并且可以被信号中断）。一旦这些条件完成，内核就会把进程设置为运行态。该状态下的进程也可接收信号被提前唤醒准备投入运行，进程在等待时会主动调用 set_current_state(TASK_INTERRUPTIBLE)；
TASK_UNINTERRUPTIBLE：进程不可被中断，也就是接收到信号也不会提前唤醒进入运行，而是继续阻塞等待条件完成。适用于需要确保操作完整性的场景（如设备驱动程序中的关键操作）。同时由于此状态的任务对信号不做响应，所以该状态使用的情况较少（此状态下的进程——执行 ps 命令是状态为 D 的进程，是无法被 kill 命令直接杀死，由于该任务不响应信号。）；
__TASK_STOPPED：进程停止执行，进程没有投入运行，也不能被投入运行。通常这种状态发生在接收到 SIGSTOP、SIGTSTP、SIGTTIN、SIGTTOU 等信号的时候。此外，在调试期间接收到任何信号，都会使进程进入这种状态（进程可以通过 SIGCONT 信号恢复运行，状态切换为 TASK_RUNNING）；
__TASK_TRACED：被其它进程跟踪的进程（例如通过 ptrace、gdb 对调试程序进行跟踪）；
TASK_DEAD：表示进程已经退出，等待被回收。进程在调用 do_exit() 后进入此状态；
TASK_WAKEKILL：表示进程可以被信号唤醒，即使它处于不可中断的等待状态。通常用于需要强制唤醒的场景；
TASK_WAKING：表示进程正在从等待状态切换到运行状态。这是一个过渡状态，通常由调度器内部使用；
TASK_NOLOAD：表示进程不会对系统的负载统计产生影响。通常用于内核线程或其他特殊任务；
TASK_NEW：表示新创建的任务，尚未被调度运行。通常在任务初始化阶段使用；
TASK_RTLOCK_WAIT：表示进程正在等待实时锁（RT lock）。这是一个特殊状态，用于实时调度场景；
TASK_FREEZABLE：表示进程可以被冻结（如在系统挂起时）。进程在冻结时会暂停运行，直到系统恢复；
TASK_FROZEN：表示进程已被冻结。通常用于系统挂起或休眠操作；
EXIT_DEAD：表示进程已经完全退出，等待被回收。与 TASK_DEAD 类似，但用于 exit_state 字段；
EXIT_ZOMBIE:表示进程已退出，但其父进程尚未调用 wait() 回收其状态。进程在此状态下被称为“僵尸进程”；
TASK_IDLE：表示进程处于空闲状态，不会对系统负载产生影响。通常用于 CPU 的空闲任务。

6.2. 设置进程状态

内核提供了多个宏和函数来设置或修改进程状态：

设置当前状态

// Linux Kernel 6.15.0-rc2
// PATH: include/linux/sched.h
#define __set_current_state(state_value)                \do {                                                \debug_normal_state_change((state_value));       \trace_set_current_state(state_value);           \WRITE_ONCE(current->__state, (state_value));    \} while (0)#define set_current_state(state_value)                  \do {                                                \debug_normal_state_change((state_value));       \trace_set_current_state(state_value);           \smp_store_mb(current->__state, (state_value));  \} while (0)

__set_current_state：直接设置当前任务的状态。
set_current_state：在设置状态时添加内存屏障以确保状态的正确性。

设置特殊状态

// Linux Kernel 6.15.0-rc2
// PATH: include/linux/sched.h
#define set_special_state(state_value)                  \do {                                                \unsigned long flags;                            \raw_spin_lock_irqsave(&current->pi_lock, flags);\debug_special_state_change((state_value));      \trace_set_current_state(state_value);           \WRITE_ONCE(current->__state, (state_value));    \raw_spin_unlock_irqrestore(&current->pi_lock, flags);\} while (0)

set_special_state：用于设置特殊状态（如 TASK_DEAD），并确保与唤醒操作的同步。

6.3. 进程状态检查

通过以下宏检查任务的状态：

// Linux Kernel 6.15.0-rc2
// PATH: include/linux/sched.h
#define task_is_running(task)       (READ_ONCE((task)->__state) == TASK_RUNNING)
#define task_is_traced(task)        ((READ_ONCE(task->jobctl) & JOBCTL_TRACED) != 0)
#define task_is_stopped(task)       ((READ_ONCE(task->jobctl) & JOBCTL_STOPPED) != 0)

6.4. 状态切换的几个核心函数

schedule()
schedule() 是内核中用于调度任务的核心函数。它会根据任务的状态和优先级选择下一个要运行的任务。

PATH: kernel/sched/core.c

try_to_wake_up() / wake_up_process()
try_to_wake_up() 用于唤醒处于非运行状态的任务，并将其状态设置为 TASK_RUNNING。（wake_up_process() 内部也是调用 try_to_wake_up()）

PATH: kernel/sched/core.c

schedule_timeout()
schedule_timeout() 用于在超时后切换任务状态：

PATH: kernel/time/sleep_timeout.c

6.5. 经典的进程状态转换

通过以上这些机制，内核能够高效地管理和切换进程状态。

在这里插入图片描述
这里提到了一个进程上下文的概念，上下文简单说来就是一个环境，相对于进程而言，就是进程执行时的环境。具体来说就是各个变量和数据，包括所有的寄存器变量、进程打开的文件、内存信息等。所谓的“进程上下文”，可以看作是用户进程传递给内核的这些参数以及内核要保存的那一整套的变量和寄存器值和当时的环境等。当发生进程调度时，进行进程切换就是上下文切换(context switch)。操作系统必须对上面提到的进程全部信息进行切换，新调度的进程才能运行。

#附

#01. 计算 task_struct 大小

方法：通过编写一个简单的内核模块，用 sizeof() 去计算 struct task_struct 结构体大小并打印出来。

#01.1. 编写 task_struct_size 模块

编写内核模块文件 task_struct_size.c。

$ vim task_struct_size.c

#include <linux/init.h>
#include <linux/module.h>
#include <linux/sched.h>static int __init task_struct_size_init(void)
{pr_info("Size of task_struct: %zu bytes\n", sizeof(struct task_struct));return 0;
}static void __exit task_struct_size_exit(void)
{pr_info("Exiting task_struct size module\n");
}module_init(task_struct_size_init);
module_exit(task_struct_size_exit);MODULE_LICENSE("GPL");
MODULE_AUTHOR("Imagine Miracle");
MODULE_DESCRIPTION("Module to calculate task_struct size");

#01.2. 编写 Makefile

编写 Makefile 文件：

obj-m += task_struct_size.oall:make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules
clean:make -C /lib/modules/$(shell uname -r)/build M=$(PWD) clean

#01.3. 编译并插入模块

执行 make 编译模块代码，生成 .ko 文件。

$ make
# 编译完成将会生成如下文件
$ ls
Makefile       Module.symvers      task_struct_size.ko   task_struct_size.mod.c  task_struct_size.o
modules.order  task_struct_size.c  task_struct_size.mod  task_struct_size.mod.o

插入内核。

$ sudo insmod task_struct_size.ko 
# 可以通过 lsmod 命令查看是否插入成功
$ lsmod | grep task_struct_size
task_struct_size       12288  0

#01.4. 查看内核打印

执行 dmesg 命令查看内核输出：

$ sudo dmesg
......
[113194.864193] Size of task_struct: 13760 bytes

一般情况下，最后一行输出就是刚刚插入的内核模块的打印信息，这里可以看到通过 sizeof 所计算出来的 task_struct 大小为 13760 bytes，约等于 13.43 KB。

卸载该模块：

$ sudo rmmod task_struct_size

#02. 验证内核栈大小

继续使用 #01 部分的内核模块代码，来添加几行打印：

$ vim task_struct_size.c

#include <linux/init.h>
#include <linux/module.h>
#include <linux/sched.h>static int __init task_struct_size_init(void)
{pr_info("Size of task_struct: %zu bytes\n", sizeof(struct task_struct));pr_info("Size of thread_union: %zu bytes\n", sizeof(union thread_union));pr_info("Size of THREAD_SIZE: %zu bytes\n", THREAD_SIZE);pr_info("Size of PAGE_SIZE: %zu bytes\n", PAGE_SIZE);pr_info("Size of KASAN_STACK_ORDER: %zu bytes\n", KASAN_STACK_ORDER);return 0;
}static void __exit task_struct_size_exit(void)
{pr_info("Exiting task_struct size module\n");
}module_init(task_struct_size_init);
module_exit(task_struct_size_exit);MODULE_LICENSE("GPL");
MODULE_AUTHOR("Imagine Miracle");
MODULE_DESCRIPTION("Module to calculate task_struct size");

插入模块之后，通过 sudo dmesg 查看打印信息如下：

[52373.661330] Size of task_struct: 13760 bytes
[52373.661341] Size of thread_union: 16384 bytes
[52373.661344] Size of THREAD_SIZE: 16384 bytes
[52373.661346] Size of PAGE_SIZE: 4096 bytes
[52373.661349] Size of KASAN_STACK_ORDER: 0 bytes

thread_union 大小与 THREAD_SIZE 相同，为 16348 bytes = 16KB。

执行 sudo rmmod task_struct_size 移除模块。

$ sudo rmmod task_struct_size

#03. 验证内核栈结构

方法同样还是编写一个简单的内核模块来验证我们的假设。这里是为了验证 task_struct、thread_info、thread_union（内核栈）三个地址是相同的。

编写内核模块文件 task_struct_addr.c。

$ vim task_struct_addr.c

#include <linux/init.h>
#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/sched.h>static int __init print_stack_address_init(void)
{pr_info("task_struct address: %lx \n", (unsigned long **)current);pr_info("task_struct->stack(stack base): %lx \n", (unsigned long **)(current->stack));pr_info("task_struct->thread_info address: %lx \n", (unsigned long *)&(current->thread_info));return 0;
}static void __exit print_stack_address_exit(void)
{pr_info("Exiting module.\n");
}module_init(print_stack_address_init);
module_exit(print_stack_address_exit);MODULE_LICENSE("GPL");
MODULE_AUTHOR("Imagine Miracle");
MODULE_DESCRIPTION("Print kernel stack address.");

编写 Makefile 文件：

obj-m += task_struct_addr.oall:make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules
clean:make -C /lib/modules/$(shell uname -r)/build M=$(PWD) clean

执行 make 编译模块，执行 sudo insmod task_struct_addr 插入模块，并通过 sudo dmesg 命令查看内核打印信息。

[72607.977772] task_struct address: ffff88fd9520a940
[72607.977773] task_struct->stack(stack base): ffffca988fdb0000
[72607.977774] task_struct->thread_info address: ffff88fd9520a940

这里发现三个地址并不相同，而只有 task_struct 与 thread_info 是相同的，这是由于 thread_info 就是 task_struct 第一个成员。那么这里也就是验证了其实内核栈地址并不是如同 thread_union 描述的那样。

#04. 验证栈指针位置

验证方式与以上几次的方式相同，这里就仅展示具体代码和输出结果。

#include <linux/init.h>
#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/sched.h>static int __init print_stack_address_init(void)
{unsigned long sp = 0;unsigned long bp = 0;// 获取当前栈指针
#ifdef CONFIG_X86_64asm volatile("mov %%rsp, %0" : "=r" (sp)); // x86_64 使用 rspasm volatile("mov %%rbp, %0" : "=r" (bp)); // x86_64 使用 rbp
#elseasm volatile("mov %%esp, %0" : "=r" (sp)); // x86_32 使用 espasm volatile("mov %%ebp, %0" : "=r" (bp)); // x86_32 使用 ebp
#endifpr_info("task_struct->stack(stack base): %lx \n", (unsigned long **)(current->stack));pr_info("task_struct->stack(stack top): %lx \n", (unsigned long **)(current->stack) + THREAD_SIZE);pr_info("Current stack pointer (sp): %lx\n", sp);pr_info("Current frame pointer (bp): %lx\n", bp);return 0;
}static void __exit print_stack_address_exit(void)
{pr_info("Exiting module.\n");
}module_init(print_stack_address_init);
module_exit(print_stack_address_exit);MODULE_LICENSE("GPL");
MODULE_AUTHOR("Imagine Miracle");
MODULE_DESCRIPTION("Print kernel stack address.");

打印结果：

[74029.947307] task_struct->stack(stack base): ffffca9889e84000
[74029.947309] task_struct->stack(stack top): ffffca9889ea4000
[74029.947311] Current stack pointer (sp): ffffca9889e87928
[74029.947313] Current frame pointer (bp): ffffca9889e87940