IMPLEMENTATION OF FORK() IN OPENBSD 4.1 --------------------------------------- A study Deepak Nagaraj http://frodo.syminet.com/~deep/ * The fork() system call is implemented in OpenBSD in the file sys/kern/kern_fork.c. The entry function is sys_fork(), as defined in the system call table sysent[] in sys/kern/init_sysent.c. sys_fork() has the following prototype: int sys_fork(struct proc *, void *, register_t *); The first is the proc struct for the parent, the last is a pointer to a 32-bit integer array for x86 systems. The second argument is not used. sys_fork() almost immediately calls fork1() in the same file, which has the guts of the fork-semantics. The call is as follows: return (fork1(p, SIGCHLD, flags, NULL, 0, fork_return, NULL, retval, NULL)); Compare with fork1()'s prototype: int fork1(struct proc *p1, int exitsig, int flags, void *stack, size_t stacksize, void (*func)(void *), void *arg, register_t *retval, struct proc **rnewprocp); * Let us look at the body of fork1(). First we make a check on whether we are beyond the maximum number of processes allowed in the system, maxprocs. For a non-root user, this is maxprocs-5, because the last 5 processes are reserved for root. If this condition is true, a syslog message saying "proc table is full" is printed, and an error EAGAIN is returned. If we are still within the maxproc limits, the total current number of processes (nprocs) and the number of processes for the current user are both incremented. If this is greater than the ulimit set for the current user,the changes are reverted and an error EAGAIN is returned. If this is okay, we attempt to allocate memory for u-area of the process in the kernel: uaddr = uvm_km_alloc1(kernel_map, USPACE, USPACE_ALIGN, 1); If this fails, we revert the changes so far and return with error code ENOMEM. At this point, we start to "fork". * First, we allocate a new proc structure, and set its status to a special value, SIDL - "being created by fork". (Later upon completion, we set it to RUN, meaning runnable.) We set the signal to be delivered to parent upon exit as SIGCHLD. We set the previous and next pointers to NULL; these are pointers used to look at the previous and next processes in the "runqueue". These pointers are updated later when the process is added to the runqueue. We reset scheduling information, timers, pending signals etc. for the child. We copy signal mask, priority, process group, etc. from the parent. This is achieved by bzero() and bcopy() operations on the parent's proc struct, because the fields are contiguous: bzero(&p2->p_startzero, (unsigned) ((caddr_t)&p2->p_endzero - (caddr_t)&p2->p_startzero)); bcopy(&p1->p_startcopy, &p2->p_startcopy, (unsigned) ((caddr_t)&p2->p_endcopy - (caddr_t)&p2->p_startcopy)); The start and end fields above are merely macros to point to the starting and ending fields in the process structure. The following lines detail most of the fields that are cleared or copied in the above calls. sys/sys/proc.h: struct proc { .... /* The following fields are all zeroed upon creation in fork. */ #define p_startzero p_oppid pid_t p_oppid; .... /* scheduling */ u_int p_estcpu; /* Time averaged value of p_cpticks. */ int p_cpticks; /* Ticks of cpu time. */ fixpt_t p_pctcpu; /* %cpu for this process during p_swtime */ void *p_wchan; /* Sleep address. */ struct timeout p_sleep_to;/* timeout for tsleep() */ const char *p_wmesg; /* Reason for sleep. */ u_int p_swtime; /* Time swapped in or out. */ u_int p_slptime; /* Time since last blocked. */ .... int p_schedflags; /* PSCHED_* flags */ struct itimerval p_realtimer; /* Alarm timer. */ struct timeout p_realit_to; /* Alarm timeout. */ struct timeval p_rtime; /* Real time. */ u_quad_t p_uticks; /* Statclock hits in user mode. */ u_quad_t p_sticks; /* Statclock hits in system mode. */ u_quad_t p_iticks; /* Statclock hits processing intr. */ int p_traceflag; /* Kernel trace points. */ struct vnode *p_tracep; /* Trace to vnode. */ void *p_systrace; /* Back pointer to systrace */ int p_ptmask; /* Ptrace event mask */ struct ptrace_state *p_ptstat; /* Ptrace state */ int p_siglist; /* Signals arrived but not delivered. */ struct vnode *p_textvp; /* Vnode of executable. */ struct emul *p_emul; /* Emulation information */ void *p_emuldata; /* Per-process emulation data, or */ .... /* End area that is zeroed on creation. */ #define p_endzero p_startcopy /* The following fields are all copied upon creation in fork. */ #define p_startcopy p_sigmask sigset_t p_sigmask; /* Current signal mask. */ sigset_t p_sigignore; /* Signals being ignored. */ sigset_t p_sigcatch; /* Signals being caught by user. */ u_char p_priority; /* Process priority. */ u_char p_usrpri; /* User-priority based on p_cpu and p_nice. */ char p_nice; /* Process "nice" value. */ char p_comm[MAXCOMLEN+1]; struct pgrp *p_pgrp; /* Pointer to process group. */ vaddr_t p_sigcode; /* user pointer to the signal code. */ /* End area that is copied on creation. */ #define p_endcopy p_addr struct user *p_addr; /* Kernel virtual addr of u-area (PROC ONLY). */ .... }; We then initialize sleep- and alarm-timers. We reset all process flags by setting p_flag field of the proc struct to 0. OpenBSD can emulate executables of various other operating systems. If the parent is such an emulated executable, some extra information is stored in the proc struct. We point the child's emulation struct to the parent's. If process profiling is on, we start the profiler clock. We copy setuid flags from parent to child. If the parent is being ptrace'd and it likes to inherit this on fork, we set it on child too. (XXX: I did not see any place where PTRACE_FORK is set on a process, so maybe this is dead code?) We copy owner credentials from parent to child, and set refcount to 1. We set text-segment to parent's, incrementing its refcount. The child's file descriptors are copied from parent by default. However, this can be overridden to build afresh or to share with parent. We point the child's process resource limit structure to the parent's and increment its refcount. (An exception is if PL_SHAREMOD flag is set, but again, I did not see any place where this is set - dead code?) If parent has a controlling terminal, we mark the child as having controlling terminal too. We then make two special checks. One is for vfork(): if it is a vfork() call, we set a special flag P_PPWAIT on the child. If it is an rfork() call with "no wait", then we set a special flag P_NOZOMBIE on the child. (P_NOZOMBIE implies that the parent does not wish to collect data about child after its exit, so that data can be free'd immediately. Use of P_PPWAIT is explained later.) We set the child's parent proc struct to the parent proc struct. We initialize "children of this process" list to NULL. After this is some rthread- and trace-related code that we may ignore for our purpose. We then inherit parent's scheduler history to child (estcpu, i.e. average cputicks). We copy sigacts from parent to child, unless a flag is set for it to be shared -- as in the case of clone() call under Linux compatibility. At this point, if we are emulating another executable and that has a "hook" to be called at fork(), we call this function. For example, for Linux ELF emulation, the function is linux_e_proc_fork(). We now update the child's p_addr pointer to the u-area we allocated much earlier. * We then call uvm_fork(), which copies the process's memory area and also arranges for CPU-dependent parts of the fork-semantics so that the returns can occur: (Note that func is fork_return, as passed to fork1() from sys_fork(). However, arg is NULL, so the child proc struct p2 is passed.) uvm_fork(p1, p2, ((flags & FORK_SHAREVM) ? TRUE : FALSE), stack, stacksize, func ? func : child_return, arg ? arg : p2); We will describe uvm_fork() later. We then set virtual timer and profiler alarm timeouts. We update some statistical counters. * We find an unused pid, usually 1+lastpid, and assign it to the child. do { lastpid = 1 + (randompid ? arc4random() : lastpid) % PID_MAX; } while (pidtaken(lastpid)); p2->p_pid = lastpid; We insert the child in front of the allproc list. We insert a hash of its process id into a pid hash list. We insert child into the children list of parent. We also insert it into the process-group list. If ptrace is set on the child, we reparent it to parent's parent. We set fork as the event to be reported. * We lock the scheduler. We get the current system time and set it as the start time for the process. We set a flag in the child to denote it has been "forked but not exec'ed". We change the state to SRUN (runnable) and insert it into the run-queue. We then unlock the scheduler. We update some statistical counters. We put parent to sleep if P_PPWAIT flag is set (as in vfork()). We send a signal SIGTRAP to parent if it is being ptrace'd. We set return value to be the child's pid, and return as parent. (The child will return to userspace in uvm_fork() call above, in its very first timeslice of execution.) This value will be returned to user through syscall() interface (see sys/arch/i386/i386/trap.c). retval[0] goes to EAX and retval[1] to EDX. * * * uvm_fork() is a function in sys/uvm/uvm_glue.c. It calls uvmspace_fork() to copy memory from parent to child. First we copy the vmspace struct. vm2 = uvmspace_alloc(old_map->min_offset, old_map->max_offset, (old_map->flags & VM_MAP_PAGEABLE) ? TRUE : FALSE); memcpy(&vm2->vm_startcopy, &vm1->vm_startcopy, (caddr_t) (vm1 + 1) - (caddr_t) &vm1->vm_startcopy); Refer: struct vmspace { struct vm_map vm_map; /* VM address map */ int vm_refcnt; /* number of references */ caddr_t vm_shm; /* SYS5 shared memory private data XXX */ /* we copy from vm_startcopy to the end of the structure on fork */ #define vm_startcopy vm_rssize segsz_t vm_rssize; /* current resident set size in pages */ segsz_t vm_swrss; /* resident set size before last swap */ segsz_t vm_tsize; /* text size (pages) XXX */ segsz_t vm_dsize; /* data size (pages) XXX */ segsz_t vm_dused; /* data segment length (pages) XXX */ segsz_t vm_ssize; /* stack size (pages) */ caddr_t vm_taddr; /* user virtual address of text XXX */ caddr_t vm_daddr; /* user virtual address of data XXX */ caddr_t vm_maxsaddr; /* user VA at max stack growth */ caddr_t vm_minsaddr; /* user VA at top of stack */ }; Then we walk through the list of vm pages. For each entry, we take one of three actions: * we ignore it * we share it * we copy it This is based on the "inheritance" field of the vm page. In the last case, we do not actually "copy" but set the page as "copy-on-write". * We then zero out or inherit some statistics pertaining to the child: p2->p_stats = &up->u_stats; memset(&up->u_stats.pstat_startzero, 0, ((caddr_t)&up->u_stats.pstat_endzero - (caddr_t)&up->u_stats.pstat_startzero)); memcpy(&up->u_stats.pstat_startcopy, &p1->p_stats->pstat_startcopy, ((caddr_t)&up->u_stats.pstat_endcopy - (caddr_t)&up->u_stats.pstat_startcopy)); struct pstats { #define pstat_startzero p_ru struct rusage p_ru; /* stats for this proc */ struct rusage p_cru; /* sum of stats for reaped children */ struct itimerval p_timer[3]; /* virtual-time timers */ #define pstat_endzero pstat_startcopy #define pstat_startcopy p_prof struct uprof { /* profile arguments */ caddr_t pr_base; /* buffer base */ size_t pr_size; /* buffer size */ u_long pr_off; /* pc offset */ u_int pr_scale; /* pc scaling */ u_long pr_addr; /* temp storage for addr until AST */ u_long pr_ticks; /* temp storage for ticks until AST */ } p_prof; #define pstat_endcopy p_start struct timeval p_start; /* starting time */ Finally, we call a function cpu_fork(), which has machine-dependent code to execute some parts of fork(). * We will describe cpu_fork() for the x86 architecture. sys/arch/i386/i386/vm_machdep.c: void cpu_fork(p1, p2, stack, stacksize, func, arg) struct proc *p1, *p2; void *stack; size_t stacksize; void (*func)(void *); void *arg; We first copy machine-dependent flags from parent to child. We then copy the process control block (PCB) from parent to child. The PCB is an architecture-dependent structure that is defined as follows for x86: sys/i386/include/pcb.h: struct pcb { struct i386tss pcb_tss; #define pcb_cr3 pcb_tss.tss_cr3 #define pcb_esp pcb_tss.tss_esp #define pcb_ebp pcb_tss.tss_ebp #define pcb_cs pcb_tss.tss_cs #define pcb_ldt_sel pcb_tss.tss_ldt int pcb_tss_sel; union descriptor *pcb_ldt; /* per process (user) LDT */ int pcb_ldt_len; /* number of LDT entries */ int pcb_cr0; /* saved image of CR0 */ int pcb_pad[2]; /* savefpu on 16-byte boundary */ union savefpu pcb_savefpu; /* floating point state for FPU */ struct emcsts pcb_saveemc; /* Cyrix EMC state */ /* * Software pcb (extension) */ caddr_t pcb_onfault; /* copyin/out fault recovery */ int vm86_eflags; /* virtual eflags for vm86 mode */ int vm86_flagmask; /* flag mask for vm86 mode */ void *vm86_userp; /* XXX performance hack */ struct pmap *pcb_pmap; /* back pointer to our pmap */ struct cpu_info *pcb_fpcpu; /* cpu holding our fpu state */ u_long pcb_iomap[NIOPORTS/32]; /* I/O bitmap */ u_char pcb_iomap_pad; /* required; must be 0xff, says intel */ }; We then copy the trapframe. We can also alter the stack pointer if the child must be given a different stack. /* * Copy the trapframe, and arrange for the child to return directly * through rei(). */ p2->p_md.md_regs = tf = (struct trapframe *)pcb->pcb_tss.tss_esp0 - 1; *tf = *p1->p_md.md_regs; /* * If specified, give the child a different stack. */ if (stack != NULL) tf->tf_esp = (u_int)stack + stacksize; We then execute func(), which is fork_return() in our case, with the child (p2) proc struct as argument. sf = (struct switchframe *)tf - 1; sf->sf_ppl = 0; sf->sf_esi = (int)func; sf->sf_ebx = (int)arg; sf->sf_eip = (int)proc_trampoline; pcb->pcb_esp = (int)sf; proc_trampoline is an asm function that calls whatever address is in ESI register, in our case fork_return(): sys/arch/i386/i386/locore.s movl $IPL_NONE,CPL pushl %ebx call *%esi addl $4,%esp fork_return() sends a SIGTRAP signal to the process if is ptrace'd, and then calls machine-dependent child_return() (sys/arch/i386/i386/trap.c). This function sets the return value to be 0: tf->tf_eax = 0; tf->tf_eflags &= ~PSL_C; and calls an inline function userret(), that posts pending signals and changes CPU priority to userland: p->p_cpu->ci_schedstate.spc_curpriority = p->p_priority = p->p_usrpri; * This completes our discussion of fork() in OpenBSD. The reader can consult the following manual pages for some more information: fork(2), vfork2(2), rfork(2), fork1(9), syscall(2), ptrace(2)