LiPPGen

The Literate-Programming-based Presentation Generator

Version 1.0, (c) 2013 Hans-Georg Eßer

Title:
Author:
Organisation:

TEXTCODE

Forking a Process

\ignore{

Things to think of when forking:

  • Copy process' address space
  • Create new process entry in process list
  • After having duplicated the process, there are two processes which are BOTH just inside the OS' implementation of fork(). So we need to be prepared that both processes will (at some time) continue in kernel mode and finish the execution of fork(). see http://www.jamesmolloy.co.uk/tutorial_html/9.-Multitasking.html for a fork() implementation

} We're getting closer to having a multitasking operating system. We only use the function start_program_from_disk for loading the first (initial) process---for everything else we want to implement the standard Unix way of creating new processes: the fork. Basically, when a Unix process calls fork() (and thus enters the fork system call), the operating system creates a duplicate of the currently running process. After a successful fork operation we have two processes which are almost identical. That means:

  • Both processes execute the same program (i.\,e., the same binary is loaded in their lower memory areas),
  • variables and dynamic memory have identical contents, but the memory is duplicated since both processes may make different changes to that memory once they continue running after the fork.
  • They also have their own copies of the user mode and kernel mode stack.
  • Most process metadata (the contents of the thread control block) are identical as well, with two important exceptions: the new process has its own process ID (or thread ID), and the new process stores the old process' thread ID in its parent process ID field (ppid).
  • After the fork, both processes return from the fork system call and continue execution in the instruction immediately following the system call---so they need a way to find out whether they are the original process (called \marginnote{Parent Process} parent) or the newly forked process (called \marginnote{Child Process}child). The user mode fork function will return 0 in the child process and the newly created process' ID in the parent process.
While we create a new process we will set its state to TSTATE_FORK to show that its creation is still in progress---now is a good time to present all the possible states that a process can be in:
< constants > ≡
  // Thread states
  #define TSTATE_READY     1   // process is ready
  #define TSTATE_FORK      3   // fork() has not completed
  #define TSTATE_EXIT      4   // process has called exit()
  #define TSTATE_WAITFOR   5   // process has called waitpid()
  #define TSTATE_ZOMBIE    6   // wait for parent to retrieve exit value
  #define TSTATE_WAITKEY   7   // wait for key press event
  #define TSTATE_WAITFLP   8   // wait for floppy
  #define TSTATE_LOCKED    9   // wait for lock
  #define TSTATE_STOPPED  10   // stopped by SIGSTOP signal
  #define TSTATE_WAITHD   11   // wait for hard disk
We also define a list of state names which can be used when displaying the process list:
< global variables > ≡
  char *state_names[12] = {
    "---",   "READY", "---",   "FORK", "EXIT", "WAIT4", "ZOMBY", "W_KEY",   // 0.. 7
    "W_FLP", "W_LCK", "STOPD", "W_IDE"                                      // 8..11
  };
Our goal for this section is to implement the function
< function prototypes > ≡
  int u_fork (context_t *r, boolean create_thread, void *start_address);
which will later be called from the fork system call handler (see page \pageref{syscall:fork}). Since we will need a lot of memory copying operations, we declare two macros phys_memcpy which lets us copy physical memory areas and copy_frame which copies page frames:
< macro definitions > ≡
  #define phys_memcpy(target, source, size) \
    (unsigned int)memcpy ( (void*)PHYSICAL(target), (void*)PHYSICAL(source), size)
  #define copy_frame(out, in)  phys_memcpy (out << 12, in << 12, PAGE_SIZE)
So, phys_memcpy does the same as memcpy but expects its first two arguments to be physical addresses (instead of virtual ones), and copy_frame provides a shortcut for copying physical frames since for that task the number of bytes to copy is always PAGE_SIZE. Next comes the definition of u_fork. This function will be called when the fork system call is executed.
< function implementations > ≡
  int u_fork (context_t *r, boolean create_thread, void *start_address) {
    TCB           *t_old, *t_new;       // pointers to old/new TCB
    int           old_tid, new_tid;     // thread IDs (old/new)
    int           i, j;                 // counters
    unsigned int  eip, esp, ebp;        // temp variables for register values
    addr_space_id old_as, new_as;       // old/new address spaces

    
fork implementation >
  }
Now, here's the actual implementation. We will present it in several steps and discuss what's happening. We start by cloning the current TCB into a free TCB which we first have to search for:
< fork implementation > ≡
  disable interrupts >
  LOCK (thread_list_lock);
  old_as = current_as;  old_tid = current_task;  int ppid = old_tid;
    
  if (!create_thread) {
    // regular fork: clone kernel part of PD; reserve user part of memory
    new_as = create_new_address_space (
      address_spaces[old_as].memend - address_spaces[old_as].memstart,
      address_spaces[old_as].stacksize );  
  } else {
    new_as = old_as;                  // creating a new thread!
    address_spaces[old_as].refcount++;
  }                                             // TO DO: ERROR CHECK!
    
  new_tid = register_new_tcb (new_as);          // TO DO: ERROR CHECK!
  t_old = &thread_table[old_tid]; t_new = &thread_table[new_tid];
  *t_new = *t_old;               // copy the TCB
  
fork: fill new TCB >
  if (!create_thread) {
    // copy the memory
    
fork: create new kernel stack and copy the old one >
    
fork: copy user mode memory >
  }
    
  UNLOCK (thread_list_lock);

  eip = get_eip ();              // get current EIP
  if (current_task == ppid) { 
fork: parent-only tasks > } 
  else                      { 
fork: child-only tasks > }
with
< fork: fill new TCB > ≡
  // get rid of the copying; some data were already setup in register_new_tcb()
  t_new->state      = TSTATE_FORK;
  t_new->tid        = new_tid;
  t_new->ppid       = old_tid;   // set parent process ID
  t_new->addr_space = new_as;
  // t_new->new        = true;      // mark new process as new

  // copy current registers to new thread, except EBX (= return value)
  t_new->regs       = *r;
  t_new->regs.ebx   = 0;         // in the child fork() returns 0

  // copy current ESP, EBP
  asm volatile("mov %%esp, %0" : "=r"(esp));  // get current ESP
  asm volatile("mov %%ebp, %0" : "=r"(ebp));  // get current EBP
  t_new->ebp  = ebp;
  t_new->esp0 = esp;
  // t_new->esp0 = 0xbffffee8;  // ????
< fork: parent-only tasks > ≡
  t_new->eip  = eip;
  add_to_ready_queue (new_tid);
  debug_printf ("fork going to return %d\n", new_tid);
  
enable interrupts >            // must be done in parent
  return new_tid;                // in parent, fork() return child's PID
< fork: child-only tasks > ≡
  debug_printf ("fork going to return 0 \n");
  
enable interrupts >

  if (create_thread) {           
    printf ("THREAD\n");
    // set correct start address. on stack, not via jmp!
    // asm ("jmp *%0" : : "r"((unsigned int)start_address));
  }

  return 0;                      // in child, fork() returns 0
< fork: create new kernel stack and copy the old one > ≡
    // create new kernel stack and copy the old one
    page_table* stackpgtable = (page_table*)request_new_page();
      // will be removed in destroy_address_space()
    address_spaces[new_as].kstack_pt = (unsigned int)stackpgtable;
    memset (stackpgtable, 0, sizeof(page_table));
    page_directory *tmp_pd;
    tmp_pd = address_spaces[new_as].pd;
    KMAPD ( &tmp_pd->ptds[767], mmu (0, (uint)stackpgtable) );
    
    uint framenos[KERNEL_STACK_PAGES];   // frame numbers of kernel stack pages

    for (i = 0;  i < KERNEL_STACK_PAGES;  i++)
      framenos[i] = request_new_frame();
      //will be removed in destroy_address_space()

    for (i=0; i       as_map_page_to_frame (new_as, 0xbffff - i, framenos[i]);

    // copy each page separately: they need not be physically connected or in order
    unsigned int base = 0xc0000000-KERNEL_STACK_SIZE;
    for (i = 0;  i < KERNEL_STACK_PAGES;  i++)
      phys_memcpy ( mmu(new_as, base + i*PAGE_SIZE),
                    mmu(old_as, base + i*PAGE_SIZE), PAGE_SIZE );
    debug_printf ("u_fork: memcpy done\n");
< function prototypes > +≡
  extern unsigned int get_eip ();
< start.asm > ≡
  global get_eip
  get_eip:
    pop eax
    push eax
    ret
Copying the user mode memory means copying the first 3 GByte except the kernel stack ... (EXPLAIN!)
< fork: copy user mode memory > ≡
    // clone first 3 GB (minus last directory entry) of address space
    page_directory *old_pd, *new_pd;
    page_table     *old_pt, *new_pt;
    old_pd = address_spaces[old_as].pd;
    new_pd = address_spaces[new_as].pd;

    for (i = 0;  i<767;  i++) {          // only 0..766, not 767 (= kstack)
      if (old_pd->ptds[i].present) {
        // walk through the entries of the page table
        old_pt = (page_table*)PHYSICAL(old_pd->ptds[i].frame_addr << 12);
        new_pt = (page_table*)PHYSICAL(new_pd->ptds[i].frame_addr << 12);
        for (j = 0;  j < 1024;  j++)
          if (old_pt->pds[j].present)
            copy_frame ( new_pt->pds[j].frame_addr, old_pt->pds[j].frame_addr );
      };
    };
\ignore{

Memory Usage

How many frames does it take to fork a process? As an example, we look at the shell process. When it forks, 24 new frames are needed:
  • 1 frame for the new kernel stack page table (category 1)
  • 4 frames for the new kernel stack (defined in KERNEL_STACK_PAGES) (category 2)
  • 1 frame for the user memory page tables (last 3 GB less the last directory entry) (category 3)
  • 16 frames for 64 KB user memory pages (category 4)
  • 1 frame for the new page directory (category 5)
  • 1 frame for first page user memory table in create_new_address_space (category 92) -- this is redundant, 23 frames (1 less) should be enough. AND THAT IS JUST WHAT WE LOSE WHEN wE FORK / EXIT.
} We can now add the fork system call:
< syscall prototypes > ≡
  void syscall_fork (context_t *r);
< syscall functions > ≡
  void syscall_fork (context_t *r) {
    r->ebx = (unsigned int) u_fork(r, false, NULL);   // false: no thread, starts at 0
    return;
  };
< initialize syscalls > ≡
  install_syscall_handler (__NR_fork, syscall_fork);

Task switches in ULIXI{

} Thanks to the forking code above, we can now have more than one process---but that means we also have to write code which lets ULIX-i386 switch between several tasks. This section is about the task switch. Understanding the switch basically means looking at the functions and stacks. When switching from process $A$ to process $B$, we expect the following to happen:
  1. Process $A$ is executing, it runs in user mode, using its user mode stack.
  2. A timer interrupt (IRQ 0) occurs. The CPU switches to kernel mode; this also switches the stack to the kernel stack. (Its address is in the TSS.) The CPU then jumps to the interrupt handler registered for interrupt 0 which does the following: \begin{Verbatim} irq0: cli push byte 0 push byte 32 jmp irq_common_stub irq_common_stub: pusha push ds push es push fs push gs push esp call irq_handler pop esp pop gs pop fs pop es pop ds popa add esp, 8 iret \end{Verbatim} So after pushing 0 (an empty error code) and 32 (that is 32+0, where 0 is the IRQ number) onto the stack, it saves all relevant registers on the stack and then calls irq_handler which is a C function: \begin{Verbatim} void irq_handler (context_t *r) { ... handler = interrupt_handlers[r->int_no - 32]; if (handler != NULL) handler(r); } \end{Verbatim} The generic irq_handler looks up the correct interrupt service routine for the timer (it calculates r->int_no-32 which in this case is $32-32=0$, finds the entry in interrupt_handlers (that is timer_handler) and then calls it.
  3. Next, the timer_handler checks whether it is time to call the scheduler and (if so) calls it: \begin{Verbatim} if (system_ticks ... scheduler (r, SCHED_SRC_TIMER); ... } \end{Verbatim}
  4. So if it decides to call the scheduler, it enters \begin{Verbatim} void scheduler (context_t *r) { ... } \end{Verbatim} which is the function that we have to implement.

Stack Usage

We need to keep track of which stacks are in use and what contents are stored on these stacks. Every process has a private user mode stack and a private kernel mode stack.
  1. When the current process runs (in user mode) and a timer interrupt occurs, the CPU checks the Task State Segment (TSS) to find the current top of the stack for kernel mode: it is stored in the ESP0 entry. (It also retrieves the new value for the SS register.) It switches to the new stack (by changing the ESP and SS registers) and pushes the old values of SS and ESP as well as the contents of EFLAGS, CS, and EIP onto the new stack. Then it starts executing the interrupt handler code. The kernel stack now looks like this: \begin{Verbatim} SS ESP EFLAGS CS EIP <- Stack (t=0) \end{Verbatim}
  2. The interrupt handler entry irq0 (for IRQ 0) pushes 0 and 32 onto the stack and jumps to irq_common_stub which pushes EAX, ECX, EDX, EBX, ESP (t=0), EBP, ESI, EDI, DS, ES, FS, and GS, resulting in \begin{Verbatim} SS ESP EFLAGS CS EIP <- Stack (t=0) 0 (err_code) 32 (int_no) (t=1) EAX ECX EDX EBX ESP (t=1) EBP ESI EDI DS ES FS GS <- Stack (t=2) \end{Verbatim} Then it pushes the current ESP (t=2) which is now properly setup as the address of the context_t structure holding all the registers. again and calls the C function irq_handler (which pushes the return address and jumps to the entry address of the C function's code).
  3. When irq_handler starts, it expects to find two values on the stack: the return address and one argument. Since it takes a context_t * as argument and we have just prepared the stack so that it exactly fits this structure: \begin{Verbatim} typedef struct { uint gs, fs, es, ds; uint edi, esi, ebp, esp, ebx, edx, ecx, eax; uint int_no, err_code; uint eip, cs, eflags, useresp, ss; } context_t; \end{Verbatim} the function can then access the elements of the context_t structure.
  4. irq_handler calls the C function timer_handler and passed the pointer to our context_t structure as its single argument.
  5. Lastly, timer_handler calls scheduler (passing that pointer once more).
  6. Now, scheduler is executing and it can access the register values which we've set up on the stack early in the assembler code and passed down all the way to the scheduler. Since we've only passed a pointer all the time (call by reference), the scheduler can modify the register values and when the whole function stack unwinds later, the pop instructions will write the modified values into the appropriate registers. Note however that somewhere inside scheduler an address space change will occur---for that reason scheduler must not make modifications before the context switch.
Let's first assume that we do not enter the scheduler (because [[system_ticks and do not modify anything relevant to scheduling---we expect the current process to continue running, as it does after other interrupt treatments. If we've just entered the scheduler, what does the stack look like right now? We don't really have to care because all the important information is available via the pointer r to the context_t. Note that at the beginning of the interrupt handling we stored the contents of all registers on the stack, in just the order which conveniently fits the context_t structure definition. We also pass the pointer to this structure to all further functions which get called (when calling handler(r) and then scheduler(r)). So within the scheduler we can look at r to see the state as it was before the timer interrupt occurred. We can also modify the register values in r, and when we later return from the scheduler and switch back to user mode, the changes we make will be written back to the registers (the pop commands at the end of irq_common_stub do this for us). Whether we've come from user mode or from kernel mode (e.\,g. from a process that was executing a system call when the timer interrupt fired) does not matter since all relevant registers have been saved and will be restored. When we schedule, we select the new process and then store all registers (data that r points to) in the old TCB. Then we switch the address space (the CPU's pointer to the page directory), and then we load the new TCB contents in the registers. After that we can return. \ignore{ If fork() copies the kernel stack then all (kernel) addresses on the new kernel stack are wrong since they point to the old stack. See http://newlife.googlecode.com/svn-history/r90/kernel/schedule/Scheduler.cpp, section with copy kernel stack. Can we live with a single kernel stack, see [Warton05singlekernel], file: Warton-Single-Kernel-Stack.pdf. See also [Draves:1994:Control-Transfer] (PhD thesis: Control Transfer). See also [Mukherjee93asurvey] (A Survey of Multiprocessor Operating System Kernels) See FreeBSD scheduler [Vidstrom:2004:FreeBSD-ia32]. IDEE: put kernel stack in the TCB, then we need not memorize the starting address of the stack (e.g. stack = TCB+4096) TODO: COMPLETE REWRITE OF FORK() AND EXEC(). }

The Implementation

Our scheduler has the protoype
< function prototypes > +≡
  void scheduler (context_t *r, int source);
which---as expected---takes a context_t pointer as its first argument. We introduce a second argument source because we will later call the scheduler from within other kernel functions. (So far we only expect to call it from the timer handler.)
< enable scheduler > ≡
  scheduler_is_active = true;
< disable scheduler > ≡
  scheduler_is_active = false;
\ignore{ We add a new entry new to the thread control block structure TCB---it will be set to true during creation of a new process via fork: < < more TCB entries > >= boolean new; // was this process freshly forked? @ } We declare two global variables in the kernel address space which will later come in handy when we have to remember information about the current and next process:
< global variables > +≡
  TCB *t_old; static TCB *t_new;
< function implementations > +≡
  void scheduler (context_t *r, int source) {
    debug_printf ("*");
    
scheduler implementation >
  }
Our implementation starts with resetting a global variable inside_yield whose purpose we will explain later on; it is necessary for the implementation of the waitpid system call. Next it checks whether there are any zombie processes and tries to get rid of them (see further down). Then it looks at the global variable
< global variables > +≡
  int scheduler_is_active = false;
to determine whether it shall try to actually attempt a context switch. Another reason for inactivity is the existence of only one process:
< scheduler implementation > ≡
  inside_yield = false;                // reset inside_yield value
                                       // it may have been set from syscall_yield
    
  
scheduler: check for zombies >     // deal with zombies if we have any
    
  // check if we want to run the scheduler
  if (!scheduler_is_active)  return;
  if (!thread_table[2].used) return;   // are there already two threads?
With all obstacles removed, the real scheduling process can begin. The scheduler lets t_old point to the current process and then finds out which process it should switch to, storing the result in t_new
< scheduler implementation > +≡
  t_old = &thread_table[current_task];
  
scheduler: find next process and set [[t_new]] >
  if (t_new != t_old) { 
scheduler: context switch > }
  
scheduler: check pending signals > // see chapter on signals
  
scheduler: free old kernel stacks >  // if there are any 
  return;
We will implement the code chunk < scheduler: check pending signals > in chapter where we introduce signals.

A Simple Round-Robin Strategy

In this early stage we will not attempt any sophisticated scheduling strategy; we will just use a simple Round-Robin system. The search for the next process goes like this:
< scheduler: find next process and set [[t_new]] > ≡
  int tid = current_task;  // start search at current task
  search:
  if (source == SCHED_SRC_WAITFOR) {
    // we cannot use the ->next pointer!
    debug_printf ("scheduler called from syscall_waitpid(). tid(old) = %d, ", tid);    
    tid = thread_table[1].next;   // ignore idle process
    debug_printf ("tid(new) = %d\n", tid);
  } else {
    tid = t_old->next;
  }
  if (tid == 0)  // end of queue reached
    tid = thread_table[1].next;   // ignore idle process
  if (tid == 0)  // still 0? run idle task
    tid = 1; // idle
  t_new = &thread_table[tid];
  if (t_new->addr_space == 0 || t_new->state != TSTATE_READY) 
    goto search; // continue searching
  // found it!
Before implementing the actual context switch, let's first observe the following facts:
  • We only enter the scheduler (and thus also the context switcher) via timer interrupts.
  • Once we're running inside the scheduler, we know that the kernel stack has been set up in a way that will allow the system to continue operation of the interrupted process---whether it was running in user mode or kernel mode before the interrupt occurred.
  • When we switch the address space, we also switch the kernel stack. However, the stack pointer register \register{ESP} will still point to the top of the old process' kernel stack. We need to remedy that and have it point to the top of the new process' kernel stack. \ignore{
  • We also need to check whether the process we're switching to is new (i.\,e., it has just been created by fork)), because in that case it does not have a proper kernel stack. Our new process has the contents of the kernel stack as they were when the fork system call handler was executed. We must modify that stack so that on returning from the scheduler, the fork handler can finish its work and return to the new process' user mode. }
To make the code more readable, we provide some functions to copy values between variables and the \register{ESP}, \register{EBP}, and \register{CR3} registers:
< macro definitions > +≡
  #define COPY_VAR_TO_ESP(x)  asm volatile ("mov %0, %%esp" :         : "r"(x) )
  #define COPY_VAR_TO_EBP(x)  asm volatile ("mov %0, %%ebp" :         : "r"(x) )
  #define COPY_ESP_TO_VAR(x)  asm volatile ("mov %%esp, %0" : "=r"(x)          )
  #define COPY_EBP_TO_VAR(x)  asm volatile ("mov %%ebp, %0" : "=r"(x)          )
  #define WRITE_CR3(x)        asm volatile ("mov %0, %%cr3" :         : "r"(x) )
Now we can present the code that handles the actual context switch. It is only executed if we truly switch, i.\,e., if t_new $\neq$ t_old:
< scheduler: context switch > ≡
  t_old->regs = *r;               // store old:   registers
  COPY_ESP_TO_VAR (t_old->esp0);  //              esp (kernel)
  COPY_EBP_TO_VAR (t_old->ebp);   //              ebp

  current_task = t_new->tid;
  current_as   = t_new->addr_space;
  current_pd   = address_spaces[t_new->addr_space].pd;
  WRITE_CR3 ( mmu (0, (unsigned int)current_pd) );   // activate address space

  COPY_VAR_TO_ESP (t_new->esp0);  // restore new: esp
  COPY_VAR_TO_EBP (t_new->ebp);   //              ebp
  *r = t_new->regs;               //              registers
\ignore{ This is what needs to be done with the kernel stack of a freshly forked process: Remember that when we forked the process we copied the current \register{EIP} value (while executing in u_fork) to the eip entry of the new thread table entry and we made a copy of the (then current) kernel stack. Now we're switching to that process, and we've already switched the address space. < < fix kernel stack for a freshly forked process > > = debug_printf ("This process is new!\n"); debug_printf ("current_task = t_new->new = false; // asm ("add asm ("push //asm ("ret"); @ } Note: {\it
  • If the handler procedure is going to be executed at a numerically lower privilege level, a stack switch occurs. When the stack switch occurs:
    • [a.] The segment selector and stack pointer for the stack to be used by the handler are obtained from the TSS for the currently executing task. On this new stack, the processor pushes the stack segment selector and stack pointer of the interrupted procedure.
    • [b.] The processor then saves the current state of the EFLAGS, CS, and EIP registers on the new stack (see Figures 5-4).
    • [c.] If an exception causes an error code to be saved, it is pushed on the new stack after the EIP value.
  • If the handler procedure is going to be executed at the same privilege level as the interrupted procedure:
    • [a.] The processor saves the current state of the EFLAGS, CS, and EIP registers on the current stack (see Figures 5-4).
    • [b.] If an exception causes an error code to be saved, it is pushed on the current stack after the EIP value.
\raggedleft (Intel® 64 and IA-32 Architectures Software Developer´s Manual\\ \raggedleft Volume 3A: System Programming Guide, Part 1, p. 199)\\ } See http://www.jamesmolloy.co.uk/tutorial_html/9.-Multitasking.html

Treating Zombie Processes

\comment{ MOVE THIS CODE TO THE SECTION WHERE WE IMPLEMENT syscall_yield (presenting it here is not helpful) } We still need to check for zombies: A zombie is a terminated process whose parent process has not yet been able to retrieve the return value (supplied via exit).
< scheduler: check for zombies > ≡
  for (int pid = 0;  pid < MAX_THREADS;  pid++) {
    if (thread_table[pid].state == TSTATE_ZOMBIE) {
      int ppid = thread_table[pid].ppid;
      // case 1: parent is waiting
      if ( (thread_table[ppid].state == TSTATE_WAITFOR) &&
           (thread_table[ppid].waitfor == pid) ) {
        debug_printf ("exit: remove_from_blocked_queue (%d,%x)\n", 
                      ppid, &waitpid_queue);
        LOCK (thread_list_lock);
        deblock (ppid, &waitpid_queue);
        UNLOCK (thread_list_lock);
        thread_table[pid].state = TSTATE_EXIT;
        // thread_table[pid].used = false;  // -> set this in syscall_waitpid()
      }
      
      // strange case
      if ( (thread_table[ppid].state == TSTATE_WAITFOR) &&
           (thread_table[ppid].waitfor != pid) ) {
        debug_printf ("Strange: process %d has parent %d, but parent waits for "
                      "process %d\n", pid, ppid, thread_table[ppid].waitfor);
      }
        
      // case 2: there is no (more) parent
      if ( ppid == 0 ) {
        thread_table[pid].state = TSTATE_EXIT;
        thread_table[pid].used = false;
      }
    }
  }