TEXT | CODE |
Implementation of Processes
We have now written most of the code that we need to introduce the
most important concept: the process. In this chapter we take a first
look at the data structures and kernel functions which will let us
create and schedule processes. We cannot go all the way immediately,
so there's a second process chapter (starting at page
\pageref{chap:processes-part2}) where we'll continue with the code.
- In Chapter we present
the desired memory layout of a ULIX-i386 process and describe our
implementation of address spaces.
- Chapter introduces the central data
structure for processes and threads, the thread control block (TCB),
as well as queues for handling ready and blocked threads.
- Chapter shows what we need
to do in order to start the very first process; all further processes
will be created via the fork mechanism.
- Since forking will require the existence of a system call
interface, it is time to introduce them: we present our
implementation in Chapter .
- With system calls available we present the fork system call
in Chapter .
We will also need to discuss how to switch between threads: once
we have more than one thread, a scheduler must take care of this.
We delay this until Chapter
Address Spaces for Processes
We will store information about memory usage in a data structure
that we call address space descriptor. The idea is that every process uses its
own address space while several threads (of the same process) share
a common address space.
\green
Address space descriptors are stored in one large
address space table. This table must be finite,
i.e., there must exist a maximum number of address spaces for
the system. This must correspond to the maximum number
of threads MAX_THREADS \black that we'll soon define:
While threads may share an address space, it is impossible for
one thread to use more than one address space. Thus
MAX_ADDR_SPACES has to be $\le$ MAX_THREADS---we give
both constants the same value:
| |
|
< constants > ≡ #define MAX_ADDR_SPACES 1024
|
\green
As we will see later, every thread may have its own virtual address
space and needs to own a reference to an address space descriptor.
Even the kernel will have to do that. Since there can be so many
address spaces, we need a shorthand to identify virtual address
spaces. We introduce the type addr_space_id to do this. It is
declared as unsigned int. Basically, an addr_space_id can
be thought of as an index into the address space table. So rather
than storing a complete address space descriptor per thread, we
will rather store an address space identifier.
| |
|
< elementary type definitions > ≡ typedef unsigned int addr_space_id;
|
We add an addr_space_id element to the TCB:
| |
|
< more TCB entries > ≡ addr_space_id addr_space; // memory usage
|
\black
Memory Layout of a Process
Every process needs three (virtual) memory areas.
- Code and Data: We will later compile user mode binaries
which expect to be loaded to the virtual addresses \hex{0} and above.
This area is used for the code (the machine instructions in the binary)
as well as variables defined statically in the program. The
\marginnote{Heap} heap will exist just behind this memory area,
processes can later dynamically expand this memory area using the
sbrk function.
| |
|
< constants > +≡ #define BINARY_LOAD_ADDRESS 0x0
#define TOP_OF_USER_MODE_STACK 0xb0000000
#define TOP_OF_KERNEL_MODE_STACK 0xc0000000
|
- User Mode Stack: Every process needs its own stack: That is
where the CPU will store return addresses and arguments whenever the process
makes a function call. We'll use the virtual addresses below
\hex{b0000000} which will leave a lot of space between the code and data
and the stack: We want the stack to grow automatically, so we'll start
with just one single page of memory for the stack and increase it as
needed: When you think of recursive functions where the end of the
recursion depends on some calculation inside the program, it is clear
that we cannot have a maximum size for the stack. Expanding the stack is
a task for the page fault handler which we've already mentioned. You will
see its implementation on page \pageref{chap:pagefault} ff.
- Kernel Stack: For several reasons we need a second stack
when a process switches to kernel mode (using a system call, see Section
). While it would be possible to share one kernel
stack between all the processes, that would also limit us: With a
single kernel stack we would run into problems when two or more
processes need to enter kernel mode at the same time.
There's also a security aspect: The kernel stack may contain kernel data
that the process should not have access to.
We'll put the kernel stack just under the kernel space of memory,
at addresses below \hex{c0000000}.
Thus, the memory layout of a process is as shown in
Fig.~. The double line below
\hex{c0000000} marks the barrier between process-specific
and generic memory ranges: everything above \hex{c0000000} is
globally visible and identical in every address space, whereas
the lower addresses differ for each process, and they do not
exist at all before the first process has been created.
[h!]
\begin{centering}\begin{tabular}{|r@{\hspace{1mm}}c@{\hspace{1mm}}l|l|c|}
\hline
\multicolumn{3}{|l|}{\textsf{Address Range}} & \textsf{Usage} & \textsf{Access} \\
\hline
\hline
\hex{D4000000} & -- & \hex{FFFFFFFF} & unused & --\\
\hline
\hex{D0000000} & -- & \hex{D3FFFFFF} & 64 MByte Physical RAM (mapped) & K\\
\hline
\hex{C0000000} & -- & \hex{CFFFFFFF} & Kernel Code and Data & K\\
\hline
\hline
\hex{BFFFF000} & -- & \hex{BFFFFFFF} & Kernel Stack (4 KByte = one page) & K\\
\hline
\hex{B0000000} & -- & \hex{BFFFEFFF} & unused & --\\
\hline
\dots & -- & \hex{AFFFFFFF} & User Mode Stack & U\\
\hline
&&&&\\
& & & Heap (can be grown with sbrk) & (U)\\
&&&&\\
\hline
\hex{00000000} & -- & \dots & Process Code and Data & U\\
\hline
\end{tabular}
\caption{This is the memory layout of a process. Address ranges marked `K' in
the last column need kernel privileges to be accessed. Heap areas will only become
available after they are manually requested.}
\end{centering}
Creating a New Address Space
Essentially, an address space is just a fresh page directory. Its
kernel memory part (addresses \hex{c0000000} and higher) will be
identical to the kernel's page directory which we've already
setup earlier.
It is helpful to reconsider how the CPU (or the MMU) accesses
the paging information: A register holds the address of the page
directory which has 1024 entries, each of which is either null or
points to a page table.
The upper $1024-768 = 256$ entries are responsible for the kernel
memory (\hexrange{C0000000}{FFFFFFFF}), and the lower 768 entries are available
for process memory (\hexrange{00000000}{BFFFFFFF}) with the upper
part of each process' private memory (\hexrange{BFFFF000}{BFFFFFFF})
being the kernel stack which is only available in kernel mode.
We want to allow for three different situations, as far as access
to process memory and kernel memory is concerned:
- [pure kernel mode:] The kernel is actively dealing with specific kernel tasks, such as memory management or interrupt service. The kernel's view on memory in this state is basically the same as it was after enabling paging: It sees its own memory (\hexrange{C0000000}{CFFFFFFF}) and the physical memory which is mapped to addresses starting at D0000000. In this mode we will use the original kernel's page directory and we'll use the address space ID 0 to refer to this memory view.
- [process in user mode:] A process is active and running in user mode. It only sees its own memory (\hexrange{00000000}{AFFF.FFFF}: code, data, heap, and user mode stack). The paging information will map the kernel stack and the kernel's memory as well, but since it will be marked non-user-mode-accessible, that will be the same as not having it at all when running in user mode. In this mode any attempt to access kernel memory (either its data or its code) will generate a page fault---even if the address is valid.
- [process in kernel mode:] A process has entered kernel mode via a system call or an interrupt has occurred. In this situation the page tables must give access to both the current process' memory and the kernel memory. All 4 GByte of virtual memory are visible. The paging information can be the same as for the process in user mode: the current level of execution (kernel mode instead of user mode) grants the access to all of the virtual memory. (For handling an interrupt it is not necessary to see the current process' user mode memory, so we could switch to pure kernel mode in order to prevent interrupt handlers from looking at process memory. But since we intend to trust our interrupt handlers, we will not do that.)
We reserve memory for an address space list. This list does not hold
the page directory or the page tables it points to, but just the
address of the page directory and some status information:
| |
|
< type definitions > ≡ typedef struct {
void *pd; // pointer to the page directory
int pid; // process ID (if used by a process; -1 if not)
short status; // are we using this address space?
unsigned int memstart; // first address below 0xc000.0000
unsigned int memend; // last address below 0xc000.0000
unsigned int stacksize; // size of user mode stack
unsigned int kstack_pt; // stack page table (for kernel stack)
unsigned int refcount; // how many threads use this address space?
} address_space;
|
pd holds the (virtual) address of the page directory.
memstart and memend contain the first and last user mode
address (for code, data, and heap), and stacksize tells the size
of the user mode stack. We also want to keep the address of the
kernel mode stack's page table handy, thus we will store it in kstack_pt.
refcount lets us count how often the address space is used---for
non-threaded processes this value will always be 1.
We define three possible values for the status field of an address
space:
| |
|
< constants > +≡ #define AS_FREE 0
#define AS_USED 1
#define AS_DELETE 2
|
The address space list is an array of address space descriptors:
| |
|
< global variables > ≡ address_space address_spaces[MAX_ADDR_SPACES] = { 0 };
|
We will need to search for a free address space: The function
| |
|
< function prototypes > ≡ int get_free_address_space ();
|
will return an integer value that serves as an index into the table:
| |
|
< function implementations > ≡ int get_free_address_space () {
addr_space_id id = 0;
while ((address_spaces[id].status != AS_FREE) && (id
if (id==MAX_ADDR_SPACES) id = -1;
return id;
}
|
We will use the first entry of the array address_spaces for the kernel
and let it point to the kernel page directory.
We add the code for initializing this entry just after enabling paging:
| |
|
< enable paging for the kernel > ≡ address_spaces[0].status = AS_USED;
address_spaces[0].pd = &kernel_pd;
address_spaces[0].pid = -1; // not a process
|
Here is what we need to do in order to create a fresh address space:
We first retrieve a new address space ID and mark its entry in
the table as used. Then
we reserve memory for a new page directory and copy the system's one
into it. Finally we set up some user space memory and add it to
the page directory:
| |
|
< function prototypes > +≡ int create_new_address_space (int initial_ram, int initial_stack);
|
The argument initial_ram defines the amount of process-private
memory that should be allocated at once, similarly initial_stack
is the initial size of the user mode stack which will always be
just 4 KByte (though it can grow later).
We expect the initial_ram and initial_stack values to be
multiples of the page size (4 KByte)---if not, we will make them so,
using this macro:
| |
|
< macro definitions > ≡ #define MAKE_MULTIPLE_OF_PAGESIZE(x) x = ((x+PAGE_SIZE-1)/PAGE_SIZE)*PAGE_SIZE
|
If the function call is successful, create_new_address_space
returns the ID of the newly created address space, otherwise $-1$:
| |
|
< function implementations > +≡ int create_new_address_space (int initial_ram, int initial_stack) {
MAKE_MULTIPLE_OF_PAGESIZE (initial_ram);
MAKE_MULTIPLE_OF_PAGESIZE (initial_stack);
// reserve address space table entry
addr_space_id id;
if ( (id = get_free_address_space()) == -1 ) return -1; // fail
address_spaces[id].status = AS_USED;
address_spaces[id].memstart = BINARY_LOAD_ADDRESS;
address_spaces[id].memend = BINARY_LOAD_ADDRESS + initial_ram;
address_spaces[id].stacksize = initial_stack;
address_spaces[id].refcount = 1; // default: used by one process
< reserve memory for new page directory > // sets new_pd
address_spaces[id].pd = new_pd;
< copy master page directory to new directory >
int frameno, pageno; // used in the following two code chunks
if (initial_ram > 0) { < create initial user mode memory > }
if (initial_stack > 0) { < create initial user mode stack > }
return id;
};
|
As usual we use request_new_page to get a fresh page of virtual
memory which will store the new page directory: that function will also
update the page directories of all already existing address spaces if
it has to create a new page table (for addresses in the kernel memory).
| |
|
< reserve memory for new page directory > ≡ page_directory *new_pd = (void*)request_new_page ();
if (new_pd == NULL) { // Error
printf ("\nERROR: no free page, aborting create_new_address_space\n");
return -1;
};
memset (new_pd, 0, sizeof(page_directory));
|
For copying the kernel page directory to the new directory, we
simply use an assignment; this copies all references to page tables
which exist in the original (kernel) page directory.
| |
|
< copy master page directory to new directory > ≡ *new_pd = kernel_pd;
memset ((char*)new_pd, 0, 4); // clear first entry (kernel pd contains
// old reference to some page table)
|
We modify the new page directory so that it contains information about
the user mode memory, stack, and kernel stack. For that purpose we will
use a function
| |
|
< function prototypes > +≡ int as_map_page_to_frame (int as, unsigned int pageno, unsigned int frameno);
|
which can create mappings of
page numbers to frame numbers in a specific address space. We will
define it just afterwards; it finds out what page table entry to modify
and, if needed, also creates a new page table and updates the page
directory to point to it.
| |
|
< create initial user mode memory > ≡ pageno = 0;
while (initial_ram > 0) {
if ((frameno = request_new_frame ()) < 0) {
printf ("\nERROR: no free frame, aborting create_new_address_space\n");
return -1;
};
as_map_page_to_frame (id, pageno, frameno);
pageno++;
initial_ram -= PAGE_SIZE;
};
|
Reserving memory for the user mode stack looks almost the same, we
just let the stack grow downwards whereas above the memory addresses
moved upwards: so we need to modify the page number with pageno--
instead of pageno++:
| |
|
< create initial user mode stack > ≡ pageno = TOP_OF_USER_MODE_STACK / PAGE_SIZE;
while (initial_stack > 0) {
if ((frameno = request_new_frame ()) < 0) {
printf ("\nERROR: no free frame, aborting create_new_address_space\n");
return -1;
};
pageno--;
as_map_page_to_frame (id, pageno, frameno);
initial_stack -= PAGE_SIZE;
}
|
We will now describe how to enter the page-to-frame mapping in the
new address space's page tables.
Getting new physical memory is not a problem since we already have
defined the function request_new_frame() which reserves a new frame.
The function as_map_page_to_frame creates such a mapping in a given address space. It will basically be a rewrite of parts of < enter frames in page table > .
| |
|
< function implementations > +≡ int as_map_page_to_frame (int as, unsigned int pageno, unsigned int frameno) {
// for address space as, map page #pageno to frame #frameno
page_table* pt;
page_directory* pd;
pd = address_spaces[as].pd; // use the right address space
unsigned int pdindex = pageno/1024; // calculuate pd entry
unsigned int ptindex = pageno%1024; // ... and pt entry
if ( ! pd->ptds[pdindex].present ) {
// page table is not present
< create new page table for this address space > // sets pt
} else {
// get the page table
pt = (page_table*) PHYSICAL(pd->ptds[pdindex].frame_addr << 12);
};
if (pdindex < 704) // address below 0xb0000000 -> user access
UMAP ( &(pt->pds[ptindex]), frameno << 12 );
else
KMAP ( &(pt->pds[ptindex]), frameno << 12 );
return 0;
};
|
In the last lines of this function we differentiate between user mode
and kernel mode memory and use the appropriate macro (UMAP or
KMAP) to create an entries which allows or forbids access for
processes in user mode: The address range \hexrange{00000000}{afffffff}
allows the process access in user mode, whereas
\hexrange{b0000000}{ffffffff} shall only be accessible in kernel mode.
Remember that every page directory entry lets us address one page
table which holds the addresses of up to 1024 pages, or
$1024 \times 4096 = 4194304$ bytes of memory.
\[
\frac{\textrm{\hex{b0000000}}}{1024 \times 4096} = 704
\]
so we must use UMAP if pdindex $ < 704$.
Now we need to explain how to create a new page table: We start with
fetching a free frame and point to it from the page directory.
| |
|
< create new page table for this address space > ≡ int new_frame_id = request_new_frame ();
unsigned int address = PHYSICAL (new_frame_id << 12);
pt = (page_table *) address;
memset (pt, 0, sizeof(page_table));
UMAPD ( &(pd->ptds[pdindex]), new_frame_id << 12);
|
Destroying an Address Space
When we exit a process, we must also destroy its address
space and release all pages used by it. For that purpose we write
a function destroy_address_space():
| |
|
< function prototypes > +≡ void destroy_address_space (addr_space_id id);
|
| |
|
< function implementations > +≡ void destroy_address_space (addr_space_id id) {
// called only from syscall_exit(), holding thread_list_lock, interrupts off
if (--address_spaces[id].refcount > 0) return;
addr_space_id as = current_as; // remember current address space
current_as = id; // set current_as: we call release_page()
< destroy AS: release user mode pages > // all pages used by the process
< destroy AS: release user mode stack > // all its user mode stack pages
< destroy AS: release page tables > // the page tables (0..703)
current_as = as; // restore current_as
address_spaces[id].status = AS_DELETE; // change AS status
// remove kernel stack (cannot do this here, this stack is in use right now)
add_to_kstack_delete_list (id);
return;
}
|
with
| |
|
< destroy AS: release user mode pages > ≡ for ( int i = address_spaces[id].memstart / PAGE_SIZE;
i < address_spaces[id].memend / PAGE_SIZE;
i++ ) {
release_page (i);
};
|
| |
|
< destroy AS: release user mode stack > ≡ for ( int i = TOP_OF_USER_MODE_STACK / PAGE_SIZE - 1;
i > (TOP_OF_USER_MODE_STACK-address_spaces[id].stacksize) / PAGE_SIZE - 1;
i-- ) {
release_page (i);
};
|
| |
|
< destroy AS: release page tables > ≡ page_directory *tmp_pd = address_spaces[id].pd;
for ( int i = 0; i < 704; i++ ) {
if ( tmp_pd->ptds[i].present )
release_frame ( tmp_pd->ptds[i].frame_addr );
}
|
In the last code chunk the loop goes from 0 to 703 since that is the
last page directory entry which points to a page table that is used
in user mode (cf.{} the discussion of UMAP vs.{} KMAP usage in
the implementation of as_map_page_to_frame on
page \pageref{discussion:umap vs kmap, 0-703}).
We will remove the kernel stack later when we're not using it any
more---doing this right now would crash the system.
For that purpose we use a global variable which contains either
0 or the ID of an address space whose kernel stack needs removal.
That is why we called the function
| |
|
< function prototypes > +≡ void add_to_kstack_delete_list (addr_space_id id);
|
in the above code. We allow up to 1024 entries in the kernel
stack delete list:
| |
|
< constants > +≡ #define KSTACK_DELETE_LIST_SIZE 1024
|
The delete list is just an array of address space IDs that we
initialize with null values, and we also define a lock to protect
access to the list. (The implementation of locks follows in
Chapter~.)
| |
|
< global variables > +≡ addr_space_id kstack_delete_list[KSTACK_DELETE_LIST_SIZE] = { 0 };
lock kstack_delete_list_lock;
|
We need to initialize the lock:
| |
|
< initialize kernel global variables > ≡ kstack_delete_list_lock = get_new_lock ("kstack");
|
Entering an address space ID in the delete list is simple:
| |
|
< function implementations > +≡ void add_to_kstack_delete_list (addr_space_id id) {
int i;
LOCK (kstack_delete_list_lock);
for (i = 0; i < KSTACK_DELETE_LIST_SIZE; i++) {
// try to enter it here
if (kstack_delete_list[i] == 0) {
// found a free entry
kstack_delete_list[i] = id;
break;
}
}
UNLOCK (kstack_delete_list_lock);
if (i == KSTACK_DELETE_LIST_SIZE)
printf ("ERROR ADDING ADDRESS SPACE TO KSTACK DELETE LIST!\n");
}
|
We have not shown the code for the scheduler yet, it is responsible
for switching between the processes and is called regularly by the
timer interrupt handler. Whenever the system activates the scheduler
it will execute the following code chunk < scheduler: free old kernel stacks >
which frees those old kernel stacks that we've put into the list:
| |
|
< scheduler: free old kernel stacks > ≡ // check all entries in the to-be-freed list
int i, entry, frameno;
page_directory *tmp_pd;
page_table *tmp_pt;
LOCK (kstack_delete_list_lock);
for (entry=0; entry
if (kstack_delete_list[entry] != 0 && kstack_delete_list[entry] != current_as) {
// remove it
addr_space_id id = kstack_delete_list[entry];
tmp_pd = address_spaces[id].pd;
tmp_pt = (page_table *) address_spaces[id].kstack_pt;
// this is the page table which maps the last 4 MB below 0xC0000000
for (i=0; i
frameno = tmp_pt->pds[1023-i].frame_addr;
release_frame (frameno);
}
kstack_delete_list[entry] = 0; // remove entry from kstack delete list
release_page (((unsigned int)tmp_pt) >> 12); // free memory for page table
release_page (((unsigned int)tmp_pd) >> 12); // ... and page directory
address_spaces[id].status = AS_FREE; // mark address space as free
}
}
UNLOCK (kstack_delete_list_lock);
|
We haven't defined the constant KERNEL_STACK_PAGES yet: it tells the
system how many pages it shall reserve for the kernel stack.
| |
|
< constants > +≡ // kernel stack (per process): 1 page = 4 KByte
#define KERNEL_STACK_PAGES 4
#define KERNEL_STACK_SIZE PAGE_SIZE * KERNEL_STACK_PAGES
|
We may sometimes also use the size of the kernel stack (KERNEL_STACK_SIZE).
Switching between Address Spaces
In order to switch between two address spaces it is sufficient
to load the new address space's page directory address into
the \register{CR3} register.
Note that using the function activate_address_space should be avoided because it has
the side effect of switching the kernel stack. Even while it is
implemented as inline function, it is still not safe to call
it: parameter passing creates local variables (on the kernel
stack) which are lost after the context switch. We will only use
it when we start the very first process.
In earlier versions of the code, < context switch > used to
make a function call to activate_address_space() and it
caused many problems (the operating system crashed). After moving
the \register{CR3} loading code directly into the context switch, the
problems disappeared.
| |
|
< function prototypes > +≡ inline void activate_address_space (addr_space_id id) __attribute__((always_inline));
|
| |
|
< global variables > +≡ addr_space_id current_as = 0; // global variable: current address space
addr_space_id tmp_as; // temp. address space variable, for context switch
|
| |
|
< function implementations > +≡ inline void activate_address_space (addr_space_id id) {
// NOTE: Do not call this from the scheduler; where needed, replicate the code
unsigned int virt = (unsigned int)address_spaces[id].pd; // get PD address
unsigned int phys = mmu (0, virt); // and find its physical address
asm volatile ("mov %0, %%cr3" : : "r"(phys)); // write CR3 register
current_as = id; // set current address space
current_pd = address_spaces[id].pd; // set current page directory
return;
};
|
Here we use another function
called mmu which emulates the behavior of the memory management unit (MMU)
and calculates the physical address belonging to a virtual address with respect
to an address space. We will define it soon.
We provide a helper function
| |
|
< function prototypes > +≡ void list_address_spaces ();
|
which shows the list of used address spaces; it is only needed for
debugging.
| |
|
< function implementations > +≡ void list_address_space (addr_space_id id) {
int mem = (unsigned int) address_spaces[id].pd;
int phys = mmu (id, (unsigned int) address_spaces[id].pd); // emulate MMU
int memstart = address_spaces[id].memstart;
int memend = address_spaces[id].memend;
printf ("ID: %d, MEM: %08x, PHYS: %08x - USER: [%08x..%08x[\n",
id, mem, phys, memstart, memend);
};
void list_address_spaces () {
addr_space_id id;
for (id=0; id
if (address_spaces[id].status != AS_FREE) {
list_address_space (id);
}
}
};
|
list_address_space also uses the mmu function---it is time to
provide its implementation. We start with a function mmu_p which,
given an address space ID and a page number, finds out whether the
page is mapped in that address space and returns the frame number of
the mapped frame.
| |
|
< function prototypes > +≡ unsigned int mmu_p (addr_space_id id, unsigned int pageno); // pageno -> frameno
unsigned int mmu (addr_space_id id, unsigned int vaddress); // virt. -> phys. addr.
|
mmu_p looks up the page directory and then the right page table
which holds the mapping for the virtual address. Note that this
function can only work if the page table is in memory---if it
was paged out, it will return -1 (or actually: MAXINT32, since
it is of type unsigned int).
| |
|
< function implementations > +≡ unsigned int mmu_p (addr_space_id id, unsigned int pageno) {
unsigned int pdindex, ptindex;
page_directory *pd;
page_table *pt;
pdindex = pageno/1024;
ptindex = pageno%1024;
pd = address_spaces[id].pd;
if ( ! pd->ptds[pdindex].present ) {
return -1;
} else {
pt = (page_table*) PHYSICAL(pd->ptds[pdindex].frame_addr << 12);
if ( pt->pds[ptindex].present ) {
return pt->pds[ptindex].frame_addr;
} else {
return -1;
};
}
};
|
and with mmu_p we can easily implement mmu because we just
have to split a virtual address into page number and offset, then
call mmu_p to find the frame number and reassemble that and
the offset to form a physical address:
| |
|
< function implementations > +≡ unsigned int mmu (addr_space_id id, unsigned int vaddress) {
unsigned int tmp = mmu_p (id, (vaddress >> 12));
if (tmp == -1)
return -1; // fail
else
return (tmp << 12) + (vaddress % PAGE_SIZE);
}
|
Note that both functions return -1 if the page or virtual address
does not exist, but only when calling mmu_p we can be sure
that a return value of -1 indicates a non-existing page---after
all some virtual address might be mapped to physical
address \hex{FFFFFFFF} (which is the same as -1).
Enlarging an Address Space
We want to allow processes to increase their standard memory usage (which
is 64 KByte). Most Unix systems provide an implementation of malloc as
part of their standard library.
Since we have already described an implementation of malloc and
free for the ULIX-i386 kernel (kmalloc and kfree), we will
introduce the more historical method of using brk.
The brk system call (and corresponding library function) is still
available on modern Unix systems, but its use is advised against.
brk adds one or more pages to the calling process' data ``segment''.
The function sbrk does the same but is more user-friendly: It takes
an increment as argument, so if the process needs 16 KByte of extra memory,
it can call sbrk(16*1024). sbrk returns the lowest address of
the new memory: After executing void *mem = sbrk(incr), the
address range $[ mem\, ; mem+incr-1 ]$ is available to the process.
How can we do this in ULIX-i386? Remember that each process uses an
address_space which has elements named memstart and memend
(the last of which is the first address that is not available)
and a pointer to the address space's page directory (pd). Thus,
sbrk just needs to
- acquire the needed number of frames,
- modify the page directory so that the new frames are mapped just
after the last old pages, and
- update the memend element.
It then returns the first (virtual) address of the first new page.
We start with the kernel-internal function u_sbrk; we expect that its
argument is always a multiple of PAGE_SIZE:
| |
|
< function prototypes > +≡ void *u_sbrk (int incr);
|
| |
|
< function implementations > +≡ void *u_sbrk (int incr) {
int pages = incr / PAGE_SIZE;
int i, frame;
address_space *aspace = &address_spaces[current_as];
unsigned int oldbrk = aspace->memend;
for (i=0; i
frame = request_new_frame ();
if (frame == -1) { return (void*)(-1); } // error!
as_map_page_to_frame (current_as, aspace->memend/PAGE_SIZE, frame);
aspace->memend += PAGE_SIZE;
}
return (void*) oldbrk;
}
|
Next we need to provide a system call for the u_sbrk function. The
constant __NR_brk is defined as 45. There is no sbrk system call
since normally the sbrk function is implemented by calling brk.
But we will only implement u_sbrk and reuse the brk system call
number.
| |
|
< syscall prototypes > ≡ void syscall_sbrk (context_t *r);
|
| |
|
< syscall functions > ≡ void syscall_sbrk (context_t *r) {
// ebx: increment
r->eax = (unsigned int)u_sbrk (r->ebx);
return;
}
|
| |
|
< initialize syscalls > ≡ install_syscall_handler (__NR_brk, syscall_sbrk);
|
\black
Thread Control Blocks and Thread Queues
\green
The thread control block (TCB)\pindex{TCB (thread
control block)} is the central place in the kernel where information
about a thread is held. In the times when people used to speak about
processes instead of threads, the TCB was called the PCB\pindex{PCB
(process control block)} which stands for \emph{\vindex{process
control block}}.
One main purpose of the TCB is to store the processor state of
a thread (sometimes also called context) during the times when
it is not assigned to a physical processor.
Note that the processor state is not the
same as the thread state. As seen above, the state of a thread can be
running, blocked etc. The processor state is all information
that is necessary to pretend that the processor has never executed any
other thread as the one to which the TCB belongs.
The TCB contains (amoung others) the following information:
- A unique identifier of the thread. This is the so-called
thread identifier (TID)\pindex{TID (thread
identifier)}. In previous times, the TID was often called
PID for \emph{\vindex{process
identifier}}.
- Storage space to save the processor context, i.\,e., the
registers, the stack pointer(s), etc.
- Depending on the thread state, the TCB contains an indication
on what event the thread is waiting for if it is in state blocked.
- Information about the memory that this thread is using---we'll
define address spaces in the next section.
- Any other information which may be useful to keep the system
running efficiently. For example, statistical information could be
stored here on how often the thread has been running in the past.
This can help the scheduler make efficient scheduling decisions.
Note that the information about the address space must be handled
differently in PCBs and in TCBs. In a system where multiple threads
can run within one address space, there is a $n:1$ mapping between
threads and address spaces. In a classical Unix system with processes
(one address space with exactly one thread), the mapping is $1:1$ and
each thread can store the full address space information in the PCB
itself. With a $n:1$ mapping, an extra data structure is necessary
to avoid having redundant information in the TCBs.
So here is the declaration of the TCB structure. We have an entry for
the thread identifier tid, the processor context consisting of the
general purpose registers and the special purpose registers (such
as instruction pointer and stack pointer), plus a reference to the
address space in which the thread runs. More entries will follow later.
\black
| |
|
< type definitions > +≡ typedef struct {
thread_id tid; // thread id
thread_id ppid; // parent process
int state; // state of the process
context_t regs; // context
unsigned int esp0; // kernel stack pointer
unsigned int eip; // program counter
unsigned int ebp; // base pointer
< more TCB entries >
} TCB;
|
\green
The TCBs of all threads are collected in a giant table within the
kernel. This table is called the thread table. Again,
in ancient times when the table was a collection of PCBs instead of
TCBs, it was called process table.
The thread table is simply an array of TCBs. The size of the
table must be finite, so there exists a maximum number of threads
which can coexist at any point in time.
| |
|
< constants > +≡ #define MAX_THREADS 1024
|
| |
|
< global variables > +≡ TCB thread_table[MAX_THREADS];
|
\green
Implementing Lists of Threads
Queues are standard data structure offered by almost all modern
programming languages. As an example, Java offers the generic class
ArrayList in which objects of any type E can be stored and
manipulated with standard operations like add(), size() and
get(). Unfortunately, ``plain'' C does not offer this convenience
so we have to implement queues by ourselves.
As explained in Section~, we have to maintain a
couple of thread queues within the kernel. In ULIX we maintain
only two: the ready queue and the blocked queue.
In fact, the blocked queue is not a single queue but there can
be multiple blocked queues, one for every event upon which a thread
can wait.
When implementing such queues, we could think about using the standard
implementation of a (double) linked list found in any introductory
textbook on programming. However these implementations usually are
examples of programming with dynamic memory allocation, e.g., in C
using the malloc library call to allocate fresh memory on the
heap. This would be a problem in ULIX since we neither have a
heap nor a library to call into.
So how can we program linked lists without allocating memory? The
first option is to do it like Knuth did it in
TeX} [Knuth:1986:TTP] and provide both a
large memory area plus functions for memory allocation and
deallocation by ourselves. Since this would be overkill, we choose
the second option, which is also the option taken in many operating
systems: We use the thread table to implement lists. The idea is to
declare two additional entries in the thread control block:
one entry called next and one called prev. Both point to other
entries in the thread table. So consider a thread control block TCB t.
The entry t.next points to the ``next'' thread in the queue
in which t is in. Similarly, t.prev points to the ``previous''
thread in t's queue.
We define the range of these two pointers to be thread_id.
| |
|
< more TCB entries > +≡ thread_id next; // id of the ``next'' thread
thread_id prev; // id of the ``previous'' thread
|
An example of the semantics of the prev and next entries in
the thread table is shown in
Figure~. It shows that the thread
identifier 0 is used as an ``end marker'' for the lists. It also shows
that the prev entry of the first entry in the queue points to the
last element in the queue. In this way, it is easily possible to
navigate through the queues in any way which is convenient.
\centering
\includegraphics[width=0.8\textwidth]{figures/ready-queue-implementation.jpg}
\caption{Implementation of ready queue and blocked queues. The
beginning of the ready queue is implicitly defined by entry 0 in
the thread table. The beginning of a blocked queue is a pair of
thread identifiers pointing into the thread table from ``outside''.}
Figure~ also shows a small
implementation trick. The thread identifier of the
thread itself is always equal to the index of the thread in the
thread table. Given a TID of
t, then thread_table[t] is the thread control block of that
thread. This also means that the entry tid in the \vindex{thread
control block} is more or less superfluous.
Now since we are using the value 0 to mark the end of a list, the
entry 0 in the thread table has become more or less useless to store
thread information. We use it instead as the ``anchor'' of the
ready queue. So to access the first element in the ready queue,
we just need to look into:
\begin{center}
[[thread_table[0].next]]
\end{center}
The last entry in the ready queue can similarly be accessed using the
following expression:
\begin{center}
[[thread_table[0].prev]]
\end{center}
The ready queue in the figure contains threads 1, 4, and 7 (in this
order).
Recalling the simple state model of threads in
Section~, every thread is in exactly one
state at any time. This means that a thread is either running, ready
or blocked. This also means that a thread can be in at most one queue
at a time. In case the thread is blocked instead of ready, we can
re-use the prev and next entries in the thread table
to implement the blocked list. We only need to have an anchor
for this blocked list. This anchor will be a structure similar to
the 0-th entry in the thread table.
| |
|
< declaration of blocked queue > ≡ typedef struct {
thread_id next; // id of the ``next'' thread
thread_id prev; // id of the ``previous'' thread
} blocked_queue;
|
So assume b is a variable of type blocked_queue representing
a blocked list. If both entries in b are 0, then the list is
empty. If not, then using the thread table we can now find the first,
second etc.~element by following the next pointers. Following the
next pointers in this way, we can traverse the entire list until
we reach an entry in which next==0. That's the \vindex{end of the
list}. Looking again at
Figure~, the blocked queue
contains threads 2, 5 and 6 (in this order).
Finally, here's a useful function to initialize a blocked queue.
This is just to encapsulate the semantics of ``emptiness''.
| |
|
< function implementations > +≡ void initialize_blocked_queue (blocked_queue* q) {
q->prev = 0;
q->next = 0;
}
|
Implementing the Ready Queue
We now provide some convenient functions to add and remove
threads from the queues. We start with the ready queue. The
function add_to_ready_queue(t) adds the thread with
identifier t to the end of the ready queue. It assumes
that the TCB of thread t has been set up and initialized
already.
The function remove_from_ready_queue(t) removes the thread
with identifier t from the ready queue. It assumes that
t is contained in the ready queue.
| |
|
< function prototypes > +≡ void add_to_ready_queue (thread_id t);
void remove_from_ready_queue (thread_id t);
|
Adding to the end of the ready queue is as easy since
we have a double linked list.
| |
|
< function implementations > +≡ void add_to_ready_queue (thread_id t) {
thread_id last = thread_table[0].prev;
thread_table[0].prev = t;
thread_table[t].next = 0;
thread_table[t].prev = last;
thread_table[last].next = t;
thread_table[t].state = TSTATE_READY; // set its state to ready
}
|
Removing is similarly easy.
| |
|
< function implementations > +≡ void remove_from_ready_queue (thread_id t) {
thread_id prev_thread = thread_table[t].prev;
thread_id next_thread = thread_table[t].next;
thread_table[prev_thread].next = next_thread;
thread_table[next_thread].prev = prev_thread;
}
|
We initialize the ready queue to be empty.
| |
|
< initialize kernel global variables > +≡ thread_table[0].prev = 0;
thread_table[0].next = 0;
|
Implementing a Blocked Queue
Blocked queues are implemented similar to the ready queue, except that
the functions are parametrized with a blocked list anchor defined
above. We provide again two functions to add and remove a thread from
a blocked list. Adding to the queue happens at the ``end'' of the
queue. An additional function allows to inspect the ``front'' queue
element.
| |
|
< function prototypes > +≡ void add_to_blocked_queue (thread_id t, blocked_queue* bq);
void remove_from_blocked_queue (thread_id t, blocked_queue* bq);
thread_id front_of_blocked_queue (blocked_queue bq);
|
We implement the easy inspector function to retrieve the front of a
blocked queue first. It is so easy that we could have avoided writing
this function alltogether (using a noweb macro), but we spell it
out for students who have learnt the concept of information hiding.
| |
|
< function implementations > +≡ thread_id front_of_blocked_queue (blocked_queue bq) {
return bq.next;
}
|
We now implement add_to_blocked_queue. Adding happens
at the end of the queue. The following code is an adaption
of the code for the ready queue. The conditional statement at
the end is necessary since thread_table[0] is not the
anchor of a blocked queue.
| |
|
< function implementations > +≡ void add_to_blocked_queue (thread_id t, blocked_queue* bq) {
thread_id last = bq->prev;
bq->prev = t;
thread_table[t].next = 0; // [[t]] is ``last'' thread
thread_table[t].prev = last;
if (last == 0) {
bq->next = t;
} else {
thread_table[last].next = t;
}
}
|
Removal is similar to the function of the ready queue, except
for again the special cases at the end.
| |
|
< function implementations > +≡ void remove_from_blocked_queue (thread_id t, blocked_queue* bq) {
thread_id prev_thread = thread_table[t].prev;
thread_id next_thread = thread_table[t].next;
if (prev_thread == 0) {
bq->next = next_thread;
} else {
thread_table[prev_thread].next = next_thread;
}
if (next_thread == 0) {
bq->prev = prev_thread;
} else {
thread_table[next_thread].prev = prev_thread;
}
}
|
Simple Dispatcher Operations
We now look at the implementations of the three simplest dispatcher
operations. These are add, retire and deblock. They
are simple because they basically only move threads from one
queue to the other.
The functions add and retire take as parameter the identifier
of the thread which is newly born or about to die. The function
deblock needs another argument: The blocked queue from which the
thread is to be removed.
| |
|
< function implementations > +≡ void add (thread_id t) {
add_to_ready_queue (t);
}
void retire (thread_id t) {
remove_from_ready_queue (t);
}
void deblock (thread_id t, blocked_queue* q) {
remove_from_blocked_queue (t, q);
add_to_ready_queue (t);
}
|
Note that with the above functions we can easily write code
that deblocks the ``front'' element from a blocked queue (if it
exists) as follows:
\begin{center}
deblock (front_of_blocked_queue (bq), &bq);
\end{center}
@ Looking back, we could have implemented the ready queue using the
functions of the blocked queue and just passing thread_table[0]
instead of a blocked queue anchor. The code for the ready queue
is however a little simpler, and therefore we decided to leave
it as it is.
\black
Allocating and Initializing a New TCB
Whenever we create a new thread or process, we will need a fresh TCB
entry and initialize it. We add a used entry to the thread control
block structure TCB
| |
|
< more TCB entries > +≡ boolean used;
|
which lets us declare an entry as used.
(Since we initialize the TCB structures with null bytes, we use
used and not free: remember that false=0.)
This will allow us to quickly find a free TCB when we create a new
thread. Instead of adding such a field, we could have used a bitmap,
but since we restrict ourselves to 1024 TCBs, not much space is
wasted this way, and searching for a free TCB will be quick.
We will memorize in a global variable
| |
|
< global variables > +≡ int next_pid = 1;
|
at which thread number we will start our search (instead of always
searchin from 1): this will later lead to a continuous sequence of
process/thread numbers: even if we terminate a thread, its TCB
will not be recycled immediately.
| |
|
< find free TCB entry > ≡ boolean tcbfound = false;
int tcbid;
for ( tcbid=next_pid; ((tcbid
if (thread_table[tcbid].used == false) {
tcbfound = true;
break;
}
};
|
However, once we've reached the
maximum number (1023), the search for a free TCB will start over,
and from that point on thread numbers will no longer indicate the
order of creation of the threads.
| |
|
< find free TCB entry > +≡ if (!tcbfound) { // continue searching at 1
for ( tcbid=1; ((tcbid
if (thread_table[tcbid].used == false) {
tcbfound = true;
break;
}
};
};
if (tcbfound) next_pid = tcbid+1; // update next_pid:
// either tcbfound == false or tcbid == index of first free TCB
|
Once we have a free address space (or reuse one) and also have
a free TCB, we can connect them:
| |
|
< function prototypes > +≡ int register_new_tcb (addr_space_id as_id);
|
This function searches for a free TCB, marks it as used and enters
the address space ID which we provide as an argument. Thus, whenever
we create a new thread, we always call create_new_address_space
first and register_new_tcb afterwards:
| |
|
< function implementations > +≡ int register_new_tcb (addr_space_id as_id) {
// called by ulix_fork() which aquires LOCK (thread_list_lock)
< find free TCB entry >
if (!tcbfound) {
return -1; // no free TCB!
};
thread_table[tcbid].used = true; // mark as used
thread_table[tcbid].addr_space = as_id; // enter address space ID
return tcbid;
}
|
Note that so far we have not entered the new TCB in the ready or
one of the blocked queues. This will happen later when the thread
has been fully initialized.
\black
Starting The First Process
Starting the first (init) process is different from how all further
processes are created: We have to manually set up the memory regions
and data structures and load a first program from the disk. Of course,
this requires filesystem support which we will implement later---for
now assume that the kernel can use filesystem functions similar to
the standard Unix functions open, read, and close.
Here we will create the first (init) process; all further processes
will be created using fork. This is what we need to do:
- sep-1pt
- Setup the TCB (thread control block) list.
- Mark the first TCB as used.
- Create a new address space
- Reserve memory for user mode (low addresses, with user access)
- Load process binary (this assumes we have a working filesystem
implementation or some other way to access the binary)
- Reserve memory for the process' kernel stack (low addresses, without
user access)
- Enter all the information in the TCB
- update a data structure called TSS (see Section~)
- enter user mode and jump to start of process
| |
|
< global variables > +≡ volatile int current_task;
|
| |
|
< constants > +≡ #define PROGSIZE 32768
|
We haven't talked about locking yet, and we will further postpone this to
chapter . For now, assume that we have a datatype
called lock which we can use for locking.
| |
|
< global variables > +≡ lock thread_list_lock = NULL; // initialize this when the first process starts
|
| |
|
< function implementations > +≡ void start_program_from_disk (char *progname) {
if (thread_list_lock == NULL) // initialize lock for thread list
thread_list_lock = get_new_lock ("thread list");
< start program from disk: prepare address space and TCB entry >
< start program from disk: load binary >
< start program from disk: create kernel stack >
current_task = tid; // make this the current task
add_to_ready_queue (tid); // add process to ready queue
ENABLE_SCHEDULER;
cpu_usermode (BINARY_LOAD_ADDRESS,
TOP_OF_USER_MODE_STACK); // jump to user mode
};
|
with
| |
|
< start program from disk: prepare address space and TCB entry > ≡ addr_space_id as;
thread_id tid;
as = create_new_address_space(64*1024, 4096); // 64 KB + 4 KB stack
tid = register_new_tcb (as); // get a fresh TCB
thread_table[tid].tid = tid;
thread_table[tid].ppid = 0; // parent: 0 (none)
thread_table[tid].new = false; // not freshly created via fork()
thread_table[tid].terminal = 0; // default terminal: 0
memcpy (thread_table[tid].cwd, "/", 2); // set current directory
memcpy (thread_table[tid].cmdline, "new", 4); // set temporary command line
activate_address_space (as); // activate the new address space
|
| |
|
< start program from disk: load binary > ≡ // read binary
int mfd = mx_open (DEV_HDA, progname, O_RDONLY);
mx_read (DEV_HDA, mfd, (char*)BINARY_LOAD_ADDRESS, PROGSIZE);
// load to virtual address 0
mx_close (DEV_HDA, mfd);
|
| |
|
< start program from disk: create kernel stack > ≡ unsigned int framenos[KERNEL_STACK_PAGES]; // frame numbers of kernel stack pages
int i; for (i=0; i
framenos[i] = request_new_frame();
page_table* stackpgtable = (page_table*)request_new_page();
memset (stackpgtable, 0, sizeof(page_table));
KMAPD ( ¤t_pd->ptds[767], mmu (0, (unsigned int)stackpgtable) );
for (i=0; i
as_map_page_to_frame (current_as, 0xbffff - i, framenos[i]);
char* kstack = (char*) (TOP_OF_KERNEL_MODE_STACK-KERNEL_STACK_SIZE);
unsigned int adr = (uint)kstack; // one page for kernel stack
tss_entry.esp0 = adr+KERNEL_STACK_SIZE;
thread_table[tid].esp0 = (uint)kstack + KERNEL_STACK_SIZE;
thread_table[tid].ebp = (uint)kstack + KERNEL_STACK_SIZE;
|
There's one more function in start_program_from_disk which we have
not described so far: cpu_usermode starts the newly loaded process.
The following subsection describes how that works.
Entering User Mode and the TSS Structure
The Intel processor provides no command that would let us switch
to user mode explicitly, but there is a way for \emph{returning
to user mode} which requires that the stack is set up properly
when executing an iret instruction. That is what normally
happens when, for example, a process already runs in user mode
and an interrupt forces a jump to the interrupt handler---the
CPU modifies the stack so that when the handler executes iret,
execution will continue in the process. To make the system switch
back to ring 3, the stack contains (besides other values) values
which will be loaded into the CS and SS segment registers
(which tell the CPU what segments to use for code and stack).
The segment registers always contain a value which is a multiple
of 8, making them an offset for the GDT whose entries are eight
bytes long. Thus, the three lowest bits of a segment register
value are always 0. What we have not mentioned yet is that the
CPU modifies the lowest two bits when it pushes the register
value on the stack (on interrupt entry), and it reads them when
it pops the registers back from the stack (during iret).
These two bits are then interpreted as the privilege level to
which the CPU shall switch (see Fig.~).
[bh]
\centering
\includegraphics[width=0.5\textwidth]{pics/segment-selector.png}
\caption{Segment Selector. Source: http://www.cs.cmu.edu/~410/doc/segments/segments.html,
REDRAW!}
We can now force the system into user mode by generating a stack
which looks just like the one that the CPU automatically generates
when an interrupt occurs. Where the segment register contents
are expected we push a value that we can calculate with
val = seg | 3; (which will set the lowest two bits).
However, we cannot use the segment descriptors which we have
created during the system initialization: both the code and the
data segment descriptors have the entry flags (which contains a
two-bit value descriptor privilege level (DPL), describing
the necessary level for accessing this segment) set to 0---the
system would halt, because it would switch to user mode but would
not be allowed to use the memory. So we need two new segment
descriptors which are designed specifically for user mode. They
are identical to the old descriptors except for the flags
entry where they have the DPL value set to 3 instead of 0.
So here's how we fill the descriptors:
| |
|
< install GDTs for User Mode > ≡ fill_gdt_entry (3, 0, 0xFFFFFFFF, 0xFA, 0xCF);
fill_gdt_entry (4, 0, 0xFFFFFFFF, 0xF2, 0xCF);
|
The numbering continues with 3 since we've already filled the null
descriptor (0) and the kernel mode code (1) and data (2) segment
descriptors (see page \pageref{fill kernel segment descriptors}).
In order to enter user mode we also have to create a structure
called \marginnote{TSS} TSS (task state segment) which is another (and
final) entry in the GDT; we must load its GDT offset in a special
task register (\register{TR}) using the ltr instruction.
The TSS is a 104 bytes long data structure:
[t]
\centering
\includegraphics[width=10cm]{pics/intel-tss.png}
\caption{The TSS structure (source: Intel \cite[p. 303]{intel-part3}).}
| |
|
< type definitions > +≡ typedef struct {
unsigned int prev_tss : 32; // unused: previous TSS
unsigned int esp0, ss0 : 32; // ESP and SS to load when we switch to ring 0
long long u1, u2 : 64; // unused: esp1, ss1, esp2, ss2 for rings 1 and 2
unsigned int cr3 : 32; // unused: page directory
unsigned int eip, eflags : 32;
unsigned int eax, ecx, edx, ebx, esp, ebp, esi, edi, es, cs, ss, ds, fs, gs : 32;
// unused (dynamic, filled by CPU)
long long u3 : 64; // unused: ldt, trap, iomap
} __attribute__((packed)) tss_entry_struct;
|
Most of its fields can be used by the CPU to do task switching: if
we wanted to use this feature we would have to provide each process
(or thread) with its own TSS structure. Then we could ask the CPU
to switch tasks and it would store large parts of the process context
in the old process' TSS structure before loading data from the new
process' TSS structure. However, in ULIX-i386 we do the task switch
manually and use the thread control block entry to store and retrieve
the process context. That is why we need only one TSS.
| |
|
< global variables > +≡
tss_entry_struct tss_entry;
|
We add this to the GDT definition (see page ....):
| |
|
< install GDTs for User Mode > +≡ write_tss (5, 0x10, 0x0);
|
Here's the definition of write_tss:
| |
|
< function prototypes > +≡ static void write_tss (int num, unsigned short ss0, unsigned int esp0);
|
Die Funktion write_tss braucht eine ausführlichere Erklärung!
| |
|
< function implementations > +≡ static void write_tss (int num, unsigned short ss0, unsigned int esp0) {
unsigned int base = (unsigned int) &tss_entry;
unsigned int limit = sizeof (tss_entry) - 1;
fill_gdt_entry (num, base, limit, 0xE9, 0x00); // enter it in GDT
memset (&tss_entry, 0, sizeof(tss_entry)); // fill with zeros
tss_entry.ss0 = ss0; // Set the kernel stack segment.
tss_entry.esp0 = esp0; // Set the kernel stack pointer.
}
|
Finally, we add the code for loading the task register to the
assembler file: the index of the TSS in the GDT is 5, so the proper
value to load is $5 \times 8 = 40 = $ \hex{28}. We have to write the
requested privilege level (RPL) into the two lowest bits:
\[
\textrm{\hex{28}}\, |\, \textrm{\hex{03}} = \textrm{\hex{0b}}
\]
| |
|
< start.asm > ≡ [section .text]
global tss_flush
tss_flush: mov ax, 0x28 | 0x03
ltr ax ; load the task register
ret
|
| |
|
< function prototypes > +≡ extern void tss_flush();
|
The cpu_usermode routine
| |
|
< function prototypes > +≡ extern void cpu_usermode (unsigned int address, unsigned int stack); // assembler
|
prepares a stack that will create
the first user mode process when executing iret. Since iret
performs a change of privilege level (from ring 0 to ring 3), it will
pop the following:
- instruction pointer EIP
- code segment selector CS
- EFLAGSÂ register
- stack pointer ESP
- stack segment selector SS
The other segment selectors (DS, ES, FS, and GS) can be
set via mov instructions.
When we enter the assembler function, there are three values on the
stack:
- the return address in [esp] (in Assembler syntax, but note: we
will never return to that address),
- the first argument in [esp + 4],
- the second argument in [esp + 8].
But each time we push data on the stack, the offsets will change.
To make the following code more readable we will start with saving
the current \register{ESP} value in \register{EBP}.
\red
| |
|
< start.asm > +≡ ; code from http://f.osdev.org/viewtopic.php?t=23890&p=194213 (Jezze)
; modified and comments added
global cpu_usermode
cpu_usermode: cli ; disable interrupts
mov ebp, esp ; remember current stack address
mov ax, 0x20 | 0x03 ; code selector 0x20 | RPL3: 0x03
; RPL = requested protection level
mov ds, ax
mov es, ax
mov fs, ax
mov gs, ax
mov eax, esp
push 0x20 | 0x03 ; code selector 0x20 | RPL3: 0x03
mov eax, [ebp + 8] ; stack address is 2nd argument
push eax ; stack pointer
pushf ; EFLAGS
pop eax ; trick: reenable interrupts when doing iret
or eax, 0x200
push eax
push 0x18 | 0x03 ; code selector 0x18 | RPL3: 0x03
mov eax, [ebp + 4] ; return address (1st argument) for iret
push eax
iret
|
\black
We're using a trick to have interrupts automatically enabled when we
execute iret: One of the values on the stack is the \register{EFLAGS}
register which contains the interrupt enable flag (\register{IF}) in bit 9.
We cannot directly set that bit in \register{EFLAGS}, but we can modify
the stack.
The sequence pop eax, or eax, 0x200, push eax pops the
\register{EFLAGS} (which was just pushed in the previous instruction)
from the stack, sets bit 9 ($2^9 = 512 = \textrm{\hex{200}}$) and
pushes the modified value onto the stack.
|