1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
|
Started Oct 1999 by Kanoj Sarcar <kanoj@sgi.com>
The intent of this file is to have an uptodate, running commentary
from different people about how locking and synchronization is done
in the Linux vm code.
vmlist_access_lock/vmlist_modify_lock
--------------------------------------
Page stealers pick processes out of the process pool and scan for
the best process to steal pages from. To guarantee the existance
of the victim mm, a mm_count inc and a mmdrop are done in swap_out().
Page stealers hold kernel_lock to protect against a bunch of races.
The vma list of the victim mm is also scanned by the stealer,
and the vmlist_lock is used to preserve list sanity against the
process adding/deleting to the list. This also gurantees existance
of the vma. Vma existance gurantee while invoking the driver
swapout() method in try_to_swap_out() also relies on the fact
that do_munmap() temporarily gets lock_kernel before decimating
the vma, thus the swapout() method must snapshot all the vma
fields it needs before going to sleep (which will release the
lock_kernel held by the page stealer). Currently, filemap_swapout
is the only method that depends on this shaky interlocking.
Any code that modifies the vmlist, or the vm_start/vm_end/
vm_flags:VM_LOCKED/vm_next of any vma *in the list* must prevent
kswapd from looking at the chain. This does not include driver mmap()
methods, for example, since the vma is still not in the list.
The rules are:
1. To modify the vmlist (add/delete or change fields in an element),
you must hold mmap_sem to guard against clones doing mmap/munmap/faults,
(ie all vm system calls and faults), and from ptrace, swapin due to
swap deletion etc.
2. To modify the vmlist (add/delete or change fields in an element),
you must also hold vmlist_modify_lock, to guard against page stealers
scanning the list.
3. To scan the vmlist (find_vma()), you must either
a. grab mmap_sem, which should be done by all cases except
page stealer.
or
b. grab vmlist_access_lock, only done by page stealer.
4. While holding the vmlist_modify_lock, you must be able to guarantee
that no code path will lead to page stealing. A better guarantee is
to claim non sleepability, which ensures that you are not sleeping
for a lock, whose holder might in turn be doing page stealing.
5. You must be able to guarantee that while holding vmlist_modify_lock
or vmlist_access_lock of mm A, you will not try to get either lock
for mm B.
The caveats are:
1. find_vma() makes use of, and updates, the mmap_cache pointer hint.
The update of mmap_cache is racy (page stealer can race with other code
that invokes find_vma with mmap_sem held), but that is okay, since it
is a hint. This can be fixed, if desired, by having find_vma grab the
vmlist lock.
Code that add/delete elements from the vmlist chain are
1. callers of insert_vm_struct
2. callers of merge_segments
3. callers of avl_remove
Code that changes vm_start/vm_end/vm_flags:VM_LOCKED of vma's on
the list:
1. expand_stack
2. mprotect
3. mlock
4. mremap
It is advisable that changes to vm_start/vm_end be protected, although
in some cases it is not really needed. Eg, vm_start is modified by
expand_stack(), it is hard to come up with a destructive scenario without
having the vmlist protection in this case.
The vmlist lock nests with the inode i_shared_lock and the kmem cache
c_spinlock spinlocks. This is okay, since code that holds i_shared_lock
never asks for memory, and the kmem code asks for pages after dropping
c_spinlock.
The vmlist lock can be a sleeping or spin lock. In either case, care
must be taken that it is not held on entry to the driver methods, since
those methods might sleep or ask for memory, causing deadlocks.
The current implementation of the vmlist lock uses the page_table_lock,
which is also the spinlock that page stealers use to protect changes to
the victim process' ptes. Thus we have a reduction in the total number
of locks.
|