summaryrefslogtreecommitdiffstats
path: root/Documentation/vm
diff options
context:
space:
mode:
authorRalf Baechle <ralf@linux-mips.org>2000-02-04 07:40:19 +0000
committerRalf Baechle <ralf@linux-mips.org>2000-02-04 07:40:19 +0000
commit33263fc5f9ac8e8cb2b22d06af3ce5ac1dd815e4 (patch)
tree2d1b86a40bef0958a68cf1a2eafbeb0667a70543 /Documentation/vm
parent216f5f51aa02f8b113aa620ebc14a9631a217a00 (diff)
Merge with Linux 2.3.32.
Diffstat (limited to 'Documentation/vm')
-rw-r--r--Documentation/vm/locking53
-rw-r--r--Documentation/vm/numa41
2 files changed, 93 insertions, 1 deletions
diff --git a/Documentation/vm/locking b/Documentation/vm/locking
index 20921f6e4..54c8a6ce0 100644
--- a/Documentation/vm/locking
+++ b/Documentation/vm/locking
@@ -75,7 +75,11 @@ having the vmlist protection in this case.
The vmlist lock nests with the inode i_shared_lock and the kmem cache
c_spinlock spinlocks. This is okay, since code that holds i_shared_lock
never asks for memory, and the kmem code asks for pages after dropping
-c_spinlock.
+c_spinlock. The vmlist lock also nests with pagecache_lock and
+pagemap_lru_lock spinlocks, and no code asks for memory with these locks
+held.
+
+The vmlist lock is grabbed while holding the kernel_lock spinning monitor.
The vmlist lock can be a sleeping or spin lock. In either case, care
must be taken that it is not held on entry to the driver methods, since
@@ -85,3 +89,50 @@ The current implementation of the vmlist lock uses the page_table_lock,
which is also the spinlock that page stealers use to protect changes to
the victim process' ptes. Thus we have a reduction in the total number
of locks.
+
+swap_list_lock/swap_device_lock
+-------------------------------
+The swap devices are chained in priority order from the "swap_list" header.
+The "swap_list" is used for the round-robin swaphandle allocation strategy.
+The #free swaphandles is maintained in "nr_swap_pages". These two together
+are protected by the swap_list_lock.
+
+The swap_device_lock, which is per swap device, protects the reference
+counts on the corresponding swaphandles, maintained in the "swap_map"
+array, and the "highest_bit" and "lowest_bit" fields.
+
+Both of these are spinlocks, and are never acquired from intr level. The
+locking heirarchy is swap_list_lock -> swap_device_lock.
+
+To prevent races between swap space deletion or async readahead swapins
+deciding whether a swap handle is being used, ie worthy of being read in
+from disk, and an unmap -> swap_free making the handle unused, the swap
+delete and readahead code grabs a temp reference on the swaphandle to
+prevent warning messages from swap_duplicate <- read_swap_cache_async.
+
+Swap cache locking
+------------------
+Pages are added into the swap cache with kernel_lock held, to make sure
+that multiple pages are not being added (and hence lost) by associating
+all of them with the same swaphandle.
+
+Pages are guaranteed not to be removed from the scache if the page is
+"shared": ie, other processes hold reference on the page or the associated
+swap handle. The only code that does not follow this rule is shrink_mmap,
+which deletes pages from the swap cache if no process has a reference on
+the page (multiple processes might have references on the corresponding
+swap handle though). lookup_swap_cache() races with shrink_mmap, when
+establishing a reference on a scache page, so, it must check whether the
+page it located is still in the swapcache, or shrink_mmap deleted it.
+(This race is due to the fact that shrink_mmap looks at the page ref
+count with pagecache_lock, but then drops pagecache_lock before deleting
+the page from the scache).
+
+do_wp_page and do_swap_page have MP races in them while trying to figure
+out whether a page is "shared", by looking at the page_count + swap_count.
+To preserve the sum of the counts, the page lock _must_ be acquired before
+calling is_page_shared (else processes might switch their swap_count refs
+to the page count refs, after the page count ref has been snapshotted).
+
+Swap device deletion code currently breaks all the scache assumptions,
+since it grabs neither mmap_sem nor page_table_lock.
diff --git a/Documentation/vm/numa b/Documentation/vm/numa
new file mode 100644
index 000000000..21a3442b7
--- /dev/null
+++ b/Documentation/vm/numa
@@ -0,0 +1,41 @@
+Started Nov 1999 by Kanoj Sarcar <kanoj@sgi.com>
+
+The intent of this file is to have an uptodate, running commentary
+from different people about NUMA specific code in the Linux vm.
+
+What is NUMA? It is an architecture where the memory access times
+for different regions of memory from a given processor varies
+according to the "distance" of the memory region from the processor.
+Each region of memory to which access times are the same from any
+cpu, is called a node. On such architectures, it is beneficial if
+the kernel tries to minimize inter node communications. Schemes
+for this range from kernel text and read-only data replication
+across nodes, and trying to house all the data structures that
+key components of the kernel need on memory on that node.
+
+Currently, all the numa support is to provide efficient handling
+of widely discontiguous physical memory, so architectures which
+are not NUMA but can have huge holes in the physical address space
+can use the same code. All this code is bracketed by CONFIG_DISCONTIGMEM.
+
+The initial port includes NUMAizing the bootmem allocator code by
+encapsulating all the pieces of information into a bootmem_data_t
+structure. Node specific calls have been added to the allocator.
+In theory, any platform which uses the bootmem allocator should
+be able to to put the bootmem and mem_map data structures anywhere
+it deems best.
+
+Each node's page allocation data structures have also been encapsulated
+into a pg_data_t. The bootmem_data_t is just one part of this. To
+make the code look uniform between NUMA and regular UMA platforms,
+UMA platforms have a statically allocated pg_data_t too (contig_page_data).
+For the sake of uniformity, the variable "numnodes" is also defined
+for all platforms. As we run benchmarks, we might decide to NUMAize
+more variables like low_on_memory, nr_free_pages etc into the pg_data_t.
+
+The NUMA aware page allocation code currently tries to allocate pages
+from different nodes in a round robin manner. This will be changed to
+do concentratic circle search, starting from current node, once the
+NUMA port achieves more maturity. The call alloc_pages_node has been
+added, so that drivers can make the call and not worry about whether
+it is running on a NUMA or UMA platform.