Documentation for /proc/sys/vm/*	version 0.1
	(c) 1998, Rik van Riel <H.H.vanRiel@phys.uu.nl>

For general info and legal blurb, please look in README.

==============================================================

This file contains the documentation for the sysctl files in
/proc/sys/vm and is valid for Linux kernel version 2.1.

The files in this directory can be used to tune the operation
of the virtual memory (VM) subsystem of the Linux kernel, and
one of the files (bdflush) also has a little influence on disk
usage.

Currently, these files are in /proc/sys/vm:
- bdflush
- buffermem
- freepages
- kswapd
- overcommit_memory
- pagecache
- swapctl
- swapout_interval

==============================================================

bdflush:

This file controls the operation of the bdflush kernel
daemon. The source code to this struct can be found in
linux/mm/buffer.c. It currently contains 9 integer values,
of which 6 are actually used by the kernel.

From linux/fs/buffer.c:
--------------------------------------------------------------
union bdflush_param{
    struct {
        int nfract;      /* Percentage of buffer cache dirty to
                            activate bdflush */
        int ndirty;      /* Maximum number of dirty blocks to
                            write out per wake-cycle */
        int nrefill;     /* Number of clean buffers to try to
                            obtain each time we call refill */
        int nref_dirt;   /* Dirty buffer threshold for activating
                            bdflush when trying to refill buffers. */
        int dummy1;      /* unused */
        int age_buffer;  /* Time for normal buffer to age before
                            we flush it */
        int age_super;   /* Time for superblock to age before we
                            flush it */
        int dummy2;      /* unused */
        int dummy3;      /* unused */
    } b_un;
    unsigned int data[N_PARAM];
} bdf_prm = {{40, 500, 64, 256, 15, 30*HZ, 5*HZ, 1884, 2}};
--------------------------------------------------------------

The first parameter governs the maximum number of of dirty
buffers in the buffer cache. Dirty means that the contents
of the buffer still have to be written to disk (as opposed
to a clean buffer, which can just be forgotten about).
Setting this to a high value means that Linux can delay disk
writes for a long time, but it also means that it will have
to do a lot of I/O at once when memory becomes short. A low
value will spread out disk I/O more evenly.

The second parameter (ndirty) gives the maximum number of
dirty buffers that bdflush can write to the disk in one time.
A high value will mean delayed, bursty I/O, while a small
value can lead to memory shortage when bdflush isn't woken
up often enough...

The third parameter (nrefill) is the number of buffers that
bdflush will add to the list of free buffers when
refill_freelist() is called. It is necessary to allocate free
buffers beforehand, since the buffers often are of a different
size than memory pages and some bookkeeping needs to be done
beforehand. The higher the number, the more memory will be
wasted and the less often refill_freelist() will need to run.

When refill_freelist() comes across more than nref_dirt dirty
buffers, it will wake up bdflush.

Finally, the age_buffer and age_super parameters govern the
maximum time Linux waits before writing out a dirty buffer
to disk. The value is expressed in jiffies (clockticks), the
number of jiffies per second is 100, except on Alpha machines
(1024). Age_buffer is the maximum age for data blocks, while
age_super is for filesystem metadata.

==============================================================
buffermem:

The three values in this file correspond to the values in
the struct buffer_mem. It controls how much memory should
be used for buffer memory. The percentage is calculated
as a percentage of total system memory.

The values are:
min_percent	-- this is the minimum percentage of memory
		   that should be spent on buffer memory
borrow_percent  -- when Linux is short on memory, and the
                   buffer cache uses more memory, free pages
                   are stolen from it
max_percent     -- this is the maximum amount of memory that
                   can be used for buffer memory 

==============================================================
freepages:

This file contains the values in the struct freepages. That
struct contains three members: min, low and high.

Although the goal of the Linux memory management subsystem
is to avoid fragmentation and make large chunks of free
memory (so that we can hand out DMA buffers and such), there
still are some page-based limits in the system, mainly to
make sure we don't waste too much memory trying to get large
free area's.

The meaning of the numbers is:

freepages.min	When the number of free pages in the system
		reaches this number, only the kernel can
		allocate more memory.
freepages.low	If memory is too fragmented, the swapout
		daemon is started, except when the number
		of free pages is larger than freepages.low.
freepages.high	The swapping daemon exits when memory is
		sufficiently defragmented, when the number
		of free pages reaches freepages.high or when
		it has tried the maximum number of times. 

==============================================================

kswapd:

Kswapd is the kernel swapout daemon. That is, kswapd is that
piece of the kernel that frees memory when it gets fragmented
or full. Since every system is different, you'll probably want
some control over this piece of the system.

The numbers in this page correspond to the numbers in the
struct pager_daemon {tries_base, tries_min, swap_cluster
}; The tries_base and swap_cluster probably have the
largest influence on system performance.

tries_base	The maximum number of pages kswapd tries to
		free in one round is calculated from this
		number. Usually this number will be divided
		by 4 or 8 (see mm/vmscan.c), so it isn't as
		big as it looks.
		When you need to increase the bandwidth to/from
		swap, you'll want to increase this number.
tries_min	This is the minimum number of times kswapd
		tries to free a page each time it is called.
		Basically it's just there to make sure that
		kswapd frees some pages even when it's being
		called with minimum priority.
swap_cluster	This is the number of pages kswapd writes in
		one turn. You want this large so that kswapd
		does it's I/O in large chunks and the disk
		doesn't have to seek often, but you don't want
		it to be too large since that would flood the
		request queue.

==============================================================

overcommit_memory:

This file contains only one value. The following algorithm
is used to decide if there's enough memory. If the value
of overcommit_memory > 0, then there's always enough
memory :-). This is a useful feature, since programs often
malloc() huge amounts of memory 'just in case', while they
only use a small part of it. Leaving this value at 0 will
lead to the failure of such a huge malloc(), when in fact
the system has enough memory for the program to run...
On the other hand, enabling this feature can cause you to
run out of memory and thrash the system to death, so large
and/or important servers will want to set this value to 0.

From linux/mm/mmap.c:
--------------------------------------------------------------
static inline int vm_enough_memory(long pages)
{
    /* This stupid algorithm decides whether we have enough memory:
     * while simple, it should work in most obvious cases.  It's
     * easily fooled, but this should catch most mistakes.
     */
    long freepages;

    /* Sometimes we want to use more memory than we have. */
    if (sysctl_overcommit_memory)
        return 1;

    freepages = buffermem >> PAGE_SHIFT;
    freepages += page_cache_size;
    freepages >>= 1;
    freepages += nr_free_pages;
    freepages += nr_swap_pages;
    freepages -= num_physpages >> 4;
    return freepages > pages;
}

==============================================================

pagecache:

This file does exactly the same as buffermem, only this
file controls the struct page_cache, and thus controls
the amount of memory allowed for memory mapping of files.

You don't want the minimum level to be too low, otherwise
your system might thrash when memory is tight or fragmentation
is high...

==============================================================

swapctl:

This file contains no less than 8 variables.
All of these values are used by kswapd, and the usage can be
found in linux/mm/vmscan.c.

From linux/include/linux/swapctl.h:
--------------------------------------------------------------
typedef struct swap_control_v5
{
    unsigned int    sc_max_page_age;
    unsigned int    sc_page_advance;
    unsigned int    sc_page_decline;
    unsigned int    sc_page_initial_age;
    unsigned int    sc_age_cluster_fract;
    unsigned int    sc_age_cluster_min;
    unsigned int    sc_pageout_weight;
    unsigned int    sc_bufferout_weight;
} swap_control_v5;
--------------------------------------------------------------

The first four variables are used to keep track of Linux's
page aging. Page aging is a bookkeeping method to keep track
of which pages of memory are used often, and which pages can
be swapped out without consequences.

When a page is swapped in, it starts at sc_page_initial_age
(default 3) and when the page is scanned by kswapd, its age
is adjusted according to the following scheme:
- if the page was used since the last time we scanned, its
  age is increased by sc_page_advance (default 3) up to a maximum
  of sc_max_page_age (default 20)
- else (it wasn't used) its age is decreased by sc_page_decline
  (default 1)
And when a page reaches age 0, it's ready to be swapped out.

The next four variables can be used to control kswapd's
aggressiveness in swapping out pages.

sc_age_cluster_fract is used to calculate how many pages from
a process are to be scanned by kswapd. The formula used is
sc_age_cluster_fract/1024 * RSS, so if you want kswapd to scan
the whole process, sc_age_cluster_fract needs to have a value
of 1024. The minimum number of pages kswapd will scan is
represented by sc_age_cluster_min, this is done so kswapd will
also scan small processes.

The values of sc_pageout_weight and sc_bufferout_weight are
used to control how many tries kswapd will make in order
to swapout one page / buffer. These values can be used to
fine-tune the ratio between user pages and buffer/cache memory.
When you find that your Linux system is swapping out too many
process pages in order to satisfy buffer memory demands, you
might want to either increase sc_bufferout_weight, or decrease
the value of sc_pageout_weight.

==============================================================

swapout_interval:

The single value in this file controls the amount of time
between successive wakeups of kswapd when nr_free_pages is
between free_pages_low and free_pages_high. The default value
of HZ/4 is usually right, but when kswapd can't keep up with
the number of allocations in your system, you might want to
decrease this number.