Documentation for /proc/sys/vm/* version 0.1 (c) 1998, Rik van Riel For general info and legal blurb, please look in README. ============================================================== This file contains the documentation for the sysctl files in /proc/sys/vm and is valid for Linux kernel version 2.1. The files in this directory can be used to tune the operation of the virtual memory (VM) subsystem of the Linux kernel, and one of the files (bdflush) also has a little influence on disk usage. Currently, these files are in /proc/sys/vm: - bdflush - buffermem - freepages - kswapd - overcommit_memory - pagecache - swapctl - swapout_interval ============================================================== bdflush: This file controls the operation of the bdflush kernel daemon. The source code to this struct can be found in linux/mm/buffer.c. It currently contains 9 integer values, of which 6 are actually used by the kernel. From linux/fs/buffer.c: -------------------------------------------------------------- union bdflush_param{ struct { int nfract; /* Percentage of buffer cache dirty to activate bdflush */ int ndirty; /* Maximum number of dirty blocks to write out per wake-cycle */ int nrefill; /* Number of clean buffers to try to obtain each time we call refill */ int nref_dirt; /* Dirty buffer threshold for activating bdflush when trying to refill buffers. */ int dummy1; /* unused */ int age_buffer; /* Time for normal buffer to age before we flush it */ int age_super; /* Time for superblock to age before we flush it */ int dummy2; /* unused */ int dummy3; /* unused */ } b_un; unsigned int data[N_PARAM]; } bdf_prm = {{40, 500, 64, 256, 15, 30*HZ, 5*HZ, 1884, 2}}; -------------------------------------------------------------- The first parameter governs the maximum number of of dirty buffers in the buffer cache. Dirty means that the contents of the buffer still have to be written to disk (as opposed to a clean buffer, which can just be forgotten about). Setting this to a high value means that Linux can delay disk writes for a long time, but it also means that it will have to do a lot of I/O at once when memory becomes short. A low value will spread out disk I/O more evenly. The second parameter (ndirty) gives the maximum number of dirty buffers that bdflush can write to the disk in one time. A high value will mean delayed, bursty I/O, while a small value can lead to memory shortage when bdflush isn't woken up often enough... The third parameter (nrefill) is the number of buffers that bdflush will add to the list of free buffers when refill_freelist() is called. It is necessary to allocate free buffers beforehand, since the buffers often are of a different size than memory pages and some bookkeeping needs to be done beforehand. The higher the number, the more memory will be wasted and the less often refill_freelist() will need to run. When refill_freelist() comes across more than nref_dirt dirty buffers, it will wake up bdflush. Finally, the age_buffer and age_super parameters govern the maximum time Linux waits before writing out a dirty buffer to disk. The value is expressed in jiffies (clockticks), the number of jiffies per second is 100, except on Alpha machines (1024). Age_buffer is the maximum age for data blocks, while age_super is for filesystem metadata. ============================================================== buffermem: The three values in this file correspond to the values in the struct buffer_mem. It controls how much memory should be used for buffer memory. The percentage is calculated as a percentage of total system memory. The values are: min_percent -- this is the minimum percentage of memory that should be spent on buffer memory borrow_percent -- when Linux is short on memory, and the buffer cache uses more memory, free pages are stolen from it max_percent -- this is the maximum amount of memory that can be used for buffer memory ============================================================== freepages: This file contains the values in the struct freepages. That struct contains three members: min, low and high. Although the goal of the Linux memory management subsystem is to avoid fragmentation and make large chunks of free memory (so that we can hand out DMA buffers and such), there still are some page-based limits in the system, mainly to make sure we don't waste too much memory trying to get large free area's. The meaning of the numbers is: freepages.min When the number of free pages in the system reaches this number, only the kernel can allocate more memory. freepages.low If memory is too fragmented, the swapout daemon is started, except when the number of free pages is larger than freepages.low. freepages.high The swapping daemon exits when memory is sufficiently defragmented, when the number of free pages reaches freepages.high or when it has tried the maximum number of times. ============================================================== kswapd: Kswapd is the kernel swapout daemon. That is, kswapd is that piece of the kernel that frees memory when it gets fragmented or full. Since every system is different, you'll probably want some control over this piece of the system. The numbers in this page correspond to the numbers in the struct pager_daemon {tries_base, tries_min, swap_cluster }; The tries_base and swap_cluster probably have the largest influence on system performance. tries_base The maximum number of pages kswapd tries to free in one round is calculated from this number. Usually this number will be divided by 4 or 8 (see mm/vmscan.c), so it isn't as big as it looks. When you need to increase the bandwidth to/from swap, you'll want to increase this number. tries_min This is the minimum number of times kswapd tries to free a page each time it is called. Basically it's just there to make sure that kswapd frees some pages even when it's being called with minimum priority. swap_cluster This is the number of pages kswapd writes in one turn. You want this large so that kswapd does it's I/O in large chunks and the disk doesn't have to seek often, but you don't want it to be too large since that would flood the request queue. ============================================================== overcommit_memory: This file contains only one value. The following algorithm is used to decide if there's enough memory. If the value of overcommit_memory > 0, then there's always enough memory :-). This is a useful feature, since programs often malloc() huge amounts of memory 'just in case', while they only use a small part of it. Leaving this value at 0 will lead to the failure of such a huge malloc(), when in fact the system has enough memory for the program to run... On the other hand, enabling this feature can cause you to run out of memory and thrash the system to death, so large and/or important servers will want to set this value to 0. From linux/mm/mmap.c: -------------------------------------------------------------- static inline int vm_enough_memory(long pages) { /* This stupid algorithm decides whether we have enough memory: * while simple, it should work in most obvious cases. It's * easily fooled, but this should catch most mistakes. */ long freepages; /* Sometimes we want to use more memory than we have. */ if (sysctl_overcommit_memory) return 1; freepages = buffermem >> PAGE_SHIFT; freepages += page_cache_size; freepages >>= 1; freepages += nr_free_pages; freepages += nr_swap_pages; freepages -= num_physpages >> 4; return freepages > pages; } ============================================================== pagecache: This file does exactly the same as buffermem, only this file controls the struct page_cache, and thus controls the amount of memory allowed for memory mapping of files. You don't want the minimum level to be too low, otherwise your system might thrash when memory is tight or fragmentation is high... ============================================================== swapctl: This file contains no less than 8 variables. All of these values are used by kswapd, and the usage can be found in linux/mm/vmscan.c. From linux/include/linux/swapctl.h: -------------------------------------------------------------- typedef struct swap_control_v5 { unsigned int sc_max_page_age; unsigned int sc_page_advance; unsigned int sc_page_decline; unsigned int sc_page_initial_age; unsigned int sc_age_cluster_fract; unsigned int sc_age_cluster_min; unsigned int sc_pageout_weight; unsigned int sc_bufferout_weight; } swap_control_v5; -------------------------------------------------------------- The first four variables are used to keep track of Linux's page aging. Page aging is a bookkeeping method to keep track of which pages of memory are used often, and which pages can be swapped out without consequences. When a page is swapped in, it starts at sc_page_initial_age (default 3) and when the page is scanned by kswapd, its age is adjusted according to the following scheme: - if the page was used since the last time we scanned, its age is increased by sc_page_advance (default 3) up to a maximum of sc_max_page_age (default 20) - else (it wasn't used) its age is decreased by sc_page_decline (default 1) And when a page reaches age 0, it's ready to be swapped out. The next four variables can be used to control kswapd's aggressiveness in swapping out pages. sc_age_cluster_fract is used to calculate how many pages from a process are to be scanned by kswapd. The formula used is sc_age_cluster_fract/1024 * RSS, so if you want kswapd to scan the whole process, sc_age_cluster_fract needs to have a value of 1024. The minimum number of pages kswapd will scan is represented by sc_age_cluster_min, this is done so kswapd will also scan small processes. The values of sc_pageout_weight and sc_bufferout_weight are used to control how many tries kswapd will make in order to swapout one page / buffer. These values can be used to fine-tune the ratio between user pages and buffer/cache memory. When you find that your Linux system is swapping out too many process pages in order to satisfy buffer memory demands, you might want to either increase sc_bufferout_weight, or decrease the value of sc_pageout_weight. ============================================================== swapout_interval: The single value in this file controls the amount of time between successive wakeups of kswapd when nr_free_pages is between free_pages_low and free_pages_high. The default value of HZ/4 is usually right, but when kswapd can't keep up with the number of allocations in your system, you might want to decrease this number.