Documentation/sysctl/vm.txt


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221

Documentation for /proc/sys/vm/*	kernel version 2.2.10
	(c) 1998, 1999,  Rik van Riel <riel@nl.linux.org>

For general info and legal blurb, please look in README.

==============================================================

This file contains the documentation for the sysctl files in
/proc/sys/vm and is valid for Linux kernel version 2.2.

The files in this directory can be used to tune the operation
of the virtual memory (VM) subsystem of the Linux kernel, and
one of the files (bdflush) also has a little influence on disk
usage.

Default values and initialization routines for most of these
files can be found in mm/swap.c.

Currently, these files are in /proc/sys/vm:
- bdflush
- buffermem
- freepages
- kswapd
- overcommit_memory
- page-cluster
- pagecache
- pagetable_cache

==============================================================

bdflush:

This file controls the operation of the bdflush kernel
daemon. The source code to this struct can be found in
linux/mm/buffer.c. It currently contains 9 integer values,
of which 6 are actually used by the kernel.

From linux/fs/buffer.c:
--------------------------------------------------------------
union bdflush_param{
    struct {
        int nfract;      /* Percentage of buffer cache dirty to
                            activate bdflush */
        int ndirty;      /* Maximum number of dirty blocks to
                            write out per wake-cycle */
        int nrefill;     /* Number of clean buffers to try to
                            obtain each time we call refill */
        int nref_dirt;   /* Dirty buffer threshold for activating
                            bdflush when trying to refill buffers. */
        int dummy1;      /* unused */
        int age_buffer;  /* Time for normal buffer to age before
                            we flush it */
        int age_super;   /* Time for superblock to age before we
                            flush it */
        int dummy2;      /* unused */
        int dummy3;      /* unused */
    } b_un;
    unsigned int data[N_PARAM];
} bdf_prm = {{40, 500, 64, 256, 15, 30*HZ, 5*HZ, 1884, 2}};
--------------------------------------------------------------

The first parameter governs the maximum number of dirty
buffers in the buffer cache. Dirty means that the contents
of the buffer still have to be written to disk (as opposed
to a clean buffer, which can just be forgotten about).
Setting this to a high value means that Linux can delay disk
writes for a long time, but it also means that it will have
to do a lot of I/O at once when memory becomes short. A low
value will spread out disk I/O more evenly.

The second parameter (ndirty) gives the maximum number of
dirty buffers that bdflush can write to the disk in one time.
A high value will mean delayed, bursty I/O, while a small
value can lead to memory shortage when bdflush isn't woken
up often enough...

The third parameter (nrefill) is the number of buffers that
bdflush will add to the list of free buffers when
refill_freelist() is called. It is necessary to allocate free
buffers beforehand, since the buffers often are of a different
size than memory pages and some bookkeeping needs to be done
beforehand. The higher the number, the more memory will be
wasted and the less often refill_freelist() will need to run.

When refill_freelist() comes across more than nref_dirt dirty
buffers, it will wake up bdflush.

Finally, the age_buffer and age_super parameters govern the
maximum time Linux waits before writing out a dirty buffer
to disk. The value is expressed in jiffies (clockticks), the
number of jiffies per second is 100, except on Alpha machines
(1024). Age_buffer is the maximum age for data blocks, while
age_super is for filesystem metadata.

==============================================================
buffermem:

The three values in this file correspond to the values in
the struct buffer_mem. It controls how much memory should
be used for buffer memory. The percentage is calculated
as a percentage of total system memory.

The values are:
min_percent	-- this is the minimum percentage of memory
		   that should be spent on buffer memory
borrow_percent  -- UNUSED
max_percent     -- UNUSED

==============================================================
freepages:

This file contains the values in the struct freepages. That
struct contains three members: min, low and high.

The meaning of the numbers is:

freepages.min	When the number of free pages in the system
		reaches this number, only the kernel can
		allocate more memory.
freepages.low	If the number of free pages gets below this
		point, the kernel starts swapping aggressively.
freepages.high	The kernel tries to keep up to this amount of
		memory free; if memory comes below this point,
		the kernel gently starts swapping in the hopes
		that it never has to do real aggressive swapping.

==============================================================

kswapd:

Kswapd is the kernel swapout daemon. That is, kswapd is that
piece of the kernel that frees memory when it gets fragmented
or full. Since every system is different, you'll probably want
some control over this piece of the system.

The numbers in this page correspond to the numbers in the
struct pager_daemon {tries_base, tries_min, swap_cluster
}; The tries_base and swap_cluster probably have the
largest influence on system performance.

tries_base	The maximum number of pages kswapd tries to
		free in one round is calculated from this
		number. Usually this number will be divided
		by 4 or 8 (see mm/vmscan.c), so it isn't as
		big as it looks.
		When you need to increase the bandwidth to/from
		swap, you'll want to increase this number.
tries_min	This is the minimum number of times kswapd
		tries to free a page each time it is called.
		Basically it's just there to make sure that
		kswapd frees some pages even when it's being
		called with minimum priority.
swap_cluster	This is the number of pages kswapd writes in
		one turn. You want this large so that kswapd
		does it's I/O in large chunks and the disk
		doesn't have to seek often, but you don't want
		it to be too large since that would flood the
		request queue.

==============================================================

overcommit_memory:

This value contains a flag that enables memory overcommitment.
When this flag is 0, the kernel checks before each malloc()
to see if there's enough memory left. If the flag is nonzero,
the system pretends there's always enough memory.

This feature can be very useful because there are a lot of
programs that malloc() huge amounts of memory "just-in-case"
and don't much of it.

Look at: mm/mmap.c::vm_enough_memory() for more information.

==============================================================

page-cluster:

The Linux VM subsystem avoids excessive disk seeks by reading
multiple pages on a page fault. The number of pages it reads
is dependent on the amount of memory in your machine.

The number of pages the kernel reads in at once is equal to
2 ^ page-cluster. Values above 2 ^ 5 don't make much sense
for swap because we only cluster swap data in 32-page groups.

==============================================================

pagecache:

This file does exactly the same as buffermem, only this
file controls the struct page_cache, and thus controls
the amount of memory used for the page cache.

In 2.2, the page cache is used for 3 main purposes:
- caching read() data from files
- caching mmap()ed data and executable files
- swap cache

When your system is both deep in swap and high on cache,
it probably means that a lot of the swapped data is being
cached, making for more efficient swapping than possible
with the 2.0 kernel.

==============================================================

pagetable_cache:

The kernel keeps a number of page tables in a per-processor
cache (this helps a lot on SMP systems). The cache size for
each processor will be between the low and the high value.

On a low-memory, single CPU system you can safely set these
values to 0 so you don't waste the memory. On SMP systems it
is used so that the system can do fast pagetable allocations
without having to acquire the kernel memory lock.

For large systems, the settings are probably OK. For normal
systems they won't hurt a bit. For small systems (<16MB ram)
it might be advantageous to set both values to 0.