1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
|
This is an NFS client for Linux that supports async RPC calls for
read-ahead (and hopefully soon, write-back) on regular files.
The implementation uses a straightforward nfsiod scheme. After
trying out a number of different concepts, I finally got back to
this concept, because everything else either didn't work or gave me
headaches. It's not flashy, but it works without hacking into any
other regions of the kernel.
HOW TO USE
This stuff compiles as a loadable module (I developed it on 1.3.77).
Simply type mkmodule, and insmod nfs.o. This will start four nfsiod's
at the same time (which will show up under the pseudonym of insmod in
ps-style listings).
Alternatively, you can put it right into the kernel: remove everything
from fs/nfs, move the Makefile and all *.c to this directory, and
copy all *.h files to include/linux.
After mounting, you should be able to watch (with tcpdump) several
RPC READ calls being placed simultaneously.
HOW IT WORKS
When a process reads from a file on an NFS volume, the following
happens:
* nfs_file_read sets file->f_reada if more than 1K is
read at once. It then calls generic_file_read.
* generic_file_read requests one ore more pages via
nfs_readpage.
* nfs_readpage allocates a request slot with an nfsiod
daemon, fills in the READ request, sends out the
RPC call, kicks the daemon, and returns.
If there's no free biod, nfs_readpage places the
call directly, waiting for the reply (sync readpage).
* nfsiod calls nfs_rpc_doio to collect the reply. If the
call was successful, it sets page->uptodate and
wakes up all processes waiting on page->wait;
This is the rough outline only. There are a few things to note:
* Async RPC will not be tried when server->rsize < PAGE_SIZE.
* When an error occurs, nfsiod has no way of returning
the error code to the user process. Therefore, it flags
page->error and wakes up all processes waiting on that
page (they usually do so from within generic_readpage).
generic_readpage finds that the page is still not
uptodate, and calls nfs_readpage again. This time around,
nfs_readpage notices that page->error is set and
unconditionally does a synchronous RPC call.
This area needs a lot of improvement, since read errors
are not that uncommon (e.g. we have to retransmit calls
if the fsuid is different from the ruid in order to
cope with root squashing and stuff like this).
Retransmits with fsuid/ruid change should be handled by
nfsiod, but this doesn't come easily (a more general nfs_call
routine that does all this may be useful...)
* To save some time on readaheads, we save one data copy
by frobbing the page into the iovec passed to the
RPC code so that the networking layer copies the
data into the page directly.
This needs to be adjustable (different authentication
flavors; AUTH_NULL versus AUTH_SHORT verifiers).
* Currently, a fixed number of nfsiod's is spawned from
within init_nfs_fs. This is problematic when running
as a loadable module, because this will keep insmod's
memory allocated. As a side-effect, you will see the
nfsiod processes listed as several insmod's when doing
a `ps.'
* This NFS client implements server congestion control via
Van Jacobson slow start as implemented in 44BSD. I haven't
checked how well this behaves, but since Rick Macklem did
it this way, it should be okay :-)
WISH LIST
After giving this thing some testing, I'd like to add some more
features:
* Some sort of async write handling. True write-back doesn't
work with the current kernel (I think), because invalidate_pages
kills all pages, regardless of whether they're dirty or not.
Besides, this may require special bdflush treatment because
write caching on clients is really hairy.
Alternatively, a write-through scheme might be useful where
the client enqueues the request, but leaves collecting the
results to nfsiod. Again, we need a way to pass RPC errors
back to the application.
* Support for different authentication flavors.
* /proc/net/nfsclnt (for nfsstat, etc.).
March 29, 1996
Olaf Kirch <okir@monad.swb.de>
|