summaryrefslogtreecommitdiffstats
path: root/README.distribution
diff options
context:
space:
mode:
authorosdl.net!shemminger <osdl.net!shemminger>2004-08-23 20:21:21 +0000
committerosdl.net!shemminger <osdl.net!shemminger>2004-08-23 20:21:21 +0000
commite7f02a522fb0910b0e3f1db485ec059661438b5d (patch)
tree6b9c1fa62399f8553fb9b0f17dbf17043a78dbd0 /README.distribution
parent98a81335a814f2584eb16db1302dd6b922e06e2a (diff)
Rename: tc/README.distribution -> README.distribution
(Logical change 1.71)
Diffstat (limited to 'README.distribution')
-rw-r--r--README.distribution95
1 files changed, 95 insertions, 0 deletions
diff --git a/README.distribution b/README.distribution
index e69de29b..fe78fb4a 100644
--- a/README.distribution
+++ b/README.distribution
@@ -0,0 +1,95 @@
+I. About the distribution tables
+
+The table used for "synthesizing" the distribution is essentially a scaled,
+translated, inverse to the cumulative distribution function.
+
+Here's how to think about it: Let F() be the cumulative distribution
+function for a probability distribution X. We'll assume we've scaled
+things so that X has mean 0 and standard deviation 1, though that's not
+so important here. Then:
+
+ F(x) = P(X <= x) = \int_{-inf}^x f
+
+where f is the probability density function.
+
+F is monotonically increasing, so has an inverse function G, with range
+0 to 1. Here, G(t) = the x such that P(X <= x) = t. (In general, G may
+have singularities if X has point masses, i.e., points x such that
+P(X = x) > 0.)
+
+Now we create a tabular representation of G as follows: Choose some table
+size N, and for the ith entry, put in G(i/N). Let's call this table T.
+
+The claim now is, I can create a (discrete) random variable Y whose
+distribution has the same approximate "shape" as X, simply by letting
+Y = T(U), where U is a discrete uniform random variable with range 1 to N.
+To see this, it's enough to show that Y's cumulative distribution function,
+(let's call it H), is a discrete approximation to F. But
+
+ H(x) = P(Y <= x)
+ = (# of entries in T <= x) / N -- as Y chosen uniformly from T
+ = i/N, where i is the largest integer such that G(i/N) <= x
+ = i/N, where i is the largest integer such that i/N <= F(x)
+ -- since G and F are inverse functions (and F is
+ increasing)
+ = floor(N*F(x))/N
+
+as desired.
+
+II. How to create distribution tables (in theory)
+
+How can we create this table in practice? In some cases, F may have a
+simple expression which allows evaluating its inverse directly. The
+pareto distribution is one example of this. In other cases, and
+especially for matching an experimentally observed distribution, it's
+easiest simply to create a table for F and "invert" it. Here, we give
+a concrete example, namely how the new "experimental" distribution was
+created.
+
+1. Collect enough data points to characterize the distribution. Here, I
+collected 25,000 "ping" roundtrip times to a "distant" point (time.nist.gov).
+That's far more data than is really necessary, but it was fairly painless to
+collect it, so...
+
+2. Normalize the data so that it has mean 0 and standard deviation 1.
+
+3. Determine the cumulative distribution. The code I wrote creates a table
+covering the range -10 to +10, with granularity .00005. Obviously, this
+is absurdly over-precise, but since it's a one-time only computation, I
+figured it hardly mattered.
+
+4. Invert the table: for each table entry F(x) = y, make the y*TABLESIZE
+(here, 4096) entry be x*TABLEFACTOR (here, 8192). This creates a table
+for the ("normalized") inverse of size TABLESIZE, covering its domain 0
+to 1 with granularity 1/TABLESIZE. Note that even with the granularity
+used in creating the table for F, it's possible not all the entries in
+the table for G will be filled in. So, make a pass through the
+inverse's table, filling in any missing entries by linear interpolation.
+
+III. How to create distribution tables (in practice)
+
+If you want to do all this yourself, I've provided several tools to help:
+
+1. maketable does the steps 2-4 above, and then generates the appropriate
+header file. So if you have your own time distribution, you can generate
+the header simply by:
+
+ maketable < time.values > header.h
+
+2. As explained in the other README file, the somewhat sleazy way I have
+of generating correlated values needs correction. You can generate your
+own correction tables by compiling makesigtable and makemutable with
+your header file. Check the Makefile to see how this is done.
+
+3. Warning: maketable, makesigtable and especially makemutable do
+enormous amounts of floating point arithmetic. Don't try running
+these on an old 486. (NIST Net itself will run fine on such a
+system, since in operation, it just needs to do a few simple integral
+calculations. But getting there takes some work.)
+
+4. The tables produced are all normalized for mean 0 and standard
+deviation 1. How do you know what values to use for real? Here, I've
+provided a simple "stats" utility. Give it a series of floating point
+values, and it will return their mean (mu), standard deviation (sigma),
+and correlation coefficient (rho). You can then plug these values
+directly into NIST Net.