Intervals & Tests
The All Point method of curve fitting provides a best fit based on all of the data points using each of the four transformations (Bounded, Unbounded, Log Normal, Normal). In the case of the Normal and Log Normal forms, the regression of the transformed standardized normal variable on the cumulative Normal distribution involves only two parameters, resulting in simple linear regression. In the case of the more general Bounded and Unbounded forms, the estimation of the four parameters requires the use of the Marquardt nonlinear regression algorithm. This algorithm, well documented in mathematical literature, iteratively varies each of the parameters in the direction of best fit until an optimal solution results (Marquardt). The Kolmogorov-Smirnov (K-S) test is then applied to test for the significance of the fit between the transformed (normalized) values derived for each of the Johnson forms. The form having the least maximum absolute deviation is selected as the final fitted form.
An all-points Johnson fit to data may generate one of two messages even for a fit with a satisfactory value for the K-S test. Although the fit may be acceptable, tests on the fit can indicate that there may be an opportunity for improvement in the fit. The tests are based on the deviations (residuals) between the data and the fit to the data. There must be more than 6 data for the outliers test and nine for the deviations test, with at least 4 plus and 4 minus deviations. If there are not enough data, the tests are not done and no message will be printed.
Message 1: There appear to be systematic deviations between the data and the fitted curve, which may indicate the effect of multiple processes.
This message may be triggered by either one or both of two tests on the randomness of the deviations between the data and the fit:
1. The first test calculates a neighborhood correlation coefficient, rho, for the residuals between the data and the fitted curve. The neighborhood correlation coefficient compares each residual with its previous and next in sequence neighbors. The message is triggered when rho is greater than (is significant at) a 0.05 level of significance. The neighborhood correlation (rho) (1-sided) test at the 0.05 significance level, against the hypothesis that rho = 0 (Sachs) is:
abs(rho) * sqrt(m-1) < 1.7 where m is the number of data.
2. The second test looks for too many or too few runs (a series of residuals all of the same sign). This two-sided test at the 0.05 level, against the hypothesis that the number of runs is equal to the expected number is (Swed):
number of runs < E - (1.645 * sqrt(V)) - 0.5 (for significantly few runs)
number of runs > E + (1.645*- sqrt(V)) + 0.5 (for significantly too may runs)
where 0.5 is a continuity correction, and the expected number of runs and its variance is (Gibbons):
E(number of runs) = 1 + 2*m1*(m - m1) / m
V(number of runs) = (E - 1)*(E - 2) / (m - 1)
Message 2: There may be outliers in the data.
It is expected that there will be as many positive values for the residual as there are negative values.
The number of positive deviations is expected to be E(m1) = m / 2. m1 is the number of positive deviations of the fit. For this two-sided test at the 0.10 level, a rule of thumb (Duckworth) is,
abs(2*m1 - m)/ sqrt(m) < 1.645 for lack of significance.
Duckworth and Wyatt. "Rapid Statistical Techniques for OR Workers." Operation Research Quarterly, 9 (1958) pp.218+
Gibbons. Non-Parametric Statistical Inference, New York: Marcel Dekker, Inc., 2nd Ed., 1985.
Sachs. Applied Statistics, Springer, 1982.
Swed and Eisenhart. "Testing Randomness of Grouping in a Sequence." Ann Math Statistics, 14 (1943) pp.83+.
Learn more about the Statistical Inference tools for understanding statistics in Six Sigma Demystified (2011, McGraw-Hill) by Paul Keller, in his online Intro. to Statistics short course (only $89) or his online Black Belt certification training course ($875).