Statistics in Engineering
With examples in MATLAB^® and R

Andrew Metcalfe, David Green, Tony Greenfield, Mahayaudin Mansor, Andrew Smith and Jonathan Tuke.

Glossary A B C D E F G H I J K L M N O P Q R S T U V W

2-factor interaction:
The effect on the response of one factor depends on the level of the other factor.

3-factor interaction:
The interaction effect of two factors depends on the level of a third factor.

absorbing state:
A state that once entered cannot be left.

acceptance sampling:
A random sampling scheme for goods delivered to a company. If the sample passes some agreed criterion, the whole consignment is accepted.

accuracy:
An estimator of a population parameter is accurate if it is, on average, close to that parameter.

addition rule:
The probability of one or both of two events occurring is: the sum of their individual probabilities of occurrence less the probability that they both occur.

aliases:
In a designed experiment, sets of a factor and interactions between factors that are indistinguishable in terms of their effect on the response. In time series analysis, frequencies that are indistinguishable because of the sampling interval.

analysis of variance (ANOVA):
The total variability in a response is attributed to factors and a residual sum of squares which accounts for the unexplained variation attributed to random errors.

AOQL:
The average proportion of defective material leaving an acceptance sampling procedure, average outgoing quality (AOQ), depends on the proportion of defectives in the incoming material. The maximum value it could take is the AOQ limit (AOQL).

aperiodic:
In the context of states of a Markov chain, not limited to recurring only at a fixed time interval (period).

asset management plan (AMP):
A business plan for corporations, like utilities such as water, gas, and electricity, which own a large number of physical assets.

asymmetric:
Without symmetry, in particular a pdf is described as asymmetric if it is not symmetric about a vertical line through its mean (or median).

asymptote:
A line that is a tangent to a curve as the distance from the origin tends to infinity.

asymptotic:
In statistics, an asymptotic result is a theoretical result that is proved in the limiting case of the sample size approaching infinity.

auto-correlation:
The correlation between observations, spaced by a lag k, in a time series. It is a function of k.

auto-covariance:
The covariance between observations, spaced by a lag k, in a time series. It is a function of k.

auto-regressive model:
The current observation is modeled as the sum of a linear combination of past values and random error.

balanced:
The same number of observations for each treatment or factor combination.

Bayes' theorem:
This theorem enables us to update our knowledge, expressed in probabilistic terms, as we obtain new data.

between samples estimator of the variance of the errors:
An estimator of population variance calculated from the variance of means of random samples from that population.

bias:
A systematic difference - which will persist when averaged over imaginary replicates of the sampling procedure - between the estimator and the parameter being estimated. Formally, the difference between the mean of the sampling distribution and the parameter being estimated. If the bias is small by comparison with the standard deviation of the sampling distribution, the estimator may still be useful.

bin:
A bin is an alternative name for a class interval when grouping data.

binomial distribution:
The distribution of the number of successes in a fixed number of trials with a constant probability of success.

bivariate:
Each element of the population has values for two variables;

block:
Relatively homogeneous experimental material that is divided into plots that then have different treatments applied.

bootstrap:
Re-sampling the sample to estimate the distribution of the population.

box plot:
A simple graphic for a set of data. A rectangle represents the central half of the data. Lines extend to the furthest data that are not shown as separate points.

categorical variable:
A variable that takes values which represent different categories.

causation:
A change in one variable leads to a consequential change in another variable.

censored:
The value taken by a variable lies above or below or within some range of values, but the precise value is not known.

centered:
A set of numbers that has been transformed by subtraction of a constant that is typically the mid-range.

central composite design:
A 2^k factorial design is augmented by each factor being set at very high and at very low with all other factors at 0 and by all factors being set at 0.

Central Limit Theorem:
The distribution of the mean of a sample of independently drawn variables from a probability distribution with finite variance tends to normality as the sample size increases.

Chapman-Kolmogorov equation:
The probability of moving from state i to state j in two steps is equal to the sum, over all states k, of the probabilities of going from i to k in one step and then k to j in the next step.

chi-squared distribution:
A sum of m independent squared normal random variables has a chi-squared distribution with m degrees of freedom.

chi-squared test:
A test of the goodness of fit of some theoretical distribution to observed data. It is based on a comparison of observed frequencies and expected frequencies equal to discrete values or within specific ranges of values.

class intervals (bins):
Before drawing a histogram the data are grouped into classes which correspond to convenient divisions of the variable range. Each division is defined by its lower and upper limits, and the difference between them is the length of the class interval. Also known as bins.

cluster:
A group of items in a population.

coefficient:
A constant multiplier of some variable.

coefficient of determination:
The proportion of the variability in a response which is attributed to values taken by predictor variables.

coefficient of variation:
The ratio of the standard deviation to the mean of a variable which is restricted to non-negative values.

cold standby:
When an item fails it can be replaced with a spare item. In contrast, hot standby is when an item is continuously backed up with a potential replacement.

common cause variation:
Variation that is accepted as an intrinsic feature of a process when it is running under current conditions.

concomitant variable:
A variable that can be monitored but cannot be set to specific values by the experimenter.

conditional distribution:
A probability distribution of some variable(s) given the value of another associated variable(s).

conditional probability:
The probability of an event conditional on other events having occurred or an assumption they will occur. (All probabilities are conditional on the general context of the problem.)

confidence interval:
A 95% confidence interval for some parameter is an interval constructed in such a way that on average, if you imagine millions of random samples of the same size, 95% of them will include the specific value of that parameter.

consistent estimator:
An estimator is consistent if its bias and standard error tend to 0 as the sample size tends to infinity.

continuity correction:
The probability that a discrete variable takes an integer value is approximated by the probability that a continuous variable is within plus or minus 0.5 of that integer.

continuous:
A variable is continuous if it can take values on a continuous scale.

control variable:
A predictor variable that can be set to a particular value by a process operator.

correlation coefficient:
A dimensionless measure of linear association between two variables that lies between -1 and 1.

correlogram:
Auto-correlation as a function of lag

covariance:
A measure of linear association between two variables, that equals the average value of the mean-adjusted products.

covariate:
A general term for a variable that is associated with some response variable and that is therefore a potential predictor variable for that response.

cumulative distribution function:
A function which gives the probability that a continuous random variable is less than any particular value. It is the population analogue of the cumulative frequency polygon. Its derivative is the pdf.

cumulative frequency polygon:
Plotted for continuous data sorted into bins. A plot of the proportion, often expressed as a percentage, of data less than or equal to right hand ends of bins. The points are joined by line segments.

data, datum:
Information on items, on one item, from the population.

degrees of freedom:
The number of data values that could be arbitrarily assigned given the value of some statistic and the values of implicit constraints.

deseasonalized:
A time series is deseasonalized (seasonally adjusted) if seasonal effects are removed.

design generator:
A product of columns representing factor values that is set equal to a column of 1s.

design matrix:
An experiment is set up to investigate the effect of certain factors on the response. The design matrix specifies values of the factors in the multiple regression model used for the analysis.

detrended:
A time series is detrended if the estimated trend is removed.

deviance:
A generalization of the sum of squared errors when a model is fitted to data.

deviate:
The value taken by a random variable, typically used to denote a random number from some probability distribution.

discrete event simulation:
A computer simulation that proceeds when an event occurs, rather than proceeding with a fixed time interval.

empirical distribution function (edf):
The proportion of data less than or equal to each order statistic. A precise version of a cumulative frequency polygon.

endogenous, exogenous:
Internal, or external, to a system.

ensemble:
The hypothetical infinite population of all possible time series.

error:
A deviation from the deterministic part of a model.

estimator, estimate:
A statistic that is used to estimate some parameter is an estimator of that parameter when considered as a random variable. The value it takes in a particular case is an estimate.

equilibrium:
The probabilistic structure of a model for a stochastic process does not depend on time.

evolutionary operation:
An experiment that is confined to small changes in factors that can be accommodated during routine production. The idea is that optimum operating conditions will be found.

expected value:
A mean value in the population.

explanatory variable:
In a multiple regression the dependent variable, usually denoted by Y, is expressed as a linear combination of the explanatory variables, which are also commonly referred to as predictor variables. In designed experiments, the explanatory variables are subdivided into \textbf{control} variables, whose values are chosen by the experimenter, and \textbf{concomitant} variables, which can be monitored but not preset.

exponential distribution:
A continuous distribution of the times until events in a Poisson process, and, as events are random and independent, the distribution of the times between events.

factorial experiment:
An experiment designed to examine the effects of two or more factors. Each factor is applied at two or more levels and all combinations of these factor levels are tried in a full factorial design.

finite, infinite:
A fixed number, without any upper bound on the number.

fixed effect, random effect:
A fixed effect is a factor that has its effects on the response, corresponding to its different levels, defined by a set of parameters that change the mean value of the response. A random effect is a source of random variation.

frequency:
In statistic usage - the number of times an event occurs. In physics usage - cycles per second (Hertz) or radians per second.

F-distribution:
The distribution of the ratio of two independent chi-squared variables divided by their degrees of freedom.

gamma distribution:
The distribution of the time until the k^th event in a Poisson process. It is therefore the sum of k independent exponential variables.

gamma function:
A generalization of the factorial function to values other than positive integers.

Gaussian distribution:
An alternative name for the normal distribution.

Gauss-Markov theorem:
In the context of a multiple regression model with errors that are independently distributed with mean 0 and constant variance: the ordinary least squares estimator of the coefficients is the minimum variance unbiased estimator among all estimators that are linear functions of the observations.

generalized linear model:
The response in the multiple regression model (linear model) is other than a normal distribution.

geometric distribution:
The distribution of the number of trials until the first success in a sequence of Bernoulli trials.

goodness of fit test:
A statistical test of a hypothesis that data has been generated by some specific model.

Gumbel distribution:
The asymptotic distribution of the maximum in samples of some fixed size from a distribution with unbounded tails and finite variance.

hidden states:
Hypothetical states that are part of a system but cannot be directly observed.

highly accelerated lifetime testing (HALT):
Testing under extreme conditions that are designed to cause failures within the testing period.

histogram:
A chart consisting of rectangles drawn above class intervals with areas equal to the proportion of data in each interval. It follows that the heights of the rectangles equal the relative frequency density, and the total area equals 1.

hot standby:
A potential replacement provides continuous back up - see {\it cold standby}.

hypothesis (null and alternative):
The null hypothesis is a specific hypothesis that, if true, precisely determines the probability distribution of a statistic. The null hypothesis is set up as the basis for an argument or for a decision, and the objective of an experiment is typically to provide evidence against the null hypothesis. The alternative hypothesis is generally an imprecise statement and is commonly taken as the statement that the null hypothesis is false.

ill-conditioned:
A matrix is ill-conditioned if its determinant is close to 0 and so its inverse will be subject to rounding error.

imaginary infinite population:
The population sampled from is often imaginary and arbitrarily large. A sample from a production line is thought of as a sample from the population of all items that will be produced if the process continues on its present settings. An estimator is considered to be drawn from an imaginary distribution of all possible estimates, so that we can quantify its precision.

independent:
Two events are independent if the probability that one occurs does not depend on whether or not the other occurs.

indicator variable:
A means of incorporating categorical variables into a regression. The variable corresponding to a given category takes the value 1 if the item is in that category and 0 otherwise.

inherent variability:
Variability that is a natural characteristic of the response.

intrinsically linear model:
A relationship between two variables that can be transformed to a linear relationship between functions of those variables.

IQR:
The difference between the upper and lower quartiles.

interaction:
Two explanatory variables interact if the effect of one depends on the value of the other. Their product is then included as an explanatory variable in the regression. If their interaction effect depends on the value of some third variable a third order interaction exists, and so on.

interval estimate:
A range of values for some parameter rather than a single value.

kurtosis:
The fourth central moment, that is a measure of weight in the tails of a distribution. The kurtosis of a normal distribution is 3.

lag:
A time difference.

Laplace distribution:
Back-to-back exponential distributions.

least significant difference:
The least significant difference at the 5% level, for example, is the product of the standard error of the difference in two means with the upper 0.025 quantile of a t-distribution with the appropriate degrees of freedom.

least squares estimate:
An estimate made by finding values of model parameters that minimize the sum of squared deviations between model predictions and observations.

level of significance:
The probability of rejecting the null hypothesis is set to some chosen value known as the level of significance.

linear model:
The response is a linear function of predictor variables. The coefficients of the predictor variables are estimated.

linear regression:
The response is a linear function of a single predictor variable.

linear transformation:
The transformed variable is obtained from the original variable by the addition of some constant number and multiplication by another constant number.

linear trend:
A model in which the mean of a variable is a linear function of time (or distance along a line).

logit:
The natural logarithm of the odds (ratio of a probability to its complement).

lower confidence bound:
A value that we are confident that the mean of some variable exceeds

main effect:
The effect of changing a factor when other factors are at their notional mid-values.

main-plot factor:
In a split-plot experiment each block is divided into plots. The different levels of the main plot factor are randomly assigned to the plots within each block (as in a randomized block design).

marginal distribution:
The marginal distribution of a variable is the distribution of that variable. The marginal indicates that the variable is being considered in a a multivariate context.

Markov chain:
A process can be in any one of a set of states. Changes of state occur at discrete time intervals with probabilities that depend on the current state, but not on the history of the process.

Markov process:
A process can be in anyone of a set of states. Changes of state occur over continuous time with rates that depend on on the current state, but not on the history of the process.

matched pairs:
A pairing of experimental material so that the two items in the pair are relatively similar.

maximum likelihood:
The likelihood function is the probability of observing the data treated as a function of the population parameters. Maximum likelihood find the values of the parameters that maximize the probability.

meal:
A mixture of materials, that have been ground to a powder, used as raw material for a chemical process.

mean:
The sum of a set of numbers divided by their number. Also known as the average.

mean-adjusted:
A set of numbers that has been transformed by subtraction of their mean. The transformed set has a mean of 0.

mean-corrected:
An alternative term for mean-adjusted.

mean-square error:
The mean of squared errors.

measurement error:
A difference between a physical value and a measurement of it.

median:
The middle value if data are put into ascending order.

method of moments:
Estimates made by equating population moments with sample moments.

mode:
For discrete data, the most commonly occurring value. For continuous data, the value of the variable at which the pdf has its maximum.

monotone:
Continually increasing or continually decreasing.

Monte-Carlo simulation:
A computer simulation that relies on the generation of random numbers.

multiple regression:
The response, is expressed as a linear combination of predictor (also known as explanatory variables) plus random error. The coefficients of the variables in this combination are the unknown parameters of the model and are estimated from the data.

multiplicative rule:
The probability of two events both occurring is the product of the probability that one occurs with the probability that the other occurs conditional on the first occurring.

multivariate normal distribution:
A bivariate normal distribution is 3D bell shaped. The marginal distributions are normal, each with a mean and variance. The fifth parameter is the correlation. This concept generalizes to a multivariate normal distribution which is defined by its means, variances and pair-wise correlations.

mutually exclusive:
Two events are mutually exclusive if they cannot occur together.

m-step transition matrix:
The matrix of probabilities of moving between states in m-steps.

non-linear least squares:
Fitting a model which in non-linear in the unknown coefficients using the principle of least squares.

normal distribution:
A bell-shaped pdf which is a plausible model for random variation if it can be thought of as the sum of a large number of smaller components.

normalizing factor:
A factor that makes the area under a curve equal 1.

or:
In probability A or B is conventionally taken to include both.

order statistics:
The sample values when sorted into ascending order.

orthogonal:
In a designed experiment the values of the control variables are usually chosen to be uncorrelated, when possible, or nearly so. If the values of the control variables are uncorrelated they are said to be orthogonal.

orthogonal design:
The product of the transpose of the design matrix with the design matrix is a diagonal matrix.

over-dispersed:
Variance of the residuals is greater than a value that is consistent with the model that is being fitted.

parameter:
A constant which is a characteristic of a population.

parametric bootstrap:
The sampling distribution of a statistic is investigated by taking random samples from a probability distribution chosen to represent the population from which the sample has been taken. The parameters of the distribution are estimated from the sample.

parent distribution:
The distribution from which the sample has been taken.

paver:
A paving block. Modern ones are made from either concrete or clay in a variety of shapes and colors.

percentage point:
The upper α% point of a pdf is the value beyond which a proportion α of the area under the pdf lies. The lower point is defined in an analogous fashion.

periodic:
Occurring, or only able to occur, at fixed time intervals.

point estimate:
A single number used as an estimate of a population parameter (rather than an interval).

Poisson distribution:
The number of events in some length of continuum if events occur randomly, independently, and singly.

Poisson process:
Events in some continuum, often form a Poisson process if they are random, independent, and occur singly.

population:
A collection of items from which a sample is taken.

power (of test):
The probability of rejecting the null hypothesis if some specific alternative hypothesis is true. The power depends on the specific alternative hypothesis.

precision:
The precision of an estimator is a measure of how close replicate estimates are to each other. Formally, it is the reciprocal of the variance of the sampling distribution.

prediction interval:
An interval within which a random variable will fall with some specified probability.

predictor variable:
A variable in the regression equation used to predict a response; also known as an explanatory variable.

priority controlled junction:
A road junction which is controlled by Give Way signs and road markings, rather than by lights.

probability:
A measure of how likely some event is to occur on a scale ranging from 0 to 1.

probability density function:
A curve such that the area under it between any two values represents the probability that a continuous variable will be between them. The population analogue of a histogram.

probability function:
A formula that gives the probability that a discrete variable takes any of its possible values.

process capability index:
The ratio of the difference between the upper and lower specification limits to six process standard deviations (C_p).

process performance index:
The ratio of the larger of the differences between the upper/lower specification limits and the mean to three process standard deviations (C_p).

pseudo-random numbers:
A sequence of numbers generated by a deterministic algorithm which appear to be random. Computer generated random numbers are actually pseudo-random.

pseudo-3D plot:
A scatter plot in which the plotting symbol indicates the range within which some third variable lies.

p-value:
The probability of a result as extreme, or more extreme, as that observed, if the null hypothesis is true.

quadrants:
In a scatter plot, the positive and negative x-axis and y-axis divide the plane into four quadrants.

quantiles:
The upper/lower α quantile is the value of the variable above/below which a proportion α of the data lie.

quantile-quantile plot:
A plot of the order statistics against the expected value of the order statistic in a random sample from the hypothetical population.

quartiles:
The upper (lower) quartile, UQ (LQ), is the datum above (below) which one-quarter of the data lie.

quota sample:
A non-random sample taken to satisfy specific identifying criteria.

random digits:
A sequence in which each one of the digits 0,1,...,9 is equally likely to occur next in the sequence.

random effect:
A component of the error structure.

random numbers:
A sequence of numbers drawn a specified probability distribution so that the proportion of random numbers in any range matches the corresponding probability calculated from the probability distribution, and such that the next number drawn is independent of the existing sequence.

random sample:
A sample which has been selected so that every member of the population has a known, non-zero, probability of appearing.

range:
Difference between the largest datum and the smallest datum when the data are sorted into ascending order.

rate matrix:
A matrix of the rates of moving between states in a Markov process.

realization:
A sequence of data that have been drawn at random from some probability distribution or stochastic process.

regression:
A model for the value taken by a response as an unknown linear combination of values taken by predictor variables. The unknown coefficients are estimated from data.

regression line:
A plot of the expected value of the response against a single predictor variable under an assumed linear relationship.

regression sum of squares:
The mean-adjusted sum of squares of the response is split into the sum of squared residuals and the regression sum of squares.

regression towards the mean:
If one variable is far from its mean, then the mean value of a correlated variable will be closer, in terms of multiples of standard deviations, to its marginal mean. In the case of a single variable, if one draw is far from the mean the next draw is likely to be closer to the mean.

relative frequency:
The ratio of the frequency of occurrence of some event to the number of scenarios in which it could potentially have occurred. That is, the proportion of occasions on which it occurred.

relative frequency density:
Relative frequency divided by the length of the bin (class interval).

reliability function:
The complement of the cumulative distribution function of component lifetime.

repeatability:
The ability to get similar results when you test under the same conditions.

replication:
The use of two or more experimental units for each experimental treatment. The execution of an entire experiment more than once so as to increase precision and obtain a more precise estimate of sampling error.

reproducibility:
The ability to get similar results when others test under conditions that satisfy given criteria designed to maintain comparability.

resampling:
Taking a random sample, without replacement, from the sample.

residuals:
Differences between observed and fitted values.

residual sum of squares:
The sum of squared residuals.

response surface:
The response is modeled as a quadratic function of covariates. The predictor variables are the covariates, squared covariates, and cross products between two covariates.

response variable:
The variable that is being predicted as a function of predictor variables.

robust:
A statistical technique that is relatively insensitive to assumptions made about the parent distribution.

run:
A performance of a process at some specified set of values for the process control variables.

run-out:
A measurement of deviation of a disc from its plane.

sample:
A collection of items taken from a population.

sample path:
A sequence of sample values from a stochastic process.

sample space:
A list of all possible outcomes of some operation which involves chance.

sampling distribution:
An estimate is thought of as a single value from the imaginary distribution of all possible estimates, known as the sampling distribution.

saturated model:
A model in which the number of parameters to be estimated equals the number of data.

scatterplot:
A graph showing data pairs as points.

seasonal term/effect/component:
A component of a time series that changes in a deterministic fashion with a fixed period.

Simpson's paradox:
An apparent relationship between variables that is a consequence of combining data from disparate sub-groups.

simple random sample:
A sample chosen so that every possible choice of n from N has the same chance of occurring.

simulation:
A computer model for some process.

skewness:
A measure of asymmetry of a distribution. Positive values correspond to a tail to the right.

spurious correlation:
A correlation that can be attributed to known relationships to a common third variable (often time).

standard deviation:
The positive square root of the variance.

standard error:
The standard deviation of some estimator.

standard normal distribution:
The normal distribution scaled to have a mean of 0 and a standard deviation of 1.

standard order:
A systematic list of runs for a process.

state:
A set of values for the variables that define a process.

state space:
The set of all possible states.

stationarity:
Constant over time.

statistic:
A number calculated from the sample.

statistically significant:
A result that is unlikely to be equalled or exceeded if the null hypothesis is true.

strata:
A sub-population.

stratification:
Division of a population into relatively homogeneous sub-populations.

Student 's t-distribution:
The sampling distribution of many statistics is normal and can therefore be scaled to standard normal. If the mean of the sampling distribution is the parameter of interest, and the unknown standard deviation is replaced by its sample estimate, with v degrees of freedom, the normal distribution becomes a t-distribution with v degrees of freedom. If v exceeds about 30 there is little practical difference.

stochastic process:
A random process, sometimes referred to as a time series model.

survey population:
The population that is to be surveyed, when it does not match the target population precisely. .

sub-plot factor:
A factor which has its different levels applied over each of the main plots in a split-plot design.

symmetric distribution:
A probability distribution with a pdf that is symmetric about a vertical line through its mean.

synchronous:
Moving together over time.

systematic sample:
A sample drawn as every k^th item from a list.

tail (heavy):
A probability distribution with tails that tend towards 0 more slowly than those of a normal distribution.

target population:
The population about which we require information.

test statistic:
A statistic designed to distinguish between a null hypothesis and the alternative hypothesis.

time homogeneous:
Parameters of the process do not change over time.

tolerance interval:
A statistical tolerance interval is an interval that includes a given proportion of the population with some given level of confidence.

training data:
A sub-set of the available data used to fit a model.

transition:
A change of state.

transition matrix:
A matrix of transition probabilities in a Markov chain.

transition probability:
The probability of changing between two given states in one step of a Markov chain.

trend:
A deterministic model for change over time.

t-ratio:
The ratio of an estimate to an estimate of its standard deviation.

unbiased estimator (estimate):
An estimator is unbiased for some parameter, if the mean of its sampling distribution is equal to that parameter.

uniform distribution:
A variable has a uniform distribution between two limits if the probability that it lies within some interval between those limits is proportional to the length of that interval.

upper confidence bound:
A value that will not be exceeded, as determined with the given confidence.

variable:
A quantity that varies from one member of the population to the next. It can be measured on some continuous scale, be restricted to integer values (discrete), or be restricted to descriptive categories (categorical).

variance:
Average of the squared deviation from the mean. The averaging is performed by dividing by the degrees of freedom.

variance-covariance matrix:
A matrix of covariances between all possible pairs of variables when analyzing multi-variate data. The variances lie along the leading diagonal.

Weibull distribution:
A versatile model for the lifetimes of components.

weighted mean:
An average in which the data are multiplied by numbers called \textbf{weights}, summed, and then divided by the sum of the weights.

within samples estimator of the variance of the errors:
A variance is calculated for each sample, and these variances are then averaged.