简介:Clustering and classification are the major subdivisions
of pattern recognition techniques. Using these techniques,
samples can be classified according to a specific property by
measurements indirectly related to the 简介:Clustering and classification are the major subdivisions
of pattern recognition techniques. Using these techniques,
samples can be classified according to a specific property by
measurements indirectly related to the property of interest
(such as the type of fuel responsible for an underground
spill). An empirical relationship or classification rule
can be developed from a set of samples for which the
property of interest and the measurements are known. The
classification rule can then be used to predict the property
in samples that are not part of the original training set.
The set of samples for which the property of interest and
measurements is known is called the training set. The set
of measurements that describe each sample in the data set
is called a pattern. The determination of the property of
interest by assigning a sample to its respective category is
called recognition, hence the term pattern recognition.
For pattern recognition analysis, each sample is represented
as a data vector x D (x1, x2, x3, xj, : : : , xn), where
component xj is a measurement, e.g. the area a of the jth peak in a chromatogram. Thus, each sample is considered
as a point in an n-dimensional measurement space. The
dimensionality of the space corresponds to the number of
measurements that are available for each sample. A basic
assumption is that the distance between pairs of points in
this measurement space is inversely related to the degree
of similarity between the corresponding samples. Points
representing samples from one class will cluster in a limited
region of the measurement space distant from the
points corresponding to the other class. Pattern recognition
(i.e. clustering and classification) is a set of methods
for investigating data represented in this manner, in order
to assess its overall structure, which is defined as the overall
relationship of each sample to every other in the data set.详细>