Ofer Melnik Jordan Pollack 



(C) Copyright Ofer Melnik 1999  
Introduction Algorithm Principals Algorithm Logic Circle Network Sphere Network 4D Sphere Network Ball Throwing Network S&P 500 Direction Predictor Vowel Recognition Network Network Learning Movies Algorithm and Net Complexity Generalization and Learning Movies of Learning Download TechReport Download the Software! 
The Decision Intersection Boundary Algorithm (DIBA) disambiguates the
"distributed" representation of multiple hidden layer, multiple output
threshold feedforward neural networks. DIBA Generates a direct
mapping between a network's inputs and outputs in the form of
polytopic decision regions. That is, it finds what regions in the networks
input space correspond to different outputs. These can be viewed directly, in lower dimensions
, or analyzed using bounding rectangles and slicing in higher
dimensions. We can even use the algorithm to watch networks during learning, seeing their decision regions and weights changing simultaneously.
The algorithm works well for moderate sized networks (all
the networks here took less than 10 seconds to analyze on a PII). In
larger networks, we expect to run into exponential space, an inherent
aspect of network complexity.
You can now download the code, and try DIBA for yourself.
NOTE: If you are looking for an applied method to analyze decision regions, check out this new paper. It describes Decision Region Connectivity Analysis (DRCA), a method to extract decision region structure, completely independently of the model type or dimensionality. Thus it can be used to efficiently analyze even very highdimensional models with tens of thousands of parameters. 
Despite being prolific models, feedforward neural networks have been ubiquitously criticized for having a low degree of comprehensibility. A neural network could be performing a task it was trained for perfectly, but to its users it remains a blackbox. What has it learned about the data it was trained on? Is it generalizing in a consistent manner? For what inputs is the network untrustworthy? These are questions we would like answers to. Different approaches have been put to this problem such as RuleExtraction, Contribution Analysis and Network Inversion. In our approach we try to generate exact representations of a network's function by analyzing the principles of its underlying computation. Only by analyzing networks exactly can we learn their real strengths and pitfalls: When they will generalize well, the relationship between size and complexity, the relationship between dimensionality and complexity, when will learning succeed and fail, etc. In this page we present an algorithm tthat extracts exact representations from feedforward threshold neural networks in the form of polytopic decision regions. We demonstrate its use on different networks, and show what this extracted representation can tell us about a network's function. We then use the underlying principles that the algorithm is based on to calculate network and algorithmic complexity bounds as well as raise questions about learning in neural networks. 

The algorithm extracts polytopic decision regions from multilayer multioutput feedforward threshold activation function networks. The algorithm is based on a few observations on how these neural networks compute:


The Decision Intersection Boundary Algorithm (DIBA) receives as inputs the weights of a three layer feed forward network and the bounds of the input space. Its output consists of the vertex pairs, or line segment edges that describe the decision regions in input space. The algorithm consists of two parts, a part which generates all the possible vertices and connecting line segments, and a part that evaluates whether these basic elements form the boundaries of decision regions.


Using backpropagation, an 80 hidden unit sigmoidal network was trained to classify points inside and outside of a circle of radius 1 around the origin. Treating the activation functions as threshold, the DIBA algorithm was used to generate the immediate figure showing the network's representation of the learnt circle a decision region composed of the hidden unit line boundaries. The decision region fully describes what the network is doing. That is, all points inside the decision region will generate a network output of 1, and all the points outside the decision region will generate a network output of 0. We can see that the network did a pretty good job of encapsulating a circlelike decision regions, but we can also see where the learned circle deviates from an actual circle and how some of the training points were misclassified. 

By zooming out from the area in the immediate vicinity of the origin we can see the
network's performance or generalization ability, in areas of the input space
that it was not explicitly trained for. In the figure we see some artifacts, decision
regions at a distance of at least 50 units from the decision region at the origin. This implies
a network misclassification if that part of the input space was of interest. 

A 100 hidden unit network was trained to differentiate between points inside and outside of a sphere centered at the origin. The figure below illustrates the network's spherelike decision region from two angles. The figure below shows the same artifact phenomena as in the circle network above the miniature sphere appears amid a backdrop of an ominous cliff face decision region. 

A 100 hidden unit network was trained to recognize points inside and outside of a four dimensional hypersphere centered around the origin. Since its dimensionality is greater than three it is not possible to directly visualize the decision region. One way to gather information from our highdimensional polytopic decision regions is to describe them in terms of rules. That is, we bound each polytope inside a hyperrectangle, and examine the rectangle's coordinates. The rectangles can later be refined to enclose parts of polytopes, thereby giving higher resolution rules. In this case our decision region is covered by the following hyperrectangle: min: (4.45,5.79,3.90,7.93) max: ( 6.47, 6.01, 7.39, 6.07) Another way to examine the highdimensional polytope is to examine projections on to lower dimensional spaces. In this figure we see two projections of the 4D sphere decision region. Needless to say the 4D sphere looks less and less like an actual sphere. It is fair to speculate that this is due to the difficulties of learning in higher dimensional spaces. The higher dimensionality naturally promotes more complex decision regions, with potential for greater artifact phenomena. 

A 40 hidden unit network was trained to predict whether a thrown ball will hit a target. As input, it received the throwing angle, initial velocity and the target distance. It achieved an 87% success rate on its training data. 

In the following figure, the network's decision region is contrasted with the analytical solution to the problem. The network managed to encapsulate the gist of the analytical model, but it may miss a few jump shots. 

A neural network with 40 hidden units was trained to predict the average direction of movement (up or down) for the following month's S&P 500 Index. It was trained on macro economic data from 1953 to 1994. The network's inputs are: the percentage change in the S&P 500 from the past month, the differential between the 10year and 3month Treasury Bill interest rates, and the corporate bond differential for AAA and BAA rates. After training, the network achieved a greater than 80% success rate on the data. In the figure we can see the network's decision regions. Immediately we can surmise that the network would probably not generalize very well to out of sample data. We can see that rather than extracting some underlying regularity, some nice localized geometrical description of the data points, the network tended to highly partition the space with small and irregular decision regions to try and enclose all the sporadic data points. In the region of input space where the training data resides the network has 28 different decision regions. Of these, all but five have a bounding rectangle with a volume of less than one. One decision region's rectangle encompasses the whole space. If we slice this polytope using three hyperplanes, each bisecting the input space across a different dimension, then if the polytope was completely convex we would expect, at most, to get eight sub polytopes. However for this decision region, the slicing generates 23 subpolytopes, suggesting a high degree of concavity, possibly an irregular branching structure which highly partitions the input space. 

Two neural networks were trained on vowel data available at the CMU repository. This data consists of 10 logarea parameters of the reflection coefficients of 11 different steadystate vowel sounds. Our interest in this example was to gauge the effect of using input spaces of different dimensionality. Whether adding dimensions will improve the generalization of the network. Both networks were trained to recognize the vowel with the had sound contrasted with another 10 vowel sounds. The first network received as input the first two coefficients. After training, it achieved a better than 86% success rate on the training data. The following figure illustrates its decision regions. In the input region of [3.2,2.3]X[0.7,1.2] we see that many disparate decision regions are used to secure a perfect classification. 

The second network received the first four coefficients as inputs. It achieved a
success rate of over 91% on the training data. It also achieved perfect classification
in the same input region. However it doesn't seem to have partitioned the
space. There is only one decision region in this area. If we slice this polytope
using four hyperplanes bisecting each of its dimensions, we only get 10 sub
polytopes, suggesting a small degree of concavity (less than the 16 we would
expect for a perfectly convex shape). In addition these subdecision regions are
mostly delimited in the third and fourth dimension with the first two dimensions
ignored (span the whole space). Therefore, it appears that the network makes
use of these added dimensions to form a more regular decision region in that
difficult portion of the input space. 

The Decision Intersection Boundary Algorithm's complexity stems from its transversal of the hyperplane arrangement in the first layer of hidden units. As such, that part of its complexity is equivalent to similar algorithms, such as arrangement construction [Edelsbrunner 87], which are exponential in the input dimension. Do we really need to examine every vertex though? Perhaps the network can only use a small number of decision regions, or it is limited with respect to the complexity of the decision regions. The following figure demonstrates that a network is capable of exponential complexity by an inductive construction of a network with an exponential number of decision regions, where each decision region has an exponential number of vertices. The figure shows the inductive step from one dimension to two dimensions. The zeros represent decision regions (line in 1D, squares in 2D). 

Generalization is the ability of the network to correctly classify points in the input space that it was not explicitly trained for. In a semiparametric model like a neural network, generalization is the ability to describe the correct output for groups of points without explicitly accounting for each point individually. Thus, the model must employ some underlying mechanisms to classify a large number of points from a smaller set of parameters. In our feedforward neural network model we can characterize the possible forms of generalization with two mechanisms. The first is by proximity: nearby points in the same decision region are classified the same way. The second is by face sharing: the same hidden unit hyperplane is used as a border face in either multiple decision regions or multiple times in a decision region. Given these two mechanisms, how well do learning algorithms exploit them? It is intuitive that learning algorithms which pull borders (Hyperplanes in this case) towards similar output points and away from different output points, should be geared for proximity generalization by forming decision regions. However face sharing is more combinatorial in nature, and might not be as amenable to continuous deformation of parameters as found in many algorithms. To illustrate this point a group of 8 hidden unit neural networks were trained on the decision regions illustrated in the 2D complexity figure. One thousand data points were taken as a training sample. Two hundred different networks were tested. Each was initialized with random weights between 2 and 2, and trained for 300 online epochs of backpropagation with momentum. Of the 200 networks none managed to learn the desired 9 decision regions. The network with the best performance generated 6 decision regions (left figure) out of an initial two. However, in examining its output unit's weights, we see that only one weight changed sign, the weight initially closest to zero, indicating a predisposition to this configuration in the initial conditions. The other figures are the decision regions of the more successful networks. 

If we can visualize the decision regions of a neural network, as is the case for networks with two or three inputs, then we can visualize these networks as they learn watching how their decision regions are modified with iterations of the learning algorithm. The following three movies are of networks trying to learn the decision region indicated by their respective picture. That is, a network trying to learn a sphere, a cylinder and three separate spherical decision regions. Each frame consists of the network's decision regions juxtaposed with a Hinton diagram of the network's weights. The blue cloumn has the output unit's weights, while the gray rows are the hidden unit weights. These movies are examples of relatively successful learning using backprop with momentum. Their successes are not the norm, and more frequently than not learning will result in inappropriate or nonexistant decision regions. Note, even though we are interpreting the activation functions as threshold for the algorithm, backprop is treating them as sigmoids. The implications of this are outlined in the techreport. 
Now the basic DIBA algorithm is available as C++ source code under the Gnu Public License. It has been compiled under Linux, but should compile under any standard C++ compiler and libraries. The three dimensional output formats available are Matlab, Povray and MHD. I will not be providing support for this program. But if you generate something interesting using it, or find a novel application, let me know at melnik@cs.brandeis.edu, and I will add a link to your page here. The package can be download from here. 