Jupyter notebook Supplementary Materials 3 - Processing.ipynb
Supplementary Materials 3: Processing as optimization
In this Part, we focus on processing dynamics of a GSC model. In the first half, we present three kinds of constraints that a GSC parser tries to optimally satisfy, and discuss the underlying motivations for these constraints. A harmony function is defined as a measure of constraint satisfaction for each constraint. A GSC model tries to maximize total harmony by stochastic gradient ascent. For the readers who may want to understand every detail of the model, we present the gradient of each harmony function with respect to an activation state as well. But it suffices to understand that (1) harmony functions were constructed to measure the goodness of an activation state with regard to different kinds of constraints and (2) the activation-state change follows the gradient of such harmony functions.
In the second half, with a simple grammar example, we first show the contribution of each kind of constraint, and then with much simpler low-dimensional systems, we explain schematically how a GSC model can handle the local ambiguity naturally arising in incremental processing.
Notations
: an activation pattern in neural coordinates
: an activation pattern in conceptual coordinates
: a conceptual-to-neural change-of-basis matrix;
: a neural-to-conceptual change-of-basis matrix;
: a (symmetric) weight matrix in neural coordinates
: a (symmetric) weight matrix in conceptual coordinates
: a bias vector in neural coordinates
: a bias vector in conceptual coordinates
: an external input vector in neural coordinates
: an external input vector in conceptual coordinates
: a baseline (resting) activation state in neural coordinates
: a baseline (resting) activation state in conceptual coordinates
: an index for each component of an activation vector in neural coordinates; the -th component of the activation vector can be viewed as a product of the -th component of a filler representation vector and the -th component of a role activation vector.
: an index for each component of an activation vector in conceptual coordinates; the -th component of the activation vector (i.e., the activation of a binding ) can be viewed as a product of the -th component of a filler representation vector and the -th component of a role activation vector.
Constraint I: Grammatical constraints
Grammar harmony , evaluated at network activation state , with external input , is defined in neural coordinates as follows:
It follows that
Grammar harmony measures how much an activation state satisfies the grammatical constraints specified in a harmonic grammar. Because and ,
where and . Now and are given directly by a harmonic grammar; we can then easily compute and . The last term in the above equation measures how much the current state satisfies the faithfulness constraint that the present state must be the same as the external input. If neural encodings of bindings are orthogonal, so the last term simply becomes . (The software stores in the attribute TP and its inverse in the attribute TPinv.)
Constraint II: Baseline constraint
The second kind of constraint requires that the activation state stay at a baseline (or resting) state . We define baseline harmony as follows:
If neural encodings of bindings are orthogonal, as above, and we get
Because has the shape of an inverted bowl, we also call it the bowl harmony.
We set to , the center of a unit hypercube in the conceptual space. The role of this constraint should be clear from its derivative.
If a unit's activation value is greater than its resting level, it will decrease. If a unit's activation value is lower than its resting level, it will increase. Thus, it creates a force to pull a state to its resting state. The parameter determines the relative strength of the baseline constraint. The minimal requirement for the parameter is that its value must be greater than the maximum eigenvalue of . Given a harmonic grammar, the software will set its value automatically but users can set it to a certain value manually. When the discreteness pressure is minimal (see below), this constraint prevents the model from exploding; in other words, if is set appropriately, the system is stable and has a single, globally optimal solution, which is a neural state that the system reaches in the long run.
Constraint III: Quantization constraint
Because the end product of incremental parsing must be a discrete symbolic structure, we introduce a discreteness constraint that requires that every role must be bound to a single filler (with activity level 1). (Note: as we discussed in Part 1, the current version of the software treats an empty role as a binding of a role with a null filler, which makes the equations simpler.)
The constraint is implemented as a combination of two different constraints: (1) each binding's activation must be either 0 or 1; (2) the sum of the squared activations of bindings in each group of competing bindings must be 1. We quantify the degree of constraint satisfaction as follows:
where is the -th component of ; is an index set of ; is a set of bindings that compete with each other. Each set of f/r bindings with a given span role r constitutes such a subset because only one filler can be bound to a role. where is the (, )-th element of and is the ()-th neural coordinate of the activation vector. We will simpify the notation using Einstein summation convention such that (sum over up-down pairs). Then,
Because we are interested in the activation state change in neural coordinates, we compute the partial derivative of the two harmony equations with respect to the ()-th component of as follows:
Then, the harmony value of an activation state given the discreteness constraint is defined as a weighted sum of the above two terms.
.
Constraints combined
The total harmony of a neural state given external input and discreteness pressure is defined as follows:
where is a dynamic parameter weighting the discretness constraint. Because the goal of GSC is to build a discrete symbolic structure, we assume that , the relative strength of the discreteness constraint or simply the discreteness pressure, gradually increases in time. In the current version of the software, changes linearly as a function of time up to an upper bound (200, in the default setting) that users can change.
Processing as harmony maximization
A GSC model is a stochastic process that tries to maximize total harmony by following, on average, the harmony gradient: it is noisy gradient ascent. The state change is as follows:
where is the standard Wiener process and is the computational temperature, determining the noise magnitude; decreases to a small value during computation.
Demo
In this section, we build a simple GSC model and investigate the contribution of each kind of constraint. To make the dynamics clearer, we will set the computational temperature to 0.
Consider a simple grammar G = {S A B | X Y}. Let T1 and T2 be the parse trees of the two grammatical sentences: T1=[ [ [A B]]] and T2=[ [ [X Y]]].
Baseline constraint only
The GSC model uses all three kinds of constraints to build a discrete symbolic structure. But the users can turn off the contribution of each constraint by setting some parameters as follows.
As you can see, the system reaches an equilibrium state, which is . The baseline constraint contributes to the stability of the system.
Grammatical constraints only
When both baseline and quantization constraints are turned off, the system loses its stability and explodes.
Grammatical constraints + baseline constraint
First, each binding with the null filler has the value of 0.5; due to precision issues, the actual values are not exactly 0.5. Those bindings do not interact with others so the state change in the bindings are governed only by the baseline constraint.
Second, the model prefers the constituents of a grammatical structure to other constituents. For example, is preferred to . This is because the former receives additional supporting signal from its grammatical mother while the latter does not receive such top-down support. Due to the negative bias, any binding with no top-down support will have an activation value smaller than 0.5.
Third, has a more negative bias (see Part 1) so it is less activated than other ungrammatical bindings (e.g., ) in the equilibrium state.
The blend state does not tell us how the grammatical bindings must be combined. As we discussed in Part 2, the blend state may be viewed as a blend of two grammatical structures T1 and T2 or a blend of two ungrammatical structures T3=[ [ [A Y]]] and T4=[ [ [X B]]]. We will briefly show that, from the apparently ambiguous blend state, the GSC model can build either T1 or T2 but neither T3 nor T4 by increasing the quantization constraint gradually.
NOTE: In fact, it is misleading to say that the blend state is a blend of multiple discrete structures (e.g., T1 and T2 in the present example) because the model does not monitor global coherence and does not even represent global structures as such. We must say that a blend state is a state in which each role is occupied by a blend of fillers. The filler/role bindings that are present in grammatical structures are however more active than those that are not.
Quantization constraint only
The second kind of constraint is applied to each role (a group of bindings bound to the same role).
Each role is occupied by a single filler. Because we did not consider the grammatical constraints at this time, the model randomly chose a filler in each role. Thus, the chosen structure can be ungrammatical.
All constraints combined
Before we run the model with all three constraints considered, we reset T to a small value (= 0.01). Small noise is required to introduce symmetry breaking in the present example.
Note: If you run the following code blocks with T = 0, you will see the model holds the blend state even when the discreteness pressure is large. This is because a subset of local constraints for T1 is perfectly balanced with another subset of local constraints for T2. External input as well as noise will break the symmetry in this case.
The model builds a grammatical structure, either [ [ [A B]]] or [ [ [X Y]]]. In Part 2, we argued that the superposition of bindings in a single structure is not catastrophic. This demo above suggests that the superposition of complete structures is not catastrophic, either. The model has the ability to consider multiple grammatical structures together and choose one over the others. The weights that implement binary harmonic grammar rules contain the information of which bindings must go with which which bindings and processing dynamics uses this information to unblend the ambiguous blend state to a grammatical symbolic structure.
In this demo, we did not provide any external input to the model. By only changing the discreteness pressure gradually, the model could build a grammatical structure. This case is special because there is no local ambiguity between input symbols and target structures. In general, even in simple grammars, the language processing system will encounter local ambiguity, the case in which linguistic input is consistent with multiple interpretations. The goal of incremental processing is to use the bottom-up information (linguistic input) as much as possible to exclude impossible interpretations and maintain all possible interpretations. In the next section, we show how in a schematic way that the GSC model can achieve this computational goal.
Incremental processing with external input handles local ambiguity
An intuitive way to understand processing dynamics of a GSC model is to visualize the harmony surface {(, | for every possible state } and investigate how the surface changes when external input is provided and the discreteness pressure increases. Typically, the GSC model has a high-dimensional state space so it is very hard to visualize the harmony surface directly. (Note: One solution is to focus on the topological structure of the harmony surface and visualize it as a disconnectivity graph [Becker & Karplus, 1997; Wales, Miller, & Walsh, 1998]. We used this method to show how the harmony surface changes while a GSC model is parsing a two-word sentence 'a b'. For details, see Cho and Smolensky (2016).) Here we will build simple 1D and 2D dynamical systems that share the same kind of constraints as the model presented above and investigate harmony surface change in those simplified systems. Even these minimal systems will reveal how the GSC model can solve two computational problems in incremental processing.
1D example
Let us begin with the simplest case, a 1D dynamical system. Suppose a single, continuous state variable represents one of two discrete symbols, () or (). We want the system to start at an arbitrary state and reach a grid state (one encoding a discrete structure): or . If an external input is given to bias the system for one over the other, the end state must be consistent with the external input.
Given the computational goal, we define Total Harmony at state , , (see harmony1D
in the following code block) as we did above. indicates an external input which biases the system toward either if or if . represents quantization strength. The two parameters may change their values during incremental processing; as we described before, is assumed to increase gradually in time. We want to understand the effect of the two parameters on the harmony surface structure. (see params
) indicates a set of parameters that do not change in processing (e.g., weight values).
With (= ) and (= ) together, the model prefers a blend state that is slightly biased toward the discrete state consistent with the external input. (corresponding to ) makes the model prefer a grid state. (In this 1D- and the next 2D-systems, we can ignore [= ] because there is only one unit per role.)
The function harmonyGrad1D
returns the derivative of Total Harmony evaluated at state for a given parameter setting (, , and ). As in any GSC model, + noise. run1D
updates the state for a given interval of time tspan
during which is fixed and increases linearly. animate_run1D
is used to create an animation.
Run the code block below and investigate how the shape of the harmony surface (, ) changes as a function of and . (variable ) can be set to (no external input), (external input, biasing the system toward ), or (external input, biasing the system toward ).
Note: the interactive widgets in this notebook will not be displayed in the HTML file. We recommend readers check the Jupyter/IPython notebook file.
First, set and to 0. You will see a single maximum at = 0 which is the optimal state given the constraints. If the system performs gradient ascent to maximize H, it will eventually reach the local maximum ( = 0; in this case, the state is also the global maximum) wherever the initial state was. Then, increase gradually from 0 to 5 and see how the harmony surface changes. You will see the shape of the harmony surface changes qualitatively when is changing from 1 to 1.2. When is smaller than 1, there is a single (hence global) maximum. However, when it is greater than 1.2, the surface has two local maxima that have the same Total Harmony. In dynamical systems theory, each local optimum is called a (point) attractor (i.e., a fixed point that is asymptotically stable) and the qualitative change in the harmony surface is called a bifurcation. When a small amount of random noise is assumed (non-zero computational temperature in the GSC model), the system will reach state =1, corresponding to the representation of a discrete symbol , in about 50% of trials.
Second, set to 0 again but at this time, set to 1. In other words, we consider a case in which the model receives external input suggesting . The external input makes the state representing and its neighbors have higher Total Harmony values than the state representing and its neighbors; note that has a term in it, measuring how much the faithfulness constraint is satisfied. You will see Total Harmony is maximal at = 0.25. It is a blend state, an intermediate state between multiple discrete states, containing some information that should be preferred to . Now increase from 0 to 5 slowly. The shape of the harmony surface changes qualitatively around at = 2. If the system has enough time before grows over 2, the state is likely to be near the global maximum before the bifurcation occurs. As increases further, the local/global optimum moves toward the ideal discrete state = 1 so the system will eventually reach = 1, representing . Note that another locally optimal, but globally non-optimal state (that will evolve to another discrete state representing ) is separated by a harmony valley. Given that the system performs stochastic gradient ascent, it is hard to cross over the harmony valley unless computational noise is huge.
Simulation
Let us investigate how the state changes in the model when the model receives as input (i.e., = 1). will increase linearly from 0 to 5 in a time interval [0, 5]. To emphasize the insensitivity of the model to its initial state, we set the initial state to = -1, representing .
The above animation shows how the harmony surface changes as a function of when external input is fixed to 1. The moving red dot represents . As passes a critical value , a harmony valley emerges to separate the state space into two subsets and (), the regions coverged by the two local humps. For the system to reach the target state ( = 1), it must be somewhere in when passes ; for simplicity, we are ignoring noise. This is possible because when is low, the harmony surface has a single optimum which is biased toward the target state due to the external input so the system will quickly move to the global optimum.
Simulated annealing
Before moving to a 2D example, we briefly discuss an alternative algorithm by which the model can reach the global optimum and then explain why we do not use it in incremental processing.
The deterministic gradient ascent is a local optimization algorithm and may not converge to a global optimum. If the harmony surface is fixed, then where the system ends up is entirely determined by where the system starts. For example, consider the 1D example again but this time, let us fix to 3.
The figure above shows a sample state trajectory on the static harmony surface when the system starts at = -1.1 and = 0 (no noise). It converges to a local optimum indicated by the red dot. When T is 0, the system cannot go downhill so there is no way to escape from the local optimum.
In many situations, however, we may want a system to be able to reach the global optimum regardless of where the system starts. One solution is simulated annealing (Geman & Geman, 1984; see also Chiang, Hwang, & Sheu, 1987). The idea is simple. We start a system at a high temperature and gradually reduces temperature---recall that the temperature determines the magnitude of noise when updating the state. Roughly speaking, when temperature is large (> ), the state jumps between two humps. Temperature gradually decreases to a medium level ( > > ) and during the period with this medium level of temperature, the state can jump from the left to the right hump but cannot jump from the right to the left hump; to jump from the right to the left hump, the state must go downhill more, which requires a greater amount of noise. Thus, if the system spends enough time at the medium temperature level, it will eventually move to and stay at the right hump. Then, temperature decreases further and the state will eventually reach the global optimum. The animation below shows how the state changes when simulated annealing is used.
In the sample run, the model has reached the global optimum when it started at x = -1.
We do not use simulated annealing in incremental processing for the following reasons. First, in incremental processing, the model must encode a present linguistic input such that it can integrate it with future input. In the GSC model, the present state and the present value contain the information. Large noise required in simulated annealing will make the present state unreliable. In some sense, the model will develop a very noisy memory of the past. Second, in the GSC model, the harmony surface changes dynamically as increases. The 1D example above and the following 2D example suggest that when is carefully updated, the system can reach the target state without being stuck in an undesired local optimum.
2D example
Let us investigate a 2D dynamical system. The system has four discrete symbolic states in a 2D state space: representing tuples , , , . Those four tuples correspond to the symbolic interpretation of strings 'A B', 'A C', 'D B', 'D C', respectively. The first state variable represents two discrete symbols () and () at the first letter position and the second state variable represents two other discrete symbols () and () at the second letter position. No connection between two state variables is assumed, which means the model does not care what symbol follows the first symbol. Although this model does not represent hierarchical constituent structure, it is still interesting because it shows how the GSC model can handle a discrete combinatorial system, creating local ambiguity in incremental processing. In this example, we will show how the model can exclude impossible interpretations given the first letter even after the first letter has been replaced by the second in the incremental input.
Run the following code block to generate the interactive harmony surface plots. As in the 1D case, you can change , , and to see how the harmony surface changes.
The left panel shows the harmony surface in a 3D surface plot and the right panel shows the same harmony surface in a 2D contour plot. The red and blue colors represent higher and lower harmony, respectively. A small number of sampled state changes (a.k.a. streamlines) is overlayed on the contour plot. The four red dots correspond to four discrete symbolic states ("the grid").
First, set all three free parameters to 0. The system has a global optimum at (0, 0). Increase gradually and check when the shape of the harmony surface changes qualitatively. When increases from 1 to 1.2, the previously global optimum is broken into four local maxima; to see the change, pay attention to the streamlines in the right panel. As increases more and more, the four local optima become more apparent. With small random noise, the system will reach one of the four discrete states eventually.
Second, set back to 0 and to 1. Keep the value of at 0. This case corresponds to a model that receives input as the first symbol of a string. The system has highest harmony at . Increase gradually from 0 to 2. You will see that the global maximum moves toward the positive axis and the shape of the harmony surface changes qualitatively when increases from 0.8 to 1. Now the model has two local optima, so even with small noise, the system will move to one of them. In this two-alphabet string case, it is not good to increase too much. To see the reason, increase to 5. If has increased slowly and gradually, the state when = 5 will be near or , corresponding to tuples or . Suppose the system is near when = 5 and receives the second input 'B'. To see the change in the harmony surface, set to 0 and to 1. We set to 0 because we assume the system does not explicitly keep every input in its past history. It is highly likely that the human language processing system can keep track of a small number of past input symbols but the number cannot be very huge. Our goal is to understand a mechanism to store the information from past history implicitly without expliciting keeping all the previous input symbols. (We can add an additional memory mechanism with which the model can keep previous input symbols---in this example, .) Note that the harmony surface does not change much. is still the local maxium so the system will stay near the discrete state. In this case, the actual input was 'A B' but the system has reached ) representing .
In the discussion below, we consider three different scenarios with regard to the update.
Case 1: increases appropriately
First, we show that the model can parse a sentence accurately if increases neither too slowly nor too fast. In the example below, increases linearly from 0 to 1.2 while the model processes the first word 'A' in the time interval [0, 1.2] and from 1.2 to 3.0 while the model processes the second word 'B' in the time interval [1.2, 2.4].
The blue and red lines indicate the state change while the model was processing the first and the second words, respectively. The red dot indicates the end state.
At the beginning, the harmony surface has a single global optimum (because is low), which is biased toward at the first position due to the faithfulness constraint (). Both interpretations consistent with the first word input, and , are equally good---the states representing the two interpretations have the same total harmony values. increases gradually so the harmony surface changes gradually as well. The state climbs up the harmony hill to reach the highest point. Then, the harmony surface changes abruptly. This is due to discrete external input update from at the first position to at the second position ( was set to 0 and was set to 1). Now the states consistent with the second word input have the same harmony value; in other words, given and external input, both interpretations and are equally good. However, continues to grow and at some point, the topological structure of the harmony surface changes qualitatively (i.e., a bifurcation occurred!) and multiple local optima emerge. Each local hump is separated from every other hump. At that time, the system's state is on the hump containing a state representing the target interpretation . On average, the state does not roll down a hill to jump to other hills. Thus, the system reaches the target state.
Both the 1D and 2D examples show that (1) external input increases grammar harmony (so total harmony as well) of the states representing discrete, symbolic structures that are locally coherent with the input; (2) increasing quantization force reshapes the harmony surface and separates every discrete state from every other discrete state. By combining (1) and (2), the GSC model can solve the computational problems arising in incremental processing.
Case 2: increases too slowly
In this example, we show that if increases too slowly while the model was processing the first word 'A', the model may fail to parse a sentence. increases linearly from 0 to 0.3 while the model processes the first word 'A' in the time interval [0, 1.2] and from 0.3 to 3.0 while the model processes the second word 'B' in the time interval [1.2, 2.4].
In this scenario, the model increased very slowly while it was processing the first word 'A'. The global optimum is only weakly biased toward S1 and S2 (consistent with input 'A'). When the second word 'B' was presented, the harmony surface changed but due to the small value of , the harmony surface still had a single optimum, which is biased toward S1 and S3 (consistent with input 'B'). The system moved close to the new optimum before harmony valleys emerged. When emerging harmony valleys were shallow, the system was near the bottom of the harmony valley so the model could jump between two local humps (containing S1 and S3) with a small amount of noise. In the example, the model chose a wrong hump (containing S3) over the target hump (containing S1). This is a kind of local coherence error because the model chose a structure that is locally coherent (i.e., coherent with the bottom-up input) but globally incoherent (i.e., incoherent with context, the first word 'A').
Case 3: increases too quickly
If increases too quickly while the model is processing the first word 'A', the model will fail to maintain multiple possible interpretations (both of S1 and S2 are consistent with the first word 'A') and will be forced to choose one over the other. In this example, increases linearly from 0 to 2.4 while the model processes the first word 'A' in the time interval [0, 1.2] and from 2.4 to 3.0 while the model processes the second word 'B' in the time interval [1.2, 2.4].
In this scenario, the model increased very quickly while it was processing the first word 'A'. Before the second word was presented, harmony valleys emerged so the model was forced to make a decision between S1 and S2 which are consistent with input 'A'. In this sample run, the model chose S2 over S1. Then, the model was given the second word 'B', which is consistent with S1 and S3. During processing the second word, the local humps containing S1 and S3 have higher harmony values than the other two local humps containing S2 and S4. But every hump was already separated from every other hump by deep harmony valleys. Thus, the system could not escape from the S2 hump. This is a garden path error.
A GSC model
In this section, we investigate the GSC model implementing a similar but slightly more complex grammar G = {S A B | A C | D B | D C}. This time, we assume that the interpretations of the sentences are not simple tuples but rather tree structures.
Note: Although the model regenerates random distributed representations each time it is created, re-running the model in the examples below will simply reproduce the same results because the seed for the random number generator is fixed across runs. (Changing the argument given to can generate new results.)
Case 1: increases appropriately
To compare the result with the 2D examples that we investigated, we project the state trajectories in the 40 dimensional vector space onto a 2D plane as follows:
The figure below shows the state trajectory. The blue and red lines indicate the state changes while the model was processing the first and the second word, respectively. The red dot indicates the end state.
Case 2: increases too slowly
Case 3: increases too quickly
References
Chiang, T., Hwang, C., & Sheu, S. (1987). Diffusion for global optimization in . SIAM Journal on Control and Optimization, 25(3), 737–753. https://doi.org/10.1137/0325042
Geman, S., & Geman, D. (1984). Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-6(6), 721–741. https://doi.org/10.1109/TPAMI.1984.4767596