Table Of ContentLEARNING PROBABILISTIC MODELS
OF WORD SENSE DISAMBIGUATION
Approved by:
Dr. Dan Moldovan
Dr. Rebecca Bruce
Dr. Weidong Chen
Dr. Frank Coyle
Dr. Margaret Dunham
Dr. Mandyam Srinath
LEARNING PROBABILISTIC MODELS
OF WORD SENSE DISAMBIGUATION
A Dissertation Presented to the Graduate Faculty of the
School of Engineering and Applied Science
Southern Methodist University
in
Partial Fulfillment of the Requirements
for the degree of
Doctor of Philosophy
with a
Major in Computer Science
by
Ted Pedersen
(B.A., Drake University)
(M.S., University of Arkansas)
May 16, 1998
ACKNOWLEDGMENTS
I am indebted to Dr. Rebecca Bruce for sharing freely of her time, knowledge,
andinsightsthroughoutthisresearch. Certainlynoneofthiswouldhavebeenpossible
without her.
Dr. Weidong Chen, Dr. Frank Coyle, Dr. Maggie Dunham, Dr. Dan Moldovan,
andDr. MandyamSrinathhaveallmadeimportantcontributionstothisdissertation.
They are also among the main reasons why my time at SMU has been both happy
and productive.
I am also grateful to Dr. Janyce Wiebe, Lei Duan, Mehmet Kayaalp, Ken McK-
eever, and Tom O’Hara for many valuable comments and suggestions that influenced
the direction of this research.
This work was supported by the Office of Naval Research under grant number
N00014-95-1-0776.
iii
Pedersen, Ted B.A., Drake University
M.S., University of Arkansas
Learning Probabilistic Models
of Word Sense Disambiguation
Advisor: Professor Dan Moldovan
Doctor of Philosophy degree conferred May 16, 1998
Dissertation completed May 16, 1998
Selecting the most appropriate sense for an ambiguous word is a common
problem in natural language processing. This dissertation pursues corpus–based ap-
proaches that learn probabilistic models of word sense disambiguation from large
amounts of text. These models consist of a parametric form and parameter esti-
mates. The parametric form characterizes the interactions among the contextual
features and the sense of the ambiguous word. Parameter estimates describe the
probability of observing different combinations of feature values. These models dis-
ambiguate by determining the most probable sense of an ambiguous word given the
context in which it occurs.
This dissertation presents several enhancements to existing supervised methods
of learning probabilistic models of disambiguation from sense–tagged text. A new
search strategy, forward sequential, guides the selection process through the space
of possible models. Each model considered for selection is judged by a new class of
evaluation metric, the information criteria. The combination of forward sequential
search and Akaike’s Information Criteria is shown to consistently select highly ac-
curate models of disambiguation. The same search strategy and evaluation criterion
also serve as the basis of the Naive Mix, a new supervised learning algorithm that
is shown to be competitive with leading machine learning methodologies. In these
comparisons the Naive Bayesian classifier also fares well which seems surprising since
it is based on a model where the parametric form is simply assumed. However, an
iv
explanation for this success is presented in terms of learning rates and bias–variance
decompositions of classification error.
Unfortunately, sense–taggedtextonlyexistsinsmallquantitiesandisexpensive
to create. This substantially limits the portability of supervised learning approaches
to word sense disambiguation. This bottleneck is addressed by developing unsuper-
vised methods that learn probabilistic models from raw untagged text. However,
such text does not contain enough information to automatically select a parametric
form. Instead, one must simply be assumed. Given a form, the senses of ambiguous
words are treated as missing data and their values are imputed via the Expecta-
tion Maximization algorithm and Gibbs Sampling. Here the parametric form of the
Naive Bayesian classifier is employed. However, this methodology is appropriate for
any parametric form in the class of decomposable models. Several local–context,
frequency–based feature sets are also developed and shown to be appropriate for
unsupervised learning of word senses from raw untagged text.
v
TABLE OF CONTENTS
ACKNOWLEDGMENTS .................................................... iii
LIST OF FIGURES.......................................................... x
LIST OF TABLES ........................................................... xiii
CHAPTER
1. INTRODUCTION ..................................................... 1
1.1. Word Sense Disambiguation ....................................... 2
1.2. Learning from Text ............................................... 3
1.2.1. Supervised Learning........................................ 5
1.2.2. Unsupervised Learning ..................................... 6
1.3. Basic Assumptions ................................................ 7
1.4. Chapter Summaries ............................................... 7
2. PROBABILISTIC MODELS ........................................... 10
2.1. Inferential Statistics ............................................... 10
2.1.1. Maximum Likelihood Estimation ........................... 11
2.1.2. Bayesian Estimation ....................................... 14
2.2. Decomposable Models ............................................. 15
2.2.1. Examples .................................................. 17
2.2.2. Decomposable Models as Classifiers......................... 22
3. SUPERVISED LEARNING FROM SENSE–TAGGED TEXT ........... 24
3.1. Sequential Model Selection ........................................ 25
3.1.1. Search Strategy ............................................ 26
3.1.2. Evaluation Criteria......................................... 29
vi
3.1.2.1. Significance Testing ............................... 30
3.1.2.2. Information Criteria .............................. 33
3.1.3. Examples .................................................. 35
3.1.3.1. FSS AIC ......................................... 35
3.1.3.2. BSS AIC ......................................... 37
3.2. Naive Mix ........................................................ 39
3.3. Naive Bayes....................................................... 43
4. UNSUPERVISED LEARNING FROM RAW TEXT .................... 45
4.1. Probabilistic Models............................................... 46
4.1.1. EM Algorithm ............................................. 47
4.1.1.1. General Description............................... 47
4.1.1.2. Naive Bayes description ........................... 49
4.1.1.3. Naive Bayes example.............................. 51
4.1.2. Gibbs Sampling ............................................ 57
4.1.2.1. General Description............................... 58
4.1.2.2. Naive Bayes description ........................... 60
4.1.2.3. Naive Bayes example.............................. 63
4.2. Agglomerative Clustering.......................................... 70
4.2.1. Ward’s minimum–variance method ......................... 71
4.2.2. McQuitty’s similarity analysis .............................. 72
5. EXPERIMENTAL DATA .............................................. 74
5.1. Words ............................................................ 74
5.2. Feature Sets ...................................................... 75
5.2.1. Supervised Learning Feature Set............................ 75
vii
5.2.2. Unsupervised Learning Feature Sets ........................ 80
5.2.3. Feature Sets and Event Distributions ....................... 83
6. SUPERVISED LEARNING EXPERIMENTAL RESULTS .............. 92
6.1. Experiment 1: Sequential Model Selection ......................... 92
6.1.1. Overall Accuracy........................................... 93
6.1.2. Model Complexity ......................................... 96
6.1.3. Model Selection as a Robust Process........................ 96
6.1.4. Model selection for Noun interest .......................... 99
6.2. Experiment 2: Naive Mix.......................................... 104
6.3. Experiment 3: Learning Rate...................................... 109
6.4. Experiment 4: Bias Variance Decomposition ....................... 113
7. UNSUPERVISED LEARNING EXPERIMENTAL RESULTS ........... 119
7.1. Assessing Accuracy in Unsupervised Learning...................... 120
7.2. Analysis 1: Probabilistic Models................................... 124
7.2.1. Methodological Comparison ................................ 127
7.2.2. Feature Set Comparison.................................... 130
7.3. Analysis 2: Agglomerative Clustering .............................. 135
7.3.1. Methodological Comparison ................................ 138
7.3.2. Feature Set Comparison.................................... 143
7.4. Analysis 3: Gibbs Sampling and McQuitty’s Similarity Analysis ... 145
8. RELATED WORK..................................................... 151
8.1. Semantic Networks ................................................ 152
8.2. Machine Readable Dictionaries .................................... 154
8.3. Parallel Translations .............................................. 155
viii
8.4. Sense–Tagged Corpora ............................................ 157
8.5. Raw Untagged Corpora ........................................... 160
9. CONCLUSIONS ....................................................... 163
9.1. Supervised Learning............................................... 163
9.1.1. Contributions .............................................. 163
9.1.2. Future Work ............................................... 165
9.2. Unsupervised Learning ............................................ 168
9.2.1. Contributions .............................................. 169
9.2.2. Future Work ............................................... 170
REFERENCES .............................................................. 174
ix
LIST OF FIGURES
Figure Page
2.1. Saturated Model (CVRTS) ........................................... 18
2.2. Decomposable Model (CSV)(RST) ................................... 19
2.3. Model of Independence (C)(V)(R)(T)(S) .............................. 21
2.4. Naive Bayes Model (CS)(RS)(TS)(VS) ............................... 22
4.1. E–Step Iteration 1 .................................................... 52
4.2. M–Step Iteration 1: pˆ(S), pˆ(F |S), pˆ(F |S) ............................ 53
1 2
4.3. E–Step Iteration 2 .................................................... 54
4.4. E–Step Iteration 2 .................................................... 55
4.5. M–Step Iteration 2: pˆ(S), pˆ(F |S), pˆ(F |S) ............................ 55
1 2
4.6. E–Step Iteration 3 .................................................... 56
4.7. E–Step Iteration 3 .................................................... 57
4.8. Stochastic E–Step Iteration 1.......................................... 64
4.9. Stochastic M–step Iteration 1: pˆ(S), pˆ(F |S), pˆ(F |S) .................. 65
1 2
4.10. E–Step Iteration 2 .................................................... 66
4.11. Stochastic E–Step Iteration 2.......................................... 67
4.12. Stochastic M–step Iteration 2: pˆ(S), pˆ(F |S), pˆ(F |S) .................. 68
1 2
4.13. Stochastic E–Step Iteration 3.......................................... 69
4.14. Stochastic E–Step Iteration 3.......................................... 69
4.15. Matrix of Feature Values, Dissimilarity Matrix......................... 71
x
Description:are manually annotated with sense values by a human judge. These sense–tagged accuracy of unsupervised learning algorithms; particular attention is paid to features