A crucial problem for an information extraction system is the disambiguation of the recognized named entities. Suppose you’ve got a vocabulary of phrases and key words for the entity in question, and a grammar containing various structural templates corresponding to different syntactic patterns of the entity and relying on the words and phrases from the vocabulary. This model roughly corresponds to our NLP-system exploited in Ontos. The immediately arising problem is that these templates are applied independently and allow for intersection of entities generated by different templates.
Take the string (1) as an input for such a model.
(1) President of Kimberly Clark Corporation Steven R. Kalmanson
When we apply all the variety of templates from the grammar, we’ll have a piled up tile of annotations subsecting and intersecting with each other.
For sure this picture is not what we’d like to get as an output. Rather, we’d like to see a finely plain structure of annotations without any intersections. Moreover, on the above picture there are some annotations that should remain in the new structure, and annotations (specifically all others) which we wouldn’t like to see any more in our lives. So how to rule out the incorrect annotations?
The solution developed in our team is the following. Recall that each annotation from the above picture arises from a specific template, and every template is motivated by some heuristics. Then some heuristics are relyible in a wide range of cases as they usually give a correct output, and some heuristics are not: we can pin our hopes on them only if we don’t have any stronger one at hand. Consequently we can rank our heuristics by assigning different weights to the templates used in our grammar. The more trustworthy is a template, the higher rank it gets.
Besides, for each type of conflict we will use a resolving rule formulated with respect to the weights of the annotations in conflict. These rules will constitute a special module called Minimization. After Minimization is applied, we get the desired output.
So that’s how it works. Although sometimes we get an incorrect analysis (when the ranking of specific templates occasionally goes wrong or when the correct hypothesis is absent), we never get intersections.
