Oct 30

The single spot in our NLP where compositionality is fully abondoned, is Semantic Tagger. Because of the syntactic and semantic complexity of structures we have to analyse, this module can’t extract the interpretation of a given construction (say, a finite sentence about the relation “be employee of”) step by step from the interpretation of its syntactic parts. We can’t do it not just because we don’t have a sufficiently powerfull syntactic analyser, but we simply don’t need it.

To extract a relation like “be employee of” we use a number of patterns relying on entities recognized previously. Look at (1).

(1) In the mid-1990 the former German citizen Heinz Schimmelbuch becomes CEO of Weissel company,

If the processor already knows that mid-1990 is a Date/Period, Heinz Schimmelbuch is a Person, CEO is a Job Title, Weissel company is an Organization, become is a verb of coming into being or smth. like that, and that there is no punctuation between these items, it has enough information to conclude that this sentence might speak about an employment relation, as long as several additional conditions are satisfied. The first condition is the items’ order in this sentence. The sentence (2) might also be an employment-related sentence, but not (3).

(2) The former German citizen Heinz Schimmelbuch becomes CEO of Weissel company in the mid-1990.
(3) In the mid-1990 Heinz Schimmelbuch becomes American citizen and meets the CEO of Weissel company.

The second condition concerns some specific constraints on the key verb. E.g., the verb become has to be finite: with an non-finite verb, the probability that this sentence is irrelevant, increases, cf. (4).

(4) In the mid-1990 the former German citizen Heinz Schimmelbuch dreamed about becoming CEO of Weissel company.

Then, taking into account these observation, we can safely add a pattern using just the recognized items like Person, Job Title, etc. and some punctuation markup (commas, points, etc.), “hiding” all the irrelevant part of the sentence.

(5) … the mid-1990 … Heinz Schimmelbuch … become … CEO … Weissel company.

Now we can compose such a pattern.

({Date} | {StartPoint})? {Person}
{becomeVG.VOICE = “act”, becomeVG.MOOD = “ind”}
{JobTitle} {Organization}

Oh yeap, this pattern is indeed used in our system.

Oct 24

I came accross an exciting paper by Karin Verspoor, George Papcun and Kari Sentz from LANL. The authors suggest a motivation for the so-called shallow approach to Information Extraction.

Indeed, an ongoing discussion between ‘deepers’ (who argue for a deep and presumably full semantic and syntactic analysis) and ’shallowers’ (who argue for a shallow analysis) revealed that “[s]hallower approaches are more robust to the linguistic variance of free text”, but “they are much faster”, while “[d]eeper approaches … are in principle more domain-neutral because they embody general linguistic principles”, but they are obviously much more expensive both in throughput and development time. I’ve got a particular interest in this issue, because in Ontos we use a kind of shallow approach, and we really do not attempt to get a full syntactic and semantic analysis. Our rules and patterns primarily focus on the recognition of objects and relations represented in the domain ontology.

Verspoor et al. 2003’s motivation relies on the Construction Grammar’s (CG) hypothesis that constructions are the essential linguistic units stored in the human mind. The basic notion of construction is defined in the following way:

C is a construction iff C is a form-meaning pair Fi, Sj>, such that some aspect of Fi (form) or some aspect of Sj (semantics) is not strictly predicted from C’s component parts or from other previously established constructions.

The main argument of constructionists is that in natural language there is a wide range of non-compositional expressions (e.g. idioms), which should be stored in the lexicon. As far as I understand, the radical constructionist view is that any linguistic structure (w.r.t. its syntax and semantics) is a construction. Although the authors note that CG does not entirely reject the principle of compositionality adopted in formal semantics, this principle does not in fact play an important role in CG. Furthermore, the authors suggest that gazetteer entries and expressions captured by syntactic patterns for named entities in a shallow NLP, are constructions in the sense of CG.

I would not be so enthusiastic with the issue of (non)compositionality in the field of IE. Presumably, using gazetteers has nearly nothing in common with (non)compositionality of their entries. In fact, most entries of key-gazetteers (those providing context information) are splendidly compositional (cf. international company, regional hospital), and e.g. patterns using any output of the Morphology Component (i.e. specifying just grammatical features and order of their units) produce also compositional phrases.

But still, there is a special subfield in our NLP, where patterns (not gazetteers) do involve a kind of non-compositional phenomena. I’ll write about it in the next post.

Oct 17

A crucial problem for an information extraction system is the disambiguation of the recognized named entities. Suppose you’ve got a vocabulary of phrases and key words for the entity in question, and a grammar containing various structural templates corresponding to different syntactic patterns of the entity and relying on the words and phrases from the vocabulary. This model roughly corresponds to our NLP-system exploited in Ontos. The immediately arising problem is that these templates are applied independently and allow for intersection of entities generated by different templates.
Take the string (1) as an input for such a model.

(1) President of Kimberly Clark Corporation Steven R. Kalmanson

When we apply all the variety of templates from the grammar, we’ll have a piled up tile of annotations subsecting and intersecting with each other.

For sure this picture is not what we’d like to get as an output. Rather, we’d like to see a finely plain structure of annotations without any intersections. Moreover, on the above picture there are some annotations that should remain in the new structure, and annotations (specifically all others) which we wouldn’t like to see any more in our lives. So how to rule out the incorrect annotations?

The solution developed in our team is the following. Recall that each annotation from the above picture arises from a specific template, and every template is motivated by some heuristics. Then some heuristics are relyible in a wide range of cases as they usually give a correct output, and some heuristics are not: we can pin our hopes on them only if we don’t have any stronger one at hand. Consequently we can rank our heuristics by assigning different weights to the templates used in our grammar. The more trustworthy is a template, the higher rank it gets.

Besides, for each type of conflict we will use a resolving rule formulated with respect to the weights of the annotations in conflict. These rules will constitute a special module called Minimization. After Minimization is applied, we get the desired output.

So that’s how it works. Although sometimes we get an incorrect analysis (when the ranking of specific templates occasionally goes wrong or when the correct hypothesis is absent), we never get intersections.

Oct 11

Today we are all flooded with news that we receive by email or RSS feeds. We all have the same problem of filtering the important ones from the less important ones and to identify what is linked together. The internet community has introduced several methods on how to aggregate news. Which of the methods is the best is difficult to say. One method that doesn’t seem to be mentioned is the possibility of automatic aggregation using semantic technologies, especially the functions of NLP. Automatic aggregation is subscribing to different sources and then analyzing the content with natural language processing (NLP) technologies. Doing it this way the system can extract objects of interest (e.g. person, organization, location etc) and the relation between the objects. After the information extraction process the system has to merge the same objects like Hillary Clinton is the same like she or Hillary or Clinton assuming the NLP can distinguish inside the text the meaning and then merge the objects together. This approach is driven and controlled by a clear knowledge model (ontology) which allows growing a semantic network of information. Over the time the more object – relation – object information we have we can start to rank information on his semantic meaning which provides a much better view. This is different to approaches introduced by Mixx or Digg where people create votes and therefore influence the ranking of the news. Another part that seems to be important for a successful start is the up to date information as seen when Tailrank started.
We’ll keep working on our approach and watch the other methods of aggregation and ranking. Would be interesting to hear about experiences from others when trying to understand how information is linked together. Can it be solved by filters or by votes or by using bookmarks?

Oct 04

Welcome to the Ontos blog. We will use it to inform about new trends and features concerning our technologies especially in the area of NLP (Natural Language Processing), aggregating and merging of news and information sources. We will share ideas and concepts on how to use information and enrich them in a semantic way - for example in our new semantic portal.