Dec 20

In the natural language processing blog, Hal Daume III has recently written an exciting post where he argues that data mark-up is not natural. To capture the fuzzy notion of naturalness he compares parallel French-English data and part-of-speech tagged data. Parallel French-English data are natural, he says, because they “naturally” exist: one shouldn’t have particular knowledge to understand what these data are and how to use/create them. As for POS-tagged data, they are unnatural and useless for non-linguists/non-developers.

The crux of the argument is that if something is not a task that anyone performs naturally, then it’s not a task worth computationalizing.

The further argumentation consists in that there is no external evidence that when a human translates a text, she performs special operations similar to, say, part-of-speech tagging. Mapping the idea onto the Ontos NLP one could say then that there is no evidence that a human piles up some objects like annotations connected with various fragments of the text using a huge number of patterns similar to the JAPE-patterns we are using. Moreover, it is unbelievable that a human reads the same text scores of times to perform a specific operation each time: first to read the whole text to recognize morphological information, then to recognize the named entities, finally to learn the relations between the recognized entities and so on and so forth. Rather, a human somehow interprets the string, applying by turns various subsystems of the Language Faculty, which are required at the moment: e.g., most likely the encyclopedia is not used permanently, but just in the cases it is necessary.

Still, I am not in the position to maintain the idea that a natural language processor must be “natural” in all its aspects. Mark-up is not natural, but it is not a so bad idea. It is noticed long ago that computers are very dissimilar to the human cognitive system. Suppose someone will implement, for instance, all the complex syntactic algorithms proposed in Generative grammar; if such a system is possible at all, it might be too complex and unusable. Computers’ peculiarities sometimes make us to avoid “naturalness” in the NLP, although it is not to say that NLPers should not try to reach it where it is appropriate and possible.