Natural Language Processing, or NLP for short, is broadly defined as the automatic manipulation of natural language, like speech and text, by software. The study of natural language processing has been around for more than 50 years and grew out of the field of linguistics with the rise of computers.
Natural Language
Natural language refers to the way we, humans, communicate with each other.
Namely, speech and text.
We are surrounded by text.
Think about how much text you see each day:
- Signs
- Menus
- SMS
- Web Pages
- and so much more…
The list is endless. Now think about speech.
We may speak to each other, as a species, more than we write. It may even be easier to learn to speak than to write. Voice and text are how we communicate with each other. Given the importance of this type of data, we must have methods to understand and reason about natural language, just like we do for other types of data.
Challenge of Natural Language
Working with natural language data is not solved. It has been studied for half a century, and it is really hard. Natural language is primarily hard because it is messy.
From Linguistics to Natural Language Processing
Linguistics
Linguistics is the scientific study of language, including its grammar, semantics, and phonetics. Classical linguistics involved devising and evaluating rules of language. Great progress was made on formal methods for syntax and semantics, but for the most part, the interesting problems in natural language understanding resist clean mathematical formalisms. Broadly, a linguist is anyone who studies language, but perhaps more colloquially, a self-defining linguist may be more focused on being out in the field. Mathematics is the tool of science. Mathematicians working on natural language may refer to their study as mathematical linguistics, focusing exclusively on the use of discrete mathematical formalisms and theory for natural language (e.g. formal languages and automata theory)
Computational Linguistics
Computational linguistics is the modern study of linguistics using the tools of computer science. Yesterday’s linguistics maybe today’s computational linguist as the use of computational tools and thinking has overtaken most fields of study. Large data and fast computers mean that new and different things can be discovered from large datasets of text by writing and running software.
Statistical Natural Language Processing
The statistical dominance of the field also often leads to NLP being described as Statistical Natural Language Processing, perhaps to distance it from the classical computational linguistics methods. Linguistics is a large topic of study, and, although the statistical approach to NLP has shown great success in some areas, there is still room and great benefit from the classical top-down methods.
Natural Language Processing
As machine learning practitioners interested in working with text data, we are concerned with the tools and methods from the field of Natural Language Processing.
So Natural Language Processing (NLP) began in the 1950s as the intersection of artificial intelligence and linguistics. NLP was originally distinct from text information retrieval (IR), which employs highly scalable statistics-based techniques to index and search large volumes of text efficiently. With time, however, NLP and IR have converged somewhat. Currently, NLP borrows from several, very diverse fields, requiring today’s NLP researchers and developers to broaden their mental knowledge-base significantly.
The Prolog language was originally invented (1970) for NLP applications. Its syntax is especially suited for writing grammars
Low-level NLP tasks include:
Sentence boundary detection
Tokenization:
Part-of-speech assignment to individual words
Morphological decomposition
Shallow parsing
Problem-specific segmentation:
Higher-level tasks build on low-level tasks and are usually problem-specific. They include:
Spelling/grammatical error identification and recovery:
Named entity recognition
Recent advances in artificial intelligence (eg, computer chess) have shown that effective approaches utilize the strengths of electronic circuitry—high speed and large memory/disk capacity, problem-specific data-compression techniques and evaluation functions, highly efficient search—rather than trying to mimic human neural function. Similarly, statistical-NLP methods correspond minimally to human thought processes.
By comparison with IR, we now consider what it may take for multi-purpose NLP technology to become mainstream. While always important to library science, IR achieved major prominence with the web, notably after Google’s scientific and financial success: the limelight also caused a corresponding IR research and toolset boom. The question is whether NLP has a similar breakthrough application in the wings.
Will NLP software become a commodity?
The post-Google interest in IR has led to IR commoditization: a proliferation of IR tools and incorporation of IR technology into relational database engines. Earlier, statistical packages and, subsequently, data mining tools also became commoditized. Commodity analytical software is characterized by:
-
Availability of several tools within a package: the user can often set up a pipeline without programming using a graphical metaphor.
-
High user-friendliness and ease of learning: online documentation/tutorials are highly approachable for the non-specialist, focusing on when and how to use a particular tool rather than its underlying mathematical principles.
-
High value in relation to price: some offerings may even be freeware.
By contrast, NLP toolkits are still oriented toward the advanced programmer, and commercial offerings are expensive. General purpose NLP is possibly overdue for commoditization: if this happens, best-of-breed solutions are more likely to rise to the top. Again, analytics vendors are likely to lead the way, following the steps of biomedical informatics researchers to devise innovative solutions to the challenge of processing complex biomedical language in the diverse settings where it is employed.