Inclusiveness

Inclusiveness

We believe that every person in the world should profit from the same technological advances, irrespective of her location and, especially, irrespective of the language she speaks. Unfortunately, this belief is weakened when we come to natural language processing technologies. They are indeed based on the concept of language resources (lexicons, grammars, analyzers, etc.), and it is a well known fact that resources available for, e.g. English are bigger than the ones available for e.g. French, not to speak of languages such as Swahili, for which almost nothing is available. With the advent of the supervised machine learning paradigm, things became even worse: by its nature, under such a  paradigm, each task necessitates huge amounts of annotated data, i.e. text in a given language, which was manually annotated in view of a specific task. For instance, if you want to build a system for classifying movie review in English, you just have available annotated resources on the web. The same task for French implies the expensive phase of manually annotating thousands of reviews with respect to, for instance, their polarity.

In business terms, this means economic barriers and affected competitiveness for European companies, which typically operates on a multilingual market.

While we do not have the possibility of creating resources for every language we would like to deal with, our research is intrinsically multilingual and not focussed only on more resourced languages. Moreover, our research track on Silver Standards  concerns methodologies and algorithms to make the automatic creation of annotated datasets more and more accurate and labor efficient (a silver standard is a dataset where the annotation is obtained from external information rather than manual annotation. Manual annotation might be applied to validate a part of the dataset).