WorldCat Identities

Allauzen, Alexandre

Overview
Works: 29 works in 29 publications in 2 languages and 53 library holdings
Roles: dgs, Opponent, Other, Author, Thesis advisor, Contributor
Publication Timeline
.
Most widely held works by Alexandre Allauzen
Generative Training and Smoothing of Hierarchical Phrase-Based Translation Models by Stephan Peitz( )

1 edition published in 2017 in English and held by 15 WorldCat member libraries worldwide

ListNet-based MT Rescoring by Jan Niehues( )

1 edition published in 2015 in English and held by 2 WorldCat member libraries worldwide

LIMSIWMT'16: Machine Translation of News by Alexandre Allauzen( )

1 edition published in 2016 in English and held by 2 WorldCat member libraries worldwide

The QT21/HimL Combined Machine Translation System by Jan-Thorsten Peter( )

1 edition published in 2016 in English and held by 2 WorldCat member libraries worldwide

A comparison of discriminative training criteria for continuous space translation models by Alexandre Allauzen( )

1 edition published in 2017 in English and held by 2 WorldCat member libraries worldwide

The KIT-LIMSI Translation System for WMT 2015 by Thanh-Le Ha( )

1 edition published in 2015 in English and held by 2 WorldCat member libraries worldwide

The Karlsruhe Institute of Technology Systems for the News Translation Task in WMT 2016 by Thanh-Le Ha( )

1 edition published in 2016 in English and held by 2 WorldCat member libraries worldwide

From lexical towards contextualized meaning representation by Diana-Nicoleta Popa( )

1 edition published in 2019 in English and held by 2 WorldCat member libraries worldwide

Les représentations des mots sont à la base du plupart des systèmes modernes pour le traitement automatique du langage, fournissant des résultats compétitifs. Cependant, d'importantes questions se posent concernant les défis auxquels ils sont confrontés pour faire face aux phénomènes complexes du langage naturel et leur capacité à saisir la variabilité du langage naturel.Pour mieux gérer les phénomènes complexes du langage, de nombreux travaux ont été menées pour affiner les représentations génériques de mots ou pour créer des représentations spécialisées. Bien que cela puisse aider à distinguer la similarité sémantique des autres types de relations sémantiques, il peut ne pas suffire de modéliser certains types de relations, telles que les relations logiques d'implication ou de contradiction.La première partie de la thèse étudie l'encodage de la notion d'implication textuelle dans un espace vectoriel en imposant l'inclusion d'information. Des opérateurs d'implication sont ensuite développées et le cadre proposé peut être utilisé pour réinterpréter un modèle existant de la sémantique distributionnelle. Des évaluations sont fournies sur la détection d'hyponymie en tant que une instance d'implication lexicale.Un autre défi concerne la variabilité du langage naturel et la nécessité de désambiguïser les unités lexicales en fonction du contexte dans lequel elles apparaissent. Les représentations génériques de mots ne réussissent pas à elles seules, des architectures différentes étant généralement utilisées pour aider à la désambiguïsation. Étant donné que les représentations de mots sont construites à partir de statistiques de cooccurrence sur de grands corpus et qu'elles reflètent ces statistiques, elles fournissent une seule représentation pour un mot donné, malgré ses multiples significations. Même dans le cas de mots monosémiques, cela ne fait pas la distinction entre les différentes utilisations d'un mot en fonction de son contexte.Dans ce sens, on pourrait se demander s'il est possible d'exploiter directement les informations linguistiques fournies par le contexte d'un mot pour en ajuster la représentation. Ces informations seraient-elles utiles pour créer une représentation enrichie du mot dans son contexte? Et si oui, des informations de nature syntaxique peuvent-elles aider au processus ou le contexte local suffit? On peux donc examiner si les représentations génériques des mots et la manière dont elles se combinent peut suffire à construire des représentations plus précises.Dans la deuxième partie de la thèse, nous étudions une façon d'incorporer la connaissance contextuelle dans les représentations de mots eux-mêmes, en exploitant les informations provenant de l'analyse de dépendance de phrase ainsi que les informations de voisinage local. Nous proposons des représentations de mots contextualisées sensibles à la syntaxe (SATokE) qui capturent des informations linguistiques spécifiques et encodent la structure de la phrase dans leurs représentations. Cela permet de passer des représentations de type générique (invariant du contexte) à des représentations spécifiques (tenant compte du contexte). Alors que la syntaxe était précédemment considérée pour les représentations de mots, ses avantages n'ont peut-être pas été entièrement évalués au-delà des modèles qui exploitent ces informations à partir de grands corpus.Les représentations obtenues sont évaluées sur des tâches de compréhension du langage naturel: classification des sentiments, détection de paraphrases, implication textuelle et analyse du discours. Nous démontrons empiriquement la supériorité de ces représentations par rapport aux représentations génériques et contextualisées des mots existantes.Le travail proposé dans la présente thèse contribue à la recherche dans le domaine de la modélisation de phénomènes complexes tels que l'implication textuelle, ainsi que de la variabilité du langage par le biais de la proposition de représentations contextualisés
A Discriminative Training Procedure for Continuous Translation Models by Quoc-Khanh Do( )

1 edition published in 2015 in English and held by 2 WorldCat member libraries worldwide

LIMSIWMT'15 Translation Task by Benjamin Marie( )

1 edition published in 2015 in English and held by 2 WorldCat member libraries worldwide

Reordering space design in statistical machine translation by Nicolas Pécheux( )

1 edition published in 2016 in English and held by 2 WorldCat member libraries worldwide

Neural language models : Dealing with large vocabularies by Matthieu Labeau( )

1 edition published in 2018 in English and held by 1 WorldCat member library worldwide

Le travail présenté dans cette thèse explore les méthodes pratiques utilisées pour faciliter l'entraînement et améliorer les performances des modèles de langues munis de très grands vocabulaires. La principale limite à l'utilisation des modèles de langue neuronaux est leur coût computationnel: il dépend de la taille du vocabulaire avec laquelle il grandit linéairement. La façon la plus aisée de réduire le temps de calcul de ces modèles reste de limiter la taille du vocabulaire, ce qui est loin d'être satisfaisant pour de nombreuses tâches. La plupart des méthodes existantes pour l'entraînement de ces modèles à grand vocabulaire évitent le calcul de la fonction de partition, qui est utilisée pour forcer la distribution de sortie du modèle à être normalisée en une distribution de probabilités. Ici, nous nous concentrons sur les méthodes à base d'échantillonnage, dont le sampling par importance et l'estimation contrastive bruitée. Ces méthodes permettent de calculer facilement une approximation de cette fonction de partition. L'examen des mécanismes de l'estimation contrastive bruitée nous permet de proposer des solutions qui vont considérablement faciliter l'entraînement, ce que nous montrons expérimentalement. Ensuite, nous utilisons la généralisation d'un ensemble d'objectifs basés sur l'échantillonnage comme divergences de Bregman pour expérimenter avec de nouvelles fonctions objectif. Enfin, nous exploitons les informations données par les unités sous-mots pour enrichir les représentations en sortie du modèle. Nous expérimentons avec différentes architectures, sur le Tchèque, et montrons que les représentations basées sur les caractères permettent l'amélioration des résultats, d'autant plus lorsque l'on réduit conjointement l'utilisation des représentations de mots
Détection de nouveauté au plus tôt dans des flux de données textuelles by Clément Christophe( )

1 edition published in 2021 in French and held by 1 WorldCat member library worldwide

The work presented in this thesis, made in collaboration with Électricité de France (EDF), aims to develop novelty detection models in textual data streams. For EDF, this is part of an approach to anticipate customer needs.We present different novelty detection approaches that exist in the literature, which allows us to precisely define the tasks we want to solve. These definitions allow us to set up evaluation methods, based either on simulated data or on real data. Modifying real data allows us to simulate novelty arrival scenarios and therefore to measure the performance of existing methods.We present two models of detection for new elements by first using topic probabilistic models. The second approach is CEND, an algorithm based on the movements of words in high dimensional representation spaces. This type of model allows us to distinguish words linked with abrupt events or slowly emerging themes.We present a model for monitoring the dynamics of a classification plan. By linking methods of time series forecasting and sequential analysis, we estimate when the dynamic of a signal changes. We test these methods on public press data and on an EDF industrial dataset
Modélisation linguistique pour l'indexation automatique de documents audiovisuels by Alexandre Allauzen( Book )

1 edition published in 2003 in French and held by 1 WorldCat member library worldwide

Most of today's methods for indexation of broadcast audio data are manual. In France, the National Audiovisual Institute (INA) is in charge more than 50,000 hours yearly broadcasted audiovisual data and over one million hours of archive data. Introduction of automatic tools in the indexation process must to be drawn to fit specificities of these real needs. The state of the art in automatic indexing of audiovisual documents is an automatic speech recognition system combined with information retrieval engine. Automatic transcription of audio track is therefore the first access to the audiovisual content and transcription errors define its relevance. One error source is the gap between the models used by the ASR system and the variability of audiovisual database. More precisely the lexical and linguistic content of automatic transcriptions is conditioned by the vocabulary and the language model (LM) of the system. The purpose of this thesis is to investigate methods for vocabulary and LM adaptation of an ASR system for indexing prospect. Two kind of audiovisual documents are considered : archive and daily broadcasted. Finding a sufficient amount of appropriate electronic texts which are contemporary to the task is one of the biggest challenges. The first solution proposed in this thesis is to build an open vocabulary LM using lexical back-off. Interactive and automatic experiences are performed on a broadcast news shows corpus. The second solution uses the web to create corpora which are contemporary to the document. Two experiments are performed. The first uses the ECHO corpus which contains archive documents dating from the forties to the nineties, and emphasizes the discrepancy between training data and documents epochs. In the second experiment algorithms are investigated to daily adapt the standard vocabulary and LM. Different corpora frameworks show the impact of selecting adaptation data
Modèle joint pour le traitement automatique de la langue : perspectives au travers des réseaux de neurones by Jérémie Tafforeau( )

1 edition published in 2017 in French and held by 1 WorldCat member library worldwide

NLP researchers has identified different levels of linguistic analysis. This lead to a hierarchical division of the various tasks performed in order to analyze a text statement. The traditional approach considers task-specific models which are subsequently arranged in cascade within processing chains (pipelines). This approach has a number of limitations: the empirical selection of models features, the errors accumulation in the pipeline and the lack of robusteness to domain changes. These limitations lead to particularly high performance losses in the case of non-canonical language with limited data available such as transcriptions of conversations over phone. Disfluencies and speech-specific syntactic schemes, as well as transcription errors in automatic speech recognition systems, lead to a significant drop of performances. It is therefore necessary to develop robust and flexible systems. We intend to perform a syntactic and semantic analysis using a deep neural network multitask model while taking into account the variations of domain and/or language registers within the data
Modèles exponentiels et contraintes sur les espaces de recherche en traduction automatique et pour le transfert cross-lingue by Nicolas Pécheux( )

1 edition published in 2016 in French and held by 1 WorldCat member library worldwide

Most natural language processing tasks are modeled as prediction problems where one aims at finding the best scoring hypothesis from a very large pool of possible outputs. Even if algorithms are designed to leverage some kind of structure, the output space is often too large to be searched exaustively. This work aims at understanding the importance of the search space and the possible use of constraints to reduce it in size and complexity. We report in this thesis three case studies which highlight the risk and benefits of manipulating the seach space in learning and inference.When information about the possible outputs of a sequence labeling task is available, it may seem appropriate to include this knowledge into the system, so as to facilitate and speed-up learning and inference. A case study on type constraints for CRFs however shows that using such constraints at training time is likely to drastically reduce performance, even when these constraints are both correct and useful at decoding.On the other side, we also consider possible relaxations of the supervision space, as in the case of learning with latent variables, or when only partial supervision is available, which we cast as ambiguous learning. Such weakly supervised methods, together with cross-lingual transfer and dictionary crawling techniques, allow us to develop natural language processing tools for under-resourced languages. Word order differences between languages pose several combinatorial challenges to machine translation and the constraints on word reorderings have a great impact on the set of potential translations that is explored during search. We study reordering constraints that allow to restrict the factorial space of permutations and explore the impact of the reordering search space design on machine translation performance. However, we show that even though it might be desirable to design better reordering spaces, model and search errors seem yet to be the most important issues
Continuous space models with neural networks in natural language processing by Hai Son Le( )

1 edition published in 2012 in English and held by 1 WorldCat member library worldwide

The purpose of language models is in general to capture and to model regularities of language, thereby capturing morphological, syntactical and distributional properties of word sequences in a given language. They play an important role in many successful applications of Natural Language Processing, such as Automatic Speech Recognition, Machine Translation and Information Extraction. The most successful approaches to date are based on n-gram assumption and the adjustment of statistics from the training data by applying smoothing and back-off techniques, notably Kneser-Ney technique, introduced twenty years ago. In this way, language models predict a word based on its n-1 previous words. In spite of their prevalence, conventional n-gram based language models still suffer from several limitations that could be intuitively overcome by consulting human expert knowledge. One critical limitation is that, ignoring all linguistic properties, they treat each word as one discrete symbol with no relation with the others. Another point is that, even with a huge amount of data, the data sparsity issue always has an important impact, so the optimal value of n in the n-gram assumption is often 4 or 5 which is insufficient in practice. This kind of model is constructed based on the count of n-grams in training data. Therefore, the pertinence of these models is conditioned only on the characteristics of the training text (its quantity, its representation of the content in terms of theme, date). Recently, one of the most successful attempts that tries to directly learn word similarities is to use distributed word representations in language modeling, where distributionally words, which have semantic and syntactic similarities, are expected to be represented as neighbors in a continuous space. These representations and the associated objective function (the likelihood of the training data) are jointly learned using a multi-layer neural network architecture. In this way, word similarities are learned automatically. This approach has shown significant and consistent improvements when applied to automatic speech recognition and statistical machine translation tasks. A major difficulty with the continuous space neural network based approach remains the computational burden, which does not scale well to the massive corpora that are nowadays available. For this reason, the first contribution of this dissertation is the definition of a neural architecture based on a tree representation of the output vocabulary, namely Structured OUtput Layer (SOUL), which makes them well suited for large scale frameworks. The SOUL model combines the neural network approach with the class-based approach. It achieves significant improvements on both state-of-the-art large scale automatic speech recognition and statistical machine translations tasks. The second contribution is to provide several insightful analyses on their performances, their pros and cons, their induced word space representation. Finally, the third contribution is the successful adoption of the continuous space neural network into a machine translation framework. New translation models are proposed and reported to achieve significant improvements over state-of-the-art baseline systems
Discontinuous constituency parsing of morphologically rich languages by Maximin Coavoux( )

1 edition published in 2017 in English and held by 1 WorldCat member library worldwide

Syntactic parsing consists in assigning syntactic trees to sentences in natural language. Syntactic parsing of non-configurational languages, or languages with a rich inflectional morphology, raises specific problems. These languages suffer more from lexical data sparsity and exhibit word order variation phenomena more frequently. For these languages, exploiting information about the internal structure of word forms is crucial for accurate parsing. This dissertation investigates transition-based methods for robust discontinuous constituency parsing. First of all, we propose a multitask learning neural architecture that performs joint parsing and morphological analysis. Then, we introduce a new transition system that is able to predict discontinuous constituency trees, i.e.\ syntactic structures that can be seen as derivations of mildly context-sensitive grammars, such as LCFRS. Finally, we investigate the question of lexicalization in syntactic parsing. Some syntactic parsers are based on the hypothesis that constituent are organized around a lexical head and that modelling bilexical dependencies is essential to solve ambiguities. We introduce an unlexicalized transition system for discontinuous constituency parsing and a scoring model based on constituent boundaries. The resulting parser is simpler than lexicalized parser and achieves better results in both discontinuous and projective constituency parsing
 
moreShow More Titles
fewerShow Fewer Titles
Audience Level
0
Audience Level
1
  General Special  
Audience level: 0.95 (from 0.90 for Détection ... to 0.99 for From lexic ...)

Associated Subjects
Languages
English (16)

French (4)