WorldCat Identities

Estève, Yannick

Overview
Works: 32 works in 44 publications in 2 languages and 312 library holdings
Genres: Conference papers and proceedings 
Roles: Editor, Other, Opponent, Thesis advisor, Author, htt
Classifications: P98, 006.454
Publication Timeline
.
Most widely held works by Yannick Estève
Statistical language and speech processing : 5th International Conference, SLSP 2017, Le Mans, France, October 23-25, 2017, proceedings by SLSP (Conference)( )

10 editions published in 2017 in English and held by 274 WorldCat member libraries worldwide

This book constitutes the refereed proceedings of the 5th International Conference on Statistical Language and Speech Processing, SLSP 2017, held in Le Mans, France, in October 2017. The 21 full papers presented were carefully reviewed and selected from 39 submissions. The papers cover topics such as anaphora and conference resolution; authorship identification, plagiarism and spam filtering; computer-aided translation; corpora and language resources; data mining and semanticweb; information extraction; information retrieval; knowledge representation and ontologies; lexicons and dictionaries; machine translation; multimodal technologies; natural language understanding; neural representation of speech and language; opinion mining and sentiment analysis; parsing; part-of-speech tagging; question and answering systems; semantic role labeling; speaker identification and verification; speech and language generation; speech recognition; speech synthesis; speech transcription; speech correction; spoken dialogue systems; term extraction; text categorization; test summarization; user modeling. They are organized in the following sections: language and information extraction; post-processing and applications of automatic transcriptions; speech paralinguistics and synthesis; speech recognition: modeling and resources
Intégration de sources de connaissances pour la modélisation stochastique du langage appliquée à la parole continue dans un contexte de dialogue oral homme-machine by Yannick Estève( Book )

2 editions published in 2002 in French and held by 5 WorldCat member libraries worldwide

Les modèles de langage sont utilisés dans un système de reconnaissance de la parole pour guider le décodage acoustique. Les modèles de langage "n-grams" qui constituent les modèles de langage de référence en reconnaissance de la parole, modélisent des contraintes sur "n" mots à partir d'événements observés dans un corpus d'apprentissage. Ces modèles donnent des résultats satisfaisants car ils profitent d'une caractéristique commune à plusieurs langues qui exercent des contraintes locales fortes sur l'ordre des mots. Malheureusement, l'utilisation de ces modèles probabilistes est confrontée à plusieurs difficultés. Une faible quantité de données d'apprentissage est courante lors du développement de nouvelles applications de reconnaissance de la parole et entraîne l'estimation de modèles probabilistes peu robustes. Une autre difficulté vient de la longueur des contraintes modélisées : certaines contraintes linguistiques portent sur des distances supérieures aux capacités de modélisation des modèles "n-grams". Afin de pallier aux difficultés des modèles "n-grams", nous proposons d'utiliser plusieurs sources de connaissances "a priori". Nous proposons un modèle hybride qui combine un modèle de langage "n-gram" avec des grammaires régulières locales. Des connaissances "a priori" sont également exploitées pour la création de modèles de langage "n-grams" spécialisés et pour leur utilisation au cours d'un dialogue oral homme-machine. De même, l'analyse des caractéristiques des hypothèses issues de différents systèmes de reconnaissance utilise diverses sources de connaissances. Cette analyse permet de choisir l'hypothèse de reconnaissance la plus pertinente ou de rejeter l'ensemble des hypothèses proposées. Enfin, des connaissances "a priori" sont prises en compte pour élaborer des critères de consistance linguistique. Ces critères permettent de détecter certains types d'erreurs qui peuvent être corrigés à l'aide de modèles de langage très spécifiques, appelés modèles stratégiques
Automatic speech recognition system for Tunisian dialect by Abir Masmoudi( )

1 edition published in 2017 in English and held by 2 WorldCat member libraries worldwide

Construction et stratégie d'exploitation des réseaux de confusion en lien avec le contexte applicatif de la compréhension de la parole by Bogdan Minescu( )

2 editions published in 2008 in French and held by 2 WorldCat member libraries worldwide

The work presented in this PhD deals with the confusion networks as a compact and structured representation of multiple aligned recognition hypotheses produced by a speech recognition system and used by different applications. The confusion networks (CN) are constructed from word graphs and structure information as a sequence of classes containing several competing word hypothesis. In this work we focus on the problem of robust understanding from spontaneous speech input in a dialogue application, using CNs as structured representation of recognition hypotheses for the spoken language understanding module. We use France Telecom spoken dialogue system for customer care. Two issues inherent to this context are tackled. A dialogue system does not only have to recognize what a user says but also to understand the meaning of his request and to act upon it. From the user's point of view, system performance is more accurately represented by the performance of the understanding process than by speech recognition performance only. Our work aims at improving the performance of the understanding process. Using a real application implies being able to process real heterogeneous data. An utterance can be more or less noisy, in the domain or out of the domain of the application, covered or not by the semantic model of the application, etc. A question raised by the variability of the data is whether applying the same processes to the entire data set, as done in classical approaches, is a suitable solution. This work follows a double perspective : to improve the CN construction algorithm with the intention of optimizing the understanding process and to propose an adequate strategy for the use of CN in a real application. Following a detailed analysis of two CN construction algorithms on a test set collected using the France Telecom customer care service, we decided to use the "pivot" algorithm for our work. We present a modified and adapted version of this algorithm. The new algorithm introduces different processing techniques for the words which are important for the understanding process. As for the variability of the real data the application has to process, we present a new multiple level decision strategy aiming at applying different processing techniques for different utterance categories. We show that it is preferable to process multiple recognition hypotheses only on utterances having a valid interpretation. This strategy optimises computation time and yields better global performance
Advanced Quality Measures for Speech Translation by Ngoc Tien Le( )

1 edition published in 2018 in English and held by 2 WorldCat member libraries worldwide

The main aim of this thesis is to investigate the automatic quality assessment of spoken language translation (SLT), called Confidence Estimation (CE) for SLT. Due to several factors, SLT output having unsatisfactory quality might cause various issues for the target users. Therefore, it is useful to know how we are confident in the tokens of the hypothesis. Our first contribution of this thesis is a toolkit LIG-WCE which is a customizable, flexible framework and portable platform for Word-level Confidence Estimation (WCE) of SLT.WCE for SLT is a relatively new task defined and formalized as a sequence labelling problem where each word in the SLT hypothesis is tagged as good or bad accordingto a large feature set. We propose several word confidence estimators (WCE) based on our automatic evaluation of transcription (ASR) quality, translation (MT) quality,or both (combined/joint ASR+MT). This research work is possible because we built a specific corpus, which contains 6.7k utterances for which a quintuplet containing: ASRoutput, verbatim transcript, text translation, speech translation and post-edition of the translation is built. The conclusion of our multiple experiments using joint ASR and MT features for WCE is that MT features remain the most influent while ASR features can bring interesting complementary information.As another contribution, we propose two methods to disentangle ASR errors and MT errors, where each word in the SLT hypothesis is tagged as good, asr_error or mt_error.We thus explore the contributions of WCE for SLT in finding out the source of SLT errors.Furthermore, we propose a simple extension of WER metric in order to penalize differently substitution errors according to their context using word embeddings. For instance, the proposed metric should catch near matches (mainly morphological variants) and penalize less this kind of error which has a more limited impact on translation performance. Our experiments show that the correlation of the new proposed metric with SLT performance is better than the one of WER. Oracle experiments are also conducted and show the ability of our metric to find better hypotheses (to be translated) in the ASR N-best. Finally, a preliminary experiment where ASR tuning is based on our new metric shows encouraging results.To conclude, we have proposed several prominent strategies for CE of SLT that could have a positive impact on several applications for SLT. Robust quality estimators for SLT can be used for re-scoring speech translation graphs or for providing feedback to the user in interactive speech translation or computer-assisted speech-to-text scenarios.Keywords: Quality estimation, Word confidence estimation (WCE), Spoken Language Translation (SLT), Joint Features, Feature Selection
Confidence Measures for Alignment and for Machine Translation by Yong Xu( )

1 edition published in 2016 in English and held by 1 WorldCat member library worldwide

In computational linguistics, the relation between different languages is often studied through automatic alignment techniques. Such alignments can be established at various structural levels. In particular, sentential and sub-sentential bitext alignments constitute an important source of information in various modern Natural Language Processing (NLP) applications, a prominent one being Machine Translation (MT).Effectively computing bitext alignments, however, can be a challenging task. Discrepancies between languages appear in various ways, from discourse structures to morphological constructions. Automatic alignments would, at least in most cases, contain noise harmful for the performance of application systems which use the alignments. To deal with this situation, two research directions emerge: the first is to keep improving alignment techniques; the second is to develop reliable confidence measures which enable application systems to selectively employ the alignments according to their needs.Both alignment techniques and confidence estimation can benefit from manual alignments. Manual alignments can be used as both supervision examples to train scoring models and as evaluation materials. The creation of such data is, however, an important question in itself, particularly at sub-sentential levels, where cross-lingual correspondences can be only implicit and difficult to capture.This thesis focuses on means to acquire useful sentential and sub-sentential bitext alignments. Chapter 1 provides a non-technical description of the research motivation, scope, organization, and introduces terminologies and notation. State-of-the-art alignment techniques are reviewed in Part I. Chapter 2 and 3 describe state-of-the-art methods for respectively sentence and word alignment. Chapter 4 summarizes existing manual alignments, and discusses issues related to the creation of gold alignment data. The remainder of this thesis, Part II, presents our contributions to bitext alignment, which are concentrated on three sub-tasks.Chapter 5 presents our contribution to gold alignment data collection. For sentence- level alignment, we collect manual annotations for an interesting text genre: literary bitexts, which are very useful for evaluating sentence aligners. We also propose a scheme for sentence alignment confidence annotation. For sub-sentential alignment, we annotate one-to-one word links with a novel 4-way labelling scheme, and design a new approachfor facilitating the collection of many-to-many links. All the collected data is released on-line.Improving alignment methods remains an important research subject. We pay special attention to sentence alignment, which often lies at the beginning of the bitext alignment pipeline. Chapter 6 presents our contributions to this task. Starting by evaluating state-of-the-art aligners and analyzing their models and results, we propose two new sentence alignment methods, which achieve state-of-the-art performance on a difficult dataset.The other important subject that we study is confidence estimation. In Chapter 7, we propose confidence measures for sentential and sub-sentential alignments. Experiments show that confidence estimation of alignment links is a challenging problem, and more works on enhancing the confidence measures will be useful.Finally, note that these contributions have been employed in a real world application: the development of a bilingual reading tool aimed at facilitating the reading in a foreign language
Synthèse audiovisuelle de la parole expressive : modélisation des émotions par apprentissage profond by Sara Dahmani( )

1 edition published in 2020 in French and held by 1 WorldCat member library worldwide

Les travaux de cette thèse portent sur la modélisation des émotions pour la synthèse audiovisuelle expressive de la parole à partir du texte. Aujourd'hui, les résultats des systèmes de synthèse de la parole à partir du texte sont de bonne qualité, toutefois la synthèse audiovisuelle reste encore une problématique ouverte et la synthèse expressive l'est encore d'avantage. Nous proposons dans le cadre de cette thèse une méthode de modélisation des émotions malléable et flexible, permettant de mélanger les émotions comme on mélange les teintes sur une palette de couleurs. Dans une première partie, nous présentons et étudions deux corpus expressifs que nous avons construits. La stratégie d'acquisition ainsi que le contenu expressif de ces corpus sont analysés pour valider leur utilisation à des fins de synthèse audiovisuelle de la parole. Dans une seconde partie, nous proposons deux architectures neuronales pour la synthèse de la parole. Nous avons utilisé ces deux architectures pour modéliser trois aspects de la parole : 1) les durées des sons, 2) la modalité acoustique et 3) la modalité visuelle. Dans un premier temps, nous avons adopté une architecture entièrement connectée. Cette dernière nous a permis d'étudier le comportement des réseaux de neurones face à différents descripteurs contextuels et linguistiques. Nous avons aussi pu analyser, via des mesures objectives, la capacité du réseau à modéliser les émotions. La deuxième architecture neuronale proposée est celle d'un auto-encodeur variationnel. Cette architecture est capable d'apprendre une représentation latente des émotions sans utiliser les étiquettes des émotions. Après analyse de l'espace latent des émotions, nous avons proposé une procédure de structuration de ce dernier pour pouvoir passer d'une représentation par catégorie vers une représentation continue des émotions. Nous avons pu valider, via des expériences perceptives, la capacité de notre système à générer des émotions, des nuances d'émotions et des mélanges d'émotions, et cela pour la synthèse audiovisuelle expressive de la parole à partir du texte
Les collections volumineuses de documents audiovisuels : segmentation et regroupement en locuteurs by Grégor Dupuy( )

1 edition published in 2015 in French and held by 1 WorldCat member library worldwide

The task of speaker diarization, as defined by NIST, considers the recordings from a corpus as independent processes. The recordings are processed separately, and the overall error rate is a weighted average. In this context, detected speakers are identified by anonymous labels specific to each recording. Therefore, a speaker appearing in several recordings will be identified by a different label in each of the recordings. Yet, this situation is very common in broadcast news data: hosts, journalists and other guests may appear recurrently. Consequently, speaker diarization has been recently considered in a broader context, where recurring speakers must be uniquely identified in every recording that compose a corpus. This generalization of the speaker partitioning problem goes hand in hand with the emergence of the concept of collections, which refers, in the context of speaker diarization, to a set of recordings sharing one or more common characteristics.The work proposed in this thesis concerns speaker clustering of large audiovisual collections (several tens of hours of recordings). The main objective is to propose (or adapt) clustering approaches in order to efficiently process large volumes of data, while detecting recurrent speakers. The effectiveness of the proposed approaches is discussed from two point of view: first, the quality of the produced clustering (in terms of error rate), and secondly, the time required to perform the process. For this purpose, we propose two architectures designed to perform cross-show speaker diarization with collections of recordings. We propose a simplifying approach to decomposing a large clustering problem in several independent sub-problems. Solving these sub-problems is done with either of two clustering approaches which takeadvantage of the recent advances in speaker modeling
Approche hybride pour la reconnaissance automatique de la parole en langue arabe by Abir Masmoudi Dammak( )

1 edition published in 2016 in French and held by 1 WorldCat member library worldwide

Le développement d'un système de reconnaissance de la parole exige la disponibilité d'une grande quantité de ressources à savoir, grands corpus de texte et de parole, un dictionnaire de prononciation. Néanmoins, ces ressources ne sont pas disponibles directement pour des dialectes arabes. De ce fait, le développement d'un SRAP pour les dialectes arabes se heurte à de multiples difficultés à savoir, l''abence de grandes quantités de ressources et l'absence d''une orthographe standard vu que ces dialectes sont parlés et non écrit. Dans cette perspective, les travaux de cette thèse s'intègrent dans le cadre du développement d'un SRAP pour le dialecte tunisien. Une première partie des contributions consiste à développer une variante de CODA (Conventional Orthography for Arabic Dialectal) pour le dialecte tunisien. En fait, cette convention est conçue dans le but de fournir une description détaillée des directives appliquées au dialecte tunisien. Compte tenu des lignes directives de CODA, nous avons constitué notre corpus nommé TARIC : Corpus de l'interaction des chemins de fer de l'arabe tunisien dans le domaine de la SNCFT. Outre ces ressources, le dictionnaire de prononciation s'impose d'une manière indispensable pour le développement d'un SRAP. À ce propos, dans la deuxième partie des contributions, nous visons la création d'un système nommé conversion (Graphème-Phonème) G2P qui permet de générer automatiquement ce dictionnaire phonétique. Toutes ces ressources décrites avant sont utilisées pour adapter un SRAP pour le MSA du laboratoire LIUM au dialecte tunisien dans le domaine de la SNCFT. L'évaluation de notre système donné lieu WER de 22,6% sur l'ensemble de test
Large-scale acoustic and prosodic investigations of french by Rena Nemoto( )

1 edition published in 2011 in English and held by 1 WorldCat member library worldwide

This thesis focuses on acoustic and prosodic (fundamental frequency (F0), duration, intensity) analyses of French from large-scale audio corpora portraying different speaking styles: prepared and spontaneous speech. We are interested in particularities of segmental phonetics and prosody that may characterize pronunciation. In French, many errors caused by automatic speech recognition (ASR) systems arise from frequent homophone words, for which ASR systems depend on language model weights. Automatic classification (AC) was conducted to discriminate homophones by only acoustic and prosodic properties depending on their part-of-speech function or their position within prosodic words. Results from AC of two homophone pairs, et/est (and/is) and à/a (ton/has), revealed that the et/est pair was more discriminable. A selection of prosodic and inter-phoneme attributes, that is 15 attributes, performed as good results as with 62 attributes. Then corresponding perceptual tests have been conducted to verify if humans also use acoustico-prosodic parameters for the discrimination. Results suggested that acoustic and prosodic information might help in operating the correct choice in similar ambiguous syntactic structures. From the hypothesis that pronunciation variants were due to varying prosodic constraints, we examined overall prosodic properties of French on a lexical and phrase level. The comparison between lexical and grammatical words revealed F0 rise and lengthening at the end of final syllable on lexical words, while these phenomena were not observed for grammatical words. Analyses also revealed that the mean profile of a n length noun phrase could be different from that of a n length noun with a low F0 at the beginning of a noun phrase. The prosodic profiles can be helpful to locate word boundaries. Findings in this thesis will lead to localize focus and named-entity using discriminative classifiers, and to improve word boundary locations by an ASR post-processing step
Évaluation adaptative des systèmes de transcription en contexte applicatif by Mohamed Amer Ben Jannet( )

1 edition published in 2015 in French and held by 1 WorldCat member library worldwide

It is important to regularly assess the technological innovation products in order to estimate the level of maturity reached by the technology and study the applications frameworks in which they can be used. Natural language processing (NLP) aims at developing modules and applications that automatically process the human language. That makes the field relevant to beth research and technological innovation. For years, the different technological modules from the NLP were developed separately. Therefore, the existing evaluation methods are in most modular. They allow to evaluate only one module at a time, while today, many applications need to combine several NLP modules to solve complex tasks. The new challenge in terms of evaluation is then to evaluate the different modules while taking into account the applicative context.Our work addresses the evaluation of Automatic Speech Recognition (ASR) systems according to the applicative context. We will focus on the case of Named Entities Recognition (NER) from spoken documents transcriped automatically. In the first part, we address the issue of evaluating ASR systems according to the application context through a study of the state of the art. We describes the tasks of ASR and NER proposed during several evalution campaigns and we discuss the protocols established for their evaluation. We also point the limitations of modular evaluation approaches and we expose the alternatives measures proposed in the literature. In the second part we describe the studied task of named entities detection, classification and decomposition and we propose a new metric ETER (Entity Tree Error Rate) which allows to take into account the specificity of the task and the applicative context during the evaluation. ETER also eliminates the biases observed with the existing metrics. In the third part, we define a new measure ATENE (Automatic Transcriptions Evaluation for Named Entities) that evaluates the quality of ASR systems and the impact of their errors for REN systems applied downstream. Rather than directly comparing reference and hypothesis transcriptions, ATENE measure how harder it becames to identify entities given the differences between hypothesis and reference by comparing an estimated likelihood of presence of entities. It is composed of two elementary measurements. The first aims to assess the risk of entities deletions and substitutions and the second aims to assess the risk of entities insertions caused by ASR errors.Our validation experiments show that the measurements given by ATENE correlate better than other measures from the state of the art with the performance of REN systems
Modèle joint pour le traitement automatique de la langue : perspectives au travers des réseaux de neurones by Jérémie Tafforeau( )

1 edition published in 2017 in French and held by 1 WorldCat member library worldwide

NLP researchers has identified different levels of linguistic analysis. This lead to a hierarchical division of the various tasks performed in order to analyze a text statement. The traditional approach considers task-specific models which are subsequently arranged in cascade within processing chains (pipelines). This approach has a number of limitations: the empirical selection of models features, the errors accumulation in the pipeline and the lack of robusteness to domain changes. These limitations lead to particularly high performance losses in the case of non-canonical language with limited data available such as transcriptions of conversations over phone. Disfluencies and speech-specific syntactic schemes, as well as transcription errors in automatic speech recognition systems, lead to a significant drop of performances. It is therefore necessary to develop robust and flexible systems. We intend to perform a syntactic and semantic analysis using a deep neural network multitask model while taking into account the variations of domain and/or language registers within the data
Reconnaissance et traduction automatique de la parole de vidéos arabes et dialectales by Mohamed Amine Menacer( )

1 edition published in 2020 in French and held by 1 WorldCat member library worldwide

This research has been developed in the framework of the project AMIS (Access to Multilingual Information and opinionS). AMIS is an European project which aims to help people to understand the main idea of a video in a foreign language by generating an automatic summary of it. In this thesis, we focus on the automatic recognition and translation of the speech of Arabic and dialectal videos. The statistical approaches proposed in the literature for automatic speech recognition are language independent and they are applicable to modern standard Arabic. However, this language presents some characteristics that we need to take into consideration in order to boost the performance of the speech recognition system. Among these characteristics we can mention the absence of short vowels in the text, which makes their training by the acoustic model difficult. We proposed several approaches to acoustic and/or language modeling in order to better recognize the Arabic speech. In the Arab world, modern standard Arabic is not the mother tongue, that is why daily conversations are carried out with dialect, an Arabic inspired from modern standard Arabic, but not only. We worked on the adaptation of the speech recognition system developed for the modern standard Arabic to the Algerian dialect, which is one of the most difficult variants of the Arabic language to recognize by automatic speech recognition systems. This is mainly due to the borrowed words from other languages, the code-switching and the lack of resources. Our approach to overcome all these problems is to take advantage from oral and textual data of other languages that have an impact on the dialect in order to train the required models for dialect speech recognition. The resulting text from Arabic speech recognition system was then used for machine translation. As a starting point, we conducted a comparative study between the phrase based approach and the neural approach used in machine translation. Then, we adapted these two approaches to translate the code-switched text. Our study focused on the mix of Arabic and English in a parallel corpus extracted from official documents of the United Nations. In order to prevent the error propagation in the pipeline system, we worked on the adaptation of the vocabulary of the automatic speech recognition system and on the proposition of a new model that directly transforms a speech signal in language A into a sequence of words in another language B
Structuration automatique de documents audio by Abdesselam Bouchekif( )

1 edition published in 2016 in French and held by 1 WorldCat member library worldwide

The topic structuring is an area that has attracted much attention in the Natural Language Processing community. Indeed, topic structuring is considered as the starting point of several applications such as information retrieval, summarization and topic modeling.In this thesis, we proposed a generic topic structuring system i.e. that has the ability to deal with any TV Broadcast News.Our system contains two steps: topic segmentation and title assignment. Topic segmentation consists in splitting the document into thematically homogeneous fragments. The latter are generally identified by anonymous labels and the last step has to assign a title to each segment.Several original contributions are proposed like the use of a joint exploitation of the distribution of speakers and words (speech cohesion) and also the use of diachronic semantic relations. After the topic segmentation step, the generated segments are assigned a title corresponding to an article collected from Google News during the same day. Finally, we proposed the evaluation of two new metrics, the first is dedicated to the topic segmentation and the second to title assignment.The experiments are carried out on three corpora. They consisted of 168 TV Broadcast News from 10 French channels automatically transcribed. Our corpus is characterized by his richness and diversity
Analyse du discours conversationnel dans le cadre de communications médiées par ordinateur by Jérémy Auguste( )

1 edition published in 2020 in French and held by 1 WorldCat member library worldwide

Les dialogues ont une place importante dans la société et celle-ci s'accroît au fur et à mesure que la technologie progresse. Il existe de plus en plus d'outils pour dialoguer à distance permettant la collecte d'une masse importante de données, utilisables pour réaliser différentes analyses et divers systèmes automatiques.L'analyse du discours conversationnel est une réponse partielle pour comprendre certains aspects de la production du langage dans les dialogues. Une telle analyse permet de caractériser les interactions entre les messages d'un dialogue et ainsi faire ressortir les différents enjeux ou d'identifier les échanges nécessaires pour faire progresser le dialogue.Produire ces analyses est une tâche complexe. Le nombre important de théories d'analyse du discours illustre bien la complexité pour un humain à définir des structures discursives modélisant l'ensemble des interactions. Ceci rend la production d'un grand corpus annoté très coûteuse et le peu de données annotées rend difficile l'utilisation d'algorithmes d'apprentissage supervisés.Dans cette thèse, je propose de produire des représentations du discours conversationnel en s'appuyant sur peu de données annotées discursivement. La thèse s'inscrit dans le cadre de l'ANR DATCHA me donnant accès à un grand corpus de tchats provenant de l'entreprise Orange. Ce corpus me permet d'explorer plusieurs stratégies pour produire des représentations du discours: s'appuyer sur un modèle bout-en-bout prédisant la satisfaction des clients; se fonder sur des annotations en actes de dialogue pour produire des plongements de phrases; utiliser des algorithmes supervisés sur un corpus enrichi automatiquement
Du signal au concept : réseaux de neurones profonds appliqués à la compréhension de la parole by Antoine Caubriere( )

1 edition published in 2021 in French and held by 1 WorldCat member library worldwide

This thesis is part of the deep learning applied to spoken language understanding. Until now, this task was performed through a pipeline of components implementing, for example, a speech recognition system, then different natural language processing, before involving a language understanding system on enriched automatic transcriptions. Recently, work in the field of speech recognition has shown that it is possible to produce a sequence of words directly from the acoustic signal. Within the framework of this thesis, the aim is to exploit these advances and extend them to design a system composed of a single neural model fully optimized for the spoken language understanding task, from signal to concept. First, we present a state of the art describing the principles of deep learning, speech recognition, and speech understanding. Then, we describe the contributions made along three main axes. We propose a first system answering the problematic posed and apply it to a task of named entities recognition. Then, we propose a transfer learning strategy guided by a curriculum learning approach. This strategy is based on the generic knowledge learned to improve the performance of a neural system on a semantic concept extraction task. Then, we perform an analysis of the errors produced by our approach, while studying the functioning of the proposed neural architecture. Finally, we set up a confidence measure to evaluate the reliability of a hypothesis produced by our system
Traitement de l'incertitude pour la reconnaissance de la parole robuste au bruit by Dung Tien Tran( )

1 edition published in 2015 in English and held by 1 WorldCat member library worldwide

This thesis focuses on noise robust automatic speech recognition (ASR). It includes two parts. First, we focus on better handling of uncertainty to improve the performance of ASR in a noisy environment. Second, we present a method to accelerate the training process of a neural network using an auxiliary function technique. In the first part, multichannel speech enhancement is applied to input noisy speech. The posterior distribution of the underlying clean speech is then estimated, as represented by its mean and its covariance matrix or uncertainty. We show how to propagate the diagonal uncertainty covariance matrix in the spectral domain through the feature computation stage to obtain the full uncertainty covariance matrix in the feature domain. Uncertainty decoding exploits this posterior distribution to dynamically modify the acoustic model parameters in the decoding rule. The uncertainty decoding rule simply consists of adding the uncertainty covariance matrix of the enhanced features to the variance of each Gaussian component. We then propose two uncertainty estimators based on fusion to nonparametric estimation, respectively. To build a new estimator, we consider a linear combination of existing uncertainty estimators or kernel functions. The combination weights are generatively estimated by minimizing some divergence with respect to the oracle uncertainty. The divergence measures used are weighted versions of Kullback-Leibler (KL), Itakura-Saito (IS), and Euclidean (EU) divergences. Due to the inherent nonnegativity of uncertainty, this estimation problem can be seen as an instance of weighted nonnegative matrix factorization (NMF). In addition, we propose two discriminative uncertainty estimators based on linear or nonlinear mapping of the generatively estimated uncertainty. This mapping is trained so as to maximize the boosted maximum mutual information (bMMI) criterion. We compute the derivative of this criterion using the chain rule and optimize it using stochastic gradient descent. In the second part, we introduce a new learning rule for neural networks that is based on an auxiliary function technique without parameter tuning. Instead of minimizing the objective function, this technique consists of minimizing a quadratic auxiliary function which is recursively introduced layer by layer and which has a closed-form optimum. Based on the properties of this auxiliary function, the monotonic decrease of the new learning rule is guaranteed
Construction rapide, performante et mutualisée de systèmes de reconnaissance et de synthèse de la parole pour de nouvelles langues by Kévin Vythelingum( )

1 edition published in 2019 in French and held by 1 WorldCat member library worldwide

We study in this thesis the joint construction of speech recognition and synthesis systems for new languages, with the goals of accuracy and quick development. The rapid development of voice technologies for new languages is driving scientific ambitions and is now considered strategic by industial players. However, language development research is led by a few research centers, each working on a limited number of languages. However, these technologies share many common points.Our study focuses on building and sharing tools between systems for creating lexicons, learning phonetic rules and taking advantage of imperfect data. Our contributions focus on the selection of relevant data for learning acoustic models, the joint development of phonetizers and pronunciation lexicons for speech recognition and synthesis, and the use of neural models for phonetic transcription from text and speech signal. In addition, we present an approach for automatic detection of phonetic transcript errors in annotated speech signal databases. This study has shown that it is possible to significantly reduce the quantity of data annotation useful for the development of new text-to-speech systems. It naturally helps to reduce data collection time in the process of new systems creation.Finally, we study an application case by jointly building a system for recognizing and synthesizing speech for a new language
Approches jointes texte/image pour la compréhension multimodale de documents by Sébastien Delecraz( )

1 edition published in 2018 in French and held by 1 WorldCat member library worldwide

The human faculties of understanding are essentially multimodal. To understand the world around them, human beings fuse the information coming from all of their sensory receptors. Most of the documents used in automatic information processing contain multimodal information, for example text and image in textual documents or image and sound in video documents, however the processings used are most often monomodal. The aim of this thesis is to propose joint processes applying mainly to text and image for the processing of multimodal documents through two studies: one on multimodal fusion for the speaker role recognition in television broadcasts, the other on the complementarity of modalities for a task of linguistic analysis on corpora of images with captions. In the first part of this study, we interested in audiovisual documents analysis from news television channels. We propose an approach that uses in particular deep neural networks for representation and fusion of modalities. In the second part of this thesis, we are interested in approaches allowing to use several sources of multimodal information for a monomodal task of natural language processing in order to study their complementarity. We propose a complete system of correction of prepositional attachments using visual information, trained on a multimodal corpus of images with captions
Speaker adaptation of deep neural network acoustic models using Gaussian mixture model framework in automatic speech recognition systems by Natalia Tomashenko( )

1 edition published in 2017 in English and held by 1 WorldCat member library worldwide

Les différences entre conditions d'apprentissage et conditions de test peuvent considérablement dégrader la qualité des transcriptions produites par un système de reconnaissance automatique de la parole (RAP). L'adaptation est un moyen efficace pour réduire l'inadéquation entre les modèles du système et les données liées à un locuteur ou un canal acoustique particulier. Il existe deux types dominants de modèles acoustiques utilisés en RAP : les modèles de mélanges gaussiens (GMM) et les réseaux de neurones profonds (DNN). L'approche par modèles de Markov cachés (HMM) combinés à des GMM (GMM-HMM) a été l'une des techniques les plus utilisées dans les systèmes de RAP pendant de nombreuses décennies. Plusieurs techniques d'adaptation ont été développées pour ce type de modèles. Les modèles acoustiques combinant HMM et DNN (DNN-HMM) ont récemment permis de grandes avancées et surpassé les modèles GMM-HMM pour diverses tâches de RAP, mais l'adaptation au locuteur reste très difficile pour les modèles DNN-HMM. L'objectif principal de cette thèse est de développer une méthode de transfert efficace des algorithmes d'adaptation des modèles GMM aux modèles DNN. Une nouvelle approche pour l'adaptation au locuteur des modèles acoustiques de type DNN est proposée et étudiée : elle s'appuie sur l'utilisation de fonctions dérivées de GMM comme entrée d'un DNN. La technique proposée fournit un cadre général pour le transfert des algorithmes d'adaptation développés pour les GMM à l'adaptation des DNN. Elle est étudiée pour différents systèmes de RAP à l'état de l'art et s'avère efficace par rapport à d'autres techniques d'adaptation au locuteur, ainsi que complémentaire
 
moreShow More Titles
fewerShow Fewer Titles
Audience Level
0
Audience Level
1
  Kids General Special  
Audience level: 0.60 (from 0.57 for Statistica ... to 0.99 for Statistica ...)

Statistical language and speech processing : 5th International Conference, SLSP 2017, Le Mans, France, October 23-25, 2017, proceedings
Covers
Languages
English (16)

French (15)