Front cover image for Using comparable corpora for under-resourced areas of machine translation

Using comparable corpora for under-resourced areas of machine translation

Inguna Skadina (Editor), Robert Gaizauskas (Editor), Bogdan Babych (Editor), Nikola Ljubešić (Editor), Dan Tufiș (Editor), Andrejs Vasiļjevs (Editor)
This book provides an overview of how comparable corpora can be used to overcome the lack of parallel resources when building machine translation systems for under-resourced languages and domains. It presents a wealth of methods and open tools for building comparable corpora from the Web, evaluating comparability and extracting parallel data that can be used for the machine translation task. It is divided into several sections, each covering a specific task such as building, processing, and using comparable corpora, focusing particularly on under-resourced language pairs and domains. The book is intended for anyone interested in data-driven machine translation for under-resourced languages and domains, especially for developers of machine translation systems, computational linguists and language workers. It offers a valuable resource for specialists and students in natural language processing, machine translation, corpus linguistics and computer-assisted translation, and promotes the broader use of comparable corpora in natural language processing and computational linguistics
eBook, English, 2019
Springer, Cham, Switzerland, 2019
1 online resource (vi, 323 pages) : illustrations (some color)
9783319990040, 9783319990033, 9783319990057, 3319990047, 3319990039, 3319990055
1085564009
Printed edition:
Intro; Contents; Chapter 1: Introduction; 1.1 Parallel Data; 1.2 Comparable Corpora and Comparability; 1.3 Acquisition of Parallel Data from Comparable Corpora; 1.4 Comparable Corpora in Machine Translation; 1.5 Summary of the Book; 1.6 The ACCURAT Project; References; Chapter 2: Cross-Language Comparability and Its Applications for MT; 2.1 Introduction: Definition and Use of the Concept of Comparability; 2.2 Development and Calibration of Comparability Metrics on Parallel Corpora; 2.2.1 Application of Corpus Comparability: Selecting Coherent Parallel Corpora for Domain-Specific MT Training 2.2.2 Methodology2.2.2.1 Description of Calculation Method; 2.2.2.2 Symmetric vs. Asymmetric Calculation of Distance; 2.2.2.3 Calibrating the Distance Metric; 2.2.3 Validation of the Scores: Cross-Language Agreement for Source vs. Target Sides of TMX Files; 2.2.4 Discussion; 2.3 Exploration of Comparability Features in Document-Aligned Comparable Corpora: Wikipedia; 2.3.1 Overview: Wikipedia as a Source of Comparable Corpora; 2.3.2 Previous Work on Using Wikipedia as a Linguistic Resource; 2.3.3 Methodology; 2.3.3.1 Document Pre-processing; 2.3.3.2 Similarity Measures 2.3.3.3 Eliciting Human Judgements2.3.4 Results and Analysis; 2.3.4.1 Responses to the Questionnaire; 2.3.4.2 Inter-assessor Agreement; 2.3.4.3 Correlation of Similarity Measures to Human Judgements; 2.3.4.4 Classification Task; 2.3.5 Discussion; 2.3.5.1 Features of `Similar ́Articles; 2.3.5.2 Measuring Cross-Language Similarity; 2.3.6 Section Conclusions; 2.4 Metrics for Identifying Comparability Levels in Non-aligned Documents; 2.4.1 Using Parallel and Comparable Corpora for MT; 2.4.2 Related Work; 2.4.3 Comparability Metrics; 2.4.3.1 Lexical Mapping Based Metric 2.4.3.2 Keyword-Based Metric2.4.3.3 Machine Translation (MT)-Based Metrics; 2.4.4 Experiments and Evaluation; 2.4.4.1 Data Sources; 2.4.4.2 Experimental Results; 2.4.5 Metric Application to Equivalent Extraction; 2.4.6 Discussion; 2.4.6.1 Advantages and Disadvantages of the Metrics; 2.4.6.2 Using Semi-parallel Equivalents in MT Systems; 2.4.7 Conclusion; References; Chapter 3: Collecting Comparable Corpora; 3.1 Introduction; 3.2 Previous Work in Collecting Comparable Corpora; 3.2.1 Web Crawling; 3.2.2 Identifying Comparable Text; 3.3 ACCURAT Techniques to Collect Comparable Documents 3.3.1 Comparable Corpora Collection from Wikipedia3.3.1.1 Extracting Comparable Articles; 3.3.1.2 Measuring Similarity in Inter-language Linked Documents; 3.3.2 Comparable Corpora Collection from News Articles; 3.3.3 Comparable Corpora Collection from Narrow Domains; 3.3.3.1 Acquiring Comparable Documents; 3.3.3.2 Aligning Comparable Document Pairs; References; Chapter 4: Extracting Data from Comparable Corpora; 4.1 Introduction; 4.2 Term Extraction, Tagging and Mapping for Under-Resourced Languages; 4.2.1 Related Work; 4.2.2 Term Extraction, Tagging and Mapping with the ACCURAT Toolkit