Search | arXiv e-print repository

Collapse of Self-trained Language Models

Abstract: In various fields of knowledge creation, including science, new ideas often build on pre-existing information. In this work, we explore this concept within the context of language models. Specifically, we explore the potential of self-training models on their own outputs, akin to how humans learn and build on their previous thoughts and actions. While this approach is intuitively appealing, our re… ▽ More In various fields of knowledge creation, including science, new ideas often build on pre-existing information. In this work, we explore this concept within the context of language models. Specifically, we explore the potential of self-training models on their own outputs, akin to how humans learn and build on their previous thoughts and actions. While this approach is intuitively appealing, our research reveals its practical limitations. We find that extended self-training of the GPT-2 model leads to a significant degradation in performance, resulting in repetitive and collapsed token output. △ Less

Submitted 2 April, 2024; originally announced April 2024.

Comments: ICLR 2024

arXiv:2402.06196 [pdf, other]

Large Language Models: A Survey

Authors: Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, Jianfeng Gao

Abstract: Large Language Models (LLMs) have drawn a lot of attention due to their strong performance on a wide range of natural language tasks, since the release of ChatGPT in November 2022. LLMs' ability of general-purpose language understanding and generation is acquired by training billions of model's parameters on massive amounts of text data, as predicted by scaling laws \cite{kaplan2020scaling,hoffman… ▽ More Large Language Models (LLMs) have drawn a lot of attention due to their strong performance on a wide range of natural language tasks, since the release of ChatGPT in November 2022. LLMs' ability of general-purpose language understanding and generation is acquired by training billions of model's parameters on massive amounts of text data, as predicted by scaling laws \cite{kaplan2020scaling,hoffmann2022training}. The research area of LLMs, while very recent, is evolving rapidly in many different ways. In this paper, we review some of the most prominent LLMs, including three popular LLM families (GPT, LLaMA, PaLM), and discuss their characteristics, contributions and limitations. We also give an overview of techniques developed to build, and augment LLMs. We then survey popular datasets prepared for LLM training, fine-tuning, and evaluation, review widely used LLM evaluation metrics, and compare the performance of several popular LLMs on a set of representative benchmarks. Finally, we conclude the paper by discussing open challenges and future research directions. △ Less

Submitted 20 February, 2024; v1 submitted 9 February, 2024; originally announced February 2024.

Comments: arXiv admin note: substantial text overlap with arXiv:2401.14423

arXiv:2312.03735 [pdf, other]

Advancing State of the Art in Language Modeling

Authors: David Herel, Tomas Mikolov

Abstract: Generalization is arguably the most important goal of statistical language modeling research. Publicly available benchmarks and papers published with an open-source code have been critical to advancing the field. However, it is often very difficult, and sometimes even impossible, to reproduce the results fully as reported in publications. In this paper, we propose a simple framework that should he… ▽ More Generalization is arguably the most important goal of statistical language modeling research. Publicly available benchmarks and papers published with an open-source code have been critical to advancing the field. However, it is often very difficult, and sometimes even impossible, to reproduce the results fully as reported in publications. In this paper, we propose a simple framework that should help advance the state of the art in language modeling in terms of generalization. We propose to publish not just the code, but also probabilities on dev and test sets with future publications so that one can easily add the new model into an ensemble. This has crucial advantages: it is much easier to determine whether a newly proposed model is actually complementary to the current baseline. Therefore, instead of inventing new names for the old tricks, the scientific community can advance faster. Finally, this approach promotes diversity of ideas: one does not need to create an individual model that is the new state of the art to attract attention; it will be sufficient to develop a new model that learns patterns which other models do not. Thus, even a suboptimal model can be found to have value. Remarkably, our approach has yielded new state-of-the-art results across various language modeling benchmarks up to 10%. △ Less

Submitted 28 November, 2023; originally announced December 2023.

arXiv:2211.04205 [pdf, other]

doi 10.3233/FAIA230376

Preserving Semantics in Textual Adversarial Attacks

Authors: David Herel, Hugo Cisneros, Tomas Mikolov

Abstract: The growth of hateful online content, or hate speech, has been associated with a global increase in violent crimes against minorities [23]. Harmful online content can be produced easily, automatically and anonymously. Even though, some form of auto-detection is already achieved through text classifiers in NLP, they can be fooled by adversarial attacks. To strengthen existing systems and stay ahead… ▽ More The growth of hateful online content, or hate speech, has been associated with a global increase in violent crimes against minorities [23]. Harmful online content can be produced easily, automatically and anonymously. Even though, some form of auto-detection is already achieved through text classifiers in NLP, they can be fooled by adversarial attacks. To strengthen existing systems and stay ahead of attackers, we need better adversarial attacks. In this paper, we show that up to 70% of adversarial examples generated by adversarial attacks should be discarded because they do not preserve semantics. We address this core weakness and propose a new, fully supervised sentence embedding technique called Semantics-Preserving-Encoder (SPE). Our method outperforms existing sentence encoders used in adversarial attacks by achieving 1.2x - 5.1x better real attack success rate. We release our code as a plugin that can be used in any existing adversarial attack to improve its quality and speed up its execution. △ Less

Submitted 5 October, 2023; v1 submitted 8 November, 2022; originally announced November 2022.

Comments: 8 pages, 4 figures

Journal ref: ECAI 2023

arXiv:2210.02549 [pdf, other]

Benchmarking Learning Efficiency in Deep Reservoir Computing

Authors: Hugo Cisneros, Josef Sivic, Tomas Mikolov

Abstract: It is common to evaluate the performance of a machine learning model by measuring its predictive power on a test dataset. This approach favors complicated models that can smoothly fit complex functions and generalize well from training data points. Although essential components of intelligence, speed and data efficiency of this learning process are rarely reported or compared between different can… ▽ More It is common to evaluate the performance of a machine learning model by measuring its predictive power on a test dataset. This approach favors complicated models that can smoothly fit complex functions and generalize well from training data points. Although essential components of intelligence, speed and data efficiency of this learning process are rarely reported or compared between different candidate models. In this paper, we introduce a benchmark of increasingly difficult tasks together with a data efficiency metric to measure how quickly machine learning models learn from training data. We compare the learning speed of some established sequential supervised models, such as RNNs, LSTMs, or Transformers, with relatively less known alternative models based on reservoir computing. The proposed tasks require a wide range of computational primitives, such as memory or the ability to compute Boolean functions, to be effectively solved. Surprisingly, we observe that reservoir computing systems that rely on dynamically evolving feature maps learn faster than fully supervised methods trained with stochastic gradient optimization while achieving comparable accuracy scores. The code, benchmark, trained models, and results to reproduce our experiments are available at https://github.com/hugcis/benchmark_learning_efficiency/ . △ Less

Submitted 29 September, 2022; originally announced October 2022.

Comments: Conference on Lifelong Learning Agents, Aug 2022, Montreal, Canada

arXiv:2207.04857 [pdf, other]

doi 10.1162/isal_a_00501

Emergence of Novelty in Evolutionary Algorithms

Authors: David Herel, Dominika Zogatova, Matej Kripner, Tomas Mikolov

Abstract: One of the main problems of evolutionary algorithms is the convergence of the population to local minima. In this paper, we explore techniques that can avoid this problem by encouraging a diverse behavior of the agents through a shared reward system. The rewards are randomly distributed in the environment, and the agents are only rewarded for collecting them first. This leads to an emergence of a… ▽ More One of the main problems of evolutionary algorithms is the convergence of the population to local minima. In this paper, we explore techniques that can avoid this problem by encouraging a diverse behavior of the agents through a shared reward system. The rewards are randomly distributed in the environment, and the agents are only rewarded for collecting them first. This leads to an emergence of a novel behavior of the agents. We introduce our approach to the maze problem and compare it to the previously proposed solution, denoted as Novelty Search (Lehman and Stanley, 2011a). We find that our solution leads to an improved performance while being significantly simpler. Building on that, we generalize the problem and apply our approach to a more advanced set of tasks, Atari Games, where we observe a similar performance quality with much less computational power needed. △ Less

Submitted 3 August, 2022; v1 submitted 27 June, 2022; originally announced July 2022.

Comments: ALIFE 2022

Journal ref: Artificial Life Conference Proceedings 2022. MIT Press

arXiv:2111.15588 [pdf, other]

SimpleTRON: Simple Transformer with O(N) Complexity

Authors: Uladzislau Yorsh, Alexander Kovalenko, Vojtěch Vančura, Daniel Vašata, Pavel Kordík, Tomáš Mikolov

Abstract: In this paper, we propose that the dot product pairwise matching attention layer, which is widely used in Transformer-based models, is redundant for the model performance. Attention, in its original formulation, has to be seen rather as a human-level tool to explore and/or visualize relevancy scores in sequential data. However, the way how it is constructed leads to significant computational compl… ▽ More In this paper, we propose that the dot product pairwise matching attention layer, which is widely used in Transformer-based models, is redundant for the model performance. Attention, in its original formulation, has to be seen rather as a human-level tool to explore and/or visualize relevancy scores in sequential data. However, the way how it is constructed leads to significant computational complexity. Instead, we present SimpleTRON: Simple Transformer with O(N) Complexity, a simple and fast alternative without any approximation that, unlike other approximation models, does not have any architecture-related overhead and therefore can be seen as a purely linear Transformer-like model. This architecture, to the best of our knowledge, outperforms existing sub-quadratic attention approximation models on several tasks from the Long-Range Arena benchmark. Moreover, we show, that SimpleTRON can benefit from weight transfer from pretrained large language models, as its parameters can be fully transferable. △ Less

Submitted 28 June, 2022; v1 submitted 23 November, 2021; originally announced November 2021.

arXiv:2108.01573 [pdf, other]

Classification of Discrete Dynamical Systems Based on Transients

Authors: Barbora Hudcová, Tomáš Mikolov

Abstract: In order to develop systems capable of artificial evolution, we need to identify which systems can produce complex behavior. We present a novel classification method applicable to any class of deterministic discrete space and time dynamical systems. The method is based on classifying the asymptotic behavior of the average computation time in a given system before entering a loop. We were able to i… ▽ More In order to develop systems capable of artificial evolution, we need to identify which systems can produce complex behavior. We present a novel classification method applicable to any class of deterministic discrete space and time dynamical systems. The method is based on classifying the asymptotic behavior of the average computation time in a given system before entering a loop. We were able to identify a critical region of behavior that corresponds to a phase transition from ordered behavior to chaos across various classes of dynamical systems. To show that our approach can be applied to many different computational systems, we demonstrate the results of classifying cellular automata, Turing machines, and random Boolean networks. Further, we use this method to classify 2D cellular automata to automatically find those with interesting, complex dynamics. We believe that our work can be used to design systems in which complex structures emerge. Also, it can be used to compare various versions of existing attempts to model open-ended evolution (Ray (1991), Ofria et al. (2004), Channon (2006)). △ Less

Submitted 3 August, 2021; originally announced August 2021.

Comments: 15 pages. arXiv admin note: substantial text overlap with arXiv:2008.13503

arXiv:2108.00415 [pdf, other]

doi 10.1162/isal_a_00478

Computational Hierarchy of Elementary Cellular Automata

Authors: Barbora Hudcová, Tomáš Mikolov

Abstract: The complexity of cellular automata is traditionally measured by their computational capacity. However, it is difficult to choose a challenging set of computational tasks suitable for the parallel nature of such systems. We study the ability of automata to emulate one another, and we use this notion to define such a set of naturally emerging tasks. We present the results for elementary cellular au… ▽ More The complexity of cellular automata is traditionally measured by their computational capacity. However, it is difficult to choose a challenging set of computational tasks suitable for the parallel nature of such systems. We study the ability of automata to emulate one another, and we use this notion to define such a set of naturally emerging tasks. We present the results for elementary cellular automata, although the core ideas can be extended to other computational systems. We compute a graph showing which elementary cellular automata can be emulated by which and show that certain chaotic automata are the only ones that cannot emulate any automata non-trivially. Finally, we use the emulation notion to suggest a novel definition of chaos that we believe is suitable for discrete computational systems. We believe our work can help design parallel computational systems that are Turing-complete and also computationally efficient. △ Less

Submitted 1 August, 2021; originally announced August 2021.

Comments: 8 pages

Journal ref: The 2021 Conference on Artificial Life Proceedings, 2021, 353--360

arXiv:2104.01008 [pdf, other]

doi 10.1162/isal_a_00277

Visualizing computation in large-scale cellular automata

Authors: Hugo Cisneros, Josef Sivic, Tomas Mikolov

Abstract: Emergent processes in complex systems such as cellular automata can perform computations of increasing complexity, and could possibly lead to artificial evolution. Such a feat would require scaling up current simulation sizes to allow for enough computational capacity. Understanding complex computations happening in cellular automata and other systems capable of emergence poses many challenges, es… ▽ More Emergent processes in complex systems such as cellular automata can perform computations of increasing complexity, and could possibly lead to artificial evolution. Such a feat would require scaling up current simulation sizes to allow for enough computational capacity. Understanding complex computations happening in cellular automata and other systems capable of emergence poses many challenges, especially in large-scale systems. We propose methods for coarse-graining cellular automata based on frequency analysis of cell states, clustering and autoencoders. These innovative techniques facilitate the discovery of large-scale structure formation and complexity analysis in those systems. They emphasize interesting behaviors in elementary cellular automata while filtering out background patterns. Moreover, our methods reduce large 2D automata to smaller sizes and enable identifying systems that behave interestingly at multiple scales. △ Less

Submitted 1 April, 2021; originally announced April 2021.

Journal ref: Artificial Life Conference Proceedings 2020 (pp. 239-247). MIT Press

arXiv:2103.08245 [pdf, other]

Emergence of Self-Reproducing Metabolisms as Recursive Algorithms in an Artificial Chemistry

Authors: Germán Kruszewski, Tomas Mikolov

Abstract: One of the main goals of Artificial Life is to research the conditions for the emergence of life, not necessarily as it is, but as it could be. Artificial Chemistries are one of the most important tools for this purpose because they provide us with a basic framework to investigate under which conditions metabolisms capable of reproducing themselves, and ultimately, of evolving, can emerge. While t… ▽ More One of the main goals of Artificial Life is to research the conditions for the emergence of life, not necessarily as it is, but as it could be. Artificial Chemistries are one of the most important tools for this purpose because they provide us with a basic framework to investigate under which conditions metabolisms capable of reproducing themselves, and ultimately, of evolving, can emerge. While there have been successful attempts at producing examples of emergent self-reproducing metabolisms, the set of rules involved remain too complex to shed much light on the underlying principles at work. In this paper, we hypothesize that the key property needed for self-reproducing metabolisms to emerge is the existence of an auto-catalyzed subset of Turing-complete reactions. We validate this hypothesis with a minimalistic Artificial Chemistry with conservation laws, which is based on a Turing-complete rewriting system called Combinatory Logic. Our experiments show that a single run of this chemistry, starting from a tabula rasa state, discovers -- with no external intervention -- a wide range of emergent structures including ones that self-reproduce in each cycle. All of these structures take the form of recursive algorithms that acquire basic constituents from the environment and decompose them in a process that is remarkably similar to biological metabolisms. △ Less

Submitted 7 December, 2021; v1 submitted 15 March, 2021; originally announced March 2021.

Comments: arXiv admin note: text overlap with arXiv:2003.07916

arXiv:2008.13503 [pdf, other]

doi 10.1162/isal_a_00260

Classification of Complex Systems Based on Transients

Authors: Barbora Hudcova, Tomas Mikolov

Abstract: In order to develop systems capable of modeling artificial life, we need to identify, which systems can produce complex behavior. We present a novel classification method applicable to any class of deterministic discrete space and time dynamical systems. The method distinguishes between different asymptotic behaviors of a system's average computation time before entering a loop. When applied to el… ▽ More In order to develop systems capable of modeling artificial life, we need to identify, which systems can produce complex behavior. We present a novel classification method applicable to any class of deterministic discrete space and time dynamical systems. The method distinguishes between different asymptotic behaviors of a system's average computation time before entering a loop. When applied to elementary cellular automata, we obtain classification results, which correlate very well with Wolfram's manual classification. Further, we use it to classify 2D cellular automata to show that our technique can easily be applied to more complex models of computation. We believe this classification method can help to develop systems, in which complex structures emerge. △ Less

Submitted 31 August, 2020; originally announced August 2020.

Comments: 9 pages

Journal ref: Artificial Life Conference Proceedings 32 (2020), 367-375

arXiv:2004.03340 [pdf, other]

Evaluating Online Continual Learning with CALM

Authors: Germán Kruszewski, Ionut-Teodor Sorodoc, Tomas Mikolov

Abstract: Online Continual Learning (OCL) studies learning over a continuous data stream without observing any single example more than once, a setting that is closer to the experience of humans and systems that must learn "on-the-wild". Yet, commonly available benchmarks are far from these real-world conditions, because they explicitly signal different tasks, lack latent similarity structure or assume temp… ▽ More Online Continual Learning (OCL) studies learning over a continuous data stream without observing any single example more than once, a setting that is closer to the experience of humans and systems that must learn "on-the-wild". Yet, commonly available benchmarks are far from these real-world conditions, because they explicitly signal different tasks, lack latent similarity structure or assume temporal independence between different examples. Here, we propose a new benchmark for OCL based on language modelling in which input alternates between different languages and domains without any explicit delimitation. Additionally, we propose new metrics to study catastrophic forgetting in this setting and evaluate multiple baseline models based on compositions of experts. Finally, we introduce a simple gating technique that learns the latent similarities between different inputs, improving the performance of a Products of Experts model. △ Less

Submitted 1 February, 2021; v1 submitted 7 April, 2020; originally announced April 2020.

arXiv:2003.07916 [pdf, other]

Combinatory Chemistry: Towards a Simple Model of Emergent Evolution

Authors: Germán Kruszewski, Tomas Mikolov

Abstract: An explanatory model for the emergence of evolvable units must display emerging structures that (1) preserve themselves in time (2) self-reproduce and (3) tolerate a certain amount of variation when reproducing. To tackle this challenge, here we introduce Combinatory Chemistry, an Algorithmic Artificial Chemistry based on a minimalistic computational paradigm named Combinatory Logic. The dynamics… ▽ More An explanatory model for the emergence of evolvable units must display emerging structures that (1) preserve themselves in time (2) self-reproduce and (3) tolerate a certain amount of variation when reproducing. To tackle this challenge, here we introduce Combinatory Chemistry, an Algorithmic Artificial Chemistry based on a minimalistic computational paradigm named Combinatory Logic. The dynamics of this system comprise very few rules, it is initialised with an elementary tabula rasa state, and features conservation laws replicating natural resource constraints. Our experiments show that a single run of this dynamical system with no external intervention discovers a wide range of emergent patterns. All these structures rely on acquiring basic constituents from the environment and decomposing them in a process that is remarkably similar to biological metabolisms. These patterns include autopoietic structures that maintain their organisation, recursive ones that grow in linear chains or binary-branching trees, and most notably, patterns able to reproduce themselves, duplicating their number at each generation. △ Less

Submitted 19 June, 2020; v1 submitted 17 March, 2020; originally announced March 2020.

arXiv:1911.01086 [pdf, other]

doi 10.1109/SSCI44817.2019.9002840

Evolving Structures in Complex Systems

Authors: Hugo Cisneros, Josef Sivic, Tomas Mikolov

Abstract: In this paper we propose an approach for measuring growth of complexity of emerging patterns in complex systems such as cellular automata. We discuss several ways how a metric for measuring the complexity growth can be defined. This includes approaches based on compression algorithms and artificial neural networks. We believe such a metric can be useful for designing systems that could exhibit ope… ▽ More In this paper we propose an approach for measuring growth of complexity of emerging patterns in complex systems such as cellular automata. We discuss several ways how a metric for measuring the complexity growth can be defined. This includes approaches based on compression algorithms and artificial neural networks. We believe such a metric can be useful for designing systems that could exhibit open-ended evolution, which itself might be a prerequisite for development of general artificial intelligence. We conduct experiments on 1D and 2D grid worlds and demonstrate that using the proposed metric we can automatically construct computational models with emerging properties similar to those found in the Conway's Game of Life, as well as many other emergent phenomena. Interestingly, some of the patterns we observe resemble forms of artificial life. Our metric of structural complexity growth can be applied to a wide range of complex systems, as it is not limited to cellular automata. △ Less

Submitted 18 March, 2020; v1 submitted 4 November, 2019; originally announced November 2019.

Comments: IEEE Symposium Series on Computational Intelligence 2019 (IEEE SSCI 2019)

Journal ref: Proceedings of the 2019 IEEE Symposium Series on Computational Intelligence

arXiv:1910.06241 [pdf, ps, other]

Updating Pre-trained Word Vectors and Text Classifiers using Monolingual Alignment

Authors: Piotr Bojanowski, Onur Celebi, Tomas Mikolov, Edouard Grave, Armand Joulin

Abstract: In this paper, we focus on the problem of adapting word vector-based models to new textual data. Given a model pre-trained on large reference data, how can we adapt it to a smaller piece of data with a slightly different language distribution? We frame the adaptation problem as a monolingual word vector alignment problem, and simply average models after alignment. We align vectors using the RCSLS… ▽ More In this paper, we focus on the problem of adapting word vector-based models to new textual data. Given a model pre-trained on large reference data, how can we adapt it to a smaller piece of data with a slightly different language distribution? We frame the adaptation problem as a monolingual word vector alignment problem, and simply average models after alignment. We align vectors using the RCSLS criterion. Our formulation results in a simple and efficient algorithm that allows adapting general-purpose models to changing word distributions. In our evaluation, we consider applications to word embedding and text classification models. We show that the proposed approach yields good performance in all setups and outperforms a baseline consisting in fine-tuning the model on new data. △ Less

Submitted 15 October, 2019; v1 submitted 14 October, 2019; originally announced October 2019.

arXiv:1910.04861 [pdf, other]

Place Deduplication with Embeddings

Authors: Carl Yang, Do Huy Hoang, Tomas Mikolov, Jiawei Han

Abstract: Thanks to the advancing mobile location services, people nowadays can post about places to share visiting experience on-the-go. A large place graph not only helps users explore interesting destinations, but also provides opportunities for understanding and modeling the real world. To improve coverage and flexibility of the place graph, many platforms import places data from multiple sources, which… ▽ More Thanks to the advancing mobile location services, people nowadays can post about places to share visiting experience on-the-go. A large place graph not only helps users explore interesting destinations, but also provides opportunities for understanding and modeling the real world. To improve coverage and flexibility of the place graph, many platforms import places data from multiple sources, which unfortunately leads to the emergence of numerous duplicated places that severely hinder subsequent location-related services. In this work, we take the anonymous place graph from Facebook as an example to systematically study the problem of place deduplication: We carefully formulate the problem, study its connections to various related tasks that lead to several promising basic models, and arrive at a systematic two-step data-driven pipeline based on place embedding with multiple novel techniques that works significantly better than the state-of-the-art. △ Less

Submitted 28 September, 2019; originally announced October 2019.

Comments: Published at WWW 2019

arXiv:1804.07745 [pdf, other]

Loss in Translation: Learning Bilingual Word Mapping with a Retrieval Criterion

Authors: Armand Joulin, Piotr Bojanowski, Tomas Mikolov, Herve Jegou, Edouard Grave

Abstract: Continuous word representations learned separately on distinct languages can be aligned so that their words become comparable in a common space. Existing works typically solve a least-square regression problem to learn a rotation aligning a small bilingual lexicon, and use a retrieval criterion for inference. In this paper, we propose an unified formulation that directly optimizes a retrieval crit… ▽ More Continuous word representations learned separately on distinct languages can be aligned so that their words become comparable in a common space. Existing works typically solve a least-square regression problem to learn a rotation aligning a small bilingual lexicon, and use a retrieval criterion for inference. In this paper, we propose an unified formulation that directly optimizes a retrieval criterion in an end-to-end fashion. Our experiments on standard benchmarks show that our approach outperforms the state of the art on word translation, with the biggest improvements observed for distant language pairs such as English-Chinese. △ Less

Submitted 5 September, 2018; v1 submitted 20 April, 2018; originally announced April 2018.

arXiv:1802.06893 [pdf, ps, other]

Learning Word Vectors for 157 Languages

Authors: Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Armand Joulin, Tomas Mikolov

Abstract: Distributed word representations, or word vectors, have recently been applied to many tasks in natural language processing, leading to state-of-the-art performance. A key ingredient to the successful application of these representations is to train them on very large corpora, and use these pre-trained models in downstream tasks. In this paper, we describe how we trained such high quality word repr… ▽ More Distributed word representations, or word vectors, have recently been applied to many tasks in natural language processing, leading to state-of-the-art performance. A key ingredient to the successful application of these representations is to train them on very large corpora, and use these pre-trained models in downstream tasks. In this paper, we describe how we trained such high quality word representations for 157 languages. We used two sources of data to train these models: the free online encyclopedia Wikipedia and data from the common crawl project. We also introduce three new word analogy datasets to evaluate these word vectors, for French, Hindi and Polish. Finally, we evaluate our pre-trained word vectors on 10 languages for which evaluation datasets exists, showing very strong performance compared to previous models. △ Less

Submitted 28 March, 2018; v1 submitted 19 February, 2018; originally announced February 2018.

Comments: Accepted to LREC

arXiv:1802.02892 [pdf, ps, other]

Efficient Large-Scale Multi-Modal Classification

Authors: D. Kiela, E. Grave, A. Joulin, T. Mikolov

Abstract: While the incipient internet was largely text-based, the modern digital world is becoming increasingly multi-modal. Here, we examine multi-modal classification where one modality is discrete, e.g. text, and the other is continuous, e.g. visual representations transferred from a convolutional neural network. In particular, we focus on scenarios where we have to be able to classify large quantities… ▽ More While the incipient internet was largely text-based, the modern digital world is becoming increasingly multi-modal. Here, we examine multi-modal classification where one modality is discrete, e.g. text, and the other is continuous, e.g. visual representations transferred from a convolutional neural network. In particular, we focus on scenarios where we have to be able to classify large quantities of data quickly. We investigate various methods for performing multi-modal fusion and analyze their trade-offs in terms of classification accuracy and computational efficiency. Our findings indicate that the inclusion of continuous information improves performance over text-only on a range of multi-modal classification tasks, even with simple fusion methods. In addition, we experiment with discretizing the continuous features in order to speed up and simplify the fusion process even further. Our results show that fusion with discretized features outperforms text-only classification, at a fraction of the computational cost of full multi-modal fusion, with the additional benefit of improved interpretability. △ Less

Submitted 6 February, 2018; originally announced February 2018.

Comments: Published at AAAI-18, 7 pages

arXiv:1712.09405 [pdf, ps, other]

Advances in Pre-Training Distributed Word Representations

Authors: Tomas Mikolov, Edouard Grave, Piotr Bojanowski, Christian Puhrsch, Armand Joulin

Abstract: Many Natural Language Processing applications nowadays rely on pre-trained word representations estimated from large text corpora such as news collections, Wikipedia and Web Crawl. In this paper, we show how to train high-quality word vector representations by using a combination of known tricks that are however rarely used together. The main result of our work is the new set of publicly available… ▽ More Many Natural Language Processing applications nowadays rely on pre-trained word representations estimated from large text corpora such as news collections, Wikipedia and Web Crawl. In this paper, we show how to train high-quality word vector representations by using a combination of known tricks that are however rarely used together. The main result of our work is the new set of publicly available pre-trained models that outperform the current state of the art by a large margin on a number of tasks. △ Less

Submitted 26 December, 2017; originally announced December 2017.

arXiv:1710.10881 [pdf, ps, other]

Fast Linear Model for Knowledge Graph Embeddings

Authors: Armand Joulin, Edouard Grave, Piotr Bojanowski, Maximilian Nickel, Tomas Mikolov

Abstract: This paper shows that a simple baseline based on a Bag-of-Words (BoW) representation learns surprisingly good knowledge graph embeddings. By casting knowledge base completion and question answering as supervised classification problems, we observe that modeling co-occurences of entities and relations leads to state-of-the-art performance with a training time of a few minutes using the open sourced… ▽ More This paper shows that a simple baseline based on a Bag-of-Words (BoW) representation learns surprisingly good knowledge graph embeddings. By casting knowledge base completion and question answering as supervised classification problems, we observe that modeling co-occurences of entities and relations leads to state-of-the-art performance with a training time of a few minutes using the open sourced library fastText. △ Less

Submitted 30 October, 2017; originally announced October 2017.

Comments: Submitted AKBC 2017

arXiv:1703.08864 [pdf, ps, other]

Learning Simpler Language Models with the Differential State Framework

Authors: Alexander G. Ororbia II, Tomas Mikolov, David Reitter

Abstract: Learning useful information across long time lags is a critical and difficult problem for temporal neural models in tasks such as language modeling. Existing architectures that address the issue are often complex and costly to train. The Differential State Framework (DSF) is a simple and high-performing design that unifies previously introduced gated neural models. DSF models maintain longer-term… ▽ More Learning useful information across long time lags is a critical and difficult problem for temporal neural models in tasks such as language modeling. Existing architectures that address the issue are often complex and costly to train. The Differential State Framework (DSF) is a simple and high-performing design that unifies previously introduced gated neural models. DSF models maintain longer-term memory by learning to interpolate between a fast-changing data-driven representation and a slowly changing, implicitly stable state. This requires hardly any more parameters than a classical, simple recurrent network. Within the DSF framework, a new architecture is presented, the Delta-RNN. In language modeling at the word and character levels, the Delta-RNN outperforms popular complex architectures, such as the Long Short Term Memory (LSTM) and the Gated Recurrent Unit (GRU), and, when regularized, performs comparably to several state-of-the-art baselines. At the subword level, the Delta-RNN's performance is comparable to that of complex gated architectures. △ Less

Submitted 16 July, 2017; v1 submitted 26 March, 2017; originally announced March 2017.

Comments: Edits/revisions applied throughout document

arXiv:1701.08954 [pdf, ps, other]

CommAI: Evaluating the first steps towards a useful general AI

Authors: Marco Baroni, Armand Joulin, Allan Jabri, Germàn Kruszewski, Angeliki Lazaridou, Klemen Simonic, Tomas Mikolov

Abstract: With machine learning successfully applied to new daunting problems almost every day, general AI starts looking like an attainable goal. However, most current research focuses instead on important but narrow applications, such as image classification or machine translation. We believe this to be largely due to the lack of objective ways to measure progress towards broad machine intelligence. In or… ▽ More With machine learning successfully applied to new daunting problems almost every day, general AI starts looking like an attainable goal. However, most current research focuses instead on important but narrow applications, such as image classification or machine translation. We believe this to be largely due to the lack of objective ways to measure progress towards broad machine intelligence. In order to fill this gap, we propose here a set of concrete desiderata for general AI, together with a platform to test machines on how well they satisfy such desiderata, while keeping all further complexities to a minimum. △ Less

Submitted 27 March, 2017; v1 submitted 31 January, 2017; originally announced January 2017.

Comments: Published in ICLR 2017 Workshop Track

arXiv:1612.03651 [pdf, other]

FastText.zip: Compressing text classification models

Authors: Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Hérve Jégou, Tomas Mikolov

Abstract: We consider the problem of producing compact architectures for text classification, such that the full model fits in a limited amount of memory. After considering different solutions inspired by the hashing literature, we propose a method built upon product quantization to store word embeddings. While the original technique leads to a loss in accuracy, we adapt this method to circumvent quantizati… ▽ More We consider the problem of producing compact architectures for text classification, such that the full model fits in a limited amount of memory. After considering different solutions inspired by the hashing literature, we propose a method built upon product quantization to store word embeddings. While the original technique leads to a loss in accuracy, we adapt this method to circumvent quantization artefacts. Our experiments carried out on several benchmarks show that our approach typically requires two orders of magnitude less memory than fastText while being only slightly inferior with respect to accuracy. As a result, it outperforms the state of the art by a good margin in terms of the compromise between memory usage and accuracy. △ Less

Submitted 12 December, 2016; originally announced December 2016.

Comments: Submitted to ICLR 2017

arXiv:1611.06188 [pdf, other]

Variable Computation in Recurrent Neural Networks

Authors: Yacine Jernite, Edouard Grave, Armand Joulin, Tomas Mikolov

Abstract: Recurrent neural networks (RNNs) have been used extensively and with increasing success to model various types of sequential data. Much of this progress has been achieved through devising recurrent units and architectures with the flexibility to capture complex statistics in the data, such as long range dependency or localized attention phenomena. However, while many sequential data (such as video… ▽ More Recurrent neural networks (RNNs) have been used extensively and with increasing success to model various types of sequential data. Much of this progress has been achieved through devising recurrent units and architectures with the flexibility to capture complex statistics in the data, such as long range dependency or localized attention phenomena. However, while many sequential data (such as video, speech or language) can have highly variable information flow, most recurrent models still consume input features at a constant rate and perform a constant number of computations per time step, which can be detrimental to both speed and model capacity. In this paper, we explore a modification to existing recurrent units which allows them to learn to vary the amount of computation they perform at each step, without prior knowledge of the sequence's time structure. We show experimentally that not only do our models require fewer operations, they also lead to better performance overall on evaluation tasks. △ Less

Submitted 2 March, 2017; v1 submitted 18 November, 2016; originally announced November 2016.

arXiv:1607.04606 [pdf, other]

Enriching Word Vectors with Subword Information

Authors: Piotr Bojanowski, Edouard Grave, Armand Joulin, Tomas Mikolov

Abstract: Continuous word representations, trained on large unlabeled corpora are useful for many natural language processing tasks. Popular models that learn such representations ignore the morphology of words, by assigning a distinct vector to each word. This is a limitation, especially for languages with large vocabularies and many rare words. In this paper, we propose a new approach based on the skipgra… ▽ More Continuous word representations, trained on large unlabeled corpora are useful for many natural language processing tasks. Popular models that learn such representations ignore the morphology of words, by assigning a distinct vector to each word. This is a limitation, especially for languages with large vocabularies and many rare words. In this paper, we propose a new approach based on the skipgram model, where each word is represented as a bag of character $n$-grams. A vector representation is associated to each character $n$-gram; words being represented as the sum of these representations. Our method is fast, allowing to train models on large corpora quickly and allows us to compute word representations for words that did not appear in the training data. We evaluate our word representations on nine different languages, both on word similarity and analogy tasks. By comparing to recently proposed morphological word representations, we show that our vectors achieve state-of-the-art performance on these tasks. △ Less

Submitted 19 June, 2017; v1 submitted 15 July, 2016; originally announced July 2016.

Comments: Accepted to TACL. The two first authors contributed equally

arXiv:1607.01759 [pdf, ps, other]

Bag of Tricks for Efficient Text Classification

Authors: Armand Joulin, Edouard Grave, Piotr Bojanowski, Tomas Mikolov

Abstract: This paper explores a simple and efficient baseline for text classification. Our experiments show that our fast text classifier fastText is often on par with deep learning classifiers in terms of accuracy, and many orders of magnitude faster for training and evaluation. We can train fastText on more than one billion words in less than ten minutes using a standard multicore~CPU, and classify half a… ▽ More This paper explores a simple and efficient baseline for text classification. Our experiments show that our fast text classifier fastText is often on par with deep learning classifiers in terms of accuracy, and many orders of magnitude faster for training and evaluation. We can train fastText on more than one billion words in less than ten minutes using a standard multicore~CPU, and classify half a million sentences among~312K classes in less than a minute. △ Less

Submitted 9 August, 2016; v1 submitted 6 July, 2016; originally announced July 2016.

arXiv:1511.08130 [pdf, other]

A Roadmap towards Machine Intelligence

Authors: Tomas Mikolov, Armand Joulin, Marco Baroni

Abstract: The development of intelligent machines is one of the biggest unsolved challenges in computer science. In this paper, we propose some fundamental properties these machines should have, focusing in particular on communication and learning. We discuss a simple environment that could be used to incrementally teach a machine the basics of natural-language-based communication, as a prerequisite to more… ▽ More The development of intelligent machines is one of the biggest unsolved challenges in computer science. In this paper, we propose some fundamental properties these machines should have, focusing in particular on communication and learning. We discuss a simple environment that could be used to incrementally teach a machine the basics of natural-language-based communication, as a prerequisite to more complex interaction with human users. We also present some conjectures on the sort of algorithms the machine should support in order to profitably learn from the environment. △ Less

Submitted 26 February, 2016; v1 submitted 25 November, 2015; originally announced November 2015.

arXiv:1511.07275 [pdf, other]

Learning Simple Algorithms from Examples

Authors: Wojciech Zaremba, Tomas Mikolov, Armand Joulin, Rob Fergus

Abstract: We present an approach for learning simple algorithms such as copying, multi-digit addition and single digit multiplication directly from examples. Our framework consists of a set of interfaces, accessed by a controller. Typical interfaces are 1-D tapes or 2-D grids that hold the input and output data. For the controller, we explore a range of neural network-based models which vary in their abilit… ▽ More We present an approach for learning simple algorithms such as copying, multi-digit addition and single digit multiplication directly from examples. Our framework consists of a set of interfaces, accessed by a controller. Typical interfaces are 1-D tapes or 2-D grids that hold the input and output data. For the controller, we explore a range of neural network-based models which vary in their ability to abstract the underlying algorithm from training instances and generalize to test examples with many thousands of digits. The controller is trained using $Q$-learning with several enhancements and we show that the bottleneck is in the capabilities of the controller rather than in the search incurred by $Q$-learning. △ Less

Submitted 23 November, 2015; v1 submitted 23 November, 2015; originally announced November 2015.

arXiv:1511.06303 [pdf, ps, other]

Alternative structures for character-level RNNs

Authors: Piotr Bojanowski, Armand Joulin, Tomas Mikolov

Abstract: Recurrent neural networks are convenient and efficient models for language modeling. However, when applied on the level of characters instead of words, they suffer from several problems. In order to successfully model long-term dependencies, the hidden representation needs to be large. This in turn implies higher computational costs, which can become prohibitive in practice. We propose two alterna… ▽ More Recurrent neural networks are convenient and efficient models for language modeling. However, when applied on the level of characters instead of words, they suffer from several problems. In order to successfully model long-term dependencies, the hidden representation needs to be large. This in turn implies higher computational costs, which can become prohibitive in practice. We propose two alternative structural modifications to the classical RNN model. The first one consists on conditioning the character level representation on the previous word representation. The other one uses the character history to condition the output probability. We evaluate the performance of the two proposed modifications on challenging, multi-lingual real world data. △ Less

Submitted 24 November, 2015; v1 submitted 19 November, 2015; originally announced November 2015.

Comments: First revision. Updated Table 3, extended Sec. 5.3 and added a paragraph to the conclusion,

arXiv:1503.01007 [pdf, other]

Inferring Algorithmic Patterns with Stack-Augmented Recurrent Nets

Authors: Armand Joulin, Tomas Mikolov

Abstract: Despite the recent achievements in machine learning, we are still very far from achieving real artificial intelligence. In this paper, we discuss the limitations of standard deep learning approaches and show that some of these limitations can be overcome by learning how to grow the complexity of a model in a structured way. Specifically, we study the simplest sequence prediction problems that are… ▽ More Despite the recent achievements in machine learning, we are still very far from achieving real artificial intelligence. In this paper, we discuss the limitations of standard deep learning approaches and show that some of these limitations can be overcome by learning how to grow the complexity of a model in a structured way. Specifically, we study the simplest sequence prediction problems that are beyond the scope of what is learnable with standard recurrent networks, algorithmically generated sequences which can only be learned by models which have the capacity to count and to memorize sequences. We show that some basic algorithms can be learned from sequential data using a recurrent network associated with a trainable memory. △ Less

Submitted 1 June, 2015; v1 submitted 3 March, 2015; originally announced March 2015.

arXiv:1502.05698 [pdf, ps, other]

Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks

Authors: Jason Weston, Antoine Bordes, Sumit Chopra, Alexander M. Rush, Bart van Merriënboer, Armand Joulin, Tomas Mikolov

Abstract: One long-term goal of machine learning research is to produce methods that are applicable to reasoning and natural language, in particular building an intelligent dialogue agent. To measure progress towards that goal, we argue for the usefulness of a set of proxy tasks that evaluate reading comprehension via question answering. Our tasks measure understanding in several ways: whether a system is a… ▽ More One long-term goal of machine learning research is to produce methods that are applicable to reasoning and natural language, in particular building an intelligent dialogue agent. To measure progress towards that goal, we argue for the usefulness of a set of proxy tasks that evaluate reading comprehension via question answering. Our tasks measure understanding in several ways: whether a system is able to answer questions via chaining facts, simple induction, deduction and many more. The tasks are designed to be prerequisites for any system that aims to be capable of conversing with a human. We believe many existing learning systems can currently not solve them, and hence our aim is to classify these tasks into skill sets, so that researchers can identify (and then rectify) the failings of their systems. We also extend and improve the recently introduced Memory Networks model, and show it is able to solve some, but not all, of the tasks. △ Less

Submitted 31 December, 2015; v1 submitted 19 February, 2015; originally announced February 2015.

arXiv:1412.7753 [pdf, other]

Learning Longer Memory in Recurrent Neural Networks

Authors: Tomas Mikolov, Armand Joulin, Sumit Chopra, Michael Mathieu, Marc'Aurelio Ranzato

Abstract: Recurrent neural network is a powerful model that learns temporal patterns in sequential data. For a long time, it was believed that recurrent networks are difficult to train using simple optimizers, such as stochastic gradient descent, due to the so-called vanishing gradient problem. In this paper, we show that learning longer term patterns in real data, such as in natural language, is perfectly… ▽ More Recurrent neural network is a powerful model that learns temporal patterns in sequential data. For a long time, it was believed that recurrent networks are difficult to train using simple optimizers, such as stochastic gradient descent, due to the so-called vanishing gradient problem. In this paper, we show that learning longer term patterns in real data, such as in natural language, is perfectly possible using gradient descent. This is achieved by using a slight structural modification of the simple recurrent neural network architecture. We encourage some of the hidden units to change their state slowly by making part of the recurrent weight matrix close to identity, thus forming kind of a longer term memory. We evaluate our model in language modeling experiments, where we obtain similar performance to the much more complex Long Short Term Memory (LSTM) networks (Hochreiter & Schmidhuber, 1997). △ Less

Submitted 16 April, 2015; v1 submitted 24 December, 2014; originally announced December 2014.

arXiv:1412.5335 [pdf, ps, other]

Ensemble of Generative and Discriminative Techniques for Sentiment Analysis of Movie Reviews

Authors: Grégoire Mesnil, Tomas Mikolov, Marc'Aurelio Ranzato, Yoshua Bengio

Abstract: Sentiment analysis is a common task in natural language processing that aims to detect polarity of a text document (typically a consumer review). In the simplest settings, we discriminate only between positive and negative sentiment, turning the task into a standard binary classification problem. We compare several ma- chine learning approaches to this problem, and combine them to achieve the best… ▽ More Sentiment analysis is a common task in natural language processing that aims to detect polarity of a text document (typically a consumer review). In the simplest settings, we discriminate only between positive and negative sentiment, turning the task into a standard binary classification problem. We compare several ma- chine learning approaches to this problem, and combine them to achieve the best possible results. We show how to use for this task the standard generative lan- guage models, which are slightly complementary to the state of the art techniques. We achieve strong results on a well-known dataset of IMDB movie reviews. Our results are easily reproducible, as we publish also the code needed to repeat the experiments. This should simplify further advance of the state of the art, as other researchers can combine their techniques with ours with little effort. △ Less

Submitted 27 May, 2015; v1 submitted 17 December, 2014; originally announced December 2014.

arXiv:1405.4053 [pdf, ps, other]

Distributed Representations of Sentences and Documents

Authors: Quoc V. Le, Tomas Mikolov

Abstract: Many machine learning algorithms require the input to be represented as a fixed-length feature vector. When it comes to texts, one of the most common fixed-length features is bag-of-words. Despite their popularity, bag-of-words features have two major weaknesses: they lose the ordering of the words and they also ignore semantics of the words. For example, "powerful," "strong" and "Paris" are equal… ▽ More Many machine learning algorithms require the input to be represented as a fixed-length feature vector. When it comes to texts, one of the most common fixed-length features is bag-of-words. Despite their popularity, bag-of-words features have two major weaknesses: they lose the ordering of the words and they also ignore semantics of the words. For example, "powerful," "strong" and "Paris" are equally distant. In this paper, we propose Paragraph Vector, an unsupervised algorithm that learns fixed-length feature representations from variable-length pieces of texts, such as sentences, paragraphs, and documents. Our algorithm represents each document by a dense vector which is trained to predict words in the document. Its construction gives our algorithm the potential to overcome the weaknesses of bag-of-words models. Empirical results show that Paragraph Vectors outperform bag-of-words models as well as other techniques for text representations. Finally, we achieve new state-of-the-art results on several text classification and sentiment analysis tasks. △ Less

Submitted 22 May, 2014; v1 submitted 16 May, 2014; originally announced May 2014.

arXiv:1312.5650 [pdf, other]

Zero-Shot Learning by Convex Combination of Semantic Embeddings

Authors: Mohammad Norouzi, Tomas Mikolov, Samy Bengio, Yoram Singer, Jonathon Shlens, Andrea Frome, Greg S. Corrado, Jeffrey Dean

Abstract: Several recent publications have proposed methods for mapping images into continuous semantic embedding spaces. In some cases the embedding space is trained jointly with the image transformation. In other cases the semantic embedding space is established by an independent natural language processing task, and then the image transformation into that space is learned in a second stage. Proponents of… ▽ More Several recent publications have proposed methods for mapping images into continuous semantic embedding spaces. In some cases the embedding space is trained jointly with the image transformation. In other cases the semantic embedding space is established by an independent natural language processing task, and then the image transformation into that space is learned in a second stage. Proponents of these image embedding systems have stressed their advantages over the traditional \nway{} classification framing of image understanding, particularly in terms of the promise for zero-shot learning -- the ability to correctly annotate images of previously unseen object categories. In this paper, we propose a simple method for constructing an image embedding system from any existing \nway{} image classifier and a semantic word embedding model, which contains the $\n$ class labels in its vocabulary. Our method maps images into the semantic embedding space via convex combination of the class label embedding vectors, and requires no additional training. We show that this simple and direct method confers many of the advantages associated with more complex image embedding schemes, and indeed outperforms state of the art methods on the ImageNet zero-shot learning task. △ Less

Submitted 21 March, 2014; v1 submitted 19 December, 2013; originally announced December 2013.

arXiv:1312.3005 [pdf, ps, other]

One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling

Authors: Ciprian Chelba, Tomas Mikolov, Mike Schuster, Qi Ge, Thorsten Brants, Phillipp Koehn, Tony Robinson

Abstract: We propose a new benchmark corpus to be used for measuring progress in statistical language modeling. With almost one billion words of training data, we hope this benchmark will be useful to quickly evaluate novel language modeling techniques, and to compare their contribution when combined with other advanced techniques. We show performance of several well-known types of language models, with the… ▽ More We propose a new benchmark corpus to be used for measuring progress in statistical language modeling. With almost one billion words of training data, we hope this benchmark will be useful to quickly evaluate novel language modeling techniques, and to compare their contribution when combined with other advanced techniques. We show performance of several well-known types of language models, with the best results achieved with a recurrent neural network based language model. The baseline unpruned Kneser-Ney 5-gram model achieves perplexity 67.6; a combination of techniques leads to 35% reduction in perplexity, or 10% reduction in cross-entropy (bits), over that baseline. The benchmark is available as a code.google.com project; besides the scripts needed to rebuild the training/held-out data, it also makes available log-probability values for each word in each of ten held-out data sets, for each of the baseline n-gram models. △ Less

Submitted 4 March, 2014; v1 submitted 10 December, 2013; originally announced December 2013.

Comments: Accompanied by a code.google.com project allowing anyone to generate the benchmark data, and use it to compare their language model against the ones described in the paper

arXiv:1310.4546 [pdf, ps, other]

Distributed Representations of Words and Phrases and their Compositionality

Authors: Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, Jeffrey Dean

Abstract: The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. In this paper we present several extensions that improve both the quality of the vectors and the training speed. By subsampling of the frequent words we obtain significant speedup and… ▽ More The recently introduced continuous Skip-gram model is an efficient method for learning high-quality distributed vector representations that capture a large number of precise syntactic and semantic word relationships. In this paper we present several extensions that improve both the quality of the vectors and the training speed. By subsampling of the frequent words we obtain significant speedup and also learn more regular word representations. We also describe a simple alternative to the hierarchical softmax called negative sampling. An inherent limitation of word representations is their indifference to word order and their inability to represent idiomatic phrases. For example, the meanings of "Canada" and "Air" cannot be easily combined to obtain "Air Canada". Motivated by this example, we present a simple method for finding phrases in text, and show that learning good vector representations for millions of phrases is possible. △ Less

Submitted 16 October, 2013; originally announced October 2013.

arXiv:1309.4168 [pdf, other]

Exploiting Similarities among Languages for Machine Translation

Authors: Tomas Mikolov, Quoc V. Le, Ilya Sutskever

Abstract: Dictionaries and phrase tables are the basis of modern statistical machine translation systems. This paper develops a method that can automate the process of generating and extending dictionaries and phrase tables. Our method can translate missing word and phrase entries by learning language structures based on large monolingual data and mapping between languages from small bilingual data. It uses… ▽ More Dictionaries and phrase tables are the basis of modern statistical machine translation systems. This paper develops a method that can automate the process of generating and extending dictionaries and phrase tables. Our method can translate missing word and phrase entries by learning language structures based on large monolingual data and mapping between languages from small bilingual data. It uses distributed representation of words and learns a linear mapping between vector spaces of languages. Despite its simplicity, our method is surprisingly effective: we can achieve almost 90% precision@5 for translation of words between English and Spanish. This method makes little assumption about the languages, so it can be used to extend and refine dictionaries and translation tables for any language pairs. △ Less

Submitted 16 September, 2013; originally announced September 2013.

arXiv:1301.3781 [pdf, other]

Efficient Estimation of Word Representations in Vector Space

Authors: Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean

Abstract: We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e.… ▽ More We propose two novel model architectures for computing continuous vector representations of words from very large data sets. The quality of these representations is measured in a word similarity task, and the results are compared to the previously best performing techniques based on different types of neural networks. We observe large improvements in accuracy at much lower computational cost, i.e. it takes less than a day to learn high quality word vectors from a 1.6 billion words data set. Furthermore, we show that these vectors provide state-of-the-art performance on our test set for measuring syntactic and semantic word similarities. △ Less

Submitted 6 September, 2013; v1 submitted 16 January, 2013; originally announced January 2013.

arXiv:1211.5063 [pdf, other]

On the difficulty of training Recurrent Neural Networks

Authors: Razvan Pascanu, Tomas Mikolov, Yoshua Bengio

Abstract: There are two widely known issues with properly training Recurrent Neural Networks, the vanishing and the exploding gradient problems detailed in Bengio et al. (1994). In this paper we attempt to improve the understanding of the underlying issues by exploring these problems from an analytical, a geometric and a dynamical systems perspective. Our analysis is used to justify a simple yet effective s… ▽ More There are two widely known issues with properly training Recurrent Neural Networks, the vanishing and the exploding gradient problems detailed in Bengio et al. (1994). In this paper we attempt to improve the understanding of the underlying issues by exploring these problems from an analytical, a geometric and a dynamical systems perspective. Our analysis is used to justify a simple yet effective solution. We propose a gradient norm clipping strategy to deal with exploding gradients and a soft constraint for the vanishing gradients problem. We validate empirically our hypothesis and proposed solutions in the experimental section. △ Less

Submitted 15 February, 2013; v1 submitted 21 November, 2012; originally announced November 2012.

Comments: Improved description of the exploding gradient problem and description and analysis of the vanishing gradient problem

Showing 1–42 of 42 results for author: Mikolov, T