GPT-2

Generative Pre-trained Transformer 2 (GPT-2) is an open-source artificial intelligence created by OpenAI in February 2019.[1][2][3][4][5][6][7][8] A transformer machine learning model, GPT-2 uses deep learning to translate text, answer questions, summarize passages,[9] and generate text output on a level that, while sometimes indistinguishable from that of humans, can become repetitive or nonsensical when generating long passages.[1] It is a general-purpose learner; it was not specifically trained to do any of these tasks, and its ability to perform them is an extension of its general ability to accurately synthesize the next item in an arbitrary sequence.[7][9]

Generative Pre-trained Transformer 2 (GPT-2)
Original author(s)OpenAI
Initial release14 February 2019
Repositoryhttps://github.com/openai/gpt-2
TypeTransformer language model
Websitewww.openai.com/blog/gpt-2-1-5b-release/

The model was a successor to OpenAI's 2018 GPT model,[10] a "direct scale-up" of the earlier model with a 10-fold increase in both its parameter count and the size of its training dataset.[8] OpenAI released the complete version of the GPT-2 language model (with 1.5 billion parameters) in November 2019.[11] GPT-2 was to be followed by the 175-billion-parameter GPT-3, revealed to the public in 2020[12] (whose source code has never been made available, and access to GPT-3 is provided exclusively through an API offered by Microsoft).[13]

Background

Since the origins of computing, artificial intelligence has been an object of study; the "imitation game", postulated by Alan Turing in 1950 (and often called the "Turing test") proposed to establish an electronic or mechanical system's capacity for intelligent action by an evaluator's ability to distinguish its behavior from that of a human.[14] The term "machine learning" was first used to describe a possible approach to artificial intelligence as early as 1959 by IBM researcher Arthur Samuel;[15] current use of the term encompasses a broad variety of statistical learning, data science and neural network approaches to computational problems (often falling under the aegis of artificial intelligence).

Computational linguistics

Natural language processing using computers, a task originally conceived as a subfield of computational linguistics, was attempted as soon as computing hardware had the capacity; the first application of a dictionary look-up table was developed at Birkbeck College in London in 1948.[16] The 1954 Georgetown experiment was a demonstration of fully automated machine translation, in which sixty Russian sentences were translated into English (mostly by replacement of words with their English synonyms).[17][18] The translations were often crude; the system had only 6 grammar rules and a 250-word vocabulary,[19] and no attempt was made to analyze or translate syntactic structure.[20] However, the experiment proved to the public that computers could interpret and process natural language,[21] and secured CIA funding for further research.[17] Direct substitution remains a benchmark against which machine translation programs are evaluated.

Systems for using natural language in human-computer interaction also began to emerge in the mid-20th century. SHRDLU, a program developed at MIT in 1968–1970, consisted of a virtual environment of several objects which a user interacted with through commands in natural language (e.g."Find a block which is taller than the one you are holding and put it into the box").[22][23] ELIZA, a chatterbot written in 1966, analyzed a human interlocutor's text for keywords and provided conversationally appropriate responses.[24] While many subjects claimed an inability to distinguish ELIZA's conversation from that of a human, the question of whether this constituted intelligence proved contentious (the most famous script parodied a psychotherapist by, largely, repeating what the user had said back to them).[25]

While initial attempts at machine translation had been purely computational, by the 1950s the dominant approach to computational linguistics had come to emphasize Noam Chomsky's concept of universal grammar;[16] NLP research in that era, accordingly, consisted largely of attempts to reduce statements in arbitrary languages to putative underlying language-agnostic logical structures. In the 1970s, semantic NLP systems would begin to eschew syntactic encodings in favor of more general semantic encodings.[26] However, until the advent of neural networks, most systems continued to rely on large (and increasingly unwieldly) sets of manually programmed rules, which failed to scale up as initially predicted.[16]

The field of artificial intelligence continued to develop in the late 20th century, but occasional periods of stagnation known as "AI winters" occurred. Various sources posit AI winters as having occurred at different times; in 1994, Howe described one as having started in 1973 and lasting a decade,[27] while Russell & Norvig in 2003 described another as starting soon after 1988.[28]

Neural networks

An early concept in artificial intelligence, connectionism, sought to produce intelligent behavior through artificial neural networks designed to simulate the behavior of neurons in biological brains. The first example of an artificial neural network was the SNARC, built in 1951. The perceptron (a type of binary classifier) was introduced in 1957 by psychologist Frank Rosenblatt;[29] his machine was designed for image recognition using 400 photocells connected to "neurons", with weightings determined by potentiometers (and adjusted with electric motors during its learning process).[30] Perceptron systems became the subject of great interest; a New York Times article described the perceptron as "the embryo of an electronic computer that [the Navy] expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence".[31] Perceptron systems, however, fell out of favor for decades following a 1969 book by Marvin Minsky and Seymour Papert (Perceptrons: an introduction to computational geometry),[32] which pointed out several shortcomings of the then-present state of the art (single-layer perceptrons), including an inability to encode the exclusive or (XOR) function. The book was considered, at the time, to discredit the perceptron approach (as well as neural networks in general) as a promising area of research.[31]

Neural networks become capable of classifying different inputs (i.e. sorting them into distinct categories) through a process known as "learning". This begins with the network's weights (the amount by which each neuron's "activation" influences the activation of each specific neuron in the subsequent layer) being initialized to random quantities; in this state, the output of the network is similarly random. An objective function, like a loss function, is defined, which is capable of quantitatively measuring how close the output of the network is to its desired performance (for example, how often an input consisting of a handwritten number results in the sole activation of the output neuron corresponding to that number).[33] From this, and from the performance of the network, the weights can be adjusted in order to improve its performance.[34]

Backpropagation, a supervised algorithm first applied to machine learning systems in Paul Werbos' 1974 dissertation,[35] efficiently calculates "gradients", which are vector fields describing the optimal adjustment of all weights in the entire network for a given input/output example.[34][33] The use of these gradients to train neural networks, a practice known as gradient descent, enabled the creation of much more complex systems, and wide-scale application of neural networks to natural language processing would occur in the 1980s.[36][28] In 1985, D.B. Parker would rediscover Werbos' method,[37]; in 1986, Rumelhart, Hinton and Williams would apply it to generate internal representations of incoming data in neural networks with hidden layers,[38] referred to as "deep learning" networks; this research would later form the basis for recurrent neural networks.

Traditional feed-forward neural networks are so named because each layer takes in output from the previous layer, and feeds it into the next; a FFNN's structure contains no "cycles" where information flows backwards. In contrast, a recurrent neural network (RNN) has at least one cycle of activation flow.[33] RNNs are often used for processing sequences of data (and predicting future sequence items), since the network can process each item using both the item itself and its own output from processing the previous item.[33]

The neocognitron, proposed by Kunihiko Fukushima in 1979[39] based on models of neural architecture in the mammalian visual cortex, provided the basis for convolutional neural networks (CNNs),[40] often used in image processing. By "sliding" a small layer over a larger input, a CNN can perform deeper processing with less computation. For example, a 100×100 image has 10,000 pixels, which would require 10,000 weights to process with a fully connected layer; a convolutional layer consisting of a 5×5 "window" sliding over the image can perform edge detection using only 25 learnable parameters. Convolutional layers are combined by "pooling layers", and processed by "fully connected" layers (which are typically multilayer perceptrons).

Machine learning for natural language processing

Due to their ability to process sequential information, recurrent neural networks have seen use in many NLP applications; unlike FFNNs, they are capable of encoding different weights (and giving different output) for identical items based on their surroundings in a sequence — that is to say, a RNN system that parsed one word at a time could still associate a "black dog" with fuzzy paws, a "corn dog" with ketchup, and a "sun dog" with refraction. Moreover, since the retention of information from previous sequence items can be performed recursively, RNN systems can be designed that recall items arbitrarily far back in a sequence: for example, being able to continue the sequences "Tom looked at the black dog", "Tom looked at the corn dog", and "Tom looked at the sun dog" with "fondly", "hungrily", and "indirectly", respectively.[41][42]

While capable of impressive solutions, many-layered FFNNs and RNNs both proved vulnerable to the vanishing gradient problem: since gradients (encoded as finite-precision numbers) are required to backpropagate across all layers of a model, they can "vanish" to zero (or "explode" to infinity) over a sufficiently large number of layers. The long short-term memory network (LSTM), first proposed by Sepp Hochreiter and Jürgen Schmidhuber in 1995—1997,[43][44][45] sought to resolve this issue by introducing a novel architecture consisting of multiple distinct "cells" with "input", "output" and "forget" gates. In 2009, an LSTM-based model submitted by Alex Graves' team won the ICDAR competition for handwriting recognition;[46] another was the most accurate model in the competition and a third was the fastest.[47]

Another issue RNNs and LSTMs encounter is that they can only take into account the context of previous sequence items. [41][48] This can create issues when parsing sentences like "Tom rode his bike to the store, put out the kickstand, and turned off the engine", in which the necessary context of the "bike" being a motorcycle is revealed only at the end. One method of solving problems like this is the bidirectional LSTM, which proceeds in both directions simultaneously, giving access to both "past" and "future" input features.[41] Conditional random fields connect inputs directly to outputs.[41] There exist combinations of the above approaches, like the LSTM-CRF network and the BI-LSTM-CRF network.[41] Other improvements on the RNN model include neural Turing machines, adaptive computation time, neural programmers, and attention mechanisms, the latter of which form the basis for GPT-2 and related technologies.[42]

Selective focusing

By the early 2010s, the state of the art in neural machine translation was the encoder–decoder model, in which a RNN or LSTM "encoder network" encoded source sentences into vectors, and a "decoder network" of similar architecture processed these vectors into translated output.[49] 2014 saw the introduction of significantly more complex "attention" mechanisms, which vastly augmented these models' performance. Attention mechanisms gave these models the ability to adaptively focus their decoder networks' "attention" on specific aspects of the source text, rather than forcing them to parse the entire text as one vector.[49][50]

2017 then saw the introduction of "transformer" models, which went a step further by using attention mechanisms to replace the RNN/LSTM architecture entirely.[51][42]

Attention mechanisms

One constraint of encoder–decoder models was the difficulty of compressing the encodings of larger sentences into fixed-length vectors; performance often deteriorated on larger inputs. In 2014, Bahdanau et al.[49] introduced an extension to the encoder–decoder model that could "align and translate jointly".[50] For each word of the source sentence that was translated, the Bahdanau model's encoder (a bidirectional RNN with 1000 hidden units in each direction) searched the entire rest of that sentence for the positions of relevant information. Rather than giving the decoder a fixed-length vector encoding of the entire input sequence (like previous models), it produced "context vectors", associated with those positions as well as previously generated target words.[49] The decoder (which also had 1000 hidden units) then used these context vectors to decide where to focus its "attention".[49][50][42]

Research into "attention" mechanisms was continued by Luong et al. in a 2015 paper.[50] A "global" approach based on the Bahdanau paper was attempted, as well as a "local" approach wherein only a subset of source words were "considered" at a time; the local approach, while more architecturally complicated, was less computationally expensive and easier to train.[50] It took 7–10 days to fully train an English–German translation model, which was specifically designed for capable of translating 1,000 target words per second; its accuracy was tested against the 2014 ACL Workshop on Machine Translation (WMT'14) benchmark for English–German sentence pairs, and achieved a result of 23.0 BLEU — a 2.1 BLEU improvement on the previous state of the art, a phrase-based language model from Buck et al. 2014.[52][50]

Transformers

While attention mechanisms were effective in improving performance when used to augment existing convolutional and recurrent neural network architectures, it was soon discovered that performant models could be built using attention mechanisms on their own, without anything else underlying them.[51]

In June 2017, the transformer architecture was first introduced, in a paper released by Google's DeepMind.[51] Transformers are a type of model based solely on attention mechanisms, discarding convolution and recurrence altogether. Unlike previous RNN-based models, transformers can process sequential input without needing to perform computation on each item in sequence; this means they can be massively parallelized.[51] On the WMT'14 French–English benchmark, a specifically trained French–English translation model using the transformer architecture was able to establish a new single-model state of the art score of 41.8 BLEU.[51] Since their introduction, transformers have seen use in many NLP applications.[53]

Architecture

On June 11, 2018, OpenAI released a paper entitled "Improving Language Understanding by Generative Pre-Training", in which they introduced the Generative Pre-trained Transformer (GPT).[10] At this point, state-of-the-art neural NLP models primarily employed supervised learning from large amounts of manually labeled data. This reliance on supervised learning limited their use on datasets that were not well-annotated, in addition to making it prohibitively expensive and time-consuming to train extremely large models;[10][54] many languages (such as Swahili or Haitian creole) are difficult to translate and interpret using such models due to a lack of available text for corpus-building.[54] In contrast, GPT's "semi-supervised" approach involved two stages: an unsupervised generative "pre-training" stage in which a language modeling objective was used to set initial parameters, and a supervised discriminative "fine-tuning" stage in which these parameters were adapted to a target task.[10]

The use of a transformer architecture, as opposed to previous techniques involving attention-augmented RNNs, provided GPT with a more structured memory than could be achieved through recurrent mechanisms; this resulted in "robust transfer performance across diverse tasks".[10]

During transfer, we utilize task-specific input adaptations derived from traversal-style approaches, which process structured text input as a single contiguous sequence of tokens.[10]

The unsupervised pre-training was performed using BooksCorpus, a dataset of over 7,000 unpublished fiction books from various genres; while other models this dataset was chosen in part because its long passages of continuous text conditioned the model to handle long-range information. Other available datasets, while larger, were rejected on the basis that they lacked this long-range structure (being "shuffled" at a sentence level).[10]

We trained a 12-layer decoder-only transformer with masked self-attention heads (768 dimensional states and 12 attention heads). For the position-wise feed-forward networks, we used 3072 dimensional inner states. We used the Adam optimization scheme with a max learning rate of 2.5e-4. The learning rate was increased linearly from zero over the first 2000 updates and annealed to 0 using a cosine schedule. We train for 100 epochs on minibatches of 64 randomly sampled, contiguous sequences of 512 tokens. Since layernorm is used extensively throughout the model, a simple weight initialization of N(0,0.02) was sufficient. We used a bytepair encoding (BPE) vocabulary with 40,000 merges [53]and residual, embedding, and attention dropouts with a rate of 0.1 for regularization. We also employed a modified version of L2 regularization proposed in Loshchilov et al. 2017, with w = 0.01 on all non bias or gain weights. For the activation function, we used the Gaussian Error Linear Unit (GELU). We used learned position embeddings instead of the sinusoidal version proposed in the original work.We use the ftfy library to clean the raw text in BooksCorpus, standardize some punctuation and whitespace, and use the spaCy tokenizer. [...] Unless specified, we reuse the hyperparameter settings from unsupervised pre-training. We add dropout to the classifier with a rate of 0.1. For most tasks, we use a learning rate of 6.25e-5 and a batchsize of 32. Our model finetunes quickly and 3 epochs of training was sufficient for most cases. We use a linear learning rate decay schedule with warmup over 0.2% of training. λ was set to 0.5.

While GPT's fine-tuning was adapted to specific tasks, its pre-training was not; to perform the various tasks, minimal changes were performed to its underlying task-agnostic model architecture.[10] Despite this, GPT still improved on the state of the art in several language processing tasks, outperforming discriminatively-trained models with task-oriented architectures on a number of diverse benchmarks.[10]

On natural language inference (also known as textual entailment) tasks, models are evaluated on their ability to interpret pairs of sentences from various datasets and classify the relationship between them as "entailment", "contradiction" or "neutral".[10] Examples of such datasets include QNLI (Wikipedia articles) and MultiNLI (transcribed speech, popular fiction and government reports, among other sources);[55] on these GPT achieved, respectively, a 5.8% and 1.5% improvement over previous best results.[10] It similarly outperformed previous models on two tasks related to question answering and commonsense reasoning — by 5.7% on RACE,[56] a dataset of written question–answer pairs from middle and high school exams, and by 8.9% on the Story Cloze Test.[57] Another task, semantic similarity (or paraphrase detection), assesses whether a model can predict whether two sentences are paraphrases of one another; on the Quora Question Pairs (QQP) dataset, GPT improved on the previous state of the art by 4.2%.[10] In a text classification task using the Corpus of Linguistic Acceptability (CoLA), GPT achieved a score of 45.4, versus a previous best of 35.0. Finally, on GLUE, a multi-task benchmark,[58] GPT achieved an overall score of 72.8 (compared to a previous record of 68.9).[10]

GPT-2 was created as a direct scale-up of GPT, with both its parameter count and dataset size increased by a factor of 10.[7][10][8] Both are unsupervised transformer models trained to generate text by predicting the next word in a sequence of tokens. The GPT-2 model has 1.5 billion parameters, and was trained on a dataset of 8 million web pages.[7] While GPT-2 was reinforced on very simple criteria (interpreting a sequence of words in a text sample and predicting the most likely next word), it produces full sentences and paragraphs by continuing to predict additional words, generating fully comprehensible (and semantically meaningful) statements in natural language.[7]

Due to the broadness of its dataset, and the broadness of its approach, GPT-2 became capable of performing a diverse range of tasks beyond simple text generation: it is capable of answering questions, summarizing, and even translating between languages in a variety of specific domains, without being instructed in anything beyond how to predict the next word in a sequence.[2][3]

One example of generalized learning is GPT-2's ability to perform machine translation between French and English, for which task GPT-2's performance was assessed using WMT-14 translation benchmarks. GPT-2's training corpus included virtually no French text; non-English text was deliberately removed while cleaning the dataset prior to training, and as a consequence, only 10MB of French of the remaining 40,000MB was available for the model to learn from (mostly from foreign-language quotations in English posts and articles).[7] Despite this, GPT-2 achieved 5 BLEU on the WMT-14 English-to-French test set (slightly below the score of a translation via word-for-word substitution). It was also able to outperform several contemporary (2017) unsupervised machine translation baselines on the French-to-English test set, where GPT-2 achieved 11.5 BLEU. This remained below the highest-performing contemporary unsupervised approach (2019), which had achieved 33.5 BLEU.[7] However, other models used large amounts of French text to achieve these results; GPT-2 was estimated to have used a monolingual French corpus approximately 1/500 the size of comparable approaches.[7]

Performance and reception

GPT-2 was first announced on 14 February 2019. While the source code of previous OpenAI models had been made immediately available to the public, the company initially refused to make a public release of GPT-2, citing the risk of malicious use;[1] limited access to the model (i.e. an interface that allowed input and provided output, not the source code itself) was allowed for selected press outlets on announcement.[1]

The Guardian described this output as "plausible newspaper prose";[1] Kelsey Piper of Vox said "one of the coolest AI systems I’ve ever seen may also be the one that will kick me out of my job".[3] GPT-2's flexibility was described as "impressive" by The Verge; specifically, its ability to translate text between languages, summarize long articles, and answer trivia questions were noted.[2]

A study by the University of Amsterdam employing a modified Turing test found that at least in some scenarios, participants were unable to distinguish poems generated by GPT-2 from those written by humans.[59] However, while GPT-2's ability to generate plausible passages of natural language text were generally remarked on positively, its shortcomings were noted as well, especially when generating texts longer than a couple paragraphs; Vox said "the prose is pretty rough, there’s the occasional non-sequitur, and the articles get less coherent the longer they get".[3] The Verge similarly noted that longer samples of GPT-2 writing were "usually easily identifiable as non-human", tended to "stray off topic" and lack overall coherence.[2]

Possible applications of GPT-2 described by journalists included aiding humans in writing text like news articles.[1]

The Allen Institute for Artificial Intelligence responded to GPT-2 with a tool to detect "neural fake news".[60] Researchers such as Jeremy Howard warned of "the technology to totally fill Twitter, email, and the web up with reasonable-sounding, context-appropriate prose, which would drown out all other speech and be impossible to filter".[2]

References

  1. Hern, Alex (14 February 2019). "New AI fake text generator may be too dangerous to release, say creators". The Guardian. Retrieved 19 December 2020.
  2. Vincent, James (14 February 2019). "OpenAI's new multitalented AI writes, translates, and slanders". The Verge. Retrieved 19 December 2020.
  3. Piper, Kelsey (14 February 2019). "An AI helped us write this article". Vox. Retrieved 19 December 2020.
  4. Piper, Kelsey (15 May 2019). "A poetry-writing AI has just been unveiled. It's ... pretty good". Vox. Retrieved 19 December 2020.
  5. Johnson, Khari (20 August 2019). "OpenAI releases curtailed version of GPT-2 language model". VentureBeat. Retrieved 19 December 2020.
  6. Vincent, James (7 November 2019). "OpenAI has published the text-generating AI it said was too dangerous to share". The Verge. Retrieved 19 December 2020.
  7. Radford, Alec; Wu, Jeffrey; Child, Rewon; Luan, David; Amodei, Dario; Sutskever, Ilua (14 February 2019). "Language models are unsupervised multitask learners" (PDF). 1 (8). Retrieved 19 December 2020. Cite journal requires |journal= (help)
  8. "Better Language Models and Their Implications". OpenAI. 14 February 2019. Retrieved 19 December 2020.
  9. Hegde, Chaitra; Patil, Shrikumar (9 June 2020). "Unsupervised Paraphrase Generation using Pre-trained Language Models". arXiv:2006.05477 [cs.CL].
  10. Radford, Alec; Narasimhan, Karthik; Salimans, Tim; Sutskever, Ilya (11 June 2018). "Improving Language Understanding by Generative Pre-Training" (PDF). OpenAI. p. 12. Retrieved 23 January 2021.
  11. "GPT-2: 1.5B Release". OpenAI. 2019-11-05. Retrieved 2019-11-14.
  12. Arram (July 9, 2020). "GPT-3: An AI that's eerily good at writing almost anything". Arram Sabeti. Retrieved July 31, 2020.
  13. Hao, Karen (September 23, 2020). "OpenAI is giving Microsoft exclusive access to its GPT-3 language model". MIT Technology Review. Retrieved 2020-09-25. The companies say OpenAI will continue to offer its public-facing API, which allows chosen users to send text to GPT-3 or OpenAI’s other models and receive its output. Only Microsoft, however, will have access to GPT-3’s underlying code, allowing it to embed, repurpose, and modify the model as it pleases.
  14. Turing, Alan (October 1950), "Computing Machinery and Intelligence", Mind, LIX (236): 433–460, doi:10.1093/mind/LIX.236.433, ISSN 0026-4423
  15. Samuel, Arthur (1959). "Some Studies in Machine Learning Using the Game of Checkers". IBM Journal of Research and Development. 3 (3): 210–229. CiteSeerX 10.1.1.368.2254. doi:10.1147/rd.33.0210.
  16. Hancox, P.J. (26 January 1996). "SEM1A5 - Part 1 - A brief history of NLP". University of Birmingham. Retrieved 12 January 2021.
  17. Nye, Mary Jo (2016). "Speaking in Tongues: Science's centuries-long hunt for a common language". Distillations. 2 (1): 40–43. Retrieved 22 March 2018.
  18. Gordin, Michael D. (2015). Scientific Babel: How Science Was Done Before and After Global English. Chicago, Illinois: University of Chicago Press. ISBN 9780226000299.
  19. John Hutchins. "The first public demonstration of machine translation: the Georgetown-IBM system, 7th January 1954". S2CID 132677. Cite journal requires |journal= (help)
  20. Reifler, Erwin (February 2–5, 1960). "The solution of MT linguistic problems through lexicography". Proceedings of the National Symposium on Machine Translation.
  21. Hutchins, John (1997). "From first conception to first demonstration: the nascent years of machine translation, 1947-1954. A chronology". Machine Translation 12, 195-252. 12 (3): 195–252. doi:10.1023/A:1007969630568. S2CID 197591.
  22. Winograd, Terry (1971-01-01). "Procedures as a Representation for Data in a Computer Program for Understanding Natural Language". hdl:1721.1/7095. Cite journal requires |journal= (help)
  23. "SHRDLU". Stanford Human-Computer Interaction (HCI) Group.
  24. Weizenbaum, Joseph (January 1966), "ELIZA – A Computer Program For the Study of Natural Language Communication Between Man And Machine", Communications of the ACM, 9 (1): 36–45, doi:10.1145/365153.365168, S2CID 1896290
  25. Bassett, Caroline (2019). "The computational therapeutic: exploring Weizenbaum's ELIZA as a history of the present". AI & Society. 34 (4): 803–812. doi:10.1007/s00146-018-0825-9.
  26. Hancox, P.J. (26 January 1996). "SEM1A5 - Part 1 - The state-of-the-art". University of Birmingham. Retrieved 12 January 2021.
  27. Howe, J. (November 1994). "Artificial Intelligence at Edinburgh University : a Perspective". Archived from the original on 17 August 2007. Retrieved 30 August 2007. Lighthill's [1973] report provoked a massive loss of confidence in AI by the academic establishment in the UK (and to a lesser extent in the US). It persisted for a decade ― the so-called 'AI Winter'
  28. Russell, Stuart J.; Norvig, Peter (2003), Artificial Intelligence: A Modern Approach (2nd ed.), Upper Saddle River, New Jersey: Prentice Hall, p. 24, ISBN 0-13-790395-2, Overall, the AI industry boomed from a few million dollars in 1980 to billions of dollars in 1988. Soon after that came a period called the 'AI Winter'
  29. Rosenblatt, Frank (1957). "The Perceptron—a perceiving and recognizing automaton". Report 85-460-1. Cornell Aeronautical Laboratory.
  30. Bishop, Christopher M. (2006). Pattern Recognition and Machine Learning. Springer. ISBN 0-387-31073-8.
  31. Olazaran, Mikel (1996). "A Sociological Study of the Official History of the Perceptrons Controversy". Social Studies of Science. 26 (3): 611–659. doi:10.1177/030631296026003005. JSTOR 285702. S2CID 16786738.
  32. Minsky, Marvin; Papert, Seymour (1969), Perceptrons: An Introduction to Computational Geometry, MIT Press, ISBN 0-262-63022-2
  33. Wilson, Bill (24 June 2012). "The Machine Learning Dictionary". www.cse.unsw.edu.au. Archived from the original on 26 August 2018. Retrieved 19 January 2021.
  34. Goodfellow, Ian; Bengio, Yoshua; Courville, Aaron (2016). "6.5 Back-Propagation and Other Differentiation Algorithms". Deep Learning. MIT Press. pp. 200–220. ISBN 9780262035613.
  35. Werbos, Paul J. (1994). The Roots of Backpropagation : From Ordered Derivatives to Neural Networks and Political Forecasting. New York: John Wiley & Sons. ISBN 0-471-59897-6.
  36. Crevier, Daniel (1993), AI: The Tumultuous Search for Artificial Intelligence, New York, NY: BasicBooks, ISBN 0-465-02997-3
  37. Parker, D.B. (1985). "Learning Logic". Center for Computational Research in Economics and Management Science. Cambridge MA: Massachusetts Institute of Technology. Cite journal requires |journal= (help)
  38. Rumelhart, David E.; Hinton, Geoffrey E.; Williams, Ronald J. (1986a). "Learning representations by back-propagating errors". Nature. 323 (6088): 533–536. Bibcode:1986Natur.323..533R. doi:10.1038/323533a0. S2CID 205001834.
  39. Fukushima, Kunihiko (October 1979). "位置ずれに影響されないパターン認識機構の神経回路のモデル --- ネオコグニトロン ---" [Neural network model for a mechanism of pattern recognition unaffected by shift in position — Neocognitron —]. Trans. IECE (in Japanese). J62-A (10): 658–665.
  40. LeCun, Yann; Bengio, Yoshua; Hinton, Geoffrey (2015). "Deep learning". Nature. 521 (7553): 436–444. Bibcode:2015Natur.521..436L. doi:10.1038/nature14539. PMID 26017442. S2CID 3074096.
  41. Bajpai, Akash (23 February 2019). "Recurrent Neural Networks: Deep Learning for NLP". Towards Data Science. Retrieved 19 January 2021.
  42. Olah, Chris; Carter, Shan (8 September 2016). "Attention and Augmented Recurrent Neural Networks". Distill. Retrieved 22 January 2021.
  43. Sepp Hochreiter; Jürgen Schmidhuber (21 August 1995), Long Short Term Memory, Wikidata  Q98967430
  44. Sepp Hochreiter; Jürgen Schmidhuber (1997). "LSTM can Solve Hard Long Time Lag Problems" (PDF). Advances in Neural Information Processing Systems 9. Advances in Neural Information Processing Systems. Wikidata  Q77698282 .
  45. Sepp Hochreiter; Jürgen Schmidhuber (1997). "Long short-term memory". Neural Computation. 9 (8): 1735–1780. doi:10.1162/neco.1997.9.8.1735. PMID 9377276. S2CID 1915014.
  46. Graves, A.; Liwicki, M.; Fernández, S.; Bertolami, R.; Bunke, H.; Schmidhuber, J. (May 2009). "A Novel Connectionist System for Unconstrained Handwriting Recognition". IEEE Transactions on Pattern Analysis and Machine Intelligence. 31 (5): 855–868. CiteSeerX 10.1.1.139.4502. doi:10.1109/tpami.2008.137. ISSN 0162-8828. PMID 19299860. S2CID 14635907.
  47. Märgner, Volker; Abed, Haikal El (July 2009). "ICDAR 2009 Arabic Handwriting Recognition Competition". 2009 10th International Conference on Document Analysis and Recognition: 1383–1387. doi:10.1109/ICDAR.2009.256. ISBN 978-1-4244-4500-4. S2CID 52851337.
  48. Olah, Chris (27 August 2015). "Understanding LSTM Networks". Retrieved 22 January 2021.
  49. Bahdanau, Dzmitry; Cho, Kyunghyun; Bengio, Yoshua (1 September 2014). "Neural Machine Translation by Jointly Learning to Align and Translate". arXiv:1409.0473 [cs.CL].
  50. Luong, Minh-Thang; Pham, Hieu; Manning, Christopher D. (17 August 2015). "Effective Approaches to Attention-based Neural Machine Translation". arXiv:1508.04025 [cs.CL].
  51. Polosukhin, Illia; Kaiser, Lukasz; Gomez, Aidan N.; Jones, Llion; Uszkoreit, Jakob; Parmar, Niki; Shazeer, Noam; Vaswani, Ashish (2017-06-12). "Attention Is All You Need". arXiv:1706.03762 [cs.CL].
  52. Buck, Christian; Heafield, Kenneth; van Ooyen, Bas. "N-gram Counts and Language Models from the Common Crawl". Retrieved 22 January 2021.
  53. Wolf, Thomas; Debut, Lysandre; Sanh, Victor; Chaumond, Julien; Delangue, Clement; Moi, Anthony; Cistac, Pierric; Rault, Tim; Louf, Remi; Funtowicz, Morgan; Davison, Joe; Shleifer, Sam; von Platen, Patrick; Ma, Clara; Jernite, Yacine; Plu, Julien; Xu, Canwen; Le Scao, Teven; Gugger, Sylvain; Drame, Mariama; Lhoest, Quentin; Rush, Alexander (2020). "Transformers: State-of-the-Art Natural Language Processing". Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. pp. 38–45. doi:10.18653/v1/2020.emnlp-demos.6. S2CID 208117506.
  54. Tsvetkov, Yulia (22 June 2017). "Opportunities and Challenges in Working with Low-Resource Languages" (PDF). Carnegie Mellon University. Retrieved 23 January 2021.
  55. Williams, Adina; Nangia, Nikita; Bowman, Samuel (1 June 2018). "A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference" (PDF). Association for Computational Linguistics. Retrieved 23 January 2021. At 433k examples, this resource is one of the largest corpora available for natural language inference (a.k.a. recognizing textual entailment), [...] offering data from ten distinct genres of written and spoken English [...] while supplying an explicit setting for evaluating cross-genre domain adaptation.
  56. Lai, Guokun; Xie, Qizhe; Hanxiao, Liu; Yang, Yiming; Hovy, Eduard (15 April 2017). "RACE: Large-scale ReAding Comprehension Dataset From Examinations". arXiv:1704.04683 [cs.CL].
  57. Mostafazadeh1, Nasrin; Roth, Michael; Louis, Annie; Chambers, Nathanael; Allen, James F. (3 April 2017). "LSDSem 2017 Shared Task: The Story Cloze Test" (PDF). Association for Computational Linguistics. Retrieved 23 January 2021. The LSDSem’17 shared task is the Story Cloze Test, a new evaluation for story understanding and script learning. This test provides a system with a four-sentence story and two possible endings, and the system must choose the correct ending to the story. Successful narrative understanding (getting closer to human performance of 100%) requires systems to link various levels of semantics to commonsense knowledge.
  58. Wang, Alex; Singh, Amanpreet; Michael, Julian; Hill, Felix; Levy, Omar; Bowman, Samuel R. (20 April 2018). "GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding". arXiv:1804.07461 [cs.CL].
  59. "Artificial intelligence versus Maya Angelou: Experimental evidence that people cannot differentiate AI-generated from human-written poetry". Computers in Human Behavior. 114: 106553. 1 January 2021. doi:10.1016/j.chb.2020.106553.
  60. Schwartz, Oscar (4 July 2019). "Could 'fake text' be the next global political threat?". The Guardian. Retrieved 16 July 2019.
This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.