OpenAI, a non-profit artificial intelligence research organization based in San Francisco, has developed a language predicting AI that can proficiently compose compelling paragraph of text based on any text samples you throw at it. But citing its concerns about malicious applications such as in creating misleading news articles, impersonating people, and automating abusive or faked contents on social media, the company is releasing a scaled down version of the model.

According to OpenAI, the new language model called GPT-2, was trained to predict the next word in 40GB of Internet text. And the offshoot of that development was the model generating a set of texts that conforms to the structure and content of the conditioning text, enabling the user to create “realistic and coherent continuations” of their choosing. GPT-2 is a successor to GPT, and has ten times more parameters, and trained on more than 10 times the amount of data than the previous model.

Here are few of the select text samples to give you the idea of how the AI generates synthetic texts with greater coherence after being fed human-written text samples.

Human Written:

In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.

GPT–2 Response:

The scientist named the population, after their distinctive horn, Ovid’s Unicorn. These four-horned, silver-white unicorns were previously unknown to science. Now, after almost two centuries, the mystery of what sparked this odd phenomenon is finally solved.

Dr. Jorge Pérez, an evolutionary biologist from the University of La Paz, and several companions, were exploring the Andes Mountains when they found a small valley, with no other animals or humans. Pérez noticed that the valley had what appeared to be a natural fountain, surrounded by two peaks of rock and silver snow.

Pérez and the others then ventured further into the valley. “By the time we reached the top of one peak, the water looked blue, with some crystals on top,” said Pérez.

Pérez and his friends were astonished to see the unicorn herd. These creatures could be seen from the air without having to move too much to see them – they were so close they could touch their horns.

While examining these bizarre creatures the scientists discovered that the creatures also spoke some fairly regular English. Pérez stated, “We can see, for example, that they have a common ‘language,’ something like a dialect or dialectic.”

Dr. Pérez believes that the unicorns may have originated in Argentina, where the animals were believed to be descendants of a lost race of people who lived there before the arrival of humans in those parts of South America.

While their origins are still unclear, some believe that perhaps the creatures were created when a human and a unicorn met each other in a time before human civilization. According to Pérez, “In South America, such incidents seem to be quite common.”

However, Pérez also pointed out that it is likely that the only way of knowing for sure if unicorns are indeed the descendants of a lost alien race is through DNA. “But they seem to be able to communicate in English quite well, which I believe is a sign of evolution, or at least a change in social organization,” said the scientist.

Human Written:

We’ve trained a large language model called GPT-2 that generates realistic paragraphs of text, while also exhibiting zero shot generalization on tasks like machine translation, question answering, reading comprehension, and summarization – problems usually approached by using training datasets and models designed explicitly for these tasks.

A typical approach to language modeling is to learn the following task: predict the next word, given all of the previous words within some text. Last year, OpenAI’s Generative Pre-trained Transformer (GPT) showed that language models trained on large amounts of data can be fine-tuned to specific tasks to achieve high performance. GPT-2 shows that much larger language models trained on a more diverse dataset derived from the internet begin to learn these NLP tasks without needing task-specific training data, instead learning from examples the system derives from the raw text. These systems also display a substantial qualitative jump in the realism and coherence of generated text.

GPT-2 Response:

Here you can see the most recent progress with Generative Pre-trained Transformer:
Figure 1: Generative Pre-trained Transformer training on several texts. We are now preparing a collection of datasets for translation and machine translation in our language model. We will be using one of the large number of text samples provided by The New York Times.

We believe this project is the first step in the direction of developing large NLP systems without task-specific training data. That is, we are developing a machine language system in the generative style with no explicit rules for producing text. We hope for future collaborations between computer scientists, linguists, and machine learning researchers.

While the hurricane of uproar is already permeating over this new language model’s ability to generate variety of prompts close to human quality,  OpenAI says the AI is still fraught with failures, such as “as repetitive text, world modeling failures (e.g. the model sometimes writes about fires happening under water), and unnatural topic switching.”

Paper describing the research is available at OpenAI’s Website.