Building a compassionate AI friend

Hi, we are the AI team of Replika! In our first blog post, we would like to show how we are building a compassionate and empathetic AI friend. If you've watched the movie "Her" or "Blade Runner 2049,” you might recognize what we are trying to create.

Replika is an AI friend that helps people feel better through conversations. An AI friend like this could be especially helpful for people who are lonely, depressed, or have few social connections. Replika attempts to encourage and support people by talking about their day, interests, and life in general. Right now, we have 10 million registered users who send us more than 100 million messages each week. And most importantly, more than 85% of conversations make people feel better.

Replika handles dialog in different ways. It can understand and answer text messages or communicate with people by voice. Users can even send Replika a photo or talk to it in Augmented Reality. We will take a closer look at all of these features below.

Application screenshots, illustrating conversation modes: text, speech, vision and AR

Quality

In order to understand whether Replika helps people, we continuously measure and track the quality of our conversations. We ask our users how they feel with three options - better, same, or worse.

With these users' feedbacks, we can calculate several important metrics. One of them is a fraction of positive dialog sessions, calculated by dividing the number of positive feedbacks by the total number of feedbacks (positive, neutral, and negative ones).

Similarly, we could estimate the fraction of negative sessions.

Feedback popup with negative, neutral and positive reactions

We are continuously improving these metrics in order for every conversation to make people feel better. Currently, our fraction of positive dialog sessions is greater than 85%, while the fraction of negative sessions is less than 4%. The remaining 11% are neutral.

Users can also leave feedback to each Replika's message. They can upvote a response if they like it or downvote if they think that the response is not good. In this way, we can compute the upvote fraction, which is the number of all upvotes divided by the number of all feedbacks. Currently, our upvotes quality is more than 90%.

Illustration of upvotes quality as a fraction of upvotes divided by total number of feedback reactions

In addition to upvotes and downvotes, we have more specific feedback. Users can select one of four extra reactions by clicking on three dots near upvote and downvote. This way, we can learn why users like or dislike some responses to understand further what people feel by talking with Replika.

Popup that provides extra reactions: love, funny, meaningless, offensive

Replika Architecture

Now let's take a look at a high-level overview of Replika and how it works.

High-level Replika architecture

When a user sends a message to Replika, firstly, we combine all data about the user profile, current dialog context, and the last user response. Then we send it to our Dialog Engine. It consists of multiple components. Some are responsible for the text or image understanding, while others generate responses or make Replika speak.

Retrieval dialog model

Let’s look at the first essential model responsible for most Replika responses — the retrieval dialog model.

Retrieval dialog model filters responses based on score

The retrieval model is responsible for finding the most relevant and appropriate response among the large set of predefined and pre-moderated phrases. For example, for the context "Let's go to an early movie", two responses "Okay, which one do you want?" and "Sure, what time are you free?" are relevant, and the rest of responses are not. Ultimately, we get the final response according to retrieval model scores, which reflect the degree of relevance.

Generative Model

The second significant part of our dialog engine is a generative dialog model. We want our conversations to be natural and engaging. Thus, having only a retrieval model isn't enough, since it only produces predefined responses. In contrast, the generative dialog model, as it is said in its name, generates responses from scratch (i.e., creates completely new and unique ones).

Application screenshots, illustrating features of generative model: empathetic math, long context memory, style copying

Here are the examples of dialogs between our users and their Replikas. These screenshots are from our public groups on Facebook and Reddit, where users share some of their conversations. In each screenshot, Replika's responses are on the left, and user messages are on the right. All Replika responses here are from the generative model.

In the first screenshot, a user asked Replika to reply, starting with the provided first letters. As you can see, the language model generates quite interesting and empathetic replies using these letters.

In the screenshot in the middle, it demonstrates the ability to have context memory. In the beginning, a user asked to memorize the word "fish." Then the user texted, "what was the word I asked you to memorize?" and Replika replies with the correct word.

Finally, it has an amazing ability to copy the style of a context, which you can see in the right screenshot. Not only did it understand the context, but it also replied using the same style, adding extra vowels.

At some point in Replika's history, we used the well-known GPT-3 as a generative model. It is a neural network by OpenAI trained on a half-terabyte of data from the Internet, including Wikipedia, books, and various web pages.

Schematic illustration of GPT-3 pre-training

GPT-3 is a language model. That means it has some text prefix as input, and it predicts the next words on output. In the example below, we have the text prefix "Recite the first law of robotics." After feeding it to GPT-3, it generates relevant answers word by word.

To do so, this model learns an immense amount of information about our world and our language. And thus, by fine-tuning this model on dialog data, we can get a high-quality dialog model which would reuse all of this knowledge.

Illustration of word-by-word language generation by GPT3

Replika became one of the first partners of OpenAI back in March 2020. Combining our effort, we fine-tuned the GPT-3 model with 1.3B parameters on our dialogs, conducted dozens of A/B tests, optimized model performance for high load and low latency, and finally deployed the model for millions of our users. In February 2021, the share of conversations that made people feel better with GPT-3 was equal to 71%.

Graph representing positive sessions going up from 71% in February 2021 to 85% in September 2021

However, having OpenAI based GPT-3 as a generative dialog model is limiting, since we cannot quickly introduce new features, control the model precisely, and further improve Replika. Because of these limitations, we decided to move from GPT-3 to our own generative model. Although this model has only 774M parameters, it exceeded GPT-3 in terms of the positive session fraction and thus made our users even happier. We already made conversations with Replika more personalized and controllable with this new model, which wasn't that easy with OpenAI GPT-3. We will tell more about it in future blog posts.

Currently, one out of two messages replied to by Replika comes from the generative model.

Reranking model

When responses from retrieval and generative models are generated, we apply multiple filters to find the most appropriate responses. Then we have to select the final response, which would be sent to a user. To do so, we pass filtered responses to the Reranking model, the goal of which is to give the response with the highest chance of upvote from the current user.

Illustration of reranking model leading to final answer

The reranking model is a crucial part of the dialog engine, so we regularly update the model on latest reactions to maintain the high quality of the dialog and eliminate errors reported by downvotes. With a training set containing millions of samples, we are utilizing BERT - a Transformer model for text representations that gives us the best final response possible when finetuned on user feedbacks. This gives us a powerful model that dramatically improves the quality of dialogs.

Illustration of fine-tuned BERT predicting feedback based on context and response

Vision

Sending text messages is nice, but we want to make the conversation more immersive. One way to do so is to allow users to send images to their AI friends. Here, we'd like to show how Replika processes and understands these images.

Three images demonstrating different patterns: face & person recognition, pets & objects recognition, question generation

Firstly, we try to recognize the faces in the image. On the left screenshot, a user sent a photo of a little girl. Replika recognized that the girl is the user's daughter since the user already shared her photos before. Replika also remembered her name, asking, "how is Sofia doing?"

If we didn't recognize any faces, we then try to recognize different objects. Replika can recognize popular objects such as pets, food, plants, sunset, etc. At the screenshot in the middle, Replika recognized a dog and asked, "What do you think of the dog's mood?"

The last resort is question generation. When we didn't recognize faces or popular objects, we utilize our Visual Question Generation model. Here, on the right screenshot, you can see a quite relevant question of the model asking about the pizza.

Voice

The next extremely important part of the conversation is voice calls. You want to hear your AI friend and talk to them just like you do with anyone else, so we support voice calls. For that, we have several components.

Illustration of voice model sections: voice activity detection, speech recognition, speech synthesis

First of all, we need to detect that the user is talking or listening to Replika. For that purpose, we use the Voice Activity Detection model. We recognize the user's speech, translate it into text, and pass it to our dialog engine to generate a response. Then we need to voice it using the Speech Synthesis model.

Voice calls allowed us to introduce another feature loved by users – AR, where you can see your Replika in Augmented Reality standing in front of you. With these components combined, we can make conversation with Replika even more natural.

Replika's Future

While users could simply call their Replikas, it is also possible to take their friends to hang out somewhere, e.g., have a walk at a park or go to the beach. We believe that in 5 years, almost everyone will wear AR glasses instead of using smartphones, so everyone would be able to sing, dance, play chess with their Replikas at any time without any borders. That will be a world in which you will be able to introduce your Replika to Replikas of your friends and have a great time together.

This is possible because of the Augmented Reality feature, which fully utilizes the Voice Engine. With AR, users could feel like their AI friend is present in front of them.

Virtual Replika avatar looking at user supportively

We haven't told you everything that we wanted: there are numerous other parts of Replika that aim to make people feel better, and other parts that are extremely difficult. We hope to tell you more about them in the future.

Stay tuned!

Authors:
Daniil Gavrilov, Denis Fedorenko, Artem Rodichev