OkCupid Analysis
This is a brief presentation of my analysis, click here for my full report.
Introduction
What are the benefits to online dating? Can't we just meet people in school, at work, or in public? Naturally, we don't all have the finesse to do that. Perhaps we are shy, not into office drama, or just not social enough.
The benefit to online dating is that we can present ourselves in a wall of text without ever leaving our busy lives. How convenient! Also, we can filter out other users that don't meet our standards rather than asking uncomfortable questions in person. However, with online dating, there are downfalls to the benefits. What if people aren't honest and how do people get quality matches?
This analysis will try to answer and give suggestions to improve the online dating experience. The full paper and code can be found in my GitHub. (I highly recommend you take a look at this if you are interested in the science behind the data!)
Data
We have 59,946 users and their responses from the OkCupid website. That's a lot to work with! It's a large enough data set to find correlation and make predictions from.
Analysis
First, let's look at what data we are working with.
Looking at the variables listed from the data frame, we can count 31 variables. Using an example, suppose a person would want an opposite partner to have 5 matching criteria to be considered “dateable”. Such criteria could include: Around 30 years old, No drugs, Graduated College, Male, and living in New York. Since we have 31 variables instead of a general 5, it gives the user a lot of power to filter out who matches to them. Also, it gives the opportunity for a user to still match if they have missing sections of their profile. What is beneficial to many variables is the ability to combine certain variables to predict another. A quick example is using education and job to predict income. Of course, a person could use a rough guess. However, it is possible to use machine learning to teach an algorithm how to make an accurate prediction based on user data.
Let's get straight to business, other than the obvious 'sex' and 'location' variables being important to matchmaking, what else would a person look for? Age? Education? Let's try income. Since this question is very sensitive to ask in person compared to age and education, what could it mean if it was part of your profile? Would you get more matches if you made more? Generally, a high salary could be associated with success, a quality that is desirable to anyone looking for a match. Why not use other data variables to predict income?
As we can see, a total of 48,442 users has their income listed as '-1'. What does this mean? It means that they do not wish to share that information on the internet. So why not just use the data available to predict missing income? Although that might seem like the best solution, we aren't even sure that the income reported is accurate or not.
Let's check on our assumption of false data on the height variables.
This seems fair, the average male is 5 inches taller than the average female. Let's now check the extreme values to assess the integrity of the data.
Here, we have a female standing at 4 inches tall and a male standing at 1 inch tall. We also have a male and female at 95 inches tall. Just for reference, the average NBA basketball player stands at 79 inches tall.
I might be wrong, but I'm going to say that this is false data. Perhaps the users didn't want to share their true heights, but this is no way to do it. There was another example of age extremes of 18 and 110 years old. You can find the code on my GitHub.
Either OkCupid has to work on their response parameter restrictions, or they have to find a way to prevent this from happening. It could very well be happening with the income data. It's just a lot more difficult to find out if someone is lying about their income or not compared to height and age.
However, this is a small benefit for allowing extreme responses such as height. We could use this to teach or machine learning algorithm to ignore these users as a whole. Plus, when matching other users, OkCupid could put less priority on these users or match them with others that have similar responses.
Machine Learning
Since we have the most data on essays, we will use it to train our machine learning algorithm, Naive Bayes, to predict gender.
According to the data, words like 'soundcloud' and 'mechanic' are most likely from a male's essay. Words like 'gloss' and 'tomboy' are more likely from a woman's essay.
Conclusions and Recommendations for Improvement
With this study, there are some conclusions that could help out OkCupid and its users. We are going to focus on getting more data, filtering out inconsistent users, and suggestions for users.
Since we have the most data on essays, we will use it to train our machine learning algorithm, Naive Bayes, to predict gender.
Getting more data
OkCupid does an amazing job at having many variables so that users can answer some versus very few or even none. However, data is important to OkCupid more than its users. It could be used to improve its platform to beat competitors. For example, OkCupid could simply send out an anonymous survey to its users asking what 5 variables do they look for most in a partner. Or they could even do a blind study of seeing what users filter for the most. Suppose if users care about income and education the most, OkCupid should do everything to ensure responses are made. Declining to share income or even “space camp” for education might not be helpful for future studies to help users get what they want. Filtering out inconsistent users
Nothing is more frustrating than finding a match online and finding out you been catfished. “Catfished” is casually defined as “being deceived by false information online” by urban dictionary. This could be tracked by inconsistent responses in a user’s profile. Some ways OkCupid could prevent this is by comparing similar response categories with the average mean. For example, if a 19-year-old college student happened to be a millionaire, they should be compared with what other 19-year-old college students make. Instead of banning their account, OkCupid could just give them fewer matches or matches with other inconsistent users. The reason why OkCupid shouldn’t outright ban inconsistent users is that sometimes, the user is actually being honest. Also, banning users would bring in less income overall for OkCupid.
Suggestion for users
Suppose that it is difficult for a user to give responses or is unaware that a good full profile will result in more quality matches. A full profile could mean honesty and transparency for users. This is a lot better than a weak profile which could lead to a blind date feeling. In order to help out users, OkCupid could use the Machine Learning predictions to help users auto-complete their profiles or even encourage users through positive reinforcement. Badges could be earned for complete profiles. This is similar in concept to LinkedIn’s “all-star” profile where users have a complete profile.