knn_classify_skenderkoll_recommend_engine

We are all Internet users. Huge amounts of data is now available with an exponential growth in the last decades. Hundreds of hours every month spent in browsing and in searching activities. It often seems to me as a battle when driving myself to the interested final page without being interrupted by advertisements, pop-ups, banners, dialogue windows, feedback requesting, filling forms, non-relevant pages and obsolete information. However, we all accept it every day resigned to the fact that this is the norm in user interface experiences (and then we read articles about applying excellence in UX). A kind of feeling similar to being abandoned but also conscious that nothing else could help us.

In fact, not a person, but an algorithm could. Yes, as we read from news, everything on the Internet is a mix of the human and inhuman and automated algorithms play a very big role in some services. One of these services is the recommender system or the recommendation engine. I will face this issue just in a moment.

First, one question: what do you find most annoying when surfing pages and retrieving information? Countless suggestions about irrelevant stuff (links or pages or arguments) you may have encountered once in your last years in web. That is what is called information overloading, the “problem” of being overwhelmed by choices. A useless digital data invasion! Many justified issues are revealed here; how do these suggestions get my preferences, should I be concerned with privacy breaches from the service providers? What is the value of being assisted in such activities?

“An Information filtering system is a system that removes redundant or unwanted information from an information stream using (semi)automated or computerized methods prior to presentation to a human user.” (Wikipedia)

However, how do you know what is unwanted information? Items (it could be a web page, a product, an event) considered important for one user may be completely irrelevant for another. Mathematical and statistical methods have been developed to correlate information.

Let me bring here an interesting case from my past academic career. Seminars from Università degli Studi di Milano-Bicocca (Milan, Italy) dealing with the importance of concept “subjectivity as difficult to predict in terms of exact evaluation.

What we can do is analyze an index of perception, the perception of quality of a service.

In 2012 I tried to gain insights about these indexes of perception, providing a survey with questions to the Campus students. They should evaluate the perceived quality of service of many University services (usability of homepage, retrieving information about courses, new initiatives, accessibility for non-Italian students) identifying a numerical quantity based on some range of min and max thresholds.

In this specific case, I was not gathering explicit data (method used in the recommender systems Amazon.com uses) with questions as “Would you recommend your University to your friends? – Answer YES or Answer NO”, but creating a perception framework to identify reputation of the University in its students’ eyes.

In other terms, the recommendation of the institution in a broader context (global). You achieve this goal by including indirect ratings for detailed items.

Well, recommendation engines, or recommender systems, are used to facilitate user’s life when selecting, filtering small portions of data from enormous quantities online. Of course, it is good for user, but also good for the service provider (think about e-commerce and entertainment industries such as Youtube, Netflix and Amazon). Dozens of other categories of business apply recommendation engines to deliver more and more efficient services (think about searching catalogues in a library, files in a Public Administration office, news categories in a IPTV broadcasting service).

Advantages for the provider are:

–         More effective service (targeting with interesting items more and more “already known” users) by improving the predictive model

–         Increase of the number of subscribers to its services (if the user is satisfied with the content and supported with accurate suggestions.)

The process involved here is generating dynamically information based on user’s preferences, past experiences, subscription forms or settings, keywords and analytics indexes. Techniques of these examples are active and happen in real-time or in automated fashion. But, it can be also passive, like keeping trace of the observed material (click on the item to buy online, but you do not actually buy it), logging he user’s activity in a time-lapse window. In both modalities, recommendation engines, before recommending, filter content for you.

Basically, there are three categories of filtering:

–         Collaborative filtering (identifying other users with similar taste; it uses their opinion to recommend items to the active user).

Training item sets are built and then users with the same index of similarity are recommended almost same content.

–         Content-based filtering (match content resources to user characteristics).

Completed users profile is fundamental. These profiles are memorized in a dedicated data structure.

Let imagine an e-learning platform; the system has to know as many information about the user as possible in order to recommend reasonable courses, material and learning paths.

In another specific context, such as movie-streaming, characteristics may be: search keywords, movie tags, favorite actor or director, year of the movie, frequency of searching terms etc etc… Recommender systems correlate also more complex experiences, grouping them in events such as “French movies in the Thirties” or “Love stories in black and white”.

–         Hybrid filtering (combination of features from both previous).

Typical solutions are associating weights to features from one or the other techniques.

Plenty of experimentation has been performed in order to optimize the first two algorithms (and the combination of both) and various statistical methods have been applied. Of course, perfection is not an achievement (and often disasters occur; have you heard about fake news diffused rapidly from social networks?), tricks and problems are present and prediction is never 100% trustful, because as you may expect, we talk about subjectivity in terms of tastes and preferences. Too many factors may alternate choices.

Known problems are the so-called Cold Case (what would you suggest to a fresh new subscriber? The most popular items? The best items according to all-time users?), lack of data to analyze (it is easier to suggest when you have exabytes of data and difficult if you have only a couple of them), overspecializing (very accurate recommendations in some context, but scarce in many others), scalability (it becomes more complex predicting when analyzing data from millions of users).

That is why a recommender systems seems similar to a human being; at the beginning, it should learn about you (appropriate learning curve applied) and progressively it can suggest stuff that is more significant. It is like a data science challenge: the more data you gather, the more sophisticated the model will be.

Since I have always been a blogger, I surely have been using platform (Blogger or WordPress) plugins “Related posts” or “You might also like”: such simple recommendation scripts make use of labels/tags for each post to suggest suitable material when reading the blog. The success of the mechanism depends if you have inserted labels/tags every time before publishing the post online!

Nevertheless, let’s dive a little more into a statistical model used from an algorithm. I will try to introduce one common deterministic algorithm, known as k-Nearest Neighbor classification. It is a memory-based algorithm, so we have to do is to save all the examples in a training set (data structure). Theoretically, the challenge is:

–         Find k training examples, couples of values as (x1,y1),…(xk,yk), that are nearest to the test example x

To illustrate the above algorithm I brought the following example.

I am a cinema passionate and very often I keep reading rated movies. In some websites where you can review cinema movies, often the subscribers put ratings, let say 1 for worst impression and 10 for best impression. We collect these ratings and put them in a user-rating matrix form (memory-based algorithm!).

As we can see in Fig. 1, in the first row are the names of the movies and in the first column are the users. In order to find similarities in users’ tastes we could apply the above well-known algorithm used in graph theory and in collaborative filtering. To measure similarity, suppose the users (X) and the movies (Y) represent the vertexes and each edge has a weight (x,y), the user rating of the specific movie. Starting from a home-vertex we visit each vertex only once (each users gives only one rating for a movie). We consider the minimum distance from sets of vertexes as the decision criteria.

Let consider the movie “Hacksaw Ridge”. Skender and Luke gave the same rating so their distance relatively to the movie is the nearest (specifically 0 – exactly the same).

Fig.1

Illustration of the k-Nearest Neighbor algorithm: Based in this specific set of examples, we conclude that Skender and Luke may have the same preferences in terms of movie genre. Therefore, sounds reasonable proposing to Luke movies Skender has liked in the past!  

Netflix applies the same basic system for improving its prediction accuracy in the Audience Prize Data competition. Instead, it has large datasets of user ratings in ranges of billions!

Similar techniques are implemented in Artificial Intelligence challenges and in Machine Learning solutions (predict values accurately on new inputs).

Another real-world example? Let’s take a website which provides streaming video online. During a vacancy trip in Cuba I would probably like to see videos about music and culture of Cuba, ten top places to visit absolutely and a few funny clips. The same trip, but for business purposes, my hours of video streaming on the plane would be filled with business interviews videos, inspiring or action movies.

Different purpose, different mood. 

I experienced myself TED talking videos. Since I usually see TED’s talking videos after I finish job and before dinner time (at 18:00), I am notified about similar videos round that specific time and no other daytime! TED’s memorized not only my past preferences, but also the time range of using its service.

It happens the same issue when people from far away countries, let’s take Italy and Argentina, share the same preferences about movies and soap opera plots. You would not say, but cultural aspects, familiarity of languages, origin and story of populations determine matching features. It is like someone who knows you, or who has had the same lifestory, understands well enough your tastes that in most cases he/she can suggest you appreciated content. Practically, it needs time and should evolve into an intelligent entity.

So, as I wrote, we all are Internet users. And we all need a trustful friend to suggest us.

Advertisements