Machine Learning

It’s becoming increasingly important for companies to collect and harness the potential of their data and use it to make more efficient decisions. As a consequence, machine learning algorithms are now being used in more creative ways than ever.

Machine learning has a broad range of applications

Machine learning has a broad range of applications – from already well-known purposes, like for identifying and filtering email spam, blacklisting and penalizing spam blogs so that users get good search results, recommending relevant products or fighting fraud, to surprising ways in which it can help businesses make smarter, better and faster products, such as through user recognition based on voice, image or even accelerometer data (the way a user moves or holds a mobile device), predicting auction sales prices, or stocks and bonds prices, automating employee access control, predicting emergency room waiting times, identifying heart failure risk or predicting strokes and seizures.

To outline and exemplify some of the problem domains that can be addressed by using machine learning, we’ve put together the following set of case studies, most of which are the result of training and experimenting sessions, meant for showcasing concepts:

This has been the only commercial machine learning project up to date. The client is a startup that offers software platforms to consulates, aiming to enable citizens to apply for a visa remotely, thus simplifying the process significantly.

The machine learning approach came from the need to scan the passport with a mobile device’s camera (iPad), read the data automatically and enter the recognized information (such as first and last name, birth date, country, passport number etc.) into the appropriate spaces of the visa application forms.

The client initially requested a proof of concept (PoC) of the mobile application meant to run on iPad, in order to present the idea to potential customer consulates. As the client was satisfied with the PoC, he decided to continue collaborating with Roweb on implementing the end product as well as several related ones.

The main technologies involved in implementing the mobile application for scanning passports were:

  • iOS
  • ObjectiveC, C++
  • Tesseract-one of the most popular open source optical character recognition (OCR) libraries
  • OpenCV- one of the most popular libraries for computer vision tasks

Some of the most challenging aspects that we managed to overcome during implementation were:

  • performing OCR in various light conditions, from under-exposed to over-exposed passport details

We designed and implemented a sentiment analysis system able to automatically detect the general feelings expressed in movie reviews. We were interested in the sentiment polarity of user comments, designing the system to classify movie reviews as either positive or negative.

The core component of the sentiment analysis software was the machine learning algorithm able to learn the probabilistic model of detecting the sentiment polarity in a movie review.

The learning process involved gathering a set of 50000 labeled reviews – which had an associated polarity expressed by the number of stars awarded by the user. The following technologies were used:

  • C#, C++
  • LibSVM, LibLinear – C++ libraries implementing the Support Vector Machines algorithm

The most notable challenges that we overcame during the development were:

  • gathering the training data, which consisted of 50000 labeled movie reviews

The goal of this application was to detect whether microchips from a fabrication plant met quality assurance standards.

During quality assurance tests, there were several measurements performed on each microchip. The various tests meant to ensure if the microchip was functioning correctly relied on the relationships between the measurements’ values.

The core component of the project was the classification algorithm for detecting malfunctioning microchips based on the set of measurements made on the devices.

The main technologies involved were:

  • MATLAB – the programming environment on which the software was implemented
  • Logistic Regression – the classification algorithm used for learning the probabilistic model associated to malfunction detection

The goal of this application was to detect and recognize handwritten text. The capability to automatically recognize handwriting has recently been in increasing demand, since it can be used for a wide range of purposes, from reading zip codes on mail envelopes to recognizing the amounts written on bank checks.

The core component of this project was the classification algorithm for understanding handwritten information and translating it into machine-readable representations. A corpus of 10000 labeled letters and digits was used in training the classification model.

The main technologies involved in this solution were:

  • MATLAB – the programming environment on which the software was implemented
  • Neural Networks – the classification formalism used in learning the probabilistic model of classifying handwriting and recognizing handwritten digits and letters

The purpose of this project was to develop a spam filter solution that could accurately classify emails into spam or non-spam.

The core component of the project was the spam classification algorithm. The solution involved a corpus of 4000 labeled emails used in the learning process and the following main technologies:

  • Support Vector Machines – the algorithm for training a probabilistic model to become able to classify the emails

The training and classification phase required the implementation of a series of cleaning and normalization procedures for email messages, aiming to remove non-word content, performing HTML stripping, normalizing URLs and email addresses, and applying stemming and lemmatization processes.

The purpose of this solution was the implementation of an algorithm able to detect anomalous behavior in server computers, based on a series of measurements of their functionality (e.g. throughput, latency etc.).

The core component of this project was the algorithm for detecting the malfunctioning nodes in a cluster of servers. The main technologies involved were:

  • Learning the parameters of a multivariate Gaussian distribution – the formalism behind the anomaly detection system

This solution focused on implementing a movie recommender system based on a large dataset of users, movies and user ratings for these movies. User ratings were expressed on a scale of 1 to 5, with 5 stars being the highest rating.

The core component of the project was the recommender system algorithm, able to generate recommendations tailored to each user’s tastes based on their past movie ratings and the ratings provided by the entire community. The main technologies used in implementing this solution were:

  • MATLAB – the programming environment on which the software was implemented
  • Collaborative Filtering – the algorithm capable of making automatic predictions concerning a user’s interests by collecting preference information from a group of users