Data mining as we all know is a process of computing to find patterns in a large data sets and it is essentially an interdisciplinary subfield of computer science. It is an essential process where a specialized application (algorithms) works out to extract data patterns.
What are data mining algorithms?
Algorithms could be defined as a set of instructions that could run on a computer to achieve a specific purpose. Algorithms are not specific to one programming language and could be written in plain English. Intelligent algorithms are used to find patterns in a set of data in data mining to help classify new information. The application of this pattern is varied and virtually limitless, for e.g. it could help predict whether a patient has cancer from identifying complex genetic problems.
We now could look into some of these top data mining algorithms:
This is the first algorithm on the list. It is a classifier, it analyses the data and tries to put it in class based on some criteria. It is a supervised learning algorithm, which means it needs a set of training data.
K-means is very different from C4.5 in every way. It is an unsupervised learning algorithm, that means it does not require any training data set. It simply groups data based on their similarities. It is a popular data mining algorithm because of its simple nature
- Support Vector Machines (SVM)
Support Vector Machines is a supervised classifier like C4.5, but the difference is, it only classifies data in two sets. This could be imagined as a line drawn on a graph sheet to separate data pointed or marked on the sheet.
Apriori is an algorithm that seeks to find out the common element of association in a data set. This is useful if anyone tries to find out a link between any two particular data.
- Expectation-Maximization (EM)
Expectation-Maximization or EM is an algorithm which clusters data using statistical models. For. E.g. we all are aware of the Bell curve, scores of any test usually look like a bell, where most of the scores will fall somewhere in the middle and few with very low & very high scores. EM tries to find the curve of a sample data set and then guess the curve of the whole set.
Pagerank is an algorithm which affects our lives every day. We probably know it better as the main algorithm that powers Google search engine. What Pagerank tries to do is to count the number of times a web page is linked to by other pages. Pages with more links are considered more important and carry more weight.
AdaBoost is a supervised algorithm which builds a good classifier out of a group of bad classifiers. A bad learner in machine learning is a classifier which does not perform better than random chance. But, a few of them could be combined to perform in a better way. Adaboost does exactly that.
- K-Nearest Neighbours (KNN)
K-nearest-neighbours is a simple data mining algorithm and like most others, it is also a classifier but it works in a different way than other classifier described above.