Hadoop is widely used in big data applications in the industry, e.g., spam filtering, network searching, click stream analysis, and social recommendation. In addition, considerable academic research is now based on Hadoop. Some representative cases are given below.
Facebook announced that their Hadoop cluster can process 100 PB data, which grew by 0.5 PB per day as in November 2012. In addition, many companies provide Hadoop commercial execution and/or support, including Cloudera, IBM, MapR, EMC, and Oracle. Among modern industrial machinery and systems, sensors are widely deployed to collect information for environment monitoring and failure forecasting, etc. Gunarathne, utilized cloud computing infrastructures, Amazon AWS, Microsoft Azure, and data processing framework based on MapReduce,
Hadoop, and Microsoft DryadLINQ to run two parallel bio-medicine applications: (i) assembly of genome segments; (ii) dimension reduction in the analysis of chemical structure. In the subsequent application, the 166-D datasets used include 26,000,000 data points. The authors compared the performance of all the frameworks in terms of efficiency, cost, and availability. According to the study, the authors concluded that the loose coupling will be increasingly applied to research on electron cloud, and the parallel programming technology (MapReduce) framework may provide the user an interface with more convenient services and reduce unnecessary costs.
As there is unstoppable growth of internet, the overall data (online and offline) is increasing in all domains. There was time when standalone data mining tools were enough to do data mining task, but increased data volumes demands distributed approach in order to save time, energy and money. The most popular distributed processing framework Hadoop being major innovation in IT world.
As more data centers supports hadoop framework, it becomes very important to migrate existing data mining approaches or algorithms onto hadoop platform for increased parallel processing efficiency. In the IT world where organizations are rich in data, the true value lies in the ability to collect this data, sort and analyze it such that it yield the actionable business intelligence. To analyze the data, traditional data mining algorithms like clustering, classification form the basis for machine learning processes in the business intelligence support tools.
As the IT industries began using larger volumes of data, migrating it over the network for the purpose of transformation or analysis becomes unrealistic. Moving terabytes of data from one system to another daily can be cumbersome to the network administrator down on a programmer rather it makes more sense to push the processing to the data. Moving all the big data to one storage area network or ETL server becomes not feasible with big amounts of data. Even if you can move the data, processing is slow and limited to SAN bandwidth, and often fails to meet batch processing windows. A Java-based programming framework Hadoop which supports the processing of large volumes of data sets in a distributed computing environment and is part of the Apache project sponsored by the Apache Software Foundation.
Hadoop was originally conceived on the basis of Google’s MapReduce, in which an application is broken down into numerous small parts. The Apache Hadoop software library can detect and handle failures at the application layer, and can deliver a highly-available service on top of a cluster of computers, each of which may be prone to failures. Hadoop can provide much needed robustness and scalability option to a distributed system as Hadoop provides inexpensive and reliable storage.
The Hadoop framework is the software framework for writing applications that process vast amounts of data in parallel on large clusters of compute nodes in the rapid way and it works on MapReduce programming model which is a generic execution engine that parallelizes computation over a large cluster of machines. MapReduce is a distributed Programming Model intended for large cluster of systems that can work in parallel on a large dataset.
The Jobtracker is responsible for handling the Map and Reduce process. The tasks divided by the main application are firstly processed by the map tasks in a completely parallel manner. The MapReduce framework sorts the outputs of the maps, which are then given as input to the reduce tasks. Both the input and output of the job are stored in the file system. Because of the parallel computing nature of MapReduce, parallelizing data mining algorithms using the MapReduce model has become popular and has received significant attention from the research community.