Aditya Desai.

7th semester,
Dept. of Computer Science and Engineering SGI, Atigre, India.

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!

order now

We Will Write a Custom Essay Specifically
For You For Only $13.90/page!

order now

E-mail: [email protected]


Sushant Gaikwad.

7th semester,
Dept. of Computer Science and Engineering SGI, Atigre, India.

E-mail: [email protected]



7th semester,
Dept. of Computer Science and Engineering SGI, Atigre, India.

E-mail: [email protected]


Ritu Vachhani.

7th semester,
Dept. of Computer Science and Engineering SGI, Atigre, India.

E-mail- [email protected]






In the big data age,
extracting knowledge from massive data has become a more important concern.
Hadoop MapReduce provides two functions namely Map and Reduce help us in
implementing machine-learning algorithms using a feasible framework. However,
this framework has a weakness that it does not support repetitions. Therefore,
algorithms requiring repetitions do not operate at full efficiency in MapReduce
framework. Hence, in this paper we propose to apply advanced learning
processes, which are meta-learning algorithms to prevent MapReduce from
parallelizing machine-learning algorithms. It also improves scaling properties
of big data. Our algorithm reduces the computational cost by reducing the
complexity. This is achieved by increasing the number of computers whereby
decreasing the errors.




MapReduce, Big data, Hadoop.




Nowadays, it is becoming more and more
important to organize and utilize the massive amounts of data as we have been
moving from the age of Terabytes to Petabytes. Considering massive data,
building classification models using data mining algorithms currently available
is difficult. Therefore, to achieve more efficiency and effective models these
algorithms are not sufficient. Talking about Big Data, it is something that
cannot be easily processed using traditional computing techniques. For example,
YouTube which works and manages the data on daily basis. Big Data involves
different aspects such as velocity, variety, volume and complexity. Big Data
challenges include searching, sharing, transferring, querying, data analysis
and information policy.

MapReduce is a programming paradigm that
runs in the background of Hadoop to provide scalability and easy
data-processing solutions. In short, it is a programming model for writing
applications that can process big data in parallel on multiple nodes. It
provides analytical capabilities for analyzing huge volumes of complex data.



1) Due to the popularity of non-negative matrix factorization and the

availability of massive data sets, researchers are facing the problem of
factorizing large-scale

matrices of dimensions in the order of millions. It is feasible to
factorize a million-by-million matrix with billions of nonzero elements on a
map reduce cluster.

2)  C4.5 is an algorithm which is a
successor of ID3 algorithm used to generate a decision tree having a high
accuracy in decision making.

3) OLS (Ordinary Least Squares) used to minimize the sum of square
difference between the observed and predicted values.

4)  Adaboost (Adaptive Boosting) used
in conjunction to improve performance, the output of the   learning algorithms is combined into a
weighted sum that represents final output of the boosted classifier.

5) VSM (Vector Space Model) used in information filtering, information
retrieval, indexing and relevancy rankings.




Hadoop is an open source implementation of a large scale batch processing
system. Although the hadoop framework is written in java it allows developetrs
to deploy customewritten programs coded in java.[1] The hadoop
ecosystem contains Hadoop kernal, MapReduce and the Hadoop distributed file
system (HDFS).


Hadoop useful in: Complex information
processing, Unstructured data needs to be turned into structured data, Heavily
recursive algorithms, Machine learning, Fault tolerance is critical.


Installation of Hadoop Windows:

To install Hadoop go to Apache site and
download tar file from their ftp server. 

Download Cygwin application, this helps us
to unzip the tar file by providing you the linux based terminal. 

Set the path for java jdk and jre
respictively for java support environment. 

Download Eclipse java Ide for Map-Reduce
manipulation on the Hadoop. 

Download Apache ant 1.9.6 which is a java
based build tool for automating the build and deployment process.


Installation of Hadoop on Linux 

Update $HOME/.bashrc.  Excursus: Hadoop Distributed File System
(HDFS).  Configuration-
.conf/*-si e.xml. 

Formatting the HDFS through the help of
the Name node. 

Starting your single node cluster.


Map Reduce: Map Reduce is a distributed
data processing algorithm. it is useful to process huge amount of data in
parallel, reliable and efficient way. It divides input data into smaller and
manageable sub-tasks to execute them in parallel. 

The Map Reduce algorithm uses the
following: Map Function: Map function is the first step in Map Reduce
algorithm. It takes input tasks and divides them into smaller sub-tasks. Then
perform required computation on each sub-task in parallel. 

The Map function takes place in two steps:
1. Splitting. 2) Mapping. 

Shuffle Function: It is the next step in
Map Reduce algorithm also known as the “combine function”. The
Shuffle function takes place in two steps: 1. Merging.  2) Sorting.

Reduce Function: It is the last stage in Map
Reduce algorithm. It takes list of > sorted pairs
from shuffle function and perform reduce operations. 

No Abstraction – Hadoop does not have any
type of abstraction so Map Reduce developer need to hand code for each and
every operation which makes it difficult to work. 

Hadoop Map Reduce is a framework for
processing large data sets in parallel across a Hadoop cluster. Map Reduce is
divided mainly into two parts viz.[5] Map and Reduce. 

 This takes a set of data as input
works on it and converts it into another set of data, where individual elements
are broken down into tuples (key-value pairs). Using Mark Logic Connector input
data is fetched.  

Reduce: The reduce task takes the output from
the above Map as an input and combines those data tuples into a smaller set of
tuples. Using the same Mark Logic Connector, the data is stored. By Default,
the Map Reduce framework gets input data from the HDFS. The data then goes
through Map Reduce Algorithm where above two tasks are performed. Map Reduce
implements various mathematical algorithms to divide a task into small parts
and assign them to multiple systems. This algorithm helps in sending the Map
and Reduce tasks to appropriate servers in a cluster. 

Mathematical algorithms:   





 All types of structured and unstructured data
need to be translated to this basic unit, before feeding the data to Map Reduce




Hadoop MapReduce is a
large scale open source software framework dedicated to scalable, distributed,
data intensive computing. [1] The framework breaks up large data into smaller parallelizable
chunks and handles scheduling. Maps each price to an intermediate value,
Reduces intermediate values to a solution, User specified partition and
combiner option, [5] if you can rewrite algorithms into maps and reduces and
your problems can be broken up into small pieces solvable in parallel, then
Hadoop’s MapReduce is the way to go for the distributed problem solving
approaches to large database. 

Usually it is observed
that the MapReduce framework generates a large amount of intermediate data.
Such abundant information is thrown away the task finish, because MapReduce is
unable to utilize them. Therefore, we propose a data-aware cache framework for
big data application them its task submit their intermediate results to the
cache manager. The task queries the cache manager before executing the actual
computing work.  




1) Dhole Poonam B, Gunjal Baisa L, “Survey
Paper on Traditional Hadoop and Pipelined Map Reduce” International Journal of
Computational Engineering Research||Vol, 03||Issue, 12||.


2) Xuan Liu*, Xiaoguang Wang, Stan Matwin1
and Nathalie Japkowicz Meta-MapReduce for scalable data miming Liu et
al.journal of big data (2015) 2:14 DOI 10.1186/s 40537-015-00214.


3) Nilam Kadale, U. A. Mande, “Survey of
Task Scheduling Method for Mapreduce Framework in Hadoop” International Journal
of Applied Information Systems (IJAIS) – ISSN : 2249-0868 Foundation of
Computer Science FCS, New York, USA 2nd National Conference on Innovative
Paradigms in Engineering & Technology (NCIPET 2013) –    


4) Suman Arora, Dr.Madhu Goel, “Survey
Paper on Scheduling in Hadoop” International Journal of Advanced Research in
Computer Science and Software Engineering, Volume 4, Issue 5, May 2014.


5) Wang, F. et al. Hadoop High
Availability through Metadata Replication. ACM (2009).   B.Thirumala Rao, Dr. L.S.S.Reddy, “Survey on
Improved Scheduling in Hadoop MapReduce in Cloud Environments”, International
Journal of Computer Applications (0975 – 8887) Volume 34– No.9, November 2011.


6) Vishal S Patil, Pravin D. Soni, “HADOOP
Application or Innovation in Engineering & Management (IJAIEM)Volume 2,
Issue 2, February 2013 ISSN 2319 – 4847.


7) Sanjay Rathe, “Big Data and Hadoop with
components like Flume, Pig, Hive and Jaql” International Conference on Cloud,
Big Dataand Trust 2013, Nov 13-15, RGPV.


8) Yaxiong Zhao, Jie Wu and Cong Liu,
“Dache: A Data Aware Caching for Big-Data Applications Using the MapReduce
Framework”,TSINGHUA SCIENCE AND TECHNOLOGY ISSNl1007-0214l 05/10l lpp39-50 Volume
19, Number 1, February 2014.


B.Thirumala Rao, Dr. L.S.S.Reddy, “Survey on Improved Scheduling in Hadoop
MapReduce in Cloud Environments”, International Journal of Computer
Applications (0975 – 8887) Volume 34– No.9, November 2011