The concept of Hadoop:
Hadoop is an open source, reliable and extensible distributed computing framework
Allow the use of simple programming models to process large data sets distributed across computer clusters;
Scalable: from a single server to thousands of computers, each computer provides local computer storage
Reliable: It does not rely on hardware to provide high availability, but detects and handles faults in the application layer, so as to provide high availability services on the computer cluster.
What Hadoop can do:
Build a large data warehouse
PB level data storage, processing, analysis, statistics and other services
Search Engines
Log analysis
data mining
business intelligence
History of Hadoop:
1. From 2003 to 2004, Google published three papers
GFS: Google File System
MapReduce: Simplified Data Processing on Large Clusters
BigTable: a large distributed database
2. In February 2006, Hadoop became an independent open source project of Apache (Doug Cutting and others implemented GFS and MapReduce mechanisms)
3. April 2006 - Standard sort (10 GB running 47.9 hours on 188 nodes per node.
4. April 2008 - Winning the world's fastest 1TB data sorting takes 209 seconds on 900 nodes.
5. In 2008, Taobao began to invest in the research of Hadoop based system Cloud Ladder. The total capacity of the ladder is about 9.3PB, with 1100 machines, processing 18000 jobs and scanning 500TB data every day.
6. In March 2009, Cloudera launched CDH (Cloudera's Ddistribution lncluding Apache Hadoop).
7. May 2009 - Yahoo's team used Hadoop to sort 1 TB of data in 62 seconds.
8. July 2009 - Hadoop Core project was renamed Hadoop Common;
9. In July 2009, MapReduce and Hadoop Distributed File System (HDFS) became independent subprojects of Hadoop project.
10. November 1, 2012 Apache Hadoop 1.0 Available.
11. April 1, 2018 Apache Hadoop 3.1 Available.
12. Search engine era
Need to save a large number of web pages (stand-alone cluster)
Word count PageRank.
13. Data warehouse era
FaceBook launches Hive.
In the past, data analysis and statistics were limited to the database. Due to the limitation of data volume and computing power, we can only make statistics and analysis on the most important data (decision data, financial related)
Hive can run SQL operations on Hadoop, and analyze the running logs, application collection data, and database data together.
14. Data mining era
Beer diaper
association analysis
User portrait/item portrait.
15. Generalized big data in the era of machine learning
Big data improves data storage capacity and provides fuel for machine learning
alpha go
Siri Little Love Tmall Genie