zhiqingstudy

Be a young person with knowledge and content

The concept of Hadoop:

Hadoop is an open source, reliable and extensible distributed computing framework

  1. Allow the use of simple programming models to process large data sets distributed across computer clusters;

  2. Scalable: from a single server to thousands of computers, each computer provides local computer storage

  3. Reliable: It does not rely on hardware to provide high availability, but detects and handles faults in the application layer, so as to provide high availability services on the computer cluster.

What Hadoop can do:

  • Build a large data warehouse

  • PB level data storage, processing, analysis, statistics and other services

  • Search Engines

  • Log analysis

  • data mining

  • business intelligence

History of Hadoop:

1. From 2003 to 2004, Google published three papers

GFS: Google File System

MapReduce: Simplified Data Processing on Large Clusters

BigTable: a large distributed database

2. In February 2006, Hadoop became an independent open source project of Apache (Doug Cutting and others implemented GFS and MapReduce mechanisms)

3. April 2006 - Standard sort (10 GB running 47.9 hours on 188 nodes per node.

4. April 2008 - Winning the world's fastest 1TB data sorting takes 209 seconds on 900 nodes.

5. In 2008, Taobao began to invest in the research of Hadoop based system Cloud Ladder. The total capacity of the ladder is about 9.3PB, with 1100 machines, processing 18000 jobs and scanning 500TB data every day.

6. In March 2009, Cloudera launched CDH (Cloudera's Ddistribution lncluding Apache Hadoop).

7. May 2009 - Yahoo's team used Hadoop to sort 1 TB of data in 62 seconds.

8. July 2009 - Hadoop Core project was renamed Hadoop Common;

9. In July 2009, MapReduce and Hadoop Distributed File System (HDFS) became independent subprojects of Hadoop project.

10. November 1, 2012 Apache Hadoop 1.0 Available.

11. April 1, 2018 Apache Hadoop 3.1 Available.

12. Search engine era

Need to save a large number of web pages (stand-alone cluster)

Word count PageRank.

13. Data warehouse era

FaceBook launches Hive.

In the past, data analysis and statistics were limited to the database. Due to the limitation of data volume and computing power, we can only make statistics and analysis on the most important data (decision data, financial related)

Hive can run SQL operations on Hadoop, and analyze the running logs, application collection data, and database data together.

14. Data mining era

Beer diaper

association analysis

User portrait/item portrait.

15. Generalized big data in the era of machine learning

Big data improves data storage capacity and provides fuel for machine learning

alpha go

Siri Little Love Tmall Genie

comment
head sculpture
Code:
Related

Why you shouldn't stay at a job for more than 2 years?

3 harsh facts long-distance relationships

how to keep your girlfriend interested in a long-distance relationship




Unless otherwise specified, all content on this website is original. If the reprinted content infringes on your rights, please contact the administrator to delete it
Contact Email:2380712278@qq.com

Filing number:皖ICP备19012824号