SlideShare a Scribd company logo
1 of 24
Amr Awadallah CTO, Cloudera, Inc. August 5, 2009 How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
Outline ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Our Older Systems Limited Raw Data Access Storage Farm for Unstructured Data (20TB/day) Instrumentation Collection RDBMS (200GB/day) BI / Reports Mostly Append Ad hoc Queries & Data Mining ETL Grid Non-Consumption Filer heads are a bottleneck
We Needed To Be More Agile (part 1) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
We Needed To Be More Agile (part 2) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
The Solution: A Store-Compute Grid Storage + Computation Instrumentation Collection RDBMS Interactive Apps “ Batch” Apps Mostly Append ETL and Aggregations Ad hoc Queries & Data Mining
What is Hadoop? ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Hadoop History ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Hadoop Design Axioms ,[object Object],[object Object],[object Object],[object Object]
HDFS: Hadoop Distributed File System Block Size = 64MB Replication Factor = 3 Cost/GB is a few ¢/month vs $/month
MapReduce: Distributed Processing
MapReduce Example for Word Count cat *.txt | mapper.pl | sort | reducer.pl > out.txt Split 1 Split i Split N Map 1 (docid, text) (docid, text) Map i (docid, text) Map M Reduce 1 Output File 1 (sorted words,  sum of  counts) Reduce i Output File i (sorted words,  sum of  counts) Reduce R Output File R (sorted words,  sum of  counts) (words, counts) (sorted words, counts) Map (in_key, in_value) => list of (out_key, intermediate_value) Reduce (out_key, list of intermediate_values) => out_value(s) Shuffle (words, counts) (sorted words, counts) “ To Be Or Not To Be?” Be, 5 Be, 12 Be, 7 Be, 6 Be, 30
Hadoop Is More Than Just Analytics/BI ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Apache Hadoop Ecosystem HDFS (Hadoop Distributed File System) HBase  (Key-Value store) MapReduce  (Job Scheduling/Execution System) Pig  (Data Flow) Hive  (SQL) BI Reporting ETL Tools Avro  (Serialization) Zookeepr  (Coordination) Sqoop RDBMS
Hadoop Development Languages ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Hive Features ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Hadoop vs. Relational Databases
[object Object],[object Object],Use The Right Tool For The Right Job
Hadoop Criticisms (part 1) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Hadoop Criticisms (part 2) ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
Conclusion ,[object Object]
Contact Information ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]
APPENDIX
Hadoop High-Level Architecture Name Node Maintains mapping of file blocks to data node slaves Job Tracker Schedules jobs across task tracker slaves Data Node Stores and serves  blocks of data Hadoop Client Contacts Name Node for data or Job Tracker to submit jobs Task Tracker Runs tasks (work units) within a job Share Physical Node

More Related Content

What's hot

Learning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with SparkLearning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with Sparkphanleson
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Rohit Agrawal
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopFlavio Vit
 
Hadoop Tutorial | Big Data Hadoop Tutorial For Beginners | Hadoop Certificati...
Hadoop Tutorial | Big Data Hadoop Tutorial For Beginners | Hadoop Certificati...Hadoop Tutorial | Big Data Hadoop Tutorial For Beginners | Hadoop Certificati...
Hadoop Tutorial | Big Data Hadoop Tutorial For Beginners | Hadoop Certificati...Edureka!
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Simplilearn
 
Introduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & HadoopIntroduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & HadoopSavvycom Savvycom
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...Simplilearn
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introductionsudhakara st
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabCloudxLab
 
Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala Edureka!
 
Mining Data Streams
Mining Data StreamsMining Data Streams
Mining Data StreamsSujaAldrin
 
Lecture4 big data technology foundations
Lecture4 big data technology foundationsLecture4 big data technology foundations
Lecture4 big data technology foundationshktripathy
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 

What's hot (20)

Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Learning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with SparkLearning spark ch01 - Introduction to Data Analysis with Spark
Learning spark ch01 - Introduction to Data Analysis with Spark
 
Semi join
Semi joinSemi join
Semi join
 
Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1Introduction to Big Data & Hadoop Architecture - Module 1
Introduction to Big Data & Hadoop Architecture - Module 1
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Hadoop Tutorial | Big Data Hadoop Tutorial For Beginners | Hadoop Certificati...
Hadoop Tutorial | Big Data Hadoop Tutorial For Beginners | Hadoop Certificati...Hadoop Tutorial | Big Data Hadoop Tutorial For Beginners | Hadoop Certificati...
Hadoop Tutorial | Big Data Hadoop Tutorial For Beginners | Hadoop Certificati...
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
Apache Spark Architecture | Apache Spark Architecture Explained | Apache Spar...
 
Introduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & HadoopIntroduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & Hadoop
 
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
What Is Apache Spark? | Introduction To Apache Spark | Apache Spark Tutorial ...
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
 
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLabApache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
Apache Spark - Basics of RDD | Big Data Hadoop Spark Tutorial | CloudxLab
 
Hadoop ppt2
Hadoop ppt2Hadoop ppt2
Hadoop ppt2
 
Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Hive(ppt)
Hive(ppt)Hive(ppt)
Hive(ppt)
 
Mining Data Streams
Mining Data StreamsMining Data Streams
Mining Data Streams
 
Lecture4 big data technology foundations
Lecture4 big data technology foundationsLecture4 big data technology foundations
Lecture4 big data technology foundations
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
 

Similar to How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook

Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony NguyenThanh Nguyen
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Ranjith Sekar
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry PerspectiveCloudera, Inc.
 
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010Yahoo Developer Network
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingCloudera, Inc.
 
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune amrutupre
 
Big data or big deal
Big data or big dealBig data or big deal
Big data or big dealeduarderwee
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟datastack
 
Hadoop by kamran khan
Hadoop by kamran khanHadoop by kamran khan
Hadoop by kamran khanKamranKhan587
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pigSudar Muthu
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1Thanh Nguyen
 
BIGDATA MODULE 3.pdf
BIGDATA MODULE 3.pdfBIGDATA MODULE 3.pdf
BIGDATA MODULE 3.pdfDIVYA370851
 

Similar to How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook (20)

Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony Nguyen
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
Hadoop: An Industry Perspective
Hadoop: An Industry PerspectiveHadoop: An Industry Perspective
Hadoop: An Industry Perspective
 
Hadoop An Introduction
Hadoop An IntroductionHadoop An Introduction
Hadoop An Introduction
 
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010
 
HDFS
HDFSHDFS
HDFS
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data Processing
 
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
Big-Data Hadoop Tutorials - MindScripts Technologies, Pune
 
Hadoop in action
Hadoop in actionHadoop in action
Hadoop in action
 
Big data or big deal
Big data or big dealBig data or big deal
Big data or big deal
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟عصر کلان داده، چرا و چگونه؟
عصر کلان داده، چرا و چگونه؟
 
Seminar ppt
Seminar pptSeminar ppt
Seminar ppt
 
Hadoop by kamran khan
Hadoop by kamran khanHadoop by kamran khan
Hadoop by kamran khan
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pig
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1
 
Big data ppt
Big data pptBig data ppt
Big data ppt
 
Hadoop
HadoopHadoop
Hadoop
 
BIGDATA MODULE 3.pdf
BIGDATA MODULE 3.pdfBIGDATA MODULE 3.pdf
BIGDATA MODULE 3.pdf
 

More from Amr Awadallah

Schema-on-Read vs Schema-on-Write
Schema-on-Read vs Schema-on-WriteSchema-on-Read vs Schema-on-Write
Schema-on-Read vs Schema-on-WriteAmr Awadallah
 
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...Amr Awadallah
 
Cloudera/Stanford EE203 (Entrepreneurial Engineer)
Cloudera/Stanford EE203 (Entrepreneurial Engineer)Cloudera/Stanford EE203 (Entrepreneurial Engineer)
Cloudera/Stanford EE203 (Entrepreneurial Engineer)Amr Awadallah
 
Service Primitives for Internet Scale Applications
Service Primitives for Internet Scale ApplicationsService Primitives for Internet Scale Applications
Service Primitives for Internet Scale ApplicationsAmr Awadallah
 
Applications of Virtual Machine Monitors for Scalable, Reliable, and Interact...
Applications of Virtual Machine Monitors for Scalable, Reliable, and Interact...Applications of Virtual Machine Monitors for Scalable, Reliable, and Interact...
Applications of Virtual Machine Monitors for Scalable, Reliable, and Interact...Amr Awadallah
 
Yahoo Microstrategy 2008
Yahoo Microstrategy 2008Yahoo Microstrategy 2008
Yahoo Microstrategy 2008Amr Awadallah
 

More from Amr Awadallah (6)

Schema-on-Read vs Schema-on-Write
Schema-on-Read vs Schema-on-WriteSchema-on-Read vs Schema-on-Write
Schema-on-Read vs Schema-on-Write
 
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
How Apache Hadoop is Revolutionizing Business Intelligence and Data Analytics...
 
Cloudera/Stanford EE203 (Entrepreneurial Engineer)
Cloudera/Stanford EE203 (Entrepreneurial Engineer)Cloudera/Stanford EE203 (Entrepreneurial Engineer)
Cloudera/Stanford EE203 (Entrepreneurial Engineer)
 
Service Primitives for Internet Scale Applications
Service Primitives for Internet Scale ApplicationsService Primitives for Internet Scale Applications
Service Primitives for Internet Scale Applications
 
Applications of Virtual Machine Monitors for Scalable, Reliable, and Interact...
Applications of Virtual Machine Monitors for Scalable, Reliable, and Interact...Applications of Virtual Machine Monitors for Scalable, Reliable, and Interact...
Applications of Virtual Machine Monitors for Scalable, Reliable, and Interact...
 
Yahoo Microstrategy 2008
Yahoo Microstrategy 2008Yahoo Microstrategy 2008
Yahoo Microstrategy 2008
 

Recently uploaded

unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 

Recently uploaded (20)

unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 

How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook

  • 1. Amr Awadallah CTO, Cloudera, Inc. August 5, 2009 How Hadoop Revolutionized Data Warehousing at Yahoo and Facebook
  • 2.
  • 3. Our Older Systems Limited Raw Data Access Storage Farm for Unstructured Data (20TB/day) Instrumentation Collection RDBMS (200GB/day) BI / Reports Mostly Append Ad hoc Queries & Data Mining ETL Grid Non-Consumption Filer heads are a bottleneck
  • 4.
  • 5.
  • 6. The Solution: A Store-Compute Grid Storage + Computation Instrumentation Collection RDBMS Interactive Apps “ Batch” Apps Mostly Append ETL and Aggregations Ad hoc Queries & Data Mining
  • 7.
  • 8.
  • 9.
  • 10. HDFS: Hadoop Distributed File System Block Size = 64MB Replication Factor = 3 Cost/GB is a few ¢/month vs $/month
  • 12. MapReduce Example for Word Count cat *.txt | mapper.pl | sort | reducer.pl > out.txt Split 1 Split i Split N Map 1 (docid, text) (docid, text) Map i (docid, text) Map M Reduce 1 Output File 1 (sorted words, sum of counts) Reduce i Output File i (sorted words, sum of counts) Reduce R Output File R (sorted words, sum of counts) (words, counts) (sorted words, counts) Map (in_key, in_value) => list of (out_key, intermediate_value) Reduce (out_key, list of intermediate_values) => out_value(s) Shuffle (words, counts) (sorted words, counts) “ To Be Or Not To Be?” Be, 5 Be, 12 Be, 7 Be, 6 Be, 30
  • 13.
  • 14. Apache Hadoop Ecosystem HDFS (Hadoop Distributed File System) HBase (Key-Value store) MapReduce (Job Scheduling/Execution System) Pig (Data Flow) Hive (SQL) BI Reporting ETL Tools Avro (Serialization) Zookeepr (Coordination) Sqoop RDBMS
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 22.
  • 24. Hadoop High-Level Architecture Name Node Maintains mapping of file blocks to data node slaves Job Tracker Schedules jobs across task tracker slaves Data Node Stores and serves blocks of data Hadoop Client Contacts Name Node for data or Job Tracker to submit jobs Task Tracker Runs tasks (work units) within a job Share Physical Node

Editor's Notes

  1. Data-As-Product is also referred to as Active DW, Operational BI, Online BI, etc.
  2. The solution is to *augment* the current RDBMSes with a “smart” storage/processing system. The original event level data is kept in this smart storage layer and can be mined as needed. The aggregate data is kept in the RDBMSes for interactive reporting and analytics.
  3. The system is self-healing in the sense that it automatically routes around failure. If a node fails then its workload and data are transparently shifted some where else. The system is intelligent in the sense that the MapReduce scheduler optimizes for the processing to happen on the same node storing the associated data (or co-located on the same leaf Ethernet switch), it also speculatively executes redundant tasks if certain nodes are detected to be slow. One of the key benefits of Hadoop is the ability to just upload any unstructured files to it without having to “schematize” them first. You can dump any type of data into Hadoop then the input record readers will abstract it out as if it was structured (i.e. schema on read vs on write) Open Source Software allows for innovation by partners and customers. It also enables third-party inspection of source code which provides assurances on security and product quality. 1 HDD = 75 MB/sec, 1000 HDDs = 75 GB/sec, the “head of fileserver” bottleneck is eliminated.
  4. http://developer.yahoo.net/blogs/hadoop/2009/05/hadoop_sorts_a_petabyte_in_162.html
  5. Speculative Execution, Data rebalancing, Background Checksumming, etc.
  6. Pool commodity servers in a single hierarchical namespace. Designed for large files that are written once and read many times. Example here shows what happens with a replication factor of 3, each data block is present in at least 3 separate data nodes. Typical Hadoop node is eight cores with 16GB ram and four 1TB SATA disks. Default block size is 64MB, though most folks now set it to 128MB
  7. Differentiate between MapReduce the platform and MapReduce the programming model. The analogy is similar to the RDBMs which executes the queries, and SQL which is the language for the queries. MapReduce can run on top of HDFS or a selection of other storage systems Intelligent scheduling algorithms for locality, sharing, and resource optimization.
  8. Think: SELECT word, count(*) FROM documents GROUP BY word Checkout ParBASH: http://cloud-dev.blogspot.com/2009/06/introduction-to-parbash.html
  9. Other uses like face recognition, document discovery, OCR, gene sequence alignment, etc. Data Mining: ** Search and Text Analytics ** Clustering/Categorization ** Modeling/Machine Learning ** Optimization/Operations Research ** Response Prediction/Forecasting ** Simulation, Monte-Carlo like. ** Random Walks of Connectivity Graphs
  10. HBase: Low Latency Random-Access with per-row consistency for updates/inserts/deletes
  11. First bullet is like assembly, then it gets higher level from there.
  12. Query: SELECT, FROM, WHERE, JOIN, GROUP BY, SORT BY, LIMIT, DISTINCT, UNION ALL Join: LEFT, RIGHT, FULL, OUTER, INNER DDL: CREATE TABLE, ALTER TABLE, DROP TABLE, DROP PARTITION, SHOW TABLES, SHOW PARTITIONS DML: LOAD DATA INTO, FROM INSERT Types: TINYINT, INT, BIGINT, BOOLEAN, DOUBLE, STRING, ARRAY, MAP, STRUCT, JSON OBJECT Query: Subqueries in FROM, User Defined Functions, User Defined Aggregates, Sampling (TABLESAMPLE) Relational: IS NULL, IS NOT NULL, LIKE, REGEXP Built in aggregates: COUNT, MAX, MIN, AVG, SUM Built in functions: CAST, IF, REGEXP_REPLACE, … Other: EXPLAIN, MAP, REDUCE, DISTRIBUTE BY List and Map operators: array[i], map[k], struct.field
  13. Hadoop is good for storing and processing large amounts of unstructured or structured data in batch form (i.e. full table scans) Hadoop with HBASE (or Hypertable) can do inserts/updates/deletes with reasonable interactive response times (also see Cassandra).
  14. Sports car is refined, accelerates very fast, and has a lot of addons/features. But it is pricey on a per bit basis and is expensive to maintain. Cargo train is rough, missing a lot of functionality, slow to start, but once it gets going it can carry a lot of stuff very economically.
  15. Hadoop is efficient on a cost basis. Security: Need better integration with systems like LDAP or Kerberos. Also need better isolation against malicious users, though auditing can potentially catch those.
  16. The Data Node slave and the Task Tracker slave can, and should, share the same server instance to leverage data locality whenever possible.