Ch 9 Big Data - Exam IV

Card Set Information

Ch 9 Big Data - Exam IV
2013-11-09 07:08:21
Exploring World Hadoop

Show Answers:

  1. Hadoop
    the distributed file framework environment that MapReduce runs in - works on data of any structure - self-manages - self-heals - detects changes & adjusts to continue without interruption
  2. node
    a cluster of data
  3. name node
    the "smart" controller of the cluster
  4. blocks
    smaller pieces that HDFS breaks large files into that are stored on the data nodes
  5. Hadoop Distributed File System
    MapReduce engine
    two primary components of Hadoop
  6. Hadoop Distributed File System
  7. Hadoop Distributed File System (HDFS)
    a reliable, high-bandwidth, low-cost, data storage cluster that facilitates the management of related files across machines
  8. MapReduce engine
    a high-performance parallel/distributed data-processing implementation of the MapReduce algorithm
  9. HDFS - Hadoop Distributed File System
    an approach to data/file management that facilitates the process of managing data for easy access (write once read many) allowing for greater coherency & increased throughput - NOT a storage facility - portable across platforms - a collection of clusters
  10. name node
    a master server that manages files & regulates access (opening, closing, renaming files & directories) - and should be replicated (incase of failure)
  11. data node (acts only when directed by the name node)
    an element that manages its local behavior - read/write request mgmt, block creation, deletion, & replication when directed
  12. transaction logs
    checksum validations
    capabilities used by the HDFS to support data integrity
  13. checksum validataions
    an error detection method where the sender assigns a numeric value to a string depending on the number of bits it contains and the receiver calculates the number to ensure it is correct (ie: credit card checking)
  14. transaction logs
    keep track of every operation and are effective in auditing or rebuilding of the file system if needed
  15. data nodes
    provide "heartbeat" messages to detect and ensure conectivity between the NameNode and the data nodes
  16. NameNode
    node where HDFS metadata is stored - should have lots of physical memory 
  17. pipeline
    a connection between multiple data nodes that exists to support the movement of data across the servers
  18. OutputCollector
    collects the output from the independent mappers and passes it to the reducers
  19. Reporter function
    function that provides information gathered from map tasks so that you know when or if the map tasks are complete
  20. InputFormat
    decides how the file is going to be broken into smaller pieces for processing
  21. Input Split
    breaks files into smaller pieces for processing
  22. RecordReader
    assigned to transform the raw data for processing by the map
  23. input format
    input split
    output format
    workflow and data movement steps in a Hadoop cluster