Picture of the author
Published on

Big Data and Hadoop

Authors

Big data Hadoop

HDFS: Hadoop distributed file system

YARN: Yet another resource negotiator. sits on top of mapreduce.

  • Manages the resources on hadoop cluster.

Mapreduce: Programming metaphore/model that allows you to distribute the data processing over a cluster. It consists of mappers and reducers.

Note:

  • Mappers transform the data in parallel across the cluster and reducers aggregate the data together.

  • Earlier mapreduce and yarn were a single module.

Pig: A SQL style scripting, can be used in place of Java and Python. It sits on top of mapreduce.

Hive: Just like pig, it sits over mapreduce and is very similar to SQL db.

Apache Ambari: Web view of the Hadoop cluster.

Ambari alternatives: Cloudera, MapR etc.

Mesos: A yarn alternative.

Spark: Sits at the same level as mapreduce.

Tez: Used in conjunction with Hive. Works on directed acyclic graph, similar to spark.

HBASE: Exposes the data on cluster to transactional platforms. HBASE is a nosql db, a columnar datastore.

Apache storm: A way of processing streaming data ie from sensors or weblogs, it is similar to spark streaming.

OOzie: A way to schedule jobs on the cluster.

Zookeeper: A tech for coordinating everything on your cluster, ie which nodes are up or down, keeping track of which node is master.

Data ingestion:

  • sqoop: Ties the cluster to relational db.

  • Flume: Transfers weblogs to the cluster.

  • Kafka: General purpose data ingestion for hadoop cluster.

External data store:

  • Cassandra
  • Mongodb
  • MySQL

Note: Cassandra/Mongodb sits between real time app and hadoop cluster.

Query Engines(Hive is built into hadoop)

  • Apache Drill: SQL like queries
  • Hue: Similar to Amabari
  • Apache pheonix: Similar to Drill
  • Presto
  • Apache Zeppelin: Notebook style interface.

HDFS

  • Used for handling veey large files across a cluster.
  • Breaks down large files into smaller blocks(eg 128MB) on different computers/node.
  • Each node can parallely process the data.
  • More than one copy of a block is stored accross nodes for better resilience.

HDFS Architecture

  • Name node and Data nodes.
  • Name node keeps track of where all the blocks of data are stored. Also keeps edit log.
  • Data nodes: Talks to client applications.
  • Data nodes communicate with each other to keep data blocks in sync.

Reading a file

  • Client appp first inetract with name node for data.
  • Name node them tells the clients app about which data nodes conatins the requested data.

Writing a file

  • If a client app wants to create a new file. It will contact name node.
  • Name node will respond with list of data nodes to be used.
  • The app will contact a single data node with the data.
  • The data node will talk to other data nodes to replicate the data.
  • The data node will then inform the status to client app which will then send the info to name node.

Handling failure of Name node

  • Backup the dat and metadata(edit log) on local storage or NFS.
  • Secondary name node: maintains a merged copy of edit log yoou can restore from.

HDFS Federation

  • Scale beyond a single name node.
  • Each namenode manages a specific namespace volume.
  • Used when there are large number of small files.
  • Also helps in resilience.

HDFS high availability

  • Prevents downtime on cluster in case of failure.
  • Previous approaches lead to small downtime.
  • Hot standby namenode using shared edit log(which is outside of cluster?).
  • Hot standby namenode can takeover right after the namenode failure.
  • There is no downtime unlike earlier methods.
  • Zookeeper tracks active namenode.
  • Uses extreme measures to ensure only one namenode is used at a time.

Using HDFS

  • Ambari Web UI.
  • CLI.
  • HTTP/HDFS proxies.
  • Java interface.
  • NFS gateway.