AI and Big Data

Hadoop

HBase

HBase MapReduce排序Secondary Sort

MapReduce是Hadoop中处理大数据的方法,是一个处理大数据的简单算法、编程泛型。虽然思想简单,但其实真正用起来还是有很多问题,不是所有的问题都可以像WordCount那样典型和直观, 有很多需要trick的地方。MapReduce的中心思想是分而治之,数据要松耦合,可以划分为小数据集并行处理,如果数据本身在计算上存在很强的依赖关系,就不要赶鸭子上架,用MapReduce了。
MapReduce编程中,最重要的是要抓住Map和Reduce的input和output,好的input和output可以降低实现的复杂度。最近,写了很多关于MapReduce的job,有倒排索引,统计,排序等。其中,对排序花费了一番功夫,MapReduce做WordCount很好理解,
Map input:[offset, text],  output: [word, 1],

Reduce input: [word, 1], output: [word, totalcount],还可以设置Combiner进行优化。

Continue reading

Hadoop

HBase

HBase Architecture Analysis Part 3 Pros and Cons

5. HBase Physical Architecture

Figure 5.1 shows the deployment view for HBase cluster:
HBase is the master-slave cluster on top of HDFS. The classic deployment is as follows:
➢** Master node:** one HMaster and one NameNode running on a machine as the master node.
Slave node: Each node is running one HRegionServer and one DataNode. And each node report status to the master node and Zookeeper.
➢** Zookeeper:** HBase is shipped with ensemble Zookeeper, but for large clusters, using existing Zookeeper is better. Zookeeper is crucial, the HMaster and HRegionServers will register on Zookeeper.
Client: There can be many clients to access HRegionServer, like Java Client, Shell Client, Thrift Client and Avro Client

Continue reading

Hadoop

HBase

HBase Architecture Analysis Part1(Logical Architecture)

1. Overview

Apache HBase is an open source column-oriented database. It is often described as a sparse, consistent, distributed, multi-dimensional sorted map. HBase is modeled after Google’s “Bigtable: A distributed Storage System for Structured Data”, which can host very large tables with billions of rows, X millions of columns.

Continue reading

Hadoop

HBase

Nodejs HBase0.96 Hadoop2.2.0 Thrift2配置与使用

项目如果没有采用Java开发,难道就不能用HBase了么?程序猿不会善罢甘休的,有什么语言就会有什么API存在,我还觉得用Java配置时各种缺包错误很烦呢,记得《数学之美》中曾说道:“做技术有术和道两个层面”,知道HBase的架构和一些底层细节是”道”,而使用各种配置和API开发应用则是”术”,而我们就来试试非Java连接HBase。
HBase的第三方接口有Shell, Java, REST和Thrift,可以参考《HBase in Action》chapter 6, REST接口比较慢,使用起来并没有Thrift好。而你可能疑惑什么是Thrift:

Continue reading