Hadoop伪集群测试

Hadoop的执行模式有三种：单机、伪集群和集群。

前面《Hadoop单机测试》文章中，我们已经搞定了纯单机模式。下面来说伪集群。

伪集群的各个进程将跑在不同的JVM里，并且使用HDFS。

2012.06.21更新：更新Hadoop版本到1.0.3

1、配置伪集群

conf/core-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <property>
        <name>fs.default.name</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>

conf/hdfs-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
</configuration>

conf/mapred-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <property>
        <name>mapred.job.tracker</name>
        <value>localhost:9001</value>
    </property>
</configuration>

2、准备无需密码登录的ssh（从localhost）
Hadoop集群的很多部署操作都依赖于无密码登录。
密钥登录方式：

ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

除此之外，一般都会有主机的known_host认证会打扰登录过程（yes/no那个选择）

cat "StrictHostKeyChecking no" >> ~/.ssh/config

这样就可以搞定了，危险就是可能会有中间人攻击，不过一般忽略不计啦。

最后ssh localhost能直接登录，无需密码就搞定了。

3、启动Hadoop，格式化。

先编辑bin/start-all.sh，添加上JAVA_HOME变量。因为我在~/.bashrc中设置的，启动后总提示“localhost: Error: JAVA_HOME is not set.”。

#格式化namenode
./bin/hadoop namenode -format
#全部启动
./bin/start-all.sh

然后看看web监控端口：
HDFS Namenode: http://localhost:50070
Job Tracker: http://localhost:50030

4、准备input数据、执行

同单机不同，这次是全部要在HDFS上啦。

#把本机的conf目录拷贝到HDFS上做为input目录
./bin/hadoop fs -put ./conf/ input

#执行
bin/hadoop jar hadoop-examples-*.jar grep input output 'dfs[a-z.]+'

在我本机运行完大概是用了1分12秒的样子。
查看结果：

#查看结果
bin/hadoop fs -cat output/*
#结果
1	dfs.replication
1	dfs.server.namenode.
1	dfsadmin

四号程序员

Keep It Simple and Stupid

Leave a Reply Cancel reply