本文共 19616 字,大约阅读时间需要 65 分钟。
配置hadoop伪分布式环境,所有服务都运行在同一个节点上。
安装jdk使用的是二进制免编译包,
$ cd /opt/local/src/$ curl -o jdk-8u171-linux-x64.tar.gz http://download.oracle.com/otn-pub/java/jdk/8u171-b11/512cd62ec5174c3487ac17c61aaa89e8/jdk-8u171-linux-x64.tar.gz?AuthParam=1529719173_f230ce3269ab2fccf20e190d77622fe1
### 解压到指定位置$ tar -zxf jdk-8u171-linux-x64.tar.gz -C /opt/local### 创建软连接$ cd /opt/local/$ ln -s jdk1.8.0_171 jdk### 配置环境变量,在当前用的配置文件 ~/.bashrc 增加如下配置$ tail ~/.bashrc # Java export JAVA_HOME=/opt/local/jdkexport JRE_HOME=$JAVA_HOME/jreexport CLASSPATH=.:$CLASSPATH:$JAVA_HOME/lib:$JRE_HOME/libexport PATH=$PATH:$JAVA_HOME/bin:$JRE_HOME/bin
$ source ~/.bashrc### 演那种是否生效,返回java信息说明正确$ java -versionjava version "1.8.0_171"Java(TM) SE Runtime Environment (build 1.8.0_171-b11)Java HotSpot(TM) 64-Bit Server VM (build 25.171-b11, mixed mode)
### 配置/etc/hosts 把主机名和IP地址一一对应$ head -n 3 /etc/hosts# ip --> hostname or domain192.168.20.10 node### 验证$ ping node -c 2PING node (192.168.20.10) 56(84) bytes of data.64 bytes from node (192.168.20.10): icmp_seq=1 ttl=64 time=0.063 ms64 bytes from node (192.168.20.10): icmp_seq=2 ttl=64 time=0.040 ms
### 生成ssh key$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
### 需要输入密码ssh-copy-id node### 验证登录,不需要密码即为成功$ ssh node
### 下载Hadoop2.7.6$ cd /opt/local/src/$ wget -c http://mirrors.hust.edu.cn/apache/hadoop/common/hadoop-2.7.6/hadoop-2.7.6.tar.gz
$ mkdir -p /opt/local/hdfs/{namenode,datanode,tmp}$ tree /opt/local/hdfs//opt/local/hdfs/├── datanode├── namenode└── tmp
### 解压到指定位置$ cd /opt/local/src/$ tar -zxf hadoop-2.7.6.tar.gz -C /opt/local/### 创建软连接$ cd /opt/local/$ ln -s hadoop-2.7.6 hadoop
$ vim /opt/local/hadoop/etc/hadoop/core-site.xml
hadoop.tmp.dir file:/opt/local/hdfs/tmp/ fs.defaultFS hdfs://node:9000 io.file.buffer.size 131072
$ vim /opt/local/hadoop/etc/hadoop/hdfs-site.xml
dfs.replication 1 dfs.namenode.name.dir /opt/local/hdfs/namenode dfs.datanode.data.dir /opt/local/hdfs/datanode dfs.webhdfs.enabled true
### mapred-site.xml需要从一个模板拷贝在修改$ cp /opt/local/hadoop/etc/hadoop/mapred-site.xml.template /opt/local/hadoop/etc/hadoop/mapred-site.xml$ vim /opt/local/hadoop/etc/hadoop/mapred-site.xml
mapreduce.framework.name yarn mapreduce.jobhistory.address node:10020 mapreduce.jobhistory.webapp.address node:19888 mapreduce.jobhistory.done-dir /history/done mapreduce.jobhistory.intermediate-done-dir /history/done_intermediate
$ vim /opt/local/hadoop/etc/hadoop/yarn-site.xml
yarn.nodemanager.aux-services mapreduce_shuffle yarn.resourcemanager.hostname node yarn.resourcemanager.address node:8032 yarn.resourcemanager.scheduler.address node:8030 yarn.resourcemanager.resource-tracker.address node:8031 yarn.resourcemanager.admin.address node:8033 yarn.resourcemanager.webapp.address node:8088 yarn.log-aggregation-enable true yarn.log-aggregation.retain-seconds 604800 yarn.nodemanager.pmem-check-enabled false yarn.nodemanager.vmem-check-enabled false
$ cat /opt/local/hadoop/etc/hadoop/slaves node
$ cat /opt/local/hadoop/etc/hadoop/masternode
$ vim /opt/local/hadoop/etc/hadoop/hadoop-env.sh
### 修改JAVA_HOMEexport JAVA_HOME=/opt/local/jdk
$ vim /opt/local/hadoop/etc/hadoop/yarn-env.sh
### 修改JAVA_HOMEexport JAVA_HOME=/opt/local/jdk
$ vim /opt/local/hadoop/etc/hadoop/mapred-env.sh
### 修改JAVA_HOMEexport JAVA_HOME=/opt/local/jdk
在 ~/.bashrc 增加hadoop环境变量,配置如下
# hadoopexport HADOOP_HOME=/opt/local/hadoopexport PATH=$PATH:$HADOOP_HOME/binexport PATH=$PATH:$HADOOP_HOME/sbinexport HADOOP_MAPRED_HOME=$HADOOP_HOME export HADOOP_COMMON_HOME=$HADOOP_HOME export HADOOP_HDFS_HOME=$HADOOP_HOMEexport HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/nativeexport HADOOP_OPTS="-DJava.library.path=$HADOOP_HOME/lib"export JAVA_LIBRARY_PATH=$HADOOP_HOME/lib/native:$JAVA_LIBRARY_PATH
$ source ~/.bashrc### 验证$ hadoop versionHadoop 2.7.6Subversion https://shv@git-wip-us.apache.org/repos/asf/hadoop.git -r 085099c66cf28be31604560c376fa282e69282b8Compiled by kshvachk on 2018-04-18T01:33ZCompiled with protoc 2.5.0From source with checksum 71e2695531cb3360ab74598755d036This command was run using /opt/local/hadoop-2.7.6/share/hadoop/common/hadoop-common-2.7.6.jar
### 格式化hdfs,如果已有数据慎重使用,这会删除原有的数据$ hadoop namenode -format### namenode存储目录会产生数据$ ls /opt/local/hdfs/namenode/current
启动hadoop主要有HDFS(Namenode、Datanode)和YARN(ResourceManager、NodeManager),可以使用start-all.sh
命令启动;关闭命令stop-all.sh
,也可以指定应用启动;
启动dfs包括namenode和datanode两个服务,可以使用start-dfs.sh
启动,以下采用分布启动;
### 启动namenode$ hadoop-daemon.sh start namenodestarting namenode, logging to /opt/local/hadoop-2.7.6/logs/hadoop-hadoop-namenode-node.out### 查看进程$ jps7547 Jps7500 NameNode### 启动SecondaryNameNode$ hadoop-daemon.sh start secondarynamenodestarting secondarynamenode, logging to /opt/local/hadoop-2.7.6/logs/hadoop-hadoop-secondarynamenode-node.out### 查看进程$ jps10001 SecondaryNameNode10041 Jps9194 NameNode
### 启动datanode$ hadoop-daemon.sh start datanodestarting datanode, logging to /opt/local/hadoop-2.7.6/logs/hadoop-hadoop-datanode-node.out### 查看进程$ jps7607 DataNode7660 Jps7500 NameNode10001 SecondaryNameNode
启动yarn包括ResourceManager和NodeManager,可以使用start-yarn.sh
启动,以下采用分布启动;
### 启动resourcemanager$ yarn-daemon.sh start resourcemanagerstarting resourcemanager, logging to /opt/local/hadoop-2.7.6/logs/yarn-hadoop-resourcemanager-node.out### 查看进程$ jps7607 DataNode7993 Jps7500 NameNode7774 ResourceManager10001 SecondaryNameNode
### 启动nodemanager$ yarn-daemon.sh start nodemanagerstarting nodemanager, logging to /opt/local/hadoop-2.7.6/logs/yarn-hadoop-nodemanager-node.out### 查看进程$ jps7607 DataNode8041 NodeManager8106 Jps7500 NameNode7774 ResourceManager10001 SecondaryNameNode
### 启动 historyserver$ mr-jobhistory-daemon.sh start historyserverstarting historyserver, logging to /opt/local/hadoop/logs/mapred-hadoop-historyserver-node.out### 查看进程$ jps8278 JobHistoryServer7607 DataNode8041 NodeManager7500 NameNode8317 Jps7774 ResourceManager10001 SecondaryNameNode
hadoop启动后,主要有以下功能
命令 | 说明 |
---|---|
hadoop fs -mkdir | 创建HDFS 目录 |
hadoop fs -ls | 列出HDFS 目录 |
hadoop fs -copyFromLocal | 复制本地文件到HDFS |
hadoop fs -put | 复制本地文件到HDFS,put可以接收stdin(标准输入) |
hadoop fs -cat | 列出HDFS文件的内容 |
hadoop fs -copyToLocal | 将HDFS上的文件复制到本地 |
hadoop fs -get | 将HDFS上的文件复制到本地 |
hadoop fs -cp | 负载HDFS文件 |
hadoop fs -rm | 删除HDFS文件或目录(-R参数) |
$ hadoop fs -mkdir /user/hadoop
$ hadoop fs -mkdir -p /user/hadoop/{input,output}
$ hadoop fs -ls /Found 2 itemsdrwxrwx--- - hadoop supergroup 0 2018-06-23 12:20 /historydrwxr-xr-x - hadoop supergroup 0 2018-06-23 13:20 /user$ hadoop fs -ls /userFound 1 itemsdrwxr-xr-x - hadoop supergroup 0 2018-06-23 13:20 /user/hadoop
$ hadoop fs -ls -R /drwxrwx--- - hadoop supergroup 0 2018-06-23 12:20 /historydrwxrwx--- - hadoop supergroup 0 2018-06-23 12:20 /history/donedrwxrwxrwt - hadoop supergroup 0 2018-06-23 12:20 /history/done_intermediatedrwxr-xr-x - hadoop supergroup 0 2018-06-23 13:20 /userdrwxr-xr-x - hadoop supergroup 0 2018-06-23 13:24 /user/hadoopdrwxr-xr-x - hadoop supergroup 0 2018-06-23 13:24 /user/hadoop/inputdrwxr-xr-x - hadoop supergroup 0 2018-06-23 13:24 /user/hadoop/output
$ hadoop fs -copyFromLocal /opt/local/hadoop/README.txt /user/hadoop/input
$ hadoop fs -cat /user/hadoop/input/README.txt
$ hadoop fs -get /user/hadoop/input/README.txt ./
### 删除文件会提示$ hadoop fs -rm /user/hadoop/input/examples.desktop18/06/23 13:47:06 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.Deleted /user/hadoop/input/examples.desktop### 删除目录$ hadoop fs -rm -R /user/hadoop18/06/23 13:48:17 INFO fs.TrashPolicyDefault: Namenode trash configuration: Deletion interval = 0 minutes, Emptier interval = 0 minutes.Deleted /user/hadoop
使用hadoop内置的wordcount程序统计字数
$ hadoop fs -put /opt/local/hadoop/README.txt /user/input$ cd /opt/local/hadoop/share/hadoop/mapreduce#### hadoop jar jar包名称 类 输入文件 输出目录$ hadoop jar hadoop-mapreduce-examples-2.7.6.jar wordcount /user/input/ /user/output/wordcount
#### 查看当然的任务情况,也可以在http://node:8088 上查看$ yarn application -list18/06/23 13:55:34 INFO client.RMProxy: Connecting to ResourceManager at node/192.168.20.10:8032Total number of applications (application-types: [] and states: [SUBMITTED, ACCEPTED, RUNNING]):1 Application-Id Application-Name Application-Type User Queue State Final-State Progress Tracking-URLapplication_1529732240998_0001 word count MAPREDUCE hadoop default RUNNING UNDEFINED 5% http://node:41713
#### _SUCCESS 表示成功,part开头的文件表是结果$ hadoop fs -ls /user/output/wordcountFound 2 items-rw-r--r-- 1 hadoop supergroup 0 2018-06-23 13:55 /user/output/wordcount/_SUCCESS-rw-r--r-- 1 hadoop supergroup 1306 2018-06-23 13:55 /user/output/wordcount/part-r-00000#### 查看内容$ hadoop fs -cat /user/output/wordcount/part-r-00000|tailuses 1using 2visit 1website 1which 2wiki, 1with 1written 1you 1your 1
http://node:50070
http://node:8088
spark是用是scala编写,Scala官网为:,因此需要首先安装Scala,Scala有以下特点:
Scala下载地址为;
从Spark2.0版开始,Spark默认使用Scala 2.11构建,因此下载scala-2.11版本;$ cd /opt/local/src/$ wget -c https://www.scala-lang.org/files/archive/scala-2.11.11.tgz
#### 解压到指定位置,并做软连接$ tar -zxf scala-2.11.11.tgz -C /opt/local/$ cd /opt/local/
#### 配置 ~/.bashrc 增加如下$ tail -n 5 ~/.bashrc# scalaexport SCALA_HOME=/opt/local/scalaexport PATH=$PATH:$SCALA_HOME/bin#### 启用配置$ source ~/.bashrc#### 验证$ scala -versionScala code runner version 2.11.11 -- Copyright 2002-2017, LAMP/EPFL
spark下载页面地址是,需要选择用于hadoop2.7及以上版本;
$ cd /opt/local/src/$ wget -c http://mirror.bit.edu.cn/apache/spark/spark-2.2.1/spark-2.3.1-bin-hadoop2.7.tgz
$ tar zxf spark-2.3.1-bin-hadoop2.7.tgz -C /opt/local/$ cd /opt/local/$ ln -s spark-2.3.1-bin-hadoop2.7 spark
$ tail -n 5 ~/.bashrc # sparkexport SPARK_HOME=/opt/local/sparkexport PATH=$PATH:$SPARK_HOME/bin
$ source ~/.bashrc
在终端输入pyspark
启动spark的python接口,启动会显示使用的python版本和spark版本信息;
pyspark --master local[4]
,local[N]代表本地运行,使用N个线程(thread);local[*] 会尽量使用所有的CPU核心;
$ pyspark Python 2.7.12 (default, Dec 4 2017, 14:50:18) [GCC 5.4.0 20160609] on linux2Type "help", "copyright", "credits" or "license" for more information.2018-06-23 19:25:00 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicableSetting default log level to "WARN".To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.3.1 /_/Using Python version 2.7.12 (default, Dec 4 2017 14:50:18)SparkSession available as 'spark'.
>>> sc.masteru'local[*]'
>>> textFile=sc.textFile("file:/opt/local/spark/README.md")>>> textFile.count()103
>>> textFile=sc.textFile("hdfs://node:9000/user/input/README.md")>>> textFile.count()103
spark可以在Hadoop YARN上运行,让YARN帮助它进行资源管理;
HADOOP_CONF_DIR=/opt/local/hadoop/etc/hadoop pyspark --master yarn --deploy-mode client
;
$ HADOOP_CONF_DIR=/opt/local/hadoop/etc/hadoop pyspark --master yarn --deploy-mode clientPython 2.7.12 (default, Dec 4 2017, 14:50:18) [GCC 5.4.0 20160609] on linux2Type "help", "copyright", "credits" or "license" for more information.2018-06-23 20:27:48 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicableSetting default log level to "WARN".To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).2018-06-23 20:27:52 WARN Client:66 - Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.3.1 /_/Using Python version 2.7.12 (default, Dec 4 2017 14:50:18)SparkSession available as 'spark'.
>>> sc.masteru'yarn'
>>> textFile=sc.textFile("hdfs://node:9000/user/input/README.md")>>> textFile.count()103
#### 也可以通过web查看:http://node:8088$ yarn application -list18/06/23 20:34:40 INFO client.RMProxy: Connecting to ResourceManager at node/192.168.20.10:8032Total number of applications (application-types: [] and states: [SUBMITTED, ACCEPTED, RUNNING]):1 Application-Id Application-Name Application-Type User Queue State Final-State Progress Tracking-URLapplication_1529756801315_0001 PySparkShell SPARK hadoop default RUNNING UNDEFINED 10% http://node:4040
配置Spark Standalone Cluste伪分布式环境,所有服务都运行在同一个节点上。
$ cp /opt/local/spark/conf/spark-env.sh.template /opt/local/spark/conf/spark-env.sh
$ tail -n 6 /opt/local/spark/conf/spark-env.sh#### Spark Standalone Clusterexport JAVA_HOME=/opt/local/jdkexport SPARK_MASTER_HOST=nodeexport SPARK_WORKER_CORES=1export SPARK_WORKER_MEMORY=512mexport SPARK_WORKER_INSTANCES=1
#### 增加编辑,也可以拷贝模板文件$ tail /opt/local/spark/conf/slavesnode
启动Spark Standalone Cluster可以使用${SPARN_HOME}/sbin/start-all.sh
一个脚本启动所有服务;也可以分布启动master和slaves;
$ /opt/local/spark/sbin/start-master.sh starting org.apache.spark.deploy.master.Master, logging to /opt/local/spark/logs/spark-hadoop-org.apache.spark.deploy.master.Master-1-node.out~$ jps4185 Master
$ /opt/local/spark/sbin/start-slaves.sh node: starting org.apache.spark.deploy.worker.Worker, logging to /opt/local/spark/logs/spark-hadoop-org.apache.spark.deploy.worker.Worker-1-node.out$ jps4185 Master4313 Worker
$ w3m http://node:8080/[spark-logo] 2.3.1 Spark Master at spark://node:7077 • URL: spark://node:7077 • REST URL: spark://node:6066 (cluster mode) • Alive Workers: 1 • Cores in use: 1 Total, 0 Used • Memory in use: 256.0 MB Total, 0.0 B Used • Applications: 0 Running, 0 Completed • Drivers: 0 Running, 0 Completed • Status: ALIVEWorkers (1) Worker Id Address State Cores Memory 512.0worker-20180624102100-192.168.20.10-42469 192.168.20.10:42469 ALIVE 1 (0 MB Used) (0.0 B Used)Running Applications (0)Application ID Name Cores Memory per Submitted Time User State Duration ExecutorCompleted Applications (0)Application ID Name Cores Memory per Submitted Time User State Duration Executor
$ pyspark --master spark://node:7077 --num-executors 1 --total-executor-cores 1 --executor-memory 512mPython 2.7.12 (default, Dec 4 2017, 14:50:18) [GCC 5.4.0 20160609] on linux2Type "help", "copyright", "credits" or "license" for more information.2018-06-24 10:39:09 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicableSetting default log level to "WARN".To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /__ / .__/\_,_/_/ /_/\_\ version 2.3.1 /_/Using Python version 2.7.12 (default, Dec 4 2017 14:50:18)SparkSession available as 'spark'.
>>> sc.masteru'spark://node:7077'
>>> textFile=sc.textFile("file:/opt/local/spark/README.md")>>> textFile.count()103
>>> textFile=sc.textFile("hdfs://node:9000/user/input/README.md")>>> textFile.count()103
spark的运行方式有多种,主要有独立集群、YARN集群、Mesos集群,和本地模式
master可选值 | 描述说明 |
---|---|
spark://host:port | spark standalone集群,默认端口为7077 |
yarn | YARN集群,当在YARN上运行时,需设置环境变量HADOOP_CONF_DIR指向hadoop配置目录,以获取集群信息 |
mesos://host:port | Mesos集群,默认端口为5050 |
local | 本地模式,使用1个核心 |
local[n] | 本地模式,使用n个核心 |
local[*] | 本地模式,使用尽可能多的核心 |
转载于:https://blog.51cto.com/balich/2132160