實驗原理
求平均數是MapReduce比較常見的算法,求平均數的算法也比較簡單,一種思路是Map端讀取資料,在資料輸入到Reduce之前先經過shuffle,将map函數輸出的key值相同的所有的value值形成一個集合value-list,然後将輸入到Reduce端,Reduce端彙總并且統計記錄數,然後作商即可。
實驗步驟
1.在Linux中開啟Hadoop
start-all.sh
2.在Linux本地建立/data/mapreduce4目錄。
mkdir -p /data/mapreduce4
3.下載下傳hadoop2lib,解壓到mapreduce檔案夾下
unzip hadoop2lib.zip
4.在HDFS上建立/mymapreduce4/in目錄,然後将Linux本地/data/mapreduce4目錄下的goods_click檔案導入到HDFS的/mymapreduce4/in目錄中。
hadoop fs -mkdir -p /mymapreduce4/in
hadoop fs -put /data/mapreduce4/goods_click /mymapreduce4/in
注意:goods_click檔案需要注意檔案格式,資料後有隐藏的空格會導緻API中讀取失敗,行末尾的空格應該取消掉,中間使用逗号分隔開
5.在IDEA中編寫代碼
package mapreduce;
import java.io.IOException;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class MyAverage{
public static class Map extends Mapper<Object , Text , Text , IntWritable>{
private static Text newKey=new Text();
public void map(Object key,Text value,Context context) throws IOException, InterruptedException{
String line=value.toString();
System.out.println(line);
String arr[]=line.split(",");
newKey.set(arr[0]);
int click=Integer.parseInt(arr[1]);
context.write(newKey, new IntWritable(click));
}
}
public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable>{
public void reduce(Text key,Iterable<IntWritable> values,Context context) throws IOException, InterruptedException{
int num=0;
int count=0;
for(IntWritable val:values){
num+=val.get();
count++;
}
int avg=num/count;
context.write(key,new IntWritable(avg));
}
}
public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException{
Configuration conf=new Configuration();
System.out.println("start");
Job job =new Job(conf,"MyAverage");
job.setJarByClass(MyAverage.class);
job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);
Path in=new Path("hdfs://192.168.149.10:9000/mymapreduce4/in/goods_click");
Path out=new Path("hdfs://192.168.149.10:9000/mymapreduce4/out");
FileInputFormat.addInputPath(job,in);
FileOutputFormat.setOutputPath(job,out);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
6.建立resources檔案夾,其中建立log4j.properties檔案
hadoop.root.logger=DEBUG, console
log4j.rootLogger = DEBUG, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.out
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{2}: %m%n
7.導入hadoop2lib的包
8.運作結果
運作如果報權限錯誤,記得修改以下, root更換成你Linux中的使用者名