天天看点

Hive 令人头痛的multi-distinct

线上一个查询简化如下:

Selectdt,count(distinct c1) , count(distinct case when c2>0 and c1=0 then c1 end),count(distinct case when c2>0 and c1>0 then c1 end) from t where dtbetween ‘20131108’ and ‘20131110’ group by dt;

一个让人头痛的multi-distinct问题,为什么说很头痛,看看执行计划就清楚了:

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

50

51

52

53

54

55

56

57

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

76

77

78

79

80

81

82

83

84

85

86

87

88

89

90

91

92

93

94

95

96

97

98

99

<code>ABSTRACTSYNTAX TREE:</code>

<code>  </code><code>(TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAMEt))) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT(TOK_SELEXPR (TOK_TABLE_OR_COL dt)) (TOK_SELEXPR (TOK_FUNCTIONDI count(TOK_TABLE_OR_COL c1))) (TOK_SELEXPR (TOK_FUNCTIONDI count (TOK_FUNCTION when(and (&gt; (TOK_TABLE_OR_COL c2) </code><code>0</code><code>) (= (TOK_TABLE_OR_COL c1) </code><code>0</code><code>))(TOK_TABLE_OR_COL c1)))) (TOK_SELEXPR (TOK_FUNCTIONDI count (TOK_FUNCTION when(and (&gt; (TOK_TABLE_OR_COL c2) </code><code>0</code><code>) (&gt; (TOK_TABLE_OR_COL c1) </code><code>0</code><code>))(TOK_TABLE_OR_COL c1))))) (TOK_WHERE (TOK_FUNCTION between KW_FALSE(TOK_TABLE_OR_COL dt) </code><code>'20131108'</code> <code>'20131110'</code><code>)) (TOK_GROUPBY (TOK_TABLE_OR_COLdt))))</code>

<code>STAGEDEPENDENCIES:</code>

<code>  </code><code>Stage-</code><code>1</code> <code>is a root stage</code>

<code>  </code><code>Stage-</code><code>0</code> <code>is a root stage</code>

<code>STAGEPLANS:</code>

<code>  </code><code>Stage: Stage-</code><code>1</code>

<code>    </code><code>Map Reduce</code>

<code>      </code><code>Alias -&gt; Map Operator Tree:</code>

<code>        </code><code>t</code>

<code>          </code><code>TableScan</code>

<code>            </code><code>alias: t</code>

<code>            </code><code>Filter Operator</code>

<code>              </code><code>predicate:</code>

<code>                  </code><code>expr: dt BETWEEN </code><code>'20131108'</code><code>AND </code><code>'20131110'</code>

<code>                  </code><code>type: Boolean</code>

<code>            </code><code>//通过select operator做投影</code>

<code>              </code><code>Select Operator</code>

<code>                </code><code>expressions:</code>

<code>                      </code><code>expr: dt</code>

<code>                      </code><code>type: string</code>

<code>                      </code><code>expr: c1</code>

<code>                      </code><code>type: </code><code>int</code>

<code>                      </code><code>expr: c2</code>

<code>                </code><code>outputColumnNames: dt, c1, c2</code>

<code>            </code><code>//在MAP端进行简单的聚合,雷区1:假设有N个distinct,MAP处理数据有M条,那么这部处理后的输出是N*M条数据,因为MAP会对dt,keys[i]做聚合操作,所以尽量在MAP端过滤尽可能多的数据</code>

<code>                </code><code>Group By Operator</code>

<code>                  </code><code>aggregations:</code>

<code>                        </code><code>expr: count(DISTINCTc1)</code>

<code>                        </code><code>expr: count(DISTINCTCASE WHEN (((c2 &gt; </code><code>0</code><code>) and (c1 = </code><code>0</code><code>))) THEN (c1) END)</code>

<code>                        </code><code>expr: count(DISTINCTCASE WHEN (((c2 &gt; </code><code>0</code><code>) and (c1 &gt; </code><code>0</code><code>))) THEN (c1) END)</code>

<code>                  </code><code>bucketGroup: </code><code>false</code>

<code>                  </code><code>keys:</code>

<code>                        </code><code>expr: dt</code>

<code>                        </code><code>type: string</code>

<code>                        </code><code>expr: c1</code>

<code>                        </code><code>type: </code><code>int</code>

<code>                        </code><code>expr: CASE WHEN (((c2&gt; </code><code>0</code><code>) and (c1 = </code><code>0</code><code>))) THEN (c1) END</code>

<code>                        </code><code>expr: CASE WHEN (((c2&gt; </code><code>0</code><code>) and (c1 &gt; </code><code>0</code><code>))) THEN (c1) END</code>

<code>                  </code><code>mode: hash</code>

<code>                  </code><code>outputColumnNames: _col0,_col1, _col2, _col3, _col4, _col5, _col6</code>

<code>                </code><code>//雷区2:在做Reduce Sink时是根据partition cplumns进行HASH的方式,那么对于按date分区的表来说一天的所有数据被放大N倍传输到Reducer进行运算,导致性能长尾或者OOME.</code>

<code>                  </code><code>Reduce Output Operator</code>

<code>                    </code><code>key expressions:</code>

<code>                          </code><code>expr: _col0</code>

<code>                          </code><code>type: string</code>

<code>                          </code><code>expr: _col1</code>

<code>                          </code><code>type: </code><code>int</code>

<code>                          </code><code>expr: _col2</code>

<code>                          </code><code>expr: _col3</code>

<code>                    </code><code>sort order: ++++</code>

<code>                    </code><code>Map-reduce partitioncolumns:</code>

<code>                    </code><code>tag: -</code><code>1</code>

<code>                    </code><code>value expressions:</code>

<code>                          </code><code>expr: _col4</code>

<code>                          </code><code>type: bigint</code>

<code>                          </code><code>expr: _col5</code>

<code>                          </code><code>expr: _col6</code>

<code>      </code><code>Reduce Operator Tree:</code>

<code>        </code><code>Group By Operator</code>

<code>          </code><code>aggregations:</code>

<code>                </code><code>expr: count(DISTINCTKEY._col1:</code><code>0</code><code>._col0)</code>

<code>                </code><code>expr: count(DISTINCTKEY._col1:</code><code>1</code><code>._col0)</code>

<code>                </code><code>expr: count(DISTINCTKEY._col1:</code><code>2</code><code>._col0)</code>

<code>          </code><code>bucketGroup: </code><code>false</code>

<code>          </code><code>keys:</code>

<code>                </code><code>expr: KEY._col0</code>

<code>                </code><code>type: string</code>

<code>          </code><code>mode: mergepartial</code>

<code>          </code><code>outputColumnNames: _col0, _col1,_col2, _col3</code>

<code>          </code><code>Select Operator</code>

<code>            </code><code>expressions:</code>

<code>                  </code><code>expr: _col0</code>

<code>                  </code><code>type: string</code>

<code>                  </code><code>expr: _col1</code>

<code>                  </code><code>type: bigint</code>

<code>                  </code><code>expr: _col2</code>

<code>                  </code><code>expr: _col3</code>

<code>            </code><code>outputColumnNames: _col0, _col1,_col2, _col3</code>

<code>            </code><code>File Output Operator</code>

<code>              </code><code>compressed: </code><code>true</code>

<code>              </code><code>GlobalTableId: </code><code>0</code>

<code>              </code><code>table:</code>

<code>                  </code><code>input format:org.apache.hadoop.mapred.TextInputFormat</code>

<code>                  </code><code>output format:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat</code>

<code>  </code><code>Stage: Stage-</code><code>0</code>

<code>    </code><code>Fetch Operator</code>

<code>      </code><code>limit: -</code><code>1</code>

查看执行计划(省去非关键部分):

<code>STAGE DEPENDENCIES:</code>

<code>  </code><code>Stage-</code><code>2</code> <code>depends on stages:Stage-</code><code>1</code><code>, Stage-</code><code>3</code><code>, Stage-</code><code>4</code>

<code>  </code><code>Stage-</code><code>3</code> <code>is a root stage</code>

<code>  </code><code>Stage-</code><code>4</code> <code>is a root stage</code>

另外也有通过unionall+sum的解决方法,感兴趣的同学也可以尝试一下.

如果不做100%精确计算的话可以通过bucket sample的方式可以更快的解决:)

本文转自MIKE老毕 51CTO博客,原文链接:http://blog.51cto.com/boylook/1322536,如需转载请自行联系原作者