天天看點

Hive 在多元統計分析中的應用 & 技巧總結

本文原位址:https://my.oschina.net/leejun2005/blog/121945

多元統計一般分兩種,我們看看 Hive 中如何解決:

1、同屬性的多元組合統計

(1)問題:

有如下資料,字段内容分别為:url, catePath0, catePath1, catePath2, unitparams

https://cwiki.apache.org/confluence 0 1 8 {"store":{"fruit":[{"weight":1,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"} 

http://my.oschina.net/leejun2005/blog/83058 0 1 23 {"store":{"fruit":[{"weight":1,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"} 

http://www.hao123.com/indexnt.html?sto 0 1 25 {"store":{"fruit":[{"weight":1,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"} 

https://cwiki.apache.org/confluence 0 5 18 {"store":{"fruit":[{"weight":5,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"} 

http://my.oschina.net/leejun2005/blog/83058 0 5 118 {"store":{"fruit":[{"weight":5,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"} 

http://www.hao123.com/indexnt.html?sto 0 3 98 {"store":{"fruit":[{"weight":3,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"} 

http://www.hao123.com/indexnt.html?sto 0 3 8 {"store":{"fruit":[{"weight":3,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"} 

http://my.oschina.net/leejun2005/blog/83058 0 5 81 {"store":{"fruit":[{"weight":5,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"} 

http://www.hao123.com/indexnt.html?sto 0 9 8 {"store":{"fruit":[{"weight":9,"type":"apple"},{"weight":9,"type":"pear"}],"bicycle":{"price":19.951,"color":"red1"}},"email":"amy@only_for_json_udf_test.net","owner":"amy1"} 

(2)需求:

計算 catePath0, catePath1, catePath2 這三種次元組合下,各個 url 對應的 pv、uv,如:

0 1 23 1 1 

0 1 25 1 1 

0 1 8 1 1 

0 1 ALL 3 3 

0 3 8 1 1 

0 3 98 1 1 

0 3 ALL 2 1 

0 5 118 1 1 

0 5 18 1 1 

0 5 81 1 1 

0 5 ALL 3 2 

0 ALL ALL 8 3 

ALL ALL ALL 8 3 

(3)解決思路:

hive 中同屬性多元統計問題通常用 union all 組合出各種次元然後 group by 進行求解:

<a href="http://blog.51cto.com/2226894115/1916895#">?</a>

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

<code>create</code> <code>EXTERNAL </code><code>table</code> <code>IF </code><code>NOT</code> <code>EXISTS t_log (  url string, c0 string, c1 string, c2 string, unitparams string)  ROW FORMAT DELIMITED FIELDS TERMINATED </code><code>BY</code> <code>'\t'</code> <code>location </code><code>'/tmp/decli/1'</code><code>;</code><code>select</code> <code>* </code><code>from</code> <code>(       </code><code>select</code> <code>host, c0, c1, c2 </code><code>from</code> <code>t_log t0 </code>

<code>        </code><code>LATERAL </code><code>VIEW</code> <code>parse_url_tuple(url, </code><code>'HOST'</code><code>) t1 </code><code>as</code> <code>host </code>

<code>        </code><code>where</code> <code>get_json_object(t0.unitparams, </code><code>'$.store.fruit[0].weight'</code><code>) != 9</code>

<code>    </code><code>union</code> <code>all</code>      <code>select</code> <code>host, c0, c1, </code><code>'ALL'</code> <code>c2 </code><code>from</code> <code>t_log t0 </code>

<code>    </code><code>union</code> <code>all</code>      <code>select</code> <code>host, c0, </code><code>'ALL'</code> <code>c1, </code><code>'ALL'</code> <code>c2 </code><code>from</code> <code>t_log t0 </code>

<code>    </code><code>union</code> <code>all</code>      <code>select</code> <code>host, </code><code>'ALL'</code> <code>c0, </code><code>'ALL'</code> <code>c1, </code><code>'ALL'</code> <code>c2 </code><code>from</code> <code>t_log t0 </code>

<code>        </code><code>where</code> <code>get_json_object(t0.unitparams, </code><code>'$.store.fruit[0].weight'</code><code>) != 9) test;</code><code>select</code> <code>c0, c1, c2, </code><code>count</code><code>(host) PV, </code><code>count</code><code>(</code><code>distinct</code><code>(host)) UV </code><code>from</code> <code>(     </code><code>select</code> <code>host, c0, c1, c2 </code><code>from</code> <code>t_log t0 </code>

<code>        </code><code>where</code> <code>get_json_object(t0.unitparams, </code><code>'$.store.fruit[0].weight'</code><code>) != 9) test </code><code>group</code> <code>by</code> <code>c0, c1, c2;</code>

2、不同屬性的多元組合統計

這種場景下我們一般選擇 Multi Table/File Inserts,下面選自《programming hive》P124

Making Multiple Passes over the Same Data

Hive has a special syntax for producing multiple aggregations from a single pass

through a source of data, rather than rescanning it for each aggregation. This change

can save considerable processing time for large input data sets. We discussed the details

previously in Chapter 5.

For example, each of the following two queries creates a table from the same source

table, history:

hive&gt; INSERT OVERWRITE TABLE sales

    &gt; SELECT * FROM history WHERE action='purchased';

hive&gt; INSERT OVERWRITE TABLE credits

    &gt; SELECT * FROM history WHERE action='returned';

This syntax is correct, but inefficient. The following rewrite achieves the same thing,

but using a single pass through the source history table:

hive&gt; FROM history

    &gt; INSERT OVERWRITE sales   SELECT * WHERE action='purchased'

    &gt; INSERT OVERWRITE credits SELECT * WHERE action='returned';

<code>FROM</code> <code>pv_users    </code><code>INSERT</code> <code>OVERWRITE </code><code>TABLE</code> <code>pv_gender_sum        </code><code>SELECT</code> <code>pv_users.gender, count_distinct(pv_users.userid)        </code><code>GROUP</code> <code>BY</code> <code>pv_users.gender    </code><code>INSERT</code> <code>OVERWRITE DIRECTORY </code><code>'/user/data/tmp/pv_age_sum'</code>

<code>        </code><code>SELECT</code> <code>pv_users.age, count_distinct(pv_users.userid)        </code><code>GROUP</code> <code>BY</code> <code>pv_users.age;</code>

<a href="https://cwiki.apache.org/confluence/display/Hive/Tutorial" target="_blank">https://cwiki.apache.org/confluence/display/Hive/Tutorial</a>

注意事項以及一些小技巧:

1、hive union all 的用法:不支援 top level,以及各個select字段名稱、屬性必須嚴格一緻

2、結果的順序問題,可以自己加字元控制排序

3、多重insert和union all一樣也隻掃描一次,但因為要insert到多個分區,是以做了很多其他的事情,導緻消耗的時間非常長,其會産生多個job,union all 本身隻有一個job

關于 insert overwrite 産生多 job 并行執行的問題:

set hive.exec.parallel=true;   //打開任務并行執行

set hive.exec.parallel.thread.number=16; //同一個sql允許最大并行度,預設為8。

<a href="http://superlxw1234.iteye.com/blog/1703713" target="_blank">http://superlxw1234.iteye.com/blog/1703713</a>

4、目前HIVE 不支援 not in 中包含查詢子句的文法,形如如下的HQ語句是不被支援的: 

查詢在key字段在a表中,但不在b表中的資料

select a.key from a where key not in(select key from b)  該語句在hive中不支援

可以通過left outer join進行查詢,(假設B表中包含另外的一個字段 key1 

select a.key from a left outer join b on a.key=b.key where b.key1 is null

5、left out join 不能連續3個以上使用,必須2個一組,2個一組包裝起來使用。

<code>select</code> <code>p.ssi,p.pv,p.uv,p.nuv,p.visits,</code><code>'2012-06-19 17:00:00'</code> <code>from</code> <code>(  </code><code>select</code> <code>* </code><code>from</code> <code>(      </code><code>select</code> <code>* </code><code>from</code> <code>(</code><code>select</code> <code>ssi,</code><code>count</code><code>(1) pv,</code><code>sum</code><code>(visits) visits </code><code>from</code> <code>FactClickAnalysis  </code>

<code>        </code><code>where</code> <code>logTime &lt;= </code><code>'2012-06-19 18:00:00'</code> <code>and</code> <code>logTime &gt;= </code><code>'2012-06-19 17:00:00'</code> <code>group</code> <code>by</code> <code>ssi ) p1       </code><code>left</code> <code>outer</code> <code>join</code> 

<code>        </code><code>(       </code><code>select</code> <code>ssi,</code><code>count</code><code>(1) uv </code><code>from</code> <code>(</code><code>select</code> <code>ssi,cookieid </code><code>from</code> <code>FactClickAnalysis </code>

<code>        </code><code>where</code> <code>logTime &lt;= </code><code>'2012-06-19 18:00:00'</code> <code>and</code> <code>logTime &gt;= </code><code>'2012-06-19 17:00:00'</code> <code>group</code> <code>by</code> <code>ssi,cookieid ) t1 </code><code>group</code> <code>by</code> <code>ssi </code>

<code>        </code><code>) p2 </code><code>on</code> <code>p1.ssi=p2.ssi</code>

<code>    </code><code>) p3   </code><code>left</code> <code>outer</code> <code>join</code>

<code>    </code><code>(       </code><code>select</code> <code>ssi, </code><code>count</code><code>(1) nuv </code><code>from</code> <code>FactClickAnalysis </code>

<code>        </code><code>where</code> <code>logTime = insertTime </code><code>and</code> <code>logTime &lt;= </code><code>'2012-06-19 18:00:00'</code> <code>and</code> <code>logTime &gt;= </code><code>'2012-06-19 17:00:00'</code> <code>group</code> <code>by</code> <code>ssi </code>

<code>    </code><code>) p4 </code><code>on</code> <code>p3.ssi=p4.ssi</code>

<code>) p</code>

6、hive本地執行mr

<a href="http://superlxw1234.iteye.com/blog/1703546" target="_blank">http://superlxw1234.iteye.com/blog/1703546</a>

7、hive動态分區建立過多遇到的一個錯誤

<a href="http://superlxw1234.iteye.com/blog/1677938" target="_blank">http://superlxw1234.iteye.com/blog/1677938</a>

8、hive中巧用正規表達式的貪婪比對

<a href="http://superlxw1234.iteye.com/blog/1751216" target="_blank">http://superlxw1234.iteye.com/blog/1751216</a>

9、hive比對全中文字段

用java中比對中文的正則即可:

name rlike '^[\\u4e00-\\u9fa5]+$'

判斷一個字段是否全數字:

select mobile from woa_login_log_his where pt = '2012-01-10' and mobile rlike '^\\d+$' limit 50;  

10、hive中使用sql window函數 LAG/LEAD/FIRST/LAST

<a href="http://superlxw1234.iteye.com/blog/1600323" target="_blank">http://superlxw1234.iteye.com/blog/1600323</a>

<a href="http://www.shaoqun.com/a/18839.aspx" target="_blank">http://www.shaoqun.com/a/18839.aspx</a>

11、hive優化之------控制hive任務中的map數和reduce數

<a href="http://superlxw1234.iteye.com/blog/1582880" target="_blank">http://superlxw1234.iteye.com/blog/1582880</a>

12、hive中轉義$等特殊字元

<a href="http://superlxw1234.iteye.com/blog/1568739" target="_blank">http://superlxw1234.iteye.com/blog/1568739</a>

13、日期處理:

檢視N天前的日期:

select from_unixtime(unix_timestamp('20111102','yyyyMMdd') - N*86400,'yyyyMMdd') from t_lxw_test1 limit 1;  

擷取兩個日期之間的天數/秒數/分鐘數等等:

select ( unix_timestamp('2011-11-02','yyyy-MM-dd')-unix_timestamp('2011-11-01','yyyy-MM-dd') ) / 86400  from t_lxw_test limit 1;  

14、删除 Hive 臨時檔案 hive.exec.scratchdir

<a href="http://hi.baidu.com/youziguo/item/1dd7e6315dcc0f28b2c0c576" target="_blank">http://hi.baidu.com/youziguo/item/1dd7e6315dcc0f28b2c0c576</a>

REF:

http://superlxw1234.iteye.com/blog/1536440

http://liubingwwww.blog.163.com/blog/static/3048510720125201749323/

http://blog.csdn.net/azhao_dn/article/details/6921429

<a href="http://superlxw1234.iteye.com/category/228899" target="_blank">http://superlxw1234.iteye.com/category/228899</a>