Hive SQL去重a,b和b,a類型

昨天開發找到我們DBA，要我們寫一條Hive SQL。

需求：

有一個t表，主要有機場名稱airport，機場的經緯度distance這兩個列組成，想得到所有距離小于100的兩個機場名。

其實寫這個SQL的邏輯并不是很困難，難點是如何去重複值，

我用MySQL模拟的一個表，其實Hive文法和SQL差不多，插入了三條資料，a, b, c 分别代表三個機場名稱，結構如下：

mysql> show create table t\G
*************************** 1. row ***************************
       Table: t
Create Table: CREATE TABLE `t` (
  `airport` varchar(10) DEFAULT NULL,
  `distant` int(11) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8
1 row in set (0.00 sec)

mysql> select * from t;
+---------+---------+
| airport | distant |
+---------+---------+
| a       |     130 |
| b       |     140 |
| c       |     150 |
+---------+---------+
3 rows in set (0.00 sec)

通過!=篩選掉本機場自己之間的比較，用abs函數取絕對值得到位置小于100的兩個機場

mysql> select t1.airport, t2.airport from t t1,t t2 where t1.airport != t2.airport and abs(t1.distant-t2.distant) < 100;
+---------+---------+
| airport | airport |
+---------+---------+
| b       | a       |
| c       | a       |
| a       | b       |
| c       | b       |
| a       | c       |
| b       | c       |
+---------+---------+
6 rows in set (0.00 sec)

但是問題來了，(b,a) 與(a,b)，(c,a)與(a,c)，(c,b)與(b,c)這裡被我們視為重複值，我們隻需要得到其中某一行的資料，就知道是哪兩個機場名了，那麼，如何去掉這個重複值呢？

貌似distinct，group by都派不上用場了，最後咨詢了一位資深的SQL高手，找到了這麼一個函數hex(),可以把一個字元轉化成十六進制，Hive也有對應的函數，效果如下：

mysql> select t1.airport,hex(t1.airport), t2.airport,hex(t2.airport) from t t1,t t2 where t1.airport != t2.airport and abs(t1.distant-t2.distant) < 100;
+---------+-----------------+---------+-----------------+
| airport | hex(t1.airport) | airport | hex(t2.airport) |
+---------+-----------------+---------+-----------------+
| b       | 62              | a       | 61              |
| c       | 63              | a       | 61              |
| a       | 61              | b       | 62              |
| c       | 63              | b       | 62              |
| a       | 61              | c       | 63              |
| b       | 62              | c       | 63              |
+---------+-----------------+---------+-----------------+
6 rows in set (0.00 sec)

這樣我們就可以通過比較機場1和機場2的大小，來去掉重複值了

mysql> select t1.airport, t2.airport from t t1,t t2 where t1.airport != t2.airport and hex(t1.airport) < hex(t2.airport) and abs(t1.distant-t2.distant) < 100;
+---------+---------+
| airport | airport |
+---------+---------+
| a       | b       |
| a       | c       |
| b       | c       |
+---------+---------+
3 rows in set (0.00 sec)

最後再優化一下，結果如下：

mysql> select t1.airport, t2.airport from t t1,t t2 where hex(t1.airport) < hex(t2.airport) and abs(t1.distant-t2.distant) < 100;
+---------+---------+
| airport | airport |
+---------+---------+
| a       | b       |
| a       | c       |
| b       | c       |
+---------+---------+
3 rows in set (0.00 sec)

SQL并不複雜，沒有太多表的join和子查詢，但是之前遇到去掉重複值用distinct或者group by就可以解決了，這次貌似不太适用，是以記錄一下，歡迎拍磚。

參考連結

https://dev.mysql.com/doc/refman/5.7/en/string-functions.html#function_hex https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-Built-inFunctions

Hive SQL去重a,b和b,a類型

繼續閱讀

MyBatis-Plus 之AR模式

ASP程式設計經典例子

ASP程式設計中20個非常有用的例子

《Hive權威指南》第八章：HiveQL索引8 HiveQL：索引

龍珠訓練營task04

阿裡雲天池龍珠計劃SQL訓練營打卡

阿裡雲天池龍珠計劃SQL訓練營day1

實驗樓sql進階之成績管理系統的資料操作(window)

HiveQl語句應用執行個體：WordCount具體步驟如下：

Oracle的基本操作

SQL優化SQL語句優化的目的

JAVA高效程式設計指南

關于SQL語言

SQL語言基礎：常用的資料查詢語句

neo4j之cypher使用文檔

sqlServer根據經緯查距離