Java中對List去重, Stream去重

問題

當下網際網路技術成熟，越來越多的趨向去中心化、分布式、流計算，使得很多以前在資料庫側做的事情放到了Java端。今天有人問道，如果資料庫字段沒有索引，那麼應該如何根據該字段去重？大家都一緻認為用Java來做，但怎麼做呢？

解答

忽然想起以前寫過list去重的文章，找出來一看。做法就是将list中對象的hashcode和equals方法重寫，然後丢到HashSet裡，然後取出來。這是最初剛學Java的時候像被字典一樣背寫出來的答案。就比如面試，面過号稱做了3年Java的人，問Set和HashMap的差別可以背出來，問如何實作就不知道了。也就是說，初學者隻背特性。但真正在項目中使用的時候你需要確定一下是不是真的這樣。因為背書沒用，隻能相信結果。你需要知道HashSet如何幫我做到去重了。換個思路，不用HashSet可以去重嗎？最簡單，最直接的辦法不就是每次都拿着和曆史資料比較，都不相同則插入隊尾。而HashSet隻是加速了這個過程而已。

首先，給出我們要排序的對象User

@Data
@Builder
@AllArgsConstructor
public class User {

  private Integer id;
  private String name;
}


List<User> users = Lists.newArrayList(
        new User(1, "a"),
        new User(1, "b"),
        new User(2, "b"),
        new User(1, "a"));

目标是取出id不重複的user，為了防止扯皮，給個規則，隻要任意取出id唯一的資料即可，不用拘泥id相同時算哪個。

用最直覺的辦法

這個辦法就是用一個空list存放周遊後的資料。

@Test
public void dis1() {
    List<User> result = new LinkedList<>();
    for (User user : users) {
      boolean b = result.stream().anyMatch(u -> u.getId().equals(user.getId()));
      if (!b) {
        result.add(user);
      }
    }

    System.out.println(result);
}

用HashSet

背過特性的都知道HashSet可以去重，那麼是如何去重的呢？再深入一點的背過根據hashcode和equals方法。那麼如何根據這兩個做到的呢？沒有看過源碼的人是無法繼續的，面試也就到此結束了。

事實上，HashSet是由HashMap來實作的(沒有看過源碼的時候曾經一直直覺的以為HashMap的key是HashSet來實作的，恰恰相反)。這裡不展開叙述，隻要看HashSet的構造方法和add方法就能了解了。

public HashSet() {
    map = new HashMap<>();
}

/**
* 顯然，存在則傳回false，不存在的傳回true
*/
public boolean add(E e) {
    return map.put(e, PRESENT)==null;
}

那麼，由此也可以看出HashSet的去重複就是根據HashMap實作的，而HashMap的實作又完全依賴于hashcode和equals方法。這下就徹底打通了，想用HashSet就必須看好自己的這兩個方法。

在本題目中，要根據id去重，那麼，我們的比較依據就是id了。修改如下：

@Override
public boolean equals(Object o) {
    if (this == o) {
      return true;
    }
    if (o == null || getClass() != o.getClass()) {
      return false;
    }
    User user = (User) o;
    return Objects.equals(id, user.id);
}

@Override
public int hashCode() {
    return Objects.hash(id);
}


//hashcode
result = 31 * result + (element == null ? 0 : element.hashCode());

其中， Objects調用Arrays的hashcode，内容如上述所示。乘以31等于x<<5-x。

最終實作如下：

@Test
public void dis2() {
    Set<User> result = new HashSet<>(users);
    System.out.println(result);
}

使用Java的Stream去重

回到最初的問題，之是以提這個問題是因為想要将資料庫側去重拿到Java端，那麼資料量可能比較大，比如10w條。對于大資料，采用Stream相關函數是最簡單的了。正好Stream也提供了distinct函數。那麼應該怎麼用呢？

users.parallelStream().distinct().forEach(System.out::println);

沒看到用lambda當作參數，也就是沒有提供自定義條件。幸好Javadoc标注了去重标準：

Returns a stream consisting of the distinct elements
(according to {@link Object#equals(Object)}) of this stream.

我們知道，也必須背過這樣一個準則：equals傳回true的時候，hashcode的傳回值必須相同. 這個在背的時候略微有些邏輯混亂，但隻要了解了HashMap的實作方式就不會覺得拗口了。HashMap先根據hashcode方法定位，再比較equals方法。

是以，要使用distinct來實作去重，必須重寫hashcode和equals方法，除非你使用預設的。

那麼，究竟為啥要這麼做？點進去看一眼實作。

<P_IN> Node<T> reduce(PipelineHelper<T> helper, Spliterator<P_IN> spliterator) {
    // If the stream is SORTED then it should also be ORDERED so the following will also
    // preserve the sort order
    TerminalOp<T, LinkedHashSet<T>> reduceOp
            = ReduceOps.<T, LinkedHashSet<T>>makeRef(LinkedHashSet::new, LinkedHashSet::add,
                                                     LinkedHashSet::addAll);
    return Nodes.node(reduceOp.evaluateParallel(helper, spliterator));
}

内部是用reduce實作的啊，想到reduce，瞬間想到一種自己實作distinctBykey的方法。我隻要用reduce，計算部分就是把Stream的元素拿出來和我自己内置的一個HashMap比較，有則跳過，沒有則放進去。其實，思路還是最開始的那個最直白的方法。

@Test
public void dis3() {
    users.parallelStream().filter(distinctByKey(User::getId))
        .forEach(System.out::println);
}


public static <T> Predicate<T> distinctByKey(Function<? super T, ?> keyExtractor) {
    Set<Object> seen = ConcurrentHashMap.newKeySet();
    return t -> seen.add(keyExtractor.apply(t));
}

當然，如果是并行stream，則取出來的不一定是第一個，而是随機的。

上述方法是至今發現最好的，無侵入性的。但如果非要用distinct。隻能像HashSet那個方法一樣重寫hashcode和equals。

小結

會不會用這些東西，你隻能去自己練習過，不然到了真正要用的時候很難一下子就拿出來，不然就冒險用。而若真的想大膽使用，了解規則和實作原理也是必須的。比如，LinkedHashSet和HashSet的實作有何不同。

附上賊簡單的LinkedHashSet源碼：

public class LinkedHashSet<E>
    extends HashSet<E>
    implements Set<E>, Cloneable, java.io.Serializable {

    private static final long serialVersionUID = -2851667679971038690L;

    public LinkedHashSet(int initialCapacity, float loadFactor) {
        super(initialCapacity, loadFactor, true);
    }

    public LinkedHashSet(int initialCapacity) {
        super(initialCapacity, .75f, true);
    }

    public LinkedHashSet() {
        super(16, .75f, true);
    }

    public LinkedHashSet(Collection<? extends E> c) {
        super(Math.max(2*c.size(), 11), .75f, true);
        addAll(c);
    }

    @Override
    public Spliterator<E> spliterator() {
        return Spliterators.spliterator(this, Spliterator.DISTINCT | Spliterator.ORDERED);
    }
}

關注我的公衆号

唯有不斷學習方能改變！

Ryan Miao

Java中對List去重, Stream去重

問題

解答

用最直覺的辦法

用HashSet

使用Java的Stream去重

小結

繼續閱讀

27 Best Free Eclipse Plug-ins for Java Developer to be ProductiveCode Quality PluginsText Editor PluginsDependency ManagementVersion Control Integration PluginsFramework Development Continuous Integration Related PluginsOther Utility Plugins

Java String.format方法的簡單使用

neo4j之cypher使用文檔

GitHub連夜封殺！這份阿裡 10W 字内部 Java 字面試手冊到底有多強？

spark/scala關于【資源檔案】加載方法概述外部檔案加載方案測試資源檔案打包入jar包中小結

NOSQL安全攻擊

mybatis_入門程式Mybatis入門

AOP程式設計_Android優雅權限架構(1)概念基礎，2021金三銀四前言正文大綱正文

登入plsql 報錯 the account is locked --使用者被鎖

Effective Java 8:通用程式設計

SequoiaDB巨杉資料庫C++驅動概述

OOM三種類型

工廠模式-三種類型

【遞歸】高效率求2的n次幂

win10本地scala和spark安裝安裝scala安裝spark

scala (3) Function 和 Method