Java Stream 過濾 List<JSON> 重複資料

常常拿到一堆 JSON 格式的資料，裡面有一大堆重複的垃圾要清。用 List 直接 remove 重複項，會爆炸。用 Stream 搭配 Collectors，一行代碼搞定，超爽。

問題場景

比如從 API 拿到的優惠券列表，user_create_time 相同的算是重複：

[
  {"id": "C001", "user_create_time": "2026-01-01 10:00:00", "discount": "10%"},
  {"id": "C002", "user_create_time": "2026-01-01 10:00:00", "discount": "10%"},
  {"id": "C003", "user_create_time": "2026-01-02 11:00:00", "discount": "20%"},
  {"id": "C004", "user_create_time": "2026-01-01 10:00:00", "discount": "10%"}
]

C001、C002、C004 的 user_create_time 都一樣，去掉 C002 和 C004 才對。

解法演進

第一版：完整的 Comparator

最保險但最冗長的寫法：

List<JSONObject> mergeCoupons = new ArrayList<>();
// ... 假設 mergeCoupons 已經有資料

List<JSONObject> deduplicated = mergeCoupons.stream()
    .collect(Collectors.collectingAndThen(
        Collectors.toCollection(() -> new TreeSet<>(new Comparator<JSONObject>() {
            @Override
            public int compare(JSONObject o1, JSONObject o2) {
                String time1 = o1.getString("USER_CREATE_TIME");
                String time2 = o2.getString("USER_CREATE_TIME");
                
                if (time1.equals(time2)) {
                    return 0;  // 相同，去重
                } else {
                    return 1;  // 不同，保留
                }
            }
        })),
        ArrayList::new
    ));

這邊用 TreeSet 是關鍵：

TreeSet 會自動去掉重複元素（根據 Comparator）
compare() 回傳 0 代表兩個元素相同，TreeSet 就不加進去
最後再轉成 ArrayList

第二版：Lambda 縮寫

用 lambda 簡化 Comparator：

List<JSONObject> deduplicated = mergeCoupons.stream()
    .collect(Collectors.collectingAndThen(
        Collectors.toCollection(() -> new TreeSet<>((o1, o2) -> 
            o1.getString("USER_CREATE_TIME").equals(o2.getString("USER_CREATE_TIME")) ? 0 : 1
        )),
        ArrayList::new
    ));

短很多，邏輯一樣。

第三版：極簡單行版本

直接一行：

List<JSONObject> deduplicated = mergeCoupons.stream()
    .collect(Collectors.collectingAndThen(
        Collectors.toCollection(() -> new TreeSet<>((o1, o2) -> 
            o1.getString("USER_CREATE_TIME").equals(o2.getString("USER_CREATE_TIME")) ? 0 : 1)),
        ArrayList::new));

這個套路怎麼運作

mergeCoupons.stream()
  ↓
.collect(Collectors.collectingAndThen(
    Collectors.toCollection(() -> new TreeSet<>(comparator)),
    ArrayList::new
))
  ↓

分三步：

Collectors.toCollection() - 把 stream 蒐集到一個 TreeSet
- TreeSet 根據 comparator 去掉重複（compare 返回 0 的元素）
- 結果：Set<JSONObject>
collectingAndThen() - 在第一步的結果上再做一次轉換
- 第一個參數：第一個 collector（toCollection）
- 第二個參數：轉換函數（ArrayList::new，把 Set 轉成 List）
- 結果：List<JSONObject>
終結 - 拿到最後的 deduplicated list

實際例子

假設你有個方法要去重優惠券：

public List<JSONObject> deduplicateCoupons(List<JSONObject> coupons) {
    return coupons.stream()
        .collect(Collectors.collectingAndThen(
            Collectors.toCollection(() -> new TreeSet<>((c1, c2) -> 
                c1.getString("USER_CREATE_TIME")
                  .compareTo(c2.getString("USER_CREATE_TIME"))
            )),
            ArrayList::new
        ));
}

注意我改成用 compareTo() 而不是 equals() 再回傳 0/1。這樣可以有排序效果（按時間遞增），同時去掉重複。

要注意的坑

1. 比較邏輯一定要對

1 2	// 爛做法：只比較 id，不比較時間 (o1, o2) -> o1.getString("ID").equals(o2.getString("ID")) ? 0 : 1

這樣會去掉所有 ID 不同的，只保留第一個。搞反了。

2. 如果要保留全部欄位，Comparator 要精確

// 如果想真的「去重」（ID 相同才重複），而不是「去重時間相同的」
(o1, o2) -> o1.getString("ID").equals(o2.getString("ID")) ? 0 : 1

// 如果想去重時間，但不同時間的都保留
(o1, o2) -> o1.getString("USER_CREATE_TIME").equals(o2.getString("USER_CREATE_TIME")) ? 0 : 1

3. 效能考慮

TreeSet 內部是紅黑樹，每插入一個元素要 O(log n) 的比較。如果有 100 萬個元素，總時間複雜度 O(n log n)。

如果元素超級多，可以先轉 Set 再過濾：

Set<String> seenTimes = new HashSet<>();
List<JSONObject> deduplicated = coupons.stream()
    .filter(coupon -> seenTimes.add(coupon.getString("USER_CREATE_TIME")))
    .collect(Collectors.toList());

這樣是 O(n)，更快。但缺點是要用 side effect（seenTimes.add），不太函式化。

更進階的用法

如果要根據多個欄位去重，Comparator 可以鏈式呼叫：

List<JSONObject> deduplicated = coupons.stream()
    .collect(Collectors.collectingAndThen(
        Collectors.toCollection(() -> new TreeSet<>((o1, o2) -> {
            int timeCompare = o1.getString("USER_CREATE_TIME")
                .compareTo(o2.getString("USER_CREATE_TIME"));
            if (timeCompare != 0) {
                return timeCompare;
            }
            return o1.getString("USER_ID").compareTo(o2.getString("USER_ID"));
        })),
        ArrayList::new
    ));

先比時間，時間相同再比 USER_ID。這樣同一時間、同一用戶的優惠券才算重複。

總結

一行 Stream 搞掉重複資料，真香。核心套路記起來：

list.stream()
  .collect(Collectors.collectingAndThen(
    Collectors.toCollection(() -> new TreeSet<>(your_comparator)),
    ArrayList::new
  ))

比自己寫 for loop 去重健康多了，也更函式化。