Elasticsearch 進階搜尋、聚合分析與效能透視 — Search, Aggregation, Profiler

你去超市買東西。走到飲料區，眼前有 200 瓶飲料。你的腦袋同時在做兩件事：一件是「過濾」——我只要冷的、無糖的、600ml 以下的；另一件是「排序」——在符合條件的裡面，我想先看我最可能喜歡的。

過濾不需要思考，是非題。排序需要判斷，申論題。

Elasticsearch 的 Query 和 Filter 就是這兩件事。搞混它們，你的搜尋會慢 2 到 5 倍。搞懂它們，你就知道為什麼電商搜尋頁面能在 50 毫秒內從百萬商品裡撈出你要的東西。

Query vs Filter：什麼時候算分、什麼時候不算

先建一組電商測試資料，後面所有範例都用這個：

PUT /products
{
  "mappings": {
    "properties": {
      "name":        { "type": "text", "analyzer": "standard" },
      "brand":       { "type": "keyword" },
      "category":    { "type": "keyword" },
      "price":       { "type": "float" },
      "rating":      { "type": "float" },
      "sold":        { "type": "integer" },
      "description": { "type": "text" },
      "tags":        { "type": "keyword" },
      "created_at":  { "type": "date", "format": "yyyy-MM-dd" },
      "in_stock":    { "type": "boolean" }
    }
  }
}

然後塞幾筆商品進去——iPhone、Galaxy、AirPods、Sony 耳機、MacBook、iPad。

現在來看一個典型的搜尋情境：使用者搜「降噪耳機」，同時過濾 Apple 品牌、價格 5000–10000、有庫存。

GET /products/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "description": "降噪耳機" } }
      ],
      "filter": [
        { "term":  { "brand": "Apple" } },
        { "range": { "price": { "gte": 5000, "lte": 10000 } } },
        { "term":  { "in_stock": true } }
      ]
    }
  }
}

「降噪耳機」放在 must 裡，因為它是使用者的搜尋意圖——描述裡提到「降噪」跟「耳機」越多次的商品，相關性越高，排越前面。這需要算 _score。

品牌、價格、庫存放在 filter 裡，因為它們只需要回答「符合」或「不符合」。不算分意味著兩件事：更快（跳過 BM25 計算），而且可以被快取。下次有人也過濾「Apple + 有庫存」，ES 直接從 cache 裡拿結果，不用重新掃。

這是搜尋效能優化裡 ROI 最高的一招。把精確過濾從 must 搬到 filter，查詢速度可以快 2–5 倍。

should——不是「或」，是加分

should 在 bool 查詢裡的角色常常被誤解。它不是「或」條件，而是「加分項」。

GET /products/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "description": "耳機" } }
      ],
      "should": [
        { "term": { "tags": { "value": "旗艦", "boost": 2 } } },
        { "range": { "rating": { "gte": 4.7, "boost": 1.5 } } }
      ],
      "filter": [
        { "term": { "in_stock": true } }
      ]
    }
  }
}

有「旗艦」標籤的商品 _score 會更高，評分 4.7 以上的也會加分。但沒有這些條件的商品不會被排除——只是排名靠後面。

boost: 2 不代表分數乘以 2。它是相對權重，具體分數還要經過 BM25 公式。把它想成「投票時這票算兩票」，不是「分數翻倍」。

Highlight、Sorting 與分頁三連擊

高亮——讓使用者知道為什麼這筆結果出現

GET /products/_search
{
  "query": { "match": { "description": "降噪晶片" } },
  "highlight": {
    "pre_tags": ["<em>"],
    "post_tags": ["</em>"],
    "fields": {
      "description": { "fragment_size": 100, "number_of_fragments": 3 }
    }
  }
}

回傳的 highlight 欄位會長這樣："Apple <em>降噪</em>耳機，H2 <em>晶片</em>"。使用者一眼就知道為什麼這筆商品被搜出來。

fragment_size 控制每個片段的長度，number_of_fragments 控制最多幾段。短文字（商品名稱）可以設 0，代表回傳完整欄位。

大量資料的情境下，可以改用 "type": "fvh"（Fast Vector Highlighter），不過 mapping 要先設好 term_vector: with_positions_offsets。

排序——指定 sort 後 _score 會消失

GET /products/_search
{
  "query": { "match_all": {} },
  "sort": [
    { "sold": "desc" },
    { "price": "asc" },
    "_score"
  ]
}

按銷量排，銷量一樣按價格排。這裡有個坑：一旦你指定了 sort，ES 預設不計算 _score（省效能）。如果同時需要相關性排名，要手動把 "_score" 加到 sort 陣列裡。

分頁——三種方式，選錯會炸

from + size 最直覺，但有硬上限：

GET /products/_search
{
  "from": 20, "size": 10,
  "query": { "match_all": {} },
  "sort": [{ "price": "asc" }]
}

from + size 預設不能超過 10,000。第 1000 頁（from: 9990）就已經接近爆炸了——因為 ES 必須先在每個 Shard 上取出 9990 + 10 = 10,000 筆結果，coordinating node 再合併排序。越深越慢。

search_after 是深分頁的正解：

GET /products/_search
{
  "size": 2,
  "query": { "match_all": {} },
  "sort": [{ "price": "asc" }, { "_id": "asc" }],
  "search_after": [7490, "3"]
}

用上一頁最後一筆的 sort 值當游標，效能恆定——不管翻到第幾頁都一樣快。代價是只能往下翻，不能跳頁。sort 裡必須有一個唯一欄位（通常是 _id）當 tiebreaker。

scroll API 只用在一次性大量匯出，會凍結搜尋時間點的 snapshot，吃資源。即時分頁千萬別用。

Aggregation——不用建 Data Warehouse 就能做即時統計

回到超市的比喻。你站在結帳台前，突然想知道：「今天冷飲賣了幾瓶？平均價格多少？哪個品牌最好賣？」超市的 POS 系統要跑一堆 SQL 才答得出來。ES 的 Aggregation 就是那個 POS 系統——但快得多。

Metric——數字統計

GET /products/_search
{
  "size": 0,
  "aggs": {
    "avg_price":    { "avg":   { "field": "price" } },
    "max_price":    { "max":   { "field": "price" } },
    "total_sold":   { "sum":   { "field": "sold"  } },
    "price_stats":  { "stats": { "field": "price" } }
  }
}

size: 0 代表不回傳搜尋結果，只要聚合數字。省掉傳輸成本，效能好很多。

stats 一次給你 count、min、max、avg、sum——比逐個寫省事。

Bucket——分群統計

GET /products/_search
{
  "size": 0,
  "aggs": {
    "by_category": {
      "terms": { "field": "category" },
      "aggs": {
        "avg_price":  { "avg": { "field": "price" } },
        "total_sold": { "sum": { "field": "sold" } }
      }
    }
  }
}

先依類別分群，每個類別再算平均價和總銷量。類似 SQL 的 GROUP BY category 再接 AVG(price), SUM(sold)。

巢狀聚合——電商儀表板的真實需求

老闆問：「每個品牌在各類別的平均價格跟總銷量？」

GET /products/_search
{
  "size": 0,
  "aggs": {
    "by_brand": {
      "terms": { "field": "brand" },
      "aggs": {
        "by_category": {
          "terms": { "field": "category" },
          "aggs": {
            "avg_price":  { "avg": { "field": "price" } },
            "total_sold": { "sum": { "field": "sold" } }
          }
        },
        "brand_total_sold": { "sum": { "field": "sold" } }
      }
    }
  }
}

三層巢狀：品牌 → 類別 → 統計值。加上品牌層級的總銷量。這在 SQL 裡要寫 subquery 或 CTE，ES 用巢狀 aggs 一次搞定。

Range 和 Date Histogram

價格區間分群：

GET /products/_search
{
  "size": 0,
  "aggs": {
    "price_ranges": {
      "range": {
        "field": "price",
        "ranges": [
          { "to": 10000, "key": "萬元以下" },
          { "from": 10000, "to": 30000, "key": "1-3萬" },
          { "from": 30000, "key": "3萬以上" }
        ]
      }
    }
  }
}

按月份統計上架數量：

GET /products/_search
{
  "size": 0,
  "aggs": {
    "monthly": {
      "date_histogram": {
        "field": "created_at",
        "calendar_interval": "month",
        "format": "yyyy-MM"
      }
    }
  }
}

Field Data vs Doc Values——聚合的底層代價

keyword 跟 numeric 欄位做聚合用的是 doc_values——一種磁碟上的列式儲存格式，Segment 建好就有了，不吃 Heap。

text 欄位預設沒有 doc_values。如果你硬要對 text 做聚合（例如 terms agg on description），ES 會載入 field data 到 JVM Heap 裡。百萬筆資料的 text 欄位，field data 可以輕鬆吃掉幾 GB 的 Heap。

所以聚合的鐵律是：只對 keyword 和 numeric 做聚合，不對 text 做聚合。如果一個欄位同時要全文搜尋和聚合，mapping 裡設成 multi-field：text + keyword 子欄位。

Search Profiler——當搜尋變慢的時候

在查詢前面加上 "profile": true：

GET /products/_search
{
  "profile": true,
  "query": {
    "bool": {
      "must": [{ "match": { "description": "降噪" } }],
      "filter": [{ "term": { "brand": "Apple" } }]
    }
  }
}