一个优秀的elasticsearch工程师对elastic官网内容和案例模板要非常清楚,因为elasticsearch的api本就复杂规律性不像sql那么简单易用。
聚合分析
搜索引擎执行搜索,聚合分析可以基于结果进行聚合新的结果,注意区分
作用:
- 搜索引擎用来回答如下问题:
- 请告诉我地址为上海的所有订单?
- 请告诉我最近1天内创建但没有付款的所有订单?
- 聚合分析可以回答如下问题:
- 请告诉我最近1周每天的订单成交量有多少?
- 请告诉我最近1个月每天的平均订单金额是多少?
- 请告诉我最近半年卖的最火的前5个商品是哪些?
定义:
- 聚合分析,英文为 Aggregation,是 es 除搜索功能外提供的实时 数据统计分析功能,可替代部分 OLAP 软件
- 功能丰富,提供Bucket、Metric、Pipeline等多种分析语法, 满足大部分的分析需求
- 实时性高,所有的计算结果都是实时返回的
案例:
聚合分析同属于search api(和普通搜索一样)
1 2 3 4 5 6 7 8 9 10 11 12 13
| GET test_search_index/_search { "size":0, "aggs" : { "<aggregation_name>" : { "<aggregation_type>" : { <aggregation_body> } [,"aggs" : { [<sub_aggregation>]+ } ]? } [,"<aggregation_name_2>" : { ... } ]* }}
|
示例:
1 2 3 4 5 6 7 8 9 10 11
| GET test_search_index/_search { "size": 0, "aggs": { "people_per_job": { "terms": { "field": "job.keyword" } } } }
|
分类
- es 聚合分析主要包含如下类型:
- Bucket,分桶类型,类似 SQL 中的 GROUP BY 语法
- Metric,指标(数值)分析类型,如计算最大值、最小值、平均 值等等
- Pipeline,管道分析类型,基于上一级的聚合分析结果进行再 分析
Metric 聚合分析
- 主要分如下两类:
- 单值分析,只输出一个分析结果
- min,max,avg,sum
- cardinality
- 多值分析,输出多个分析结果
- stats,extended stats
- percentile, percentile rank
- top hits
min,max,avg,sum
返回数值类字段的最小值
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
| GET test_search_index/_search { "size": 0, "aggs": { "min_age": { "min": { "field": "age" } } } }
GET test_search_index/_search { "size": 0, "aggs": { "min_age": { "min": { "field": "age" } }, "max_age": { "max": { "field": "age" } }, "avg_age": { "avg": { "field": "age" } }, "sum_age": { "sum": { "field": "age" } } } }
|
cardinality
类似sql的distinct
eg.列举所有job的种类
1 2 3 4 5 6 7 8 9 10 11 12
| GET test_search_index/_search { "size": 0, "aggs": { "count_of_job": { "cardinality": { "field": "job.keyword" } } } }
|
Stats
返回一系列数值类型的统计值,包含min、max、avg、sum 和 count
1 2 3 4 5 6 7 8 9 10 11
| GET test_search_index/_search { "size": 0, "aggs": { "stats_age": { "stats": { "field": "age" } } } }
|
Extended Stats
对 stats 的扩展,包含了更多的统计数据,如方差、标准差等
1 2 3 4 5 6 7 8 9 10 11
| GET test_search_index/_search { "size": 0, "aggs": { "stats_age": { "extended_stats": { "field": "age" } } } }
|
Percentile
百分位数统计值
1 2 3 4 5 6 7 8 9 10 11
| GET test_search_index/_search { "size": 0, "aggs": { "per_age": { "percentiles": { "field": "salary" } } } }
|
百分位数统计
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
| GET test_search_index/_search { "size": 0, "aggs": { "per_salary": { "percentile_ranks": { "field": "salary", "values": [ 11000, 30000 ] } } } }
|
Top Hits
一般用于分桶后获取该桶内最匹配的顶部文档列表,即详情数据
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
| GET test_search_index/_search { "size": 0, "aggs": { "jobs": { "terms": { "field": "job.keyword", "size": 10 }, "aggs": { "top_employee": { "top_hits": { "size": 10, "sort": [ { "age": { "order": "desc" } } } } } } } } }
|
Bucket 聚合分析
- 按照 Bucket 的分桶策略,常见的 Bucket 聚合分析如下:
- Terms
- Range
- Date Range
- Histogram
- Date Histogram
Terms
该分桶策略最简单,直接按照 term 来分桶,如果是 text 类型,则 按照分词后的结果分桶
1 2 3 4 5 6 7 8 9 10 11 12
| GET test_search_index/_search { "size": 0, "aggs": { "jobs": { "terms": { "field": "job.keyword", "size": 5 } } } }
|
Range
通过指定数值的范围来设定分桶规则
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
| GET test_search_index/_search { "size": 0, "aggs": { "date_range": { "range": { "field": "birth", "format": "yyyy", "ranges": [ { "from": "1980", "to": "1990" }, { "from": "1990", "to": "2000" } } } } } }
|
Historgram
直方图,以固定间隔的策略来分割数据
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
| GET test_search_index/_search { "size": 0, "aggs": { "salary_hist": { "histogram": { "field": "salary", "interval": 5000, "extended_bounds": { "min": 0, "max": 40000 } } } } }
|
Date Historgram
针对日期的直方图或者柱状图,是时序数据分析中常用的聚合分析 类型
1 2 3 4 5 6 7 8 9 10 11 12 13
| GET test_search_index/_search { "size": 0, "aggs": { "by_year": { "date_histogram": { "field": "birth", "calendar_interval": "year", "format": "yyyy" } } } }
|
Bucket + Metric联合使用
Bucket 聚合分析允许通过添加子分析来进一步进行分析,该子分 析可以是 Bucket 也可以是 Metric。这也使得 es 的聚合分析能力 变得异常强大
分桶后再分桶
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
| GET test_search_index/_search { "size": 0, "aggs": { "jobs": { "terms": { "field": "job.keyword", "size": 10 }, "aggs": { "age_range": { "range": { "field": "age", "ranges": [ { "to": 20 }, { "from": 20, "to": 30 } } } } } } } }
|
分桶后进行数据分析
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
| GET test_search_index/_search { "size": 0, "aggs": { "jobs": { "terms": { "field": "job.keyword", "size": 10 }, "aggs": { "salary": { "stats": { "field": "salary" } } } } } }
|
Pipeline
- 针对聚合分析的结果再次进行聚合分析,而且支持链式调用,可以 回答如下问题:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
| POST order/_search { "size": 0, "aggs": { "sales_per_month": { "date_histogram": { "field": "date", "interval": "month" }, "aggs": { "sales": { "sum": { "field": "price" } } }, "avg_monthly_sales": { "avg_bucket": { "buckets_path": "sales_per_month>sales" } } } } }
|
- Pipeline 的分析结果会输出到原结果中,根据输出位置的不同,分 为以下两类:
- Parent 结果内嵌到现有的聚合分析结果中
- Sibling 结果与现有聚合分析结果同级
Sibling
Min Bucket
找出所有 Bucket 中值最小的 Bucket 名称和值
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
| GET test_search_index/_search { "size": 0, "aggs": { "jobs": { "terms": { "field": "job.keyword", "size": 10 }, "aggs": { "avg_salary": { "avg": { "field": "salary" } } } }, "min_salary_by_job": { "min_bucket": { "buckets_path": "jobs>avg_salary" } } } }
|
Max Bucket
计算所有 Bucket 的平均值
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
| GET test_search_index/_search { "size": 0, "aggs": { "jobs": { "terms": { "field": "job.keyword", "size": 10 }, "aggs": { "avg_salary": { "avg": { "field": "salary" } } } }, "avg_salary_by_job": { "avg_bucket": { "buckets_path": "jobs>avg_salary" } } } }
|
Avg Bucket
计算所有Bucket的平均值
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
| GET test_search_index/_search { "size": 0, "aggs": { "jobs": { "terms": { "field": "job.keyword", "size": 10 }, "aggs": { "avg_salary": { "avg": { "field": "salary" } } } }, "avg_salary_by_job": { "avg_bucket": { "buckets_path": "jobs>avg_salary" } } } }
|
Sum Bucket
计算总额
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
| GET test_search_index/_search { "size": 0, "aggs": { "jobs": { "terms": { "field": "job.keyword", "size": 10 }, "aggs": { "avg_salary": { "avg": { "field": "salary" } } } }, "sum_salary_by_job": { "sum_bucket": { "buckets_path": "jobs>avg_salary" } } } }
|
Stats Bucket
计算所有 Bucket 值的 Stats 分析
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
| GET test_search_index/_search { "size": 0, "aggs": { "jobs": { "terms": { "field": "job.keyword", "size": 10 }, "aggs": { "avg_salary": { "avg": { "field": "salary" } } } }, "stats_salary_by_job": { "stats_bucket": { "buckets_path": "jobs>avg_salary" } } } }
|
Percentiles Bucket
计算所有 Bucket 值的百分位数
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
| GET test_search_index/_search { "size": 0, "aggs": { "jobs": { "terms": { "field": "job.keyword", "size": 10 }, "aggs": { "avg_salary": { "avg": { "field": "salary" } } } }, "percentiles_salary_by_job": { "percentiles_bucket": { "buckets_path": "jobs>avg_salary" } } } }
|
Parent
Derivative
计算 Bucket 值的导数
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
| GET test_search_index/_search { "size": 0, "aggs": { "birth": { "date_histogram": { "field": "birth", "interval": "year", "min_doc_count": 0 }, "aggs": { "avg_salary": { "avg": { "field": "salary" } }, "derivative_avg_salary": { "derivative": { "buckets_path": "avg_salary" } } } } } }
|
Moving Average
计算 Bucket 值的移动平均值
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
| GET test_search_index/_search { "size": 0, "aggs": { "birth": { "date_histogram": { "field": "birth", "interval": "year", "min_doc_count": 0 }, "aggs": { "avg_salary": { "avg": { "field": "salary" } }, "mavg_salary": { "moving_avg": { "buckets_path": "avg_salary" } } } } } }
|
Cumulative Sum
计算 Bucket 值的累积加和
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
| GET test_search_index/_search { "size": 0, "aggs": { "birth": { "date_histogram": { "field": "birth", "interval": "year", "min_doc_count": 0 }, "aggs": { "avg_salary": { "avg": { "field": "salary" } }, "cumulative_salary": { "cumulative_sum": { "buckets_path": "avg_salary" } } } } } }
|
作用范围
- es 聚合分析默认作用范围是 query 的结果集,可以通过如下的方 式改变其作用范围:
- filter
- post_filter
- global
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
| GET test_search_index/_search { "size": 0, "query": { "match": { "username": "alfred" } }, "aggs": { "jobs": { "terms": { "field": "job.keyword", "size": 10 } } }
|
filter
为某个聚合分析设定过滤条件,从而在不更改整体 query 语句的 情况下修改了作用范围
注意是基于查询出来的结果进行过滤,不是参与查询条件
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
| GET test_search_index/_search { "size": 0, "aggs": { "jobs_salary_small": { "filter": { "range": { "salary": { "to": 10000 } } }, "aggs": { "jobs": { "terms": { "field": "job.keyword" } } } }, "jobs": { "terms": { "field": "job.keyword" } } } }
|
post-filter
作用于文档过滤,但在聚合分析后生效
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
| GET test_search_index/_search { "aggs": { "jobs": { "terms": { "field": "job.keyword" } } }, "post_filter": { "match": { "job.keyword": "java engineer" } } }
|
global
无视 query 过滤条件,基于全部文档进行分析
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
| GET test_search_index/_search { "query": { "match": { "job.keyword": "java engineer" } }, "aggs": { "java_avg_salary": { "avg": { "field": "salary" } }, "all": { "global": {}, "aggs": { "avg_salary": { "avg": { "field": "salary" } } } } } }
|
排序
- 可以使用自带的关键数据进行排序,比如:
- _count 文档数
- key 按照 key 值排序
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84
| GET test_search_index/_search { "size": 0, "aggs": { "jobs": { "terms": { "field": "job.keyword", "size": 10, "order": [ { "_count": "asc" }, { "_key": "desc" } ] } } } }
GET test_search_index/_search { "size": 0, "aggs": { "jobs": { "terms": { "field": "job.keyword", "size": 10, "order": [ { "avg_salary.sum": "desc" } ] }, "aggs": { "avg_salary": { "stats": { "field": "salary" } } } } } }
GET test_search_index/_search { "size": 0, "aggs": { "salary_hist": { "histogram": { "field": "salary", "interval": 5000, "order": { "age>avg_age": "desc" } }, "aggs": { "age": { "filter": { "range": { "age": { "gte": 10 } } }, "aggs": { "avg_age": { "avg": { "field": "age" } } } } } } } }
|