所有文章是按github的markdown格式书写的,如此页面显示不正常,请跳转到github版

PostgreSQL求解最短路径

有些业务中需要求解最短路径,PostgreSQL中有个pgrouting插件内置了和计算最短路径相关的算法。 下面看下示例

表定义

postgres=# \d testpath
              Table "public.testpath"
 Column |  Type   | Collation | Nullable | Default 
--------+---------+-----------+----------+---------
 id     | integer |           |          | 
 source | integer |           |          | 
 target | integer |           |          | 
 cost   | integer |           |          | 

这是张业务表,每一行代表一条边及其代价,总共1000多条记录(实际对应的是按业务条件筛选后的结果集大小)。其余业务相关的属性全部隐去。

求解2点间的最短路径

postgres=# SELECT * FROM pgr_dijkstra(
  'SELECT id,source,target,cost FROM testpath',
  10524, 10379, directed:=true);
 seq | path_seq | node  |  edge   | cost | agg_cost 
-----+----------+-------+---------+------+----------
   1 |        1 | 10524 | 1971852 |    1 |        0
   2 |        2 |  7952 |   32256 |    1 |        1
   3 |        3 |  7622 |   76615 |    2 |        2
   4 |        4 | 44964 |   76616 |    1 |        4
   5 |        5 |  7861 |   19582 |    1 |        5
   6 |        6 |  7629 |   14948 |    2 |        6
   7 |        7 | 17135 |   14949 |    1 |        8
   8 |        8 | 10379 |      -1 |    0 |        9
(8 rows)

Time: 22.979 ms

求解2点间最短的N条路径

postgres=# SELECT * FROM pgr_ksp(
  'SELECT id,source,target,cost FROM testpath',
  10524, 10379, 1000,directed:=true);
 seq | path_id | path_seq | node  |  edge   | cost | agg_cost 
-----+---------+----------+-------+---------+------+----------
   1 |       1 |        1 | 10524 | 1971852 |    1 |        0
   2 |       1 |        2 |  7952 |   32256 |    1 |        1
   3 |       1 |        3 |  7622 |   54740 |    2 |        2
   4 |       1 |        4 | 35389 |   54741 |    1 |        4
   5 |       1 |        5 |  7861 |   19582 |    1 |        5
   6 |       1 |        6 |  7629 |   14948 |    2 |        6
   7 |       1 |        7 | 17135 |   14949 |    1 |        8
   8 |       1 |        8 | 10379 |      -1 |    0 |        9
...(略)
 100 |      12 |        4 | 53179 |   95137 |    1 |        4
 101 |      12 |        5 |  7625 |   90682 |    2 |        5
 102 |      12 |        6 | 51211 |   90683 |    1 |        7
 103 |      12 |        7 |  7861 |   19582 |    1 |        8
 104 |      12 |        8 |  7629 | 1173911 |    2 |        9
 105 |      12 |        9 | 59579 | 1173917 |    1 |       11
 106 |      12 |       10 | 10379 |      -1 |    0 |       12
(106 rows)

Time: 201.223 ms

纯SQL求解最短路径

前面的最短路径是通过pgrouting插件计算的,能不能单纯利用PG自身的SQL完成最短路径的计算呢?

真实的业务场景下是可以限制路径的长度的,比如,如果我们舍弃所有边数大于7的路径。 那么完全可以用简单的深度遍历计算最短路径。计算速度还提高了5倍。

postgres=# WITH RECURSIVE line AS(
SELECT source,target,cost from testpath
),
path(fullpath,pathseq,node,total_cost) AS (
    select ARRAY[10524],1,10524,0
  UNION ALL
    select array_append(fullpath,target),pathseq+1,target,total_cost+cost from path join line on(source=node) where node!=10379 and pathseq<=8
)
SELECT * FROM path where fullpath @> ARRAY[10379] order by total_cost limit 1;
                   fullpath                    | pathseq | node  | total_cost 
-----------------------------------------------+---------+-------+------------
 {10524,7952,7622,80465,7861,7629,17135,10379} |       8 | 10379 |          9
(1 row)

Time: 4.334 ms

如果每条边的cost相同,可以去掉上面的order by total_cost,在大数据集上性能会有很大的提升。

纯SQL求解最短的N条路径

沿用前面的SQL,只是修改了一下limit值,相比pgrouting的pgr_ksp函数性能提升的更多。性能提升了50倍。

postgres=# WITH RECURSIVE line AS(
SELECT source,target,cost from testpath
),
path(fullpath,pathseq,node,total_cost) AS (
    select ARRAY[10524],1,10524,0
  UNION ALL
    select array_append(fullpath,target),pathseq+1,target,total_cost+cost from path join line on(source=node) where node!=10379 and pathseq<=8
)
SELECT * FROM path where fullpath @> ARRAY[10379] order by total_cost limit 1000;
                      fullpath                      | pathseq | node  | total_cost 
----------------------------------------------------+---------+-------+------------
 {10524,7952,7622,80465,7861,7629,17135,10379}      |       8 | 10379 |          9
 {10524,7952,7622,35389,7861,7629,17135,10379}      |       8 | 10379 |          9
 {10524,7952,7622,44964,7861,7629,17135,10379}      |       8 | 10379 |          9
 {10524,7952,7622,80465,7861,7629,59579,10379}      |       8 | 10379 |          9
 {10524,7952,7622,35389,7861,7629,59579,10379}      |       8 | 10379 |          9
 {10524,7952,7622,44964,7861,7629,59579,10379}      |       8 | 10379 |          9
 {10524,7952,7622,53179,7625,7861,7629,17135,10379} |       9 | 10379 |         10
 {10524,7952,7622,53179,7625,7861,7629,59579,10379} |       9 | 10379 |         10
(8 rows)

Time: 4.425 ms

下面看下执行计划

postgres=# explain analyze
WITH RECURSIVE line AS(
SELECT source,target,cost from testpath
),
path(fullpath,pathseq,node,total_cost) AS (
    select ARRAY[10524],1,10524,0
  UNION ALL
    select array_append(fullpath,target),pathseq+1,target,total_cost+cost from path join line on(source=node) where node!=10379 and pathseq<=8
)
SELECT * FROM path where fullpath @> ARRAY[10379] order by total_cost limit 1000;
                                                              QUERY PLAN                                                              
--------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=277.18..277.18 rows=1 width=44) (actual time=9.992..10.001 rows=8 loops=1)
   CTE line
     ->  Seq Scan on testpath  (cost=0.00..16.45 rows=1045 width=12) (actual time=0.017..0.624 rows=1045 loops=1)
   CTE path
     ->  Recursive Union  (cost=0.00..257.09 rows=161 width=44) (actual time=0.003..9.889 rows=42 loops=1)
           ->  Result  (cost=0.00..0.01 rows=1 width=44) (actual time=0.001..0.002 rows=1 loops=1)
           ->  Hash Join  (cost=0.29..25.39 rows=16 width=44) (actual time=0.451..1.090 rows=5 loops=9)
                 Hash Cond: (line.source = path_1.node)
                 ->  CTE Scan on line  (cost=0.00..20.90 rows=1045 width=12) (actual time=0.003..0.678 rows=1045 loops=8)
                 ->  Hash  (cost=0.25..0.25 rows=3 width=44) (actual time=0.007..0.007 rows=3 loops=9)
                       Buckets: 1024  Batches: 1  Memory Usage: 8kB
                       ->  WorkTable Scan on path path_1  (cost=0.00..0.25 rows=3 width=44) (actual time=0.001..0.004 rows=3 loops=9)
                             Filter: ((node <> 10379) AND (pathseq <= 8))
                             Rows Removed by Filter: 1
   ->  Sort  (cost=3.63..3.64 rows=1 width=44) (actual time=9.991..9.994 rows=8 loops=1)
         Sort Key: path.total_cost
         Sort Method: quicksort  Memory: 26kB
         ->  CTE Scan on path  (cost=0.00..3.62 rows=1 width=44) (actual time=7.851..9.979 rows=8 loops=1)
               Filter: (fullpath @> '{10379}'::integer[])
               Rows Removed by Filter: 34
 Planning time: 0.234 ms
 Execution time: 10.111 ms
(22 rows)

Time: 10.973 ms

大数据集的对比

以上测试的数据集比较小,只有1000多个边,如果在100w的数据集下,结果如何呢?

计算最短路径 时间(秒)
pgr_dijkstra() 52秒
递归CTE(最大深度2边) 2秒
递归CTE(最大深度3边) 5秒
递归CTE(最大深度4边) 105秒
递归CTE(最大深度5边) 算不出来,放弃
递归CTE(最大深度7边,假设每个边cost相等,不排序,结果最短路径为3个边) 1.6秒

小结

简单的深度遍历求解可以适用于小数据集或深度比较小的场景。 在满足这些条件的场景下,效果还是不错的。

April 26, 2020