PostgreSQL - aggregate rows with similar values (+/- 1) in a dynamic way

2 min read 06-10-2024
PostgreSQL - aggregate rows with similar values (+/- 1) in a dynamic way


Grouping Similar Values in PostgreSQL: A Dynamic Approach

Problem: You have a table with data where you need to group rows with similar values, but the range of similarity is flexible. For example, you might want to group values that are within +/- 1 of each other.

Rephrased: Imagine you have a table filled with product prices. You want to group prices that are very similar, but the "very similar" threshold can change based on your needs. This article will show you how to achieve this dynamic grouping in PostgreSQL.

Scenario:

Let's say you have a table named products with the following structure:

CREATE TABLE products (
  id SERIAL PRIMARY KEY,
  name VARCHAR(255),
  price DECIMAL(10,2)
);

And some sample data:

INSERT INTO products (name, price) VALUES 
('Product A', 10.99),
('Product B', 11.00),
('Product C', 11.99),
('Product D', 12.00),
('Product E', 13.00),
('Product F', 14.00),
('Product G', 15.00);

Original Code (Inefficient):

A basic approach would be to manually define thresholds and use CASE statements:

SELECT 
  CASE 
    WHEN price BETWEEN 10 AND 11 THEN 'Group 1'
    WHEN price BETWEEN 12 AND 13 THEN 'Group 2'
    ELSE 'Other'
  END AS price_group,
  COUNT(*) AS count
FROM products
GROUP BY price_group;

This solution is inflexible and requires manual adjustments for every new range.

Dynamic Grouping with WITH RECURSIVE:

Here's how you can achieve dynamic grouping using a WITH RECURSIVE CTE:

WITH RECURSIVE price_groups AS (
    SELECT 
        id, name, price, price AS group_start, price AS group_end
    FROM products
    UNION ALL
    SELECT 
        p.id, p.name, p.price, pg.group_start, pg.group_end
    FROM products p
    JOIN price_groups pg ON p.price BETWEEN pg.group_start - 1 AND pg.group_end + 1
    WHERE p.id > pg.id 
)
SELECT 
    group_start, 
    group_end, 
    COUNT(*) AS count
FROM price_groups
GROUP BY group_start, group_end
ORDER BY group_start;

Explanation:

  1. WITH RECURSIVE price_groups: This creates a recursive common table expression (CTE) named price_groups.
  2. Initial Case: The first part of the UNION ALL statement selects all rows from the products table and sets group_start and group_end equal to the price. This initializes the grouping process.
  3. Recursive Case: The second part of the UNION ALL statement recursively joins the price_groups CTE with the products table. It finds rows from products that have prices within +/- 1 of the current group_start and group_end values, while ensuring that the id of the products row is greater than the id of the corresponding price_groups row to prevent infinite recursion.
  4. Final Select: This query selects group_start, group_end, and the count of products within each group.

Key Benefits:

  • Dynamic Threshold: You can modify the +/- 1 value in the JOIN condition to adjust the similarity range dynamically.
  • No Hardcoded Groups: This approach eliminates the need for predefined groups, making it highly flexible.
  • Performance Considerations: While efficient for smaller datasets, using a recursive CTE can become computationally expensive with very large datasets.

Additional Insights:

  • Alternative Approaches: For very large datasets, you might consider using window functions like LAG and LEAD to determine similar values.
  • Custom Grouping Logic: This approach can be easily modified to accommodate more complex grouping logic, like grouping based on other columns or using different similarity thresholds.

Remember: The code provided is a basic example. You may need to adjust it based on your specific data and requirements. For larger datasets, consider exploring alternative techniques or optimizations to improve performance.