Database : Finding the current prices for n Fuelstations at a certain point in time

dimanche 22 février 2015

Finding the current prices for n Fuelstations at a certain point in time

I have a Table where price information is stored in with approx 13 million rows stored in a PostgreSQL 9.5 database.


CREATE TABLE public.de_tt_priceinfo (
  id integer NOT NULL DEFAULT nextval('priceinfo_id_seq'::regclass),
  station_id character varying(60),
  recieved timestamp with time zone NOT NULL DEFAULT now(),
  e5 numeric(4,3),
  e10 numeric(4,3),
  diesel numeric(4,3),
  CONSTRAINT de_tt_priceinfo_id_pkey PRIMARY KEY (id)
);

CREATE INDEX de_tt_priceinfo_recieved_station_id_idx
  ON public.de_tt_priceinfo (recieved, station_id COLLATE pg_catalog."default");

CREATE INDEX index_station_id
  ON public.de_tt_priceinfo (station_id COLLATE pg_catalog."default");

From this table I need to extract the latest prices at a certain point in time with maximum performance, since I have to simulate 32 million commuters which query this table (not at once, but still).

I have a working query!


SELECT station_id, e5, e10, diesel, recieved FROM de_tt_priceinfo a 
WHERE a.recieved = (SELECT MAX(recieved) FROM de_tt_priceinfo b 
  WHERE a.station_id = b.station_id
  AND  recieved <= '2014-09-25 08:45:12'::TIMESTAMPTZ)
AND station_id IN('0C91A93A-a-b-c-d', '578C44BB-a-b-c-d', '6F2F48A8-a-b-c-d'
                , '9982BE74-a-b-c-d', 'A24C612B-a-b-c-d', 'BEC3EF55-a-b-c-d'
                , 'F5137488-a-b-c-d')

The performance of this Query is not usable. Execution time is varying around 900ms. The result looks like this


0C91A93A-a-b-c-d, 1.xxx, 1.xxx, 1.xxx, "2014-09-25 08:17:50.000000"
578C44BB-a-b-c-d, 1.xxx, 1.xxx, 1.xxx, "2014-09-25 08:00:09.000000"
6F2F48A8-a-b-c-d, 1.xxx, 1.xxx, 1.xxx, "2014-09-25 07:08:57.000000"
9982BE74-a-b-c-d, 1.xxx, 1.xxx, 1.xxx, "2014-09-25 08:29:55.000000"
A24C612B-a-b-c-d, 1.xxx, 1.xxx, 1.xxx, "2014-09-25 08:00:09.000000"
BEC3EF55-a-b-c-d, 1.xxx, 1.xxx, 1.xxx, "2014-09-25 06:53:49.000000"
F5137488-a-b-c-d, 1.xxx, 1.xxx, 1.xxx, "2014-09-25 07:44:55.000000"

So I searched around a little bit and found buzz words like recursive CTE, loose indexscan and some answers on DBA which seemed very close, but I could not modify them to my needs.

How do I efficiently get "the most recent corresponding row"?

GROUP BY one column, while sorting by another in PostgreSQL

If I understood right, a recursive CTE would be the quickest way to query the data I want.

What I got so far is this:


WITH RECURSIVE cte AS (
   (
   SELECT station_id, e5, e10, diesel, recieved
   FROM   de_tt_priceinfo
   WHERE  recieved <= '2014-09-25 08:45:00'::TIMESTAMPTZ
     AND station_id IN('0C91A93A-a-b-c-d', '578C44BB-a-b-c-d', '6F2F48A8-a-b-c-d'
                     , '9982BE74-a-b-c-d', 'A24C612B-a-b-c-d', 'BEC3EF55-a-b-c-d'
                     , 'F5137488-a-b-c-d')
   ORDER  BY station_id, recieved DESC NULLS LAST
   LIMIT 1  
   )
   UNION ALL
   (
   SELECT u.station_id, u.e5, u.e10, u.diesel, u.recieved
   FROM   cte c
   JOIN de_tt_priceinfo u ON u.recieved > c.recieved   
   WHERE  u.recieved <= '2014-09-25 08:45:00'::TIMESTAMPTZ  -- repeat condition!
     AND u.station_id IN('0C91A93A-a-b-c-d', '578C44BB-a-b-c-d', '6F2F48A8-a-b-c-d'
                       , '9982BE74-a-b-c-d', 'A24C612B-a-b-c-d', 'BEC3EF55-a-b-c-d'
                       , 'F5137488-a-b-c-d')
   ORDER BY u.station_id, u.recieved DESC NULLS LAST LIMIT 1
   )
   )
SELECT * FROM cte;

But this just returns the following two lines:


0C91A93A-a-b-c-d, 1.xxx, 1.xxx, 1.xxx, "2014-09-25 08:17:50.000000"
9982BE74-a-b-c-d, 1.xxx, 1.xxx, 1.xxx, "2014-09-25 08:29:55.000000"

Update:

SELECT Version(); PostgreSQL 9.5devel on x86_64-pc-linux-gnu, compiled by x86_64-pc-linux-gnu-gcc (Gentoo 4.8.3 p1.1, pie-0.5.9) 4.8.3, 64-bit

EXPLAIN ANALYSE: http://ift.tt/1B9KE1p

XEON 1231v3, 16 GB Ram, Samsung 840 PRO SSD

Changes to the default postgresql.conf


# Connection
listen_addresses = '*'
max_connections = 16

# Logging
log_destination = 'csvlog'
log_directory = 'pg_log'
logging_collector = on
log_filename = 'postgres-%Y-%m-%d_%H%M%S.log'
log_rotation_age = 1d
log_rotation_size = 1GB
log_min_duration_statement = 500ms
#log_checkpoints = on
#log_connections = on
#log_disconnections = on
log_lock_waits = on
#log_temp_files = 0

# Memory
shared_buffers = 1GB
temp_buffers = 32MB
work_mem = 256MB
maintenance_work_mem = 1GB
effective_cache_size = 8GB

# Checkpoint ( When to write to disk )
wal_buffers = 16MB
checkpoint_completion_target = 0.9
checkpoint_timeout = 30min
checkpoint_segments = 32

random_page_cost = 1.1

# Import only!
#autovacuum = off
fsync = off
synchronous_commit = off
full_page_writes = off

Database

dimanche 22 février 2015

Finding the current prices for n Fuelstations at a certain point in time

Aucun commentaire:

Enregistrer un commentaire