I have one large wCTE that, before any rows are inserted, has to determine if each new row conforms to the relational structure of the database as well as other constraints not implementable yet verifiable at the database level.
I am determining if they conform with EXISTS. Each WHERE condition for each EXISTS is indexed, single or multi-column, and are ordered in the same way as the multi-column indexes when they're available.
Each conditional column is either an integer type or bytea no larger than 64 bytes.
First, I tried to create as few EXISTS as possible. In this case, multiple conditions on the incoming data CTE, source_data and source_auxiliary_data, were tested in a single subquery such as:
EXISTS (
SELECT 1 FROM source_data sd, source_auxiliary_data sad, source_auxiliary_data tad
WHERE sd.norm1_id = source_data.norm1_id
AND (
(
sd.norm2_id > source_data.norm2_id
AND (
sd.data1 < source_data.data1
OR (sd.data2 = source_data.data2 AND sd.data3 = source_data.data3)
OR (sd.data4 = source_data.data4 AND sd.data5 = source_data.data5)
)
)
OR (
sd.norm2_id > source_data.norm2_id
AND sad.data6 = source_data.data6
AND tad.data6 = sd.data6
AND (
tad.data7 = sad.data7
OR (
tad.data2 = sad.data2
AND (
tad.tstp = sad.tstp
OR tad.data3 = sad.data3
OR tad.data5 = sad.data5
)
)
)
)
)
)
The performance for doing that was horrible, seconds per new source_data row.
I split up this query and the many other like them via CASE conditions, and performance improved to hundreds of ms per new source_data row.
I used EXPLAIN ANALYZE to find high cost/high row count portions and noticed that they were all surrounding seq scan sections, so I reordered all of my columns to conform with index conditions and further split up the EXISTS queries into their most base forms.
Some started using indexes instead, and performance increased to only tens of ms consumed per new source_data row.
For EXISTS that were simple such as
EXISTS (
SELECT 1 FROM main_data
WHERE norm1_id = source_data.norm1_id
AND data1 = source_data.data1
AND data2 = source_data.data2
AND data3 = source_data.data3
)
and main_data has a unique constraint on norm1_id, data1, data2, data3 in that exact order, a seq scan was still used.
Yesterday, I turned off seq scans for this transaction, and while only indexes were used, performance didn't seem to increase at first, but after many, many executions, performance started to skyrocket again so that now only 1 or 2ms per new source_data row is consumed.
The EXISTS checks consume similar amounts of time regardless of how big main_data, where the most rows are stored, is.
Still, if I comment out the checks for existence, the writes consume only as much time as the drive requires to write and a little extra.
Why? Naively, it seems as if the database should load the source rows into memory, scan the indexes for conformity, and quit once one is found. Inreasing transaction work_mem had no impact.
I only turned on the portion that checks the new incoming data against itself, and this portion still consumes hundreds of us. I wrote a dumb scan in C++ to run this check, and "0" us are consumed.
Is there any way to make PostgreSQL EXISTS resemble the speed of these dumb C++ checks? If so, how?
Aucun commentaire:
Enregistrer un commentaire