Database : EXISTS CPU consumption

samedi 7 février 2015

EXISTS CPU consumption

I have one large wCTE that, before any rows are inserted, has to determine if each new row conforms to the relational structure of the database as well as other constraints not implementable yet verifiable at the database level.

I am determining if they conform with EXISTS. Each WHERE condition for each EXISTS is indexed, single or multi-column, and are ordered in the same way as the multi-column indexes when they're available.

Each conditional column is either an integer type or bytea no larger than 64 bytes.

First, I tried to create as few EXISTS as possible. In this case, multiple conditions on the incoming data CTE, source_data and source_auxiliary_data, were tested in a single subquery such as:


EXISTS (
    SELECT 1 FROM source_data sd, source_auxiliary_data sad, source_auxiliary_data tad
        WHERE sd.norm1_id = source_data.norm1_id  
            AND (
                (
                    sd.norm2_id > source_data.norm2_id  
                    AND ( 
                        sd.data1 < source_data.data1  
                        OR (sd.data2 = source_data.data2 AND sd.data3 = source_data.data3) 
                        OR (sd.data4 = source_data.data4 AND sd.data5 = source_data.data5) 
                    ) 
                 )
                 OR (
                    sd.norm2_id > source_data.norm2_id  
                    AND sad.data6 = source_data.data6  
                    AND tad.data6 = sd.data6  
                    AND ( 
                        tad.data7 = sad.data7  
                        OR ( 
                            tad.data2 = sad.data2  
                            AND ( 
                                tad.tstp = sad.tstp  
                                OR tad.data3 = sad.data3  
                                OR tad.data5 = sad.data5  
                            ) 
                        ) 
                    ) 
                 )
           )
)

The performance for doing that was horrible, seconds per new source_data row.

I split up this query and the many other like them via CASE conditions, and performance improved to hundreds of ms per new source_data row.

I used EXPLAIN ANALYZE to find high cost/high row count portions and noticed that they were all surrounding seq scan sections, so I reordered all of my columns to conform with index conditions and further split up the EXISTS queries into their most base forms.

Some started using indexes instead, and performance increased to only tens of ms consumed per new source_data row.

For EXISTS that were simple such as


EXISTS (
    SELECT 1 FROM main_data 
        WHERE norm1_id = source_data.norm1_id 
         AND data1 = source_data.data1 
         AND data2 = source_data.data2 
         AND data3 = source_data.data3
)

and main_data has a unique constraint on norm1_id, data1, data2, data3 in that exact order, a seq scan was still used.

Yesterday, I turned off seq scans for this transaction, and while only indexes were used, performance didn't seem to increase at first, but after many, many executions, performance started to skyrocket again so that now only 1 or 2ms per new source_data row is consumed.

The EXISTS checks consume similar amounts of time regardless of how big main_data, where the most rows are stored, is.

Still, if I comment out the checks for existence, the writes consume only as much time as the drive requires to write and a little extra.

Why? Naively, it seems as if the database should load the source rows into memory, scan the indexes for conformity, and quit once one is found. Inreasing transaction work_mem had no impact.

I only turned on the portion that checks the new incoming data against itself, and this portion still consumes hundreds of us. I wrote a dumb scan in C++ to run this check, and "0" us are consumed.

Is there any way to make PostgreSQL EXISTS resemble the speed of these dumb C++ checks? If so, how?

Database

samedi 7 février 2015

EXISTS CPU consumption

Aucun commentaire:

Enregistrer un commentaire