Database : Aggregation Operations on View Ignores Index

The Scenario

Once upon a time there was a Staging database at a small company that was participating in an ETL process, acting as a receiving catalog for the various formats of files from a number of third party sources. The E was handled through DTS packages, with few control structures for auditing or control, but was deemed "Good Enough" and for all intents and purposes, it was.

The data provided by the E portion was intended for consumption by a singular application, developed and managed by a handful of young and capable programmers. Though lacking experience with or knowledge of the data warehousing techniques of the time, they set forth and created their own T and L processes from application code. Blazing forth, these fledgling software engineers invented what outsiders might call a "less-than-ideal wheel," but with "Good Enough" as an ever-present service level, they were able to provide a operational framework.

For a time, all was good in the tightly coupled realm, with the Staging catalog feasting on the data of a dozen third parties, in turn being fed upon by the application. As the application grew, so too did its appetites, but with the skillful white knight developers standing watch over the system, these appetites were addressed quickly and in many cases, even well.

But the golden age could not last forever, of course. With the prosperity granted by the successful application, the business grew and grew. As it grew, the Staging environment and application were forced to grow with it. For all their vigilance, the mere handful of hero developers could not keep up with maintaining the now expansive system, and the consumers had become entitled to their data. No longer was it a matter of what they needed or even wanted, but the populace felt that they simply deserved it, demanding even more.

Armed with little more than coffers full of swag, the business reached out into the market, hiring developers and administrators to help support the ever growing system. Mercenaries of every ethos flocked to the company, but with this growth spurt came little in the way of available expert guidance. New developers and administrators struggled to understand the intricacies of the home-brewed suite, until the frustrations resulted in all out war. Each department began to attempt to solve every problem alone, doing more to work against each other than work with each other. A single project or initiative would be implemented in several different ways, each slightly different from the next. The strain of it all proved to be too much for some of the white knights and as they fell, the empire crumbled. Soon, the system was in shambles, with inline application code accessing base tables directly from the staging environment, undocumented secondary T and L processes scattered about and total bypassing of any attempt to perform even rudimentary normalization or even standardization prevalent throughout both the Staging catalog and application.

Despite the transformation of these fields of promise to gory spaghetti code, the company endured. It was, after all, "Good Enough."

The Challenge

A few more regime changes and hiring sprees later, I find myself in the employment of the company. It has been many years since the great wars, but the damage done is still very visible. I've managed to address some of the weaknesses in the E portion of the system and add some control tables while under the guise of upgrading the DTS packages to SSIS, which are now being used by some actual data warehousing professionals as they create a normal and documented T and L replacement.

The first hurdle was to import the data from the third party files in a way that wouldn't truncate the values or change the native data types, but also include some control keys for reloads and purges. This was all well and good, but the applications needed to be able to access these new tables in a seamless, transparent manner. A DTS package may populate a table, which is then directly read by the application. The SSIS upgrades need to be done in parallel for QA reasons, but these new packages include various control keys and also leverage a partitioning scheme, not to mention the actual metadata changes alone can be significant enough to warrant a new table altogether anyway, so a new table was used for the new SSIS packages.

With reliable data imports now working and being used by the warehousing team, the real challenge comes in serving the new data to the applications which access the Staging environment directly, with minimal ( aka "No" ) impact on the application code. For this, I have elected to use views, renaming a table such as dbo.DailyTransaction to dbo.DailyTranscation_LEGACY and reusing the dbo.DailyTransaction object name for a view, which effectively just selects everything from the now LEGACY designated table. Since reloading the years of data contained in these tables is not an option from the business' perspective, as the new SSIS-populated and partitioned tables make their way into production, the old DTS imports are turned off and the applications need to be able to access the new data in the new tables as well. At this point, the views are updated to select the data from the new tables ( say, dbo.DailyTransactionComplete, for instance ) when it is available and select from the legacy tables when it is not.

In effect, something like the following is being done:


CREATE VIEW dbo.DailyTransaction
AS  SELECT  DailyTransaction_PK, FileDate, Foo
    FROM    dbo.DailyTransactionComplete
    UNION ALL
    SELECT  DailyTransaction_PK, FileDate, Foo
    FROM    dbo.DailyTransaction_LEGACY l
    WHERE NOT EXISTS (  SELECT  1
                        FROM    dbo.DailyTransactionComplete t
                        WHERE   t.FileDate = l.FileDate );

While logically sound, this does not perform well at all in a number of aggregation cases, generally resulting in an execution plan that performs a full index scan against the data in the legacy table. This is probably fine for a few dozen million records, but not so much for a few dozen hundred million records. Since the latter is in fact the case, I've had to resort to being... "creative," leading me to creating an indexed view.

Here's the little test case I've set up, including the FileDate control key having been ported to the Data Warehouse compatible DateCode_FK port for illustrating how utterly little I care about the queries against the new table being sargable for the time being:


USE tempdb;
GO

SET NOCOUNT ON;
GO

IF NOT EXISTS ( SELECT  1
                FROM    sys.objects
                WHERE   name = 'DailyTransaction_LEGACY'
                    AND type = 'U' )
BEGIN
    --DROP TABLE dbo.DailyTransaction_LEGACY;
    CREATE TABLE dbo.DailyTransaction_LEGACY
    (
        DailyTransaction_PK         BIGINT IDENTITY( 1, 1 ) NOT NULL,
        FileDate                    DATETIME NOT NULL,
        Foo                         INT NOT NULL
    );

    INSERT INTO dbo.DailyTransaction_LEGACY ( FileDate, Foo )
    SELECT  DATEADD( DAY, ( 1 - ROW_NUMBER() 
                OVER( ORDER BY so1.object_id ) - 800 ) % 1000, 
                CONVERT( DATE, GETDATE() ) ),
            so1.object_id % 1000 + so2.object_id % 1000
    FROM    sys.all_objects so1
    CROSS JOIN sys.all_objects so2;

    ALTER TABLE dbo.DailyTransaction_LEGACY
    ADD CONSTRAINT PK__DailyTrainsaction
        PRIMARY KEY CLUSTERED ( DailyTransaction_PK )
    WITH ( DATA_COMPRESSION = PAGE, FILLFACTOR = 100 );
END;
GO

IF NOT EXISTS ( SELECT  1
                FROM    sys.objects
                WHERE   name = 'DailyTransactionComplete'
                    AND type = 'U' )
BEGIN
    --DROP TABLE dbo.DailyTransactionComplete;
    CREATE TABLE dbo.DailyTransactionComplete
    (
        DailyTransaction_PK            BIGINT IDENTITY( 1, 1 ) NOT NULL,
        DateCode_FK                    INTEGER NOT NULL,
        Foo                            INTEGER NOT NULL
    );

    INSERT INTO dbo.DailyTransactionComplete ( DateCode_FK, Foo )
    SELECT  TOP 100000
            CONVERT( INTEGER, CONVERT( VARCHAR( 8 ), DATEADD( DAY, 
                ( 1 - ROW_NUMBER() OVER( ORDER BY so1.object_id ) ) % 100, 
                GETDATE() ), 112 ) ),
            so1.object_id % 1000
    FROM    sys.all_objects so1
    CROSS JOIN sys.all_objects so2;

    ALTER TABLE dbo.DailyTransactionComplete
    ADD CONSTRAINT PK__DailyTransaction
        PRIMARY KEY CLUSTERED ( DateCode_FK, DailyTransaction_PK )
    WITH ( DATA_COMPRESSION = PAGE, FILLFACTOR = 100 );        
END;
GO

On my local sandbox, the above gets me a legacy table with about 4.4 million rows and a new table containing 0.1 million rows, with some overlap of the DateCode_FK / FileDate values.

A MAX( FileDate ) against the legacy table with no additional indexes runs in about what I would expect.


SET STATISTICS IO, TIME ON;

DECLARE @ConsumeOutput        DATETIME;
SELECT  @ConsumeOutput = MAX( FileDate )
FROM    dbo.DailyTransaction_LEGACY;

SET STATISTICS IO, TIME OFF;
GO

Table 'DailyTransaction_LEGACY'. Scan count 1, logical reads 9228, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times: CPU time = 889 ms, elapsed time = 886 ms.

Clustered Index, Legacy

Tossing a simple index on the table makes things much better. Still a scan, but a scanning one record instead of the 4.4 million records. I'm cool with that.


CREATE NONCLUSTERED INDEX IX__DailyTransaction__FileDate
    ON    dbo.DailyTransaction_LEGACY ( FileDate );

SET STATISTICS IO, TIME ON;

DECLARE @ConsumeOutput        DATETIME;
SELECT  @ConsumeOutput = MAX( FileDate )
FROM    dbo.DailyTransaction_LEGACY;

SET STATISTICS IO, TIME OFF;
GO

SQL Server parse and compile time: CPU time = 0 ms, elapsed time = 1 ms. Table 'DailyTransaction_LEGACY'. Scan count 1, logical reads 3, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times: CPU time = 0 ms, elapsed time = 0 ms.

Non-clustered index, Legacy

And now, creating the view so the developers don't have to change any code because that would apparently be the end of the world as we know it. A cataclysm of sorts.


IF NOT EXISTS ( SELECT  1
                FROM    sys.objects
                WHERE   name = 'DailyTransaction'
                    AND type = 'V' )
BEGIN
    EXEC( 'CREATE VIEW dbo.DailyTransaction AS SELECT x = 1;' );
END;
GO

ALTER VIEW dbo.DailyTransaction
AS  SELECT  DailyTransaction_PK, FileDate = CONVERT( 
                DATETIME, CONVERT( VARCHAR( 8 ), DateCode_FK ), 112 ), Foo
    FROM    dbo.DailyTransactionComplete
    UNION ALL    
    SELECT  DailyTransaction_PK, FileDate, Foo
    FROM    dbo.DailyTransaction_LEGACY l
    WHERE   NOT EXISTS (    SELECT  1
                            FROM    dbo.DailyTransactionComplete t
                            WHERE   CONVERT( DATETIME, CONVERT( VARCHAR( 8 ),
                                        t.DateCode_FK ), 112 ) = l.FileDate );
GO

Yes, the sub query is abysmal, but this is not the problem and I'll probably simply create a persisted computed column and throw an index on it for that purpose when the real problem is solved. So without further ado,

The Problem


SET STATISTICS IO, TIME ON;

DECLARE @ConsumeOutput1        DATETIME;
SELECT  @ConsumeOutput1 = MAX( FileDate )
FROM    dbo.DailyTransaction;

SET STATISTICS IO, TIME OFF;
GO

SQL Server parse and compile time: CPU time = 0 ms, elapsed time = 4 ms. Table 'DailyTransaction_LEGACY'. Scan count 1, logical reads 11972, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Workfile'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'DailyTransactionComplete'. Scan count 2, logical reads 620, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times: CPU time = 983 ms, elapsed time = 983 ms.

View Plan

Oh I see, Sql Server is trying to tell me that what I am doing is idiotic. While I largely agree, that does not change my predicament. This actually works brilliantly for queries where the FileDate on the dbo.DailyTransaction view is included in the predicate, but while the MAX plan is bad enough, the TOP plan sends the whole thing running south. Real south.


SET STATISTICS IO, TIME ON;

SELECT  TOP 10 FileDate
FROM    dbo.DailyTransaction
GROUP BY FileDate 
ORDER BY FileDate DESC

SET STATISTICS IO, TIME OFF;
GO

Table 'DailyTransactionComplete'. Scan count 2, logical reads 1800110, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'DailyTransaction_LEGACY'. Scan count 1, logical reads 1254, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Workfile'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times: CPU time = 109559 ms, elapsed time = 109664 ms.

Top

I mentioned getting "creative" earlier, which was probably misleading. What I meant to say was "more stupid," so my attempts to make this view work during aggregation operations have been to create views on the dbo.DailyTransactionComplete and dbo.DailyTransaction_LEGACY tables, schema bind and index the latter one, then use those view in another view with a NOEXPAND hint on the legacy view. While it is more or less working for what it needs to do for now, I find the whole "solution" to be quite upsetting, culminating with the following:


IF NOT EXISTS ( SELECT  1
                FROM    sys.objects
                WHERE   name = 'v_DailyTransactionComplete'
                    AND type = 'V' )
BEGIN
    EXEC( 'CREATE VIEW dbo.v_DailyTransactionComplete AS SELECT x = 1;' );
END;
GO

ALTER VIEW dbo.v_DailyTransactionComplete
AS  SELECT  DailyTransaction_PK, FileDate = CONVERT( DATETIME, 
                CONVERT( VARCHAR( 8 ), DateCode_FK ), 112 ), 
            Foo
    FROM    dbo.DailyTransactionComplete;
GO

IF NOT EXISTS ( SELECT  1
                FROM    sys.objects
                WHERE   name = 'v_DailyTransaction_LEGACY'
                    AND type = 'V' )
BEGIN
    EXEC( 'CREATE VIEW dbo.v_DailyTransaction_LEGACY AS SELECT x = 1;' );
END;
GO

ALTER VIEW dbo.v_DailyTransaction_LEGACY
WITH SCHEMABINDING
AS  SELECT  l.DailyTransaction_PK,
            l.FileDate,
            l.Foo,
            CountBig = COUNT_BIG( * )
    FROM    dbo.DailyTransaction_LEGACY l
    INNER JOIN dbo.DailyTransactionComplete n
        ON  l.FileDate <> CONVERT( DATETIME, CONVERT( VARCHAR( 8 ), 
                n.DateCode_FK ), 112 )
    GROUP BY l.DailyTransaction_PK,
            l.FileDate,
            l.Foo;
GO

CREATE UNIQUE CLUSTERED INDEX CI__v_DailyTransaction_LEGACY
    ON dbo.v_DailyTransaction_LEGACY ( FileDate, DailyTransaction_PK )
WITH ( DATA_COMPRESSION = PAGE, FILLFACTOR = 80 );
GO

IF NOT EXISTS ( SELECT  1
                FROM    sys.objects
                WHERE   name = 'DailyTransaction'
                    AND type = 'V' )
BEGIN
    EXEC( 'CREATE VIEW dbo.DailyTransaction AS SELECT x = 1;' );
END;
GO

ALTER VIEW dbo.DailyTransaction
AS  SELECT  DailyTransaction_PK, FileDate, Foo
    FROM    dbo.v_DailyTransactionComplete
    UNION ALL    
    SELECT  DailyTransaction_PK, FileDate, Foo
    FROM    dbo.v_DailyTransaction_LEGACY WITH ( NOEXPAND );
GO

Forcing the optimizer to use the index provided by the indexed view makes the MAX and TOP issues go away, but there's got to be a better way to achieve what I'm trying to do here. Absolutely any suggestions / scolding would be very much appreciated!!


SET STATISTICS IO, TIME ON;

DECLARE @ConsumeOutput1        DATETIME;
SELECT  @ConsumeOutput1 = MAX( FileDate )
FROM    dbo.DailyTransaction;

SET STATISTICS IO, TIME OFF;
GO

Table 'v_DailyTransaction_LEGACY'. Scan count 1, logical reads 3, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'DailyTransactionComplete'. Scan count 1, logical reads 310, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times: CPU time = 31 ms, elapsed time = 36 ms.


SET STATISTICS IO, TIME ON;

DECLARE @ConsumeOutput1        DATETIME;
SELECT  TOP 10 @ConsumeOutput1 = FileDate
FROM    dbo.DailyTransaction
GROUP BY FileDate 
ORDER BY FileDate DESC

SET STATISTICS IO, TIME OFF;
GO

Table 'v_DailyTransaction_LEGACY'. Scan count 1, logical reads 101, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Workfile'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'DailyTransactionComplete'. Scan count 1, logical reads 310, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times: CPU time = 63 ms, elapsed time = 66 ms.

TL;DR:

Help me understand what I need to do to make aggregation queries on the first view I mentioned run in reasonable amounts of time with reasonable I/O resource utilization.

Database

mercredi 24 décembre 2014

Aggregation Operations on View Ignores Index

Aucun commentaire:

Enregistrer un commentaire