mercredi 28 janvier 2015

How to query against legacy data that needs to be cleaned up?


I have an existing SQL Server DB with 3 tables (note: this is a made up example, but modelled after a real problem).



  • Artists : contains music artists

  • MusicRecords : contains information on music records produced by an artists

  • LabelAndArtists : lists all the labels that have published the artist


Music Database


For the whole life of the application MusicRecords were gotten by Artist.



SELECT * FROM [MusicRecords] WHERE [ArtistID] = 10


Now we want to start querying MusicRecords based off of which Label published the MusicRecord. However we were lax in the past when putting Label information in the system. Some Labels were prefixed by 'Label-', other times the Label would just be NULL in the MusicRecord.


If there is exactly 1 LabelAndArtists record for an artist the business is comfortable assuming the LabelAndArtist Label is the Label for the MusicRecord.


We came up with a few ways we could query off of Label.



  1. Create a VIEW that can normalize Label name so we can query off of it.

  2. Create a FUNCTION that returns a TABLE of ArtistIDs with that a normalized Label that we can join to Artists.

  3. Create a Mapping table between Artists and normalized Labels.

  4. Fix our Labels.


The obvious answer would be to fix our Labels. However due to application constraints I am not sure how viable that would be.


Are any of our ways the better than others to query MusicRecords by Label? Is there another way to get MusicRecords by Label?


SQL to recreate:



---------- Artists ----------
CREATE TABLE [Artists]
(
[ArtistID] [int] IDENTITY (1,1) NOT NULL,
[Name] [varchar](512) NOT NULL
)
GO

ALTER TABLE [Artists] WITH NOCHECK ADD
CONSTRAINT [PK_Artists] PRIMARY KEY NONCLUSTERED
(
[ArtistID]
)
ON [PRIMARY]
GO

---------- MusicRecords ----------
CREATE TABLE [MusicRecords]
(
[MusicRecordID] [int] IDENTITY (1,1) NOT NULL,
[ArtistID] [int] NOT NULL,
[Title] [varchar](512) NOT NULL,
[Label] [varchar](512) NULL
)
GO

ALTER TABLE [MusicRecords] WITH NOCHECK ADD
CONSTRAINT [PK_MusicRecords] PRIMARY KEY NONCLUSTERED
(
[MusicRecordID]
)
ON [PRIMARY]
GO

ALTER TABLE [MusicRecords] ADD
CONSTRAINT [FK_MusicRecords_Artist] FOREIGN KEY
(
[ArtistID]
) REFERENCES [Artists] (
[ArtistID]
) ON DELETE CASCADE
GO

CREATE CLUSTERED INDEX [IX_MusicRecords_ArtistID] ON [MusicRecords]
(
[ArtistID],
[MusicRecordID]
)
GO

---------- LabelAndArtists ----------
CREATE TABLE [LabelAndArtists]
(
[LabelAndArtistsID] [int] IDENTITY (1,1) NOT NULL,
[ArtistID] [int] NOT NULL,
[LabelName] [varchar](512) NULL
)
GO

ALTER TABLE [LabelAndArtists] WITH NOCHECK ADD
CONSTRAINT [PK_LabelAndArtists] PRIMARY KEY NONCLUSTERED
(
[LabelAndArtistsID]
)
ON [PRIMARY]
GO

CREATE CLUSTERED INDEX [IX_LabelAndArtists_ArtistID] ON [LabelAndArtists]
(
[LabelAndArtistsID],
[ArtistID]
)
GO

ALTER TABLE [LabelAndArtists] ADD
CONSTRAINT [FK_LabelAndArtists_Artist] FOREIGN KEY
(
[ArtistID]
) REFERENCES [Artists] (
[ArtistID]
) ON DELETE CASCADE
GO

-- Possible VIEW Solution
CREATE VIEW [VW_MusicRecordsWithNormalizedLabel]
AS
SELECT
Label = REPLACE(COALESCE
(
m.Label,
CASE WHEN
(
SELECT COUNT(DISTINCT m2.[LabelName])
FROM [LabelAndArtists] m2 with(nolock)
WHERE m2.[ArtistID] = m.[ArtistID]
AND m2.[LabelName] IS NOT NULL
) = 1
THEN
(
SELECT TOP 1 m2.[LabelName]
FROM [LabelAndArtists] m2 with(nolock)
WHERE m2.[ArtistID] = m.[ArtistID]
ORDER BY m2.[LabelName] DESC
)
ELSE NULL
END
), 'Label-', ''),
m.[ArtistID],
m.[Title]
FROM [MusicRecords] m
GO
DECLARE @Label varchar(512) = '15'
SELECT * FROM [VW_MusicRecordsWithNormalizedLabel] WHERE Label = @Label

-- Possible FUNCTION Solution
CREATE FUNCTION [Artists_GetByLabel] (@LabelName [varchar](512))
RETURNS @ArtistsIDs TABLE (ID int NOT NULL)
AS
BEGIN
INSERT INTO @ArtistsIDs
SELECT ArtistID
FROM LabelAndArtists a with (nolock)
WHERE a.LabelName = REPLACE(@LabelName, 'Label-', '')
UNION
SELECT a.ArtistID
FROM Artists a with (nolock)
JOIN MusicRecords mr with (nolock) on (a.ArtistID = mr.ArtistID)
WHERE mr.Label = REPLACE(@LabelName, 'Label-', '')
RETURN
END
GO
DECLARE @Label varchar(512) = '15'
SELECT *
FROM [MusicRecords] m
JOIN [Artists_GetByLabel](@Label) abl on (abl.ID = m.ArtistID)

---------- Test Data ----------
DECLARE @RowCount [int] = 1
WHILE @RowCount < 150000
BEGIN
INSERT INTO [Artists] ([Name]) VALUES ('ArtistName' + CONVERT(varchar(128), NewID()))
DECLARE @ArtistID [int] = @@IDENTITY
DECLARE @LabelTypeA [varchar] (512) = 'Label-' + Convert(varchar(512), @RowCount)
DECLARE @LabelTypeB [varchar] (512) = Convert(varchar(512), @RowCount)

INSERT INTO [MusicRecords] ([ArtistID], [Title], [Label]) VALUES (@ArtistID, 'Title' + CONVERT(varchar(128), NewID()), @LabelTypeA)
INSERT INTO [MusicRecords] ([ArtistID], [Title], [Label]) VALUES (@ArtistID, 'Title' + CONVERT(varchar(128), NewID()), @LabelTypeA)
INSERT INTO [MusicRecords] ([ArtistID], [Title], [Label]) VALUES (@ArtistID, 'Title' + CONVERT(varchar(128), NewID()), @LabelTypeB)
INSERT INTO [MusicRecords] ([ArtistID], [Title], [Label]) VALUES (@ArtistID, 'Title' + CONVERT(varchar(128), NewID()), NULL)
INSERT INTO [MusicRecords] ([ArtistID], [Title], [Label]) VALUES (@ArtistID, 'Title' + CONVERT(varchar(128), NewID()), NULL)

INSERT INTO [LabelAndArtists] ([ArtistID], [LabelName]) VALUES (@ArtistID, @LabelTypeA)
INSERT INTO [LabelAndArtists] ([ArtistID], [LabelName]) VALUES (@ArtistID, NULL)

SET @RowCount = @RowCount + 1
END




Aucun commentaire:

Enregistrer un commentaire