Database : SQL aggregate function performance in a table with many records that is well-indexed

mardi 30 décembre 2014

SQL aggregate function performance in a table with many records that is well-indexed

My question is about 3 tables that stores data of a single entity, and there is a report generation feature that refers these tables and calculate COUNT and AVG information.

There are three tables: review, rating, and rating_values.

Review table


CREATE TABLE IF NOT EXISTS `review` (
  `review_id` int(10) unsigned NOT NULL AUTO_INCREMENT COMMENT 'Unique review ID of this review.',
  `nid` int(10) unsigned NOT NULL,
  `sub_target` int(10) unsigned NOT NULL,
  `timestamp` int(10) unsigned NOT NULL DEFAULT '0' COMMENT,
  `score` float DEFAULT NULL COMMENT,
  PRIMARY KEY (`review_id`),
  KEY `nid` (`nid`),
  KEY `uid` (`uid`),
  KEY `nid_uid` (`nid`,`uid`),
  KEY `score` (`score`),
  KEY `timestamp` (`timestamp`),
);

Review table acts as the the primary table. Each review can have 0 or more (we have a limit of 18 at the moment in application level) rating values. A rating value is something like "Food", "Location", etc for a restaurant review.

Rating table


'rating' (
  `review_id` int(10) unsigned NOT NULL, //foreign key to {review}.review_id
  `type` varchar(255) NOT NULL,
  `score` float DEFAULT NULL,
  KEY `review_id` (`review_id`),
  KEY `score` (`score`)
);

In this table, type is a machine name of the "food", "location", etc I mentioned above. There is a foreign key to the base table that stores these rating type information. In our application, each rating can have multiple rating values that count towards the main rating. Say, there is "food", and there can multiple sub questions such as "food - appearance", "food - taste". These values can be 0 or more. If there are more than 1 sub question, this score field contains the average of the sub questions (please see the table below).

Rating value table


'rating_value' (
  `review_id` int(10) unsigned NOT NULL,// foreign key to the {review}.review_id
  `type` varchar(255) NOT NULL,
  `sid` int(10) unsigned DEFAULT NULL,
  `score` int(10) unsigned DEFAULT NULL,
  KEY `review_id` (`review_id`),
  KEY `type` (`type`)
)

This is where I expect most of the data would go into. Each sub question has an sid that is referred here.

A typical review would add rows like this:


Review:

review_id | nid | sub_target | timestamp | score
1         | 583 |         786| 1388167200| 4.32

Rating (please note how the two floats add up to 4.32 above):

review_id|type     |score
1        |food     |2.10
1        |location |2.22

Rating value:

review_id|type     |sid|score
1        |food     |1  |2.00
1        |food     |1  |2.30
1        |location |2  |2.12
1        |location |3  |2.22
1        |location |3  |2.22

We will be generating reports with COUNT and AVG on rating_value tables. However, there will be always a JOIN with review table, and the filter would be review.nid, which is indexed.

When accessing reviews, application will almost every time use the review_id, which are either primary or foreign keys in all tables.

Sometimes we need to CAST the score field to nearest integer.

Admittedly I'm not an expert in databases, but I spent a lot of time normalizing the above, and I ended up with the above database after all three steps. I'm eager to learn, so if you have suggestions to above table structure, please don't hesitate to mention them. Note that all review and rating tables will contain some extra data that I didn't mention here. the rating table has a text field to allow users to write a text review. That's why rating values and rating are separated.

The reports we generate are never going to run on all rows. Just a small set of rows filtered by the nid (plus additional filters).

My questions are:

Is it common to spread data of a single user operation across multiple rows? I expect the rating_value table to grow almost 10 times (in record count) than the review table.

While (inner) JOINing tables, and calculating averages, can database engines filter down the eligible rows using the indexes first?

If you have any thoughts about the above schema, please don't hesitate to comment on.

Database

mardi 30 décembre 2014

SQL aggregate function performance in a table with many records that is well-indexed

Aucun commentaire:

Enregistrer un commentaire