dimanche 1 mars 2015

Postgresql Multiple Processes and Queries or Nested Query?


I have a series of tables and queries I run a process against and at times this process can take hours - if not multiple days - depending on the date range I select. I am running PostgreSQL 9.4.


In reviewing the below queries, does it make sense to 1) have a temp table and/or 2) should I have a nested query. 3) Is there a better design to improve the structure.


I realize there is some back-end tuning I can optimize, but in just reviewing the below queries is there a glaring improvement I am missing or should be addressing better?


The first query:




-- I first insert the below results into a temp table -- Running the below query can take hours -- Then the insert can sometimes take an hour or more depending on the amount of data -- The average amount of rows the can be insert are around two million or more. -- There is no way to avoid this given the data set and requested results





INSERT INTO tempresults(
lmp_date, approved_lmp_name_a, approved_lmp_name_b,
path, approved_a_pnode,
approved_b_pnode, approved_a_sink, approved_b_sink, approved_a_source,
approved_b_source,
lmp_da_a, lmp_da_b, da_spread_diff, lmp_da_spread_pl,
submitted_bid, filled, filled_percentage, priced_filled_at, filled_vs_settled,
profit_flag, filled_profit, loss_flag, filled_loss, lmp_rt_a,
lmp_rt_b, rt_spread_diff, dart_a, dart_b, dart_spread_diff, win_trigger, loss_trigger)

SELECT
"Selected_Period".lmp_date as "Date",
"Selected_Period".lmp_name_a as "LMP Name A",
"Selected_Period".lmp_name_b as "LMP Name B",
CONCAT("Selected_Period".lmp_name_a,' - ',"Selected_Period".lmp_name_b) as "Path",
"Selected_Period".approved_a_pnode, "Selected_Period".approved_b_pnode,
"Selected_Period".approved_a_sink, "Selected_Period".approved_b_sink, "Selected_Period".approved_a_source,
"Selected_Period".approved_b_source,
AVG("Selected_Period".lmp_da_a) as "LMP DA A Avg",
AVG("Selected_Period".lmp_da_b) as "LMP DA B Avg",
(("Selected_Period".da_spread_diff)*-1) as "LMP DA A-B Spread",
(AVG(("Selected_Period".da_spread_diff)*-1)) as "LMP DA A-B Spread PL",
0.50 AS "Submitted Bid",
CASE WHEN ("Selected_Period".da_spread_diff*-1)<=0.50 THEN TRUE ELSE FALSE END AS "Filled",
CASE WHEN ("Selected_Period".da_spread_diff*-1)<=0.50 THEN 1 ELSE 0 END AS "Filled %",
CASE WHEN ("Selected_Period".da_spread_diff*-1)<=0.50 THEN "Selected_Period".da_spread_diff*-1 END AS "Price Filled At",

CASE WHEN ("Selected_Period".da_spread_diff*-1)<=0.50 THEN SUM(("Selected_Period".dart_spread_diff)-("Selected_Period".da_spread_diff)) END AS "Filled vs Settled",

CASE WHEN ("Selected_Period".da_spread_diff*-1)<=0.50 THEN CASE WHEN SUM(("Selected_Period".dart_spread_diff)-("Selected_Period".da_spread_diff)*-1) >0.50 THEN TRUE ELSE FALSE END END AS "Profit Flag",

CASE WHEN ("Selected_Period".da_spread_diff*-1)<=0.50 THEN CASE WHEN SUM(("Selected_Period".dart_spread_diff)-("Selected_Period".da_spread_diff)*-1) >0.50 THEN SUM((("Selected_Period".dart_spread_diff)-("Selected_Period".da_spread_diff)))+("Selected_Period".da_spread_diff) ELSE 0 END END AS "Filled Profit",

CASE WHEN ("Selected_Period".da_spread_diff*-1)<=0.50 THEN CASE WHEN SUM(("Selected_Period".dart_spread_diff)-("Selected_Period".da_spread_diff)*-1) >0.50 THEN FALSE ELSE TRUE END END AS "Loss Flag",

CASE WHEN ("Selected_Period".da_spread_diff*-1)<=0.50 THEN CASE WHEN SUM(("Selected_Period".dart_spread_diff)-("Selected_Period".da_spread_diff)*-1) >0.50 THEN 0 ELSE SUM((("Selected_Period".dart_spread_diff)-("Selected_Period".da_spread_diff)))-("Selected_Period".da_spread_diff)*-1 END END AS "Filled Loss",

AVG("Selected_Period".lmp_rt_a) as "LMP RT A Avg",
AVG("Selected_Period".lmp_rt_b) as "LMP RT B Avg",
AVG(("Selected_Period".rt_spread_diff)*-1) as "LMP RT A-B Spread Avg",
AVG("Selected_Period".dart_a) as "LMP DART A Avg",
AVG("Selected_Period".dart_b) as "LMP DART B Avg",
AVG("Selected_Period".dart_spread_diff) as "LMP DART A-B Spread Avg",
CASE WHEN ("Selected_Period".da_spread_diff*-1)<=0.50 THEN CASE WHEN SUM(("Selected_Period".dart_spread_diff)-("Selected_Period".da_spread_diff)*-1) >0.50 THEN 1 ELSE 0 END END AS "Win Trigger",
CASE WHEN ("Selected_Period".da_spread_diff*-1)<=0.50 THEN CASE WHEN SUM(("Selected_Period".dart_spread_diff)-("Selected_Period".da_spread_diff)*-1) >0.50 THEN 0 ELSE 1 END END AS "Loss Trigger"
FROM
public.darts_calculated_partitioned "Selected_Period"
WHERE
"Selected_Period".lmp_date >= 'XYXYXY' AND -- starting date. user normally inputs this value
"Selected_Period".lmp_date <= 'YZYZYZ' AND -- ending date. user normally inputs this value
"Selected_Period".he = '5' --this is hour. user normally inputs this value
GROUP BY
"Selected_Period".lmp_date, "Selected_Period".lmp_name_a, "Selected_Period".lmp_name_b, "Selected_Period".approved_lmp_name_a,
"Selected_Period".approved_lmp_name_b, "Selected_Period".approved_a_pnode, "Selected_Period".approved_b_pnode,
"Selected_Period".approved_a_sink, "Selected_Period".approved_b_sink, "Selected_Period".approved_a_source,
"Selected_Period".approved_b_source, "Selected_Period".he, "Selected_Period".lmp_da_a, "Selected_Period".lmp_da_b,
"Selected_Period".da_spread_diff, "Selected_Period".lmp_rt_a, "Selected_Period".lmp_rt_b,
"Selected_Period".rt_spread_diff, "Selected_Period".dart_a, "Selected_Period".dart_b, "Selected_Period".dart_spread_diff;


The tables above have lmp_date and he indexed.




The second part of the query then queries from the tempresults table. This can often be very slow.



-- stage 2
-- once the above data has been inserted into the temp table I then query the results.
-- this can take anywhere from twenty minutes to hours

SELECT
approved_lmp_name_a as "LMP A",
approved_lmp_name_b as "LMP B",
path as "Path",
approved_a_pnode as "pNode A",
approved_b_pnode as "pNode B",

sum(tempresults.da_spread_diff) as "Total DA Spread",
max(tempresults.da_spread_diff) as "Max DA Spread",
min(tempresults.da_spread_diff) as "Min DA Spread",

sum(tempresults.rt_spread_diff) as "Total RT Spread",
max(tempresults.rt_spread_diff) as "Max RT Spread",
min(tempresults.rt_spread_diff) as "Min RT Spread",

sum(tempresults.dart_spread_diff) as "Total DART Spread",
max(tempresults.dart_spread_diff) as "Max DART Spread",
min(tempresults.dart_spread_diff) as "Min DART Spread",

round(sum(filled_percentage)/count(*),2) as "Filled %",
sum(tempresults.filled_profit) as "Total Gain",
round(avg(tempresults.filled_profit),2) as "Average Gain",
Max(tempresults.filled_profit) as "Max Gain",
sum(tempresults.filled_loss) as "Total Loss",
round(avg(tempresults.filled_loss),2) as "Average Loss",
min(tempresults.filled_loss) as "Max Loss",
count(*) as "Population",
sum(tempresults.win_trigger) as "Total Winners",
sum(tempresults.loss_trigger) as "Total Losers",
round(sum(tempresults.win_trigger)/count(*),2) as "Total % Winners",
round(sum(tempresults.loss_trigger)/count(*),2) as "Total % Lossers"
FROM
public.tempresults
group by approved_lmp_name_a,
approved_lmp_name_b,
path,
approved_a_pnode,
approved_b_pnode;


Any advice would be much appreciated!





Aucun commentaire:

Enregistrer un commentaire