vendredi 27 mars 2015

Can't get Dell PE T420 (Perc H710) perform better than a MacMini with PostgreSQL


It's months that I'm trying to solve a performance issue with PostgreSQL.


SYSTEM CONFIGURATION


Our deployment machine is a Dell PowerEdge T420 with a Perc H710 RAID controller configured in this way:



  • VD0: two 15k SAS disks (ext4, OS partition, WAL partition, RAID1)

  • VD1: ten 10k SAS disks (XFS, Postgres data partition, RAID5)


This system has the following configuration:



  • Ubuntu 14.04.2 LTS (GNU/Linux 3.13.0-48-generic x86_64)

  • 128GB RAM (DDR3, 8x16GB @1600Mhz)

  • two Intel Xeon E5-2640 v2 @2Ghz

  • Dell Perc H710 with 512MB cache (Write cache: "WriteBack", Read cache: "ReadAhead", Disk cache: "disabled"):

    • VD0 (OS and WAL partition): two 15k SAS disks (ext4, RAID1)

    • VD1 (Postgres data partition): ten 10k SAS disks (XFS, RAID5)



  • PostgreSQL 9.4 (updated to the latest available version)

  • moved pg_stat_tmp to RAM disk


My personal low cost and low profile development machine is a MacMini configured in this way:



  • OS X Server 10.7.5

  • 8GB RAM (DDR3, 2x4GB @1333Mhz)

  • one Intel i7 @2.2Ghz

  • two Internal 500GB 7.2k SAS HDD (non RAID) for OS partition

  • external Promise Pegasus R1 connected with Thunderbolt v1 (four 1TB 7.2k SAS HDD 32MB cache, RAID5, Write cache: "WriteBack", Read cache: "ReadAhead", Disk cache: "enabled", NCQ: "enabled")

  • PostgreSQL 9.0.13 (the original built-in shipped with OS X Server)

  • moved pg_stat_tmp to RAM disk


So far I've made a lot of tuning adjustments to both machines, including kernel reccomended ones on the official Postgres doc site.


APPLICATION


The deployment machine runs a web platform which instructs Postgres to make big transactions over billion of records. It's a platform designed for one user because system resources have to be dedicated as much as possible to one single job due to data size (I don't like to call it big data because big data are in the order ob ten of billion).


ISSUEs


I've found the deployment machine to be a lot slower than the development machine. This is paradoxal because the two machine really differs in many aspects. I've run many queries to investigate this strange behaviour and have done a lot of tuning adjustments.


During the last two months I've prepared and executed two type of query sets:



  • A: these sets make use of SELECT ... INTO, CREATE INDEX, CLUSTER and VACUUM ANALYZE.

  • B: these sets are from our application generated transactions and make use of SELECT over the tables created with set A.


A and B were always slower on T420. The only type of operation that was faster is the VACUUM ANALYZE.


RESULTS


A type set:



  • T420: went from 311seconds (default postgresql.conf) to 195seconds doing tuning adjustments over RAID, kernel and postgresql.conf;

  • MacMini: 40seconds.


B type set:



  • T420: 141seconds;

  • MacMini: 101seconds.


I've to mention that we have also adjusted the BIOS on T420 setting all possible parameters to "performance" and disabling low energy profiles. This lowered time execution over a type A set from 240seconds to 211seconds.


We have also upgrade all firmware and BIOS to the latest available versions.


Here are two benchmarks generated using pg_test_fsync:


T420 pg_test_fsync



60 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
open_datasync 23358.758 ops/sec 43 usecs/op
fdatasync 21417.018 ops/sec 47 usecs/op
fsync 21112.662 ops/sec 47 usecs/op
fsync_writethrough n/a
open_sync 23082.764 ops/sec 43 usecs/op

Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
open_datasync 11737.746 ops/sec 85 usecs/op
fdatasync 19222.074 ops/sec 52 usecs/op
fsync 18608.405 ops/sec 54 usecs/op
fsync_writethrough n/a
open_sync 11510.074 ops/sec 87 usecs/op

Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB
in different write open_sync sizes.)
1 * 16kB open_sync write 21484.546 ops/sec 47 usecs/op
2 * 8kB open_sync writes 11478.119 ops/sec 87 usecs/op
4 * 4kB open_sync writes 5885.149 ops/sec 170 usecs/op
8 * 2kB open_sync writes 3027.676 ops/sec 330 usecs/op
16 * 1kB open_sync writes 1512.922 ops/sec 661 usecs/op

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
write, fsync, close 17946.690 ops/sec 56 usecs/op
write, close, fsync 17976.202 ops/sec 56 usecs/op

Non-Sync'ed 8kB writes:
write 343202.937 ops/sec 3 usecs/op


MacMini pg_test_fsync



60 seconds per test
Direct I/O is not supported on this platform.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
open_datasync 3780.341 ops/sec 265 usecs/op
fdatasync 3117.094 ops/sec 321 usecs/op
fsync 3156.298 ops/sec 317 usecs/op
fsync_writethrough 110.300 ops/sec 9066 usecs/op
open_sync 3077.932 ops/sec 325 usecs/op

Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync
is Linux's default)
open_datasync 1522.400 ops/sec 657 usecs/op
fdatasync 2700.055 ops/sec 370 usecs/op
fsync 2670.652 ops/sec 374 usecs/op
fsync_writethrough 98.462 ops/sec 10156 usecs/op
open_sync 1532.235 ops/sec 653 usecs/op

Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB
in different write open_sync sizes.)
1 * 16kB open_sync write 2634.754 ops/sec 380 usecs/op
2 * 8kB open_sync writes 1547.801 ops/sec 646 usecs/op
4 * 4kB open_sync writes 801.542 ops/sec 1248 usecs/op
8 * 2kB open_sync writes 405.515 ops/sec 2466 usecs/op
16 * 1kB open_sync writes 204.095 ops/sec 4900 usecs/op

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)
write, fsync, close 2747.345 ops/sec 364 usecs/op
write, close, fsync 3070.877 ops/sec 326 usecs/op

Non-Sync'ed 8kB writes:
write 3275.716 ops/sec 305 usecs/op


This conferms the hardware IO capabilities of T420 but doesn't explain why MacMini is MUCH MORE FAST.


Any ideas?





Aucun commentaire:

Enregistrer un commentaire