The Old Switcheroo
8 January 2008
Oracle versions 9i-10g
How It Began
I worked on a database management team that was part of an AMR (automated meter reading) application development project. The staff came partly from consulting organizations and partly from the permanent staff of the AMR application provider. The target customers were Fortune 500 utility companies.
A test version of the application had recently been released to one of the target customers for them to play around with. The DB management team was still studying the rates of growth expected for the database, so we had provided only a preliminary estimate of database size. Suddenly the testing customer reported a fatal defect, the team switched into crisis mode, and life became exciting for a few weeks.
The problem was hard to diagnose and involved late nights of work and considerable anxiety on the part of the utility company and of the AMR application provider, but fundamentally it was simple. You can make it into a rule: The hardware performance has to scale linearly, or nothing else will. I mean, duh.
The Old Switcheroo
The utility company had purchased a SAN from a major vendor to dedicate to the AMR application. The network administrator had configured the SAN with what he thought was sufficient storage, on the basis of our preliminary estimate, and had chosen one of the less expensive configurations, because of the price. In this configuration, the enclosure contained some faster drives and some slower, cheaper drives. The SAN used the faster drives first. When disk utilization reached a threshold, the SAN switched to using the slower drives. This seemed reasonable to the network administrator, so he did not mention it to the DB management team.
However, Oracle’s cost-based optimizer (CBO) in versions above 8 periodically executes disk reads to gather system statistics about LUN performance. The performance statistics influence the optimizer’s choice of query execution plans. This works well in a storage environment that performs consistently, but if some of the LUNs have different performance characteristics from the rest, then the CBO has rubbish statistics, and keeps choosing the wrong query execution plans.
In the system we were supporting, as soon as DB utilization reached 50 percent of the available storage, the performance of many DB queries suddenly changed, and not usually for the better. Several queries became performance nightmares, changing from subsecond response time to 20-minute response time, and back, unpredictably. There was a particular Web page that seemed to have gone berserk, which was why everyone thought we were looking at a bad bug.
Diagnosis and Solution
Diagnosis was difficult. We had to ignore the appearance of an application bug and discover six things:
From a data team manager’s perspective, this issue could have been avoided by better communication of storage requirements to the utility company, and by better communication of the storage hardware configuration from the utility company to the DB management team.
From a DB architect’s perspective, this issue emphasizes the need to estimate DB storage early and to allocate it generously, on consistently performing storage hardware, from the beginning of elaboration through the end of construction.
From a DBA’s perspective, the issues are,
All in all, I guess most of us felt that we were pretty dumb to let this problem develop to the point where we had to be very smart to solve it.