The Old Switcheroo

David Clement

8 January 2008

Oracle versions 9i-10g


How It Began

I worked on a database management team that was part of an AMR (automated meter reading) application development project. The staff came partly from consulting organizations and partly from the permanent staff of the AMR application provider. The target customers were Fortune 500 utility companies.

A test version of the application had recently been released to one of the target customers for them to play around with. The DB management team was still studying the rates of growth expected for the database, so we had provided only a preliminary estimate of database size. Suddenly the testing customer reported a fatal defect, the team switched into crisis mode, and life became exciting for a few weeks.

The problem was hard to diagnose and involved late nights of work and considerable anxiety on the part of the utility company and of the AMR application provider, but fundamentally it was simple. You can make it into a rule: The hardware performance has to scale linearly, or nothing else will. I mean, duh.

The Old Switcheroo

The utility company had purchased a SAN from a major vendor to dedicate to the AMR application. The network administrator had configured the SAN with what he thought was sufficient storage, on the basis of our preliminary estimate, and had chosen one of the less expensive configurations, because of the price. In this configuration, the enclosure contained some faster drives and some slower, cheaper drives. The SAN used the faster drives first. When disk utilization reached a threshold, the SAN switched to using the slower drives. This seemed reasonable to the network administrator, so he did not mention it to the DB management team.

However, Oracle’s cost-based optimizer (CBO) in versions above 8 periodically executes disk reads to gather system statistics about LUN performance. The performance statistics influence the optimizer’s choice of query execution plans. This works well in a storage environment that performs consistently, but if some of the LUNs have different performance characteristics from the rest, then the CBO has rubbish statistics, and keeps choosing the wrong query execution plans.

In the system we were supporting, as soon as DB utilization reached 50 percent of the available storage, the performance of many DB queries suddenly changed, and not usually for the better. Several queries became performance nightmares, changing from subsecond response time to 20-minute response time, and back, unpredictably. There was a particular Web page that seemed to have gone berserk, which was why everyone thought we were looking at a bad bug.

Diagnosis and Solution

Diagnosis was difficult. We had to ignore the appearance of an application bug and discover six things:

  • that query execution plans kept changing (you would expect them to stabilize),
  • that a given plan (such as a fast full index scan) for a given query sometimes performed well and sometimes did not,
  • that performance had gone haywire when storage space utilization reached 50 percent of available storage,
  • that the vendor offered a configuration where performance characteristics changed as storage space utilization crossed a threshold,
  • and finally, that the network administrator had chosen such a configuration.
When the DB management team figured out what had happened, we manually reorganized the DB into a tighter storage allocation so that the slower LUNs were not used. This was a somewhat old-fashioned solution, but it solved the performance problem, for a while at least.

Take-Aways

From a data team manager’s perspective, this issue could have been avoided by better communication of storage requirements to the utility company, and by better communication of the storage hardware configuration from the utility company to the DB management team.

From a DB architect’s perspective, this issue emphasizes the need to estimate DB storage early and to allocate it generously, on consistently performing storage hardware, from the beginning of elaboration through the end of construction.

From a DBA’s perspective, the issues are,

  • first, you need to be sure that you know what choice the customer is making when they purchase storage hardware, and
  • second, even after years of development of OFA and other standards, an RDBMS is still sensitive to the performance of the storage devices (at least, it is if you mess it up), so sometimes a manual re-org is still the way to correct a performance problem.

All in all, I guess most of us felt that we were pretty dumb to let this problem develop to the point where we had to be very smart to solve it.