The Oracle Disk I/O Mechanism

David Clement

September 2004

Oracle versions 7-8


Oracle's mechanism for reading and writing current data is highly optimized and complex. It is not as simple as reading data from disk when it is needed and writing it to disk when it is saved. Instead, the objective is to maintain the highest possible performance by keeping as much of the most currently needed data in memory as possible.

At startup, an Oracle instance allocates space in the SGA to hold data from the database. This space is called the "database block buffer" or "database buffer cache." Since the smallest unit that the file system can address for reading or writing is a block, space in the database buffer cache is allocated in units derived from the file system blocks. If the file system block size is 2K, the Oracle data block size may be 2K, 4K, 8K, or something similar. The Oracle data block size is configured when the database is created and cannot be changed afterward. Recent versions of Oracle have some interesting new ways to manage block size, but we'll skip those for the purpose of this explanation.

In the database buffer cache, one buffer holds one Oracle data block, so the words "buffer" and "block" are used interchangeably in this context.

When a user process connects to the database, a server process (SP) is started to handle its requests. After a SQL statement is parsed, at the beginning of the fetch phase, the SP receives a service request and tries to serve it by finding the necessary data blocks in the database buffer cache. The SP can do this because each Oracle data block has a header that includes a directory of tables and a directory of rows. The SP consults these directories, and where it can find a block in the buffer cache that contains a row or some rows that the SP needs, this is called a "cache hit." If the SP does not find a block that it needs, this is called a "cache miss."

Cache Hit Processing

In a cache hit, the SP acquires the latch on the buffer header, then updates the status register in the block header to mark the buffer as "pinned" (meaning it's in use) and compares the block version number (BVN) to the transaction SCN. If the buffer data is current, then as described in "The Oracle Read-Consistency Model," the SP returns the data from the buffer cache to the user process, in a "current read," and if not, the SP has to get data from the rollback segments, in a "consistent read."

After the current or consistent read, the SP updates the LRU list. This is one of two lists that manage the buffer cache; the other one is called the "write list" and will be described shortly. The lists contain hash values that refer to the buffer headers, and both lists are managed by a third list, the cache buffer chain. The LRU list refers to the buffers in order from least recently used (LRU) to most recently used (MRU).

For performance reasons, the SGA contains a number of each of these lists, but to reduce complexity, this explanation will talk as if there was one of each.

The way the SP updates the LRU is that, having just read a block from the cache, the SP now swaps the pointers in the cache buffer chain so that this buffer becomes the MRU buffer. The buffer that was previously MRU is now MRU-1, the previous MRU-1 is now MRU-2, and so forth.

The buffer status is no longer "pinned" now that the data has been provided to the user process. If data in the buffer has been modified, then the SP makes the status "dirty," and if not, the status becomes "free." Finally, the SP releases the latch on the buffer header. This completes the cache hit processing.

For an interesting exception to the usual LRU-MRU algorithm, see "Where is FTS Data Cached?"

Cache Miss Processing

In a cache miss, the SP must read data blocks from the file system into the database buffer cache. This is done in a loop, in which the SP reads the status register in each buffer header, working from the LRU end of the cache toward the MRU end.

If the SP finds a "free" buffer, then the SP can use that buffer, and will read data blocks from the file system into it.

If the SP sees that the status is "pinned," it knows that some other process is working with the buffer, as described above, and it goes to the next buffer header.

If the SP sees that the status is "dirty," then the SP tries to reorder the cache buffer chain so as to move the pointer for the dirty buffer into the other buffer-management list, the write list (also called the LRUW and the dirty list). If it succeeds, then it goes to the next buffer header.

If, however, the write list is full, then the SP signals the Database Writer (DBWR) background process to write the buffers on the write list to disk. As DBWR writes those buffers out, it takes their pointers off the write list and makes their statuses "free." When DBWR signals completion, the SP returns to the top of the loop and starts reading status registers from the LRU end of the cache.

If the SP does not find any free buffers before it hits the "foreground scan depth" (a limit on the number of database buffer reads allowed, internally set but influenced by an initialization parameter) then it signals DBWR, just as it does when the write list is full. Again, on DBWR's signal of completion, the SP returns to the top of its loop.

This algorithm is how the SP acquires a free buffer as soon as possible, while ensuring that modified data is written safely to disk.