Tricks with Analytic Functions

David Clement

4 March 2008

Oracle versions 8-10g

1. Introduction

Oracle introduced SQL analytic functions in 8.1.6. As I recall, the documentation stated that their purpose was to support data mining operations in data warehouses. However, the use of these functions is by no means limited to data warehousing. They are intellectually interesting and an amazing performance enhancement at the same time. The enhancement is that, as a rule, a SQL query using analytic functions can generate results from one single pass for complex operations that otherwise would require self-joins or multiple scans of the same data.

This tip page provides a couple of examples of using SQL analytic functions in real-world situations. It does not introduce SQL analytic functions to the puzzled junior user or provide examples that are already widely available. For introductions to SQL analytic functions, you can see Martin Zahn's helpful articles "Analytic Functions in Oracle 8i and 9i" and "The Power of Analytic Functions." I choose to cite Zahn rather than other sites because Zahn covers the syntax in full, gives examples very concisely, and provides a really clever application of analytic functions to deduplication. There are lots of other good authors, such as Rene Nyffenegger, and Tom Kyte has been discussing the uses of analytic functions for a long time, if you follow "Ask Tom." Finally, there is a book on the subject that I have just found and have not read yet, by R. W. Earp and S. S. Bagui, Advanced SQL Functions in Oracle 10g. It looks solid. All of these people explain the syntax and the most common applications, such as Top-N querying and data pivoting.

This tip is not meant to duplicate their work. Instead of that, I want to collect a few examples of cases where analytic functions were used

  • to solve a real-world problem
  • in a manner beyond the commonly cited, textbook applications.

2. A Data Window

In this example, data is acquired by polling hardware devices, and the results are inserted into an Oracle table. Every now and then, on demand, the application has to show the user the current trend. Is the hardware running hotter? Is it cooling down?

One way to do that is to use the lag function. Here is an example.

     SELECT last_time_read, prev_time_read, last_value_read, prev_value_read,
            prev_value_read - last_value_read AS trend
       FROM (SELECT time_read last_time_read, value_read last_value_read,
                    LAG (time_read, 1) OVER (ORDER BY time_read) prev_time_read,
                    LAG (value_read, 1) OVER (ORDER BY time_read) prev_value_read,
                    DENSE_RANK () OVER (PARTITION BY device_id ORDER BY time_read DESC)
                 AS rnk
               FROM realtime_device_reads
              WHERE device_id = &1 AND valid_flg = TRUE)
      WHERE rnk = 1;
The nested select statement generates a result set ordered by the time read for every device identifier; that's what the dense_rank... partition by... order by... part specifies. The rank number is returned in this result set as a column called rnk. The nested query also adds to every row two columns from the row ranking just below it: that's what the lag (time_read, 1) and lag (value_read, 1) parts do.

In the outer query, the where rnk = 1 condition limits what is returned to the front-end application: only the four values from the most recent and the last previous row, and the calculated difference between them.

3. Only the Latest Row

It is a common problem that a table contains a mixture of current and historical data and there is a frequent need to report the current data. There are several ways to model your way out of this problem. For instance, you could update rows instead of inserting, so that the table always contains only current data. If the historical data is critical for some other need, you might consider using a before-insert trigger to save the out-of-date row elsewhere before updating it. Another possibility is to use a flag column for the current rows. These data modeling solutions really seem better to me than code solutions.

However, I often see tables that mingle historical data and current data, for which you have to write queries that look like this.

     SELECT device_id
       FROM realtime_device_reads rdr
      WHERE time_read = (SELECT MAX (time_read)
                           FROM realtime_device_reads
                          WHERE device_id = rdr.device_id);

The correlated subquery is ANSI-standard SQL, but it is not very fast, and it scales badly: the larger the table, the worse the query will perform. Another possibility is to rewrite the query so that it uses an in-line view, like this.

     SELECT device_id
       FROM realtime_device_reads rdr,
            (SELECT   MAX (time_read)
                 FROM realtime_device_reads
             GROUP BY device_id) latest
      WHERE rdr.time_read = latest.time_read 
        AND rdr.device_id = latest.device_id

The in-line view may not always perform well. It will depend on reading the same table twice and hash-joining the results. A more elegant solution would be to read the table only once. This can be done with a SQL analytic function as follows.

     SELECT latest.device_id
       FROM  (SELECT device_id,
                     ROW_NUMBER () OVER (PARTITION BY device_id 
            ORDER BY time_read DESC NULLS LAST) rn
                FROM realtime_device_reads) latest
      WHERE rn = 1;

This is simpler than the previous example. The nested select adds a number to every row in the table, such that the latest row is always number 1. Then the outer query, with its condition where rn = 1, returns only the rows numbered 1.

Performance improvements from this kind of query rewriting are sometimes as large as reducing a 20-hour job to 20 minutes.

4. More Sources

Tom Kyte covers SQL analytic functions in Chapter 12 of his book, Expert One-on-One Oracle (see the sidebars to this page) and I've just recently read a nice article by Gary Goodman on the Hotsos Web site, "The Performance Dangers of Database Independent Applications." Both of these sources, as well as those cited onat the beginning of this article, have influenced this tip page.