I currently work as a SQL Server DBA in the power industry for a company that provides reliability services for the Western Interconnection (https://bit.ly/2QCgFc9). One service that this company provides is Real-Time Contingency Analysis (RTCA): The existing and potential operating conditions of the power grid are evaluated based on real-time data about monitored equipment. For example, if the voltage on a transmission line exceeds a high or low limit, reliability operators will want to take action to mitigate or circumvent the adverse consequences of that event. These operators are sitting in control rooms right now, and they're watching displays that pull data from data servers every few minutes. After deploying a recent upgrade of a critical application that displays (among other things) RTCA data, we noticed a substantial performance degrade for some associated jobs running on the data server: SSIS jobs that used to run in about 10-15 seconds were taking 48-50 seconds. We used SQL Profiler to identify the specific underlying queries (the events were Stored Procedures | RPC:Completed, TSQL | SQL:BatchCompleted, and TSQL | SQL:BatchStarting, we selected only the specific database in question, and we filtered for execution times > 100 ms) and then ran the queries manually. These are SELECT statements on views with a half-dozen or more joins (inner and left-outer) on underlying user-defined tables
and views that can be very confusing to troubleshoot (development wrote these views and queries, by the way), but we isolated the problem to a WHERE clause that uses the LIKE keyword:
SELECT
[column_1]...,[column_30+]
FROM
[schema_name].[big_complicated_view]
WHERE [column_n] NOT LIKE 'XYZ%'
ORDER BY [column_5] DESC, ....
(I've changed/redacted names and strings so I'm not revealing anything sensitive.) Here, column_n is a varchar(m) column, where m is < 25. It made sense to me that this WHERE clause was the problem because I knew that using LIKE in a WHERE clause can render the query non-sargable (i.e., the query engine isn't able to use an index to speed up the search time). Looking at the execution plan verified that the query was scanning the index, so I replaced NOT LIKE 'XYZ%' as follows:
SELECT
[column_1]...,[column_30+]
FROM
[schema_name].[big_complicated_view]
WHERE LEFT([column_n],3) <> 'XYZ'
ORDER BY [column_5] DESC, ....
This search argument is still not very SARGable, but the query time dropped back down to around 0.7 seconds. We asked QA to test it in development, and they sent us a client patch to deploy (some of the queries originate from the client application on the application-server tier). The performance worsened about two days later, briefly improving after we rebuilt statistics on five tables underlying the views in question. For the short term, I set up a maintenance job to rebuild statistics daily. We have two physical sites (> 1000 miles apart) and failover critical applications roughly every month, so we rebuilt statistics daily on the primary/active side but left the secondary/inactive side alone. The idea was to compare the different execution plans after about a week. That comparison revealed that the execution plan on the secondary/inactive side was using a hash join and index scan; and it was building an in-memory hash table that was over 800k records long. We also became aware of some changes in SQL Server 2016 to the cardinality estimator that can result in this kind of behavior with hash joins. We set the database compatibility level to 2012, and performance improved to the previous level (queries completed in about 0.7-1 seconds). However, this resulted in poor performance for another application accessing the same database to generate reports at daily and 15-minute intervals, so we couldn't leave the database at the 2012 compatibility level. Using the Query Store, we determined that at the 2016 compatibility level, the queries were regressing to sub-optimal execution plans shortly after we rebuilt statistics. We also noticed that the execution plan used at the 2012 compatibility level was essentially the same as the optimal plan for the database at 2016 compatibility level. We ultimately switched the compatibility level back to 2016 and used the Query Store to force the optimal plan. We haven't seen a recurrence of the performance issue for these queries, which are being closely monitored.
So why would
WHERE LEFT([column_n,3]) <> 'XYZ'
... peform better than
WHERE [column_n] NOT LIKE 'XYZ%'
... when it's also not SARGable? Because [big_complicated_view] is built on at least two more layers of views, the execution plans are not exactly a quick read, so I will have to research this more when I get time, but for now my best guess is that SQL Server is able to somewhat leverage its primary clustered index on column_n in the table underlying these views and consequently is not needing to execute the LEFT() function on every row in the table. The string on the right side of the inequality operator gives SQL Server a range to focus on. If that's the case, it's an example of how SARGability is a continuous rather than discrete metric: If a query isn't SARGable, the underlying statistics and indexes will be less helpful, but that doesn't mean they won't be of any help at all.
Comments