The complaint “… saving is taking for ever”. Translation — commits are taking longer than usual.
Now I understand what the users are experiencing time to determine root cause by reviewing the system statistics.
First stop what are the top 5 events? I can check in EM12c performance grid and looking at the graph I can see whether there is a spike or increase in the commit wait.
Run an AWR Report
< Insert Copy of Report for this section >
Run a query. There are many blogs were you can find a query if you don’t have one of your own.
Or you could query the gv$system view but remember this cumulative since start-up
select * from (
select event, total_waits, total_timeouts, time_waited
where event not like 'SQL*Net%'
and event not in ('pmon timer','rdbms ipc message','dispatcher timer','smon timer')
order by time_waited desc )
where rownum < 6;
It’s been said time and again that if the wait isn’t a significant wait event then ignore because the gains will be insignificant. However, if the users are complaining the system slows down when they hit the save button, then it needs to be addressed. In my experience the commit wait will show up in the top wait events during when the complaints are occurring, however, the wait may still not be significant. A recent example the commit wait was only 7% of the wait, while I/O comprised the majority of the time. However, the experience of the end users is important. Why are they experiencing a slow response when they commit needs to be addressed. User experience trumpets the overall standing of the wait from the system perspective.
So now that I can see that yes there is a spike in the commit wait. I need to determine the root cause. This is one of those pesky waits that can be due to increase redo generation due to increase workload, disk subsystem latency, CPU or Memory starvation, excessive commits, as well as other contentions within the database. The bottom line is LGWR isn’t posting changes to disk fast enough causing the end-user to wait on their saves or commits.
If I check the number of commits compared to an other time periods I can determine whether there is an increase in the number of commits. In my case I have EM12c running and since I utilize physical standby I like to alert when redo begins to increase to a level where the physical standbys are in jeopardy of falling behind (generally OK on the physical standby, but we make use of logical for reporting and when the logical begins to fall behind complaints will soon follow).
Here is an example of the redo generation graph from EM12c (it can be found under Data Guard Performance). The yellow line marks the warning level and the red critical.
This example is the same graph but for the last 7 days. Can also break out into 2 weeks and 31 days.
Just by looking at the graph I can see that there is differently an uptick of redo during the last few days. To review the details I do use a few queries to retrieve the number of commits, user calls, executes and other system statistics that could determine whether we are actually performing more work.
The additional stats can also be reviewed in the AWR report. Keep in mind the snap interval this report is aggregated over the period of the snap interval. Be careful using the average because it doesn’t provide an accurate view the high and low points.
The DBA_HIST views provide a wealth of information in particular
dba_hist_snapshot –> the list of snapshots for a given period
dba_hist_osstat –> OS stats
dba_hist_sys_time_model –> system stats at each snap
dba_hist_sysstat –> system stats at each snap
dba_hist_sysmetric_history –> Metric’s being captured
To capture shorter period of time for particular stats I will run
select name, value from v$sysstat where name = 'user commits';
Save the value off and capture it again. At this point I’m only after an idea based on previous performance. This is where knowing your application and your database performance is useful.
We already know that we are generating peak redo and over a longer period of time. The challenge now becomes finding out where that redo is being generated.
Tanel Poder snapper is the perfect tool for handling this type of investigation. His blog contains all the information to using the tool in this scenario. If you haven’t downloaded the tool I highly recommend it not just for this situation. The tool is extremely useful in investigating production issues.
What I discovered was a peculiar SQL statement running from a session as SYS. Inserting into the stats$user_log table. The information being logged was for username PUBLIC. The stats$user_log table is used for logon/off triggers. I found the trigger in question:
insert into stats$user_log
where user not in
('SYSMAN', 'DBSNMP', 'SYS',);
The trigger only had three exclusions. Of course, now I’m wondering about the PUBLIC user, what’s it’s use, and why is it logging on an off so frequently.
Turns out the PUBLIC user is internal and used by data guard during gap resolution. The user can not be used outside of Oracle, in other words no one can log in as PUBLIC.
Great if the user PUBLIC is only used internally then I can modify the trigger and this user will no longer be tracked. The redo generation should decrease. Of course, that brings up another question why is my standby following behind? But I’ll save that for another day.
There’s also a great post on log file sync wait along with LGWR work on Pythain website by Riyaj Shamsudeen.