Monthly Archives: January 2013

Execution Plans Are For SQL Statements

Developers are well aware of the need to understand the execution plan for a given SQL statement. Their training has them always looking for the lowest cost, but not really analyzing the access path. However, when a given statement includes a function or is an PL/SQL anonymous block some still try to find the cost through an execution plan. That’s not the right tool. It’s important to understand the difference and use the right tool or you just spin your wheels.

An execution plan is only for SQL statements. It provides valuable information on all the access paths chosen to arrive at a given answer for a given statement. There’s a wealth of information available the covers how to analyze execution plans to determine whether the optimizer has all the rights statistics and the access paths are the most efficient. But execution plans are only for SQL statements.

So if execution plans are only for SQL statements how can pl/sql code be reviewed and analyzed to determine if its efficient, or find the bottlenecks in the code? The answer is DBMS_PROFILER.

What’s captured? the total number of times each line has been executed, the total amount of time that has been spent executing that line, and the minimum and maximum times that have been spent on a particular execution of that line.

DBMS_PROFILE must first be installed. Your DBA can check and install it if necessary as SYSDBA

$ORACLE_HOME/rdbms/admin/profload.sql

The tables will then need to be created in the schema that is executing the procedures. If you have more than one user than of course, you will need to create them under each of the users. Alternatively you could create a test account that owns the DBMS_PROFILE table as well.

$ORACLE_HOME/rdbms/admin/proftab.sql

To utilize DBMS_PROFILE its as simple as modifying the interesting anonymous block, procedure, package or function like the following:

create or replace procedure test_dbms_profile
as
begin
dbms_profiler.start_profiler(‘test_dbms_profile’);
series of statements
dbms_profile.stop_profiler;
end;
/

Alternatively you can start and stop the profile outside the procedure like the following:

execute dbms_profiler.start_profiler(‘test_dbms_profile’);
execute test_dbms_profile;
execute dbms_profiler.stop_profiler;

The above method is great when you are working in production or another environment where modifying the code may not be easy or desirable. As well as when you may need to execute a series of pl/sql.

Execute the procedure.

Now to review the data from dbms_profile you can use a simple query such as the following:

BREAK ON runid on run_owner, on run_comment on run_secs
select a.runid, a.run_owner, a.run_comment,
a.run_total_time /1000000000 run_secs, c.total_occur,
c.total_time/1000000000 line_total_secs, c.line#, u.text
from plsql_profiler_runs a,
plsql_profiler_units b,
plsql_profiler_data c,
user_source u
where a.runid=b.runid
and a.runid=c.runid
and b.unit_name=u.name
and c.line#=u.line;

Alternatively there is a SQL that will allow the report to be formatted in HTML.

In a future post I will delve into using DBMS_PROFILER to assist in tuning PL/SQL within production environments.

Insert Statements And Duplicates

The data is partially incorrect in a table in development and they want the data brought over from QA. The caveat, as there always is, only copy the data that is unique. Unique by the primary key, all columns or some columns? In any of the cases a special query needs to be written to select the distinct values for the columns that they wish to use for uniqueness. There should be a better way. MySQL has the Insert Ignore statement that will ignore duplicates why can’t Oracle. It does! It’s implemented as a hint.

IGNORE_ROW_DUPKEY_INDEX(table_name, index_name)

Here’s an example:


declare
sSQL varchar2(400);
begin
for tab in
(select a.table_name,b.constraint_name from
(select table_name from dba_constraints
where r_constraint_name = 'TAB_PK') a,
(select constraint_name,table_name from dba_constraints
where constraint_type = 'P') b
where a.table_name=b.table_name)
loop
sSQL:='insert /*+ ignore_row_on_dupkey_index('||tab.table_name||','||tab.constraint_name||')*/ into claims.'||tab.table_name||' (select * from claims.'||tab.table_name||'@hmqa01 minus select * from claims.'||tab.table_name||')';
dbms_output.put_line(sSQL);
execute immediate sSQL;
dbms_output.put_line(SQL%ROWCOUNT);
end loop;
end;

I prefer the dynamic sql, it allows me to utilize dbms_output to capture the statements. This is handy when you need to provide the developer with the statements that were ran, as well as capturing the number of rows. Yes I know I didn’t include exception handling.

What a time saving hint.

Devops — Agile Moves to Operations

In the world of constantly changing technology a debate is raging. On one side you have the movement for flexibility and more faster deployments at cost of unexpected downtime and issues. The other side the desire and need for stability. The debate? How to bridge the gap between the two sides — developers and operations. The answer DEVOPS.

DEVOPS seeks to align developers and operations with the business needs of continuous delivery not just of code, but infrastructure. This naturally scares operations and for good reasons. DEVOPS.COM provides additional information as well as resources on the concept.

A short synopsis of the problem and the process on one side of IT you have a silo of Developers who want to code,they want to create, they want to build. Their end game is deployment. The more features and changes the developers deploy the more business objectives they meet.

On the other side you have a silo of operations. Operations is charged with keeping production up as well as the development, test, and integration environments. Over the years operations have learned the less you change the more stable the environment and the less they are called during the middle of the night or during their child’s birthday part on the weekend. Those of us that have spent our careers working in operation are way too familiar with problematic changes. Deployments contribute to the biggest pain points. Faster and more deployments introduce more potential issues. Its easy to see why operations desire a stable environment. The stable environment meets business objects by achieving the SLA (service level agreement).

Two completely separate teams within each of the silos. On one hand continuous change and on the other stability which equates to less frequent change. Neither approach is wrong, but together they create friction. An us against them mentality tends to prevail. When a deployment occurs, problems ensue fingers begin pointing. Frustration looms over operation as they try to fight the fire. When developers are quick to point out that it worked when they tested it in their environment. Operations doesn’t understand how items are deployed with issues obviously all the necessary testing wasn’t completed. Developers can’t understand why the configurations are different.

How did we build these silos? Over the years operations and development have been viewed as separate. Operations developed their own set of processes. Think ITIL (IT Infrastructure Library). Often teetering on the verge of death by process. The name of the game — stability. Building, configuring and testing infrastructure takes time. Operations is charged with system up time and maintaining SLAs that are can be as high as 99%.

The business wants continuous up time while increasing the number of features and bugs rolled out quickly to meet the customer desires (demands). Developers answer with their own process, Agile. They code, test and deploy. The name of the game is small changes in a small window continuously flowing into production.

The reality of fast continuous delivery from an operations view is the instability. Every deployment problems occur. Some right away others weeks or months down the road. Sometimes the problems don’t manifest themselves until another release is deployed. Developers view continuous delivery is impeded by the archaic processes that delay releases for weeks.

DEVOPS seeks to breakdown the silos with operation and developers working as one team with the same objectives and common terminology. The important part is that this is not a technology but an attempt to resolve a business process to align IT with the business strategy.

What’s important from an operations standpoint is decreasing delivery time doesn’t necessarily mean increasing issues. Software data centers promise to allow for quick continuous deployments while protecting the stability of the environment. Vendors such as Amazon and Puppet Lab to name two. In my opinion I do believe we have turned the corner where up time was crucial and issue faults were less than tolerated to more toleration as we increase our pace just how much that will continue remains to be seen.

There’s an ongoing conversation on anti-fragile. The theory is proposed by Nassim Nicholas Taleb in his book Antifragile: Things That Gain from Disorder. The premise is simple, “.. is that category of things that no only gain from chaos but need it in order to survive and flourish.” The question is how does this apply to complex systems within the realm of IT?

Before we can change our behavior within operations, we have to change our culture. And not just in operation but within development as well. Instead of two separate entities existing in our silos throwing work over the wall to other groups, we need to become one team aligned with the business strategy. the starting point is conversations and rebuilding the culture.

Redo Log Re-Sizing

The only way to re-size REDO logs is to create new one’s and drop the old ones. This method has been the standard… the only way to re-size redo logs since at least 7.1.4 (that’s when I started working with Oracle. Hopefully that doesn’t age me).

Let’s start with reviewing the current redo logs:
select (a.bytes/1024)/1024, a.status,b.member,b.type,b.group# from v$log a,v$logfile b
where a.group#=b.group#

50 CURRENT /u01/app/oracle/oradata/redo/redo01a.log ONLINE 1
50 CURRENT /u01/app/oracle/oradata/redo/redo01b.log ONLINE 1
50 INACTIVE /u02/app/oracle/oradata/redo/redo02a.log ONLINE 2
50 INACTIVE /u02/app/oracle/oradata/redo/redo02b.log ONLINE 2
50 INACTIVE /u03/app/oracle/oradata/redo/redo03a.log ONLINE 3
50 INACTIVE /u03/app/oracle/oradata/redo/redo03b.log ONLINE 3

To create additional redo log files:

alter database add logfile group 4 ('/u01/app/oracle/oradata/redo/redo4a.log',
'/u01/app/oracle/oradata/redo/redo4b.log') size 1024m;
alter database add logfile group 5 ('/u01/app/oracle/oradata/redo/redo5a.log',
'/u01/app/oracle/oradata/redo/redo5a.log') size 1024m;
alter database add logfile group 6 ('/u01/app/oracle/oradata/redo/redo6a.log',
'/u01/app/oracle/oradata/redo/redo6b.log') size 1024m;

Check which one’s are active:

select (a.bytes/1024)/1024, a.status,b.member,b.type,b.group# from v$log a,v$logfile b
where a.group#=b.group#

50 CURRENT /u01/app/oracle/oradata/redo/redo01a.log ONLINE 1
50 CURRENT /u01/app/oracle/oradata/redo/redo01b.log ONLINE 1
50 INACTIVE /u02/app/oracle/oradata/redo/redo02a.log ONLINE 2
50 INACTIVE /u02/app/oracle/oradata/redo/redo02b.log ONLINE 2
50 INACTIVE /u03/app/oracle/oradata/redo/redo03a.log ONLINE 3
50 INACTIVE /u03/app/oracle/oradata/redo/redo03b.log ONLINE 3

Perform log switches until the new log file is current:

alter system switch logfile ;

Check point to flush the data. This allows the release of the redo logs. If you attempt to delete a redo log that is required for crash recovery it will error:

alter system checkpoint global;

Now drop the logfile groups that contain the files that are being replaced.

alter database drop logfile group 1;

You might be wondering how one determines the redo size to start with? Or maybe how will I know I need to re-size my redo?

Let’s start with the later. How frequently are log switches happening? Just looking at the alert.log gives an indication. But how pervasive are the switches?

Thread 1 advanced to log sequence 4696 (LGWR switch)
Current log# 1 seq# 4696 mem# 0: /u01/app/oracle/oradata/redo01a.log
Current log# 1 seq# 4696 mem# 1: /u01/app/oracle/oradata/redo01b.log
Thu Jan 24 09:11:16 2013
Archived Log entry 4404 added for thread 1 sequence 4695 ID 0x5104d81c dest 1:
Thread 1 cannot allocate new log, sequence 4697
Checkpoint not complete
Current log# 1 seq# 4696 mem# 0: /u01/app/oracle/oradata/redo01a.log
Current log# 1 seq# 4696 mem# 1: /u01/app/oracle/oradata/redo01b.log
Thu Jan 24 09:11:25 2013
Thread 1 advanced to log sequence 4697 (LGWR switch)
Current log# 2 seq# 4697 mem# 0: /u02/app/data/oradata/redo02a.log
Current log# 2 seq# 4697 mem# 1: /u02/app/data/oradata/redo02b.log
Thu Jan 24 09:11:25 2013
Archived Log entry 4405 added for thread 1 sequence 4696 ID 0x5104d81c dest 1:
Thread 1 cannot allocate new log, sequence 4698
Checkpoint not complete
Current log# 2 seq# 4697 mem# 0: /u02/app/data/oradata/redo02a.log
Current log# 2 seq# 4697 mem# 1: /u02/app/data/oradata/redo02b.log
Thread 1 advanced to log sequence 4698 (LGWR switch)
Current log# 3 seq# 4698 mem# 0: /u03/app/data/oradata/redo03a.log
Current log# 3 seq# 4698 mem# 1: /u03/app/data/oradata/redo03b.log
Thu Jan 24 09:11:34 2013
Archived Log entry 4406 added for thread 1 sequence 4697 ID 0x5104d81c dest 1:
Thu Jan 24 09:12:01 2013
Thread 1 cannot allocate new log, sequence 4699
Checkpoint not complete
Current log# 3 seq# 4698 mem# 0: /u03/app/data/oradata/redo03a.log
Current log# 3 seq# 4698 mem# 1: /u03/app/data/oradata/redo03b.log

To actually understand how frequently the log switches are occurring we need to take a look at the history across multiple days.

select inst_id instance,
to_char(first_time,'DD-MON-YYYY') day,
to_char(sum(decode(to_char(first_time,'HH24'),'00',1,0)),'999') "00",
to_char(sum(decode(to_char(first_time,'HH24'),'01',1,0)),'999') "01",
to_char(sum(decode(to_char(first_time,'HH24'),'02',1,0)),'999') "02",
to_char(sum(decode(to_char(first_time,'HH24'),'03',1,0)),'999') "03",
to_char(sum(decode(to_char(first_time,'HH24'),'04',1,0)),'999') "04",
to_char(sum(decode(to_char(first_time,'HH24'),'05',1,0)),'999') "05",
to_char(sum(decode(to_char(first_time,'HH24'),'06',1,0)),'999') "06",
to_char(sum(decode(to_char(first_time,'HH24'),'07',1,0)),'999') "07",
to_char(sum(decode(to_char(first_time,'HH24'),'08',1,0)),'999') "08",
to_char(sum(decode(to_char(first_time,'HH24'),'09',1,0)),'999') "09",
to_char(sum(decode(to_char(first_time,'HH24'),'10',1,0)),'999') "10",
to_char(sum(decode(to_char(first_time,'HH24'),'11',1,0)),'999') "11",
to_char(sum(decode(to_char(first_time,'HH24'),'12',1,0)),'999') "12",
to_char(sum(decode(to_char(first_time,'HH24'),'13',1,0)),'999') "13",
to_char(sum(decode(to_char(first_time,'HH24'),'14',1,0)),'999') "14",
to_char(sum(decode(to_char(first_time,'HH24'),'15',1,0)),'999') "15",
to_char(sum(decode(to_char(first_time,'HH24'),'16',1,0)),'999') "16",
to_char(sum(decode(to_char(first_time,'HH24'),'17',1,0)),'999') "17",
to_char(sum(decode(to_char(first_time,'HH24'),'18',1,0)),'999') "18",
to_char(sum(decode(to_char(first_time,'HH24'),'19',1,0)),'999') "19",
to_char(sum(decode(to_char(first_time,'HH24'),'20',1,0)),'999') "20",
to_char(sum(decode(to_char(first_time,'HH24'),'21',1,0)),'999') "21",
to_char(sum(decode(to_char(first_time,'HH24'),'22',1,0)),'999') "22",
to_char(sum(decode(to_char(first_time,'HH24'),'23',1,0)),'999') "23"
from gv$log_history
group by inst_id,to_char(first_time,'DD-MON-YYYY')
order by inst_id, to_char(first_time,'DD-MON-YYYY');

Several factors come into play when determining the size of the Redo files. The amount of redo being generated and the tolerance for data loss. The above query will provide you a grid showing the number of log switches within each hour over the course of several days. You may have heard at one point that your log switches should be approximately 10 minutes apart. The goal is to archive a previous used redo log file prior to the LGWR switching back to that file. Remember redo logs are used in round robin. If the database is in archive log mode, the file will not be written to until it has been archived.

Redo log files should be large enough to ensure that there are no excessive checkpoint which will decrease performance. Batch and data warehouse databases require larger redo, while OLTP will generally have smaller redo. Usually this is somewhere in the neighborhood of 10 – 30 minutes, but keep in mind that every system is different.

You can set the ARCHIVE_LAG_TARGET in the init parameter file to force a log switch at a set interval, just ensure that the redo logs are large enough to handle the redo being generated.

Within 11g we can simply run this query to assist in determining the size.

select optimal_logfile_size from v$instance_recovery

I personally will estimate the amount of redo when building a new database, and then watch it over the course of the first months its in production. Usually I’m able to understand the data and its activity level while testing.

Lack of Exception Handling Yet Again

What’s a good method of getting those deep layered exceptions to propagate for effective troubleshooting? My personal favorite is dbms_utility.format_error_backtrace . Even if the code doesn’t have exception handling I will at times add this to the code in order to effectively troubleshoot. Easier of course in non-production systems, but the same can be done in production. If someone of course, is monitoring recompiled code or creation of objects this might get flagged. I don’t recommend it for high executed code in production without of course going through your change control process. Could end up being nasty if you’re not careful.

Simply add dbms_output.put_line(dbms_utility.format_error_backtrace) to your exception handling:


CREATE OR REPLACE PROCEDURE third
IS
BEGIN
DBMS_OUTPUT.put_LINE('I'm the first');
first;
EXCEPTION
WHEN OTHERS
THEN
DBMS_OUTPUT.put_line(DBMS_UTILITY.FORMAT_ERROR_STACK);
DBMS_OUTPUT.put_line(dbms_utility.format_error_backtrace);
END;

I used both the FORMAT_ERROR_STACK and FORMAT_ERROR_BACKTRACE backtrace doesn’t include the error message. alternatively SQLERRM could be used as well. Depending on how deep into the call stack the error is being raised it may sever you will to substr the error message. Maybe you do only need the ORA error number and line number.

Remember if using multiple exceptions you will need to re-raise the exception to the outer most exception.

A developer can make friends with the DBA staff by including well-defined exception handling within their code. And if they don’t we can guide them to the light. When all else fails we can add in the needed lines for troubleshooting. I find this so much easier than trying to debug in production. Just because you can doesn’t mean that you should.

America’s Most Promising Company — Forbes.com

America’s Most Promising Company an Article on Forbes.com

. They are hitting the niche in the market not only with mobile but with the use of unstructured data. The possibilities are very intriguing especially the unstructured comments within the social media. How do you find the relevant comments that could make a difference or be acted upon? In my experience most people only comment on the bad never the good. The old adage one bad experience will be repeated to at least 10 people, while one great experience may never be repeated to anyone! Makes me think whether others are like me and tend to be skeptical of the good comments and focus on the negatives. I’ve somehow drawn the conclusions that the awesome comments are fake. Posted by interested parties. But couldn’t the same be said for the bad reviews and comments? Just a different interested party? The market that the company has landed in looks very promising along with the leadership and management style. If only they were closer to Tampa.

Pesky Log File Sync Waits

The complaint “… saving is taking for ever”. Translation — commits are taking longer than usual.

Now I understand what the users are experiencing time to determine root cause by reviewing the system statistics.

First stop what are the top 5 events? I can check in EM12c performance grid and looking at the graph I can see whether there is a spike or increase in the commit wait.

Run an AWR Report
< Insert Copy of Report for this section >

Run a query. There are many blogs were you can find a query if you don’t have one of your own.

Here’s a query that makes use of the dba_hist tables from Aalap Sharma.

Or you could query the gv$system view but remember this cumulative since start-up
select * from (
select event, total_waits, total_timeouts, time_waited
from v$system_event
where event not like 'SQL*Net%'
and event not in ('pmon timer','rdbms ipc message','dispatcher timer','smon timer')
order by time_waited desc )
where rownum < 6;

It’s been said time and again that if the wait isn’t a significant wait event then ignore because the gains will be insignificant. However, if the users are complaining the system slows down when they hit the save button, then it needs to be addressed. In my experience the commit wait will show up in the top wait events during when the complaints are occurring, however, the wait may still not be significant. A recent example the commit wait was only 7% of the wait, while I/O comprised the majority of the time. However, the experience of the end users is important. Why are they experiencing a slow response when they commit needs to be addressed. User experience trumpets the overall standing of the wait from the system perspective.

So now that I can see that yes there is a spike in the commit wait. I need to determine the root cause. This is one of those pesky waits that can be due to increase redo generation due to increase workload, disk subsystem latency, CPU or Memory starvation, excessive commits, as well as other contentions within the database. The bottom line is LGWR isn’t posting changes to disk fast enough causing the end-user to wait on their saves or commits.

If I check the number of commits compared to an other time periods I can determine whether there is an increase in the number of commits. In my case I have EM12c running and since I utilize physical standby I like to alert when redo begins to increase to a level where the physical standbys are in jeopardy of falling behind (generally OK on the physical standby, but we make use of logical for reporting and when the logical begins to fall behind complaints will soon follow).

Here is an example of the redo generation graph from EM12c (it can be found under Data Guard Performance).  The yellow line marks the warning level and the red critical.

This example is the same graph but for the last 7 days.  Can also break out into 2 weeks and 31 days.

Just by looking at the graph I can see that there is differently an uptick of redo during the last few days.  To review the details I do use a few queries to retrieve the number of commits, user calls, executes and other system statistics that could determine whether we are actually performing more work.

The additional stats can also be reviewed in the AWR report.  Keep in mind the snap interval this report is aggregated over the period of the snap interval.  Be careful using the average because it doesn’t provide an accurate view the high and low points.

The DBA_HIST views provide a wealth of information in particular

dba_hist_snapshot –> the list of snapshots for a given period

dba_hist_osstat –> OS stats

dba_hist_sys_time_model –> system stats at each snap

dba_hist_sysstat –> system stats at each snap

dba_hist_sysmetric_history –> Metric’s being captured

To capture shorter period of time for particular stats I will run

select name, value from v$sysstat where name = 'user commits';

Save the value off and capture it again. At this point I’m only after an idea based on previous performance. This is where knowing your application and your database performance is useful.

We already know that we are generating peak redo and over a longer period of time. The challenge now becomes finding out where that redo is being generated.

Tanel Poder snapper is the perfect tool for handling this type of investigation.  His blog contains all the information to using the tool in this scenario.  If you haven’t downloaded the tool I highly recommend it not just for this situation.  The tool is extremely useful in investigating production issues.

What I discovered was a peculiar SQL statement running from a session as SYS.  Inserting into the stats$user_log table.  The information being logged was for username PUBLIC.  The stats$user_log table is used for logon/off triggers.  I found the trigger in question:

insert into stats$user_log
select user,
sys_context('USERENV', 'SESSIONID'),
sys_context('USERENV', 'HOST'),
null,
null,
null,
sysdate,
to_char(sysdate, 'hh24:mi:ss'),
null,
null,
null
from sys.dual
where user not in
('SYSMAN', 'DBSNMP', 'SYS',);

The trigger only had three exclusions. Of course, now I’m wondering about the PUBLIC user, what’s it’s use, and why is it logging on an off so frequently.

Turns out the PUBLIC user is internal and used by data guard during gap resolution. The user can not be used outside of Oracle, in other words no one can log in as PUBLIC.

Great if the user PUBLIC is only used internally then I can modify the trigger and this user will no longer be tracked. The redo generation should decrease. Of course, that brings up another question why is my standby following behind? But I’ll save that for another day.

There’s also a great post on log file sync wait along with LGWR work on Pythain website by Riyaj Shamsudeen.