Monthly Archives: April 2013

Row Lock Contention: Scalability Killer

We experience major issues with TX Row Lock Contention. The design of the application has two tables that are central and updated from not only the UI but also batch processing and even some ad hoc queries. We don’t queue UI connections outside the database so it’s typical that we can witness 150 plus sessions blocked by a batch process. Prior to my arrival the solution was intermittent commits which as we all are aware can create performance issues.

I decided to drill further into not only the sessions that were during the blocking or being blocked but find out if in fact all the sessions were actually performing updates. I wanted to be able to show the developers what issue needed to be re-mediated with the details. The query I captured from the gv$sessions was very clear:

update table_a set col_a = :B1;

Yes every single session waiting on TX row lock contention was trying to execute the above sql. The question was were these all different rows or the same row.

From the gv$session we are provided with the object, file, block and row number of the row the blocked session is waiting on. We can then utilize dbms_rowid.rowid_create to find the rowid.

select dbms_rowid.rowid_create(1,19175,26,2781508,39) from daul;

19175 object_id
26 file number
2781508 block number
39 row number

The output is a rowid for the row that is waiting to be updated. We can then determine how many are duplicates.

obj file block row no rowid
19175 30 219580 32 AAAErnAAeAAA1m8AAg
19175 30 218792 31 AAAErnAAeAAA1aoAAf
19175 13 1622245 36 AAAErnAANAAGMDlAAk
19175 30 218903 2 AAAErnAAeAAA1cXAAC
19175 13 3776009 25 AAAErnAANAAOZ4JAAZ
19175 26 608808 33 AAAErnAAaAACUooAAh
19175 30 218855 16 AAAErnAAeAAA1bnAAQ
19175 30 219027 12 AAAErnAAeAAA1eTAAM
19175 30 219027 12 AAAErnAAeAAA1eTAAM
19175 13 1622245 36 AAAErnAANAAGMDlAAk
19175 26 608808 33 AAAErnAAaAACUooAAh
19175 28 489128 16 AAAErnAAcAAB3aoAAQ
19175 13 1622245 36 AAAErnAANAAGMDlAAk
19175 30 218903 2 AAAErnAAeAAA1cXAAC
19175 30 218903 2 AAAErnAAeAAA1cXAAC
19175 13 3776009 25 AAAErnAANAAOZ4JAAZ
19175 13 3775121 8 AAAErnAANAAOZqRAAI
19175 30 218903 2 AAAErnAAeAAA1cXAAC
19175 16 2953416 22 AAAErnAAQAALRDIAAW
19175 28 489458 29 AAAErnAAcAAB3fyAAd
19175 26 609544 28 AAAErnAAaAACU0IAAc
19175 26 2781612 2 AAAErnAAaAAKnGsAAC
19175 13 3775121 8 AAAErnAANAAOZqRAAI
19175 16 2953121 29 AAAErnAAQAALQ+hAAd
19175 30 219580 32 AAAErnAAeAAA1m8AAg
19175 30 218903 2 AAAErnAAeAAA1cXAAC

Based on the above output, there are few duplicates but for the most part the sessions are waiting on different rows. What rows are being waited on and is it a legitimate wait or are we updating a row that isn’t actually being updated. The data isn’t in fact being changed.

I can use the output rowid to find out what row in the table is being updated.

select * from table_a where rowid in (select dbms_rowid.rowid_create(1,19175,26,2781508,39) from dual;

At this point I’m not 100% sure what data I want to key in on within the table so i just request it all. Where is the query being generated from a procedure or UI? In my case it is actually a procedure that is executed when table_b is updated. Fortunately for us we have the rows populated with last_changed_by, last_changed_on and last_changed_thru so I have some key columns that I can review to see if they are in fact updating the data.

select b.id b_id, a.id a_id, a.last_changed_on, a.last_changed_by, a.last_changed_thru,
b.last_changed_on, b.last_changed_by, b.last_changed_thru from table_a a, table_b b
where a.id = b.a_id and a.id in (select id from table_a where rowid in
select dbms_rowid.rowid_create(1,19175,26,2781508,39) from dual)) order by b.last_changed_on desc;

When I reviewed the data I found that although table_b was in fact being updated, table_a was not necessarily being updated. I pulled up the code and found that the procedure that updates table_a was being called whenever table_b was updated whether or table_a needed to be updated or not. Oracle wasn’t in fact updating the data even though an update was performed. Which is expected behavior — Oracle is smart enough to not update even when it’s told to update a record if the column data matches. The bad news even though the data isn’t in fact updated, the DML is still executed and a row lock is still obtained.

Once I had all the details behind the problem I was able to show the developers the issue for proper remediation.

This made me think about past occurring blocking locks. Was this same information available to me within the DBA_HIST_ACTIVE_SESS_HISTORY table — provided of course there was no row movement. It would appear so.

select o.object_name, s.sample_time,s.session_id,s.event,s.blocking_session,s.sql_id,current_obj#, current_file#, current_block#, current_row#, dbms_rowid.rowid_create ( 1, o.DATA_OBJECT_ID, current_FILE#, current_BLOCK#, current_ROW# ) from dba_hist_active_sess_history s, dba_objects o where sample_time > sysdate - 1 and
object_name = 'TABLE_A' and blocking_session is not null and event='enq: TX - row lock contention' and s.current_OBJ# = o.OBJECT_ID

I bumped the output of the above query against the output from my earlier search, as well as all the data. The DBA_HIST_ACTIVE_SESS_HISTORY does provide the details for historical research. This will come in handy in very near future.

Logical Standby Out of Sync — again!

I'm not a big fan of data guard's logical standby.  In my experience we have way too many issues that crop up and I spend way too much time having to manage the standby.  In my opinion this valuable time that could be spent on other areas that are more pressing.  The logical standby is used for reporting so when it's out of sync my attention is diverted to it and that happens way too often.  But it is what it is at this time.  Hopefully the future will bring some change that will eliminate this requirement.

In the meantime, one of the most problematic areas is when the logical standby is missing archive logs.  We have had many different scenarios where archive logs end up being deleted before being applied.  Why that happens is another conversation although it's mostly human error.

The problem is compounded because the primary utilize ASM and RAC, and the logical is filesystem single instance.  When we need to move archivelogs from primary to logical we follow these general steps:

1.  On the primary host. Copy the archivelog from ASM to filesystem.  This can be done via a RMAN, ASMCMD or DBMS_TRANSFER (DBMS_TRANSFER is the only way for 10g).  I prefer the ASMCMD cp simply because it's an easy shell script. Ensure you are logged in as a user with the correct credentials to execute asmcmd:


#!/bin/ksh
for i in $(asmcmd ls +FRA/ORCL/ARCHIVELOG/2013_04_19/*);
do
asmcmd cp +FRA/ORCL/ARCHIVELOG/2013_04_19/$i /u01/temp
done

Substitute the * for the characters that you are interested in copying. Usually I'm looking for specific sequence numbers and possible the thread number.

2. Move the files from the primary host to the logical standby host. I prefer scp but there are other methods that could be used as well.

scp thread_1_seq_* user@host:/path

substitute the correct filename with wildcard, as well as the user, host and path for the remote destination.

Now that the files are on the logical standby host there is a good chance that the names need to be changed. Since the primary archive format is thread ... but the logical just starts with the thread number. Not a big thing if only dealing with a few files, but with a hundred or more a shell script can ease the pain.

3. On the logical standby host. Execute a shell script that performs a mass rename of files. Yes there are many ways that this could probably be written, but I believe in keeping it simple.


#!/bin/ksh
for file in $(ls /path/filename*)
do
substr=`echo $file | cut -d '_' -f 4 | cut -d '.' -f 1`
mv ${file} 1_${substr}_69837890223.arc
done

In the above code the substitute the correct path for the path. And utilize wildcards and the actual filename to grab all the files that you would like to rename. This will be specific to your needs.

The substr utilizes the cut to remove everything from the last underscore (_) and the second one removes everything from the first period (.). So my file format for achive logs is thread_1_seq_1111111.22222.dbid. This means the sequence number that I must maintain begins right after the second underscore (_) or is the 4th group and before the first period. The first cut removes the first three groups plus the underscores (_) leaving the sequence number as the first group. The second cut then removes everything after the first group. I'm left with just the sequence number which I need in the filename for the logical standby.

The mv takes the $file and renames it to the correct format, thread_seq_dbid.arc.

A review of the alert.log on the standby shows that the apply process picks up the correctly named archive logs and begins to apply them.

Avoiding the whole rebuild process. Depending on how far behind will depend on how long it takes. For us we avoid impacting production by avoiding the replication process and are able to catchup 24 hours in about 2 hours without any gotchas.

Sequences Don’t Provide Order

I inherited our production RAC, and like everything else surprises appear from time to time. It’s just the nature of the job.

The application the database supports utilize sequence numbers for primary key. Which is not idea since there are no natural primary keys defined. The use of the sequence isn’t the real issue, the fact that they are used for ordering is. On the surface sequence defined with order are selected in an incrementing fashion. So the latest one should be the highest number, but not necessarily. And it’s the not necessarily that is the gotcha.

In a single instance database the sequence appear to have a nice order. Sequence request are occurring from only one instance. Unless the process (or multiple processes) are firing off request for sequences chances are the higher the sequence number the newer the record. So if you have an application that is receiving say receipts from end users and assigning a sequence number it may be safe to assume the higher number is the latest if there was a re-submission. Notice the use of “safe to assume” — it’s not a guarantee. However, in a RAC environment the chances that the sequences are not nicely ordered with the highest being the newest record receive even if ordered is set on the sequence How is this possible?

There are at least one instance, but more than likely more than one instance performing the same function. Now imagine the request for a sequence occurring from more than one instance. Each of these instances cache the sequence (avoid performance implications do not set the nocache option). So now if there is a re-submission the possibility that it occurs from a different instance that has a lower sequence number cached is possible.

A sequence is nothing more than a number generated by Oracle. We can provide some order but we can’t guarantee order. A sequence has no meaning it’s a number that is given meaning within the application by being assigned to a record within a table. It tells us nothing about the record. It’s just a number. So why would use this number to order the records by? To order records we need an attribute that has meaning. An attribute that definitely identifies a record as being the newest, or oldest. An attribute that tells us we received this record before we received the other record or vice versa. So what type of attribute would provide us order?

TIMESTAMP. A timestamp has the meaning that is needed when ordering a group of records based on when they were received. Now when we have two records with all the same identification from an end-user, and we need to ensure we use the latest one submitted we can guarantee this by sorting on a timestamp that identifies when it was received. An entity that provides for critical data based on submission, say refunds we need to ensure that we have the latest receipt. We can only provide this guarantee by using a timestamp.

Sequences have provided developers with a false sense of security that they can order by the sequence and guarantee the order is correct. In a single instance database this may have provided for the illusion that sequences guaranteed order. However, with a RAC implementation this illusion meets reality.

It’s not a RAC issue, it’s a design issue. It doesn’t just occur within RAC. A misunderstanding of sequences leads to the wrong attribute used for ordering whether RAC or single instance.