Tuning ‘log file sync’ Event Waits
In this
blog entry, I will discuss strategies and techniques to resolve ‘log file sync’
waits. This entry is intended to show an approach based upon scientific
principles, not necessarily a step-by-step guide. Let’s understand how LGWR is
inherent in implementing the commit mechanism first.
Commit Mechanism and LGWR Internals
At commit
time, a process creates a redo record (containing commit opcodes) and copies
that redo record into the log buffer. Then, that process signals LGWR to write
the contents of log buffer. LGWR writes from the log buffer to the log file and
signals user process back completing a commit. A commit is considered
successful after the LGWR write is successful.
Of course,
there are minor deviations from this general concept, such as latching, commits
from a plsql block, or IMU-based commit generation, etc. But the general
philosophy remains the same.
Signals, Semaphores and LGWR
The
following section introduces the internal workings of commit and LGWR
interation in Unix platforms. There are minor implementation differences
between a few Unix flavours or platforms like NT/XP, for example the use of
post-wait drivers instead of semaphores etc. This section will introduce, but
not necessarily dive deep into, internals. I used truss to trace LGWR and user process. The command is:
truss
-rall -wall -fall -vall -d -o /tmp/truss.log -p 22459
(A word of
caution: don’t truss LGWR or any background process unless it is absolutely
necessary. Doing so can cause performance issues, or worse, shutdown the
database.)
Initially,
LGWR is sleeping on semaphore, using the semtimedop or semop call.
22459/1:
semtimedop(9, 0xFFFFFD7FFFDFE648, 1, 0xFFFFFD7FFFDFE488) EAGAIN
22459/1: semnum=15 semop=-1
semflg=0
22459/1: timeout: 2.060000000 sec
In the
above call, 9 is
the semaphore set id visible through the ipcs command and semnum=15 is the semaphore for the LGWR process in that set.
The next
argument is a structure sembuf:
{
unsigned short sem_num; /* semaphore number */
short sem_op; /* semaphore operation */
short sem_flg; /* operation flags */
}
The third
argument is number of semaphores.
When a
session commits, a redo record is created and copied in to the log buffer. Then
that process posts an LGWR semaphore using a semctl call if LGWR is not active already. Then, the process
goes to sleep with a semtimedop call in its own semaphore.
The semaphore
set id is 9, but sem_num is 118, which is for the user process I was tracing.
The first semctl calls is posting LGWR. The process then sleeps on the semtimedop call.
27396/1:
semctl(9, 15, SETVAL, 1) = 0
...
27396/1:semtimedop(9,
0xFFFFFD7FFFDFC128, 1, 0xFFFFFD7FFFDFBF68)=0
27396/1: semnum=118 semop=-1
semflg=0
27396/1: timeout: 1.000000000 sec
The
waiting log writer gets a 0 return code from semtimedop and writes the redo records to the current redo log
file. kaio calls
are kernalized asynchronous I/O calls on the Solaris platform.
22459/7:
pwrite(262, "01 "9E0E i ?".., 1024, 1915904) = 1024
22459/9:
pwrite(263, "01 "9E0E i ?".., 1024, 1915904) = 1024
22459/1:
kaio(AIOWAIT, 0xFFFFFD7FFFDFE310)
= 1
22459/1: timeout: 600.000000 sec
22459/9:
kaio(AIONOTIFY, 0)
= 0
22459/7:
kaio(AIONOTIFY, 0)
= 0
After
successful completion of the write(s), LGWR posts a semaphore of the waiting
process using the semctl command.
22459/1:
semctl(9, 15, SETVAL, 1)
= 0
The user
process/session continues after receiving a return code from semtimedop call, reprinted below.
27396/1:semtimedop(9,0xFFFFFD7FFFDFC128,1,0xFFFFFD7FFFDFBF68)=0
So, What Exactly is a ‘log file sync’
wait?
A commit
is not complete until LGWR writes the log buffers, including the commit redo
records to the log files. In a nutshell, after posting LGWR to write, user or
background processes wait for LGWR to signal back with a 1-second timeout. The
User process charges this wait time as a ‘log file sync’ event.
In the
prior section, ‘log file sync’ waits start at step 2 after the semctl call, and completes after step 5.
The Root Causes of ‘log file sync’
waits
The root
causes of ‘log file sync’ essentially boil down to few scenarios. The following
is not an exhaustive list, by any means.
- LGWR is unable to complete writes fast enough for one
of the following reasons:
- Disk I/O performance to log files is not good enough.
Even though LGWR can use asynchronous I/O, redo log files are opened with
the DSYNC flag and buffers must be flushed to the disk (or
at least, written to the disk array cache in the case of SAN) before LGWR
can mark a commit as complete.
- LGWR is starving for CPU resource. If the server is
very busy, LGWR can starve for CPU too. This will lead to slower response
from LGWR, increasing ‘log file sync’ waits. After all, these system
calls and I/O calls must use CPU. In this case, ‘log file sync’ is a
secondary symptom and resolving the root cause for high CPU usage will
reduce ‘log file sync’ waits.
- Due to memory starvation issues, LGWR can be paged
out. This also can lead to slower response from LGWR.
- LGWR is unable to complete writes fast enough due to
file system or uUnix buffer cache limitations.
- LGWR is unable to post the processes fast enough, due
to excessive commits. It is quite possible that there is no starvation for
CPU or memory, and that I/O performance is decent enough. Still, if there
are excessive commits, LGWR has to perform many writes/semctl calls,
and this can increase ‘log file sync’ waits. This can also result in sharp
increase in redo wastage statistics.
- IMU undo/redo threads. With private strands, a process
can generate few Megabytes of redo before committing. LGWR must write the
generated redo so far, and processes must wait for ‘log file sync’ waits,
even if the redo generated from other processes is small enough.
- LGWR is suffering from other database contention such
as enqueue waits or latch contention. For example, I have seen LGWR freeze
due to CF enqueue contention. This is a possible scenario, albeit an
unlikely one.
- Various bugs. Oh yes, there are bugs introducing
unnecessary ‘log file sync’ waits.
It is
worthwhile to understand and identify the root cause, to and resolve it. Here
are the steps and considerations.
1. First
make sure the ‘log file sync’ event is indeed a major wait event. For example,
in the statspack report for 60 minutes below, ‘log file sync’ is indeed an
issue. Why? Statspack runs for 1800 seconds and there are 8 CPUs in the server,
so there are roughly 14,400 CPU seconds available. There is just one database
alone in this server, and so, the approximate CPU usage is 7034/14,400: 50%
But, 27021
seconds were spent waiting. On average, 27021/3600
= 7.5 processes were waiting for
‘log file sync’ event. So, this is a major bottleneck for application
scalability.
Top
5 Timed Events
~~~~~~~~~~~~~~~~~~ % Total
Event Waits Time (s) Ela Time
---------------------------
------------ -------- --------
log
file sync 1,350,499 27,021
50.04
db
file sequential read 1,299,154 13,633
25.25
CPU
time 7,034 13.03
io
done 3,487,217 3,225
5.97
latch
free 115,471 1,325
2.45
2. Next,
identify and break down LGWR wait events, and query wait events for LGWR. In
this instance, the LGWR sid is 3 (and usually it is).
select
sid, event, time_waited, time_waited_micro
from
v$session_event where sid=3 order by 3
SQL>
/
SID EVENT TIME_WAITED
TIME_WAITED_MICRO
------
------------------------------ ----------- -----------------
..
3 control file sequential read 237848
2378480750
3 enqueue 417032 4170323279
3 control file parallel write 706539
7065393146
3 log file parallel write 768628 7686282956
3 io done 40822748 4.0823E+11
3 rdbms ipc message 208478598 2.0848E+12
When LGWR
is waiting (using the semtimedop call) for posts from the user sessions, that wait time
is accounted as an ‘rdbms ipc message’ event. Normally, this event can be
ignored. The next highest waited event is the ‘io done’ event. After submitting
async I/O requests, LGWR waits until the I/O calls complete, since LGWR writes
are done as synchronous writes. (“asynchronous” and “synchronous” are not
contradictory terms when comes to I/O. Google it — there is enormous
information about this already.)
It is
worth noting that v$session_event is a cumulative counter from the instance startup, and
hence, can be misleading. The difference between two snapshots from this view,
for the same session, can be quite useful.
Tanel Poder has written an excellent tool for
this. Using that tool, we can print out a 1-second snapshot for LGWR session 3.
From the table below, 813 milliseconds were spent waiting for ‘io done’ event
in a 1-second interval. That’s 81%. (Some parts of the output have been removed
to improve readability.)
------------------------------------------------------------------
SID,
SNAPSHOT START ,SEC, TYPE,
STATISTIC , DELTA,
------------------------------------------------------------------
3,
20080513 11:44:32, 1, STAT, messages
sent ,
9,
3,
20080513 11:44:32, 1, STAT, messages
received , 153,
3,
20080513 11:44:32, 1, STAT, redo
wastage , 39648,
3,
20080513 11:44:32, 1, STAT, redo
writes , 152,
3,
20080513 11:44:32, 1, STAT, redo blocks
written , 1892,
3,
20080513 11:44:32, 1, STAT, redo write
time , 82,
3,
20080513 11:44:32, 1, WAIT, rdbms ipc
message , 169504,
3,
20080513 11:44:32, 1, WAIT, io
done , 813238,
3,
20080513 11:44:32, 1, WAIT, log file
parallel write , 5421,
3,
20080513 11:44:32, 1, WAIT, LGWR wait
for redo copy , 1,
3. Next,
confirm that LGWR is waiting for that event by SQL*Trace also. Tracing LGWR can
deteriorate performance further. So, careful consideration must be given before
turning sqltrace on LGWR. Packages such as dbms_system/dbms_support or oradebug
can be used to turn on a 10046 event at level 12.
A few
trace lines are shown below. In this specific case, LGWR is waiting for I/O
events. The output below shows that LGWR submitted two write calls with 16 redo
blocks. Waits for I/O completion events are counted as ‘io done’ events.
Between two calls, LGWR waited for 1600 microseconds or 1.6ms. The performance
of write itself is not entirely bad.
WAIT #0: nam='rdbms ipc message' ela= 7604
p1=223 p2=0 p3=0
WAIT #0: nam='log file parallel write' ela=
35 p1=2 p2=16 p3=2
WAIT #0: nam='io done' ela= 0 p1=0 p2=0 p3=0
WAIT #0: nam='io done' ela= 639 p1=0 p2=0
p3=0
WAIT #0: nam='io done' ela= 0 p1=0 p2=0 p3=0
WAIT #0: nam='io done' ela= 605 p1=0 p2=0
p3=0
WAIT #0: nam='io done' ela= 1 p1=0 p2=0 p3=0
WAIT #0: nam='io done' ela= 366 p1=0 p2=0
p3=0
4. Let’s
look at few other statistics also. From the statspack report, and with one
column removed to improve readability:
Statistic Total per Second
-------------------------------
----------- --------------
redo
blocks written
230,881 2,998.5
redo
buffer allocation retries
0 0.0
redo
entries
285,803 3,711.7
redo
log space requests
0 0.0
redo
log space wait time
0 0.0
redo
ordering marks
0 0.0
redo
size 109,737,304
1,425,159.8
redo
synch time
40,744 529.1
redo
synch writes
38,141 495.3
redo
wastage
5,159,124 67,001.6
redo
write time
6,226 80.9
redo
writer latching time
4 0.1
user
calls
433,717 5,632.7
user
commits
38,135 495.3
user
rollbacks
1 0.0
workarea
executions - multipass 0 0.0
From the
above statistics, we can see that
- 5632 user calls / second were made and 495 commits
were executed per second, on average. - 3000 redo blocks (of size 512 bytes in solaris) are
written by LGWR, approximately 1.5MB/sec.
- 1.4MB/sec redo was generated, which is approximately
16Mbps.
- For 3000 redo blocks, 38,155 commits were created.
Essentially,
the redo size is high, but not abnormal. But, 500 commits per second is on the
higher side.
5.
Knowledge about your application will be useful here. If commit frequency is
higher, trace a few sessions and understand why there are so many commits. For
example, in the following trace file, there is a commit after every SQL
statement, and that can be the root cause for these waits.
The XCTEND call below with rlbk=0 and rd_only=0 is a commit.
WAIT
#120: nam='SQL*Net message from client' ela= 291 p1=1952673792 p2=1 p3=0
XCTEND
rlbk=0, rd_only=0
WAIT
#0: nam='log file sync' ela= 1331 p1=31455 p2=0 p3=0
WAIT
#0: nam='SQL*Net message to client' ela= 1 p1=1952673792 p2=1 p3=0
6. Next,
examine the commit frequency. If it is higher, LGWR could be spending time
signalling user process. Just as with any other process, the Unix scheduler can
kick LGWR off the CPU, and this can cause foreground processes to wait for ‘log
file sync’ event. In Solaris, prstat
microstat accounting (-mL) is quite useful in breaking down how much time is spent by
LGWR waiting for CPU.
7.
Although uncommon, there are few bugs causing unnecessary ‘log file sync’
waits, signaling incorrect processes etc.
8. In a
few scenarios, ‘log file sync’ waits are intermittent and hard to catch.
Statistics might need to be captured with more granularity to understand the
issue. In the example below, I had instance freeze intermittently, and I wrote
a small script to gather statistics from v$sysstat every 20 seconds and spool to a file. Also, we were
collecting iostat using
the Oracle-supplied tool, LTOM.
23-MAR-2007-04:29:43:Redo
blocks written:1564176614:Delta: 8253
23-MAR-2007-04:30:14:Redo
blocks written:1564176614:Delta: 0
23-MAR-2007-04:30:44:Redo
blocks written:1564205771:Delta:29157
Between
4:29:43 and 4:30:14, redo blocks-written statistics did not change, since delta
is 0. Meaning, LGWR did not write any redo blocks. But, ‘redo size’ statistics
during the same duration was at ~7MB in that instance. The number of processes
waiting for the ‘log file sync’ event increased to 100. So, we know that freeze
occurred because LGWR wasn’t able to write any blocks during this time-frame.
In a RAC
cluster, this problem is magnified, since both CR- and CUR-mode buffers need
the log flush to complete before these blocks can be transferred to another instance’s
cache.
Looking
closely at I/O statistics, we can see that average service time was quite high
for a few devices during that time period (we mapped these devices back to log
file systems later). This lead to more a granular time-frame, and finally, the
issue proved to be a hardware switch intermittently freezing.
extended
device statistics
r/s w/s kr/s kw/s wait actv wsvc_t asvc_t %w %b device
0.0 0.0
0.0 0.0 0.0
1.0 0.0 0.0
0 100 d6
0.0 0.0
0.0 0.0 0.0
1.0 0.0 0.0
0 100 d2
0.0 0.0
0.0 0.0 0.0
1.0 0.0 0.0
0 100 d1
0.0 0.0
0.0 0.0 0.0
9.0 0.0 0.0
0 100 d0
0.0 0.0
0.0 0.0 0.0
3.0 0.0 0.0
0 100 d13
0.0 2.0
0.0 24.0 0.0 2.0
0.0 1000.7 0 100 d12
0.0 0.0
0.0 0.0 0.0
1.0 0.0 0.0
0 100 d11
9. Unix
tools such as truss, tusc, and strace can be used to debug those scenarios if the above
techniques are not sufficient. But, tools such as truss and strace should be used as a last resort. In Solaris 10, dtrace can be used to debug I/O or processor-specific issues. dtrace is safer by design. Brendan
Gregg has a dtrace tool kit and wealth of information here.
Guidelines for Resolving a Few Root
Causes
Finding
and understanding the root cause is essential to resolving a performance issue.
Some final observations:
- If I/O bandwith is an issue, then doing anything other
than improving I/O bandwidth is not useful. Switching to file systems
providing better write throughput is one option. RAW devices are another
option. Reducing the number of log file members in a group is another, as
it reduces the number of write calls. But, this option comes with a cost.
- If CPU starvation is an issue, then reducing CPU
starvation is the correct step to resolve it. Increasing the priority of
LGWR is a work around.
- If commit rate is higher, then decreasing commits is
correct step but, in a few cases, if that is not possible, increasing the
priority of LGWR (using nice) or increasing the priority class of LGWR to RT might
provide some relief.
- Solid State Disk devices also can be used if the redo
size is extreme. That being thecase, it is also preferable to decrease
redo size.
- If excessive redo size is the root cause, it can be
reduced using various techniques. More information can be found here (PDF).
In
summary, finding and resolving the root cause for a log file sync event will
improve application response time.
No comments:
Post a Comment