Link explaining more:
http://mwidlake.wordpress.com/?s=table+partitions&searchbutton=go%21
http://mwidlake.wordpress.com/?s=table+partitions&searchbutton=go%21
The client I was working with at the time was about to go live with a new application and database. For various reasons I was a little anxious about how the Java application (the User Interface) would actually call the stored PL/SQL code I had helped develop. Initial workloads would be low and so I asked that the STATISTICS_LEVEL be set to ALL, so that bind variables (amongst other things) would be gathered. This is on version 10.2.0.4, btw, enterprise edition and 4-node RAC.
We went live, issues were encountered and resolved, the usual way these things work. Then, a few weeks in and when everything was still very “exciting” from a problem resolution perspective, I got an odd email from the DBA team. Would they like us to add another datafile to the SYSAUX tablespace. Huh? I checked. I’d been watching the size of our application’s tablespaces but not the others {well, I was not supposed to be a DBA and I was watching an awful lot of other things}. Our SYSAUX tablespace was around 160GB in size, having pretty much filled it’s 5th datafile. Why? I checked to see what was taking up the space in the tablespace:
select * from ( select owner,segment_name|| '~' ||partition_name segment_name,bytes/(1024*1024) size_m from dba_segments where tablespace_name = 'SYSAUX' ORDER BY BLOCKS desc ) where rownum < 40 OWNER SEGMENT_NAME SIZE_M ------------------ -------------------------------------------------- ------------ SYS WRH$_LATCH_CHILDREN~WRH$_LATCH__14459270_3911 27,648 SYS WRH$_LATCH_CHILDREN_PK~WRH$_LATCH__14459270_3911 26,491 SYS WRH$_LATCH_CHILDREN~WRH$_LATCH__14459270_3537 23,798 SYS WRH$_LATCH_CHILDREN_PK~WRH$_LATCH__14459270_3537 22,122 SYS WRH$_LATCH_CHILDREN~WRH$_LATCH__14459270_4296 17,378 SYS WRH$_LATCH_CHILDREN_PK~WRH$_LATCH__14459270_4296 16,818 SYS WRH$_ACTIVE_SESSION_HISTORY~WRH$_ACTIVE_14459270_3 136 911 SYS WRH$_SQLSTAT~WRH$_SQLSTA_14459270_3911 96 SYS WRH$_SQLSTAT~WRH$_SQLSTA_14459270_3537 72 SYS WRH$_SQLSTAT~WRH$_SQLSTA_14459270_4296 47 SYS WRH$_LATCH_MISSES_SUMMARY_PK~WRH$_LATCH__14459270_ 45 3537 SYS I_WRI$_OPTSTAT_H_OBJ#_ICOL#_ST~ 41 SYS WRH$_SYSMETRIC_SUMMARY~ 40 SYS WRH$_LATCH_MISSES_SUMMARY_PK~WRH$_LATCH__14459270_ 37 |
As you can see, almost all the space is being taken up by WRH$_LATCH_CHILDREN and WRH$_LATCH_CHILDREN_PK partitions. They are massive compared to other objects. A quick goggle did not come up with much other than many hits just listing what is in SYSAUX and the odd person also seeing SYSAUX being filled up with these objects and suggested ways to clear down space, nothing about the cause.
I had a chat with the DBAs and we quickly decided that this was going to be something to do with AWR given the name of objects – “WRH$_” objects are the things underlying AWR. The DBA suggested my settings of 15 minute intervals and 35 day retention was too aggressive. I knew this was not the case, I’ve had more aggressive snapshot intervals and longer retention periods on far busier systems than this. I did not have access to Metalink at that point so I asked the DBAs to raise a ticket, which they duly did.
Oracle support cogitated for a couple of days and came back with the advice to reduce the retention period. Hmmmm. Via the DBA I asked Oracle support to explain why those objects were so large when I had not seen this issue on several other systems. Was it a bug? I had by now corroborated with a friend from a previous site with 5 minute snapshot intervals and two months retention period and their SYSAUX tablespace was about 10GB all in. I did not want to go changing things if we did not know it would fix the issue as we really wanted to stop the growth of SYSAUX as soon as possible, not just try a hunch.
As you probably realise from the title of this blog, the issue was not the snapshot interval or retention period but the STATISTICS_LEVEL=ALL. The one and only hit you get in metalink if you search on WRH$_LATCH_CHILDREN is note 874518.1. From V10.1.0.2 to V11.1.0.7 setting this parameter to ALL is known to create a lot of data about Latch children and not clear it down when the AWR data is purged (Bug 8289729). The advice was to change STATISTICS_LEVEL and make the snapshot interval larger. I’d suggest you just need to alter the STATISTICS_LEVEL, unless you really, really need that extra information gathered. It seemed to take Oracle Support an extra day or two to find that note for us. {I’ve since checked out Metalink directly to confirm all this}.
So with a known issue we felt confident that altering the initialisation parameter would solve the issue. It took a while for us to change the STATISTICS_LEVEL on the production system – Change Control for that site is rather robust. This allowed us to see some other impacts of this issue.
The mmon process which looks after AWR data was becoming a top session in our OEM performance screens. In particular, a statement with SQL id 2prbzh4qfms7u that inserted into the WRH$_LATCH_CHILDREN table was taking several seconds to run each time and was running quite often {I include the SQL ID as it may be the same on many oracle V10 systems as it is internal code}:
This was doing a lot of IO, by far the majority of the IO on our system at the time – it was a new system and we had been able to tune out a lot of the physical IO.
We also now started to have issues with mmon running out of undo space when it ran at the same time as our daily load. This was particularly unfortunate as it coincided in a period of “intense management interest” in the daily load…
What was happening to the size of the SYSAUX tablespace?
Enter the tablespace ( or leave null )> sys TS_NAME ORD SUM_BLKS SUM_K MAX_CHNK_K NUM_CHNK -------------------- ----- ----------- ------------ ----------- -------- SYSAUX alloc 58,187,904 465,503,232 33,553,408 14 free 10,728 85,824 21,504 20 SYSTEM alloc 128,000 1,024,000 1,024,000 1 free 68,360 546,880 546,752 3 4 rows selected. select * from ( select owner,segment_name|| '~' ||partition_name segment_name,bytes/(1024*1024) size_m from dba_segments where tablespace_name = 'SYSAUX' ORDER BY BLOCKS desc ) where rownum < 40 OWNER SEGMENT_NAME SIZE_M -------- ------------------------------------------------------------ ---------- SYS WRH$_LATCH_CHILDREN~WRH$_LATCH__14459270_6201 30262 WRH$_LATCH_CHILDREN~WRH$_LATCH__14459270_5817 29948 WRH$_LATCH_CHILDREN~WRH$_LATCH__14459270_5435 28597 WRH$_LATCH_CHILDREN~WRH$_LATCH__14459270_4675 28198 WRH$_LATCH_CHILDREN~WRH$_LATCH__14459270_3911 27648 WRH$_LATCH_CHILDREN_PK~WRH$_LATCH__14459270_5817 27144 WRH$_LATCH_CHILDREN~WRH$_LATCH__14459270_6585 26965 WRH$_LATCH_CHILDREN_PK~WRH$_LATCH__14459270_6201 26832 WRH$_LATCH_CHILDREN_PK~WRH$_LATCH__14459270_4675 26741 WRH$_LATCH_CHILDREN_PK~WRH$_LATCH__14459270_3911 26491 WRH$_LATCH_CHILDREN~WRH$_LATCH__14459270_4296 26307 WRH$_LATCH_CHILDREN_PK~WRH$_LATCH__14459270_5435 26248 WRH$_LATCH_CHILDREN_PK~WRH$_LATCH__14459270_4296 25430 WRH$_LATCH_CHILDREN_PK~WRH$_LATCH__14459270_6585 25064 WRH$_LATCH_CHILDREN~WRH$_LATCH__14459270_5058 24611 WRH$_LATCH_CHILDREN_PK~WRH$_LATCH__14459270_5058 23161 WRH$_LATCH_CHILDREN~WRH$_LATCH__14459270_6966 9209 WRH$_LATCH_CHILDREN_PK~WRH$_LATCH__14459270_6966 8462 WRH$_SYSMETRIC_SUMMARY~ 152 WRH$_ACTIVE_SESSION_HISTORY~WRH$_ACTIVE_14459270_3911 136 WRH$_SQLSTAT~WRH$_SQLSTA_14459270_3911 96 @sysaux_conts OWNER OBJ_PART_NAME SIZE_M ------------------------------ ---------------------------------------- ---------- SYS WRH$_LATCH_CHILDREN-WRH 231745.063 SYS WRH$_LATCH_CHILDREN_PK-WRH 215573.063 SYS WRH$_SQLSTAT-WRH 711.0625 SYS WRH$_LATCH_MISSES_SUMMARY_PK-WRH 439.0625 SYS WRH$_ACTIVE_SESSION_HISTORY-WRH 437.0625 SYS WRH$_LATCH_PARENT-WRH 292.0625 SYS WRH$_LATCH-WRH 276.0625 SYS WRH$_LATCH_MISSES_SUMMARY-WRH 273.0625 SYS WRH$_SEG_STAT-WRH 268.0625 SYS WRH$_LATCH_PARENT_PK-WRH 239.0625 SYS WRH$_SYSSTAT_PK-WRH 237.0625 |
Yes, that is close to half a terabyte of SYSAUX and it is all used, more partitions have appeared and the total size of the largest segments in SYSAUX show how WRH$_LATCH_CHILDREN and WRH$_LATCH_CHILDREN_PK make up the vast majority of the space used.
Shortly after, we finally got permission to change the live system. The impact was immediate, mmon dropped from being the most demanding session, that SQL code dropped down the rankings and the issues with running out of undo ceased.
I was anxious to see if the old data got purged, as the Metalink note had suggested the data would not be purged. Thankfully, that was not the case. The space was slowly released as normal purging of data outside the retention period took place and after just over a month, the SYSAUX tablespace contained a lot less information and was mostly free space:
OWNER OBJ_PART_NAME SIZE_M ---------- ---------------------------------------- ---------- SYS WRH$_LATCH_MISSES_SUMMARY_PK-WRH 512.0625 SYS WRH$_LATCH_MISSES_SUMMARY-WRH 350.0625 SYS WRH$_LATCH-WRH 304.0625 SYS WRH$_SQLSTAT-WRH 280.0625 SYS WRH$_LATCH_PK-WRH 259.0625 SYS WRH$_SYSSTAT_PK-WRH 247.0625 SYS WRH$_SERVICE_STAT_PK-WRH 228.0625 SYS WRH$_PARAMETER_PK-WRH 201.0625 SYS WRH$_PARAMETER-WRH 169.0625 SYS WRH$_SYSSTAT-WRH 169.0625 SYS WRH$_SEG_STAT-WRH 161.0625 SYS WRH$_SYSTEM_EVENT_PK-WRH 156.0625 SYS WRH$_SYSMETRIC_SUMMARY- 152 SYS WRH$_SYSTEM_EVENT-WRH 133.0625 SYS WRH$_SERVICE_STAT-WRH 123.0625 SYS WRH$_ACTIVE_SESSION_HISTORY-WRH 115.0625 TS_NAME ORD SUM_BLKS SUM_K MAX_CHNK_K NUM_CHNK -------------------- ----- ----------- ------------ ----------- -------- SYSAUX alloc 58,251,904 466,015,232 33,553,408 15 free 57,479,400 459,835,200 4,063,232 1,208 SYSTEM alloc 128,000 1,024,000 1,024,000 1 free 68,048 544,384 544,320 2 |
Now, how do we get that space back? I left that with the DBA team to resolve.
Oh, one last thing. I mentioned the above to a couple of the Oaktable lot in the pub a few weeks back. Their instant response was to say “You set STATISTICS_LEVEL to ALL on a live system?!? You are mad!”
No comments:
Post a Comment