Damir Vadas, Oracle as I learned

Today topic comes from real life story ... a story that took 3 days of work/investigation and involved even Oracle SR with result that was far from expected. Why ? Because I left mine main saying:
Trust but verify!
So let me start .... from the beginning as it should ...

Allocate one RAC node to other rack

A customer, where 4 node Oracle 10gR2 RAC on RHEL 5 Linux x64 was perfectly working for a year, come with more then welcome idea to reallocate two nodes (blade servers) from one rack to another rack, to reduce risk from power failure. Idea was immediately accepted and meeting with other involved parties was pretty short.

One outsource partner will reconfigure HW rack (console, network etc), second, which represent OS admins, will reconfigure NIC interfaces in new rack, third, Oracle DBA should be here only as checking instance that all is working as it should.

On mine suggestion, we agreed to reallocate just one node in the first place and when all went OK, we'll repeat the same procedure with second node. On mine question: "Would there be any HW/OS changes?", I was clearly answered: "No, there will not be any changes that might be notices. We will only change MAC address on all NICs, all other will remain the same!". This sound to me like situation when all node NICs are dead so their replacement is necessary. Sounds easy!

According this, let me show the picture how the hardware was configured before and how it should be configure after reallocation, according OS and hardware partners work. Here is short picture of RAC and switch where VIP connections are placed:

The problem

Reallocation went pretty quick in planned time (cca. 22:30 in the evening). When all was ready, node was started in "normal" (non admin mode). But Oracle RAC was not able to start on that node! And here the problem arises.
oifcfg, Oracle network for RAC tool, mostly hanged showing no result in several minutes, what I found very bad sign and all problems related to network.

After half an hour ad hoc analyze of the RAC logs, and two unsuccessful node reboots (when OS admins has some ideas), I suggested to leave everything as is until morning, when we might be more wise. Frankly, whenever I find myself in situation that in problem situation people have "ideas" I suggest to leave the problem immediately (if possible), knowing that things come from bad to worse very soon! Suggestion was accepted and we went home leaving problematic node up and running.

Next morning, while I was driving to work, the same client called me with panic in his voice that no one cannot connect to any of the nodes-RAC was unusable. After one minute I have suggested to shutdown problematic node to see if anything will be better. Shutting down node was performed with halt command because shutdown was not enough. After shut down of node, other three nodes in RAC started immediately to work normally.

This information was really scary to me because Linux doesn't behave in that way really and this may be the sign of some bigger problems that might arise in the rest of the day. Fortunately that was all and nothing other went wrong.

The analyze part

Initially, when I come to work, I check once again if Oracle documentation or Google mention any problem with "MAC address change" in RAC environment, but except one case on specific platform, there was nothing! So I left the case to one who initially made the problem-HW and OS admins.

After a week, customer come to us saying that we should meet with OS admins to analyze the problem and find the solution to repair problematic node. Customer reaction was normal-he wanted to get all nodes up and running as it was before, but I was a little surprised why DBA has to help when there is nothing changed in Oracle configuration and we were not involved in this failure in any mean! I felt they do not trust us so raising any other frictions were not acceptable nor professional. Even thought, I have opened SR on Metalink, just to prove myself that Oracle is not the guilty part.

Because server was down, I asked to start node in "no network mode" to collect all logs that might help us in finding the cause. While I was getting log files on USB stick, I asked OS admin several questions:

Have you checked SSH equivalence on all nodes? Answer was Yes-all OK.
Have you check Jumbo frame setting (it was implemented as MTU=9000). Answer was Yes-all OK.
On mine question if all the cables were the same as before, answer was nothing but strange look from his eyes. This was a point when I should react and insist on question ... but I was not 100% sure what is the real problem and pass over that.
You haven't changed anything in Linux start up files? Answer was No-it is untouched.

Shortly, on all mine questions answer was pretty the same: nothing was touched or changed and everything is the same as before. I do not know if anyone have ever been in such a situation, but I was so confused so much that some of mine last questions were pretty stupid, leaving impression that I do not understand anything. So I shut up and left customer to see logs in peace at mine desk. Parallel, I sent logs to Oracle support who react in very quick manner, proving that every SR depend on person who took it.

/var/tmp/.oracle solution?

Initially I find some crsd logs that took me to solution which says to clear /var/tmp/.oracle directory content, where Linux create socket's files. In the same time, from Metalink comes recommendation like:

I think the issue is with ocfs it can not communicate with other nodes. Is this used for ocr or voting files ?

If this used please execute RDA on working node and this failing node according to instructions from this article: Remote Diagnostic Agent (RDA) 4 - RAC Cluster Guide (Doc ID 359395.1)

If the ocfs is not used for ocr or voting files please then execute following steps
1. enable network public and interconnect if they are disabled
2. crsctl stop crs
3. cleanup old socket's "/var/tmp/.oracle"
4. crsctl start crs

Our RAC configation has OCFS, but it has nothing with ocr or voting disks (RAC was based on raw services and ASM) and this OCFS was used only for some exp and file based actions ... shortly nothing with any necessary RAC operation. Another thing was a small problem to start node because of possibility that it will be hang RAC operation on all other nodes (as first time). So second SR advice seems to be logical and promising to us. This was a plan to do.

Start node in admin mode
Delete all content of /var/tmp/.oracle directory
Start up node normally

All in short I was really confused knowing that if this plan fail we will be in real trouble because, we have no idea what was wrong and Metalink suggested to run RDA, what is also very clear sign that they also do not know really what is the problem.

Solution

On this second intervention on customer site, I was absent, so mine colleague took this ungrateful role to present DBA side. After performing previously explained plan, RAC was still unable to start CRS core services. They were stuck as we were first night.
But there was one difference, mine colleague didn't believe anybody and he wanted to get prove about every question/idea he asked. When they come to question about configuration and how it looks now, he got a picture:

After he realized that 2 new switches has been added and connected totally different then it was initially set up (compare this pic with one on the beginning is this the same configuration to you!!??), he was now focused on errors that might arise in bad network connection. Soon he found errors like:

2011-02-19 17:06:37.378: [  OCRMSG][233026112]prom_rpc: CLSC send failure..ret code 6
2011-02-19 17:06:37.378: [  OCRMSG][233026112]prom_rpc: possible OCR retry scenario
2011-02-19 17:06:37.378: [  OCRSRV][233026112]proas_forward_request: PROM_TIME_OUT or Master Fail
2011-02-19 17:06:37.429: [ COMMCRS][522433088]clscsendx: (0x600000000065bcd0) Connection not active

Google gave him just one result where was a problem with jumbo frames. When he wanted prove that all node are pingable with correct MTU , all packets to reallocated node was rejected, because MTU was not set on middle switch!! After setting correct MTU values on that switch, RAC started to work like a charm! Customer was happy and this said story finish happy.

But lessons should be learned!

The End

What I wanted to point is that DBA (or any admin staff) should never trust anything that cannot be proved or tested. Mine error was just in that - I trusted but haven't verified and this cost me a lot.

So regardless how big "faces" are working with you, if you are asked to help them, you are equal amount them in any mean! You might ask something stupid but if HW and OS people (in this case) were not afraid to claim that configuration has not been changed, in a situation when 2 new switches has been added, and network was totally different, then you are entitled to get answer on your every questions in no excuse. Without any inconvenience!

That was mine small and cincere story that I wanted to share with you.

Cheers!

Monday, February 21, 2011

Trust ... but verify!