On their earnings call today, MGM Mirage's Jim Murren discussed the recent problems with their guest reservations system, OPERA, that runs 7 properties for the company, mostly in Las Vegas.
Murren stated that the OPERA issue was due to a 'memory leak in the operating system'. He also stated that OPERA runs on an 'HP platform' and an 'Oracle database'. By 'HP platform', it's not clear if he means they are running on HP's PA-RISC UNIX operating system or HP hardware and something like Windows Server.
Murren claims the issue is resolved and they are fully operational but that they intend to have a backup system put into place for the future. He didn't break out the total cost of the problem for the company.
Comments
Maybe one of you more technical guys can explain it to me, but why would a large corporation like MGM Mirage NOT have redundancies built in or a back up system? It's not like hotel reservations aren't critical to their business.
Well, I'm sure that many of their systems do have backup systems - and who knows, this system may have had one also, though lower capacity - it's not like the hotel was closed, just severely impacted.
Also, it's important to remember that MGM Mirage was built through mergers - that means a lot of different systems that need to be integrated.
As someone who works in IT for a living I have been very interested in learning the details behind MGM Mirage�s OPERA reservations system meltdown. Some speculation was made last week that they had added NY, NY to the system just before it became unstable. I�m sure that MGM�s system is likely very customized and finding the source of the problem would have taken quite some time with multiple vendors pointing the finger at each other or back at MGM�s IT guys. I would think that MGM would definitely try to have as much redundancy as possible built into their system. I�m not a database guy so I might be wrong but I suspect that if an application were causing a memory leak simply having the system mirrored or clustered on redundant hardware would do little to prevent this kind of problem. I would hope MGM is currently reviewing their change management process as well as their disaster recovery process. Certainly taking almost a week to get the system back to normal should be considered completely unacceptable. I feel bad for the front line MGM Mirage employees having to face so many frustrated angry guests during the system crash.
Hopefully Bellagio�s AAA rating will not be jeopardized by the fiasco as Steve Friess has speculated on his blog http://thestrippodcast.blogspot.com/2007/10/mgm-mirage-hell-jeopardizing-aaa-rating.html . I certainly can�t think of a worse time for a AAA reviewer to have possibly paid Bellagio a visit.
If it was truly a memory leak, something like clustering wouldn't really help in the big picture, though it might take longer for the problem to surface in an environment with more resources.
I too find this fascinating, no doubt.
I'm sure you're right that their install is highly customized and Murren said on the call that they were probably the largest install of that software on earth, indicating they are somewhat on the bleeding (or 'bruising' as he said) edge.
I was talking to some MGM-Mirage IT guys and they said that it was similar to the issues Wynn had after opening but on a much larger scale.
They also told me that OPERA doesn't have what he called "real redundancy"
I'll ask them about the "HP PLATFORM" too.
Steve Friess has a slightly different explanation for the problem over at The Strip blog:
We determined the source of the dormant �bug� to be within the Microsoft Windows 2003 operating system and a patch has been applied.
http://thestrippodcast.blogspot.com/2007/10/ok-geeks-does-this-make-any-sense.html
At the very least, it might be the answer to the HP Platform question.
I would be very interested to know which MS patches (KB #) were needed on MGM�s Windows Sever 2003 servers. I maintain Windows 2003 Servers as part of my job and have so far fortunately not seen a memory leak severe enough to bring a system down so quickly as what MGM was experiencing. It would be nice to learn from MGM�s experience so others may avoid similar problems in the future.
Yeah, that's an excellent question. I too have several Win2k3 servers in my stable and while MS is easy to pick on and there are of course some problems in Windows, MGM Mirage may have just scapegoated them... Of course it's possible this was a legit MS bug, who knows...
"We determined the source of the dormant �bug� to be within the Microsoft Windows 2003 operating system and a patch has been applied."
There's no excuse for an unpatched system. Especially when it concerns machines with things like names and addresses and perhaps credit card numbers on them.
Here is more information around the outage. This is a Opera communication approved by MGM that was released about the outage.
TO OUR MANY OPERA CUSTOMERS:
As you may have recently read in the press:
1. Certain MGM MIRAGE properties in Las Vegas, Detroit, and Biloxi, Mississippi experienced instability and some down-time starting on Friday, October 19, 2007 (all of the affected MGM properties are operated in a centralized OPERA schema with a centralized Oracle database).
2. Personnel from MICROS (supplier of OPERA), Microsoft (supplier of the operating system), Oracle (supplier of the database), HP (supplier of the hardware), Intel (Itanium chips) and MGM MIRAGE worked closely together to identify and correct the issue.
3. Simply stated, the problem was subsequently identified as a dormant operating system bug in the form of a memory leak which results in the Oracle RAC to hang and render the system unusable.
4. After various diagnostic procedures were implemented, the Microsoft Windows Server 2003 Service Pack 2 was deployed on October 26, 2007. The system has not failed since the implementation of the patch.
5. It has now been definitively confirmed that the issue which triggered the sporadic downtime was NOT an OPERA application problem, despite one erroneous published news report. We also know the issue was not the result of system overload, despite one erroneous published news report attributing the downtime to the implementation of the New York New York property on September 18, 2007 (some 5 weeks before the problem first surfaced).
6. Thank you and we will continue to keep you posted.
Well it sounds like to much hotel for to little of a system. I hear they have many outages. Many big hotels run on a single AS/400 system with w backup system and run 24/7/365 without issues. Maybe MGM should start playing with the big boys and get something reliable. I have also heard that Wynn has had issues also.