My First Almost Corrupt Database
Well, I thought the day had finally come. My first legit corrupt database.
One production system started firing a ton of alerts for databases suspending and disconnecting. So I log into try and gather whats happening, and all my databases are missing…oh crap. Then I get an error “Database msdb cannot be opened. It has been marked SUSPECT”…DOUBLE CRAP.
Well the error says to check the SQL Logs for information, so I’ll check out what’s going on. Get the error again. Hmm, ok. Well I’ll run a CHECKDB and see if it will give me something. Still can’t interact with the db. Think Kyle, think. I start to research how to fix a SUSPECT msdb and learn that there is template version that I can restore from. After a good hour or so of googling, I take a step back and think.
Ok, so step 1 to any IT investigation I have ever done: What has changed since the issue started occurring? I asked them to remove a USB drive attached to that Hyp a couple days ago and it I got my tracking number today. Are my data files still all present? Did they remove the right drive?
Data files still existed. They removed the wrong USB but it didn’t really matter. Maybe the SAN had a hissy fit and we just need to reboot the server. *cue 8 year waiting period for this unhappy machine to reboot*.
Oh boy. System came back up, but I can’t view 2 of the drives anymore. Both are on the same SAN. Now I’m getting really confused, what do I look at now? WINDOWS EVENT LOG! There I find a different error than I expected, “Operating System error 19(Media is write protected) encountered”. Time to call the datacenter.
So I recap everything I have checked to our contact there and asked him to give things a look and tell me what he’s seeing. He tells me I’m out of space, and then I tell him that I am not. I have 400 GB between the 2 drives. It’s not great, but I am not out. What else do you think is wrong? Then I learned something VERY fun. They put a threshold in to cap my SAN usage at 95%. So instead of running out of space when I expect to run out of space, I “run out of space” faster with no warning since my disk reports aren’t configured to let me know 400GB is actually 0GB.
Around and around we go trying to come up with solutions on how to get us functioning again. They were able to add a nice 10TB NAS device to hold some of our data. With a wave of the ole dbWonderKid wand, we moved some databases, combined those 2 drives into 1, and tadaaaa a functioning mirror once again existed.
- Not all SANs are created equal. Learn your limits and alert on them properly.
- There are a few ways to fix a corrupt msdb (if it’s actually corrupt and you aren’t just out of space).
- BIG shout-out to John Grover @ MSSQLTips for the lesson if that day ever comes.
- Inspect all the logs you have available to you when an issue is encountered. I lost so much time going down rabbit holes that didn’t matter. If I would’ve jumped to Windows Logs immediately after I realized I couldn’t do anything in SSMS, I would have saved myself a good hour or so of spinning my tires.