Content-type: text/html Set-Cookie: cookiehash=D8TIX1F9GFT8GSNG7NCYDC1UDL31CF7Q; expires=Fri, 08 Jul 2022 00:00:00 GMT; path=/; DMI News

DMI News

Previous Entry.. Next Entry..

Server Problems

December 20, 2006 13:53

As I'm certain everyone has noticed over the last 24 hours, the server has ... well.... not been there. This has been most unfortunate, mostly for myself. I will explain the process I went through to get it working again.

About 4:50 pm I suddenly notice the server is not responding. Assuming the server has crashed, I order an automatic reboot. This goes off without a hitch, except for the slight problem that it didn't actually reboot the machine. But it thought it did. So I order a manual reboot which usually takes a little while longer to process. In the meantime I do some poking around and discover that the server is still responding to ssh, but won't let me log in, saying the passwords are bad. Confusing, but ok. Anyways, they eventually come back and say the server has been rebooted (again) and is now responding. Only problem here is that it never went down during that timeframe. However, I'm busy pursuing the ssh issue. On a lark, I decide to run a portscan on the server to see what IS open. Lots of stuff, but nothing much that should be. And one of the ports was opened on 31337. That didn't look promising.

So now I figure I've been hacked. Priority now is to get the machine shut down and off the network before whoever broke in was able to do any significant damage. Despite the obvious simplicity of pulling a plug, this request took about 2 hours to accomplish, although I later figured out why. I order a system restore for the next morning with the old drive slaved, and go to work.

The next morning, the next afternoon and into the next evening, they're busy trying to restore the system. This includes at first formatting a new drive and putting it in the computer, booting it and putting it on the network. That was easy and took less than 30 minutes. Only problem was, it didn't work. They didn't seem to be able to actually GET IT ON THE NETWORK. So... in an effort to solve this problem, they replaced and reloaded just about everything.

Eventually, they discovered another problem. There was an ip conflict on the network. They mentioned this at the time they discovered it, but in my stuporific state, I somehow missed it right then. It would appear that this was the reason that my server had gone offline. I was attempting to get into someone else's box and not my own, and for that reason assumed I had been hacked. My server in fact was purring along just happily oblivious to any problems, and as it would appear, traffic.

They fix the ip conflict issue, map the ip address to the mac address and proceed along happily. Only, it still doesn't work. So, the obvious solution to this problem is to reload everything again. Still nothing. Finally, the poor guy who had been working on this problem for the entirety of his shift forwards it to a supervisor on the next shift. It's about this time that I discover the ip conflict message and realize what the problem REALLY was, and I start mentioning this in the trouble ticket. However, by this time, they're between shifts and nobody's paying any attention. I put in requests to call me, and start bugging people until the next shift supervisor calls me as requested. I give him the quick rundown of what happened and tell him what I figure was the REAL problem we should be chasing. Armed with new information, he attempts to figure out what went wrong.

Less than an hour later, he comes back and states the problem has been solved and the server is now live on the network. What happened was, when they discovered the ip conflict and mapped my ip address to my mac address, that should have solved the problem right there. However, the mac address they used was that of the original server, and by this time, the server itself had been replaced, along with the NIC, so there was a new mac address in use, and the binding therefore was inaccurate and the end result was the same.

Once the server was online, I took a look through the old HD to see if there were any signs of tampering, but I could find none. I took a quick backup of the important stuff and told them to put the old HD back in as master and reboot. That took another 30 minutes, and they fired it up, and sure enough, everything came up just perfectly, no problems at all. DMI was back without skipping a beat.

ANYWAYS.... in light of this recent scare, I've implemented a 3 tier backup system. I'll have stuff I backup every day, every week, and every month. Source code will be backed up daily, most of the website stuff weekly, and some of the noncritical website components monthly, along with various system configuration files, and some home directories. I'll keep a couple past copies here locally for each so in the worst case, I'll be able to almost fully recover. I won't be saving logs or cam archives, as these take up a LOT of space and consist of the things I delete first whenever I need more space anyway.

So there you have it. An explaination of what went wrong AND a news update. News updates will be more forthcoming in the next few days. I've got a lot to talk about, just hadn't had the time to sit down and type it all out yet.