When Bad Things Happen to Good Servers
How many have you had this happen to you? It is late Friday afternoon and you are getting ready to pack it up and the phone rings. You answer against your best instincts and BAM… you find yourself in the middle of a conference call with a former client from a few years back and they are asking you to help them this very moment.
I don’t know about you but I have to be very selective because I am a one man show here at Aesir Computing, Inc. While I would love to help everyone that calls me and many times I often can spot and fix problems on the phone without charge, I could tell quickly this was not going to be the case. The call began and I started to jot down the list of things they were saying.
- Servers had been hacked. Most likely was a sql injection attack
- Vendor they had managing their servers claimed it was the client problem
- Independent security firm had done an audit and said they found root kits, and a sql injection attack
- Servers were cpu bound and not responsive
- They had two front end web servers querying a backend mysql database server
- There were approx 50-60 domains that were being hosted in joomla envrironments
- They wanted to move toward a more sustainable plan going forward
- They had no tripwire/aide installed
- They had moved the domains back to their office and away from the Data Center
- Could I help them fix the cpu problem to hold them over while they worked on mgmt for a longer term fix
I explained that I had two hours before other commitments came into play and say – “Sure I’ll take a look; but I only have two hours to spend on this”.
I waited impatiently for the email to arrive with the login information. The clock was ticking and it arrived 20 minutes later. I attempted to connect and a firewall was in place stopping my access so I fired off another email and waited for that to be resolved. About 20 mins later, an email arrived and I was off to the races with access.
I logged in and did an uptime because the machine was under serious congestion. Here is what I got back:
# uptime 10:47:14 up 1 days, 23:22, 4 users, load average: 106.71, 99.36, 99.76 |
So I have just over an hour to go and I begin my process. Anyone who has ever worked on a machine like this knows that you plan your commands very carefully because any keys you type will not be echo’d back for what can seem to be minutes. For the sake of space, I will not tell you what those commands were but here is the top output.
# top
11:57:29 up 1 days, 22 min, 3 users, load average: 50.17, 99.37, 99.76 |
I know nothing of this machine at this point. I login to the other machine and its worse. I decide work on the faster of the box. I am working on a production server and I need to get this thing under control fairly quickly. I begin the analysis.
- netstat shows that there are 100’s of unique ip connections happening
- vmstat shows that there is no I/O
To be continued