Apache Servers are CPU Bound

When Bad Things Happen to Good Servers

How many have you had this happen to you?  It is late Friday afternoon and you are getting ready to pack it up and the phone rings. You answer against your best instincts and BAM…  you find yourself in the middle of a conference call with a former client from a few years back and they are asking you to help them this very moment.

I don’t know about you but I have to be very selective because I am a one man show here at Aesir Computing, Inc. While I would love to help everyone that calls me and many times I often  can spot and fix  problems on the phone without charge,  I could tell quickly this was not going to be the case.   The call began and I started to jot down the list of things they were saying.

  1. Servers had been hacked. Most likely was a sql injection attack
  2. Vendor they had managing their servers claimed it was the client problem
  3. Independent security firm had done an audit and said they found root kits, and a sql injection attack
  4. Servers were cpu bound and not responsive
  5. They had two front end web servers querying a backend mysql database server
  6. There were approx 50-60 domains that were being hosted in joomla envrironments
  7. They wanted to move toward a more sustainable plan going forward
  8. They had no tripwire/aide installed
  9. They had moved the domains back to their office and away from the Data Center
  10. Could I help them fix the cpu problem to hold them over while they worked on mgmt for a longer term fix

I explained that I had two hours before other commitments came into play and say – “Sure I’ll take a look; but I only have two hours to spend on this”.

I waited impatiently for the email to arrive with the login information.  The clock was ticking and it arrived 20 minutes later.  I attempted to connect and a firewall was in place stopping my access so I fired off another email and waited for that to be resolved.   About 20 mins later, an email arrived and I was off to the races with access.

I logged in and did an uptime because the machine was under serious congestion.  Here is what I got back:

# uptime
10:47:14 up 1 days, 23:22,  4 users,  load average: 106.71, 99.36, 99.76

So I have just over an hour to go and I begin my process.  Anyone who has ever worked on a machine like this knows that you plan your commands very carefully because any keys you type will not be echo’d back for what can seem to be minutes.  For the sake of space, I will not tell you what those commands were but here is the top output.

# top

11:57:29 up 1 days, 22 min,  3 users,  load average: 50.17, 99.37, 99.76
Tasks:  96 total,   6 running,  90 sleeping,   0 stopped,   0 zombie
Cpu(s): 96.8%us,  2.6%sy,  0.0%ni,  0.4%id,  0.0%wa,  0.0%hi,  0.2%si,  0.0%st
Mem:   3369668k total,  3175196k used,   194472k free,   252792k buffers
Swap:  2031608k total,       80k used,  2031528k free,  2657444k cached
21677 apache    16   0  408m  10m 6580 R   87  0.3   0:03.25 httpd
21672 apache    16   0  408m  22m  17m R   75  0.7   0:05.56 httpd
21663 apache    19   0  408m  39m  34m R   71  1.2   0:12.42 httpd
21616 apache    16   0  408m  51m  46m R   62  1.6   0:39.48 httpd
21669 apache    15   0  408m  15m  10m S   44  0.5   0:04.14 httpd
21665 apache    16   0  408m  19m  15m R   34  0.6   0:11.22 httpd
21673 apache    15   0  408m  21m  16m S   13  0.7   0:02.56 httpd
21678 apache    15   0  408m  11m 6712 S    9  0.3   0:00.26 httpd
21675 apache    15   0  408m  11m 6748 S    1  0.3   0:00.10 httpd
21680 apache    15   0  407m 8996 4840 S    1  0.3   0:00.02 httpd

I know nothing of this machine at this point.  I login to the other machine and its worse.  I decide work on the faster of the box. I am working on a production server and I need to get this thing under control fairly quickly.  I begin the analysis.

  1. netstat  shows that there are 100’s of unique ip connections happening
  2. vmstat shows that there is no I/O

To be continued