Apache Servers are CPU Bound – The Silent Admin — Musings from a lifetime of computers

When Bad Things Happen to Good Servers

How many have you had this happen to you? It is late Friday afternoon and you are getting ready to pack it up and the phone rings. You answer against your best instincts and BAM… you find yourself in the middle of a conference call with a former client from a few years back and they are asking you to help them this very moment.

I don’t know about you but I have to be very selective because I am a one man show here at Aesir Computing, Inc. While I would love to help everyone that calls me and many times I often can spot and fix problems on the phone without charge, I could tell quickly this was not going to be the case. The call began and I started to jot down the list of things they were saying.

Servers had been hacked. Most likely was a sql injection attack
Vendor they had managing their servers claimed it was the client problem
Independent security firm had done an audit and said they found root kits, and a sql injection attack
Servers were cpu bound and not responsive
They had two front end web servers querying a backend mysql database server
There were approx 50-60 domains that were being hosted in joomla envrironments
They wanted to move toward a more sustainable plan going forward
They had no tripwire/aide installed
They had moved the domains back to their office and away from the Data Center
Could I help them fix the cpu problem to hold them over while they worked on mgmt for a longer term fix

I explained that I had two hours before other commitments came into play and say – “Sure I’ll take a look; but I only have two hours to spend on this”.

I waited impatiently for the email to arrive with the login information. The clock was ticking and it arrived 20 minutes later. I attempted to connect and a firewall was in place stopping my access so I fired off another email and waited for that to be resolved. About 20 mins later, an email arrived and I was off to the races with access.

I logged in and did an uptime because the machine was under serious congestion. Here is what I got back:

# uptime
10:47:14 up 1 days, 23:22, 4 users, load average: 106.71, 99.36, 99.76

So I have just over an hour to go and I begin my process. Anyone who has ever worked on a machine like this knows that you plan your commands very carefully because any keys you type will not be echo’d back for what can seem to be minutes. For the sake of space, I will not tell you what those commands were but here is the top output.

# top

11:57:29 up 1 days, 22 min, 3 users, load average: 50.17, 99.37, 99.76
Tasks: 96 total,   6 running, 90 sleeping,   0 stopped,   0 zombie
Cpu(s): 96.8%us, 2.6%sy, 0.0%ni, 0.4%id, 0.0%wa, 0.0%hi, 0.2%si, 0.0%st
Mem:   3369668k total, 3175196k used,   194472k free,   252792k buffers
Swap: 2031608k total,       80k used, 2031528k free, 2657444k cached
PID USER      PR NI VIRT RES SHR S %CPU %MEM    TIME+ COMMAND
21677 apache    16   0 408m 10m 6580 R   87 0.3   0:03.25 httpd
21672 apache    16   0 408m 22m 17m R   75 0.7   0:05.56 httpd
21663 apache    19   0 408m 39m 34m R   71 1.2   0:12.42 httpd
21616 apache    16   0 408m 51m 46m R   62 1.6   0:39.48 httpd
21669 apache    15   0 408m 15m 10m S   44 0.5   0:04.14 httpd
21665 apache    16   0 408m 19m 15m R   34 0.6   0:11.22 httpd
21673 apache    15   0 408m 21m 16m S   13 0.7   0:02.56 httpd
21678 apache    15   0 408m 11m 6712 S    9 0.3   0:00.26 httpd
21675 apache    15   0 408m 11m 6748 S    1 0.3   0:00.10 httpd
21680 apache    15   0 407m 8996 4840 S    1 0.3   0:00.02 httpd

I know nothing of this machine at this point. I login to the other machine and its worse. I decide work on the faster of the box. I am working on a production server and I need to get this thing under control fairly quickly. I begin the analysis.

netstat shows that there are 100’s of unique ip connections happening
vmstat shows that there is no I/O

To be continued