qmail-smtpd CLOSE_WAIT

Started by shealey, December 05, 2006, 08:14:31 AM

Previous topic - Next topic

shealey


This issue is not specifically Toaster related - I just want to run it by the forum who are familar with qmail and its related tools.

We are running a Gentoo Linux based implementation of the Mail::Toaster (4.0x) on a Sun Netra, and ever since we built it, it has from time to time suffered periods of getting severely CPU bound and simply doesn't handle anything like the kind of throughput one would expect from a qmail based mailserver.

The cause of the excessive CPU usage seems to be the instances of qmail-smtpd which tcpserver spawns to deal with each incoming connection on port 25.

Instances of qmail-smtpd are being spawned up to the configured maximum (25), each of which would initially deal with an inbound message but eventually sit apparently doing nothing but consuming an identical share of CPU time - with the box running at 98 - 99% user CPU. The elapsed CPU time for several of these processes gradually accumulates, with some clocking up 20 - 30 *mins* of CPU time.

The effect is reproducable by restarting SMTP - you can then observe a flurry of activity in the mail logs as a bunch of new qmail-smtpd processes are spawned which each starts dealing with a connection, but this activity soon dries up to a trickle - even though there is a whole bunch of unhandled SMTP connections waiting in SYN_RECV state - and a number of old connections piles up in CLOSE_WAIT state.

I know an earlier message touches on a similar sounding subject ...

http://www.tnpi.net/support/forums/index.php/topic,557.0.html

... But we have the relevent 512 and 1024 bit key files sitting in /var/qmail/control - so unless its a permission issue on these files I don't think the problem is related to dynamic key generation (We also don't see OpenSSL using much CPU).

A few days ago I realised that the PIDs of these long running processes all relate to SMTP connections in CLOSE_WAIT state. It looks to me like qmail-smtpd is finishing an SMTP transaction and simply not hanging up the connection properly.

I wrote a one-line AWK script which parses the output of 'netstat -anp' and gives me back the PIDs of the offending processes - which I then embedded in a simple shell loop which kills each of the jobs in the pid list. I've cron'd this to run every 3 mins.

Since I implemented this rather brutal hack, we have processed more mail in the last 24 hours than we have all the previous week!

My feeling is that this behaviour is likely to be something to do with the interaction between tcpserver and qmail-smtpd ... I can't really switch on fully verbose SMTP logging to check right now because we are still being bombarded by joe-job type rubbish.

Comments? Opinions?

Sean.

matt

You say you have the temp key files sitting in place, but are you keeping them updated? You must either run the /var/qmail/bin/update_tmprsadh script periodically, or run toaster-watcher.pl and it will do that for you. This will cause exactly the problems you describe.

Of course, having mismatched version of openssl and your qmail-smtpd daemon would cause this too. Say for example, you build your toaster a long time ago, then recompiled OpenSSL to resolve the security vulnerabilities, and did NOT rebuild qmail afterwards.

But my guess is the former.

shealey

#2
Matt, thanks for the advice - you've pointed me back to something I thought I had checked properly, and as a result I've found and fixed the problem.

It was a rogue cron job!

The Gentoo Linux port of qmail installs a script 'qmail-genrsacert.sh' in the vixie-cron /etc/cron.daily directory ... and mail toaster installs 'update_tmprsadh' under /var/qmail/bin. The port version was creating key files owned by qmaild:qmail (Wrong!!!) and the toaster version was creating them owned by vpopmail:vchkpw (right!!).

What seems to have been happening is the 'update_tmprsadh' script would create new keys, and then a short while later 'qmail-genrsacert.sh' would go and stomp over the files with new ones with the wrong ownership.

Says a lot about my system security that an unknown cron job can go unnoticed all this time :-(

Sean.