Weird server reboots.

Started by TW_Adam, September 05, 2007, 02:12:44 PM

Previous topic - Next topic

TW_Adam

Good Afternoon all,
I have a weird problem where my mail servers reboot at weird times,  with no entries into the logs, no core dumps present or any indication of why they may be rebooting. I have read in the forums and online that often it is faulty hardware that causes spontaneous reboots. So I had Dell come out and replace all of the hardware in the server that reboots most frequently. Within 3 days, this server had rebooted again.

Some Background.

In my setup, I have a total of 4 mail servers. 2 are running, 2 are backups with no domains on them.
Only the 2 with domains on them are having this rebooting issue. (Mail1 and Mail3 are active with domains, Mail2 and Mail4 are spares used to transition if necessary or disaster recovery)

It should be noted that day to day, when the servers are not rebooting, they work extremely well. All of the services run, spam detection is outstanding and mail processes very efficiently. Mail1 takes 2000 connections per hour and Mail3 takes about 4000 connections per hour. Load averages almost never go above  0.35,  0.72,  0.42. It also appears that these reboots only occur at off peak times. Like after we would normally go home after work. sometimes at 3-4 am not usually during working hours.

All of the servers are Dell 1U Rackmount i386 boxes. The have 80Gb sata drives and are at 8% capacity. They are running Freebsd 6.1 with 2.8 Ghz processors (dual). Running toaster version 4.10. Mail1 has 1.5 Gb of ram, Mail3 has 1 GB of ram. Both have dual gigabit nics, and are each plugged into separate UPS's. (APC 1200's)


So far These are the steps I've taken to stop the reboots.

On Mail1 replaced all of the hardware, only kept the empty chassis. (gotta love dell warranty)
Replaced all power cables with beefy Dell server cables.
Inserted fsck_y_enable="YES" background_fsck="NO" to the rc.conf, doesn't stop the reboots, but means less calls to the data center to have them manually do the fsck.
Read up on clamav issues with file attachments added .cab and .chm to simconf then rebuilt simscan.cdb
Limited databytes to 18MB max (just in case it was a 300 mb mail causing this)

Does anyone have any other suggestions? This is really driving me nuts, as I know people have alot more domain hosted on alot less box with far fewer issues. Back in the early years, we had many more domain on a single box without it rebooting all the time. I have considered upgrading Freebsd to 6.2, but what I've read so far doesn't indicate that there was ever a weird rebooting issue in 6.1 that would be fixed by this upgrade.

Thanks for your time in reading this.

matt

spontaneous reboots on a *BSD system are rarely anything other than:

  a. bad RAM
  b. overheating CPU

Does your Dell motherboard support any type of hardware monitoring, so that you can monitor the CPU temp? Often, something as simple as a fan failing will cause the CPU to overheat. But that doesn't sound like your problem. If I had to put money on anything, it would be faulty RAM.

Is it ECC RAM?  Have you run memtest on it?  Overnight?

TW_Adam

Thanks for reading Matt,

I thought bad ram as well, except I had dell replace it when they gutted the box. I also had them replace all of the fans etc. It was rebooting most often under very little load, weird times..etc. During peak times the boxes run well 5000 + pop connections per hr. The ram is matched DDR2-400 ECC.  (I have not run memtest on it overnight, however the each box has processed over a million messages with zero errors)

Since blocking the .cab and .chm files, both of the working servers have not rebooted. I was going to wait another week before appending to this post, but since you replied i figured I'd update with the latest information.  So thats 7 and 11 days respectively. (it'll be 8 and 12 tomorrow at 8 am which is the longest in about 6 months or so.) 

Now I don't profess to have anywhere near the experience with the diverse variety of hardware you do, so memtest will no doubt be one of the next steps.

Thanks Adam

TW_Adam

And here is a little more info....

I installed and ran memtest on 2 of the 4 machines. 1 that had domains, and 1 that did not. On both, memtest ran for many hours without any errors in the loops. On the box with domains, about 3 days later it again rebooted without any notice.

Earlier today the one that rebooted most frequently, rebooted again, however uptime was just a couple of hours under 14 days. I checked the maillog this morning, for the time just before it rebooted this error repeated only twice for users that are on the box, and from their correct previously connected IP's.

vchkpw-pop3: vpopmail user not found xxx@domain.com:xxx.xxx.xxx.xxx

After checking google, and previous postings, this kind of error seeme to point to a mysql issue with linux threads. I don't believe that linux threads was built into this mysql version, but I may be incorrect with that.

So I guess my question would be can this be the cause of my grief? Would this error force a reboot on the server?

thanks for your time.

FreeBSD port: mysql-server-4.1.20 ....checking runtime info and system variables shows no mention of linux threads.

matt

Strange stuff. I'd start by upgrading to FreeBSD 6.2. You want to be on 6.2 for security reasons anyway. But that won't likely solve your reboot issue....

You may want to consider upgrading to 6.2-stable.  There's been a lot of changes made in -stable since 6.2-RELEASE and quite a few of the have to do with MySQL performance. Check out some tests I ran: http://dnscache01.layeredtech.com/MySQL_Benchmarks.html

Once upgraded to 6.2-stable, rebuild MySQL. I'd upgrade to 5.0.latest while I was at it. Then make sure to add this to /etc/libmap.conf on your MySQL server:

  [mysqld]
  libpthread.so.2 libthr.so.2
  libpthread.so libthr.so

I don't know if that'll do anything for your reboot problem but it certainly will give MySQL a huge performance boost.

TW_Adam

Thanks for responding again.
That is some major mysql performance boost you have going there. As an update to the previous information with regard to the servers rebooting.
After running the memtest, and seeing the servers reboot anyway, I tried a simple test. I put a command into the crontab to restart mysql every 23 hours. And I have let it stay there for the past month.  Wm1 up 34+07, Wm3 up 39+12. So for the first time in a very long time, the servers are not rebooting. After some digging around the suggeston is that it may be something connecting to mysql, and not mysql itself.
I'll post further info as I have it. (good call on the upgrade though)
Thanks for reading