Sometimes when you have a kernel panic, OOPS, machine check exception (MCE), or another fatal crash, the reason may be your hardware. This article describes how to properly test your hardware to check if it is in good shape.
Please note that most of the tests described below could do harm to your machine if something is wrong with it (e.g., it is overclocked, undercooled, etc.). In general, overclocking is not recommended for production server boxes.
Random Access Memory (RAM) is sometimes faulty, which leads to some very strange system crashes. It is highly recommended to test system RAM before putting the node into production. Several approaches and tools can be used.
Memtest86 and Memtest86+
Memtest86 is a standalone RAM tester, which can be booted either from CD (floppy) or from the normal Linux bootloader - LILO/GRUB.
Memtest86+ is a forked version of Memtest86 with some features added.
You can download and install one of these programs from either of these sites: http://memtest86.com/ or http://memtest.org/. They may be a part of your Linux distribution already.
To test the server for faulty RAM, install either memtest and reboot into it. Run it for at least a few hours (at least 2 to 3 iterations). It is better to run tests for 1 to 2 days. If there is even a single error reported, you have to change your RAM chips (or, if your system is overclocked, downclock it to normal speed).
Memtester is a userspace utility for testing the memory subsystem for faults. The good thing is that you can test your memory without needing to reboot the server, and you can run other programs with it. The bad thing is that not all of the memory is tested.
Memtester is available at http://pyropus.ca/software/memtester/. To build: download, unpack, and type "make."
Invoke memtester as a root, giving an amount of memory it will test as an argument, e.g.:
# /usr/sbin/memtester 512M
The more memory you specify, the better.
CPU cooling tests
Such tests check that your CPU will work fine under the highest possible load and temperature.
Cpuburn (http://pages.sbcglobal.net/redelm/) is a utility to burn your CPU as high as possible. It tests your system stability by checking how the CPU and the whole system work under high temperatures.
Download tarball from http://pages.sbcglobal.net/redelm/, untar, and run.
It is recommended to switch the server to single-user mode and remount all the partitions to read-only, just in case of a system hang.
Run this command:
# burnBX || echo $? &
killall -TERM burnBX
You can also use the burnMMX utility:
burnMMX J || echo $? &
Cpuburn author says burnMMX is not optimal for AMD processors; use burnBX if you have AMD.
It is also a good idea to run cpuburn and memtester in parallel, as this increases the likelihood that more errors will be detected.