 |
Very large disk array using IDE drives and Linux
Reference
herein to any specific commercial product, process, or service by
trade name, trademark, manufacturer, or otherwise, does not constitute
or imply its endorsement by the United States Government, Jet
Propulsion Laboratory, or California Institute of Technology.
A Terabyte RAID5 storage box.
This page documents our first attempt to use Linux and
commodity hardware for very large data storage systems. It includes some
details about the hardware and software configuration, as well as
further details about possible hardware and software options. It should
provide some guidelines for building a very inexpensive yet very large
storage system. Two identical 1TB storage systems were built in August 2002, for a total cost of 5000$.
RASCHAL is an updated and much bigger storage cluster built along the same lines
Hardware components and
configuration:
-PII 300MHz, 128MB RAM, mid tower case with 4 x 5" external drive
bays and 3x3.5" internal bays.
-SCSI adapter with 4GB SCSI drive for system disk.
-AGP video board.
-8 Maxtor DiamondMax D540X 160GB IDE drives.
-3Ware Escalade 7500-8 IDE RAID Controller
-Leadman (POWMAX) 500W ATX Power Supply
-3COM 3996SX Fiber GigE PCI interface.
-3COM 3996Tx Copper GigE PCI Interface.
-NONAME 3- Bay IDE ATA/133 Backplane with Tray Locks
This system was built to solve a critical storage need, starting
with a decommissioned PC. The case, motherboard with CPU and memory, the
Adaptec SCSI board with the 4GB drive and the AGP video card were
part of the original configuration. To keep the operating system from
interfering with the large storage, the original SCSI subsystem was left
untouched. It provided 4GB of disk space for the operating system, and
also allowed the connection to an external SCSI CD, which was used for
software installation.
Installing eight new drives in the cage is the main challenge. Six of
the 160GB drives were mounted in the 3-Bay IDE cage trays. This
type of IDE cage takes 3 separate 1" IDE133 drives, and allows them to
be installed in a full height (twice
as high as a normal CD) 5 1/4" external drive bay. Since the
tower case only had four HH 5 1/4" accessible drive bays, only six
drives could be installed. The other two IDE drives were mounted in the
internal drive bays, next to the already existing SCSI 4GB system drive.
The consumer package 3Ware IDE RAID controller comes with 8 single
connector IDE133 cables with tension relief connectors at the end,
making it relatively simple to connect all the 8 drives. The original
power supply was a 250W mode, and gave up the ghost after a few
minutes of heavy disk usage. To be on the safe side a 500W power supply
was used to replace it.
Since each disk needs to have a separate power connector, at least 9
power connectors are needed for this configuration. A five connector
split power cable was used to power some of the drives, the rest being
supplied directly from the power supply.
After installing the IDE RAID card with the associated cables, the
video card, the SCSI card and the GigE cards need to be installed.
Once the machine is powered up, at the 3Ware prompt, press Alt-3. The 8
drives should be detected and show up as uninitialized. A RAID5
configuration provides protection against a single drive failure, at the
expense of the storage capacity of one drive. With 160GB drives, the
board spends about two hours initializing the new volume.
The existing motherboard did not provide networking ports, two cards
were added, matching the fiber and copper connections from the
existing host systems.
Software configuration:
The Linux installation and configuration was a bit tricky, but for
somebody with some Linux experience it should not be a problem.
First step was the installation of Linux Red Hat 7.2 software,
configured as stand-alone, with most of the development options, on the
4GB SCSI disk. While the Linux drivers for the IDE RAID board
are available for Linux kernel 2.2, an upgrade to at least kernel 2.4.18
is required to support disks larger than 1TB. The Linux 2.4.18 or newer
kernel source is available from The
Linux Kernel Archives. The latest patches for the 3Ware IDE
raid and for the GigE board were also downloaded from the
respective manufacturer web sites and applied to the kernel source. The
new drivers are now configurable options in the Linux kernel
configuration "make menuconfig", and need to be enabled, at least
as modules. After the normal "make bzImage ; make modules_install"
succeeds, copy the newly generated bzImage form arch/i386/boot to
the /boot directory.
The file /etc/modules.conf needs to be changed to contain the
following lines:
alias scsi_hostadapter aic7xxx
alias eth0 bcm5700
alias eth1 bcm5700
alias scsi_hostadapter 3w-xxxx
Grub is used as the boot loader, the content of the
/boot/grub/grub.conf was changed to support booting the new
kernel:
# grub.conf generated by anaconda
#
# Note that you do not have to rerun grub after making changes to this
file
# NOTICE: You do not have a /boot partition. This means
that
# all kernel and
initrd paths are relative to /, eg.
# root (hd0,1)
# kernel
/boot/vmlinuz-version ro root=/dev/sdb2
# initrd
/boot/initrd-version.img
#boot=/dev/sda
default=1
timeout=10
splashimage=(hd0,1)/boot/grub/splash.xpm.gz
title Red Hat Linux (2.4.7-10)
root (hd0,1)
kernel
/boot/vmlinuz-2.4.7-10 ro root=/dev/sdb2
initrd
/boot/initrd-2.4.7-10.img
title Big Disk Kernel (2.4.18)
root (hd0,1)
kernel /boot/bzImage ro
root=/dev/sdb2
initrd
/boot/initrd-2.4.18.img
The last step before rebooting is to generate the
/boot/initrd-2.4.18.img file. Once all the other changes are done, use
the command:
mkinitrd /boot/initrd-2.4.18.img 2.4.18.
Due to the position of the IDE RAID card in relation to the
Adaptec SCSI card, and the fact that the 3ware driver emulates a SCSI
disc, when the machine boots with the IDE RAID module
enabled, the IDE RAID becomes /dev/sda and the SCSI disk is now
/dev/sdb. If the 3Ware driver is not loaded (or not configured), the
single SCSI drive is /dev/sda. It is easy to recover from errors, since
GRUB allows for editing of the boot command entries before loading the
kernel.
If all this was successful, at the next reboot with the "Big Disk
Kernel", the RAID drive should be available as /dev/sda
This new drive needs to be partitioned and a file system to be created.
We created one large partition, owned by Linux.
Using fdisk, the disk looks like this:
[root@disk_ boot]# fdisk /dev/sda
The number of cylinders for this disk is set to 139508.
There is nothing wrong with that, but this is larger than 1024,
and could in certain setups cause problems with:
1) software that runs at boot time (e.g., old versions of LILO)
2) booting and partitioning software from other OSs
(e.g., DOS FDISK, OS/2 FDISK)
Command (m for help): p
Disk /dev/sda: 255 heads, 63 sectors, 139508 cylinders
Units = cylinders of 16065 * 512 bytes
Device Boot
Start End
Blocks Id System
/dev/sda1
1 139508 1120597978+ 83 Linux
As far as the file system goes, reiserfs seemed like a reasonable
choice, since it is a journaling file system with good performance.
There are not many useful options for making a reiserfs, just use the
defaults:
mkreiserfs /dev/sda1
The new machine is used as a dedicated disk, so we used a dedicated
point to point network link, and a private NFS connection.
After about a week of heavy use, there have been no errors, and the
drive status looks like this:
[root@disk_ sbin]# df -h
Filesystem
Size Used Avail Use% Mounted on
/dev/sda1
1.0T 317G 752G 30% /raid
Using the diskPC via a point to point FastEthernet link, performance
figures are very good, with sustained read/write rates from the client
machine of 7MBytes/sec being frequently observed.
Using a GigE connection, read rates only go up a limited amount since
the latencies from various sources are rather large. Sustained read
rates reach 12MBytes/sec using large read block and after some
tuning of the NFS on the Linux server and the SGI client. Write
performance is spectacular, around 50MBytes/sec for short bursts. These
are cached writes, with the system memory on the storage Linux PC
serving as disk cache.
Factors to be considered:
Hardware:
Very large EIDE drives.
The most important factor is of course, storage capacity, followed
by power consumption. Since the disks are going to be closely packed in
a small case, the heat produced by disks while in use can accumulate
real fast. Most of the suitable current products use IDE100 or IDE133
(ATA/ATAPI 6) as the interface, but the new Serial ATA (SATA)
becoming available. The 3Ware Escalade 8500 series supports SATA drives.
Maxtor DiamondMax
D540X or DiamondMax
16
Up to 160GB per disk, 3" form factor, 1" height. They are 5400 RPM
drives, have 2MB cache RAM and use 5Watt of power while in operation.
The DiamondMax 16 can be purchased with an 8MB cache
configuration. The retail version of the D540X also includes a
IDE133 PCI controller. The current warranty is three years, but it will
be dropped to one year on October 1st 2002
News:
MaXLine II, 250GB and 320GB disks
Western Digital CaviarLine
Has capacities up to 200GB per disk and up to 8MB cache per
disk. They are also 3.5" form factor, with 1" height. The 180GB and
200GB versions are only available with 2MB cache and 7200RPM, using
7.75W while operating. The 2MB cache/5400 RPM drives top at 120GB, but
they only use 6W while operating. These drives use an IDE100 interface.
Seagate Barracuda ATA
Currently disks up to 120GB, at 7200RPM, with IDE100 and Serial ATA
interfaces. Serial ATA provides a 150MB/sec transfer rate, and uses
thin cables. The specification sheet for the 120GB drives claims it
uses 12W during read/write operations.
IBM Deskstar
XP
Uses IDE100, with capacities up to 120GB at 7200RPM.
IDE interfaces:
The RAID5 functionality can be done in software on the host CPU, by
the Linux md package, dispensing with the hardware IDE RAID card. Extra
IDE interfaces can be installed, and each IDE channel can support
up to two drives. The IDE133 board that is included with the end-user
Maxtor 160GB drives works with a driver from the Linux IDE
Project, but I was only able to get one disk per channel working.
Since the number of PCI cards that can be installed in a system is
limited, a RAID card needs to be used for larger disk capacities.
3Ware makes a 12 port IDE133 card, Escalade 7500-12. I am not sure that
the 12 port card supports RAID5.
3Ware also just announced the Escalade series 8500, with the same
features but using Serial ATA.
For larger disk capacities, a nice solution is to use two sets of 8
drives and two 8 port IDE RAID drives, and use Linux md to stripe
across the two disks.
This configuration can create a RAID5, 2.2TB usable single virtual
volume using the 160GB drives, or a 2.8TB one using 200GB drives. Of
course, a large case capable of hosting at least 16 IDE drives is
required, together with the appropriately sized power supply.
Drive cages:
Seems to be a good investment. Using regular tower cases, they
increase the number of drives that can be mounted, provide additional
cooling, and some of them even support hot-swap. Since the IDE RAID
card also supports hot-swap, it might be possible to replace a failed
drive with a new one and rebuild the array while the disk is in use.
Another interesting benefit is being able to change the drives without
disturbing any of the hardware configuration inside the PC case.
Such a system could be used as a real nice backup device. Since 8 160GB
can be bought for about 1800$ at the market prices, the backup cost is
about 1.75$ per GigaByte, very competitive with tapes, and much simpler
to use.
Software:
SGIs XFS as the file system. XFS is being used on very large disks with
good performance, and a Linux port is available.
Of course, the Linux system can be configured using other network file
systems, such as SAMBA for MS Windows clients.
Can Linux use direct IO as part of the NFS server? Doing so should
reduce the read latencies.
If this information is of interest, you might want to look at this link
RASCHAL, a 40TB storage cluster
|
 |