spacer spacer spacer
spacer spacer spacer
spacer
NASA Logo - Jet Propulsion Laboratory    + View the NASA Portal
spacer
JPL Home Earth Solar System Stars & Galaxies Technology
Parallel Applications Technologies Group
PAT Home PAT News PAT Projects PAT People PAT Publications blank blank
spacer
spacer spacer spacer
spacer

Very large disk array using IDE drives and Linux

Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise, does not constitute or imply its endorsement by the United States Government, Jet Propulsion Laboratory, or California Institute of Technology.

A Terabyte RAID5 storage box.

This page documents our first attempt to use Linux and commodity hardware for very large data storage systems. It includes some details about the hardware and software configuration, as well as further details about possible hardware and software options. It should provide some guidelines for building a very inexpensive yet very large storage system. Two identical 1TB storage systems were built in August 2002, for a total cost of 5000$.

RASCHAL is an updated and much bigger storage cluster built along the same lines
 
Hardware components and configuration:

-PII 300MHz, 128MB RAM, mid tower case with 4 x 5" external drive bays and 3x3.5" internal bays.
-SCSI adapter with 4GB SCSI drive for system disk.
-AGP video board.
-8 Maxtor DiamondMax D540X 160GB IDE drives.
-3Ware Escalade 7500-8 IDE RAID Controller
-Leadman (POWMAX) 500W ATX Power Supply
-3COM 3996SX Fiber GigE PCI interface.
-3COM 3996Tx Copper GigE PCI Interface.
-NONAME 3- Bay IDE ATA/133 Backplane with Tray Locks

This system was built to solve a critical storage need, starting with a decommissioned PC. The case, motherboard with CPU and memory, the Adaptec SCSI board with the 4GB drive and the AGP video card were part of the original configuration. To keep the operating system from interfering with the large storage, the original SCSI subsystem was left untouched. It provided 4GB of disk space for the operating system, and also allowed the connection to an external SCSI CD, which was used for software installation.
Installing eight new drives in the cage is the main challenge. Six of the 160GB drives were mounted in the 3-Bay IDE cage trays. This type of IDE cage takes 3 separate 1" IDE133 drives, and allows them to be installed in a full height (twice as high as a normal CD) 5 1/4" external drive bay. Since the tower case only had four HH 5 1/4" accessible drive bays, only six drives could be installed. The other two IDE drives were mounted in the internal drive bays, next to the already existing SCSI 4GB system drive.
The consumer package 3Ware IDE RAID controller comes with 8 single connector IDE133 cables with tension relief connectors at the end, making it relatively simple to connect all the 8 drives. The original power supply was  a 250W mode, and gave up the ghost after a few minutes of heavy disk usage. To be on the safe side a 500W power supply was used to replace it.
Since each disk needs to have a separate power connector, at least 9 power connectors are needed for this configuration. A five connector split power cable was used to power some of the drives, the rest being supplied directly from the power supply.
After installing the IDE RAID card with the associated cables, the video card, the SCSI card and the GigE cards need to be installed.
Once the machine is powered up, at the 3Ware prompt, press Alt-3. The 8 drives should be detected and show up as uninitialized. A RAID5 configuration provides protection against a single drive failure, at the expense of the storage capacity of one drive. With 160GB drives, the board spends about two hours initializing the new volume.
The existing motherboard did not provide networking ports, two cards were added, matching the fiber and copper connections from the  existing host systems.

Software configuration:

The Linux installation and configuration was a bit tricky, but for somebody with some Linux experience it should not be a problem.  First step was the installation of Linux Red Hat 7.2 software, configured as stand-alone, with most of the development options, on the 4GB SCSI disk.  While the Linux drivers for the IDE RAID board are available for Linux kernel 2.2, an upgrade to at least kernel 2.4.18 is required to support disks larger than 1TB. The Linux 2.4.18 or newer kernel source is available from The Linux Kernel Archives. The latest patches for the 3Ware IDE raid  and for the GigE board were also downloaded from the respective manufacturer web sites and applied to the kernel source. The new drivers are now configurable options in the Linux kernel configuration "make menuconfig", and need to be enabled, at least as modules. After the normal "make bzImage ; make modules_install" succeeds,  copy the newly generated bzImage form arch/i386/boot to the /boot directory.

The file /etc/modules.conf needs to be changed to contain the following lines:

alias scsi_hostadapter aic7xxx
alias eth0 bcm5700
alias eth1 bcm5700
alias scsi_hostadapter 3w-xxxx

Grub is used as the boot loader, the content of the /boot/grub/grub.conf  was changed to support booting the new kernel:

# grub.conf generated by anaconda
#
# Note that you do not have to rerun grub after making changes to this file
# NOTICE:  You do not have a /boot partition.  This means that
#          all kernel and initrd paths are relative to /, eg.
#          root (hd0,1)
#          kernel /boot/vmlinuz-version ro root=/dev/sdb2
#          initrd /boot/initrd-version.img
#boot=/dev/sda
default=1
timeout=10
splashimage=(hd0,1)/boot/grub/splash.xpm.gz
title Red Hat Linux (2.4.7-10)
        root (hd0,1)
        kernel /boot/vmlinuz-2.4.7-10 ro root=/dev/sdb2
        initrd /boot/initrd-2.4.7-10.img

title Big Disk Kernel (2.4.18)
        root (hd0,1)
        kernel /boot/bzImage ro root=/dev/sdb2
        initrd /boot/initrd-2.4.18.img

The last step before rebooting is to generate the /boot/initrd-2.4.18.img file. Once all the other changes are done, use the command:
mkinitrd /boot/initrd-2.4.18.img 2.4.18.

Due to the position of the IDE RAID card in relation to the Adaptec SCSI card, and the fact that the 3ware driver emulates a SCSI disc,   when the machine boots with the IDE RAID module enabled, the IDE RAID becomes /dev/sda and the SCSI disk is now /dev/sdb. If the 3Ware driver is not loaded (or not configured), the single SCSI drive is /dev/sda. It is easy to recover from errors, since GRUB allows for editing of the boot command entries before loading the kernel.

If all this was successful, at the next reboot with the "Big Disk Kernel", the RAID drive should be available as /dev/sda
This new drive needs to be partitioned and a file system to be created. We created one large partition, owned by Linux.
Using fdisk, the disk looks like this:

[root@disk_ boot]# fdisk /dev/sda

The number of cylinders for this disk is set to 139508.
There is nothing wrong with that, but this is larger than 1024,
and could in certain setups cause problems with:
1) software that runs at boot time (e.g., old versions of LILO)
2) booting and partitioning software from other OSs
   (e.g., DOS FDISK, OS/2 FDISK)

Command (m for help): p

Disk /dev/sda: 255 heads, 63 sectors, 139508 cylinders
Units = cylinders of 16065 * 512 bytes

   Device Boot    Start       End    Blocks   Id  System
/dev/sda1             1    139508 1120597978+  83  Linux

As far as the file system goes, reiserfs seemed like a reasonable choice, since it is a journaling file system with good performance.
There are not many useful options for making a reiserfs, just use the defaults:

mkreiserfs /dev/sda1

The new machine is used as a dedicated disk, so we used a dedicated point to point network link, and a private NFS connection.
After about a week of heavy use, there have been no errors, and the drive status looks like this:
[root@disk_ sbin]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda1             1.0T  317G  752G  30% /raid

Using the diskPC via a point to point FastEthernet link, performance figures are very good, with sustained read/write rates from the client machine of 7MBytes/sec being frequently observed.
Using a GigE connection, read rates only go up a limited amount since the latencies from various sources are rather large. Sustained read rates  reach 12MBytes/sec using large read block and after some tuning of the NFS on the Linux server and the SGI client. Write performance is spectacular, around 50MBytes/sec for short bursts. These are cached writes, with the system memory on the storage Linux PC serving as disk cache.

Factors to be considered:

Hardware:

Very large EIDE drives.

The most important factor is of course, storage capacity, followed by power consumption. Since the disks are going to be closely packed in a small case, the heat produced by disks while in use can accumulate real fast. Most of the suitable current products use IDE100 or IDE133 (ATA/ATAPI 6) as the interface, but the new Serial ATA (SATA) becoming available. The 3Ware Escalade 8500 series supports SATA drives.

Maxtor          DiamondMax D540X  or  DiamondMax 16
Up to 160GB per disk, 3" form factor, 1" height. They are 5400 RPM drives, have 2MB cache RAM and use 5Watt of power while in operation. The DiamondMax 16 can be purchased with an 8MB cache configuration.  The retail version of the D540X also includes a IDE133 PCI controller. The current warranty is three years, but it will be dropped to one year on October 1st 2002
News: MaXLine II, 250GB and 320GB disks

Western Digital  CaviarLine
Has capacities up to 200GB per disk and up to 8MB cache per disk. They are also 3.5" form factor, with 1" height. The 180GB and 200GB versions are only available with 2MB cache and 7200RPM, using 7.75W while operating. The 2MB cache/5400 RPM drives top at 120GB, but they only use 6W while operating. These drives use an IDE100 interface.

Seagate Barracuda ATA
Currently disks up to 120GB, at 7200RPM, with IDE100 and Serial ATA interfaces. Serial ATA provides a 150MB/sec transfer rate, and uses thin cables. The specification sheet for the 120GB drives claims it uses 12W during read/write operations.

IBM Deskstar XP
Uses IDE100, with capacities up to 120GB at 7200RPM.

IDE interfaces:

The RAID5 functionality can be done in software on the host CPU, by the Linux md package, dispensing with the hardware IDE RAID card. Extra IDE interfaces can be installed, and each IDE channel can support up to two drives. The IDE133 board that is included with the end-user Maxtor 160GB drives works  with a driver from the Linux IDE Project, but I was only  able to get one disk per channel working. Since the number of PCI cards that can be installed in a system is limited, a RAID card needs to be used for larger disk capacities.
3Ware makes a 12 port IDE133 card, Escalade 7500-12. I am not sure that the 12 port card supports RAID5.
3Ware also just announced the Escalade series 8500, with the same features but using Serial ATA.
For larger disk capacities, a nice solution is to use two sets of 8 drives and two 8 port IDE RAID drives, and use Linux md to stripe across the two disks.
This configuration can create a RAID5, 2.2TB usable single virtual volume using the 160GB drives, or a 2.8TB one using 200GB drives. Of course, a large case capable of hosting at least 16 IDE drives is required, together with the appropriately sized power supply.

Drive cages:

Seems to be a good investment. Using regular tower cases, they increase the number of drives that can be mounted, provide additional cooling, and some of them even support hot-swap. Since the IDE RAID card also supports hot-swap, it might be possible to replace a failed drive with a new one and rebuild the array while the disk is in use. Another interesting benefit is being able to change the drives without disturbing any of the hardware configuration inside the PC case. Such a system could be used as a real nice backup device. Since 8 160GB can be bought for about 1800$ at the market prices, the backup cost is about 1.75$ per GigaByte, very competitive with tapes, and much simpler to use.
 

Software:

SGIs XFS as the file system. XFS is being used on very large disks with good performance, and a Linux port is available.
Of course, the Linux system can be configured using other network file systems, such as SAMBA for MS Windows clients.
Can Linux use direct IO as part of the NFS server? Doing so should reduce the read latencies.

If this information is of interest, you might want to look at this link
RASCHAL, a 40TB storage cluster
spacer
spacer spacer spacer
spacer
Privacy / Copyrights FAQ Contact JPL Sitemap
spacer
spacer spacer spacer
spacer
FIRST GOV   NASA Home Page This page, http://pat.jpl.nasa.gov/public/lucian/disk.html, is maintained by Lucian Plesea and was last modified Saturday, 19-Jul-2003 21:40:10 PDT
spacer
spacer spacer spacer
spacer spacer spacer
JPL NASA Caltech