HPCC FY99 Milestone Report
Peggy Li
November 11, 1999
The major milestone for the ESS Distributed Visualization task is
a prototype for 4D datasets on TeraFLOPS scalable systems. We conducted
a demo run on Tuesday, 11/9/99, in order to fulfill
the milestone. Following is a detail report on the preparation, the
execution, and the result of the demo.
- Dataset preparation: I was hoping that I can get all 11 years
simulation data from LANL for their global ocean modeling Run #11. It
turns out that they did not save the first five years of the snapshots,
instead, they only saved the annual averge results. Therefore, I got 6
years of data. The snapshot is taken every 10 days and I got 181 time
steps in total. Each data file has four variables, velocity (x and y
components), temperature, and salinity. The volume size is 1280x896x20
voxels or 370 Mbytes. The total size of 181 time steps is about 66.9
Gbytes. I processed the data and put them into NetCDF format and added
the necessary meta information. For the velocity data, I calculated the
magnitude of the velocity. I also added the ocean floor depth data (2D)
into the velocity data to give me proper orientation. Each resulting
file represents one time step with 275 Mbytes in size. I have 181
files.
- Data staging: There is not enough space on Neptune's temp disks
to store all my datasets. I have to keep my dataset in Caltech's
Datacache machine (johnny). Data can be accessed remotely from neptune
using SDSC's SRB (Storage Resource Broker) at 14 Mbyte/sec rate in
theory. Unfortunately, the SRB command (Sget) takes 60 seconds to
retrieve one data file from Johnny to Neptune due to a very low CPU
utilization rate (corresponding to 4.5Mbytes/sec). Therefore, the
strategy for accessing data during the demo is in two stages, first
prefetching the data from Johnny to Neptune's tmp disk, then reading
the data from neptune's disk. Thanks to Cris Windoffer and many
generous neptune's users, I was able to grab most of the space on
neptune's fastest disk (/tmp8) and be able to store up to 78 time
steps in that disk before the demo started. While the demo ran, I
deleted the files that have already loaded into the disk and
retrieved the files from johnny as soon as the space was made
available. The staging process is done by forking a process from
the rendering software. Reading one file from /tmp8 into the
renderer takes 16 seconds (18.4 Mbytes/sec), by staging enough data
ahead of time and prefetching data from Johnny, I was able to input
the data at its full speed without waiting for data to move from
Johnny to Neptune.
- Start the demo: The demo was scheduled to run from 9AM to
11AM with myself as the only dedicated user. I was
hoping that I can extend my demo run to noon because, based on my
calculation, I will need at least three hours just to move 180 files
from Johnny to Neptune! Due to some configuration problem, the machine
had to be rebooted twice to get it set up properly. I was not able to
access the machine until almost 10AM. The original plan was to run
the renderer on 192 nodes, input module on 16 nodes (has to be on
Node 8 to get the best disk throughtput), and output module on 32
nodes. My MPI program does not run in the above configuration
because the shared memory I requested exceeded the system limit.
(I was using MPI2.0 one-sided communication API, the HP
implementation restricts the memory allocation
for one-sided communication in a fixed memory segment, which is
configured to be a small portion of the total memory space. By running
on more nodes, the total shared memory I requested becomes bigger, thus,
exceeding the system limit) I had to play with the MPI environment
variable MPI_SHMEMCNTL for almost 20 minutes in order to find a setting
to allow my program to run. Finally, I was able to run my program using
128 nodes for renderer, 16 nodes for input and 32 nodes for output.
- The Process: I was planning to load 300 to 360 timesteps in the
demo if time permits. But because I started late and I decided to first
load 200 timesteps. I had to fake the data sets above time step #181 by
looping back from #0. I first prefetched 98 time steps into the
renderer (the first 98 time steps have already stationed in neptune's
tmp disks with 20 frames on /tmp9 and 78 frames on /tmp8), then using
out-of-core rendering to get the remaining 102 time steps during
animation. I was able to animate the velocity, temperature, and the
salinity variables to the GUI running on pollux in ADL as well as
regenerate all the animation frames in 1280x1024 and save them in the
disk. After I was done with the first 200 timesteps, it was already
12:05pm. Cris Windoffer extended my dedicated time to 12:30pm,
therefore, I decided to load in another 45 time steps, i.e. #201 to
#244. I was able to load in all the datasets and be able to, again,
animate the velocity, temperature, and salinity data sets. This 45 time
steps were loaded in as a different dataset from the previous 200
timesteps. Since time step #201 has a different range from time step #1,
thus the color and the opacity looks inconsistent because the data
values are normalized using different range (sorry, my fault!).
Therefore, the animation generated by the last 45 timesteps doesn't
look right.
- The Results: In 2 hours and 10 minutes, I was able to load 245
time steps of LANL's global ocean model datasets (67 Gbytes processed
data, corresponding to 90 Gbytes original data) and generated three
animations of 200 time steps. The animation was later assembled into
3 screen movie, which will be shown at SC99. Most of the 2.16 hours were
spent in loading the data, i.e., 1.23 hours. The rendering and animation
was fast, about 1.6 frames per second for the velocity data and 1 frame
per second for the temperature and salinity data. In the first 200
timesteps, the memory usage on hypernode 0 reached its limit, thus,
caused occassional page fault. As a conseqence, the CPU utilization
during animation dropped from the normal 80% to only 40%. This is due
to inbalance of the data distribution. The first 16 nodes got three
blocks of data while the remaining 112 nodes got only 2 blocks of data
each. I should have calculated the block size in advance to make the
data distribution more even. With better distribution, the CPU
utilization and the frame rate should improve further.
- Conclusion: This demo is still distance away from the HPCC
milestone -- visualization of terabyte dataset on teraflop machines.
I believe the technology is there but there are still many factors
that have to be pulled together to make it happen. In the past two
months while I was preparing the demo, the most time I spent was
finding the data, getting the data ready, finding space to store
the data, and finding the fastest way to read the data. Looks
like the biggest obstacle to reach this
milestone is still disk storage and IO bandwidth.
- Acknowledgement: During this exercise, I got help from many
people. Without their help, I will never be able to complete the demo.
First, Robert Malone from Los Alamos National Laboratory provided me with
all the data sets I used in the demo. He is so kind that he responded
to all my requests in such an efficient way! He retrieved 67 Gbytes
of data from their archive and ftp'ed them to Caltech for me, he also
helped me to read and interpret the data. Secondly, I like to thank
the Caltech/JPL Neptune support team for their help to use neptune, the
disks, and the datacache machine. Cris Windoffer, as a long time
friend, has always been kind and helpful. He helped me to make the disk
space available, set up the machine and queue to run my demo, and
patiently responded to every tiny things that made me panic :-) Sharon
Bernett overlooked the entire exercise and pulled in help when
necessary. Chris Lee got me set up on SRB and taught me how to use it.
Heidi Lorenz-Wirzba and Jimi Patel helped me to solve many
system-related problems. I also like to thank three neptune users, Rick
Janet, Ichiro Fukumori, and Victor Zlotnicki, who removed their files
from /tmp8 upon my request. Last, but not the least, I like to thank my
partner, James Tsiao, who worked side by side with me in the last two
months to make the demo happen. We together had to shuffle hundreds of
gigabytes of data back and forth between neptune and johnny. We had to
watch the disk usage like a wolf waiting for his pray and grab it
whenever there is more free space available :-) We had to code, debug,
and test old codes and new codes over and over again ....