HPCC FY99 Milestone Report

Peggy Li November 11, 1999

The major milestone for the ESS Distributed Visualization task is a prototype for 4D datasets on TeraFLOPS scalable systems. We conducted a demo run on Tuesday, 11/9/99, in order to fulfill the milestone. Following is a detail report on the preparation, the execution, and the result of the demo.

  1. Dataset preparation: I was hoping that I can get all 11 years simulation data from LANL for their global ocean modeling Run #11. It turns out that they did not save the first five years of the snapshots, instead, they only saved the annual averge results. Therefore, I got 6 years of data. The snapshot is taken every 10 days and I got 181 time steps in total. Each data file has four variables, velocity (x and y components), temperature, and salinity. The volume size is 1280x896x20 voxels or 370 Mbytes. The total size of 181 time steps is about 66.9 Gbytes. I processed the data and put them into NetCDF format and added the necessary meta information. For the velocity data, I calculated the magnitude of the velocity. I also added the ocean floor depth data (2D) into the velocity data to give me proper orientation. Each resulting file represents one time step with 275 Mbytes in size. I have 181 files.
  2. Data staging: There is not enough space on Neptune's temp disks to store all my datasets. I have to keep my dataset in Caltech's Datacache machine (johnny). Data can be accessed remotely from neptune using SDSC's SRB (Storage Resource Broker) at 14 Mbyte/sec rate in theory. Unfortunately, the SRB command (Sget) takes 60 seconds to retrieve one data file from Johnny to Neptune due to a very low CPU utilization rate (corresponding to 4.5Mbytes/sec). Therefore, the strategy for accessing data during the demo is in two stages, first prefetching the data from Johnny to Neptune's tmp disk, then reading the data from neptune's disk. Thanks to Cris Windoffer and many generous neptune's users, I was able to grab most of the space on neptune's fastest disk (/tmp8) and be able to store up to 78 time steps in that disk before the demo started. While the demo ran, I deleted the files that have already loaded into the disk and retrieved the files from johnny as soon as the space was made available. The staging process is done by forking a process from the rendering software. Reading one file from /tmp8 into the renderer takes 16 seconds (18.4 Mbytes/sec), by staging enough data ahead of time and prefetching data from Johnny, I was able to input the data at its full speed without waiting for data to move from Johnny to Neptune.
  3. Start the demo: The demo was scheduled to run from 9AM to 11AM with myself as the only dedicated user. I was hoping that I can extend my demo run to noon because, based on my calculation, I will need at least three hours just to move 180 files from Johnny to Neptune! Due to some configuration problem, the machine had to be rebooted twice to get it set up properly. I was not able to access the machine until almost 10AM. The original plan was to run the renderer on 192 nodes, input module on 16 nodes (has to be on Node 8 to get the best disk throughtput), and output module on 32 nodes. My MPI program does not run in the above configuration because the shared memory I requested exceeded the system limit. (I was using MPI2.0 one-sided communication API, the HP implementation restricts the memory allocation for one-sided communication in a fixed memory segment, which is configured to be a small portion of the total memory space. By running on more nodes, the total shared memory I requested becomes bigger, thus, exceeding the system limit) I had to play with the MPI environment variable MPI_SHMEMCNTL for almost 20 minutes in order to find a setting to allow my program to run. Finally, I was able to run my program using 128 nodes for renderer, 16 nodes for input and 32 nodes for output.
  4. The Process: I was planning to load 300 to 360 timesteps in the demo if time permits. But because I started late and I decided to first load 200 timesteps. I had to fake the data sets above time step #181 by looping back from #0. I first prefetched 98 time steps into the renderer (the first 98 time steps have already stationed in neptune's tmp disks with 20 frames on /tmp9 and 78 frames on /tmp8), then using out-of-core rendering to get the remaining 102 time steps during animation. I was able to animate the velocity, temperature, and the salinity variables to the GUI running on pollux in ADL as well as regenerate all the animation frames in 1280x1024 and save them in the disk. After I was done with the first 200 timesteps, it was already 12:05pm. Cris Windoffer extended my dedicated time to 12:30pm, therefore, I decided to load in another 45 time steps, i.e. #201 to #244. I was able to load in all the datasets and be able to, again, animate the velocity, temperature, and salinity data sets. This 45 time steps were loaded in as a different dataset from the previous 200 timesteps. Since time step #201 has a different range from time step #1, thus the color and the opacity looks inconsistent because the data values are normalized using different range (sorry, my fault!). Therefore, the animation generated by the last 45 timesteps doesn't look right.
  5. The Results: In 2 hours and 10 minutes, I was able to load 245 time steps of LANL's global ocean model datasets (67 Gbytes processed data, corresponding to 90 Gbytes original data) and generated three animations of 200 time steps. The animation was later assembled into 3 screen movie, which will be shown at SC99. Most of the 2.16 hours were spent in loading the data, i.e., 1.23 hours. The rendering and animation was fast, about 1.6 frames per second for the velocity data and 1 frame per second for the temperature and salinity data. In the first 200 timesteps, the memory usage on hypernode 0 reached its limit, thus, caused occassional page fault. As a conseqence, the CPU utilization during animation dropped from the normal 80% to only 40%. This is due to inbalance of the data distribution. The first 16 nodes got three blocks of data while the remaining 112 nodes got only 2 blocks of data each. I should have calculated the block size in advance to make the data distribution more even. With better distribution, the CPU utilization and the frame rate should improve further.
  6. Conclusion: This demo is still distance away from the HPCC milestone -- visualization of terabyte dataset on teraflop machines. I believe the technology is there but there are still many factors that have to be pulled together to make it happen. In the past two months while I was preparing the demo, the most time I spent was finding the data, getting the data ready, finding space to store the data, and finding the fastest way to read the data. Looks like the biggest obstacle to reach this milestone is still disk storage and IO bandwidth.
  7. Acknowledgement: During this exercise, I got help from many people. Without their help, I will never be able to complete the demo. First, Robert Malone from Los Alamos National Laboratory provided me with all the data sets I used in the demo. He is so kind that he responded to all my requests in such an efficient way! He retrieved 67 Gbytes of data from their archive and ftp'ed them to Caltech for me, he also helped me to read and interpret the data. Secondly, I like to thank the Caltech/JPL Neptune support team for their help to use neptune, the disks, and the datacache machine. Cris Windoffer, as a long time friend, has always been kind and helpful. He helped me to make the disk space available, set up the machine and queue to run my demo, and patiently responded to every tiny things that made me panic :-) Sharon Bernett overlooked the entire exercise and pulled in help when necessary. Chris Lee got me set up on SRB and taught me how to use it. Heidi Lorenz-Wirzba and Jimi Patel helped me to solve many system-related problems. I also like to thank three neptune users, Rick Janet, Ichiro Fukumori, and Victor Zlotnicki, who removed their files from /tmp8 upon my request. Last, but not the least, I like to thank my partner, James Tsiao, who worked side by side with me in the last two months to make the demo happen. We together had to shuffle hundreds of gigabytes of data back and forth between neptune and johnny. We had to watch the disk usage like a wolf waiting for his pray and grab it whenever there is more free space available :-) We had to code, debug, and test old codes and new codes over and over again ....