I’ll be moving my server over the next couple of days. I’m working on an email setup to make sure there’s no interruption there. The website and the blahg will however be down until Wednesday evening. Sorry for any inconvenience this may cause.
You may have noticed that about a week ago my webserver went down. It started off as having to deal with two failed disks in a RAID6, but quickly turned into a system reinstall. There was no data-loss because of hardware failures, but I am sad to say that I accidentally nuked all the blahg post titles and publication times. Additionally, all the comment times and author names are gone. The content for both posts and comments is safe.
I’ll try to restore as many posts as possible as soon as possible. It might involve going through archive.org and copying the metadata over.
How did the data-loss happen? I forgot to backup extended attributes. My blahging software uses them to store the post metadata. Oops.
Sorry for the inconvenience.
The server is back up. I think all the services have been restarted. If you find something wrong, let me know.
As some of you may have noticed, my blahg has been down for about 2 days. The reason is, I upgraded my system, and php4 packages broke. I just installed the php5 packages, and things seem to work again.
As many of you may have noticed, I’ve been really lazy when it comes to updating this blahg of mine…so here’s a short summary of what happened over the past week at SC07. I’m sure I forgot to talk about a ton of things…feel free to leave a comment.
Friday, November 9
Pretty uneventful day…flying from JFK to Reno via LAX, checking into the hotel were the two highlights.
Saturday, November 10
We mis-read the bus schedule, and ended up taking the 6:30 shuttle to the convention center. Waking up that early was quite painful. When we got to the center, we started unpacking the nodes, rack and the TV. Compared to the other teams, we were unfortunate enough to have twelve 8-core nodes, and two 4-core nodes. Yeah, 14 nodes, an infiniband switch, a gigE switch, a TV, and the full-sized rack. That’s 18 things to unpack. Other teams had around 8 nodes and similar interconnect. Either way, we had more to set up.
The organizers of the Cluster Challenge (this is the whole thing about universities, and teams, Stony Brook being one of them - read the link for more info) were nice enough to organize a cruise on lake Tahoe for us…but the only problem with it was, that it was in the evening. So, we got to see a whole lot of big black nothing.
Majority of the Indiana University team, featuring Pikachu:
I must admit, the pikachu hat was a great way to draw attention, I therefore propose that next year, the team looks more like this (photo courtesy of Peter Honeyman):
Sunday, November 11
While most of Saturday was spend setting up hardware, at least half of Sunday was spend setting up software. Somehow, magically, NFS decided to stop working (I’ve been told by the NFS folks that it’s generally not NFS that breaks but something else, but I maintain that NFS is broken :) ). In our case, NFS was a major component - we went the netboot way, and had only 1 disk for the entire cluster. We exported the node root directory image, as well as the home directories over NFS over ethernet, and created a tmpfs (kind of like a ramdisk, but it grows as needed) over NFS over IP over IB. There’s probably a way to remove IP out of the equation, but we just didn’t have enough time to try everything we wanted to - like doing PXE boot over IB, removing the need for ethernet all together. (One of the visitors who stopped by our cluster told me that he does do netboot over IB.)
Monday, November 12
The Cluster Challenge started at 20:00. Things got really hectic really quickly, but overall it was all fun. Once everything calmed down, we decided to start the 6-hour shifts. I went back to the hotel. At 4:13 in the morning, I got woken up by a call from the team leader asking me when I’d be back. 4:13 is waaaay too early. I decided to take a pillow and the blanket with me to the conference center.
About 19.5 hours later, still at the conference center, I decided to go to sleep. I didn’t feel like going back to the hotel, so I crashed on one of the couches right by our team’s rack. I hear there is a photo of me sleeping on the couch. Moral of the story: when at a conference, take a pillow and a blanket with you, it might come in handy when you decide to sleep at the conference center.
Tuesday, November 13
Shortly after noon, the entire conference center lost power for a couple of seconds (see The Register). None of the teams were using UPSes (UPSes eat up power, which was quite precious - only 26 Amps per team), all the clusters rebooted. I’ve heard that the team from Taiwan lost more than 10 hours of computation because of that.
We lost only about 15 minutes wall time of computation (on 96 cores) because we just started a new job.
Taken right after the power outage (notice that the lights are still off):
Wednesday, November 14
The competition ended at 16:00. That’s 44 hours after starting. Everyone was quite tired, but not tired enough to skip what the conference organizers have prepared for us. They rented out an entire arcade in one of the near by hotels. The arcade included a whole lot of games, including laser tag. I wish I had a photo of one of the signs at the laser tag place, because it had quite a number of grammatical mistakes.
Thursday, November 15
The conference ended at 16:00. Everything got promptly torn down, and packed up in boxes. And then…*drumroll* everyone headed to a Blue Man Group show done specifically for SC07 tech badge holders (which included the folks doing the Cluster Challenge - read: us). The show was fantastic, but far too short. Next time I have a pile of PVC pipes, I’m going to have a ton of fun :)
Friday, November 16
After an hour meeting at the center to figure out what could be done better next year, everyone dispersed, and went their own ways. We went to the airport, and headed back to NY - this time via Phoenix. We got to JFK around midnight.