Philip Newton (pne) wrote,
Philip Newton
pne

  • Mood:

botanix faw down go boom

Yesterday, I noticed that /tmp on our machine botanix was full. I deleted a couple of files I still had lying around there and had a look to see what was wrong.

Apparently, the database had been writing two error messages every hour for the past several days. The error log file itself wasn't that big, and the info dumps even smaller, but the fact that there were two of them every hour caused them to add up.

I called Ralf and he said this might be due to something with cron -- he had compressed cron's log file to free up space on /var/tmp after that filesystem had filled up some days previously and this might stem from the same problem. (I'm not sure that's it, though -- user bfssys has one cron job running every five minutes, another every hour, and another once daily; why should only the hourly one produce problems and not the five-minute-ly one. Anyway.)

Ralf suggested either restarting cron if that's possible or rebooting the machine. I asked Rolf whether cron can be restarted and he told me to RTFM and to try SIGHUP. `man cron` didn't say anything about restarting, however, so I went around asking whether it'd be OK to reboot botanix. They said "give us ten minutes", so I did, and scheduled a reboot for 16:35.

With fewer than 60 seconds to go, I remembered that Ralf had said a couple of processes needed to be shutdown manually before a reboot, so I kill()ed the reboot process, shut down those processes, and then called reboot without a time argument. I saw the message "Reboot time reached" (or similar) on my terminal window, then another couple of lines and then my connection was gone.

Oh good, I thought, and waited a while. Then I went to the server room, but when I switched on the console, it was blank. Oops. After twenty minutes, Swetlana came by and asked whether botanix was back up. I had another look but couldn't connect to it.

I called Olaf, who isn't a sysadmin per se but knows a fair deal about Unix boxes. I'm glad that while he was his sarcastic self, he didn't reproach me but tried to get it fixed with me.

The front panel showed "FLT 3000", which he took to mean "fault"; I initially thought it had something to do with the tape, but that's probably DLT rather than FLT.

I had a look to see whether I could find a manual, but all I found was an old one for a model that wasn't the one we had (specifically, it didn't have a front panel LCD and hence the manual had no list of error messages and their meanings).

Finally, Olaf decided to try switching the machine off physically. He flipped the switch, waited a bit, and switched it on again. A ± sign appeared on the console, indicating it was at least connected in some way, but not very helpful. After a little bit, the LCD showed "TEST 3000" followed by "FLT 3000", then nothing. We also tried switching off the external disks and switching them on again, but nothing helped.

I had a look for online docs but the places I looked on HP's site require a support agreement. Olaf had found some PDFs for a similar model, though -- must have looked smarter than I had. However, I found out the model number (9000/861) by searching for an old bug report I had made for a perl I had built on the machine, which included the uname. I thought that might come in handy.

We also looked around for manuals but couldn't find any. Maybe they were in a locked cabinet, but one of the "pseudo-sysadmins" (they have the title but not necessarily the experience; projects are kind of responsible for adminning their own machines) had left already and the other had no keys. He looked around but didn't find any in or on the desk of the trainee in the computer room, and the front desk, which should have keys, was locked. Well, at least our emergency plan was tested (and failed).

So in the end Olaf said that the next morning (today), someone would call a technician since it looked like a hardware fault. Hopefully he'll be able to fix the machine quickly; a bunch of people are not able to work properly if botanix is down.

Edit 09:00: An HP chap might be coming over today... probably not this morning, though. I'll have to think of something else to do. Argh.
Subscribe
  • Post a new comment

    Error

    Anonymous comments are disabled in this journal

    default userpic

    Your reply will be screened

    Your IP address will be recorded 

  • 0 comments