Troubleshooting

  1. Telescope
  2. Cryogenics
  3. Electronics
  4. Rotator
  5. Computer Crashes
    1. andante crashes
    2. allegro crashes
    3. allegro and andante crash
    4. kilauea crashes
    5. The telescope computer (hau) crashes
    6. Everybody crashes!
  6. Data Streams
    1. Data Stream Programs or Windows Die
    2. dirsync Crashes or Hangs
    3. header_copy Crashes or Hangs
    4. write_log Crashes or Hangs
    5. Inspecting the Encoder Logs
    6. Merging Dies
  7. Data Processing
    1. For all cases
    2. Xterms die
    3. Xterm(s) remain alive but IDL quits
    4. Xterm(s) remain alive but an IDL routine crashes
    5. IDL cleaning code does not clean observations of a particular source
  8. General Computer Troubleshooting
    1. andante (the DAS/fridge computer)
    2. allegro
    3. kilauea
    4. Problems Accessing Data Disks or Dying Data Disks
    5. Dying System Disks
  9. Revision History
Back to BolocamWebPage
Back to ExpertManual


Telescope

For non-Bolocam related problems (dome, dish, antenna computer, etc.), go to the main CSO Hawaii webpage (http://www.cso.caltech.edu), scroll down and click on "Local Information".  You will find generic troubleshooting information there.


Cryogenics

About the only cryogenics problem that the typical observer can deal with is running out of LHe.  If you know how to do LHe fills, you can go ahead and refill; see the cryogen fill instructions.  If you caught the problem quickly enough and the fridge did not die, you may be able to continue observing.  If the UC Fridge GRT reading on the fridge monitoring page returns to its previous value, you're fine and you can keep observing.

If the fridge did run out (IC and/or UC Fridge GRT readings high and not recovering), then you can at least speed the recovery along by following the recovery instructions.

If you are not experienced with doing cryogen fills, your night is done.  Leave a note for the day crew and shut down for the day.  They will refill, recover, and set the fridge to cycle and be cold for the next night.

If you have more serious problems -- cryogen hold time sharply decreased, fridge cycle failing, etc. -- let the day crew and the Bolocam support person know.


Electronics

There are two kinds of electronics problems one typically runs into:

Rotator

Typical problems that might occur are:
The sequence of troubleshooting is as follows:
  1. You should stop your current observing macro but otherwise leave all programs running.

  2. First, assume that it is a simple problem that simply requires resetting the rotator.  Use the home program; instructions are given elsewhere.  This essentially resets the entire system and will likely get rid of any mild problems.  If rehoming is successful, test the rotator by rotating to some small angles (between -30 and +30 deg) using the interactive program, and try reading back the angles and see if they make sense.  If interactive works, then you can then restart observing.  The observation during which the problem occured should probably be discarded, but in principle later observations should be fine.  If either the rotator or write_log program died, you will have to restart them as explained below prior to restarting observing.

  3. If you can't rehome, or rehoming does not help, maybe the DIP switches on the fiber-optic isolators have gotten screwed up.  Check that they are set correctly by comparing to the instructions on the Setting up for Observing page.

  4. If rehoming fails, you should determine whether it is an obvious broken connection problem.  Go outside and check all the rotator cabling, which is described on the Setting up for Observing page.  The most likely failure mode is the fiber-optic cables; spares can be found in a box near allegro.  Make sure you hook the replacement up properly, paying attention to the connector colors and where they connect to.  If the other parts of the cabling fail, you may be able to find replacements by digging around the AOS lab.  After replacing the damaged cabling, try rehoming again.  If that works, try testing using interactive as above.  If that works, you can probably start observing again, though again you may have to restart rotator or write_log as above.

  5. If rehoming continues to fail, there may be communication or control problems.  If you suddenly get errors in your rotator or write_log window such as modprobe: can't locate module, then somehow one of the run-time kernel modules has been unloaded.  Log in to allegro.submm.caltech.edu as observer (password in white Bolocam Manual binder) and type

    > insmod rocket
    > insmod seaio

    You should receive messages like

    Using /lib/modules/2.4.13-0.6/kernel/drivers/char/rocket.o
    Using /lib/modules/misc/seaio.o

    possibly with warnings or the messages

    insmod: a module named rocket already exists
    insmod: a module named seaio already exists

    There may be other warnings.  As long as none of them say a module could not be loaded, then things should work.  Try rehoming and running interactive as above; if successful, you can restart observing, restarting rotator and/or interactive as above if necessary.

  6. If you are still having problems, then the best thing to do is just lock the rotator to its home position and disable the rotator for the night.  By turning the motor power off (see the rotator instructions elsewhere), you can rotate the dewar to its home position (where the homing sensor tab occludes the right half of the homing sensor).  Turn the power back on to have it hold there. 

    You will have to restart the rotator program with rotation disabled (R = 0). 

    If you had problems communicating with the encoder, then you will have to restart write_log with the encoder readout disabled (do not include the -e flag).  Clearly note when this occured in your observing logs, as it will be necessary to recalibrate the rotator angle from that point onward.  The data will be entirely analyzable, it will just have to be treated differently than preceding data.
Regardless of the problem, inform the Bolocam support person, providing details, so the problem can be rectified.


Computer Crashes

Ah, the bane of every system, the reason we should just go back to using chart recorders and slide rules!

Remarkably, our critical computers, andante and allegro are quite stable.  This is because we do not run much on them.  andante only runs the DAS and the fridge control, allegro only runs the data copying programs and gbolostripkilauea tends to be less stable due to strange goings-on with its video card.  We provide instructions here for recovering from crashes of each of these machines.

For an explanation of the data streams, see the Data Acquisition, Rotator Control, and Data Handling page.

andante crashes

This is not too tragic.  Do the following:

allegro crashes

This is a pain because the encoder log files are completely lost for this period.  Do the following:

allegro and andante crash

You should only be so lucky.  The main thing here is to bring up both computers and get everything cross-mounted as explained separately for each machine above before starting any programs.  Then you can restart the data-copying programs, then the DAS, then merge.

kilauea crashes

This is not so bad because no data are lost.  Don't be fooled by the fact that your data copying programs were running in windows on kilauea; they weren't really, only the log files were being displayed in these windows.  Do the following:

The telescope computer (hau) crashes

This happens very infrequently.  Your observation is terminated.  write_log will continue running without too much problem, but, obviously, it gets no information from the telescope and so will write invalid values. 

To recover, do the following:

Everybody crashes!

Again, get all the computers up and the disks cross-mounted first, then start up the various programs.


Data Streams

For an explanation of the data streams, see the Data Acquisition, Rotator Control, and Data Handling page.

Data Stream Programs or Windows Die

The more likely occurrence is that the X connection to the machine displaying the monitoring windows for the data copying programs goes down (for example, if kilauea crashes).  This is not a major problem!  The data copying programs are running autonomously on allegro; all that has happened is that the windows that display the log files written by these programs have died.  You have not lost any data, all you need to do is restart the monitor windows.  Once you have your X server back up and running, log into allegro (set X forwarding as necessary) and type

> start_tel_util YYYYMMDD R E 0

YYYYMMDD is UT date, R indicates whether you want to use the dewar rotator or not (R = 1 means "use the rotator"), and E indicates whether you want to read the rotator encoder (E = 1 indicates that the encoder should be read; if you don't read the encoder, the rotator angle will be taken to be 0 and you will have to deal with this later in the analysis).  The last argument, 0, tells the program that all the processes are already running, you just want to create the monitoring windows.  start_tel_util will check to see whether all 4 programs are indeed running; it will advise you if there is a problem.

If, on the other hand, allegro has itself died, then the file copying programs have died.  Once you have allegro back up and are ready to start taking data again, you can start them back up using the command

> start_tel_util YYYYMMDD R E 1 nlast

where YYYYMMDD, R, and E are as above.  The 4th argument is set to 1 to advise start_tel_util that it needs to start up the programs again, not just start up the log monitoring windows.  nlast is very important; it is the number of the last rotator log that was written (in /data00/encdir/YYYYMMDD).  Remember is just the number, not the entire filename.  You will have lost the encoder log files between nlast and the minute you restart the programs; you will have to force merging to work around them as indicated below.  However, as long as you use the nlast argument, the observation number should pick up where it left off.  If you forget the nlast argument, the observation number will start again from 0 and you will have a mess on your hands (it can be cleaned up, but you will have to consult an expert).

If allegro has not died but you suspect that one or more of the file copying programs has died, you can check by logging into allegro and typing

> check_tel_util

You will get messages indicating which processes are still running.  Proceed as follows:
  1. If all of the processes have died, you can restart in the same way as you would if allegro had died.

  2. If write_log has died, the easiest thing to do is to kill the other programs and restart everything as if allegro had died.  Just issue the command

    > kill_tel_util

    You will see messages indicating which programs were killed and which were not running.  Then issue the

    > start_tel_util YYYYMMDD R E 1 nlast

    command as you would have above.  Make sure to type the above line correctly.

  3. If write_log has not died, you are better off restarting the processes by hand so the encoder logs remain continuous.  Issue whichever of the following commands are necessary (corresponding to the processes that need to be restarted):

    > /home/observer/src/rotator/rotator R \
         >>& /data00/encdir/rotator_YYYYMMDD.log &

    > /home/observer/src/dirsync/dirsync.py \
         /smb/andante/YYYYMMDD \
         /data00/rawdir/YYYYMMDD \
         >>& /data00/rawdir/dirsync_YYYYMMDD.log &

    > /home/observer/src/dirsync/header_copy.py \
         /data/plog/YYYYMMDD \
         /data00/headerdir/YYYYMMDD
         >>& /data00/rawdir/dirsync_YYYYMMDD.log &

    You can then restart the log monitoring windows using the

    > start_tel_util YYYYMMDD R E 0

    command as you would have if only the X connection had died.  You may end up with duplicate log monitoring windows, just kill the duplicates: killing the duplicate log monitoring windows does not affect the operation of the running programs.  Make sure to type the above line correctly, otherwise you may get unexpected behavior.


dirsync
Crashes or Hangs

Check Remedy
/smb/andante/YYYYMMDD is visible on allegro but not readable by observer Check Windows sharing setup for \\andante\d\das_data and \\andante\d\das_data\YYYYMMDD
/smb/andante/YYYYMMDD not visible on allegro but directory listing of /smb/andante returns something
\\andante\d\das_data\YYYYMMDD probably has not been created.  Do so from andante's desktop.
/smb/andante/YYYYMMDD not visible on allegro and directory listing of /smb/andante returns nothing
  1. Check that andante is powered on and Windows has not crashed.  Reboot if necessary.
  2. If andante is on, then \\andante\d\das_data probably has not been cross-mounted.  cd to /smb on allegro and follow the instructions in /smb/AAAREADME.  You may need the root password, it is on allegro's monitor.  If this fails, then it is likely that \\andante\d\das_data is not being shared properly.  Check the sharing setup for this directory on andante directly.  A reboot of andante may be necessary.  It is very unlikely that the problem is with allegro, as this cross-mounting has operated without problems on allegro's side since 2000.
/data00/rawdir/YYYYMMDD does not exist
Should not happen -- start_tel_util should not have started dirsync.  Check that /data00/rawdir exists and that observer has write permissions.  If the permissions are wrong, change them by becoming root.
/data00/headerdir/YYYYMMDD is not writeable by observer Should not happen -- start_tel_util should not have started dirsync.   Change permissions by becoming root.
Is there free disk space on /data00?  Check using df -k.
Move or delete some data. 


header_copy
Crashes or Hangs

Check
Remedy
/data/plog visible on allegro but not readable by observer.
Check permissions for /data/plog, become root and change if necessary.
/data/plog visible on allegro but is empty
hau:/var/plog probably has not been cross-mounted.  Check using df -k.  If hau:/var/plog is not mounted at /data/plog, then become root and mount it by typing mount /data/plog.  If this fails, then it's likely that hau is either not exporting /data/plog or not considering allegro to be a valid mount client.  Contact a CSO staff member in the following order: Hiro, Ruisheng, Richard, Martin, anyone else.  Of course, hau may just be dead, but presumably you would have been told that by now.
/data/plog visible on allegro and contains files, but nothing is being copied
Is there a .lck file in /data/plog?  Check by doing ls /data/plog/*.lck.  If not, then you are probably  suffering from the antenna computer "no more free inodes" problem.  You have to reboot the antenna computer; see this link.  Once the antenna computer display shows something, in UIP type ANTENNA/RESTART/NOSYNC; you should see the antenna display come back up.  If not, consult the CSO Troubleshooting page.  If you still can't get it to come up, contact someone (try the pager first, then Hiro).
/data00/headerdir/YYYYMMDD does not exist
Should not happen: start_tel_util should not have started header_copy.  Check that /data00/headerdir exists and that observer has write permissions.  If the permissions are wrong, change them by becoming root.
/data00/headerdir/YYYYMMDD is not writeable by observer
Should not happen -- start_tel_util should not have started header_copy.   Change permissions by becoming root.
Is there free disk space on /data00?  Check using df -k. Move or delete some data. 


write_log
Crashes or Hangs

Check
Remedy
Gives RPC timeout error.
hau's RPC server is not up, is failing, or the network connection to hau is not good.  Not much you can do, try calling Hiro.  Check whether you are also having access problems with /data/plog.
/data00/encdir/YYYYMMDD does not exist Should not happen -- start_tel_util should not have started write_log.  Check that /data00/encdir exists and that observer has write permissions.  If the permissions are wrong, change them by becoming root.
/data00/encdir/YYYYMMDD is not writeable by observer Should not happen -- start_tel_util should not have started write_log.   Change permissions by becoming root.
Is there free disk space on /data00?  Check using df -k. Move or delete some data.

Inspecting the Encoder Logs

Sometimes you may not be sure what has happened with the encoder logs and you want to inspect them directly to see which observation numbers are present and whether they match up with the source names as you expect.  There is a simply utility for doing this, sum_encdir.  To use it, simply type

> sum_encdir /data00/encdir/YYYYMMDD

A list of observation numbers and source names will be printed out.

Merging Dies

Merging can die if any of the necessary files (raw bolometer data, pointing files from telescope, encoder log files from rotator) are missing or if the raw data files are short.  Typical error messages are:

Error opening das directory
Error opening header directory
Error opening encoder directory
These imply that the given directory could not be found.  Since start_merge ensures the directories exist when it begins, this means that a directory has "vanished" in midstream.  This is usually because a cross-mounted disk from another computer has gone offline, usually because the computer has crashed.  For example, if merging on kilauea and allegro crashes, you will get these errors.  Consult the instructions above for dealing with a crashed computer.

Cannot open file XXXX, reached max number of tries
This means that for a given raw data file, no pointing log or encoder log file was found after waiting for some number of 30-second intervals.

Previous number, this number
This means that the raw data file minute number incremented by more than 1, which implies files were lost.

Now about to crash!
File size is XXXX
File pointer position is YYYY
feof reports ZZZZ
Now crashing, satisfied...
Happily aborting with error
This error occurs when a raw data file is the wrong size.  Raw data files have an almost perfectly fixed length set by the number of sampled channels and the number of samples per minute.  This error will usually happen on the last file of the night because the DAS is usually stopped mid-minute.  That's fine.  You should worry when it occurs partway through the night.

You should also worry if merging remains stuck in the wait loop for the next file.  New raw data files should appear every minute, so if merging stalls for much longer than that, it indicates the raw data files are not being generated or being copied to allegro

For the various cases, do the following:

Data Processing

This section describes how to restart the auto-analysis programs.  NOTE: For any instance where you are asked to delete files, be careful to always use the -i option so that you can confirm any deletes.  This should be the default on allegro and kilauea, but be sure about it before you delete anything.

For all cases

Processes that die involuntarily can leave partially written output files, especially cleaning.  Look for .lck files in your data directories (see the Analysis Software page for details on where these would be).  For any .lck files that exist, delete the .lck file and the associated data file.  For example, if the data directories start at  ~/data, then the command

> find ~/data -path '*.lck' -follow

will find all the .lck files.  Don't forget to include the single quotes.  Only delete the .lck files on kilauea; do not delete .lck files in the cross-mounted directories rawdir/, headerdir/, or encdir/.

Xterms die

Either because you accidentally killed them, or because kilauea's X server dies, or because kilauea crashes, etc.  You can restart the xterm(s) and the routine(s) running in them as follows.  If all your xterms died, you should still use this by-hand method because start_autos does not supply the necessary obsnum_start argument to run_auto_slice_files.

Xterm(s) remain alive but IDL quits

(very unlikely) Even though the xterm hasn't died, you will need to kill the offensive xterm(s) and follow the above instructions for starting new xterms.  The reasons are technical, you can ask the Bolocam support person if you really care.

Xterm(s) remain alive but an IDL routine crashes

You will need to restart the IDL code by hand.  Various scenarios are described below.  Be sure to ALWAYS type retall at the IDL prompt before attempting to restart the IDL code; this brings you back to the main IDL program level and prevents unpredictable behavior that may arise from restarting the code from inside a routine that has crashed.  No ill effects arise from typing retall when it isn't necessary, so go crazy!  Do not use the .full_reset_session or .reset_session executive commands; these will erase assorted variables that were initialized at startup and are needed for some of the code to run properly. 

IDL cleaning code does not clean observations of a particular source

You probably have forgotten to add your source to the appropriate source list files.  See the Analysis Software page for instructions on making the cleaning pipeline aware of new sources.  The pipeline won't be aware of these changes, though, until you restart it.  You can do this in one of two ways:
  1. If you still have all the pipeline windows up, just hit q in all of them except the slice_files window to stop the ongoing processes.  If q does not work, try Ctrl-c then type retall to get back to the MAIN level in IDL.  Then, in each window except slice_files, type the appropriate one of the following (refer to the xterm title)

    Don't forget the @ sign!  This procedure is similar to what is done above for when all the Xterms die.

  2. If you need to restart everything because the windows are gone, do the usual start_autos from the shell command line, but then hit Ctrl-c in the slice_files window as soon as you can so you don't reslice all the data for that day (which will then cause it all to be reprocessed).
The reason the above works is that, as long as you don't reslice the files, the analysis routines realize that only the unprocessed observations need to be analyzed -- the revision dates on the processed observations' files tell the pipeline they are done.  If you reslice the files, though, then the sliced files get new revision dates and the pipeline thinks all the downstream files are out of date and need to be regenerated.


General Computer Troubleshooting

Computers are built to fail, one might say.  Here are some problems you might run into and how to deal with them, working from the front-end to the back-end.  If you run into a problem that prevents you from taking data and can't solve it with the following information, call the CSO pager.  The on-call staff member will either be able to help you or to get the necessary person in touch with you.

andante (the DAS/fridge computer)

andante has had a troubled history that seems to dog it no matter what computer we call andante.  We have had to reinstall the system more times than we would like.  Hence, we have become quite expert at it.  Here's how to deal if andante starts acting up.

If you start to see crashing of either LabView, disk cross-mounting to allegro, or the entire system itself, and the problems are not obviously attributable to a specific cause, the likely problem is that something bad has happened to Windows.  Don't fight it!  Your first course of action is to switch over to the image disk.  When we have andante in a happily working state, we make a byte-by-byte image of it onto a second, identical disk.  That disk is then powered down and left sitting in andante.  To switch to the image disk, do the following;
  1. First, find the target of the desktop shortcuts for BCAM_DAS and fridge_cycle.  Copy these programs off andante, as updates may have been done since the last time the image disk was made.  If you have network access, you can use SSH (shortcut on the desktop) to copy the programs to any other computer; allegro or kilauea are good choices since they are on the local network.  If not, you can probably copy the programs off to a floppy disk.  Make sure you copy the programs themselves, not just the shortcuts!  You can find the targets of the shortcuts by right-clicking on the shortcut and selecting Properties.

  2. Second, shut down andante and open it up.  It is nontrivial to open the computer up due to the way the cover locks.  See the instructions.  Find the current system drive and the image drive (both are IDE drives) -- they are probably sitting right next to each other.  The image drive will likely have no power and IDE cables connected.  Simply switch the power and IDE cables over to the image drive and try booting.  Close up the computer if you are able to boot properly.

  3. Use SSH to recopy the DAS and fridge cycle programs down to the image drive.  Make sure to put them in the right folders (find the targets of the desktop shortcuts) and to redefine the shortcuts as necessary.  Do this even if the originals have the same names (e.g., BCAM_DAS_20040225) -- there might be minor updates that did not warrant a new name but need to be propagated.

  4. If you have had to switch to the image disk, inform the Bolocam support person so that we can recover the original system disk at the next chance, turning it into the image disk.
You may have gotten into the much worse situation where you actually need to reinstall Windows from scratch and you can't just image the working drive.  This will overwrite much of the configuration information, so it takes some work to get back to a properly working state.  If you have to do this, follow these instructions.  You will be frequently prompted for reboots, go ahead and reboot as necessary.  Log in as bolocam whenever possible.
  1. The CDs you will need are in the Bolocam file cabinet in the computer room.  You will find Windows XP Professional Service Pack 2, Partition Magic 8.0, and LabView 7.1.

  2. Some software must be downloaded from Caltech's site-licensed software site.  You need a Caltech ITS account for this.  If you don't have one, contact the Bolocam support person.

  3. First, power down the computer and remove all the National Instruments cards.  See the instructions for opening the computer.  Remove the PCI-6031E, PCI-6034E, and PCI-GPIB+ cards from the computer.  Note which slots they were initially in so you can return them to the right places, and be careful about static electricity.

  4. Make sure the computer is connected to the web.

  5. Install a fresh version of Windows XP PRO - SP2.  (In the options, choose to format the hard drive and install Windows XP).

  6. Log on as administrator and make sure to create a password if you weren't prompted to do so during the installation of Windows (use same password as noted in the white Bolocam binder).

  7. Run Windows Update until the Windows installation is fully up to date, with all security patches.

  8. Create a new user account bolocam with full administrator rights with the same password as written in the Bolocam white binder.

  9. Log out of the administrator account and log in as bolocam.

  10. If you are using the Dell Precision 420 as andante, get the video card driver.  The video card is a Matrox G400 (http://www.matrox.com).  After rebooting, you can run the resolution up to something sensible (1200 x 768).  You might need to change the frequency to 75 Hz.  These latter settings are accessible by right-clicking in an open space on the Windows Desktop, which will open the display settings.  Click on the Settings tab.  To find the frequency setting, click on Advanced and then the Monitor tab.

  11. Download and install VPN-3000 Virtual Private Network client software from the Caltech ITS site:
  12. Run VPN-3000 to obtain a Caltech virtual IP address and install Caltech site-licensed software from http://software.caltech.edu:
  13. You may disconnect VPN-3000 at this point.

  14. Install Partition Magic 8.0 from CD.

  15. Install the FULL version of Labview 7.1 from CD. Note that the FULL installation includes the very useful MAX (Measurements & Automation Explorer).

  16. Shut down and install all the National Instruments cards, being sure to put them back in the same slots you removed them from.  Again, take precautions against static electricity.

  17. Plug the GPIB connector and the two ADC cables in (one ADC cable comes from the SCXI chassis and connects to the upper PCI card, the other comes from the white thermometry breakout box and connects to the middle PCI card.  The connectors have different form factors so there should be no confusion).  Restart the computer and log in as bolocam.

  18. Fire up MAX (there should be a shortcut on the desktop labeled Measurement and Automation Explorer).  You should see:

    My System

         Devices & Interfaces
              Traditional NI-DAQ Devices
              GPIB


    To see the fridge power supplies (the Tektronix PS2520G modules), right-click on GPIB and click Scan for Instruments.  Two GPIB devices should come up.  (You may need to left-click on GPIB to open the tree up further.) 

    NEED TO UPDATE THE FOLLOWING WHILE HAVING ACCESS TO PC.
    To see the MUX chassis, right click on Traditional NI-DAQ Devices and choose Add SCXI Chassis and pick SCXI-1001.  Right-click on the SCXI-1001 entry, select Properties, and make sure Chassis ID is set to 1 and Chassis Address to 0. 

    Click on SCXI-1001 and you should see 12 SCXI-1100 modules appear in the right window.  Right-click on the first one and select Properties.  Under the General tab, go to Connected to: and select the PCI-6034E card.  Also click the This device will control the chassis checkbox.  Leave the defaults in the other tabs.  For the other SCXI-1100 modules, open their Properties windows and make sure that the Connected to: box says None.  The This device will control the chassis checkbox will be grayed out.

  19. Correct the device numbers in the BCAM_DAS and fridge_cycle LabView programs.  Open fridge_cycle and look for the PCI-6031E Device Number control to the right on the front panel (probably off the screen).  If the default value does not point to the PCI-6031E card (to the device number indicated in MAX), then change the value.  To save the new value as the default, change to edit mode (Operate -> Change to Edit Mode), right-click on the control and select Data Operations -> Make Current Value Default.  Then switch back to run mode (Operate -> Change to Run Mode) and save the program.  Similarly, open up the BCAM_DAS program and look for the PCI-6034E Device Number control at the top of its front panel and repeat the above for this program.  It may be that one or both of these programs complains of missing vi's on startup; they are probably in one of the llb files in Program Files/National Instruments, just dig until you find them, they will be there.

  20. Set the IP address to be andante's static address.  Click on Start -> Connect to -> Show all Connections and then select Local Area Connection and click on Properties.  In the Components window, select TCP/IP or possibly Internet Protocol (TCP/IP) and then click on Properties.  Check the Use the following IP address: radio button and type in the following:

    IP address:  128.171. 86.211
    subnet mask: 255.255.255.  0

    gateway:     128.171. 86.  2

    Also check the Use the following DNS server addresses: and type in the following:

    Preferred DNS Server: 128.171.3.13

    Have it change the IP address immediately (i.e., don't wait to reboot).

  21. Set up the computer to do network time synchronization.  Double-click on the clock in the lower-right corner of the desktop.  The Date and Time Properties dialog box will come up.  Click on the Time Zone tab and make sure the clock is set to the GMT time zone.  Click on the Internet Time tab.  Enable automatic time synchronization with hau.submm.caltech.edu.  Click the Update Now button.

  22. Set up disks properly:
    1. Using Partition Magic, split up the master drive into two partition C:\ (~21 GB) and D:\ (~17 GB, call it DATA).  Follow the instructions.  You'll be prompted to reboot the system at the end.
    2. Create the following directories in D:\
           D:\das_data
           D:\fridge_data

           D:\lab_tests
    3. Make  D:\das_data remotely accessible so data can be transferred to allegro:
      • Use Windows Explorer to get access to D:\asdsa
      • Right-click on the the das_data directory and select Properties.
      • Click on the Sharing tab.
      • In the Network Sharing and Security box, enable Share this folder on the network and give it the name das_data.  Make sure Allow network users to change my files remains disabled.

  23. Turn on the network firewall:

  24. Turn on the Remote Desktop server to allow remote users to use this computer:

  25. Enable automatic Windows Updates:

allegro

allegro has been remarkably stable.  If it crashes, instructions for bringing it back up and cross-mounting disks have been given above.  If the system itself seems to be going belly-up -- e.g., frequent crashing, unexpected behavior -- you can switch to the image disk.  This is a disk that, like for andante, is basically a byte-by-byte image of the system and /home disk.  Your data will be unaffected by this switch.  To do this:
  1. If possible, dismount /data00 from kilauea by logging in to kilauea as root (password in the white Boloca Manual binder) and typing umount /data00.  If you get a device busy error, you may have to get people to log out user sessions if they happen to be sitting in one of the /data00 directories (unlikely).   If you can't get the disk to unmount, skip to the next step.

  2. If possible, copy the directory /home/observer/src to another computer (e.g., kilauea) so that the code on the image disk can be updated if necessary.  To do this:


  3. Shut down allegro: log in as root (password in the white Bolocam Manual binder), then type shutdown -h now.  The computer will shut down and power off.  Flip the power switch on the back of the computer to the off position also.

  4. Open up allegro.  You will see a 20 GB drive connected to the primary IDE port on the motherboard -- this is the current system disk.  (Follow the cables back to the motherboard and you will the connectors labeled on the board.)  Somewhere else inside the computer there will be another 20 GB disk without a power cable attached -- this is the image disk.  Switch the IDE and power cables from the original system disk to the image disk.  You may have to move disks around in order to be able to connect the IDE cable to the image disk.

  5. Turn the rear power switch back on and then press the front panel power button to boot the computer from the image disk.

  6. Update the src directory as follows:

  7. Follow the remaining directions above for bringing allegro back up (cross-mounting disks)

kilauea

kilauea is managed by Ruisheng Peng.  If it starts having problems, let him and the Bolocam support person know.  If kilauea just dies completely and won't reboot, you don't have much recourse until you get in touch with Ruisheng.  However, kilauea is not critical to taking data.  All the normal data-taking processes will continue to run if kilauea dies.  You can log in to puuoo as bolocam (same password as on kilauea) and from there restart the xterms that monitor rotator, dirsync.py, header_copy.py, and write_log, and also restart gbolostrip following the directions given above as if you were doing it on kilauea.  You can also do this from any other computer with a X-server; feel free to use your laptop, or you can also use Reflection X on pika, the PC in the main computer room.

Problems Accessing Data Disks or Dying Data Disks

The summit is not a friendly place for hard drives, especially data drives that get heavily exercised.  We keep spare 120 GB data drives ready for when a data drive on one of the linux machines dies.  Symptoms of this happening are i/o errors from the processes that write or read data to the particular disks, or simply the directories on a drive not appearing.

allegro: For allegro, we have a spare data drive sitting in the computer ready to go.  Just switch to this drive.  Even if it turns out that the original drive had not fully failed, switching drives will minimize downtime.  Further investigation can be done during the daytime.  To switch drives, do as follows:
kilauea: This machine uses a SCSI RAID for almost every disk, so they should be pretty robust.  Responding to various disk failure modes:

Dying System Disks

System disks can also die.  To make it possible to recover quickly from such a problem, we have created image drives for the andante and allegro system drives.  They are left powered off and disconnected inside the particular computer.

Note that, on andante, the system and data drives are different partitions of the same disk, and so if one begins to fail, so does the other.

The image drives nominally have the same version of the Bolocam-specific software as the original drive, but they could be slightly out of date.  To be sure to get the most recent software, do the following first:
To switch over to the image drive, do the following (same for both computers):
Once you have switched to the image drive, you can update the software as follows:

Revision History


Questions or comments?  Contact the Bolocam support person.