Beeper Duty Overview

Contents


Basics

Who is number 1?

The users. [5/21/95 dl2n :]

The practical implications of this are that you need to consider the impact of things you do (and things you don't do) on the people who use the system. Whenever you don't promptly answer a call, whenever you fail to promptly restore a service, someone's work is being impacted. There are trade-offs involved. There will be times when you'll have the opportunity to track a problem or restore service, but not both simulataneously. If you need guidance, please ask, but you should be able to use your better judgement on this sort of thing.

When carrying a beeper, you will be expected to be within range of getting paged (it's a ground-based system, making range about 50 miles around Pittsburgh) and situated such that you can service problems that occur. Occasionally a problem will require a trip to campus, but it's comparatively rare. Sometimes, hardware fails, and you may need to have a disk replaced and restore its contents from amanda or stage (the unix and AFS backup systems). Typically this will not be the case.

You will also be expected to read org.acs.asg.coverage, and when you fix something (if you have a beeper or not) to post about it there, including what you did to resolve it. Use a subject line which is likely to be meaningful later if someone is looking back to find the problem.


AFS database service

Using BOS

When running a "bos" command you must either be authenticated as an admin user, or logged into a fileserver or database server. If you are logged into a server you must add the -localauth option to the end of the command line in order not to need to authenticate as an admin user.

AFS database services, which include the ptserver, and vlserver, run on vice2,7,11,12,28. The ptserver maps Kerberos principals (usernames) to AFS ID numbers, and also provides groups for AFS ids. The vlserver keeps track of where all copies of all volumes in a cell live. For example, user shadow's home directory is in a volume named user.shadow, and using the command vos examine user.shadow, I might find out that volume is on the fileserver on vice1, on partition vicepe.

vos examine user.shadow user.shadow 1970660716 RW 17291 K On-line VICE1.FS.ANDREW.CMU.EDU /vicepe RWrite 1970660716 ROnly 0 Backup 1970660718 MaxQuota 20000 K Creation Wed Aug 7 16:15:46 1991 Last Update Mon Sep 9 13:36:41 1996 3231 accesses in the past day (i.e., vnode references) RWrite: 1970660716 Backup: 1970660718 number of sites -> 1 server VICE1.FS.ANDREW.CMU.EDU partition /vicepe RW Site

Things which will happen to these machines are crashes of one or more of these processes, crashes of the machine, malfunctions of one of the servers, generally the kaserver, or meltdowns of one of the servers, generally the ptserver. When a server process crashes, you should check the logs in /usr/afs/logs on the machine on which it is running to see if you can get any clues as to what caused it. You can use the command: bos status (machinename) (server process name) -long to check on whether a server process has crashed and left a core. The core will be in /usr/afs/logs, named core.(processname), for instance, core.snmpd.

% bos status vice28.fs kadmind -long Bosserver reports inappropriate access on server directories Instance kadmind, (type is simple) has core file, currently running normally. Process last started at Sun Sep 20 04:00:09 1998 (1 proc starts) Command 1 is '/usr/afs/bin/kadmind -r ANDREW.CMU.EDU -smitv4'

This is useful for the AFS maintainer. If you are knowledgable or feeling ambitious you may wish to copy the core and the server binary to another machine of the same system type and attempt to debug it, but if you do, please encrypt the core, as it may include sensitive data, like our AFS cell key. Generally you should watch to see that the server process comes back up and begins answering requests again. If it does not, and you do not know how to proceed, you should contact the System Manager or the Beeper Coordinator. You can use /usr/local/bin/des on the fileserver and ~shadow/bin/des on Andrew Solaris machines for encrypting the core.

If the whole machine crashes, you should see if there are any messages on the console. If the machine is at an "ok" prompt, type reset, then if you get another "ok", type boot. In any case, watch the machine come up, then log in and check the system logs, in /var/log/syslog and /var/adm/messages, to see if you can ascertain the cause of the crash. Watch the logs from the servers in /usr/afs/logs to make sure everything comes back up nicely.

Occasionally we have been known to experience slowdowns in authentication service. While none of the servers are down, the sync site, the site which coordinates changes to the database, does not recognize one of the other servers. You can use the command: udebug (host name) 7004 to check for this.

% udebug vice2.fs 7004 Host's 128.2.10.2 time is Thu Feb 27 16:23:30 1997 Local time is Thu Feb 27 16:23:29 1997 (time differential -1 secs) Last yes vote for 128.2.10.2 was 3 secs ago (sync site); Last vote started 3 secs ago (at Thu Feb 27 16:23:26 1997) Local db version is 856688536.400 I am sync site until 57 secs from now (at Thu Feb 27 16:24:26 1997) (5 servers) Recovery state 1f Sync site's db version is 856688536.400 0 locked pages, 0 of them for write Last time a new db version was labelled was: 390073 secs ago (at Sun Feb 23 04:02:16 1997) Server 128.2.10.28: (db 856688536.400) last vote rcvd 3 secs ago (at Thu Feb 27 16:23:26 1997), last beacon sent 3 secs ago (at Thu Feb 27 16:23:26 1997), last vote was yes dbcurrent=1, up=1 beaconSince=1 Server 128.2.10.12: (db 856688536.400) last vote rcvd 3 secs ago (at Thu Feb 27 16:23:26 1997), last beacon sent 3 secs ago (at Thu Feb 27 16:23:26 1997), last vote was yes dbcurrent=1, up=1 beaconSince=1 Server 128.2.10.11: (db 856688536.400) last vote rcvd 3 secs ago (at Thu Feb 27 16:23:26 1997), last beacon sent 3 secs ago (at Thu Feb 27 16:23:26 1997), last vote was yes dbcurrent=1, up=1 beaconSince=1 Server 128.2.10.7: (db 856688536.400) last vote rcvd 3 secs ago (at Thu Feb 27 16:23:26 1997), last beacon sent 3 secs ago (at Thu Feb 27 16:23:26 1997), last vote was yes dbcurrent=1, up=1 beaconSince=1 % udebug vice7.fs 7004 Host's 128.2.10.7 time is Thu Feb 27 16:24:27 1997 Local time is Thu Feb 27 16:24:26 1997 (time differential -1 secs) Last yes vote for 128.2.10.2 was -1 secs ago (sync site); Last vote started -1 secs ago (at Thu Feb 27 16:24:27 1997) Local db version is 856688536.400 I am not sync site Lowest host 128.2.10.2 was set -1 secs ago Sync host 128.2.10.2 was set -1 secs ago Sync site's db version is 856688536.400 0 locked pages, 0 of them for write

Try one of the five machines. It will either tell you which machine is the sync site, or will give you information which includes a string like "I am sync site until 60 seconds from now". In the output from the sync site, look for the dbcurrent flag on one of the servers to be 0, and/or a "last vote was no". These are clues of a problem. In the event of a "last vote was no", wait 90 seconds to see if it happens again. In the event this seems to be the problem, if you believe it to be critical to restore service, you may proceed by running the command bos restart <host name of sync site> kaserver. You do of course need to know which host is currently the sync site, as explained above.

The other condition which has been known to exist is called a meltdown. It generally happens to the ptserver, although this has not been a problem as of late. The ptserver has a set number of "threads" in it taking care of incoming requests. If for some reason it gets bogged down and is unable to finish transactions it is possible for all the threads to get tied up and the number of waiting requests to skyrocket. You can use afsmon to check for a high number of wait procs on the ptservers by running afsmon, then click the Wait_Proc button, then start.

Any servers which are bogged down can be determined as above and restarted using the command bos restart <host name> ptserver. Note though that again unless you are confident it is probably a good idea to contact the System Manager or the Beeper Coordinator before doing this. As above, watch for things to come up ok, and also watch for the waiting processes to stabilize, as the problem is likely to recur.

Only the meltdown problem is truly time-critical. A failure of any single server process on any one machine, or of any one machine, should not cause loss of service because of the manner in which AFS deals with databases, and because we have 5 (or more than 1) server.

The AFS database servers also run the kerberos key distribution server (kdc) and admin servers. The kdc is run by the bosserver, just like the native afs database servers. Unlike the afs database services, kerberos uses a dedicated-master model, where one of the servers is assigned a role that allows it to update the database (add users, change passwords, etc). The master distributes database changes to the slave servers.

The master server (vice28) runs the following additional services:

The slave servers run the following additional service:

Some ipropd-slave failures require restarting ipropd-master as well. if ipropd-slave has died on a db server and won't start, try restarting ipropd-master on vice28

AFS filservers

We have a number of fileservers. These are all of the vice machines not included above. Actually, the database servers run file servers, but user data should never be stored on those machines. Instead, the space exists for testing and for the Operators to restore AFS volumes onto. A current list of fileservers and the specific functions they fill can always be found in /afs/andrew.cmu.edu/acs/asg/fileservers. Most file servers are now equipped with RAID arrays for the data partitions. On these machines the "a" partition is typically empty or filled with test data. See the "fileservers" file.

Functions which fileservers perform are user server, development (or dev) servers, and replication (or rep) servers. User servers contain data which includes user home directory, but more generally any data which is not system software, and which is not replicated. When one of these machines or the fileserver process on it crashes, the data on it becomes entirely unavailable, and that should be borne in mind when you are deciding how and how fast to deal. Dev servers include writable copies of system software. When one of these machines or the file server process on it crashes, only data which is not replicated becomes unavailable. This will generally, but not always, affect only machines which are "beta" linked machines. The owners of these machines have volunteered to run software which has not been extensively tested. Dev server crashes are one of the pitfalls of beta linked machines. No development will be able to be done (or actually very little) when one of these machines goes down, but it is not nearly as critical as a user server. Rep servers provide readonly copies of various system software. Generally there are at least 1 readonly copy of any package; For some system types, there are more. It is not safe to assume there is more than one replica, so this is more critical than a dev server crash, as if only one readonly copy of an AFS volume exists, AFS will not transparently deal as it will with multiple readonly copies.

You can use the command bos status (host name) fs -long to check if a fileserver is still running or has crashed.

% bos status vice20.fs fs -long Bosserver reports inappropriate access on server directories Instance fs, (type is fs) has core file, currently running normally. Auxiliary status is: file server running. Process last started at Sun Feb 23 04:01:27 1997 (2 proc starts) Command 1 is '/usr/afs/bin/fileserver -L' Command 2 is '/usr/afs/bin/volserver -p 16 -log' Command 3 is '/usr/afs/bin/salvager -parallel 5'

There may be cases where the server process is functioning normally, and hence shows "up", but not all partitions are online. You should check the FileLog on the server to make sure all the partitions were attached successfully. If not, make sure the partition(s) in question are functioning (powered up, SCSI cable attached, visible to the system, fsckable, mountable and mounted) and then try a fileserver restart bos restart viceN.fs.andrew.cmu.edu fs . If that fails, you can reboot the server. If that fails, contact the System Manager or the Beeper Coordinator.

The same caveat as with the database servers applies if there is a core file. As with database servers, if either the file server process or the machine crashes, please look at the system logs to attempt to find out why. Also watch the FileLog and SalvageLog in /usr/afs/logs. Specifically, watch for volumes which were not able to be attached in the former, and files which were deleted from volumes in the latter. Files named .gopherrc and .ircmotd often get deleted; This is a function of the applications which use them, and should not be worried about. If other files are deleted you should notify the System Manager or the Beeper Coordinator. To find out about unattached volumes you may use the command vos listvol (host name) (partition letter) > /tmp/somefile Then look through the output for an unattached volume, and attempt to bring it back online as below.

To attempt to bring a volume back online, do:

bos salvage &lt;server&gt; &lt;partition&gt; &lt;volume&gt;

Be sure to include all the arguments in the correct order to prevent the fileserver from going offline. Sometimes this fails to bring the volume back online. In that case, do: vos dump (volume) -t 0 > /dev/null (If you forget the "> /dev/null", and ^C the process, see below). Getting the arguments wrong will mean that bos thought you wanted to salvage either the entire partition or entire machine, both of which require bringing the fileserver process down.

If a volume is busy, isn't being backed up or moved, and won't unlock, you can either wait for the timeout, or kill -15 the volserver process on the fileserver with that volume by logging in and sending a signal to the process. If you don't know how to do this, it's probably better to wait.

Sometimes problems will be caused by disk errors. You can check in the system log files, as detailed for database servers, or check the output of the command "dmesg". This is an excellent thing to check for if you see a crash but don't have a core file or anything in the logs. Fileservers very rarely go "poof". If you have a disk error, please contact the system manager, or the beeper coordinator before proceeding, as likely a disk will need to be copied before it gets any worse, and user data will need to be restored. You can find a summary of log messages on the bboard org.acs.asg.log.syslog; Emergency reports are mailed when received, otherwise there is a weekly summary.

Occasionally a fileserver will appear to hang, when in reality a client, or a few clients, are sending a heavy stream of requests which the file server cannot keep up with. (XXX How do you do detection for this Walter? My way is evil) It may be necessary to reboot a machine, or to contact DataComm and have a machine disconnected from the network.

A file server may also fill up. If this happens, vos listvol <host name> <partition name> > /tmp/somefile. Try doing a vos backup <volume> on large AMS volumes, as this will often bring the usage down a lot. If that's not enough, check the file "/afs/andrew/acs/asg/fileservers" to see what type of server it is. Do vos partinfo <server> on servers of the same type to find one with the most free space, if there's a partition on the same server with enough spare room, use that since it's more efficient. Look for a volume of about the right size to reduce the percent full below the threshold. Do a vos move <volume> <fromserver> <frompart> <toserver> <topart> -verbose. You may also check ~shadow/scripts/Scout for a quick terminal display of the servers. sentinel also provides a summary of disk usage so you don't have to do a partinfo on every server. Generally, the volume balancer, which runs Friday nights on the machine flying.andrew.cmu.edu keeps partitions at relatively the same usage, so this is not often necessary.

TODO LIST for this section


Amanda workstation backup system

Amanda is the Advanced Maryland Automatic Network Disk Archiver, which runs on the machines amanda, amanda2 and amanda3, and backs up machines listed in /usr/sbin/amanda/normal/amanda.disklist. It runs a special client which runs dump on the remote machine and pipes the output over the network to the amanda server.

You may be beeped because the amanda spool disk is full, because amanda couldn't read it's configuration files, or because amanda is confused about its tapes. Amanda's files live in /usr/sbin/amanda/normal. Binaries live in /usr/amanda on clients. Helper programs can be found in /usr/local/bin. The disklist is copied from /afs/andrew/data/db/amanda/normal/amanda.disklist when that file changes. If you need to add to it, the format is: hostname disk-device-without-preceding-/dev/ method The method is generally krb-comp-encrypt, krb-comp-encrypt-root, krb-encrypt, or krb-encrypt-root. Root filesystem dumps have a lower priority generally as less data changes there. comp and encrypt enable compression and encryption on the network respectively. We only back up division machines with amanda, BTW.

You can find instructions on how to do a restore from Amanda via http://asg.web.cmu.edu/isam/howto/amanda.html. Man pages exist for all the amanda helper commands.

Generally you will not need to worry about amanda.

STAGE backup system

We use the stage backup system to back up AFS volumes. It predates the newer AFS backup system, but provides features still not found in that system.

Stage runs on the machines backup1 and backup2 Each machine backs up a different subset of volumes. backup1 backs up project volumes, system software, and other data. backup3 backs up user volumes. Things will you will generally be required to do include dealing with dealing with a full spool disk (you don't; It will handle itself, unless the condition is ongoing over several days, in which case you should contact the System Manager or the beeper coordinator), dealing with an errant tape changer (powercycling the changer and/or manually hitting the eject button may help. if not, contact the System Manager, the Beeper Coordinator or someone from CMG). Occasionally, stage's database server (which runs on backup2) will experience problems. Assuming the tdbserver process is not running, and there are no clear errors in /usr/stage/log/tdbserver.log, the server may be restarted by running /usr/stage/bin/tdb_initrec, followed by /usr/stage/bin/tdbserver up. Also, occasionally an operator will start to restore a volume and fill a partition on a fileserver or abort the restore. This can occasionally leave part of a restored volume around, and subsequent restore attempts fail. If you first make sure what you're deleting is the correct bad volume, you may use the command vos zap <server> <partition> <numeric volume id> or vos delentry <volume name> to remove a failed restore. If you have doubts, please contact the System Manager or the Beeper Coordinator.

At this point in time, stage again requires restores to be run on the machine that hosts the dumps (backup2 for recently dumped user volumes, backup1 for everything else). Use the "search", "restore" and "spoolout" commands to find the correct set of dumps, queue them to be read from tape (or disk), and write them out to a fileserver respectively. Typically, operators handle loading tapes for restores for you. If you are in a hurry and want to do it yourself, what you do depends on what tapes you end up needing.

If the data is on 8mm tapes (those named afs.<foo> or {sum,win,spr}.yynn for yy less than 99, then the restore procedure will be started automatically by cron, and messages will appear on the servers console (and in zephyr messages to class backup, instance <first hostname component of backup server>) indicating what tape should be inserted in the drive. After each tape is used, it will be ejected, and a new console message will appear.

If the data is on DLT tapes, and a non-changer DLT drive is being used by stage, the same procedure should be used (as of 11/20/02 this is always the case. there are no changers)

If you are using a DLT changer, you must manually start the extraction process. Run /usr/stage/script/extractmgr. After you confirm which drive the restores are bing run on, The tapes that are needed to complete the pending restores will be listed, and you will be prompted to load the changer with as many as possible. Once this is done, and you have pressed return, the system will run through all the loaded tapes and process them. If more tapes are needed after this further prompting will occur.

If the dumps are on a raid disk ("drum") instead of a tape (as of 11/20/02, most dumps are done this way. the exception is dumps in the "archive" and "purged" groups), you need to run "drum fetch rest" after all the restores have been queued, and before you attempt to spool them out to a fileserver.

[cg2v 11/20/02]


Cyrus Mail System

The Cyrus mail system is our replacement for AMS. It currently runs on the machines mail[1-4].andrew.cmu.edu, mail-fe[2-6].andrew.cmu.edu, mupdate1.andrew.cmu.edu, and imsp1.andrew.cmu.edu.

users with access can't read restricted bboards
For restricted bboard failures, find all of the process id's of ptloader, as in /usr/ucb/ps auxww|grep ptloader. The second number is the process id. Kill them using kill -15 <id> (and then again in a few seconds using kill -9 <id>) then run sh /etc/rc.local.ptloader.
Messages in /var/log/cyrus.err complaining about a mailbox with bad format (or other mailbox corruption problems)
If a mailbox is inaccessable because there is some corruption in its metadata, using the 'reconstruct' program may be able to save it. Copy the files from the `mbpath mailboxname` to some holding location (to try to get the bug fixed), remove the cyrus.* files from the original directory, and then do /usr/cyrus/bin/reconstruct <mailbox>. [rjs3 29-apr-2003]
Users report they can't read cmu.misc.market or another bboard (happens more often to high volume ones).
This is usually caused by a process locking one of the cyrus.* files inside the the mailbox. To locate the mailbox, run /usr/cyrus/bin/mbpath cmu.misc.market and cd there. Run lsof cyrus.*; if one process has a lock through repeated runs (a lock is shown by a "W" in the 4th column) it is probably deadlocked.

It's also possible that the process locking the mailbox is waiting for a lock somewhere else; run truss to determine if it's waiting for another resource. Try to find the problem process, not the symptom process.

In general, processes are suppose to timeout and release their resources, so this is a bug. If it's crucial for things to start moving again, you can try killing the process; otherwise, you might want to tell the Cyrus wizard (if any). [leg 16-nov-00]

Users unable to login with plaintext passwords.
Plaintext passwords are verified by the sasl authentication daemon, which spawns a number of itself at startup. If for some reason these die, Kerberos logins will continue as normal but plaintext logins will fail (including TLS/SSL protected connections). Check to see if saslauthd is running, and if not, restart it (see /etc/rc.local.saslauthd).

The Cyrus IMAP Aggregator (Murder)

The Cyrus IMAP Aggregator is a system that allows a single IMAP namespace to exist (and be accessed by any IMAP client) across multiple machines. To do this, there are 3 types of machines involved in mail storage: frontends (mail-fe*), backends (mail*), and a MUPDATE server (mupdate1.andrew.cmu.edu).

The backends are normal IMAP servers with a bit of added intelligence whenever a mailbox operation is performed. In other words, clients can connect directly to a backend and have a typical IMAP session, except they will not be able to see all of the mailboxes that exist on the murder. In fact, this is how we ran the beta test: mail1 continued to be the primary point of access for most users, but was actually a full member of the murder. We currently have 5 backends, (mail1, mail2, mail3, mail4, and mail5), with mail5 being used primaraly for testing purposes.

The frontend servers are fancy IMAP proxys. They are fancy because they can switch servers mid-session and perform some operations (such as LIST) locally. Otherwise, they proxy requests to the appropriate backend server, or (client-willing) refer the client to the backend directly. Frontends are, for all intents and purposes, identical, and exist as a loadbalanced pool. If one goes down and is removed from the pool, the only clients who lose are the ones who were connected to it at the time. Otherwise, no one should notice or care what frontend they get. We currently have 6 frontends (mail-fe[1-6]), with mail-fe1 being used primaraly for testing purposes.

The Aggregator requires an authoritative server to manage the namespace. The MUPDATE server performs this task. There is also a slave MUPDATE server running on each of the frontends, to allow the frontends a local copy of the entire mailbox namespace. When updates happen on the master they are pushed to the frontends in a relatively short period of time.

Mail delivery happens to the aggregator via a process that runs on the mx and smtp servers called lmtpproxyd. This process recieves a message, queries the mupdate server for the location of its destination mailbox, and forwards the message along to that server. Note that we currently run sieve on the backend machines only, so all of a user's mailboxes must live there. (user.rjs3 and user.rjs3.spam couldn't live on different backend servers for example)

When things go wrong

The murder is designed so that restarting master on a machine (excpet for the mupdate server, which is, for the most part, considered authoritative) should bring it in sync with the state of the world as the murder sees it. That is, when a frontend restarts, it gets a fresh copy of the entire mailbox list. When a backend restarts, it performs a series of checks between its local mailbox list and what the MUPDATE server thinks it has, and either updates the MUPDATE server (if it has a mailbox that is not in MUPDATE, or the MUPDATE data is stale), or deletes the local mailbox (if MUPDATE claims the mailbox is hosted on another server). Note that if mailbox transfers fail a certain way, it may be necessary to restart both the source and target servers.

A common problem we are currently experienceing is that mon will report a "timeout" for the mupdate server on a frontend. If this is the case, typically this means that fdsync() calls have started to take a very long time (due to a Solaris bug). Generally, running /etc/remount-root will fix this problem. There is a cron job that runs this on a somewhat daily basis. Presumably this will be fixed if we upgrade to Solaris 10 or manage to wrangle a patch for the kernel out of Sun (unlikely) [rjs3 5/19/03]

Locating a mailbox

Mailboxes have two parts to their location, just like AFS: a server and a partition on that server. To get this information for a given mailbox, use the info command in cyradm. If you need to find the physical files on a given server, look in /imap/<partitionname> on the correct backend (the command mbpath on the backend will help).

Sendmail mail routing

Sendmail is the world-infamous MTA (mail transfer agent) that we use for moving electronic mail through a maze of twisty-little passages. It runs on every machine, but most machines run very stupid configurations of it.

When things go very wrong, complain to Larry.

Webmail

Webmail service is provided on the loadbalanced pool of webmail machines, using the host sqmail collection, which contains our local version of Squirrelmail, an open-source PHP based IMAP client. Each of the webmail servers runs an SSL-capable apache, which uses webiso to authenticate user.

Because webmail must maintain session information, as well as preferences information that is shared among all the servers, the /afs/andrew/data/db/squirrelmail tree exists. While the sess/ tree should basicly maintain itself, the prefs/ tree (which contains a large number of volumes, which are used based on hashes of the username), may occasionally have a volume fill. In this case, some users will not be able to write out preferences files, address books, or upload attachments. In this case, the most likely cause is attachments that haven't been properly garbage collected. The attachment temporary files have fairly obvious names (they look like hashes), as opposed to the preferences and addressbook files (which are username.pref and username.abook). Deleting any attachments that are older than a half hour or so should be safe. If this doesn't clear enough space, just increase the quota on the volume.

There is also a webmail bugzilla project which may be enlightening for known issues. If something goes wrong that you can't fix, complain to Rob or Larry.

To bring up a new webmail machine, just copy the package.proto, pubcookie key, and SSL keys and certs from an existing machine. Do NOT run keyclient, since it will invalidate the keys on the other machines. (alternatively, you can run keyclient -d to just download the current key)


Printing


Unix Servers


LDAP

This section has a lot of Q&A about the LDAP service running on campus. The person carrying beeper would not be asked to fix many of the horrors listed herein unless most of the LDAP expertise in the group were suddenly wiped out in one swift stroke. But the care and feeding of this system should be written down and preserved, and this is as good a place to start doing that as any.

All questions are preceeded by a "Q: " string, and the question is written in upper case letters. You may search for keywords in the questions by doing upper case only searches.

Q: WHAT THE HECK IS LDAP?

Good place to start. LDAP (Lightweight Directory Access Protocol) is a database that holds records for all people and accounts on campus. The database has a tree structure similar to a filesystem so that entries can be placed in a heirarchical system. Each entry is identified by a "distinguished name" or "DN", which reads like the path to a file in a filesystem, except the divider between components is a comma instead of a slash. The order of the components in the DN is from specific to general. Example:

    uid=adamson,ou=Account,dc=andrew,dc=cmu,dc=edu
This shows an account in the Andrew system at Carnegie Mellon. There are trees for several different groupings:
    ou=Person,dc=cmu,dc=edu    Humans past, present, and future on campus
    ou=Account,dc=cmu,dc=edu             cmu.edu accounts
    ou=Account,dc=andrew,dc=cmu,dc=edu   Andrew accounts
    ou=Account,dc=cs,dc=cmu,dc=edu       Computer Science accounts
    ou=AuthEntity,dc=cmu,dc=edu          Special authentication identities 
If you were to retrieve an entry from LDAP you would find that it consists of attribute=value pairs, with attributes like "common name" (cn), "home telephone number" (homePhone), and "computer accounts" (cmuAccount).

The LDAP system is used by the Email system to route Email in the Andrew and cmu.edu domains. It is used by "finger" to find and fetch all information about people. It is used by web pages on www.cmu.edu to find people. Mail clients can use LDAP to find and resolve names on the "To:" line of composed messages.

Q: WHERE IS LDAP RUNNING?

The main machine is metadir.andrew.cmu.edu. There are slave servers running on these machines:

Q: WHAT'S WITH ALL THE SERVER MACHINES?

The main machine, metadir, takes updates and is the only one that can actually write new data to the others. People can only query the others, not make changes to them. The other servers are intended for:

	mail-ldap*	Andrew and cmu.edu Email routing. VERY IMPORTANT
	ldap*		Andrew mail client To: line name resolution

Since we add and delete servers, you might want to look at CNAMEs to figure out what server is doing what. Currently defined CNAMEs [leg 8/22/01]:

Q: HOW DO I START OR STOP THE SERVER?

Solaris:   (currently metadir,ldap1,mail-ldap[12])

/etc/init.d/openldap start

/etc/init.d/openldap stop

/etc/init.d/openldap restart

Linux:   (currently ldap[234])

/etc/rc.d/init.d/openldap start

/etc/rc.d/init.d/openldap stop

/etc/rc.d/init.d/openldap restart

Use restart when possible. There are other some other services that may need to be started and stopped alongside the LDAP server. If the server stopped unexpectedly and you want to start it up again, the restart command will prevent the situation where you get multiple copies of the other services running.

There is a wrapper program called slapd.wrapper that is run by the init script's start and stop commands. The wrapper will attempt to restart the slapd daemon if it were to stop unexpectedly (aka "crash"). The wrapper will wait about 10 seconds after the child dies and then restart it. If it crashes fast several times, it will send a zephyr to the people listed in /usr/openldap/etc/admin and slow down the restart process. So if you try to kill slapd by sending the slapd process a signal, it will be restarted. Use the init scripts. The stop command will signal the wrapper to exit and it will signal the slapd daemon to exit too. The restart command will only signal the slapd daemon to exit, and then the wrapper will restart it.

Q: HOW DO I TELL IF LDAP IS RUNNING ON A MACHINE?

A good way is to see if some program is listening to the LDAP port:

	% netstat -an  | grep 389
If there is a line that looks like
	*.389      *    *   0 LISTEN
then something is listening to the LDAP port. Another way is to check the process table for a program called "slapd"
	% ps -e | grep slapd
You might see 2 on metadir since it has a backup slave server running too.

Q: WHERE ARE ALL THE FILES THAT LDAP USES?

Look in /usr/openldap/ on any LDAP server:

	bin/			client programs
	libexec/		daemon binaries
	db/			BDB database files
	etc/			update scripts
	etc/opendldap/		config files for slapd
	etc/openldap/schema/	LDAP schema files
	share/			odd config files, e.g. nicknames
	feeds/			on the master server, the update feed files
	logs/			output from slapd and slurpd
	var/			LDIF files, core files, PID files
	var/openldap/replica/	bookkeeping on replication
	lib/			LDAP libraries (not installed by default)
	lib/perl/		modules for performing updates on master 
	lib/perl/Feed/		modules for each feed type

Q: WHERE DOES THE SOFTWARE COME FROM?

The base LDAP service is in the source collection "local/openldap", or,

	/afs/andrew/system/src/local/openldap/
It came originally from the OpenLDAP group, available through HTTP, FTP, and CVS. See http://www.openldap.org The CMU specific files, like the nightly updating and web pages, are in the "host/cmuldap" collection
	/afs/andrew/system/src/host/cmuldap
The "package" files that install the software are controlled by the /etc/package.proto file using these macros:
	%define doesopenldap
	%define doescmuldap
which cause these two files to be included:
	/afs/andrew/wsadmin/services/lib/openldap.generic
	/afs/andrew/wsadmin/services/lib/cmuldap.generic
They will bring in the programs as well as the config files
	/afs/andrew/wsadmin/services/etc/openldap.*
as needed.

Q: HOW DOES THE MASTER MACHINE UPDATE THE SLAVE MACHINES?

A process called slurpd runs on the master machine (metadir). When updates occur in the LDAP database, slapd writes to a log file called /usr/openldap/var/rep.ldif. There may be several processes that want to read that file, so a program called "repd" will take rep.ldif and make multiple copies of it to other filenames, according to its config file /usr/openldap/etc/openldap/repd.conf. One of the copies it makes is called /usr/openldap/var/rep.slurpd.ldif. The slurpd process runs in the background, polling that file for updates. When updates appear, it uses normal LDAP calls to tell all of the slave servers.

Q: HOW DOES SLURPD START AND STOP?

/etc/init.d/openldap.rep start

/etc/init.d/openldap.rep stop

/etc/init.d/openldap.rep restart

Just like slapd.

Q: HOW DOES THE LDAP DATABASE GET BACKED UP?

The LDAP server uses Sleepycat BDB database files to keep data. These files are always open, so writing them to tape is no good. To properly write them to tape or get a flat file copy of them, the LDAP server must be stopped, which is an interruption in service.

For this reason, there is a second LDAP server on the master server, listening on port 3890. The main server sends updates to this second server through the normal slurpd replication model. The backup server is not intended to service any queries, so it does no indexing or optimizations, and its ACL's are all shut off.

Each evening, CRON runs /usr/openldap/etc/ldapbackup.sh at 7:00pm. This time was chosen because it is about an hour before the nightly backup system, Amanda, runs. The ldapbackup.sh script shuts down the backup LDAP server and then runs /usr/openldap/sbin/slapcat to make a flat file copy of the LDAP database held by the backup server, then restarts the server. There is no interruption in service, since the backup server does not serve users.

When Amanda runs, it will make copies of the BDB files, but these are of little use. However, it will make a copy of the flat file database, which could be used in a disaster recovery situation.

This backup server gets started and stopped like the master LDAP server, except its script is

	/etc/init.d/openldap.bak  {start|stop|restart}

Q: HOW IS IT THERE ARE TWO LDAP SERVERS ON THE MAIN MACHINES?

The main server uses files in the default directory

	/usr/openldap
There is an extra subdirectory there called "backup/" that contains the files for the backup LDAP server. It uses a special config file "backup/etc/openldap/slapd.conf" that tells where the non-standard directories are. When the process table is examined, the backup database slapd process is distinguished by having a "-f" switch that points to this alternate config file.

Q: HOW DO I RESTORE THE LDAP DATABASE?

First you have to determine how widespread is the problem that makes you want to restore the database. There's one of 3 situations, listed in increasing order of unfriendliness:

  1. The master server crashed and its database is corrupt
  2. Bad data got into LDAP today and sent to the slave servers too
  3. Bad data has been in the system for a while, and we have to go back a ways

Solution to scenario 1:

You can copy the database files directly from one of the slave servers. Go to, say, mail-ldap2. Note that $LOGNAME should be set to your username. If it is not, just replace the occurances of the variable, below, with your username.
	# on metadir:
	su $LOGNAME.root
	echo ${LOGNAME}@ANDREW.CMU.EDU >> /.klogin
	cd /usr/openldap
	/etc/init.d/openldap stop
	mv db db.corrupt           (optional)

	# on mail-ldap2:
	su $LOGNAME.root
	cd /usr/openldap
	/etc/init.d/openldap stop
	tar -cf db.tar db
	klog $LOGNAME
	scp db.tar root@metadir:/usr/openldap/.
	/etc/init.d/openldap start

	# when that is done, on metadir:
	tar -xf db.tar
	/etc/init.d/openldap start
Note that if a slave server is corrupted, the same operation above can be performed, replacing "metadir" with the slave server hostname.

Solution to scenario 2:

This is a bad situation. It will take time to restore, so if the slave servers can be left running while you work on other machines, it will help prevent a long downtime in service.
	# on metadir
	su $LOGNAME.root
	echo ${LOGNAME}@ANDREW.CMU.EDU >> /.klogin
	cd /usr/openldap
	/etc/init.d/openldap stop
	/etc/init.d/openldap.rep stop
	echo > var/rep.slurpd.ldif
	mv db db.corrupt
	mkdir db
	sbin/slapadd -l  backup/db/backup.ldif
This last command will take a while, even 30-90 minutes.
	tar -cf db.tar db

	/etc/init.d/openldap.back stop
	/bin/cp db/nextid.dbb    backup/db
	/bin/cp db/dn2id.dbb     backup/db
	/bin/cp db/id2entry.dbb  backup/db
	/etc/init.d/openldap.bak start

	/etc/init.d/openldap start
	klog $LOGNAME
Now, one by one, perform steps similar to section 1 above to scp the db.tar file from metadir to one slave server
	# on the slave:
	su $LOGNAME.root
	echo ${LOGNAME}@ANDREW.CMU.EDU >> /.klogin
	cd /usr/openldap
	/etc/init.d/openldap stop
	mv db db.corrupt           (optional)

	# on metadir:
	scp db.tar root@<SLAVE SERVER NAME>:/usr/openldap/.
	/etc/init.d/openldap start

	# when that is done, on the slave server:
	tar -xf db.tar
	/etc/init.d/openldap start
Loop through those commands on each slave. If the database is really screwed up such that the slaves cannot run, go ahead and run the scp commands concurently from metadir. You may want to add the -q switch between "scp" and "db.tar" to shut off the running updates.

When all slaves are updated, restart the replication server:

	/etc/init.d/openldap.rep start

Solution to scenario 3:

Wow, you really screwed up. You're going to have to go into the backup system, Amanda, and get an old copy of the LDIF file
	/usr/openldap/backup/db/backup.ldif
Talk to a backup kind of guy, like Chaskiel Grundmann, to get into Amanda. Once you have the backup file, you can follow the steps for section 2) above. The "slapadd" command will need the name of the restored LDIF file after the "-l" switch.

You can try getting the LDAP database more up to date if you think the feed files are uncorrupted. You'll need to look in the directory

	/usr/openldap/feeds/
to see how far back the feed files go. If you restored the LDIF file from further back, you're going to have a gap in the updates. But suppose today is March 20, 2001 and you took the LDIF file from, say, March 5, 2001. You can run the update system between those two dates and you should be in good shape
	# on metadir
	cd /usr/openldap
	etc/ldapfeeds.pl -old 010305 -new 010320
This will take the differences between the feed files from the old date to the new, and make updates to the LDAP database accordingly. The output of the command will be in
	/usr/openldap/feeds/ldapfeeds.output.010320
The date on the output file is today's date; if the -new date was two days ago (010318) the output file would still be today's date.

Q: HOW DO I SEARCH FOR AN ENTRY IN THE LDAP DATABASE ?

(Also see the next question to MODIFY an entry)

The "ldapsearch" command should be all you need. It's in /usr/local/bin/ on most machines, and can be run with --help for a list of all command line switches. The command tends to look like this:

      % ldapsearch -h ldap2 -b dc=cmu,dc=edu  cmuAndrewID=adamson 
This will connect to the (h)ost named ldap2 and find the user whose Andrew ID is "adamson" and print out everything in the entry that is available to you. It will use your Kerberos ticket to authenticate you, so you may gain additional priviliges such as the user's phone number. Several people's ADMIN instance are in the cn=Administrators group, which can see anything

If you get an error about command line options, make sure you are using the OpenLDAP ldapsearch and not one from the OS vendor or Netscape or other vendors, as the command line switches can vary. Specify /usr/local/bin as the path to the program.

There are more general searches you can conduct, changing the filter at the end of the command line to something like

      'cmuDepartment=Services Development Group (Comp Services)'
to get all programmers in Computing Services. You will get back multiple entries. If there is more than one filter you want to use to refine the search, you can put them together in the LDAP filter format:
      (&(attr1=value1)(attr2=value2)(!(attr3=value3)))
The initial ampersand (AND) can be changed to a pipe (OR), and the the exclamation point shows how to do a logical NOT.

If there are only 1 or 2 attributes you are interested in instead of the entire entry, you can append them to the commandline. For the ldapsearch command if anything appears after the filter, it is taken as a list of attributes you want to see.

      % ldapsearch -h ldap2 -b dc=cmu,dc=edu  cmuAndrewID=adamson homePhone cn
There are also "operational" attributes such as last modify time (modifyTimestamp) and last modifier's name (modifiersName) which don't show up unless you specifically request them. To see them, add a plus sign "+" to the end of the command line as an attribute to request. This is a flag in the protocol to see operational attributes.

Q: HOW DO I CHANGE AN ENTRY IN THE LDAP DATABASE ?

(Also see the next question to ADD an entry)

You need to write an LDIF file and feed it to ldapmodify. A quick crash course on LDIF... It is flat ASCII, with lines separated by carriage returns, and entries separated by blank lines. Example:

dn: uid=blef,ou=Account,dc=andrew,dc=cmu,dc=edu
mail: blef@andrew.cmu.edu

dn: uid=fnork,ou=Account,dc=andrew,dc=cmu,dc=edu
cn: John Fnork
cn: John Q Fnork
This LDIF file would change two entries. Each line consists of an attribute name, a colon, a space, and a value. The first attribute for an entry has to be the "dn" attribute. After that, any attribute available for the entry can be listed. If an attribute is to have more than one value, like the common name for John Fnork above, the values are listed one per line with the attr name repeated for each one. You don't have to list the entire entry as the way you want it to appear when you're done, you just have to list the attributes you want changed.

The ldapmodify program lives in either /usr/openldap/bin or /usr/local/bin on most machines. You have to give it a few commandline args, and you will need a Kerberos ticket. A typical line looks like this:

    % ldapmodify -h metadir -f ./ldif
The "-h metadir" says to make the changes on the (h)ost named metadir, which at the time of this writing, is the read/write server. The "-f ./ldif" args give the path to the (f)ile that contains your LDIF.

IMPORTANT NOTE: If your LDIF is all new values to APPEND to entries, you are fine. If you want the old values that are in the entry removed and REPLACED with the values you have in your LDIF file, add a "-r" switch to the command. Again, you don't have to list the entire entry in your LDIF -- if you add -r to the command, only attributes explicitly included in your LDIF file will be overwritten. If one entry lists "cn" and a second lists "mail", and you use the -r switch, the "mail" attr of the first entry WILL NOT BE CHANGED, nor will the "cn" of the second entry. "Inclusion" of attributes is on an entry by entry basis, not the entire file.

In order to REMOVE a SINGLE value from an attribute, list all of the other values in the LDIF file and use the -r switch. To remove ALL of the values, say all of the "labeledURI" attribute values, add "labeledURI: " to the LDIF. BE SURE to have the colon and the space.

What could go wrong?

If you get not authorized or inappropriate auth there is something wrong with your authentication. Make sure you have your Admin ticket. Maybe your admin instance doesn't have authorization to change the database. At the time of this writing, people whose admin instance are able to make changes are Mark Adamson, Larry Greenfield, and Derrick "Moonbeam" Brashear.

If there is a schema violation, you are trying to add/change attributes in an LDAP entry that can't have those attributes. For example, if you try to append a "cmuSIScat" attribute to an Account entry, it will fail -- you can only add that attribute to a Person entry (with the guid=<hex> in the DN).

If an error occurs with one entry, the ldapmodify command will stop. You'll have to take the tail of the LDIF file and run it through ldapmodify again. You can add the "-c" switch to tell ldapmodify to (c)ontinue after errors.

Q: HOW DO I ADD AN ENTRY TO THE LDAP DATABASE?

(See also the previous question for MODIFYING an entry, and the next question for creating a GUID)

You will need a file containing a complete LDIF copy of the entry you want to add to the database. See the previous question for a crash course in LDIF. The file will contain the COMPLETE entry, with all values filled in. One way to start is to fetch a similar entry from LDAP and plagarize.

With the LDAP file ready, feed it into ldapadd:

    % ldapadd -h metadir -f ./ldif
just like the "ldapmodify" command in the previous question. There is no "-r" switch for ldapadd, since there are no old values to replace.

You shouldn't have much need to create entries by hand.

Q: HOW DO I CREATE A NEW GUID?

This will come up if you're manually adding a new entry for a Person for some reason. (This situation could arise if you find a Person entry that is the combination of two humans into one entry, and you need to splice them apart into two Person entries.) All Person entries have a DCE GUID in their DN to uniquify them across the campus forever.

There is a program on the main LDAP server

	/usr/openldap/etc/guid
Just run it, it takes no commandline args, and makes no changes to LDAP. It simply generates a DCE GUID and writes it to the screen. Run it again, and you'll see the next one it generated is different.

The program produces slightly better GUIDs if run as root, since on SUN machines only root can get to the MAC address of the ethernet card, which is used as an input into GUID production.

Q: WHO IS AUTHORIZED TO MAKE CHANGES IN THE DATABASE?

You can read the slapd config file to get the list of LDAP DN's that can make writes to the database. The config file is

	/usr/openldap/etc/openldap/slapd.conf
Look for "access to" directives, and within them look for "write" access.
  access to attrs=telephoneNumber,homePhone
    by group="cn=Administrators,ou=Group,dc=cmu,dc=edu" write
    by * read
This entry controls access to 2 phone attributes. It says that the given group named cn=Administrators can write changes to this attribute, and anyone else can read it. If you were to search for that group, it would have one or more values in the "member" attribute, and it is those people who can write to those two phone attributes. Access control directives are read from top to bottom, so a previous directive may have superceded this one.

In general, you will find that to make changes you must be in the cn=Administrators group, or bind as the cn=Manager identity listed in the "rootdn" directive in the slapd.conf file.

The cn=Administrators group lists several people's admin instance. These people, with their Admin ticket, can change anything and add new entries.

Q: WE'RE ALL LOCKED OUT! HOW DO WE CHANGE THE DATABASE TO ADD SOMEONE BACK IN?

Some idiot deleted everyone from the administrators list, eh? These things happen. You will need to log into the master server (currently metadir.andrew) and change the slapd config file.

	/usr/openldap/etc/openldap/slapd.conf
Find the line that says "rootpw ". The value after it is the password that you can use to bind as the manager DN listed in the "rootdn " directive (usually right above). The rootdn has fearsome powers to do anything. Be careful.

By default, the rootpw line has a bogus value, because we don't want people using the rootdn directly, but this is an emergency, right? You will need to DES encrypt the new password. You did think of a new password, right? Something that people aren't going to guess, right? You can DES encrypt it using the crypt() function, available from perl quickly with

    % perl -e '$a=<STDIN>; chop $a; print crypt($a, "SALT") . "\n";'
The <STDIN> will wait for you to type in the password, so it's not stored by your shell as part of the commandline. It WILL appear on the screen as you type, so look over your shoulder first. When you hit enter, a 13 character string will appear, which you should add to the "rootpw" line of slapd.conf:
    rootpw {CRYPT}SAYoATkkQuOb7
Now restart the slapd server; (see the question far above on starting/stopping the server). You only need to do these steps on the master server.

Now add someone to the cn=Administrators group. Write an LDIF like:

dn: cn=Administrators,ou=Group,dc=cmu,dc=edu
member: uid=adamson.admin,ou=Account,dc=andrew,dc=cmu,dc=edu
member: uid=shadow.admin,ou=Account,dc=andrew,dc=cmu,dc=edu
and run "ldapmodify" with passwords instead of SASL.
   % ldapmodify -h metadir -D cn=Manager,ou=AuthEntity,dc=cmu,dc=edu \
      -x -W -f ./ldif
   Password:
NOTE IN BIG LETTERS: Run this while logged into the master server, NOT across the network. The password you type is going to the server unencrypted. Doing it on the same machine prevents the communication from using the campus net.

The "-x" switch says to use simple authentication, and "-W" says to prompt for the password. DO NOT USE "-w <passwd>" since that leaves the password sitting in your shell, and possibly in your ~/.history file. If the ldapmodify command doesn't like "-W", then you are using the Netscape version -- go find the OpenLDAP version (/usr/local/bin, /usr/openldap/bin).

WHEN YOU ARE DONE reset the "rootpw" directive in slapd.conf to remove the password, like

    rootpw  {CRYPT}NO_PASSWORD
and restart the server AGAIN. This will disable the use of simple password to become the manager.

Q: HOW DO I REINDEX THE DATABASE?

The index files in LDAP speed up lookups for indexed items. As values are changed over time the hash tables in the index files become less and less efficient. This can be cleared up by removing the index files and running the OpenLDAP utility 'slapindex'. Of course, this can't be done while the server is running, and reindexing takes about an hour, so the indexing should be done on a separate offline copy of the database. Therefore, we use the space in the backup database server on the main LDAP server, in

     /usr/openldap/backup/
That server already has a copy of the 3 main files that contain all of the raw data for the database:
	dn2id.dbb
	id2entry.dbb
	nextid.dbb
From those files, all indecies may be created.

Another problem to be faced is that a snapshot of the database is going to be used to create the indecies. While the indecies are being created, which takes about an hour (by June,2001 standards), the live database can undergo updates. If the newly indexed database is swapped in as the live database, those changes will be lost. For this reason, some games must be played with the replication system to recover any changes. Nevertheless, this process should be done during non-peak hours because there is a window of about 60 seconds during which any changes made have to be re-entered by hand. Just make sure not to attempt this process while the nightly feed update is going on!

  1. Log into the main LDAP server, which also has the nightly backup server.
    	telnet metadir
    	  login
    	su root
    	cd /usr/openldap/
    
  2. Modify the slapd.conf file for the backup server to include all indecies. Normally, the backup server doesn't do any indexing.
    	cp backup/etc/openldap/slapd.conf backup/etc/openldap/slapd.conf.orig
    	grep ^index etc/openldap/slapd.conf >> backup/etc/openldap/slapd.conf
    
  3. Stop the backup server
    	/etc/init.d/openldap.backup stop
    
  4. Reindex in the backup area. Takes an hour or so
    	sbin/slapindex -c -f backup/etc/openldap/slapd.conf
    
  5. Restart the backup server (using the indexing config file)
    	/etc/init.d/openldap.backup start
    
  6. Let slurpd update the backup server with any changes that occurred while it was reindexing. Because the backup server is using indexing, the new index files you created with slapindex will be updated. Watch the slurpd status file to see when the most recent update timestamp for the backup server matches the last entry in the slurpd replog
    	tail var/openldap-slurp/replica/slurpd.replog
    	grep metadir var/openldap-slurp/replica/slurpd.status
    
    You'll have to do some date conversions.
  7. This next multistep should be done as swiftly as possible, during a period when updates to the database are not expected to happen. Make a tarball of the reindexed database then swap the new index into the place of the live database.
    	cd backup/db
    	/etc/init.d/openldap.rep stop
    	/etc/init.d/openldap.backup stop
    	tar -cf db.tar *
    	cd /usr/openldap
    	/etc/init.d/openldap stop
    	mv db db.old
    	mv backup/db .
    	mv var/rep.slurpd.ldif var/redo.ldif
    	/etc/init.d/openldap start
    
  8. Between the time the backup server was stopped and the main server was stopped, any changes made to the database are now sitting in the LDIF file that was moved to var/redo.ldif. These changes must be fed back into the main database, or they will be lost. Since the time window was small, hopefully there won't be any changes.
    	edit the redo.ldif file
    		remove each block of "replica" lines
    		remove any lines that change system attributes (modify* creat*)
    		remove the "-" lines after them, too.
    	feed the redo.ldif file into ldapmodify
    
  9. The main database is now happily running and taking updates. The slurpd server is NOT running, so any changes made after the backup server was stopped are queuing up in the slurpd LDIF file.

  10. Recreate the environment for the backup server, and restart it
    	mv backup/etc/openldap/slapd.conf.orig backup/etc/openldap/slapd.conf
    	mkdir backup/db
    	cd backup/db
    	tar -xf ../../db/db.tar dn2id.dbb id2entry.dbb nextid.dbb
    	cd ../..
    	/etc/init.d/openldap.backup start
    
  11. Now start copying the reindexed database to the slave servers. Note that the scp will take as much as a few minutes. The untar can take a minute or so, which is why it is done into a separate directory and then copied into place with the near-instant "mv" commands. Note that this requires 3x the disk space of the directory, (the running db, the tar file, and the new db) so watch your df.
    	foreach host ( mail-ldap1 ...... )
    	 scp db/db.tar root@${host}:/usr/openldap/db.tar
    	 ssh -l root $host '\
    	   cd /usr/openldap; \
    	   mkdir db.new; \
    	   cd db.new; \
    	   tar -xf ../db.tar; \
    	   cd ..; \
    	   /etc/init.d/openldap stop; \
    	   mv db db.old;\
    	   mv db.new db; \
    	   /etc/init.d/openldap start;  \
    	   rm -r db.tar db.old'
    	end
    	rm db/db.tar
    
  12. Restart the replication server. It will have an LDIF file to work on that contains all the changes since the database was copied from the backup area into the live main database. You can clear out the logs that slurpd keeps, since you know all of the databases are the same now.
    	rm var/openldap-slurp/replica/*
    	/etc/init.d/openldap.rep start
    
If, in the future, the various database servers do not keep identical copies of the LDAP database, the reindexing will need to be done on each machine. In that case, you will need to make a separate database area on each machine similar to the backup area on the main server. It will need its own db.new/ directory and its own slapd.conf file that has the "directory" directive pointing to db.new/ area. Then you will need to stop the slave slapd, copy the 3 main database files to db.new, and restart the slapd. Run slapindex with the -f flag giving the separate slapd.conf. When it's done, stop the slapd, move the db.new into place, and restart the slapd.

The downside is this took a while, and the database is missing all changes that happened since the 3 main files were copied over. If you stop slurpd on the main server before you copy the 3 files, you won't miss any updates, but ALL updates will not be replicated while that is going on. You have the option of leaving slurpd running, but tell "repd" on the main server to make another copy of the LDIF file, and when the slave is done reindexing you can feed that new LDIF file into a "one shot mode" slurpd (use the -o and -r).


WebISO (pubcookie)


General Oracle Information.

Computing Services maintains several Oracle databases such as the Help Center DB, the BlackBoard DB, the DAMS Trigger Server, and multiple development tablespaces on one or more of these DBs. So far, the databases have been well behaved. Occasionally they can act up and cause some trouble but this is rare.

If a DB acts up, the first place to look is in one of the alert_{SID}.log files. This file contains information about log switches, major DB alterations, and errors. All errors are preceeded by "ORA-XXXXX". Where XXXXX is a very useful error number. These numbers can be used to access further information, and possible solutions, for that error.

Locations of Oracle Instances

Below is a table displaying the machine and directory structure for the Oracle instances currently administered by Computing Services.

CONTAINS MACHINE Oracle Home AOracle SID LOCATION OF ALERT LOG
primary BlackBoard Server courseinfo4.andrew.cmu.edu /oracle/app/oracle/product/8.1.7 blkboard /oracle/app/oracle/admin/blkboard/bdump/alert_blkboard.log
Primary Help Center and Generic Developer Server ora1.andrew.cmu.edu /oracle/m01/app/oracle/product/8.1.7 thebigdb /oracle/m01/app/oracle/admin/thebigdb/bdump/alert_thebigdb.log
Primary DAMS Trigger Server metadir.andrew.cmu.edu /oracle/m01/app/oracle/product/8.1.6 dams /oracle/m01/app/oracle/admin/dams/bdump/alert_dams.log

What to do when Oracle misbehaves.

Under normal conditions, the Oracle instance should automatically be closed and reopened during a server reboot. This is controlled by the rc links in /etc which point to the scripts init.d/oracle8i and init.d/listener8i. These scripts su to the oracle user and then call wrapper scripts in the $ORACLE_HOME/bin directory; $ORACLE_HOME/bin/dbstart, dbshut, startlsnr, and stoplsnr. These scripts set several environment variables and then execute the binaries to stop or start the DB or the listener. At this point in time, there are two possible failure scenarios from which we can recover. An instance failure, where the DB crashes for some reason, and a media failure where we lose one or more of the drives of the DB server. The first scenario is fairly easy to recover from. The DB must be restarted. It will automatically attempt to recover itself by using the redo logs. The following are the commands to restart the DB after an instance failure.

% su root
Password: 
skydiver.andrew.cmu.edu# su oracle
skydiver.andrew.cmu.edu# source /oracle/oracle.env
skydiver.andrew.cmu.edu% stoplsnr
skydiver.andrew.cmu.edu% dbshut
skydiver.andrew.cmu.edu% startlsnr
skydiver.andrew.cmu.edu% dbstart

If the drive containing the binaries failes, it will have to be restored from amanda. Since the DB control files live on the RAID unit, after you restore the DB binaries and bring the DB back on line, it should pick up where it left off as in an instance failure. If the entire RAID goes bad, the nightly amanda dump of /oracle will need to be restored and again the DB should recover although only up to the state of the DB as of the nightly dump.

Where to find additional information

The absolute best place to go for additional information on an Oracle error, or for any query about Oracle is http://metalink.oracle.com You need an account to access this web site. The only person on campus who can grant an account is
Thang Vu

You can get pretty reasonable results just by typing the Ora-XXXXX error into any search engine. Finally, most of the Oracle documentation can be found at Oracle 8I Documentation


Monitoring Infrastructure

Event monitoring

The monitoring systems previously documented here (nadine, tsvmon) have been superseded by "mon", a system that supports active notification in addition to polling by user consoles, and is easier to extend to test new services as well. Mon is currently used to test services provided by the Systems group (including AFS, kerberos, mail, ldap, and various web servers) and the network group (dns, radius) as well as network devices maintained by datacomm.

The primary interface to mon is a cgi script running on monitor.andrew.cmu.edu (aka opermon1.andrew.cmu.edu) This interface allows you to see what, if any, services are currently failing, and also allows you to adjust some of mon's settings

Mon is significantly different from earlier monitoring systems used at cmu in that it does not rely on the operators to contact the primary in the event of a problem.

In most cases, two consecutive failures of a service will result in the primary being sent a text page with a brief description of the problem (or at the very least, identification of the service and machine that have failed). The primary will be re-paged at varying intervals until either the service starts functioning again, or the failure is acknowleged.

The acknowlegement can either be done from the web console or directly from a 2-way pager. The messages mon sends to the pagers include information that allows the pager user to reply to the page. Once an acknowlegement has been processed, mon will not send further alerts about this failure, and the web console will display the text of the acknowlegement.

Alternatively, if the circumstances warrant it (the outage will be extended, the machine will need to be rebooted), monitoring of a host or service may be disabled instead. This can also be done from either the web console or a pager.

There are 3 kinds of disable actions that can be performed: disabling a host none of the services on the specifiv host will be monitored disabling a service prevents that specific service from being monitored on all the hosts in the hostgroup. disabling a group prevents all the services from being monitored on all the hosts in the group.

Because mon automatically notifies the primary of an outage, people who maintain services should make sure not to trigger it while doing maintenece. The relevant services or hosts should be disabled before any interruption in service, and not be re-enabled until there is a reasonable expectation that the service is stable.

[I'll presumably add more detail to this next section when my brain is feeling less frazzeled]

quick navigation hints for using the web console:

Acknowlegement messages are set by clicking on the service name in the second column, and using the text box on the following page

services, hosts, and groups can be disabled or enabled by clicking on the hostgroup name in the first column and using the radio buttons on the following page. Single services can also be disabled/enabled by clicking on the service name in the second column and using the "disable/enable service xxx in group yyy" link on the following page.

[cg2v 11/21/02]

Historical data

Historical data on machines is currently maintained on graphs.andrew.cmu.edu.

Historical data on the network is available at stats.net.cmu.edu. People trying to track down denial-of-service problems might want to look at the Top 10 usage link at the bottom. "cyh-a100.sw.cmu.edu" is the machine room switch.

Netdev takes care of stats.net; Larry is probably good to complain to about graphs.andrew. [leg 8/22/01]


Local Machine Accounts

Root Account

While a generic root password is distributed in the global /etc/passwd, found in /afs/andrew.cmu.edu/common/etc/passwd, some machines override it using the file /etc/passwd.change. Most notable among these are the PO/BB machines, which use the Postman root password. Fileservers and database servers use a separate, special root password, and do not use the global password file. Cluster machines use a root password belonging to Clusters. Certain passwords are made available to members of the coverage pool, but locally set and departmental passwords are seldom if ever available. To get around this, you may create a simple setuid program in AFS. If you do so, make very certain that only you may access it. Following is source for a simple program to give you root privileges. Arguably you should make something more complex, which actually logs that you became root, but this is for when you need something quick and dirty.

main() { setuid(0); execl("/bin/csh", "csh", 0); }

Then become your admin self, and on certain systems, you may need to become root as well.

chown root (your binary) chmod u+s (your binary)

As before, *make* *sure* only you can read it.

Root Instances

Another way to get root access is through the use of a root instance. A root instance looks like "userid.root". To get root access using a root instance, either log in with your root instance at a login prompt, or use su to switch to your root instance. Once it verifies that you are in a group that is allowed root access on the local machine, it gives you tickets/tokens for your instance, and a root shell on the machine.

Which root instances can be used on a particular machine is set by the /etc/root.permits file. This file lists individual root instances and pts groups which may become root on that particular machine.

If you need a root instance and don't have one, ask the System Manager.

Service Account

Another useful back door is the service local account, which allows Computing Services people to log in on certain machines where they otherwise can't.

In general, the service local account password is unavailable to general use. The password can be obtained in special cases when absolutely necessary.

Solaris problems

One typical problem is fsck dying with signal 10 when / is full. The typical culprit is fsck trying to link more into /lost+found; This case can be dealt with by moving the disk elsewhere or "boot net - noinstall" and making it not full. In at least one case the /etc/passwd.*.idx files were corrupt; These can be removed the same way.

If a solaris machine is failing to boot and drops to a shell, you will often see text like the following above the # prompt:

*** Unable to retrieve `root' entry in shadow password file ***

This does not indicate a problem in itself. All it means is that andrew workstations do not use shadow password files, and that the stock solaris tools don't deal with that. You need to look farther back in the output to find relevant error messages.

License server problems

Should you happen to get beeped for something regarding a license server, /afs/andrew/acs/software/licenservers will show which machine each license server is running on. For the most part, all of our license servers work by grabbing some binaries/config files out of /afs/andrew/data/db/<package>, and placing them in /usr/<package> and /etc/rc.local.<package>. The three basic problems are

  1. the machine crashed,
  2. the process(es) died, or
  3. I messed something up.
Hence, the three basic solutions are
  1. reboot the machine,
  2. restart the process (/etc/rc.local.<package> usually works well), or
  3. complain to me.
[jf6b 9/23/98]

Calendar Server Problems

Should you get beeped about the calendar server being down, here's some quick info to get you started:

The Web Publishing System (AWPS)

There are two parts of the publishing system, the staging server (web2.andrew.cmu.edu, AKA publishing.andrew) and the production server (web3.andrew.cmu.edu, AKA www.cmu.edu). People FTP stuff onto the staging server, and either immediately or later, as they configure their individual web collection, it is published on the production server.

These notes assume that the problems are not specific to an individual user, i.e. we assume the Help Center has verified that it's broken for lots of people, not just a particular user.

Staging Server Problems

The staging server has the following unusual processes going on:

  1. authenticated second inetd/ftpd
  2. authenticated httpd
  3. event handler
  4. OpenLDAP server

Authenticated second inetd/ftpd

/etc/init.d/wpftpd starts this part up, starting /usr/webpub/bin/auth.ftpd. That keeps its authentication (as 'service.webpublish') in a subshell, and runs a second inetd (which is therefore authenticated) and spawns ftpds. The config file for this inetd is /etc/wpftpd.conf.

Therefore, the staging server should have two inetds running at all times.

The ftpd is really "ftpd.checkp", which is a modified ftpd which gets permission from the program /etc/check_ftpuser.

check_ftpuser is a binary which queries the LDAP server to find out if it's OK for a given user to see/manipulate a given file.

Authenticated httpd

If you find that people can get to the admin pages on the staging server, but all "view staged content" links return 403 errors, the daemon might not be running authenticated, so restart it.

The server uses rjs3's "apacheath" collection, which uses a modified apachectl which calls reauth. If you need to restart the daemon, apachectl stop and apachectl startssl will do the trick as with any other SSL server.

Event handler

The event handler is a perl process which checks for publishing events, which have appeared in the Oracle table which holds said events. When it finds an event of status "pending", it does the publishing stuff (which is a series of commands ssh'd to the production server) and updates the event to "complete" or "failed" as appropriate.

The handler requires three processes. If you grep for 'handler' in the ps list you should find 'handler.wrapper', which is a perl nanny process; kill -TERM it and you take out the whole handler complex.

The wrapper spawns (and respawns if necessary) 'auth.handler', which is yet another pagsh script which maintains authentication for the real workhorse, which is 'handler', a perl script.

The only reason we've seen for the handler to die (which causes auth.handler to exit, which causes handler.wrapper to launch it again) is if the local LDAP server goes down. Of course, that itself has a nanny, so we hope it comes back up, at which point the handler's nanny will be able to successfully relaunch the handler. So if the handler keeps dying, there might be a problem with the LDAP server.

OpenLDAP server

Not much to this, really; there's stuff in /usr/openldap/etc/openldap (cmuweb.conf, schema/cmuweb.schema) which configures things for us, but that's about it. If it dies, you'll need to figure out why, and possibly restart the server (/etc/init.d/openldap restart; see the general LDAP section of this FAQ for details).

Production Server Problems

We haven't really found problems that could crop up on the production server yet that are different from normal web server issues: since the production server's httpd doesn't run with kerberos, it's not difficult. Just make sure you use '/usr/www/bin/apachectl startssl' instead of just 'start'.

The only thing interesting is the script /usr/webpub/bin/webmount, which should be run hourly by cron, so you don't need to watch for any dead processes. This updates the cmualiases.conf and andrewaliases.conf files in the web server's conf directory when new URLs are added for collections.

My Andrew

The content and driver CGI for MyAndrew are in a Web Publishing collection called myandrew, which mounts on web3's disk as /collections/myandrew. Related collections such as myandrew/software are in there as well: grep 'myandrew' in /usr/www/conf/cmualiases.conf for particulars.

MyAndrew access is forced to the secure port: see the httpsdirect.conf file. The front door (/myandrew/index.html) is open for access without authentication.

Authentication-requiring services pass through the CGI at the URL /myandrew/auth (/collections/myandrew/auth/myandrew.pl). This script accepts a few variables that tell it which service is being requested, such as the Web Publishing system, and redirects accordingly.

Doing machine downloads

It may arise that during the course of your time with the beeper that you are called upon to redownload a machine for one reason or another. Typically machines are built using a download service running on a host named (hosttype)build, for instance, HPs from hpbuild, DECstations from decbuild, SunOS suns from sunbuild, Solaris suns from sunbuild2, and Linux machines from linuxbuild.

[cg2v 11/20/02]



last updated $Id: FAQ.coverage.html,v 1.27 2003/07/16 21:28:29 rjs3 Exp $