The users. [5/21/95 dl2n :]
The practical implications of this are that you need to consider the impact of things you do (and things you don't do) on the people who use the system. Whenever you don't promptly answer a call, whenever you fail to promptly restore a service, someone's work is being impacted. There are trade-offs involved. There will be times when you'll have the opportunity to track a problem or restore service, but not both simulataneously. If you need guidance, please ask, but you should be able to use your better judgement on this sort of thing.
When carrying a beeper, you will be expected to be within range of getting paged (it's a ground-based system, making range about 50 miles around Pittsburgh) and situated such that you can service problems that occur. Occasionally a problem will require a trip to campus, but it's comparatively rare. Sometimes, hardware fails, and you may need to have a disk replaced and restore its contents from amanda or stage (the unix and AFS backup systems). Typically this will not be the case.
You will also be expected to read org.acs.asg.coverage, and when you fix something (if you have a beeper or not) to post about it there, including what you did to resolve it. Use a subject line which is likely to be meaningful later if someone is looking back to find the problem.
When running a "bos" command you must either be
authenticated as an admin user, or logged into a
fileserver or database server. If you are logged into a server you
must add the -localauth option to the end of the command line in
order not to need to authenticate as an admin user.
AFS database services, which include the ptserver, and
vlserver, run on vice2,7,11,12,28. The ptserver maps Kerberos
principals (usernames) to AFS ID numbers, and also provides groups
for AFS ids. The vlserver keeps track of where all copies of all
volumes in a cell live. For
example, user shadow's home directory is in a volume named
user.shadow, and using the command vos examine user.shadow, I might
find out that volume is on the fileserver on vice1, on partition
vicepe.
Things which will happen to these machines are crashes of one or
more of these processes, crashes of the machine, malfunctions of
one of the servers, generally the kaserver, or meltdowns of one of
the servers, generally the ptserver. When a server process
crashes, you should check the logs in /usr/afs/logs on the machine
on which it is running to see if you can get any clues as to what
caused it. You can use the command:
bos status (machinename) (server process name) -long
to check on whether a server process has crashed and left a core.
The core will be in /usr/afs/logs, named core.(processname),
for instance, core.snmpd.
This is useful for the AFS maintainer. If you are knowledgable or
feeling ambitious you may wish to copy the core and the server
binary to another machine of the same system type and attempt to
debug it, but if you do, please encrypt the core, as it may include
sensitive data, like our AFS cell key. Generally you should watch
to see that the server process comes back up and begins answering
requests again. If it does not, and you do not know how to
proceed, you should contact the System Manager or the Beeper
Coordinator. You can use /usr/local/bin/des on the fileserver
and ~shadow/bin/des on Andrew Solaris machines for encrypting the
core.
If the whole machine crashes, you should see if there are any
messages on the console. If the machine is at an "ok" prompt, type
reset, then if you get another "ok", type boot. In any case, watch
the machine come up, then log in and check the system logs, in
/var/log/syslog and /var/adm/messages, to see if you can ascertain
the cause of the crash. Watch the logs from the servers in
/usr/afs/logs to make sure everything comes back up nicely.
Occasionally we have been known to experience slowdowns in
authentication service. While none of the servers are down, the
sync site, the site which coordinates changes to the database, does
not recognize one of the other servers. You can use the command:
udebug (host name) 7004 to check for this.
Try one of the five machines. It will either tell you which
machine is the sync site, or will give you information which
includes a string like "I am sync site until 60 seconds from now".
In the output from the sync site, look for the dbcurrent flag on
one of the servers to be 0, and/or a "last vote was no". These are
clues of a problem. In the event of a "last vote was no", wait 90
seconds to see if it happens again. In the event this seems to be
the problem, if you believe it to be critical to restore service,
you may proceed by running the command bos restart <host
name of sync site> kaserver. You do of course need to know
which host is currently the sync site, as explained above.
The other condition which has been known to exist is called a
meltdown. It generally happens to the ptserver, although this has
not been a problem as of late. The ptserver has
a set number of "threads" in it taking care of incoming requests.
If for some reason it gets bogged down and is unable to finish
transactions it is possible for all the threads to get tied up and
the number of waiting requests to skyrocket. You can use afsmon to
check for a high number of wait procs on the ptservers by running
afsmon, then click the Wait_Proc button, then start.
Any servers which are bogged down can be
determined as above and restarted using the command bos restart
<host name> ptserver. Note though that again unless you
are confident it is probably a good idea to contact the System
Manager or the Beeper Coordinator before doing this. As above,
watch for things to come up ok, and also watch for the waiting
processes to stabilize, as the problem is likely to recur.
Only the meltdown problem is truly time-critical. A failure of any
single server process on any one machine, or of any one machine,
should not cause loss of service because of the manner in which AFS
deals with databases, and because we have 5 (or more than 1)
server.
The AFS database servers also run the kerberos key distribution server
(kdc) and admin servers. The kdc is run by the bosserver, just like
the native afs database servers. Unlike the afs database services,
kerberos
uses a dedicated-master model, where one of the servers is assigned a
role that allows it to update the database (add users, change
passwords, etc). The master distributes database changes to the
slave servers.
The master server (vice28) runs the following additional services:
The slave servers run the following additional service:
Some ipropd-slave failures require restarting ipropd-master as well.
if ipropd-slave has died on a db server and won't start, try restarting
ipropd-master on vice28
We have a number of fileservers. These are all of the vice
machines not included above. Actually, the database servers run
file servers, but user data should never be stored on those
machines. Instead, the space exists for testing and for the
Operators to restore AFS volumes onto. A current list of
fileservers and the specific functions they fill can always be
found in /afs/andrew.cmu.edu/acs/asg/fileservers.
Most file servers are now equipped with RAID arrays for the data
partitions. On these machines the "a" partition is typically
empty or filled with test data. See the "fileservers" file.
Functions which fileservers perform are user server, development
(or dev) servers, and replication (or rep) servers. User servers
contain data which includes user home directory, but more generally
any data which is not system software, and which is not replicated.
When one of these machines or the fileserver process on it crashes,
the data on it becomes entirely unavailable, and that should be
borne in mind when you are deciding how and how fast to deal. Dev
servers include writable copies of system software. When one of
these machines or the file server process on it crashes, only data
which is not replicated becomes unavailable. This will generally,
but not always, affect only machines which are "beta" linked
machines. The owners of these machines have volunteered to run
software which has not been extensively tested. Dev server crashes
are one of the pitfalls of beta linked machines. No development
will be able to be done (or actually very little) when one of these
machines goes down, but it is not nearly as critical as a user
server. Rep servers provide readonly copies of various system
software. Generally there are at least 1 readonly copy of any
package; For some system types, there are more. It is not safe to
assume there is more than one replica, so this is more critical
than a dev server crash, as if only one readonly copy of an AFS
volume exists, AFS will not transparently deal as it will with
multiple readonly copies.
You can use the command bos status (host name) fs -long
to check if a fileserver is still running or has crashed.
AFS database service
Using BOS
AFS filservers
There may be cases where the server process is functioning normally, and hence shows "up", but not all partitions are online. You should check the FileLog on the server to make sure all the partitions were attached successfully. If not, make sure the partition(s) in question are functioning (powered up, SCSI cable attached, visible to the system, fsckable, mountable and mounted) and then try a fileserver restart bos restart viceN.fs.andrew.cmu.edu fs . If that fails, you can reboot the server. If that fails, contact the System Manager or the Beeper Coordinator.
The same caveat as with the database servers applies if there is a core file. As with database servers, if either the file server process or the machine crashes, please look at the system logs to attempt to find out why. Also watch the FileLog and SalvageLog in /usr/afs/logs. Specifically, watch for volumes which were not able to be attached in the former, and files which were deleted from volumes in the latter. Files named .gopherrc and .ircmotd often get deleted; This is a function of the applications which use them, and should not be worried about. If other files are deleted you should notify the System Manager or the Beeper Coordinator. To find out about unattached volumes you may use the command vos listvol (host name) (partition letter) > /tmp/somefile Then look through the output for an unattached volume, and attempt to bring it back online as below.
To attempt to bring a volume back online, do:
Be sure to include all the arguments in the correct order to prevent the fileserver from going offline. Sometimes this fails to bring the volume back online. In that case, do: vos dump (volume) -t 0 > /dev/null (If you forget the "> /dev/null", and ^C the process, see below). Getting the arguments wrong will mean that bos thought you wanted to salvage either the entire partition or entire machine, both of which require bringing the fileserver process down.
If a volume is busy, isn't being backed up or moved, and won't unlock, you can either wait for the timeout, or kill -15 the volserver process on the fileserver with that volume by logging in and sending a signal to the process. If you don't know how to do this, it's probably better to wait.
Sometimes problems will be caused by disk errors. You can check in the system log files, as detailed for database servers, or check the output of the command "dmesg". This is an excellent thing to check for if you see a crash but don't have a core file or anything in the logs. Fileservers very rarely go "poof". If you have a disk error, please contact the system manager, or the beeper coordinator before proceeding, as likely a disk will need to be copied before it gets any worse, and user data will need to be restored. You can find a summary of log messages on the bboard org.acs.asg.log.syslog; Emergency reports are mailed when received, otherwise there is a weekly summary.
Occasionally a fileserver will appear to hang, when in reality a client, or a few clients, are sending a heavy stream of requests which the file server cannot keep up with. (XXX How do you do detection for this Walter? My way is evil) It may be necessary to reboot a machine, or to contact DataComm and have a machine disconnected from the network.
A file server may also fill up. If this happens, vos listvol <host name> <partition name> > /tmp/somefile. Try doing a vos backup <volume> on large AMS volumes, as this will often bring the usage down a lot. If that's not enough, check the file "/afs/andrew/acs/asg/fileservers" to see what type of server it is. Do vos partinfo <server> on servers of the same type to find one with the most free space, if there's a partition on the same server with enough spare room, use that since it's more efficient. Look for a volume of about the right size to reduce the percent full below the threshold. Do a vos move <volume> <fromserver> <frompart> <toserver> <topart> -verbose. You may also check ~shadow/scripts/Scout for a quick terminal display of the servers. sentinel also provides a summary of disk usage so you don't have to do a partinfo on every server. Generally, the volume balancer, which runs Friday nights on the machine flying.andrew.cmu.edu keeps partitions at relatively the same usage, so this is not often necessary.
TODO LIST for this section
Amanda is the Advanced Maryland Automatic Network Disk Archiver,
which runs on the machines amanda, amanda2 and amanda3, and backs up
machines listed in /usr/sbin/amanda/normal/amanda.disklist. It runs
a special client which runs dump on the remote machine and pipes
the output over the network to the amanda server.
You may be beeped because the amanda spool disk is full, because
amanda couldn't read it's configuration files, or because amanda is
confused about its tapes. Amanda's files live in
/usr/sbin/amanda/normal. Binaries live in /usr/amanda on
clients. Helper programs can be found in /usr/local/bin. The
disklist is copied from
/afs/andrew/data/db/amanda/normal/amanda.disklist when that file
changes. If you need to add to it, the format is:
hostname disk-device-without-preceding-/dev/ method
The method is generally krb-comp-encrypt, krb-comp-encrypt-root,
krb-encrypt, or krb-encrypt-root. Root filesystem dumps have a
lower priority generally as less data changes there. comp and
encrypt enable compression and encryption on the network
respectively. We only back up division machines with amanda, BTW.
You can find instructions on how to do a restore from Amanda via
http://asg.web.cmu.edu/isam/howto/amanda.html.
Man pages exist for all the amanda
helper commands.
Generally you will not need to worry about amanda.
We use the stage backup system to back up AFS volumes. It predates
the newer AFS backup system, but provides features still not found
in that system.
Stage runs on the machines backup1 and backup2
Each machine backs up a different subset of volumes. backup1 backs
up project volumes, system software, and other data.
backup3 backs up user volumes. Things will you will
generally be required to do include dealing with dealing with a full
spool disk (you don't; It will handle itself, unless the condition is
ongoing over several days, in which case you should contact the
System Manager or the beeper coordinator), dealing with an errant
tape changer (powercycling the changer and/or manually hitting the
eject button may help. if not, contact the System Manager, the Beeper
Coordinator or someone from CMG). Occasionally, stage's database
server (which runs on backup2) will experience problems. Assuming
the tdbserver process is not running, and there are no clear errors
in /usr/stage/log/tdbserver.log, the server may be restarted by running
/usr/stage/bin/tdb_initrec, followed by /usr/stage/bin/tdbserver up.
Also, occasionally an operator will start to restore
a volume and fill a partition on a fileserver or abort the
restore. This can occasionally leave part of a restored volume
around, and subsequent restore attempts fail. If you first make
sure what you're deleting is the correct bad volume, you may use
the command vos zap <server> <partition>
<numeric volume id> or vos delentry <volume
name> to remove a failed restore. If you have doubts,
please contact the System Manager or the Beeper Coordinator.
At this point in time, stage again requires restores to be run
on the machine that hosts the dumps (backup2 for recently
dumped user volumes, backup1 for everything else). Use the
"search", "restore" and "spoolout" commands to find the
correct set of dumps, queue them to be read from tape (or
disk), and write them out to a fileserver respectively.
Typically, operators handle loading tapes for restores for
you. If you are in a hurry and want to do it yourself, what
you do depends on what tapes you end up needing.
If the data is on 8mm tapes (those named afs.<foo> or
{sum,win,spr}.yynn for yy less than 99, then the restore
procedure will be started automatically by cron, and messages
will appear on the servers console (and in zephyr messages to
class backup, instance <first hostname component of backup
server>) indicating what tape should be inserted in the
drive. After each tape is used, it will be ejected, and a new
console message will appear.
If the data is on DLT tapes, and a non-changer DLT drive is
being used by stage, the same procedure should be used (as of
11/20/02 this is always the case. there are no changers)
If you are using a DLT changer, you must manually start the
extraction process. Run /usr/stage/script/extractmgr. After
you confirm which drive the restores are bing run on, The
tapes that are needed to complete the pending restores will be
listed, and you will be prompted to load the changer with as
many as possible. Once this is done, and you have pressed
return, the system will run through all the loaded tapes and
process them. If more tapes are needed after this further
prompting will occur.
If the dumps are on a raid disk ("drum") instead of a tape (as
of 11/20/02, most dumps are done this way. the exception is
dumps in the "archive" and "purged" groups), you need to run
"drum fetch rest" after all the restores have been queued, and
before you attempt to spool them out to a fileserver.
The Cyrus mail system is our replacement for
AMS. It currently runs on the machines mail[1-4].andrew.cmu.edu,
mail-fe[2-6].andrew.cmu.edu,
mupdate1.andrew.cmu.edu, and imsp1.andrew.cmu.edu.
It's also possible
that the process locking the mailbox is waiting for a lock
somewhere else; run truss to determine if it's
waiting for another resource. Try to find the problem
process, not the symptom process.
In general, processes are suppose to timeout and
release their resources, so this is a bug. If it's
crucial for things to start moving again, you can try
killing the process; otherwise, you might want to tell the
Cyrus wizard (if any). [leg 16-nov-00]
The Cyrus IMAP Aggregator is a system that allows a single IMAP
namespace to exist (and be accessed by any IMAP client) across
multiple machines. To do this, there are 3 types of machines involved
in mail storage: frontends (mail-fe*), backends
(mail*), and a MUPDATE server
(mupdate1.andrew.cmu.edu).
The backends are normal IMAP servers with a bit of added
intelligence whenever a mailbox operation is performed. In other
words, clients can connect directly to a backend and have a typical
IMAP session, except they will not be able to see all of the mailboxes
that exist on the murder. In fact, this is how we ran the beta test:
mail1 continued to be the primary point of access for most users, but
was actually a full member of the murder. We currently have 5
backends, (mail1, mail2, mail3, mail4, and mail5), with mail5 being used
primaraly for testing purposes.
The frontend servers are fancy IMAP proxys. They are fancy because
they can switch servers mid-session and perform some operations
(such as LIST) locally. Otherwise, they proxy requests to the
appropriate backend server, or (client-willing) refer the client to
the backend directly. Frontends are, for all intents and purposes,
identical, and exist as a loadbalanced pool. If one goes down and
is removed from the pool, the only clients who lose are the ones who
were connected to it at the time. Otherwise, no one should notice or
care what frontend they get. We currently have 6 frontends
(mail-fe[1-6]), with mail-fe1 being used
primaraly for testing purposes. The Aggregator requires an authoritative server to manage the
namespace. The MUPDATE server performs this task. There is also
a slave MUPDATE server running on each of the frontends, to allow
the frontends a local copy of the entire mailbox namespace. When
updates happen on the master they are pushed to the frontends in
a relatively short period of time. Mail delivery happens to the aggregator via a process that runs
on the mx and smtp servers called lmtpproxyd. This process
recieves a message, queries the mupdate server for the location of its
destination mailbox, and forwards the message along to that server.
Note that we currently run sieve on the backend machines only, so all
of a user's mailboxes must live there. (user.rjs3 and user.rjs3.spam
couldn't live on different backend servers for example) The murder is designed so that restarting master on a machine
(excpet for the mupdate server, which is, for the most part, considered
authoritative) should bring it in sync with the state of the world
as the murder sees it. That is, when a frontend restarts, it gets a
fresh copy of the entire mailbox list. When a backend restarts, it
performs a series of checks between its local mailbox list and what
the MUPDATE server thinks it has, and either updates the MUPDATE
server (if it has a mailbox that is not in MUPDATE, or the MUPDATE
data is stale), or deletes the local mailbox (if MUPDATE claims the
mailbox is hosted on another server). Note that if mailbox transfers
fail a certain way, it may be necessary to restart both the source
and target servers. A common problem we are currently experienceing is that mon will
report a "timeout" for the mupdate server on a frontend. If this is
the case, typically this means that fdsync() calls have started
to take a very long time (due to a Solaris bug). Generally, running
/etc/remount-root will fix this problem. There is a cron job that
runs this on a somewhat daily basis. Presumably this will be fixed
if we upgrade to Solaris 10 or manage to wrangle a patch for the
kernel out of Sun (unlikely) [rjs3 5/19/03] Mailboxes have two parts to their location, just like AFS:
a server and a partition on that server. To get this
information for a given mailbox, use the info command
in cyradm. If you need to find the physical files on
a given server, look in /imap/<partitionname> on the
correct backend (the command mbpath on the backend
will help).Amanda workstation backup system
STAGE backup system
Cyrus Mail System
The Cyrus IMAP Aggregator (Murder)
When things go wrong
Locating a mailbox
Sendmail mail routing
Sendmail is the world-infamous MTA (mail transfer agent) that we use for moving electronic mail through a maze of twisty-little passages. It runs on every machine, but most machines run very stupid configurations of it.
When things go very wrong, complain to Larry.
# /usr/sbin/sendmail -bs 220 web3.andrew.cmu.edu ESMTP Sendmail 8.12.0.Beta7/8.12.0.Beta3; Wed, 22 Aug 2001 15:00:28 -0400 (EDT)The binary version is 8.12.0.Beta7 (the first number) and the configuration file is the second number (8.12.0.Beta3). It's very bad when the config file is more recent than the binary. Regardless, make sure the machine is running sane versions of the sendmail collection (in local) and the smailcf collection (in host). They should both either be the current beta release or the current gamma release.
Log in and try to determine what's wrong. Generally, rebooting these systems are safe. If it keeps happening, it might be due to a denial-of-service attack (intential or otherwise) from some client machine. See below.
mail1 (the IMAP server) is generally the biggest culprit of this, and start looking at it to diagnose problems if there are a large number of messages queued for it. However, the smtp machines will queue mail for people who are over quota for up to 5 days, so generally mail queued because of over quota reasons isn't very interesting.
# /usr/local/bin/kauth -n smtp.smtp6 -f /etc/srvtab -- /usr/sbin/sendmail -q -vwhere smtp6 is the first part of the hostname. You can use -qRcmu.edu to run only *cmu.edu* mail. There is also a script to run andrew.cmu.edu mail (with proper authentication) in /afs/andrew/acs/asg/coverage/forceq. This is useful if delivery has stalled for a reason that has been resolved and you want to restart mail delivery immediately.
# cat >> /etc/mail/access badhost.example.com REJECT # /usr/local/sbin/makemap hash /etc/mail/access < /etc/mail/access
More permanently, edit /afs/andrew/data/db/sendmail/etc/mail/access. This will be integrated into the Sendmail servers every 5 minutes.
# mkdir /var/spool/mqueue/hold-mar-05-2002
# cd /var/spool/mqueue
# /afs/andrew/acs/asg/coverage/qtool.pl -e '$msg{sender} =~ /zx22/' hold-mar-05-2002 q1
where hold-mar-05-2002 is an arbitrary hold directory
and /var/spool/mqueue/q1 is the mail queue with all the
junk in it. If there are multiple queues with junk in them,
you can specify multiple from directory at the end of the
qtool.pl command line.
You can now use mailq to make sure you got everything or mailq -oQ/var/spool/mqueue/hold-mar-05-2002 to see that everything in the hold directory is junk. If it is junk, feel free to remove the directory. [leg 03/05/02]
Webmail service is provided on the loadbalanced pool of webmail machines, using the host sqmail collection, which contains our local version of Squirrelmail, an open-source PHP based IMAP client. Each of the webmail servers runs an SSL-capable apache, which uses webiso to authenticate user.
Because webmail must maintain session information, as well as preferences information that is shared among all the servers, the /afs/andrew/data/db/squirrelmail tree exists. While the sess/ tree should basicly maintain itself, the prefs/ tree (which contains a large number of volumes, which are used based on hashes of the username), may occasionally have a volume fill. In this case, some users will not be able to write out preferences files, address books, or upload attachments. In this case, the most likely cause is attachments that haven't been properly garbage collected. The attachment temporary files have fairly obvious names (they look like hashes), as opposed to the preferences and addressbook files (which are username.pref and username.abook). Deleting any attachments that are older than a half hour or so should be safe. If this doesn't clear enough space, just increase the quota on the volume.
There is also a webmail bugzilla project which may be enlightening for known issues. If something goes wrong that you can't fix, complain to Rob or Larry.
To bring up a new webmail machine, just copy the package.proto, pubcookie key, and SSL keys and certs from an existing machine. Do NOT run keyclient, since it will invalidate the keys on the other machines. (alternatively, you can run keyclient -d to just download the current key)
If operations gets called about processes on unix servers, the basic policy is that the coverage staff should *not* kill the processes if possible. But they can feel free to lower the priority (nice +20) on the processes.
The exception to this rule is if the jobs appear to be runaway or if the processes are "seriously" hurting performance for everyone else. ie. use your best judgement or you can try to contact John Lerchey if you don't want to make the decision.
At any time that you have to do something, please save the ps output and send that, along with the unix server and the time to John Lerchey. [wcw 15-feb-1996]
Every so often someone will accidentially or intentionally set off a fork bomb on a Unix server. This is when a process gets into a loop where all it does is call fork repeatedly, causing lots of processess to come up, each of which is also calling fork until it can't fork anymore.
When this happens, you can su root, su user, and kill -9 -- -1, which blows away all the user's processes (including the shell you're running). "-1" is a special argument for kill that says "kill all my processes".
Anything else is ineffective, since the processes are spawning as quickly as they can be killed.
Because we (Derrick) put in user limits after the last round of fork bombs, this probably won't be necessary, but it's here if it is.
There's an Operating Systems project where students write shells, and as a result this seems to happen a few times at the beginning of the semester. [tjs 12-oct-1998]
LDAP
This section has a lot of Q&A about the LDAP service running on campus. The
person carrying beeper would not be asked to fix many of the horrors listed
herein unless most of the LDAP expertise in the group were suddenly wiped out
in one swift stroke. But the care and feeding of this system should be written
down and preserved, and this is as good a place to start doing that as any.
All questions are preceeded by a "Q: " string, and the question is written in upper case letters. You may search for keywords in the questions by doing upper case only searches.
Q: WHAT THE HECK IS LDAP?
Good place to start. LDAP (Lightweight Directory Access Protocol) is a database that holds records for all people and accounts on campus. The database has a tree structure similar to a filesystem so that entries can be placed in a heirarchical system. Each entry is identified by a "distinguished name" or "DN", which reads like the path to a file in a filesystem, except the divider between components is a comma instead of a slash. The order of the components in the DN is from specific to general. Example:
uid=adamson,ou=Account,dc=andrew,dc=cmu,dc=edu
This shows an account in the Andrew system at Carnegie Mellon. There are trees
for several different groupings:
ou=Person,dc=cmu,dc=edu Humans past, present, and future on campus
ou=Account,dc=cmu,dc=edu cmu.edu accounts
ou=Account,dc=andrew,dc=cmu,dc=edu Andrew accounts
ou=Account,dc=cs,dc=cmu,dc=edu Computer Science accounts
ou=AuthEntity,dc=cmu,dc=edu Special authentication identities
If you were to retrieve an entry from LDAP you would find that it consists
of attribute=value pairs, with attributes like "common name" (cn), "home
telephone number" (homePhone), and "computer accounts" (cmuAccount).
The LDAP system is used by the Email system to route Email in the Andrew and
cmu.edu domains. It is used by "finger" to find and fetch all information about
people. It is used by web pages on www.cmu.edu to find people. Mail clients
can use LDAP to find and resolve names on the "To:" line of composed messages.
Q: WHERE IS LDAP RUNNING?
Q: WHAT'S WITH ALL THE SERVER MACHINES?
The main machine, metadir, takes updates and is the only one that can actually write new data to the others. People can only query the others, not make changes to them. The other servers are intended for:
mail-ldap* Andrew and cmu.edu Email routing. VERY IMPORTANT ldap* Andrew mail client To: line name resolution
Since we add and delete servers, you might want to look at CNAMEs to figure out what server is doing what. Currently defined CNAMEs [leg 8/22/01]:
Q: HOW DO I START OR STOP THE SERVER?
Solaris: (currently metadir,ldap1,mail-ldap[12])
Linux: (currently ldap[234])
/etc/rc.d/init.d/openldap start
/etc/rc.d/init.d/openldap stop
/etc/rc.d/init.d/openldap restart
Use restart when possible. There are other some other services that may need to be started and stopped alongside the LDAP server. If the server stopped unexpectedly and you want to start it up again, the restart command will prevent the situation where you get multiple copies of the other services running.
There is a wrapper program called slapd.wrapper that is run by the init script's start and stop commands. The wrapper will attempt to restart the slapd daemon if it were to stop unexpectedly (aka "crash"). The wrapper will wait about 10 seconds after the child dies and then restart it. If it crashes fast several times, it will send a zephyr to the people listed in /usr/openldap/etc/admin and slow down the restart process. So if you try to kill slapd by sending the slapd process a signal, it will be restarted. Use the init scripts. The stop command will signal the wrapper to exit and it will signal the slapd daemon to exit too. The restart command will only signal the slapd daemon to exit, and then the wrapper will restart it.
Q: HOW DO I TELL IF LDAP IS RUNNING ON A MACHINE?
A good way is to see if some program is listening to the LDAP port:
% netstat -an | grep 389If there is a line that looks like
*.389 * * 0 LISTENthen something is listening to the LDAP port. Another way is to check the process table for a program called "slapd"
% ps -e | grep slapdYou might see 2 on metadir since it has a backup slave server running too.
Q: WHERE ARE ALL THE FILES THAT LDAP USES?
Look in /usr/openldap/ on any LDAP server:
bin/ client programs libexec/ daemon binaries db/ BDB database files etc/ update scripts etc/opendldap/ config files for slapd etc/openldap/schema/ LDAP schema files share/ odd config files, e.g. nicknames feeds/ on the master server, the update feed files logs/ output from slapd and slurpd var/ LDIF files, core files, PID files var/openldap/replica/ bookkeeping on replication lib/ LDAP libraries (not installed by default) lib/perl/ modules for performing updates on master lib/perl/Feed/ modules for each feed type
Q: WHERE DOES THE SOFTWARE COME FROM?
The base LDAP service is in the source collection "local/openldap", or,
/afs/andrew/system/src/local/openldap/It came originally from the OpenLDAP group, available through HTTP, FTP, and CVS. See http://www.openldap.org The CMU specific files, like the nightly updating and web pages, are in the "host/cmuldap" collection
/afs/andrew/system/src/host/cmuldapThe "package" files that install the software are controlled by the /etc/package.proto file using these macros:
%define doesopenldap %define doescmuldapwhich cause these two files to be included:
/afs/andrew/wsadmin/services/lib/openldap.generic /afs/andrew/wsadmin/services/lib/cmuldap.genericThey will bring in the programs as well as the config files
/afs/andrew/wsadmin/services/etc/openldap.*as needed.
Q: HOW DOES THE MASTER MACHINE UPDATE THE SLAVE MACHINES?
A process called slurpd runs on the master machine (metadir). When updates occur in the LDAP database, slapd writes to a log file called /usr/openldap/var/rep.ldif. There may be several processes that want to read that file, so a program called "repd" will take rep.ldif and make multiple copies of it to other filenames, according to its config file /usr/openldap/etc/openldap/repd.conf. One of the copies it makes is called /usr/openldap/var/rep.slurpd.ldif. The slurpd process runs in the background, polling that file for updates. When updates appear, it uses normal LDAP calls to tell all of the slave servers.
Q: HOW DOES SLURPD START AND STOP?
/etc/init.d/openldap.rep start
/etc/init.d/openldap.rep stop
/etc/init.d/openldap.rep restart
Just like slapd.
Q: HOW DOES THE LDAP DATABASE GET BACKED UP?
The LDAP server uses Sleepycat BDB database files to keep data. These files are always open, so writing them to tape is no good. To properly write them to tape or get a flat file copy of them, the LDAP server must be stopped, which is an interruption in service.
For this reason, there is a second LDAP server on the master server, listening on port 3890. The main server sends updates to this second server through the normal slurpd replication model. The backup server is not intended to service any queries, so it does no indexing or optimizations, and its ACL's are all shut off.
Each evening, CRON runs /usr/openldap/etc/ldapbackup.sh at 7:00pm. This time was chosen because it is about an hour before the nightly backup system, Amanda, runs. The ldapbackup.sh script shuts down the backup LDAP server and then runs /usr/openldap/sbin/slapcat to make a flat file copy of the LDAP database held by the backup server, then restarts the server. There is no interruption in service, since the backup server does not serve users.
When Amanda runs, it will make copies of the BDB files, but these are of little use. However, it will make a copy of the flat file database, which could be used in a disaster recovery situation.
This backup server gets started and stopped like the master LDAP server, except its script is
/etc/init.d/openldap.bak {start|stop|restart}
Q: HOW IS IT THERE ARE TWO LDAP SERVERS ON THE MAIN MACHINES?
The main server uses files in the default directory
/usr/openldapThere is an extra subdirectory there called "backup/" that contains the files for the backup LDAP server. It uses a special config file "backup/etc/openldap/slapd.conf" that tells where the non-standard directories are. When the process table is examined, the backup database slapd process is distinguished by having a "-f" switch that points to this alternate config file.
Q: HOW DO I RESTORE THE LDAP DATABASE?
First you have to determine how widespread is the problem that makes you want to restore the database. There's one of 3 situations, listed in increasing order of unfriendliness:
# on metadir:
su $LOGNAME.root
echo ${LOGNAME}@ANDREW.CMU.EDU >> /.klogin
cd /usr/openldap
/etc/init.d/openldap stop
mv db db.corrupt (optional)
# on mail-ldap2:
su $LOGNAME.root
cd /usr/openldap
/etc/init.d/openldap stop
tar -cf db.tar db
klog $LOGNAME
scp db.tar root@metadir:/usr/openldap/.
/etc/init.d/openldap start
# when that is done, on metadir:
tar -xf db.tar
/etc/init.d/openldap start
Note that if a slave server is corrupted, the same operation above can be
performed, replacing "metadir" with the slave server hostname.
# on metadir
su $LOGNAME.root
echo ${LOGNAME}@ANDREW.CMU.EDU >> /.klogin
cd /usr/openldap
/etc/init.d/openldap stop
/etc/init.d/openldap.rep stop
echo > var/rep.slurpd.ldif
mv db db.corrupt
mkdir db
sbin/slapadd -l backup/db/backup.ldif
This last command will take a while, even 30-90 minutes.
tar -cf db.tar db /etc/init.d/openldap.back stop /bin/cp db/nextid.dbb backup/db /bin/cp db/dn2id.dbb backup/db /bin/cp db/id2entry.dbb backup/db /etc/init.d/openldap.bak start /etc/init.d/openldap start klog $LOGNAMENow, one by one, perform steps similar to section 1 above to scp the db.tar file from metadir to one slave server
# on the slave:
su $LOGNAME.root
echo ${LOGNAME}@ANDREW.CMU.EDU >> /.klogin
cd /usr/openldap
/etc/init.d/openldap stop
mv db db.corrupt (optional)
# on metadir:
scp db.tar root@<SLAVE SERVER NAME>:/usr/openldap/.
/etc/init.d/openldap start
# when that is done, on the slave server:
tar -xf db.tar
/etc/init.d/openldap start
Loop through those commands on each slave. If the database is really screwed
up such that the slaves cannot run, go ahead and run the scp commands
concurently from metadir. You may want to add the -q switch between "scp" and
"db.tar" to shut off the running updates.When all slaves are updated, restart the replication server:
/etc/init.d/openldap.rep start
/usr/openldap/backup/db/backup.ldifTalk to a backup kind of guy, like Chaskiel Grundmann, to get into Amanda. Once you have the backup file, you can follow the steps for section 2) above. The "slapadd" command will need the name of the restored LDIF file after the "-l" switch.
You can try getting the LDAP database more up to date if you think the feed files are uncorrupted. You'll need to look in the directory
/usr/openldap/feeds/to see how far back the feed files go. If you restored the LDIF file from further back, you're going to have a gap in the updates. But suppose today is March 20, 2001 and you took the LDIF file from, say, March 5, 2001. You can run the update system between those two dates and you should be in good shape
# on metadir cd /usr/openldap etc/ldapfeeds.pl -old 010305 -new 010320This will take the differences between the feed files from the old date to the new, and make updates to the LDAP database accordingly. The output of the command will be in
/usr/openldap/feeds/ldapfeeds.output.010320The date on the output file is today's date; if the -new date was two days ago (010318) the output file would still be today's date.
Q: HOW DO I SEARCH FOR AN ENTRY IN THE LDAP DATABASE ?
(Also see the next question to MODIFY an entry)
The "ldapsearch" command should be all you need. It's in /usr/local/bin/ on most machines, and can be run with --help for a list of all command line switches. The command tends to look like this:
% ldapsearch -h ldap2 -b dc=cmu,dc=edu cmuAndrewID=adamson
This will connect to the (h)ost named ldap2 and find the user whose Andrew ID
is "adamson" and print out everything in the entry that is available to
you. It will use your Kerberos ticket to authenticate you, so you may gain
additional priviliges such as the user's phone number. Several people's
ADMIN instance are in the cn=Administrators group, which can see anything
If you get an error about command line options, make sure you are using the OpenLDAP ldapsearch and not one from the OS vendor or Netscape or other vendors, as the command line switches can vary. Specify /usr/local/bin as the path to the program.
There are more general searches you can conduct, changing the filter at the end of the command line to something like
'cmuDepartment=Services Development Group (Comp Services)'
to get all programmers in Computing Services. You will get back multiple
entries. If there is more than one filter you want to use to refine the
search, you can put them together in the LDAP filter format:
(&(attr1=value1)(attr2=value2)(!(attr3=value3)))
The initial ampersand (AND) can be changed to a pipe (OR), and the the
exclamation point shows how to do a logical NOT.If there are only 1 or 2 attributes you are interested in instead of the entire entry, you can append them to the commandline. For the ldapsearch command if anything appears after the filter, it is taken as a list of attributes you want to see.
% ldapsearch -h ldap2 -b dc=cmu,dc=edu cmuAndrewID=adamson homePhone cn
There are also "operational" attributes such as last modify time
(modifyTimestamp) and last modifier's name (modifiersName) which don't
show up unless you specifically request them. To see them, add a plus sign "+"
to the end of the command line as an attribute to request. This is a flag in
the protocol to see operational attributes.
Q: HOW DO I CHANGE AN ENTRY IN THE LDAP DATABASE ?
(Also see the next question to ADD an entry)
You need to write an LDIF file and feed it to ldapmodify. A quick crash course on LDIF... It is flat ASCII, with lines separated by carriage returns, and entries separated by blank lines. Example:
dn: uid=blef,ou=Account,dc=andrew,dc=cmu,dc=edu mail: blef@andrew.cmu.edu dn: uid=fnork,ou=Account,dc=andrew,dc=cmu,dc=edu cn: John Fnork cn: John Q FnorkThis LDIF file would change two entries. Each line consists of an attribute name, a colon, a space, and a value. The first attribute for an entry has to be the "dn" attribute. After that, any attribute available for the entry can be listed. If an attribute is to have more than one value, like the common name for John Fnork above, the values are listed one per line with the attr name repeated for each one. You don't have to list the entire entry as the way you want it to appear when you're done, you just have to list the attributes you want changed.
The ldapmodify program lives in either /usr/openldap/bin or /usr/local/bin on most machines. You have to give it a few commandline args, and you will need a Kerberos ticket. A typical line looks like this:
% ldapmodify -h metadir -f ./ldif
The "-h metadir" says to make the changes on the (h)ost named metadir, which
at the time of this writing, is the read/write server. The "-f ./ldif" args
give the path to the (f)ile that contains your LDIF.IMPORTANT NOTE: If your LDIF is all new values to APPEND to entries, you are fine. If you want the old values that are in the entry removed and REPLACED with the values you have in your LDIF file, add a "-r" switch to the command. Again, you don't have to list the entire entry in your LDIF -- if you add -r to the command, only attributes explicitly included in your LDIF file will be overwritten. If one entry lists "cn" and a second lists "mail", and you use the -r switch, the "mail" attr of the first entry WILL NOT BE CHANGED, nor will the "cn" of the second entry. "Inclusion" of attributes is on an entry by entry basis, not the entire file.
In order to REMOVE a SINGLE value from an attribute, list all of the other values in the LDIF file and use the -r switch. To remove ALL of the values, say all of the "labeledURI" attribute values, add "labeledURI: " to the LDIF. BE SURE to have the colon and the space.
If there is a schema violation, you are trying to add/change attributes in an LDAP entry that can't have those attributes. For example, if you try to append a "cmuSIScat" attribute to an Account entry, it will fail -- you can only add that attribute to a Person entry (with the guid=<hex> in the DN).
If an error occurs with one entry, the ldapmodify command will stop. You'll have to take the tail of the LDIF file and run it through ldapmodify again. You can add the "-c" switch to tell ldapmodify to (c)ontinue after errors.
Q: HOW DO I ADD AN ENTRY TO THE LDAP DATABASE?
(See also the previous question for MODIFYING an entry, and the next question for creating a GUID)
You will need a file containing a complete LDIF copy of the entry you want to add to the database. See the previous question for a crash course in LDIF. The file will contain the COMPLETE entry, with all values filled in. One way to start is to fetch a similar entry from LDAP and plagarize.
With the LDAP file ready, feed it into ldapadd:
% ldapadd -h metadir -f ./ldif
just like the "ldapmodify" command in the previous question. There is no "-r"
switch for ldapadd, since there are no old values to replace.You shouldn't have much need to create entries by hand.
Q: HOW DO I CREATE A NEW GUID?
This will come up if you're manually adding a new entry for a Person for some reason. (This situation could arise if you find a Person entry that is the combination of two humans into one entry, and you need to splice them apart into two Person entries.) All Person entries have a DCE GUID in their DN to uniquify them across the campus forever.
There is a program on the main LDAP server
/usr/openldap/etc/guidJust run it, it takes no commandline args, and makes no changes to LDAP. It simply generates a DCE GUID and writes it to the screen. Run it again, and you'll see the next one it generated is different.
The program produces slightly better GUIDs if run as root, since on SUN machines only root can get to the MAC address of the ethernet card, which is used as an input into GUID production.
Q: WHO IS AUTHORIZED TO MAKE CHANGES IN THE DATABASE?
You can read the slapd config file to get the list of LDAP DN's that can make writes to the database. The config file is
/usr/openldap/etc/openldap/slapd.confLook for "access to" directives, and within them look for "write" access.
access to attrs=telephoneNumber,homePhone
by group="cn=Administrators,ou=Group,dc=cmu,dc=edu" write
by * read
This entry controls access to 2 phone attributes. It says that the given group
named cn=Administrators can write changes to this attribute, and anyone else
can read it. If you were to search for that group, it would have one or more
values in the "member" attribute, and it is those people who can write to those
two phone attributes. Access control directives are read from top to bottom, so a
previous directive may have superceded this one.In general, you will find that to make changes you must be in the cn=Administrators group, or bind as the cn=Manager identity listed in the "rootdn" directive in the slapd.conf file.
The cn=Administrators group lists several people's admin instance. These people, with their Admin ticket, can change anything and add new entries.
Q: WE'RE ALL LOCKED OUT! HOW DO WE CHANGE THE DATABASE TO ADD SOMEONE BACK IN?
Some idiot deleted everyone from the administrators list, eh? These things happen. You will need to log into the master server (currently metadir.andrew) and change the slapd config file.
/usr/openldap/etc/openldap/slapd.confFind the line that says "rootpw ". The value after it is the password that you can use to bind as the manager DN listed in the "rootdn " directive (usually right above). The rootdn has fearsome powers to do anything. Be careful.
By default, the rootpw line has a bogus value, because we don't want people using the rootdn directly, but this is an emergency, right? You will need to DES encrypt the new password. You did think of a new password, right? Something that people aren't going to guess, right? You can DES encrypt it using the crypt() function, available from perl quickly with
% perl -e '$a=<STDIN>; chop $a; print crypt($a, "SALT") . "\n";'
The <STDIN> will wait for you to type in the password, so it's not stored by
your shell as part of the commandline. It WILL appear on the screen as you
type, so look over your shoulder first. When you hit enter, a 13 character
string will appear, which you should add to the "rootpw" line of slapd.conf:
rootpw {CRYPT}SAYoATkkQuOb7
Now restart the slapd server; (see the question far above on starting/stopping
the server). You only need to do these steps on the master server.Now add someone to the cn=Administrators group. Write an LDIF like:
dn: cn=Administrators,ou=Group,dc=cmu,dc=edu member: uid=adamson.admin,ou=Account,dc=andrew,dc=cmu,dc=edu member: uid=shadow.admin,ou=Account,dc=andrew,dc=cmu,dc=eduand run "ldapmodify" with passwords instead of SASL.
% ldapmodify -h metadir -D cn=Manager,ou=AuthEntity,dc=cmu,dc=edu \
-x -W -f ./ldif
Password:
NOTE IN BIG LETTERS: Run this while logged into the master server, NOT across
the network. The password you type is going to the server unencrypted. Doing
it on the same machine prevents the communication from using the campus net.The "-x" switch says to use simple authentication, and "-W" says to prompt for the password. DO NOT USE "-w <passwd>" since that leaves the password sitting in your shell, and possibly in your ~/.history file. If the ldapmodify command doesn't like "-W", then you are using the Netscape version -- go find the OpenLDAP version (/usr/local/bin, /usr/openldap/bin).
WHEN YOU ARE DONE reset the "rootpw" directive in slapd.conf to remove the password, like
rootpw {CRYPT}NO_PASSWORD
and restart the server AGAIN. This will disable the use of simple password to
become the manager.
Q: HOW DO I REINDEX THE DATABASE?
The index files in LDAP speed up lookups for indexed items. As values are changed over time the hash tables in the index files become less and less efficient. This can be cleared up by removing the index files and running the OpenLDAP utility 'slapindex'. Of course, this can't be done while the server is running, and reindexing takes about an hour, so the indexing should be done on a separate offline copy of the database. Therefore, we use the space in the backup database server on the main LDAP server, in
/usr/openldap/backup/
That server already has a copy of the 3 main files that contain all
of the raw data for the database:
dn2id.dbb id2entry.dbb nextid.dbbFrom those files, all indecies may be created.
Another problem to be faced is that a snapshot of the database is going to be used to create the indecies. While the indecies are being created, which takes about an hour (by June,2001 standards), the live database can undergo updates. If the newly indexed database is swapped in as the live database, those changes will be lost. For this reason, some games must be played with the replication system to recover any changes. Nevertheless, this process should be done during non-peak hours because there is a window of about 60 seconds during which any changes made have to be re-entered by hand. Just make sure not to attempt this process while the nightly feed update is going on!
telnet metadir login su root cd /usr/openldap/
cp backup/etc/openldap/slapd.conf backup/etc/openldap/slapd.conf.orig grep ^index etc/openldap/slapd.conf >> backup/etc/openldap/slapd.conf
/etc/init.d/openldap.backup stop
sbin/slapindex -c -f backup/etc/openldap/slapd.conf
/etc/init.d/openldap.backup start
tail var/openldap-slurp/replica/slurpd.replog grep metadir var/openldap-slurp/replica/slurpd.statusYou'll have to do some date conversions.
cd backup/db /etc/init.d/openldap.rep stop /etc/init.d/openldap.backup stop tar -cf db.tar * cd /usr/openldap /etc/init.d/openldap stop mv db db.old mv backup/db . mv var/rep.slurpd.ldif var/redo.ldif /etc/init.d/openldap start
edit the redo.ldif file remove each block of "replica" lines remove any lines that change system attributes (modify* creat*) remove the "-" lines after them, too. feed the redo.ldif file into ldapmodify
mv backup/etc/openldap/slapd.conf.orig backup/etc/openldap/slapd.conf mkdir backup/db cd backup/db tar -xf ../../db/db.tar dn2id.dbb id2entry.dbb nextid.dbb cd ../.. /etc/init.d/openldap.backup start
foreach host ( mail-ldap1 ...... )
scp db/db.tar root@${host}:/usr/openldap/db.tar
ssh -l root $host '\
cd /usr/openldap; \
mkdir db.new; \
cd db.new; \
tar -xf ../db.tar; \
cd ..; \
/etc/init.d/openldap stop; \
mv db db.old;\
mv db.new db; \
/etc/init.d/openldap start; \
rm -r db.tar db.old'
end
rm db/db.tar
rm var/openldap-slurp/replica/* /etc/init.d/openldap.rep start
The downside is this took a while, and the database is missing all changes that happened since the 3 main files were copied over. If you stop slurpd on the main server before you copy the 3 files, you won't miss any updates, but ALL updates will not be replicated while that is going on. You have the option of leaving slurpd running, but tell "repd" on the main server to make another copy of the LDIF file, and when the slave is done reindexing you can feed that new LDIF file into a "one shot mode" slurpd (use the -o and -r).
WebISO provides centralized authentication to web servers. The webiso login servers take care of authenticating users and issue cookies that can be used by other webservers to verify the user's identity.
The login servers run on the websio.andrew pool of machines (currently webiso1 and webiso2). The authentication is handled by a CGI program that lives in /usr/www/htdocs/login.cgi. The login CGI gets its configuration information from /usr/www/pubcookie/config (a normal ascii file, one option per line, formatted as option:value).
The login servers log data to two different places: /usr/www/logs/error_log and /var/adm/messages. (Work to make this logging better is in progress.) At the default debug:1 and logging_level:1, the error_log logs when a user successfully authenticates and when a user gets redirected back to an application server. The system messages logs when users visit the login server (including what app server sent them there), information about the existing login cookie (if any), information when it issues login cookies, and failed authentications.
There aren't many problems with the webiso login servers, but other problems come up on occasion.
WebISO verifies usernames/passwords by trying to get a Kerberos 5 tgt, and then using that TGT to get a krb5 host/webiso.andrew.cmu.edu ticket. If the user doesn't have a krb5 passwd entry, this will fail. Have the user change their password and try again.
This happens when the app server can't validate the cookies issued by the login server. The app server sends the user back to the login server to get new cookies, which issues more cookies that the app server can't validate repeatedly. Often, this is caused by a misconfigured app server (missing the pubcookie_granting.cert or encryption key file, or bad permissions on them), or mismatched encryption keys between the app server and login server. The login server keeps the encrpytion key for each app server in /usr/www/pubcookie/keys/, one per hostname. The hostname used is the name listed in the app server's SSL certificate, not the real hostname. Verify that the keys are the same on all login servers and the application server(s). If they're not, either create a new one from the app server by running keyclient, or just copy one of the key files around to make them all the same.
If you come across another problem not documented here, get in touch with jeaton@andrew.cmu.edu and let him know so he can update this FAQ.
General Oracle Information.
Computing Services maintains several Oracle databases such as the Help Center
DB, the BlackBoard DB, the DAMS Trigger Server, and multiple development
tablespaces on one or more of these DBs. So far, the databases have been well
behaved. Occasionally they can act up and cause some trouble but this is
rare.
If a DB acts up, the first place to look is in one of the alert_{SID}.log files. This file contains information about log switches, major DB alterations, and errors. All errors are preceeded by "ORA-XXXXX". Where XXXXX is a very useful error number. These numbers can be used to access further information, and possible solutions, for that error.
Locations of Oracle Instances
Below is a table displaying the machine and directory structure for the Oracle
instances currently administered by Computing Services.
| CONTAINS | MACHINE | Oracle Home | AOracle SID | LOCATION OF ALERT LOG |
|---|---|---|---|---|
| primary BlackBoard Server | courseinfo4.andrew.cmu.edu | /oracle/app/oracle/product/8.1.7 | blkboard | /oracle/app/oracle/admin/blkboard/bdump/alert_blkboard.log |
| Primary Help Center and Generic Developer Server | ora1.andrew.cmu.edu | /oracle/m01/app/oracle/product/8.1.7 | thebigdb | /oracle/m01/app/oracle/admin/thebigdb/bdump/alert_thebigdb.log |
| Primary DAMS Trigger Server | metadir.andrew.cmu.edu | /oracle/m01/app/oracle/product/8.1.6 | dams | /oracle/m01/app/oracle/admin/dams/bdump/alert_dams.log |
% su root Password: skydiver.andrew.cmu.edu# su oracle skydiver.andrew.cmu.edu# source /oracle/oracle.env skydiver.andrew.cmu.edu% stoplsnr skydiver.andrew.cmu.edu% dbshut skydiver.andrew.cmu.edu% startlsnr skydiver.andrew.cmu.edu% dbstart
If the drive containing the binaries failes, it will have to be restored from amanda. Since the DB control files live on the RAID unit, after you restore the DB binaries and bring the DB back on line, it should pick up where it left off as in an instance failure. If the entire RAID goes bad, the nightly amanda dump of /oracle will need to be restored and again the DB should recover although only up to the state of the DB as of the nightly dump.
Where to find additional information
The absolute best place to go for additional information on an Oracle
error, or for any query about Oracle is http://metalink.oracle.com You
need an account to access this web site. The only person on campus
who can grant an account is
You can get pretty reasonable results just by typing the Ora-XXXXX error into any search engine. Finally, most of the Oracle documentation can be found at Oracle 8I Documentation
Monitoring Infrastructure
Event monitoring
The monitoring systems previously documented here (nadine, tsvmon) have been superseded by "mon", a system that supports active notification in addition to polling by user consoles, and is easier to extend to test new services as well. Mon is currently used to test services provided by the Systems group (including AFS, kerberos, mail, ldap, and various web servers) and the network group (dns, radius) as well as network devices maintained by datacomm.
The primary interface to mon is a cgi script running on monitor.andrew.cmu.edu (aka opermon1.andrew.cmu.edu) This interface allows you to see what, if any, services are currently failing, and also allows you to adjust some of mon's settings
Mon is significantly different from earlier monitoring systems used at cmu in that it does not rely on the operators to contact the primary in the event of a problem.
In most cases, two consecutive failures of a service will result in the primary being sent a text page with a brief description of the problem (or at the very least, identification of the service and machine that have failed). The primary will be re-paged at varying intervals until either the service starts functioning again, or the failure is acknowleged.
The acknowlegement can either be done from the web console or directly from a 2-way pager. The messages mon sends to the pagers include information that allows the pager user to reply to the page. Once an acknowlegement has been processed, mon will not send further alerts about this failure, and the web console will display the text of the acknowlegement.
Alternatively, if the circumstances warrant it (the outage will be extended, the machine will need to be rebooted), monitoring of a host or service may be disabled instead. This can also be done from either the web console or a pager.
There are 3 kinds of disable actions that can be performed: disabling a host none of the services on the specifiv host will be monitored disabling a service prevents that specific service from being monitored on all the hosts in the hostgroup. disabling a group prevents all the services from being monitored on all the hosts in the group.
Because mon automatically notifies the primary of an outage, people who maintain services should make sure not to trigger it while doing maintenece. The relevant services or hosts should be disabled before any interruption in service, and not be re-enabled until there is a reasonable expectation that the service is stable.
[I'll presumably add more detail to this next section when my brain is feeling less frazzeled]
quick navigation hints for using the web console:
Acknowlegement messages are set by clicking on the service name in the second column, and using the text box on the following page
services, hosts, and groups can be disabled or enabled by clicking on the hostgroup name in the first column and using the radio buttons on the following page. Single services can also be disabled/enabled by clicking on the service name in the second column and using the "disable/enable service xxx in group yyy" link on the following page.
[cg2v 11/21/02]Historical data on machines is currently maintained on graphs.andrew.cmu.edu.
Historical data on the network is available at stats.net.cmu.edu. People trying to track down denial-of-service problems might want to look at the Top 10 usage link at the bottom. "cyh-a100.sw.cmu.edu" is the machine room switch.
Netdev takes care of stats.net; Larry is probably good to complain to about graphs.andrew. [leg 8/22/01]
While a generic root password is distributed in the global
/etc/passwd, found in /afs/andrew.cmu.edu/common/etc/passwd, some
machines override it using the file /etc/passwd.change. Most
notable among these are the PO/BB machines, which use the Postman
root password. Fileservers and database servers use a separate,
special root password, and do not use the global password file.
Cluster machines use a root password belonging to Clusters.
Certain passwords are made available to members of the coverage
pool, but locally set and departmental passwords are seldom if ever
available. To get around this, you may create a simple setuid
program in AFS. If you do so, make very certain that only you may
access it. Following is source for a simple program to give you
root privileges. Arguably you should make something more complex,
which actually logs that you became root, but this is for when you
need something quick and dirty.
Then become your admin self, and on certain systems, you may need to
become root as well.
As before, *make* *sure* only you can read it.
Another way to get root access is through the use of a root instance.
A root instance looks like "userid.root". To get root access using
a root instance, either log in with your root instance at a login
prompt, or use su to switch to your root instance.
Once it verifies that you are in a group that is allowed root access on
the local machine, it gives you tickets/tokens for your instance, and
a root shell on the machine.
Which root instances can be used on a particular machine is set by the
/etc/root.permits file. This file lists individual root instances and
pts groups which may become root on that particular machine.
If you need a root instance and don't have one, ask the System Manager.
Another useful back door is the service local account, which allows
Computing Services people to log in on certain machines where they
otherwise can't.
In general, the service local account password is unavailable to
general use. The password can be obtained in special cases when
absolutely necessary.
One typical problem is fsck dying with signal 10 when / is
full. The typical culprit is fsck trying to link more
into /lost+found; This case can be dealt with by moving the
disk elsewhere or "boot net - noinstall" and making it
not full. In at least one case the /etc/passwd.*.idx files
were corrupt; These can be removed the same way.
If a solaris machine is failing to boot and drops to a shell,
you will often see text like the following above the # prompt:
This does not indicate a problem in itself. All it
means is that andrew workstations do not use shadow password
files, and that the stock solaris tools don't deal with that.
You need to look farther back in the output to find relevant
error messages.
Should you happen to get beeped for something regarding a
license server, /afs/andrew/acs/software/licenservers will
show which machine each license server is running on. For
the most part, all of our license servers work by grabbing
some binaries/config files out of
/afs/andrew/data/db/<package>, and placing them in
/usr/<package> and /etc/rc.local.<package>.
The three basic problems are
Should you get beeped about the calendar server being
down, here's some quick info to get you started:
There are two parts of the publishing system, the staging server
(web2.andrew.cmu.edu, AKA publishing.andrew) and the production server
(web3.andrew.cmu.edu, AKA www.cmu.edu). People FTP stuff onto the
staging server, and either immediately or later, as they configure
their individual web collection, it is published on the production
server. These notes assume that the problems are not specific to an
individual user, i.e. we assume the Help Center has verified that it's
broken for lots of people, not just a particular user. The staging server has the following unusual processes going on: /etc/init.d/wpftpd starts this part up, starting
/usr/webpub/bin/auth.ftpd. That keeps its authentication (as
'service.webpublish') in a subshell, and runs a second inetd (which is
therefore authenticated) and spawns ftpds. The config file for this
inetd is /etc/wpftpd.conf. Therefore, the staging server should have two inetds running at all
times. The ftpd is really "ftpd.checkp", which is a modified ftpd which gets
permission from the program /etc/check_ftpuser. check_ftpuser is a binary which queries the LDAP server to find out
if it's OK for a given user to see/manipulate a given file. If you find that people can get to the admin pages on the staging
server, but all "view staged content" links return 403 errors, the
daemon might not be running authenticated, so restart it. The server uses rjs3's "apacheath" collection, which uses a modified
apachectl which calls reauth. If you need to restart the daemon,
apachectl stop and apachectl startssl will do the trick as with any
other SSL server. The event handler is a perl process which checks for publishing
events, which have appeared in the Oracle table which holds said
events. When it finds an event of status "pending", it does the
publishing stuff (which is a series of commands ssh'd to the
production server) and updates the event to "complete" or "failed" as
appropriate. The handler requires three processes. If you grep for 'handler' in
the ps list you should find 'handler.wrapper', which is a perl nanny
process; kill -TERM it and you take out the whole handler complex. The wrapper spawns (and respawns if necessary) 'auth.handler', which
is yet another pagsh script which maintains authentication for the
real workhorse, which is 'handler', a perl script. The only reason we've seen for the handler to die (which causes
auth.handler to exit, which causes handler.wrapper to launch it again)
is if the local LDAP server goes down. Of course, that itself has a
nanny, so we hope it comes back up, at which point the handler's nanny
will be able to successfully relaunch the handler. So if the handler
keeps dying, there might be a problem with the LDAP server. Not much to this, really; there's stuff in /usr/openldap/etc/openldap
(cmuweb.conf, schema/cmuweb.schema) which configures things for us,
but that's about it. If it dies, you'll need to figure out why, and
possibly restart the server (/etc/init.d/openldap restart; see the
general LDAP section of this FAQ for details). We haven't really found problems that could crop up on the
production server yet that are different from normal web server
issues: since the production server's httpd doesn't run with kerberos,
it's not difficult. Just make sure you use '/usr/www/bin/apachectl
startssl' instead of just 'start'. The only thing interesting is the script /usr/webpub/bin/webmount,
which should be run hourly by cron, so you don't need to watch for any
dead processes. This updates the cmualiases.conf and
andrewaliases.conf files in the web server's conf directory when new
URLs are added for collections.Local Machine Accounts
Root Account
Root Instances
Service Account
Solaris problems
License server problems
Hence, the three basic solutions are
[jf6b 9/23/98]
Calendar Server Problems
The Web Publishing System (AWPS)
Staging Server Problems
Authenticated second inetd/ftpd
Authenticated httpd
Event handler
OpenLDAP server
Production Server Problems
My Andrew
The content and driver CGI for MyAndrew are in a Web Publishing collection called myandrew, which mounts on web3's disk as /collections/myandrew. Related collections such as myandrew/software are in there as well: grep 'myandrew' in /usr/www/conf/cmualiases.conf for particulars.
MyAndrew access is forced to the secure port: see the httpsdirect.conf file. The front door (/myandrew/index.html) is open for access without authentication.
Authentication-requiring services pass through the CGI at the URL /myandrew/auth (/collections/myandrew/auth/myandrew.pl). This script accepts a few variables that tell it which service is being requested, such as the Web Publishing system, and redirects accordingly.
It may arise that during the course of your time with the beeper that you are called upon to redownload a machine for one reason or another. Typically machines are built using a download service running on a host named (hosttype)build, for instance, HPs from hpbuild, DECstations from decbuild, SunOS suns from sunbuild, Solaris suns from sunbuild2, and Linux machines from linuxbuild.