/** @page wprocs Worker Processes Everything related to worker processes. [TOC] @section philosophy Philosophy The idea behind separate worker processes is to achieve protected parallelization. Protected because a worker being naughty shouldn't affect the core process, and parallel because we use multiple workers. Ideally between 1.5 and 3 per CPU core available to us. Workers are free-standing processes, kept small, and with no knowledge about Nagios' object structure or logic. The reason for this is that small processes can achieve a lot more fork()s per second than large processes (800/sec for a 300MB process against 13900/sec for a 1MB process). While workers can (and do) grow a little bit in memory usage when it's running many checks in parallel, they will still be a lot smaller than the primar Nagios daemon, and the memory they occupy should be released once the checks they're running are done. @section protocol Protocol Workers use a text-based protocol to communicate with workers. It's fairly simple and very easy to debug. The breakdown goes as follows: @li A request consists of a sequence of key/value pairs. @li A key is separated from its value with an equal sign ('='). @li A key/value pair is separated from the next key/value pair with a nul byte ('\0'). @li Each request is separated from the next with a message delimiter sequence made up by a one-byte followed by three nul bytes: "\1\0\0\0". @li Keys cannot contain equal signs. underscores and numbers. @li Values cannot contain nul bytes. @li Neither keys nor values can contain the message delimiter. @li A zero-length value is considered to be the empty string. @note Even though it's technically legal to put almost anything in the key field, you should stick to mnemonic names when extending the protocol and just use lower case letters and underscores. @note Keys are case sensitive. JOB_ID is *not* the same as job_id. @subsection apis API's Worker processes communicate with Nagios using libnagios API's exclusively. Since you're looking at a subpage of the documentation for that documentation right now, I'll just assume you've found it. Although using the libnagios api's when writing a worker is completely optional, it's highly recommended. The key API's to use are: @li nsock - for connecting to and communicating through the qh socket @li kvvec - for parsing requests and building responses @li worker - for utils and stuff nifty to have if you're a worker @li runcmd - for spawning and reaping commands @li squeue - for maintaining a queue of the running job's timeouts @li iocache - for bulk-reading requests and parsing them @li iobroker - for multiplexing between running tasks and the master nagios process. @note In particular, have a look at the "parse_command_kvvec()" and "finish_job()" functions in lib/worker.c. They will do a large part of the request/response handling for you. @section registering Registering a worker - The handshake Workers register with Nagios through the queryhandler, using a query sent to the wproc handler. Since the query handler reserves the nul byte as a magic delimiter for its messages, this one time we use the semicolon instead, as is almost-standard in the internal-only queryhandlers. Typically, the default worker process registers with a query such as this: @verbatim @wproc register name=Core Worker $pid;pid=$pid\0 @endverbatim Nagios will then respond with @verbatim OK\0 @endverbatim followed by a stream of commands. Nagios currently understands the following (short) list of special keys: @li pid - The pid of the worker process. Sometimes used to check if a worker is online @li name - Used to set the name of the worker @li max_jobs - Used to tell Nagios how many concurrent jobs this worker can handle @li plugin - basename() or absolute path of specific plugins that this worker wants to handle checks for. @note plugin can be given multiple times. It is valid for a single single worker to say "plugin=check_snmp;plugin=check_iferrors", for example. @note Many workers can register for the same plugin(s). They will share the load in round-robin fashion. Complete C-code for registering a generic worker with Nagios follows: @code static int nagios_core_worker(const char *path) { int sd, ret; char response[128]; is_worker = 1; set_loadctl_defaults(); sd = nsock_unix(path, NSOCK_TCP | NSOCK_CONNECT); if (sd < 0) { printf("Failed to connect to query socket '%s': %s: %s\n", path, nsock_strerror(sd), strerror(errno)); return 1; } ret = nsock_printf_nul(sd, "@wproc register name=Core Worker %d;pid=%d", getpid(), getpid()); if (ret < 0) { printf("Failed to register as worker.\n"); return 1; } ret = read(sd, response, 3); if (ret != 3) { printf("Failed to read response from wproc manager\n"); return 1; } if (memcmp(response, "OK", 3)) { read(sd, response + 3, sizeof(response) - 4); response[sizeof(response) - 2] = 0; printf("Failed to register with wproc manager: %s\n", response); return 1; } enter_worker(sd, start_cmd); return 0; } @endcode The "enter_worker()" part actually refers to a libnagios function that lives in worker.c. The set_loadctl_defaults() call can be ignored. It's primarily intended to give sane defaults about how many jobs we can run, so we (in theory) can tell Nagios that we're swamped in case we run out of filedescriptors or child processes. @subsection request Requests A complete request looks like this (with C-style format codes replaced with actual values, ofcourse): @verbatim job_id=%d\0type=%d\0command=%s\0timeout=%u\0\1\0\0\0 @endverbatim Note that values can contain equal signs, but cannot contain nul bytes, and cannot contain the message delimiter sequence. By including nagios/lib/worker.h and using worker_ioc2msg() followed by worker_kvvec2buf_prealloc(), you will get a parsed key/value vector handed to you. Have a look in base/workers.c to see how it's done for the core workers. @subsection responses Responses Once the worker is done running a task, it hands over the result to the master Nagios process and forgets it ever ran the job. The workers take no further action, regardless of how the task went. The exception is if the job timed out, or if the worker failed to even start the job, in which case it should report the error to Nagios and only *then* forget it ever got the job. The response is identical to the request in formatting but differs in the understood keys. The request sent from Nagios to the worker must precede the other result variables. In particular, the job_id must be the first variable Nagios sees for it to parse the result as a job result rather than as something else. The variables required for the response to a successfully executed job on a registered worker process are as follows: @li job_id - The job id (as received by Nagios) @li type - The job type (as Nagios sent it) @li start - Timeval struct for start value in $sec.$usec format @li stop - Timeval struct for stop time in $sec.$usec format @li runtime - Floating point value of runtime, in seconds @li outstd - Output caught on stdout @li outerr - Output caught on stderr @li exited_ok - Boolean flag to denote if the job exited ok. A non-zero return code can still be achieved @li wait_status - Integer, as set by the wait() family of system calls The following should only be present when the worker is unable to execute the check due to an error, or when it cannot provide all the variables required for a successfully executed job due to arbitrary system errors: @li error_msg - An error message generated by the worker process @li error_code - The error code generated by the worker process error_code 62 (ETIME - Timer expired) is reserved and means that the job timed out. @note *never* invent error codes in the range 0-10000, since we'll want to reserve that for special cases. The following are completely optional (for now): @li command - The command we executed @li timeout - The timeout Nagios requested for this job @li ru_nsignals - The ru_nsignals field from the rusage struct @li ru_nswap - The ru_nswap field from the rusage struct @li ru_minflt - The ru_minflt field from the rusage struct @li ru_majflt - The ru_majflt field from the rusage struct @li ru_stime - The ru_stime field from the rusage struct @li ru_utime - The ru_utime field from the rusage struct @li ru_inblock - The ru_inblock field from the rusage struct @li ru_oublock - The ru_oublock field from the rusage struct The meaning of the fields of the rusage struct can be viewed in the section 2 man-page for the getrusage() system call. Normally, you would access it by executing the following command: @verbatim man 2 getrusage @endverbatim Note that most systems do not support all the fields of the rusage struct and may leave them empty if so. @section logging Logging Worker processes can send events to the main Nagios process that will end up in the nagios.log file. The format is the same as that in requests and responses, but a log-message consists of a single key/value pair, where the key is always 'log'. Consequently, a request from a worker to the main process to log something looks like this: @verbatim log=A random message that will get logged to nagios.log\0 @endverbatim It's worth noting that Nagios will prefix the message with the worker process name, so as to make grep'ing easy when debugging experimental workers. @section xchgexample Protocol Exchange Example A register + execution of one job on a worker process will, with the standard Nagios core worker look like this, after the worker process has connected to the query handler socket but before it has sent anything. Note that the nul-bytes separating key/value pairs have been replaced with newline to enhance readability. Also note that this depicts only the required steps, which go as follows: @verbatim Step 1, Worker: @wproc register name=Worker Hoopla;max_jobs=100;pid=6196\0 Step 2, Nagios: OK\0 Step 3, Nagios: job_id=0 type=2 timeout=60 command=/opt/plugins/check_ping -H localhost -w 40%,100.0 -c 60%,200.0 \1\0\0\0 Step 4, Worker: job_id=0 type=2 timeout=60 start=1355231532.000123 stop=1355231532.994343 runtime=0.994120 exited_ok=1 outstd=OK: RTA: 12.6ms; PL: 0%|rta=12.6ms;100.0;200.0;0;; pl=0%;40;60 wait_status=0 outerr= \1\0\0\0 @endverbatim Steps 3 and 4 in this chain repeat indefinitely. */