Thread Startup and Shutdown

Overview

ZMap runs with a main thread that drives the display and user interaction and this work mainly off callbacks from GLib/GTK and also the X-remote interface (that uses X-atoms). Interfaces to external systems (eg ACEDB, pipeServers) are handled by separate threads and these all implement a very similar interface consisting of a series of up to 8 (9? 10?) requests.

In zmapView.c/createConnection() a new thread is created, giving up to three handler functions to zMapThreadCreate().

In zmapThreads.c/zmapCreateThread() a new thread is created and given zmapNewThread() as its main loop. This function waits on requests sent by the main thread and sends replies. There is no normal exit from this function but failed requests can drop out of the main loop and terminate the thread.

The server protocol defines a terminate request which has not been implemented to date (May 2010). There is the possibility of making this over complicated: requests are expected to return success fo failure (or not implemented); the main thread has to detect and cope with spontaneous evaporation of threads and infinite loops, and call kill threads if they do not respond. resources have to be freed tidily.

Thread termination

Each thread has its own data structure to hold its state. The main thread also has data per thread. The thread model assumes there is some external server (eg ACEDB) whose state can be reported; for pipeServers this is not spectacularly relevant and may just cause confusion.

The slave thread's data includes a pointer to the main thread's corresponding data structure, which allows data to be passed to the main thread.

Timing out

Each slave waits for a request state not equal to waiting in its main loop, and this will time out after 5 seconds and continue round the main loop. This appears to be necessary to call pthread_testcancel(), which allows the thread to be killed by another one. The main thread does not appear to operate any timeouts on requests.

zmapSlave.c/zmapNewThread() handles a ZMAPTHREAD_RETURNCODE_TIMEDOUT but this is not referenced elsewhere in the source.

Slave killed by the main thread

This happens if a request fails and is regarded as critical, or on ZMap shutdown. The main thread will issue a pthread_cancel() which will eventually cause the thread to exit. The main thread's slave data resources must not be released at this point. The slave thread will call zmapSlave.c/cleanUpThread() as it exits (via the posix handler stack) and as the thread is not reporting a specific error if sets a return code of 'ZMAP_REPLY_CANCELLED'.

zmapView.c/checkStateConnections() polls all threads during idle moments and handles the actual termination of the thread. If no requests are active the thread's state will be ZMAP_REPLY_WAIT (this is set by the main thread after replies have been processed).

Slave exits spontaneously

The cleanUpThread() function will be called by the posix code as there is a stack of functions to be called set up on thread creation. There appears to be no way to tell the difference between this and thread killed by ZMap's main thread.

Slave has internal error and exits by design

On the assumption that the slave thread interfaces to an external server there are response codes such as 'BAD REQ' and 'SERVER DIED', and these generate a reply code of 'THREAD DIED'. The main loop terminates via a break and the thread calls its cleanup function. ZMap will get this 'died' response and clear up it's own data.

Slave executes terminate request

Here we face a possible semantic problem. It's OK for a slave to reply to a terminate request as the response is held in data owned by the main thread, but susbsequent requests are not logical. Therefore it make sense for a terminate request to respond with success and for the thread to exit cleanly. On receipt of the response (which cannot logically be locked by the slave) the main thread should clean up its data structures, in which case we should not have a 'thread died' status.

It's not clear from the source code whether or not we can tell if a thread has exited or not - they are set to 'detached' on creation. The pthread interface only includes one function that allows a signal to be sent to another thread - pthread_kill() and a signal of 0 allows testing of the thread's existance without sending the signal. Given that the existing method for detecting a dead thread involves failing to lock memory it seems unlikely that it does actually detect whether the thread is there or not.

Proposed implementation

Post implementation remarks

pthread_kill() on the Mac

It appears that on the Mac the pthread_kill() function is not implemented properly, and always reports a dead thread. Google reveals that some implementations return a wrong error code, and it may be possible to fix this by testing for success rather than failure. However as we are working round a bug in an external library it's worth keeping this in mind.

Status codes sent to Otterlace

Some problems were found with handling Terminate requests and it was necessary the catch ZMAPTHREAD_REPLY_QUIT in repsonse to a terminate request and set a thread statos of good. (If sent in reply to any other request it is considered a failure). After loading data a delayed server is requested to terminate, and at the end of processing the entire step list of requests status is reported to Otterlace. This includes a status code of 1 or 0 (1 means OK, 0 means failure) and an explanatory message, which is normally OK on success (if the server is not delayed) or 'server terminated' (if the server is delayed), and any other messages are errors.

Layers of software

Each server module implements a CloseConnection() and a FreeConnection() function. CloseConnection() releases external interfaces and data after which the thread is in a zombie state but still has its own internal data structures allocated. FreeConnection releases the last of these memory items.

Both of these functions are called from cleanUpThread() via the thread data structure allocated by zMapThreadCreate() and are known as thread->terminate_func() and thread->destroy_func(). The actual functions are defined in zmapServerProtocolHandler.c as zMapServerTerminateHandler() and zMapServerDestroyHandler() which call further internal (to zmapServerProtocolHandler.c) functions called terminateServer() and destroyServer().

NB: originally these functions were commented as being called on 'abnormal' thread exit. As 'normal' thread exit was not programmed this was arguably accurate but really quite confusing. There is no close request defined in the server protocol and terminate was not implemented. They were called if the thread was cancelled but not if the 'server died'.

So the call stack for a thread that has been killed by the main thread looks like this:
(thread_local is the thread's allocated data and thread_main is the main thread's allocated data)

(main thread)
zmapView.c/createConnection()
      zmapThreads.c/zMapThreadCreate()
            zmapThreads.c/createThread()  stores functions from zmapServerProtocolHandler.c

zmapView.c/checkStateConnections()
      zmapThreads.c/ZMapThreadKill()
            pthread_cancel()
(slave thread)
zmapSlave.c/zmapNewThread()       (main loop)

zmapSlave.c/cleanUpThread()       (on pthread clean up stack)
      zmapServerProtocolHandler.c/zMapServerTerminateHandler()  as (thread_local)->(thread_main)->terminate_func()
            zmapServerProtocolHandler.c/terminateServer()
                  zmapServer.c/zMapServerCloseConnection()
                        pipeServer.c/closeConnection()  as server->funcs->close()
(main thread)
zmapView.c/checkStateConnections()
      zmapView.c/destroyConnection()      when it works out that the thread has died