We have 999 problems, and a websocket is every single one of them

2016-06-22

What’s wrong?

We recently had a major medical journal (BMJ) add us to their pages. This has resulted in a fairly substantial increase in the number of persistent connections we are seeing to our servers from ~500/800 (mean/peak) to 5000/6500.

This has identified a number of issues in our streamer code – that is, the code that propagates creates/updates/deletes of annotations to connected clients in “real time.”

The basic problem is that the code that determines whether to send a given annotation update to a client is incredibly slow. Each update is checked against every single connected websocket to determine if it should be sent to that client, and each of these checks executes complicated filter code based on the clients permissions and the “filter” they’ve sent down the websocket.

How can we fix it?

There are a number of approaches we can take here:

Reduce the number of open websocket connections.

This might sound silly, but a large number of the open websocket connections — more than 95% of them at last count — are from logged-out users. It seems likely that most of these, in turn, are just Hypothesis embeds on a page which nobody has interacted with.

We should probably not initiate websocket connections until someone interacts with a Hypothesis embed. Otherwise we’ll end up with a number of websocket connections roughly equivalent to the number of online users of all of the websites that embed us. That sounds like a fun scaling problem but probably one we should avoid if we can.
Handle annotation create/update/deletes more quickly.

This seems like a pressing need regardless of item 1. It should be possible for us to handle many thousands of websocket connections per worker without taking 30 seconds to handle each event.

One way of doing this would be to store percolation queries for each client in Elasticsearch and then do a single query to retrieve the IDs of all connected clients which should receive a document. We can even perform the query by reference to annotation already in the index rather than by POSTing the annotation data.

Next steps

In my opinion we should look into item 1 soon as it’s relatively easy and will buy us time.

Item 2 is more complicated, and in particular will probably be achieved most easily by providing a query parser that can map queries coming from the client streamer code into ElasticSearch queries. This will probably require changes to the client and to the server and will require some careful design for rollout on account of old(er) clients. I have some thoughts on this but this document is plenty long enough already.

We should start doing some of the groundwork for item 2 now, and I’ll be breaking some of that out into cards. But item 1 should buy us enough time that we’re not looking to replace the entire streamer implementation in a matter of days.