Performance: Making ZMap and Otterlace run faster

Ideas for speeding things up

Profiling ZMap

Click here for some ideas on profiling.

Network bandwidth: some fantasy figures

Currently, depending on the amount of data requested, it can take as much as 10 minutes (anecdotally) for Otterlace/ ZMap to load. Because of this it is common for people to save thier ACEDB sessions locally to avoid this load time the following day. By routing data directly to ZMap we loose this option and as a result we expect a large increase in network traffic from current levels at the start of the day.

We also consider that users will request more data if it takes less time to load, and that new high volume fieldsets will become available and therefore required.

We also aim to speed up startup by loading data in parallel - currently each featureset is loaded sequentially and where data is staged between source databases and target there is an additional sequential step. Of course it would be worth while getting some real statistics but let's invent some unsupported figures assuming 100 annotators and see how the peak network traffic might compare to current levels. The point here is not that these figures cannot be discredited, but that it is clear we need to consider these issues.

CurrentEstimatesWorst caseFactor
No of network startups30801003
Mean no of active fieldsets1305050
Stages: DB > Otterlace > ACE/ZMapserialparallelparallel2
Compressionnone6:14:10.25
No of clones46102

Which all implies an increase in peak network traffic by a factor of 90 - 150x.

If we assume 512k per fieldset per clone then we have the following estmiates of network traffic per second:
CurrentEstimatesWorst case
No of network startups3080100
Mean no of fieldsets303050
No of clones4610
Data per user (MB)6090250
Data total (MB)1800720025000
Ave time to load2 min10 sec10 sec
Traffic per second (MB)157202500

One obvious remark is that not every will press the start button at the same instant, and also that not all users will request huge amounts of data, so we can divide the worse case by (eg) a factor of 10. However, considering that to achieve a given performance target it is necessary to design in spare capacity that still leaves us facing in rough terms a network requirement not far removed from GB per second if we just program it blindly.

Feedback from anacode

Experience with pipe servers is that moderately parallel used of server scripts soon reaches a performance bottleneck on the web servers. Distant sources (eg DAS in Washington) can time out. Data is cached by the server scripts and this is exepcted to resolve this king of problem after an initial loading of data.