iSpot / Megalab server moves
February 11th, 2010 by Richard LovelockThe recently configured back-up architecture and plan were given a chance to be tested over the past couple of weeks with a live swap over for iSpot and the Evolution Megalab. This was brought about by a disk failure on the live web server. The switch over included moving from the live web and database/file server to two completely different back-up web and database/file servers which then assumed the role of live servers whilst maintenance was performed on the original live web server.
The switch to the back up servers went fairly smoothly and with only a minimal amount of downtime of the live websites. With the exception of the quiz on Megalab (which is something that needs to be looked at), both sites continued to function as normal on the back up architecture.
The disc failure problem on the live web server was rectified after about two weeks which then allowed the site to revert to the original live web server and database/file server set up. A procedure to do these switches had been documented and followed. We had originally believed that the the swap back to the live set-up had gone very smoothly, and generally speaking in terms of the technical tasks performed it had gone smoothly. However there were a couple of issues which resulted in some significant down time of the live sites:
- When the live site files were copied back to the live web servers from the ‘backup live’ servers the websites were put in to maintenance mode so that the public could not access them and no data was written to the databases. A flag is set in the database to put the sites in to maintenance mode. After the files and databases had been copied back on to the live servers, the DNS records were changed so that requests to iSpot and the Megalab were then directed to the original live servers. The flag on the databases on the live servers were set to be online so that as the DNS changes propagated to the DNS servers of various ISPs, users trying to access the live sites would then be able to access them on the live servers (this can generally take between 24 and 72 hours or so and is out of our hands). The oversight on our part that caused a problem was that there is a scheduled task (as part of the original back up routine) to back up the live database and save/overwrite the copy on the back up server. This task performed its duty and overwrote the database on the back up server from live but now with the offline/online flag set to online. This happened within an hour of switching back to the live set up and before the DNS propagation had taken effect so that requests for iSpot were generally still pointing to the back up server which had now (inadvertently) been set to online, so users began posting observations and we believed that we were seeing the site appearing on the live servers. Then at hourly intervals the database on the back up server (which was being updated with user observations etc) was over-written with the database from the live site – which was effectively old data. The issue was picked up when a new news feature item dissappeared from the home page. The scheduled task was then stopped and the flag was set to make the sites offline again on the back up servers. Once the propagation of the DNS records started to take effect, users were successfully directed to using the sites on the live servers again.
- The second problem was that we had requested internally that the external hosting company be notified to update the DNS settings of iSpot and the Megalab and our request was not fulfilled. When we chased up about the request we were told that it was the responsibility of a different person so we then had to chase up the request with a different person to get it actioned. We weren’t given any notice that it wasn’t the responsibility of the person that we logged the request with.
Lessons learnt
- Apart from the obvious frustrastion of a period of a couple of hours where data was lost, we were generally very pleased with how the back up architecture performed and how smoothly the transition from each system went.
- We have updated our documentation to reflect any changes to the procedure to make sure that data isn’t lost in the future.
- During the switch back to live it came to light that there may be an alternative and more efficient way of switching between servers that will result in less downtime of the site for users. Instead of asking the external hosting company to switch the destination of the request to iSpot and the Megalab on their DNS servers and waiting for these changes to propagate, there is a tool that we can use that can internally (internal to our network) control the routing to various servers/IP addresses for given requests. This would effectively mimic a change to an external DNS record but as there would be no external propagation required the switch between servers would be almost instantaneous which should therefore result in only a minimal amount of down time for the live servers.
February 14th, 2010 at 12:58 pm
Glad to hear that lessons have been learned and that it is all documented.