The rush to Wikipedia's 500,000th article (see related story) was interrupted Wednesday when a disk ran out of space and forced the developers to put the site in read-only mode for about 15 hours. The incident highlighted the need for more people to be involved in Wikipedia development and server administration (see archived story).
The problem developed when the master database server ran out of space, not for the database itself, but for the binlog file used to store updates that need to be sent to the additional slave database servers. As developer Kate Turner explained, when the disk allocated to the binlog runs out of space, MySQL bypasses the binlog and writes updates directly to the database itself. This makes it impossible to resynchronize the slave databases with the master without halting the process.
Under normal circumstances, the potential problem can be avoided through regular maintenance of the binlog. Old binlogs that have already been processed by the slave databases can be deleted, freeing up more space for new updates. However, this requires that someone be actively monitoring the situation at the time when the binlog is filling up.
Unfortunately, the binlog managed to fill up on Wednesday and escaped the notice of the developers until it was too late. As a result, Wikipedia briefly went down around 16:00 (UTC) and was brought back in a locked state. Turner apologised on behalf of the developers for the lack of monitoring. In response to a few complaints, Silsor issued a reminder that "our developers are all volunteers who have lives of their own and often get sucked into annoying Wikimedia issues."
Unlike previous instances of downtime caused by power outages (see archived stories), Wikipedia remained available to readers, so that the only people seriously affected were those trying to edit. Readers also might not have seen some of the latest changes while the site was in read-only mode. Editing was restored at around 07:00 (UTC) on Thursday.
The problem had already come up once this year and is a known issue with MySQL. The MediaWiki developers have previously been in contact with MySQL developers about this bug, although it isn't known whether any progress has been made in that regard. In the meantime, however, Turner reported that she was writing additional code for servmon (a tool used to monitor the status of the servers) that will monitor disk space and hopefully prevent similar incidents in the future.
Discuss this story