User:WebCiteBOT

Update (12/23/14) - I am working on getting the BOT up and running again. There seem to be some technical problems with WebCite (requests giving time out messages, but actually completing) at the moment. I am working on a work-a-around for my code. --ThaddeusB (talk) 18:58, 23 December 2014 (UTC)

WebCiteBOT's purpose is to combat link rot by automatically WebCiting newly added URLs. It is written in Perl and runs automatically with only occasional supervision.

A complete log of the bot's activity, organized by date, can be found under User:WebCiteBOT/Logs/. Some interesting statistics related to its operation can be found at User:WebCiteBOT/Stats.

Operation: WebCiteBOT monitors the URL addition feed at IRC channel #wikipedia-en-spam and notes the time of each addition, but takes no immediate action. After 48 hours (or more) have passed it goes back and checks the article to see if the new link is still in place and if it is used as a reference (i.e. not as an external link). These precautions help prevent the archiving of spam/unneeded URLs.

Articles that have been tagged/nominated for deletion are skipped until the issue is resolved.

For each valid reference it finds, WebCiteBOT first checks its database to see if a recent archive was made. If not, it checks the functionality of the link. Valid links are submitted for archiving at WebCitation.org, while dead links are tagged with {{dead link}}. After the archival attempt has had time to complete, the bot checks the archive's status and updates the corresponding Wikipedia page if the archive was completed successfully. It will also attempt to add title, author, and other metadata that wasn't supplied by the human who added the link.

Features not yet implemented:

  • Ability to archive all links on a specific page on demand
  • Build database of "problem" sites to save time
  • Tag invalid links with {{dead link}} (Implemented June 6, 2009)
  • More robust capture of metadata; build db of human supplied metadata to assist bot in determining certain items (update: Bot is now capturing human entered data for each page it loads in order to build this db)
  • Attempt to locate archive for older links when updating a page (maybe)

Known Issues/Limitations:

  • Some link additions are not reported to #wikipedia-en-spam (likely because there are too many edits for the reporting bot to examine every one) and thus are not caught by WebCiteBOT.
  • The link reporting bot will "un-encode" characters that are URL encoded (e.g. "%80%99") which will make my bot unable to find the link in the wikitext and report it as "removed". (A workaround was added to the code February 26, 2012 to "save" a few of these.)
  • WebCiteBOT is not able to distinguish between true new additions and additions caused by reverts and such. Thus, sometimes a "new" link is actually fairly old and the archived version may not match the version the original editor saw.
  • WebCitation.org does not archive some pages due to robot restrictions. A small number of additional pages are archived incorrectly. (WebCiteBOT normally catches these and doesn't link to them.)
  • WebCiteBOT does not follow redirects. This means if a page is moved after a link is added, but before the bot looks at it, it will be reported as "(link) has been removed". It is not clear to me whether following redirects would be a desirable behavior or not.

Feel free to make a suggestion to improve the bot.

This user keeps citations to online sources working with the help of WebCite!