On 1 September, the Arbitrators voted to suspend the Media Viewer case for 60 days. After the suspension period is up, the case is to be closed unless the committee votes otherwise. The case suspension comes in response to several new initiatives and policies announced by the Wikimedia Foundation that may make the case moot. In the same motion, the committee declared that Eloquence's resignation of the administrator right was "under a cloud" and that he can only regain the right through another RfA.
The Arbitrators voted to appoint Callanecc (talk · contribs), Joe Decker (talk · contribs) and MBisanz (talk · contribs), with DeltaQuad (talk · contribs) as the alternate, to the 2014 Audit Subcommittee.
Reader comments
Two featured articles were promoted this week.
One featured list was promoted this week.
Ten featured pictures were promoted this week.
One of the problems Wikipedia faces is users who add content copied and pasted verbatim from sources. When we follow up on a person's work, we often don't check for this, and a few editors have managed to make thousands of edits over many years before concerns are detected. In the past year, I've picked up three or four editors who have made many thousands of edits to medical topics in which their additions contain word-for-word copy from elsewhere. Most of those who only make a few edits of this nature are usually never detected.
After a user detects this kind of editing, clean-up involves going through all their edits and occasionally reverting dozens of articles. Unfortunately, sometimes it means going back to how an article was years back, resulting in the loss of the efforts of the many editors who came after them. Contingency reverts can end up harming overall article quality and frustrate the core editing community. What is the point of contributing to Wikipedia if it's simply a collection of copyright-infringed text cobbled together, and even your own original contributions disappear in the cleanup? Worse, the fallout can cause editors to retire. If we could have caught them early and explained the issues to them, we'd not only save a huge amount of work later on, but might retain editors who are willing to put in a great deal of time.
So what is the solution? In my opinion, we need near real-time automated analysis and detection of copyright concerns. I'd been trying to find someone to develop such a tool for more than two years; then, at Wikimania in London, I managed to corner a pywikibot programmer, ValHallASW, and convinced him to do a little work. This was followed by meeting a wonderful Israeli instructor from the Sackler School of Medicine Shani Evenstein who knew two incredibly able programmers, User:Eran and User:Ravid ziv. By the end of Wikimania our impromptu team had produced a basic bot – User:EranBot – that does what I'd envisioned. It works by taking all edits over a certain size and running them through Turnitin / iThenticate. Edits that come back positive are listed for human follow-up. Development of this idea began back in March of 2012 by User:Ocaasi and can be seen here.
Determining copy-and-paste issues becomes more difficult the longer one waits between the initial edit and the checking, as one then has to deal with mirroring of Wikipedia content across the Internet. As well, many reliable sources – including peer-reviewed journals and textbooks – have begun borrowing liberally from Wikipedia without attribution. So if we're looking at copyright issues six months or a year down the road, we need to look at publication dates and go back in the article history to determine who is copying from whom.
In short, it's far more difficult for both humans and machines.
Turnitin is an Internet-based plagiarism-prevention service created by iParadigms, LLC, first launched in 1997; it is one of the strategies used by some universities and schools to minimise plagiarism in student writing. The company that developed and owns the product has agreed to give us free access to their tools and API. Even though it's a for-profit company, there won't be obtrusive links from Wikipedia to their site, and no advertising for them will ever appear on Wikipedia.
Why would they want to be involved with us? Letting us use their tools doesn't cost them anything and is no disadvantage to shareholders. Some companies are willing to help us just because they like what we do. We've had a number of publishers donate large numbers of accounts to Wikipedians for similar reasons. They have extra capacity just sitting there, so why not give it away? They also know we're volunteers and are not going to buy their capacity anyway. Other options could include Google, but they don't allow their services to be used in this way, and it appears that Yahoo is currently charging for use by User:CorenSearchBot, which checks new articles for issues.
How many edits are we looking at? Currently the bot is running only on the English Wikipedia's medical articles. In 2013, there were 400,000 edits to medical content – around 1,100 edits per day. Of these only about 10% are of significant size and not a revert, so we're looking at an average of around maybe 100 edits per day. If we assume a 10% rate of copyright concerns and three times as many false positives as true positives, we're looking at 40 edits per day at most. Who would follow-up? With the number of concerning edits in the range of 40 per day, members of WikiProject Medicine will be able to handle the load. This is much easier than catching 30,000 edits of copyright infringement after the fact, with clean-up taking many of us away from writing content for many days.
The Wiki Education Foundation has expressed interest in the development of this tool, since edits by students have previously contained significant amounts of plagiarism, kindling much discontent with Wiki Education's predecessor. The Hebrew Wikipedia is also currently working with this bot, and we'd be happy to see other topic areas and WMF language sites use it.
There are still a few rough aspects to iron out. The parsing out of the new text added by an edit is not as good as it could be. Reverts should be ignored. These issues are fairly minor to address, and a number have already been dealt with. While there were initially about three false positives for every true positive, we should have this down to a more even 50–50 split by the end of the week. Already in its early stages, this has turned out to be an exceedingly useful tool.
This week we saw three of the top ten articles remain in place, with the Ice Bucket Challenge at #1, Amyotrophic lateral sclerosis at #2, and Islamic State of Iraq and the Levant at #5, all for a second straight week. The death of English actor Richard Attenborough was apparently the most notable of the week, as that article entered the list at #3. Top news subjects of recent weeks, including Ebola virus disease (#7) and Robin Williams (#9), also continued to remain popular.
For the full top 25 list, see WP:TOP25. See this section for an explanation for any exclusions.
For the week of 24 to 30 August 2014, the 10 most popular articles on Wikipedia, as determined from the report of the 5,000 most viewed pages, were:
Rank | Article | Class | Views | Image | Notes |
---|---|---|---|---|---|
1 | Ice Bucket Challenge | 1,773,522 | Number 1 for the second week in a row. This global viral phenomenon to arise awareness and funding for research on ALS was not launched by any particular charity, but seems to have grown on its own. While it certainly has achieved its goals, some have criticized the whole movement as feeling more like an act of slacktivism by many participants. But most viral phenomena have absolutely no redeeming social value (has Grumpy Cat raised millions for disease research?), so things could be much worse. Wikipedia did its part to keep things focused on substance by deleting the celebrity-fest page "List of Ice Bucket Challenge participants" on 29 August, after a lengthy deletion debate. | ||
2 | Amyotrophic lateral sclerosis | 880,652 | Like #1, it's #2 for the second week in a row. | ||
3 | Richard Attenborough | 794,061 | This popular English actor died on August 24, at age 90. Attenborough won two Academy Awards as director and producer for Gandhi in 1983. He also won four BAFTA Awards and four Golden Globe Awards during his career. As an actor, memorable appearances included roles in Brighton Rock (1947), The Great Escape (1963), 10 Rillington Place (1971), and Jurassic Park (1993). He is survived by his wife of almost 70 years, former actress Sheila Sim. | ||
4 | Ariana Grande | 589,596 | Up from #19 last week, the popular singer released her second album, My Everything, on August 25. | ||
5 | Islamic State of Iraq and the Levant | 448,261 | Holding steady at #5 for a second week. This almost absurdly brutal jihadist group proudly posts mass executions it carries out on Twitter, and has been disowned even by al-Qaeda. The recent execution of journalist James Foley is among the reasons for the continued popularity of this article. | ||
6 | Deaths in 2014 | 361,006 | The list of deaths in the current year is always a popular article. Deaths this week included Leonid Stadnyk (August 24), a Ukrainian formerly listed by Guinness as the tallest man in the world; Swedish comic strip artist Lars Mortimer (August 25); Nigerian pastor Samuel Sadela (pictured at left), unverified claimant to being the oldest male alive (August 26); American particle physicist Victor J. Stenger (August 27); Former Soviet spy John Anthony Walker (August 28); Singaporean comedian David Bala (August 29); and 18-year old Belgian cyclist Igor Decraene, who died in a train accident (August 30). | ||
7 | Ebola virus disease | 356,594 | The 2014 West Africa Ebola outbreak continues to draw attention to this horrific disease. | ||
8 | Pseudoscorpion | 334,956 | Reddit noted this week that "tiny pseudoscorpions (about 4mm) live inside old books, effectively protecting them by eating booklice and dustmites", a hook exciting enough to make reddit put this in the top 10 this week. | ||
9 | Robin Williams | 332,653 | Down from #3 last week. The unexpected death by suicide of this iconic comic on August 11 led to one of the highest spikes in views since this project began. | ||
10 | 328,386 | Usually a fairly popular article; a slower news week allowed it to percolate back up into the Top 10 for the first time in a while. |
This week, the Signpost went out to meet WikiProject Anatomy, dedicated to improving the articles about all our bones, brains, bladders and biceps, and getting them to the high standard expected of a comprehensive encyclopaedia. Begun back in 2005 by Phyzome, this project has its own Manual of Style, a huge to-do list, and yet only 30 active members helping to achieve anatomical greatness. So, we asked CFCF, Flyer22 and LT910001 for their opinions on this vital corner of the wiki documenting our own bodies.
What motivated you to join WikiProject Anatomy? Do you have a background in medicine or biology, or are you simply interested in the topic?
Have you contributed to any of the project's four Featured or thirteen Good articles, and are these sort of articles generally easier or harder to promote than other subjects?
Can you explain your scope: what sort of articles qualify to be tagged under this project and what kind of things you don't cover?
What is your most popular topic or article, measured by reader page views? Should it be a project aim to improve your highest visibility articles?
What are the primary resources used for writing an anatomy article? Do you solely rely on medical experts or are more mainstream references also fine?
How close are your links with WikiProject Medicine, a related project? Do many members participate in both WikiProjects?
What is the reason you exclusively cover human anatomy and not the body parts of other animals? No project seems to be looking after articles such as Thorax.
How can a new member help today?
Anything else you'd like to add?
Better get your syntax all fixed in time for next week, when we'll be venturing out of content to spend some time with a project that never misses an error. Until then, why not look for some mistakes in the archive?
Reader comments
A monthly overview of recent academic research about Wikipedia and other Wikimedia projects, also published as the Wikimedia Research Newsletter.
A research group at MIT led by Cesar A. Hidalgo published[1] a global "Pantheon" (probably the same project already mentioned in our December 2012 issue), where Wikipedia biographies are used to identify and "score" thousands of global historical figures of all time, together with a previous compilation of persons having written sources about them. The work was also covered in several news outlets. We won't summarise here all the details, strengths and limits of their method, which can already be found in the well-written document above.
Many if not most of the headaches encountered by the research group lie in the work needed to aggregate said scores by geographical areas. It's easy to get the city of birth of a person from Wikipedia, but it's hard to tell to what ancient or modern country that city corresponds, for any definition of "country". (Compare our recent review of a related project by a different group of researchers that encountered the same difficulties: "Interactions of cultures and top people of Wikipedia from ranking of 24 language editions".) The MIT research group has to manually curate a local database; in an ideal world, they'd just fetch from Wikidata via an API. Aggregation by geographical area, for this and other reasons, seems of lesser interest than the place-agnostic person rank.
The most interesting point is that a person is considered historically relevant when being the subject of an article on 25 or more editions of Wikipedia. This method of assessing an article's importance is often used by editors, but only as an unscientific approximation. It's a useful finding that it proved valuable for research as well, though with acknowledged issues. The study is also one of the rare times researchers bother to investigate Wikipedia in all languages at the same time and we hope there will be follow-ups. For instance, it could be interesting to know which people with an otherwise high "score" were not included due to the 25+ languages filter, which could then be further tweaked based on the findings. As an example of possible distortions, Wikipedia has a dozen subdomains for local languages of Italy, but having an article in 10 italic languages is not an achievement of "global" coverage more than having 1.
The group then proceeded to calculate a "historical cultural production index" for those persons, based on pageviews of the respective biographies (PV). This reviewer would rather call it a "historical figures modern popularity index". While the recentism bias of the Internet (which Wikipedia acknowledges and tries to fight back) for selection is acknowledged, most of the recentism in this work is in ranking, because of the usage of pageviews. As WikiStats shows, 20% of requests come from a country (the US) with only 5% of the world population, or some 0.3% of the total population in history (assumed as ~108 billion). Therefore there is an error/bias of probably two orders of magnitude in the "score" for "USA" figures; perhaps three, if we add that five years of pageviews are used as sample for the whole current generation. L* is an interesting attempt to correct the "languages count" for a person (L) in the cases where visits are amassed in single languages/countries; but a similar correction would be needed for PV as well.
From the perspective of Wikipedia editors, it's a pity that Wikipedia is the main source for such a rank, because this means that Wikipedians can't use it to fill gaps: the distribution of topic coverage across languages is complex and far from perfect; while content translation tools will hopefully help make it more even, prioritisation is needed. It would be wonderful to have a rank of notably missing biographies per language editions of Wikipedia, especially for under-represented groups, which could then be forwarded to the local editors and featured prominently to attract contributions. This is a problem often worked on, from ancient times to recent tools, but we really lack something based on third party sources. We have good tools to identify languages where a given article is missing, but we first need a list (of lists) of persons with any identifier, be it authority record or Wikidata entry or English name or anything else that we can then map ourselves.
The customary complaint about inconsistent inclusion criteria can also be found: «being a player in a second division team in Chile is more likely to pass the notoriety criteria required by Wikipedia Editors than being a faculty at MIT», observe the MIT researchers. However, the fact that nobody has bothered to write an article on a subject doesn't mean that the project as a whole is not interested in having that article; articles about sports people are just easier to write, the project needs and wants more volunteers for everything. Hidalgo replied that he had some examples of deletions in mind; we have not reviewed them, but it's also possible that the articles were deleted for their state rather than for the subject itself, a difference to which "victims" of deletion often fail to pay attention to.
– by Maximilianklein
When analyzing any Wikipedia version, getting the underlying data can be a hard engineering task, beyond the difficulty of the research itself. Being developed by researchers from Macalester College and the University of Minnesota, WikiBrain aims to "run a single program that downloads, parses, and saves Wikipedia data on commodity hardware." [2] Wikipedia dump-downloaders and parsers have long existed, but WikiBrain is more ambitious in that it tries to be even friendlier by introducing three main primitives: a multilingual concept network, semantic relatedness algorithms, and geospatial data integration. With those elements, the authors are hoping that Wikipedia research will become a mix-and-match affair.
The first primitive is the multilingual concept network. Since the release of Wikidata, the Universal Concepts that all language versions of Wikipedia represent have mostly come to be defined by the Wikidata item that each language mostly links to. "Mostly" is a key word here, because there are still some edge cases, like the English Wikipedia's distinguishing between the concepts of "high school" and "secondary school", while others do not. WikiBrain will give you the Wikidata graph of multilingual concepts by default, and the power to tweak this as you wish.
The next primitive is semantic relatedness (SR), which is the process of quantifying how close two articles are by their meaning. There have been literally hundreds of SR algorithms proposed over the last two decades. Some rely on Wikipedia's links and categories directly. Others require a text corpus, for which Wikipedia can be used. Most modern SR algorithms can be built one way or another with Wikipedia. WikiBrain supplies the ability to use five state-of-the-art SR algorithms, or their ensemble method – a combination of all 5.
Already at this point an example was given of how to mix our primitives. In just a few lines of code, one could easily find which articles in all languages were closest to the English article on "jazz", and which were also a tagged as a film in Wikidata.
The last primitive is a suite of tools that are useful for spatial computation. So extracting location data out of Wikipedia and Wikidata can become a standardized process. Incorporated are some classic solutions to the "geoweb scale problem" – that regardless of an entity's footprint in space, it is represented by a point. That is a problem one shouldn't have to think about, and indeed, WikiBrain will solve it for you under the covers.
To demonstrate the power of WikiBrain the authors then provide a case study wherein they replicate previous research that took "thousands of lines of code", and do it in "just a few" using WikiBrain's high-level syntax. The case study is cherry-picked as is it previous research of one of the listed authors on the paper – of course it's easy to reconstruct one's own previous research in a framework you custom-built. The case study is a empirical testing of Tobler's first law of geography using Wikipedia articles. Essentially one compares the SR of articles versus their geographic closeness – and it's verified they are positively linked.
Does the world need an easier, simpler, more off-the-shelf Wikipedia research tool? Yes, of course. Is WikiBrain it? Maybe or maybe not, depending on who you are. The software described in the paper is still version 0.3. There are notes explaining the upcoming features of edit history parsing, article quality ranking, and user data parsing. The project and its examples are written in Java, which is a language choice that targets a specific demographic of researchers, and alienates others. That makes WikiBrain a good tool for Java programmers who do not know how to parse off-line dumps, and have an interest in either multilingual concept alignment, semantic relatedness, and spatial relatedness. For everyone else, they will have to make do with one of the other 20+ alternative parsers and write their own glueing code. That's OK though; frankly the idea to make one research tool to "rule them all" is too audacious and commandeering for the open-source ecosystem. Still that doesn't mean that WikiBrain can't find its userbase and supporters.
It's time for another interesting paper on newcomer retention[3] from authors with a proven track record of tackling this issue. This time they focus on the Articles for Creation mechanism. The authors conclude that instead of improving the success of newcomers, AfC in fact further decreases their productivity. The authors note that once AfC was fully rolled out around mid-2011, it began to be widely used – the percentage of newcomers using it went up from <5% to ~25%. At the same time, the percentage of newbie articles surviving on Wikipedia went down from ~25% to ~15%. The authors hypothesize that the AfC process is unfriendly to newcomers due to the following issues: 1) it's too slow, and 2) it hides drafts from potential collaborators.
The authors find that the AfC review process is not subject to insurmountable delays; they conclude that "most drafts will be submitted for review quickly and that reviews will happen in a timely manner.". In fact, two-thirds of reviews take place within a day of submission (a figure that positively surprised this reviewer, though a current AfC status report suggests a situation has worsened since: "Severe backlog: 2599 pending submissions"). In either case, the authors find that about a third or so of newcomers using the AfC system fail to understand the fact that they need to finalize the process by submitting their drafts to the review at all – a likely indication that the AfC instructions need revising, and that the AfC regulars may want to implement a system of identifying stalled drafts, which in some cases may be ready for mainspace despite having never been officially "submitted" (due to their newbie creator not knowing about this step or carrying it out properly).
However, the authors do stand by their second hypothesis: they conclude that the AfC articles suffer from not receiving collaborative help that they would get if they were mainspaced. They discuss a specific AfC, for the article Dwight K. Shellman, Jr/Dwight Shellman. This article has been tagged as potentially rescuable, and has been languishing in that state for years, hidden in the AfC namespace, together with many other similarly backlogged articles, all stuck in low-visibility limbo and prevented from receiving proper Wikipedia-style collaboration-driven improvements (or deletion discussions) as an article in the mainspace would receive.
The researchers identify a number of other factors that reduce the functionality of the AfC process. As in many other aspects of Wikipedia, negative feedback dominates. Reviewers are rarely thanked for anything, but are more likely to be criticized for passing an article deemed problematic by another editor; thus leading to the mentality that "rejecting articles is safest" (as newbies are less likely to complain about their article's rejection than experienced editors about passing one). AfC also suffers from the same "one reviewer" problem as GA – the reviewer may not always be qualified to carry out the review, yet the newbies have little knowledge how to ask for a second opinion. The authors specifically discuss a case of reviewers not familiar with the specific notability criteria: "[despite being notable] an article about an Emmy-award winning TV show from the 1980's was twice declined at AfC, before finally being published 15 months after the draft was started". Presumably if this article was not submitted to a review it would never be deleted from the mainspace.
The authors are critical of the interface of the AfC process, concluding that it is too unfriendly to newbies, instruction wise: "Newcomers do not understand the review process, including how to submit articles for review and the expected timeframe for reviews" and "Newcomers cannot always find the articles they created. They may recreate drafts, so that the same content is created and reviewed multiple times. This is worsened by having multiple article creation spaces(Main, userspace, Wikipedia talk, and the recently-created Draft namespace".
The researchers conclude that AfC works well as a filtering process for the encyclopedia, however "for helping and training newcomers [it] seems inadequate". AfC succeeds in protecting content under the (recently established) speedy deletion criterion G13, in theory allowing newbies to keep fixing it – but many do not take this opportunity. Nor can the community deal with this, and thus the authors call for a creation of "a mechanism for editors to find interesting drafts". That said, this reviewer wants to point out that the G13 backlog, while quite interesting (thousands of articles almost ready for main space ...), is not the only backlog Wikipedia has to deal with – something the writers overlook. The G13 backlog is likely partially a result of imperfect AfC design that could be improved, but all such backlogs are also an artifact of the lack of active editors affecting Wikipedia projects on many levels.
In either case, AfC regulars should carefully examine the authors suggestions. This reviewer finds the following ideas in particular worth pursuing. 1) Determine which drafts need collaboration and make them more visible to potential editors. Here the authors suggest use of a recent academic model that should help automatically identify valuable articles, and then feeding those articles to SuggestBot. 2) Support newcomers’ first contributions – almost a dead horse at this point, but we know we are not doing enough to be friendly to newcomers. In particular, the authors note that we need to create better mechanisms for newcomers to get help on their draft, and to improve the article creation advice – especially the Article Wizard. (As a teacher who has introduced hundreds of newcomers to Wikipedia, this reviewer can attest that the current outreach to newbies on those levels is grossly inadequate.)
A final comment to the community in general: was AfC intended to help newcomers, or was it intended from the start to reduce the strain on new page patrollers by sandboxing the drafts in the first place? One of the roles of AfC is to prevent problematic articles from appearing in the mainspace, and it does seem that in this role it is succeeding quite well. English Wikipedia community has rejected the flagged revisions-like tool, but allowed implementation of it on a voluntary basis for newcomers, who in turn may not often realize that by choosing the AfC process, friendly on the surface, they are in fact slow-tracking themselves, and inviting extraordinary scrutiny. This leads to a larger question that is worth considering: we, the Wikipedia community of active editors, have declined to have our edits classified as second-tier and hidden from the public until they are reviewed, but we are fine pushing this on to the newbies. To what degree is this contributing to the general trend of Wikipedia being less and less friendly to newcomers? Is the resulting quality control worth turning away potential newbies? Would we be here if years ago our first experience with Wikipedia was through AfC?