Wikipedia:Wikipediology/library/essays/R.fiend-1

How Many Articles Does Wikipedia Really Have?

While the official count is somewhere around 750,000, having looked through quite a few of these articles, I've noticed that there are a number of these that would be hard to call "articles". Some are a sentence long, some are disambiguation pages, some are more of a chart or a list than an article (not to say they aren't necessarily useful, but the term "article" might be a bit of a stretch). So I decided to explore what these 750,000 articles contain. To aid me in this I have used the random article option. Now, it's been a while since I've taken a class in probability and the like, but it is my understanding that, if truly random, a pretty small sample can yield pretty accurate results. This is assuming the random article button is truly random (and I think it is). I have cateogrized the results into several categories:

  • Full articles: Although I cannot judge their overall quality very well, these are your basic decent or very good articles in Wikipedia, with a few exceptions that are covered elswhere
  • Stubs: Sort of tricky, as I disregarded if they were labelled as such, but I basically went with anything that was a brief or medium-sized paragraph. But any of 2 sentences or less are:
  • Sub-stubs: Very brief articles (they really cannot be considered articles at all)
  • Disambiguation pages: We all know what these are
  • Articles completely from other sources: These are basically the same as "full articles", but I thought they deserved special categorization, as they are not the creation of Wikipedia or Wikipedians. In its own category are
  • Rambot articles: We all know these; quite a vexed part of Wikipedia. Those that have substantial human written sections (not sure what qualifies as "substantial" yet) will be included in full articles.
  • Articles that are mostly a chart: Articles that have litle text that is not a template or chart. This includes album stubs that are a sentence (artist and year) plus a template and a tracklist, as well as often a member lineup. Those with at least a few sentences beyond this can be considered full articles. A similar but separate group is:
  • Lists: We all know these; another highly debated facet.
  • Articles needing cleanup: For this, I included pretty substantial cleanup. It's a bit of a judgment call, I know, but these are ones than can probably be generally called "pretty bad" and hence do not take their place (yet) with the full articles, but are not
  • Deletables: Candidates for either speedy or AfD. Again a bit of a judgment call here, particularly since I have a slightly higher bar than many. But this project is partially for the purpose of people who are skeptical of Wikipedia, those who are used to and prefer more standard encyclopedias, and their standards may well be considerably above mine.
  • Dubious articles: Ones that may be disguised adverts, dubious, potentially unverifable claims, hoaxes, copyvios, and the like, but ones I'm not going to AfD.
  • Redirects: Articles that should have been redirects to more complete articles, and I have since redirected them. This sort of like a delete, as it removes one article from the total number. Redirects, I believe, are neither counted in the article count nor the random article feature, so are disregarded when they exist already.

Note this is more about types of articles than article quality. In my notes I originally had "full artciles" down as "good articles", but decided to change it. Since I won't be reading anything but the stubs and substubs in their entirety, I cannot judge the accuracy or quality of them (even if I read them completely, I still couldn't do so without substantial research, which would obviously slow this project down immensely). There are also articles, while full and complete, I wouldn't necessarily call "good". I didn't want to have to make a separate category for fancruft or anything, as that would obviously be severely subjective. At some later point, I think I may undertake another such project in which I categorize random full articles into subject, paying particular attention to the amount of fiction. One criticism (not entirely unfounded) that Wikipedia often garners is its level of detail in TV shows, sci-fi, anime, etc., potentially at the expense of other subjects (though whether this is really at any expense is clearly debatable).

The number of random articles I explored was 500. This should give me a pretty accurate reading. I'll have to look into calculating the margin of error (if any math folks who want to help I'd be grateful).

If anyone knows of a similar project by another Wikipedian, I'd be very curious to see it. Any feedback on this I'd love to hear. Leave it on my talk page.