Wikipedia talk:Wikipedia Signpost/2015-12-02/Op-ed

Discuss this story

@ Andreas Kolbe, Thanks for an excellent article highlighting the pitfalls in Wikidata policies.Hope to see the foundation act on the issues. --Arjunaraoc (talk) 03:05, 7 December 2015 (UTC)[reply]

"Strong and explicit disclaimers" is, in practice, a joke, since I'm pretty sure if you took a random sample of visitors to Wikimedia projects 99% would be unaware that the disclaimers exist. From what I've read anecdotally, a sizable number of people think there's a paid staff that writes Wikipedia, or that the Foundation has editorial control over the projects. --71.119.131.184 (talk) 03:46, 7 December 2015 (UTC)[reply]

You're right: there are indeed many, many people still operating under these mistaken assumptions. On the other hand, Wikipedia's Wikipedia:General_disclaimer does average well over 3,000 views a day (currently ranking #1664 in traffic on en.wikipedia.org), and there has been substantial public discussion of the fact that an openly editable crowdsourced encyclopedia cannot be relied upon to present correct information at any given point in time. As long as the Knowledge Graph or Snapshot still contains the word "Wikipedia", at least some people will bear that in mind. The moment the attribution disappears, however, the chances of people doing that diminish. Andreas JN466 06:51, 7 December 2015 (UTC)[reply]

@ Andreas Kolbe Thank you for an enlightening article. Especially significant: the loss of provenance (verifiability) due to clear violations of Wikipedia's generous but restrictive licensing terms, e.g., importing Wikipedia's CC BY-SA 3.0 licensed content (not facts, but claims of fact) without required attribution directly into Wikidata under the permissive CCO public domain dedication. A great investment for Google and Microsoft, which have the financial means and technical infrastructure to continually analyze, refine, and commercially exploit Wikidata's now totally free crowd-sourced claims of fact without any community responsibilities whatsoever — other than those due to their shareholders. -- Paulscrawl (talk) 05:15, 7 December 2015 (UTC)[reply]

  • There are two different issues. One is data quality; for instance, unreferenced data. The other one is how people admit data as valid.
    As for quality, I'm worried about lack of references as mentioned above. But I'm also worried about corporate bias -Google and Microsoft are mentioned- as I do not bite the hand that feeds me (so I never ever edit about the company I work for, my personal policy).
    But as for data validation by users, that's a different case. I do not trust any statement based on a single unknown source. An extreme case, my mother (a female) was studying Medicine in 1960. The local census for her home town shows that the number of female university students in that place in 1960 was zero. The census is obviously wrong. Does it mean that cesuses are always wrong? No, in fact they are mostly right. But they can be wrong. And a Spanish census is a quite well done official source of information. So if I-don't-know-who says that an avocado is a kind of Nepalese oceangoing vessel... well, I should double check. In wiki and out of wiki, pre-wiki, post-wiki, inter-wiki.
    Is information neutral? Are data? Not really, based on our own experience in life. We are just used to live in this kind of context. I know that saying Myanmar or Burma, Alboraya or Alboraia, football or soccer, are non-neutral decisions, we know what to expect from texts making such word use and we evaluate them accordingly. It is experience and prudence, the same things that keep us from being run by a car when we cross the street. B25es (talk) 07:03, 7 December 2015 (UTC)[reply]
  • it is an opinion. It is severely flawed and, you know what happened to the ring that ruled them all. For want of a better world it was destroyed. This opinion demonstrates a total lack of understanding of what a wiki is and the quality that Wikidata brings. It deserves a rebuttal and I would love to write one. Thanks, GerardM (talk) 06:44, 7 December 2015 (UTC)[reply]
DarTar, Jayen466: It doesn't look like this discussion is taken serious by wikidata devs like User:Markus Krötzsch, who prefers to tell so in the wikidata mailinglist-echo chamber. On the other hand, nobody cares to reject ludicrous arguments by GerardM that poisoning wikidata with bad data is no problem, just like carelessly poisoning the Rhine is apparently no problem downstreams in the Netherlands because "shit happens and we can deal with that." Surreal. As a Wikipedian, i can tell you that i don't want to be forced to deal with the shit that happened at wikidata, thank you very much. --Atlasowa (talk) 12:40, 7 December 2015 (UTC)[reply]
Thanks for the pointer, Atlasowa. I hadn't actually seen that response from Markus; it didn't come through to the Wikimedia-l mailing list. Andreas JN466 13:47, 7 December 2015 (UTC)[reply]
I have now replied to Markus. [1] --Andreas JN466 23:21, 7 December 2015 (UTC)[reply]
  • The actual and major point of Wikidata, as far as I can see (I have been working on it for a year and a few months) is that it is versatile. It is not for just one thing; it began as an interwiki index, but has moved quite a distance from that position. So I think we can pretty much forget about considerations based on the business interests of the original sponsors, for example. The scope is broad rather than narrow, and many people and institutions are going to find it useful. (I was in an GLAM meeting on Wednesday and the institution in question seemed to find it an eye-opener how much has already happened.) Another point is that Wikidata after three years is much like Wikipedia after three years, i.e. 2004 here. Which I remember quite well: it has the same feeling of a huge amount to do wherever you look. So, naturally, if you are picky you can find things to be picky about. Put another way, guidelines are not yet well developed, systems not in place. The Wikidata community seems to function quite reasonably, and that is a reason to be hopeful that issues will find solutions. The third point I'd like to make is that areas like "authority control" seem to be crying out for something like Wikidata - I have become familiar with VIAF through Wikidata work, and what Wikidata adds to that major system is already substantial, though in need of some checking because the early bot work was a bit careless about disambiguation. In fact I came up just recently with a thought (Wikidata is a database that "can do outreach") which made me conclude that the "linked structured data" model in use is a big advance. I have come in from the merging encyclopedias direction, and (via Magnus Manske) I have come to see that the old way of thinking in the "missing article" area is obsolescent, with Wikidata able to provide a much better environment for what can only be called digital scholarship. And also, for example, able to support editathons by supplying "redlink lists" of missing articles to work on. WP:WPDNB and its talk page archives show the emergence of some of the new thinking. It would be silly to ignore the real problems with data integrity on Wikidata; but the standards of referencing are going up, and one shouldn't use metrics that are somewhat naive to argue about that issue. Charles Matthews (talk) 07:22, 7 December 2015 (UTC)[reply]
    • Wikidata is the designated successor to Freebase, used as a source for SERP infoboxes by both Google and Microsoft. So I wouldn't say that there are no business interests involved: the impact of infobox features on users' interaction with search engine results pages is profound. 2004: One point people raised in the Wikimedia-l discussion was that Wikidata should take the lessons learned by Wikipedia in its early years on board, rather than replicating these errors. I find that argument fairly compelling. Referencing: Standards of referencing do seem to be going up – in June of this year, only 17% of Wikidata statements referenced what in Wikipedia would be considered a reliable source, and now it is 21% – but there is still a long way to go. Andreas JN466 08:26, 7 December 2015 (UTC)[reply]
      • What I meant was not that Wikidata is "decoupled" from business, which it isn't, but that I don't see the argument that it should be decoupled as particularly interesting (to me). Yes, I agree that the lessons of history are important, and my positive verdict on the Wikidata community factors in the way discussions are actually conducted, which seems much more helpful in practice (people generally less stubborn, for example). On referencing, looking at biographies which are about 20% of items, referencing vital dates is much more important than referencing occupations (say). It is interesting to see the efforts of the Library of Congress and Union List of Artist Names to reference dates, for example: these are major authoritative database sources, but they don't have as transparent a system as Wikidata now proposes. With 50% references on statements, we are in classic "glass half full/empty" territory anyway. What Wikidata has going for it is the ability, for example, to search for unreferenced death dates. The status quo, before Wikidata, was that such major databases could disagree, and no one pointed a finger at anybody. Charles Matthews (talk) 08:57, 7 December 2015 (UTC)[reply]
  • What I would like to see is any practical discussion of how Wikidata is useful, now. Because I am pretty sure that the promised "population of birth/death" dates feature, for example, has not yet happened. As a Wikipedia article writer, the only use I see we get out of Wikidata is a centralized repository for interlanguage links. Not the best investment for the mentioned 1+ million euros. --Piotr Konieczny aka Prokonsul Piotrus| reply here 07:39, 7 December 2015 (UTC)[reply]
    • Just a quick point regarding development costs: My understanding is that the mentioned 1.3 million Euros from the three sponsors funded initial development work begun in 2012, and that a substantial part of the movement's funds granted to Wikimedia Deutschland annually since then has supported further development of Wikidata (see related comments by the Funds Dissemination Committee quoted in last week's News and notes). Andreas JN466 08:05, 7 December 2015 (UTC)[reply]
@Piotrus: "Not the best investment for the mentioned 1+ million euros." Making it much, much less work to maintain a small Wikipedia sounds to me like a very good investment, for which millions of euros is small compared to the long-run benefits. There are other ways Wikidata benefits Wikipedia, including automatic generation of lists with Listeria (yes, some of these will be incomplete or incorrect, just like manually-generated Wikipedia lists). I find it useful to look up on one page how a term is represented in lots of different languages. Then stepping away from the "As a Wikipedia article writer..." to Wikisource, it's great that I can add metadata to a Wikisource author profile just by linking from Wikidata, without having to paste in and maintain an image or authority file links. Stepping away from the other Wikimedia projects entirely, Wikidata is already an awesome free knowledge project in its own right: the people I'm training at the University of Oxford are really impressed with Histropedia timelines, the Reasonator, the map interfaces, the ongoing integration of scholarly authority files. And that's just what's happening at what we all agree is a very early stage of Wikidata's evolution. Yes, millions of euros is a lot of money, but it needs to be seen in perspective of the value created, and in this context it's frankly not much. A case can be made that Wikidata will ultimately be more important than Wikipedia to the web as a whole. MartinPoulter (talk) 17:39, 8 December 2015 (UTC)[reply]

On two occasions I have been asked, "Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?" ... I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.

— Charles Babbage, Passages from the Life of a Philosopher (1864)
  • WikiProject Medicine participants and others are discussion an application of Wikidata in Wikipedia infoboxes at "Another reform proposal - split infobox into "human readable" and "non human readable" and call from Wikidata", or see this discussion archived. The article above is great. The "Poof it works" article on gene information from Wikidata in Wikipedia by Benjamin Good and team is still the best profile I have seen of this application. Blue Rasberry (talk) 13:09, 7 December 2015 (UTC)[reply]
  • A pitall of that reasoning : references come from users, just as reference on Wikipedia. if you don't have any (serious) user, then you don't have people to correctly reference the facts. Then you don't have trustworthy datas. And ... if you don't have datas, then you don't have users. The most important thing to understand about Wikidata is that its quality will be improved the most it is used. It will be used if there is datas to used. When Wikidata will have kickstart, then more and more Wikipedia will use the data, so more and more user will require source for the datas they have. But datas won't come by themselves and we have to start somewhere to realize the kickstart. TomT0m (talk) 14:40, 7 December 2015 (UTC)[reply]
    A corrolary : there is an opposite not virtuos circle : Wikidata don't have any other datas. Ther big wikipedias will continue to ignore the project because ... they have more and better data ? Why would they bother ? Then if we stay like this ... little Wikipedia won't benefit the datas for their project, and it's as there were no Wikidata at all, and no community. TomT0m (talk) 14:44, 7 December 2015 (UTC)[reply]
  • @ Andreas Kolbe: thanks for the article, I do not agree with many of your opinions, but it's gold. So thank you. I think, with many people, that there are many topics involved here: CC0, the role of "over-the-top" companies like Google and Bing, data quality. I want to address just one bit, though: "the authority control to rule them all". I still think that Wikidata can be a "super authoity control", because it is perfect as an aggregator of identifiers, and for things that are unique (like persons) has already proven its worth. VIAF can check if one of its authors is the same of other authority controls via Wikidata, using bots and a bit of AI. This is already useful and helpful, just because Wikidata is a place where you can import many authority controls and reconcile them with each other. I don't really see a problem here. Aubrey (talk) 15:07, 7 December 2015 (UTC)[reply]
  • Agree, of course: it is already happening. And enWP will benefit when those about to write a biographical article here routinely check for a Wikidata item, not only for existing language versions, but database links. And images, naturally. Could take a few years. Charles Matthews (talk) 15:31, 7 December 2015 (UTC)[reply]
  • One point to extend what you wrote about verification in Wikidata. Having dabbled over there earlier this year, unless its interface has radically changed in the last few months, I found adding references there to Wikipedias difficult & references to sources beyond Wikipedias practically impossible. And I say this as someone who is computer savvy. Documentation would help, but I suspect the ability to verify statements was added more of an afterthought than part of the original design. (For one thing, it's easy to add links to other Wikipedia nodes, which represent notable items; however most sources, either primary or secondary are not & will not be notable per Wikipedia consensus.) -- llywrch (talk) 16:33, 7 December 2015 (UTC)[reply]
    On the technical point, adding references to a Wikidata statement is straightforward once you know the drill. There are "reference URL" (diff) and "stated in" options, and I use these all the time. Also "imported from" for a Wikipedia import. Charles Matthews (talk) 06:37, 8 December 2015 (UTC)[reply]
    No, they are notable on Wikidata, as they "fulfil a structural need" per d:WD:N. Something has not necessarily to be notable on a wikipedia to be notable on Wikidata, although anything that has an article on any Wikipedia is notable on Wikidata. Sourcing was taken into account from the beginning, although it takes time to be well implemented, as anything else on Wikidata.
    To ease sourcing on Wikidata there is also project, for example I just recieved a mail through wikidata ml about Strep Hit who just got accepted as and IEG Grant. TomT0m (talk) 18:08, 7 December 2015 (UTC)[reply]
  • Well written paper, but useless. We could write the same paper about Wikipedia. But the community will reply : it's a wiki, so fix it, it's a project of encyclopedia, a work in progress, etc. Pyb (talk) 20:16, 7 December 2015 (UTC)[reply]
    An important point here is that wikidata (in particular from the central storage perspective) is different project and needs different requirement. In that sense it is not just a wiki and should not be treated as such, it is a central and its errors multiply throughout the system and hence it needs even stricter requirements for its data than wikipedia.--Kmhkmh (talk) 00:14, 8 December 2015 (UTC)[reply]
    That's just focusing on the wrong facet of the coin. An error on enwiki is boradcasted to any enwiki readers and dbpedia, so ... it's not really much different. We can also count to the fact that a more visible error will be corrected faster than a burried in a not often readed article, so an error in Wikidata propagated in several wiki will be corrected overall faster than an isolated error in a Wikipedia. Lastly, actually wikidata as a central rep has its own error detection mechanisms, helped by the structuring effort on datas, which adds up to the sum of all error detection mechanisms in the local wikis. So overall, this fast propagation of errors might be more than compensated by the advantages of centralizing datas, which mutualize all the efforts to improve quality in the short and long run. TomT0m (talk) 08:46, 8 December 2015 (UTC)[reply]
  • Regarding the graph at the beginning, many statements on Wikidata are self-evident and don't really need sources (i.e. Authority control, instance of, sex or gender, etc.) Of course, lots of statements on Wikidata do need sources and it's a lot harder to get people to provide them than it is on Wikipedia, but I feel this graph misrepresents the project. FallingGravity (talk) 04:54, 8 December 2015 (UTC)[reply]
    • It's official WMF data. References are clearly more important in some cases than in others. I'm not losing sleep over the fact that there is no reference in Wikidata supporting the assertion that the mother of Jesus was "Mary" (or that Mary was female). But in fact Wikidata has four references for Barack Obama being male, for example. Andreas JN466 14:05, 8 December 2015 (UTC)[reply]
  • Very interesting article. Thank you! -- œ 12:20, 8 December 2015 (UTC)[reply]
  • If one would count the sentences contained in Wikipedia and add to that all the infobox elements, and divide the sum by the references on Wikpedia, the result would be far worse than for Wikidata in my opinion. - And just to offer a different view of the data: The amount of references within the Wikipedia-Wikidata ecosystem is increasing in absolute numbers. - The numbers in the graph are a fact, but it is odd that the author assumes that his reading of the graph is intrinsic truth of the numbers, which scinece knows does not exist. --Tobias1984 (talk) 16:51, 8 December 2015 (UTC)[reply]
  • As wikidatian I agree with some criticisms of the article. The problem for me is the oblivion of the initial objective of WD: to be the reference database of Wikipedia. People in WD are playing their own game without any considerations of the final data users and their requirements. I have the impression that people are playing with data import because they can do it and not because they have an objective. They only want to fill memory without any thinking about the use of that data. The license is a problem too and I think we missed an important step when the choice of the license was made. CC0 just means you can't access to most of the reference data because the minimal license is the CC BY-SA in most of the official databases. Snipre (talk) 01:25, 9 December 2015 (UTC)[reply]
    • Snipre, I find the following interesting: Google said, "When we publicly launched Freebase back in 2007, we thought of it as a "Wikipedia for structured data." So it shouldn't be surprising that we've been closely watching the Wikimedia Foundation's project Wikidata[1] since it launched about two years ago. We believe strongly in a robust community-driven effort to collect and curate structured knowledge about the world, but we now think we can serve that goal best by supporting Wikidata -- they’re growing fast, have an active community, and are better-suited to lead an open collaborative knowledge base. So we've decided to help transfer the data in Freebase to Wikidata, and in mid-2015 we’ll wind down the Freebase service as a standalone project. Freebase has also supported developer access to the data, so before we retire it, we’ll launch a new API for entity search powered by Google's Knowledge Graph. Loading Freebase into Wikidata as-is wouldn't meet the Wikidata community's guidelines for citation and sourcing of facts -- while a significant portion of the facts in Freebase came from Wikipedia itself, those facts were attributed to Wikipedia and not the actual original non-Wikipedia sources. So we’ll be launching a tool for Wikidata community members to match Freebase assertions to potential citations from either Google Search or our Knowledge Vault[2], so these individual facts can then be properly loaded to Wikidata. We believe this is the best first step we can take toward becoming a constructive participant in the Wikidata community, but we’ll look to continually evolve our role to support the goal of a comprehensive open database of common knowledge that anyone can use."
    • Wikidata would seem to me to be doing exactly what Freebase did, i.e. cite Wikipedia and not the external sources, and it is interesting that Google thought this disqualified Freebase from being imported directly. Andreas JN466 04:01, 9 December 2015 (UTC)[reply]
  • The CC0 license is a non-issue. Data is not copyrightable in the US (or most of the world for that matter), so there is no way to require attribution regardless of what license you want to stick on the site. Kaldari (talk) 03:48, 9 December 2015 (UTC)[reply]