Thursday, August 13, 2015

#Wikidata - #Quality, #probability and #set theory

The problem with any source is that it has errors. It cannot be helped. There is always a certain percentage that is wrong. When you take all the items of Wikidata that have statements, the type of process that added those statements provides an indication of the percentage of errors that were included.

I made thousands of mistakes. In a way I am entitled to have made those mistakes because I made over 2 million edits. Amir made even more edits with his bot. Because of the process involved the percentage of his errors will be fewer. When you only look at Wikidata and its items, you can be confident that these errors exist, you can be confident about what percentage is likely but there is no way to make an educated guess what is right or what is wrong. The only way to improve the data is by sourcing one statement at a time. It is a process that will introduce its own errors. That is something we know from experience elsewhere.

To add value to Wikidata, we need both quality and quantity. Let us consider the use of external sources that are known to have been created with the best of intentions. Consider one type of information, the place of birth for example. It is highly likely that Wikidata and that external source have many items in common. Once they are defined as being about the same person, we can use the logic of set theory. We can establish the number of records where both have a value for the place of birth. We can determine the amount of matching items, we can determine the number where one has a value and the other does not and, we can determine the number of items where there is a mismatch.

It is probable that most errors will be found where Wikidata and the source do not match. It is certain that even where the two match there will still be factual errors as both can be wrong.

Quality and confidence have much in common. Wikipedia has quality but we know it has issues. Wikidata has quality but we know it has issues. The easiest and most economical way to improve the quality of Wikidata is by comparing sources, many sources and concentrating on the differences. It is easy and obvious and when we ask someone to add a source to a statement we are confident that the result matters. It matters for both Wikidata and the external source.

This approach is not available to Wikipedia. It cannot easily compare with other sources and therefore there is no option but to source everything. Given that many statements find their origin in Wikipedia, new insights in Wikidata may prove a point and a need to adapt articles.

Consequently, applying set theory and probability will enhance the quality of Wikidata. It will help drive fact checking in Wikipedia and it is therefore the best approach to improve quality. Accepting new data from external sources and iterating this process of comparison will ensure that Wikidata will become much more reliable. Reliable because you can expect that the data is there and, reliable because you know that quality has been a priority in what we do.
Thanks,
      GerardM

No comments: