Wikidata:WikiProject Reference Verification

This is a research and development project aimed at helping Wikidata editors check the quality of external references based on various types of AI/ML models. The method heavily relies on an academic paper called ProVe[1].

We are developing a backend to deploy AI/ML models and to be used as an inference server. Furthermore, we aim to launch feasible tools such as a Wikidata gadget, dashboard, and bot-based worklist update page.

 Info You can use the AutoEdit tool to quickly add label and description on WikiProject Reference Verification in many languages.

Introduction

edit

Motivation

edit

Wikidata is a repository of information that gathers data from many different sources and topics. It stores this data as semantic triples, which are used in various important applications on the modern web, including Wikipedia infoboxes and search engines.

Wikidata mainly serves as a secondary source of information. To be trustworthy and useful, it needs well-documented and verifiable sources for its data. However, checking the quality of these sources, especially whether they actually support the information in Wikidata, has mostly been done manually. This manual process doesn't work well as Wikidata grows larger.

ProVe aims to solve this problem. It's an automated system that checks whether a piece of information (a triple) in Wikidata is supported by the text from its listed source. This approach can help ensure the quality of Wikidata's content more efficiently as it continues to expand.

Challenges

edit

ProVe implementation for helping Wikidata editors faces several challenges:

  • How to best support Wikidata editors' workflow based on ProVe results
  • How to design system architecture for ProVe to support AI/ML inference and integrate with Wikidata tools using Toolforge and gadgets
  • What is the most effective method to present ProVe results for reusability
  • How to handle claims or triples that lack references


ProVe Reasoning

edit

Main algorithm

edit
  • The ProVe reasoning process is conducted sequentially for each Wikidata claim belonging to an item. ProVe processes an item by inputting its Q-id and returns 'No external URLs' if the item doesn't have any available external URLs. ProVe is designed to verify only external URLs, not internal URLs like connections among other Wikimedia and Wikidata resources.
 
ProVe Workflow
  • Wikidata Parser: First, ProVe collects and filters Wikidata claims for an item with the requested Q-id. We have a great interest in checking the quality of external references, so ProVe filters out unnecessary Wikidata claims such as those having no URL or only internal URLs. For URLs encoded in a format different from common URLs, it transforms them into commonly accessible URLs for browsers. Then, it collects HTML documents according to these accessible URLs.
  • Verbalization: In this function, ProVe transforms a set of Wikidata claims with URLs from RDF format to natural language sentences by adopting a verbalization model based on pre-trained T5. This function is designed to transform Wikidata claims into natural sentences for comparison with sentences from the HTML documents of URLs attached to the Wikidata claims. It runs for all Wikidata claims obtained in the previous Wikidata parsing process.
  • Sentence Segmentation: This function cleans HTML documents obtained in the Wikidata parsing process to have only clean text without HTML tags and scripts. Then, it splits the cleaned HTML documents into a set of sentences using a Python library (pysbd). Segmented sentences will be compared to the Wikidata claims, respectively.
  • Top-N Sentences Selection: ProVe creates multiple pairs for the Wikidata claims of the Wikidata Item. Each pair consists of one verbalized sentence from the Wikidata claim and a set of segmented sentences from the HTML document obtained by accessing the URL of the Wikidata claim. This function selects the top-N semantically similar sentences from the set of segmented sentences compared to the verbalized sentence, based on a fine-tuned BERT model.
  • Text Entailment: This function determines if the Wikidata claim is supported or not by the N selected sentences, based on a fine-tuned BERT model with the FEVER dataset. The result for each selected sentence will be 'SUPPORTS', 'REFUTES', or 'NOT ENOUGH INFORMATION'. Finally, it returns a final result for the Wikidata claim by aggregating these text entailment results for the N selected sentences.

System architecture

edit
  • ProVe is currently running on a VM hosted at King's College London to utilize GPU resources. The system architecture of ProVe's backend consists of three main parts:
 
  • DBs and Resources: The second part relates to databases and resources used by ProVe or used to store data produced by ProVe. ProVe uses the Wikidata API to collect Wikidata claims and uses an SQLite database to store these collected claims. The ML model DB contains pre-trained machine learning models with datasets and configuration values. The HTML DB (SQLite) is designed to store raw HTML documents obtained during the ProVe process, serving as a kind of cache for HTML scraping. The Reasoning Data DB (SQLite) stores the entire intermediary and final result data during the ProVe process.
  • API Server: The third part is an API host server using Flask. It returns the requested data from external sources by interacting with the databases and functions of ProVe. Details of the APIs are written at the bottom of this page.

ProVe Tools

edit

ProVe Gadget

edit
  • This is a widget that represents the results of ProVe at the top of the Wikidata Item view page.
 
  • The ProVe Gadget displays a button to request Wikidata item processing from the ProVe backend server on the same line as the Wikidata item title. It shows a 'Reference score' and its icon if the item has already been processed in the backend.
  • The ProVe Gadget displays three boxes:
    • A red box includes potential bad references
    • A yellow box shows the list of references without enough information to be determined
    • A green box represents potential good references
  • You can find more additional information and a tutorial in the Google Slides presentation: Empowering Wikidata editors and content with the Wikidata Quality Toolkit, Wikimania 2024 Hackathon
  • To implement and activate the ProVe gadget, please visit https://www.wikidata.org/wiki/Wikidata:ProVe
  • We are working to publish the ProVe Gadget on the official Wikidata gadget list.

ProVe Dashboard (Under Development)

edit
  • It will display statistics related to ProVe's work, such as (tentative):
 - The number of processed items
 - The number of processed triples
 - The number of processed references
 - User requests for Wikidata items
 - A reference score line graph over time

ProVe Worklists (Under Development)

edit
  • This will show worklists in terms of references.
  • We plan to publish a set of RDF triples derived from ProVe results.
  • This will allow Wikidata editors and users to access a subset of Wikidata items with ProVe results that need to be addressed via SPARQL query.

ProVe API

edit

GET /api/task

edit
  • This API set is related to checking the tasks of the ProVe backend.
  • The current domain of ProVe API:
 kclwqt.sites.er.kcl.ac.uk
  • Check queue: Returns the current queue of the ProVe backend.
 /api/task/checkQueue
 (e.g., https://kclwqt.sites.er.kcl.ac.uk/api/task/checkQueue)
  • Check completed items: Returns the completed items of the ProVe process.
 /api/task/checkCompleted
 (e.g., https://kclwqt.sites.er.kcl.ac.uk/api/task/checkCompleted)
  • Check errors: Returns the list of items that weren't processed successfully in the ProVe backend.
 /api/task/checkErrors
 (e.g., https://kclwqt.sites.er.kcl.ac.uk/api/task/checkErrors)

GET /api/items

edit
  • This API set is related to getting the results of ProVe for the item with a given Q-id.
  • Get simple result for the item: Returns the representative results of ProVe for the item with the given Q-id.
 /api/items/getSimpleResult?qid=<qid>
 (e.g., https://kclwqt.sites.er.kcl.ac.uk/api/items/getSimpleResult?qid=Q42)
  • Get all results for the item: Returns all results of ProVe for the item with the given Q-id.
 /api/items/getCompResult?qid=<qid>
 (e.g., https://kclwqt.sites.er.kcl.ac.uk/api/items/getCompResult?qid=Q42)
  • Check item status in the ProVe backend: Returns item status such as 'not processed yet' or 'in queue'.
 /api/items/checkItemStatus?qid=<qid>
 (e.g., https://kclwqt.sites.er.kcl.ac.uk/api/items/checkItemStatus?qid=Q42)

GET /api/requests

edit
  • This API can send a request to the ProVe backend.
  • Request ProVe processing for a specific item: Updates the current queue to include the given Q-id from the user.
 /api/requests/requestItem?qid=<qid>
 (e.g., https://kclwqt.sites.er.kcl.ac.uk/api/requests/requestItem?qid=Q42)

Subpages

edit


Participants

edit

Template:References

  1. Amaral, G., Rodrigues, O., & Simperl, E. (2022). ProVe: A pipeline for automated provenance verification of knowledge graphs against textual sources. Semantic Web, (Preprint), 1-34.