I’ve created an open source project which aspires to add “Contextual Citations” to Wikipedia.
I’m looking for a volunteer with machine learning expertise to help me convert all of Wikipedia’s quotations to use Contextual Citations.
(Note: This is not an official Wikipedia Project (yet), but I do have a connection to someone who has worked for Wikipedia.)
What are Contextual Citations?
I’ve developed an Web App that enables web authors to demonstrate the context of their citations.
You can see a 3 minute demo of the app below:
1) You can view the Thomas Jefferson example from the video below.
Nothing is more certainly written in the book of fate than that these people are to be free.
2) A second type of citation uses contextual popups. As an example click on the link below:
In Chapter 5 of Pride and Prejudice, Elizabeth confesses how Darcy offended her, saying:
I could easily forgive his pride, if he had not mortified mine.
Adding Contextual Citations to Wikipedia
CiteIt’s Contextual Citations are a natural addition to Wikipedia.
Wikipedia already has popup windows that show more information about linked articles.
- These popups appear whenever a reader mouses over an article name (see Los Angeles Lakers example to the right)
- Contextual Popup Citations would be similar to contextual article popups. — popups that would appear to provide more context about a quotation — but they would only appear when a reader clicks on the quotation, not on mouse over.
- The data for these contextual citations would be pulled from the original source when the Wikipedia author publishes the article or when a program extracts information from the footnoted source that follows the quote.
Below is a screenshot of the Ruth Bader Ginsburg article mock up which shows the Contextual Popup that appears when a reader clicks on the light grey-blue text of a contextual quotation, in this case: “Ginsburg precedent“.
The Contextual Popup displays the 500 characters before and after the quote. In this instance, the quote is from a 2005 New York Times article.
Below are mock ups of a few sample articles that demonstrate what Wikipedia would look like if it used CiteIt.net’s proposed Contextual Citations.
What Work Needs to be Done:
- The goal is to automate the process of converting as many existing Wikipedia articles to Contextual Citations as possible.
- If the automated process is uncertain of whether a particular citation can accurately scripted, it should save that citation to a database which Wikipedia volunteers can manually handle.
- The system will be trained using manually mocked up articles and be refined using manual corrections to its output.
I’ve already manually marked up a few sample Wikipedia articles to specify the URL that corresponds to each cited quote.
.. Senators invoked the phrase the <q cite=”https://www.nytimes.com/2005/09/04/weekinreview/roberts-rx-speak-up-but-shut-up.html”>Ginsburg precedent</q> to defend his demurrers ..
Once the quote is mocked up in this way, the CiteIt webservice can:
- lookup the source,
- find the context, and
- create the JSON files
The tasks that involve machine learning automate the process of mocking up of the sources to specify the URL using the <q tag>.
Here’s a description of how a procedural algorithm would process the articles:
- Download, and read each Wikipedia article’s Html.
- Find each of the article’s quoted text that is delimited with quotation marks.
- Find the footnote(s) that immediately follows each quote.
- Find the URL for the footnoted source whose target Html contains a match for the quoted text.
- This process involves starting with the first footnote and searching for a text match in the cited source. If no match is found, the process should continue searching through subsequent footnotes until the last footnote of that immediate cluster of footnotes is processed.
- After finding a text match using a fuzzy match process, the program should update the matched Wikipedia quote with the exact text of the cited source. It is important that the Wikipedia quote match exactly and if the quote does not match the quoted text, if the quote can be matched using a “fuzzy match” the progam must add annotations to
- note [c]hanges to capitalization,
- [add words] using brackets, and
- use ellipses to note skipped words.
- In this way, the program will validate the accuracy of the original quotes and potentially add annotations to note any discrepancies between the quote and the original.
- Wrap the Wikipedia quote with with the <q tag and the source’s URL> so that the existing Python webservice can be run to pull in the 500 characters of context and save the results as a json file.
What’s been learned so far:
As I’ve manually mocked up these articles, I’ve noted irregularities that could trip up an automated conversion.
I’ve created error codes which could be used to construct a test-suite that could assess the output of the process.
Here are a few examples of known irregularities:
|Wiki||wiki-legend||The Quote is found in the Legend of Wikipedia Image and the footnote may be before the quote||Example:
“Fact-checkers from The Washington Post, the Toronto Star, and CNN compiled data on “false or misleading claims” (orange background), and “false claims” (violet foreground), respectively.”
|Wiki||wiki-note||Internal Wikipedia Note||Example:
Clinton into “imaginary discussions” with the also-politically active Eleanor Roosevelt.[f]
|Wiki||wiki-multiple-source||Wikipedia Source Citation Record Contains Multiple Sources||Example:
Calabresi, Massimo (November 7, 2011). “Hillary Clinton and the Rise of Smart Power”. Time. pp. 26–31. See also “TIME magazine editor explains Hillary Clinton’s ‘smart power’”. CNN. October 28, 2011.Wikipedia article: Hillary Clinton
- How much work will need to be done to design and train the machine learning program?
- How many citations need to be mocked up to ensure a given level of accuracy in converting the quotes to Contextual Citations?
- How much computing power needs to be acquired to run the machine learning program?
- If this computing power is to be run in the Cloud, what cloud should it be run on?
Open Source Code:
If you’d like to learn more about the way the front end code works, check out the Developer Sample Code.