IntroductionThe ION demonstrator is a tool to explore newspaper content. The tool is built on top of the ION dataset, a collection of news articles published on the website of five newspapers between August 2014 and August 2015. With this demonstrator we aim to support an exploration of (1) how frequent a certain topic is mentioned in a newspaper, (2) with which other topics it co-occurs and (3) what kind of images are shown with articles about this topic.
The ION Demo was presented at ICT.OPEN2016 and was awarded the 3rd prize in the meet-the-demo contest there. The underlying dataset of image and text in online news will be presented at LREC2016: Laura Hollink, Adriatik Bedjeti, Martin van Harmelen and Desmond. A Corpus of Images and Text in Online News. In proceedings of the 10th edition of the Language Resources and Evaluation Conference, 23-28 May 2016, Portorož, Slovenia. [Abstract]The ION demo is the result of an Amsterdam Data Science grant. In a follow-up grant, we will further explore the detection of topic drift (If you know any qualified MSc. students for a student-assistant position, let us know. We are hiring!).
For an example of how to ION demonstrator may be used we have created a slide deck with a step-by-step demo scenario, exploring the difference in how two newspapers cover Bernie Sanders.
The ION Dataset
The Images in Online News (ION) dataset is a collection of news articles of five news publishers: the US-based The New York Times, The Washington Post and The Huffington Post, and the UK-based The Daily Mail and The Independent. The articles were collected during a period of one year: 13 August, 2014 to August 13, 2015.
The ION dataset (approx. 323,000 articles) can be downloaded from the CWI Repository.
UPDATE: due to security problems, the CWI repository has temporarily been shut down. For now, please download the datasets from here.
The ION dataset is published as JSON-LD, a specification for representing Linked Data in the popular JSON format. For each news publisher an archive of JSON-LD files and a .mat file are available for download. The JSON-LD files contain the article data whereas the .mat file contains the image features in h5 format.
For the news publisher the following data is available in the JSON-LD files:
- schema.org : based on
- : URL of the news publisher
- : "Newspaper"
- : the name of the news publisher
- : the list of articles of that news publisher
- : the unique ID for the article as a hashcode of the article URL (SHA-1)
- : the publishing date of the article
- : the headline of the article
- : the URL of the article
: metadata about the image
- : the ID of the article (see above) + ".jpg"
- : the URL of the image as it appears on the article page
- : the caption that usually goes with the image, if such a caption exists
- : the coarse topics of the article indicated by their Wikipedia URLs as well as topics (see below) which are Wikipedia categories (A Wikipeda category is a Wikipedia page identified by a URL containing "Category" followed by a colon (":"). E.g. "http://en.wikipedia.org/wiki/Category:Technology")
- : a list of topics and entities indicated by their Wikipedia URLs which are not Wikipedia categories*.
An example of what the data looks like in JSON-LD is given below: