Introduction
The ION demonstrator is a tool to explore newspaper content. The tool is built on top of the ION dataset, a collection of news articles published on the website of five newspapers between August 2014 and August 2015. With this demonstrator we aim to support an exploration of (1) how frequent a certain topic is mentioned in a newspaper, (2) with which other topics it co-occurs and (3) what kind of images are shown with articles about this topic.The ION Demo was presented at ICT.OPEN2016 and was awarded the 3rd prize in the meet-the-demo contest there. The underlying dataset of image and text in online news will be presented at LREC2016: Laura Hollink, Adriatik Bedjeti, Martin van Harmelen and Desmond. A Corpus of Images and Text in Online News. In proceedings of the 10th edition of the Language Resources and Evaluation Conference, 23-28 May 2016, Portorož, Slovenia. [Abstract]
The ION demo is the result of an Amsterdam Data Science grant. In a follow-up grant, we will further explore the detection of topic drift (If you know any qualified MSc. students for a student-assistant position, let us know. We are hiring!).Demo Scenario
ForADemoPleaseSeeTheStepByStepScenario
For an example of how to ION demonstrator may be used we have created a slide deck with a step-by-step demo scenario, exploring the difference in how two newspapers cover Bernie Sanders.
The ION Dataset
The Images in Online News (ION) dataset is a collection of news articles of five news publishers: the US-based The New York Times, The Washington Post and The Huffington Post, and the UK-based The Daily Mail and The Independent. The articles were collected during a period of one year: 13 August, 2014 to August 13, 2015.
The ION dataset (approx. 323,000 articles) can be downloaded from the CWI Repository.
UPDATE: due to security problems, the CWI repository has temporarily been shut down. For now, please download the datasets from here.
The ION dataset is published as JSON-LD, a specification for representing Linked Data in the popular JSON format. For each news publisher an archive of JSON-LD files and a .mat file are available for download. The JSON-LD files contain the article data whereas the .mat file contains the image features in h5 format.
For the news publisher the following data is available in the JSON-LD files:
- schema.org : based on
- : URL of the news publisher
- : "Newspaper"
- : the name of the news publisher
- : the list of articles of that news publisher
- : the unique ID for the article as a hashcode of the article URL (SHA-1)
- : the publishing date of the article
- : the headline of the article
- : the URL of the article
- : the ID of the article (see above) + ".jpg"
- : the URL of the image as it appears on the article page
- : the caption that usually goes with the image, if such a caption exists
: metadata about the image
- : the coarse topics of the article indicated by their Wikipedia URLs as well as topics (see below) which are Wikipedia categories (A Wikipeda category is a Wikipedia page identified by a URL containing "Category" followed by a colon (":"). E.g. "http://en.wikipedia.org/wiki/Category:Technology")
- : a list of topics and entities indicated by their Wikipedia URLs which are not Wikipedia categories*.
An example of what the data looks like in JSON-LD is given below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 | { "@context": { "@vocab": "http://schema.org/", "datePublished": { "@type": "http://www.w3.org/2001/XMLSchema#dateTime" } }, "@id": "http://www.nytimes.com/", "@type": "Newspaper", "name": "New York Times", "@reverse": { "publisher": [ { "@id": "1cf5e45097bb2e1f036824790e955cd52e84751d", "datePublished": "2015-08-12", "headline": "Hillary Clinton Directs Aides to Give Email Server and Thumb Drive to the Justice Department", "url": "http://www.nytimes.com/2015/08/12/us/politics/hillary-clinton-directs-aides-to-give-email-server-and-thumb-drive-to-the-justice-department.html", "image": { "@id": "1cf5e45097bb2e1f036824790e955cd52e84751d.jpg", "caption": "Hillary Rodham Clinton's emails while secretary of state remain a subject of intense scrutiny", "url": "http://static01.nyt.com/images/2015/08/12/us/12EMAILS/12EMAILS-master675.jpg" }, "category":[ "http://en.wikipedia.org/wiki/Category:Technology", "http://en.wikipedia.org/wiki/Category:Politics", "http://en.wikipedia.org/wiki/Category:International_relations", "http://en.wikipedia.org/wiki/Category:Security", "http://en.wikipedia.org/wiki/Category:Government_information", "http://en.wikipedia.org/wiki/Category:United_States_federal_policy", "http://en.wikipedia.org/wiki/Category:Espionage", "http://en.wikipedia.org/wiki/Category:Privacy", "http://en.wikipedia.org/wiki/Category:Intelligence_(information_gathering)", "http://en.wikipedia.org/wiki/Category:Secrecy", "http://en.wikipedia.org/wiki/Category:Information_sensitivity", "http://en.wikipedia.org/wiki/Category:Politics_of_the_United_States", "http://en.wikipedia.org/wiki/Category:National_security", "http://en.wikipedia.org/wiki/Category:American_politicians" ], "about": [ "http://en.wikipedia.org/wiki/United_States_Department_of_State", "http://en.wikipedia.org/wiki/Hillary_Clinton", "http://en.wikipedia.org/wiki/Classified_information_in_the_United_States", "http://en.wikipedia.org/wiki/Email", "http://en.wikipedia.org/wiki/Federal_Bureau_of_Investigation", "http://en.wikipedia.org/wiki/Server_(computing)", "http://en.wikipedia.org/wiki/Classified_information" ] } ] } } |
Image Feature Detection
The image concepts are generated by Robert-Jan Bruintjes and Thomas Mensink with a visual classifier trained on ImageNet [1], a fully annotated dataset containing images for nouns from the WordNet hierarchy. Fur further information about the specifics of the classifier, please refer to [2] and [3].Contact
The ION corpus project is supported by Amsterdam Data Science, and accepted for publishing in LREC 2016.

- Adriatik Bedjeti
- Desmond Elliot
- Martin van Harmelen
- Laura Hollink