Introduction

The ION demonstrator is a tool to explore newspaper content. The tool is built on top of the ION dataset, a collection of news articles published on the website of five newspapers between August 2014 and August 2015. With this demonstrator we aim to support an exploration of (1) how frequent a certain topic is mentioned in a newspaper, (2) with which other topics it co-occurs and (3) what kind of images are shown with articles about this topic.

The ION Demo was presented at ICT.OPEN2016 and was awarded the 3rd prize in the meet-the-demo contest there. The underlying dataset of image and text in online news will be presented at LREC2016: Laura Hollink, Adriatik Bedjeti, Martin van Harmelen and Desmond. A Corpus of Images and Text in Online News. In proceedings of the 10th edition of the Language Resources and Evaluation Conference, 23-28 May 2016, Portorož, Slovenia. [Abstract]

The ION demo is the result of an Amsterdam Data Science grant. In a follow-up grant, we will further explore the detection of topic drift (If you know any qualified MSc. students for a student-assistant position, let us know. We are hiring!).

Demo Scenario

ForADemoPleaseSeeTheStepByStepScenario
For an example of how to ION demonstrator may be used we have created a slide deck with a step-by-step demo scenario, exploring the difference in how two newspapers cover Bernie Sanders.


The ION Dataset

The Images in Online News (ION) dataset is a collection of news articles of five news publishers: the US-based The New York Times, The Washington Post and The Huffington Post, and the UK-based The Daily Mail and The Independent. The articles were collected during a period of one year: 13 August, 2014 to August 13, 2015.

The ION dataset (approx. 323,000 articles) can be downloaded from the CWI Repository.
UPDATE: due to security problems, the CWI repository has temporarily been shut down. For now, please download the datasets from here.

The ION dataset is published as JSON-LD, a specification for representing Linked Data in the popular JSON format. For each news publisher an archive of JSON-LD files and a .mat file are available for download. The JSON-LD files contain the article data whereas the .mat file contains the image features in h5 format.

For the news publisher the following data is available in the JSON-LD files:

Each article in the list of articles contains the following metadata: Additionally articles can contain the following attributes depending on whether those exist in the original article and whether they could be analyzed. Besides the JSON-LD, which contains the metadata about the articles and their textual content, the features of the images of the articles are extracted in the h5 format in a .mat file. A Python script is also provided to extract the feature vector of each image individually. An article is identified by a unique ID in both JSON-LD and .mat files.

An example of what the data looks like in JSON-LD is given below:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
{
"@context": {
	"@vocab": "http://schema.org/",
	"datePublished": {
		"@type": "http://www.w3.org/2001/XMLSchema#dateTime"
	}
},
"@id": "http://www.nytimes.com/",
"@type": "Newspaper",
"name": "New York Times",
"@reverse": {
   "publisher": [
   {
    	"@id": "1cf5e45097bb2e1f036824790e955cd52e84751d",
        "datePublished": "2015-08-12",
        "headline": "Hillary Clinton Directs Aides to Give Email Server and Thumb Drive to the Justice Department",
        "url": "http://www.nytimes.com/2015/08/12/us/politics/hillary-clinton-directs-aides-to-give-email-server-and-thumb-drive-to-the-justice-department.html",
        "image": {
        	"@id": "1cf5e45097bb2e1f036824790e955cd52e84751d.jpg",
            	"caption": "Hillary Rodham Clinton's emails while secretary of state remain a subject of intense scrutiny",
            	"url": "http://static01.nyt.com/images/2015/08/12/us/12EMAILS/12EMAILS-master675.jpg"
	},
	"category":[
    		"http://en.wikipedia.org/wiki/Category:Technology",
        	"http://en.wikipedia.org/wiki/Category:Politics",
        	"http://en.wikipedia.org/wiki/Category:International_relations",
        	"http://en.wikipedia.org/wiki/Category:Security",
        	"http://en.wikipedia.org/wiki/Category:Government_information",
        	"http://en.wikipedia.org/wiki/Category:United_States_federal_policy",
        	"http://en.wikipedia.org/wiki/Category:Espionage",
        	"http://en.wikipedia.org/wiki/Category:Privacy",
        	"http://en.wikipedia.org/wiki/Category:Intelligence_(information_gathering)",
        	"http://en.wikipedia.org/wiki/Category:Secrecy",
        	"http://en.wikipedia.org/wiki/Category:Information_sensitivity",
        	"http://en.wikipedia.org/wiki/Category:Politics_of_the_United_States",
        	"http://en.wikipedia.org/wiki/Category:National_security",
        	"http://en.wikipedia.org/wiki/Category:American_politicians"
        ],
	"about": [
            	"http://en.wikipedia.org/wiki/United_States_Department_of_State",
            	"http://en.wikipedia.org/wiki/Hillary_Clinton",
                "http://en.wikipedia.org/wiki/Classified_information_in_the_United_States",
                "http://en.wikipedia.org/wiki/Email",
                "http://en.wikipedia.org/wiki/Federal_Bureau_of_Investigation",
                "http://en.wikipedia.org/wiki/Server_(computing)",
                "http://en.wikipedia.org/wiki/Classified_information"
	]
      }
    ]
  }
}

Image Feature Detection

The image concepts are generated by Robert-Jan Bruintjes and Thomas Mensink with a visual classifier trained on ImageNet [1], a fully annotated dataset containing images for nouns from the WordNet hierarchy. Fur further information about the specifics of the classifier, please refer to [2] and [3].

Contact

The ION corpus project is supported by Amsterdam Data Science, and accepted for publishing in LREC 2016.
The ION corpus is an effort by: Contact: l.hollink@cwi.nl.

[1] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
[2] Mihir Jain, Jan van Gemert, Thomas Mensink, Cees Snoek. Objects2action: Classifying and localizing actions without any video example. In ICCV, 2015
[3] Spencer Cappallo, Thomas Mensink, Cees Snoek. Image2Emoji: Zero-shot Emoji Prediction for Visual Media. In ACMMM, 2015