View on GitHub

NewsHack2017

Multilingual Media Monitoring with the SUMMA Platform

SUMMA Hack: Technical Documentation

SUMMA Database

The SUMMA database is provided via the REST API to a single Elasticsearch instance. For the purposes of the event, this database is static and will not be updated with live data during the duration of the hack.

The database is running Elasticsearch version 5.6.3.

Database location

The full SUMMA (production) database is available at http://summadb.summa-project.eu.

The development SUMMA database is available at http://summadb2.summa-project.eu. This development database has not been extensively tested.

These can both be subsituted for [base url] in the following document.

Getting started

To test your connection to the database, perform an HTTP GET to [base url]. You should receive some JSON as a response to indicate the database is running, like this:

{
  "name" : "LpuEwye",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "zoRkkbaZSZK4IpoPA6ZGuA",
  "version" : {
    "number" : "5.6.3",
    "build_hash" : "1a2f265",
    "build_date" : "2017-10-06T20:33:39.012Z",
    "build_snapshot" : false,
    "lucene_version" : "6.6.1"
  },
  "tagline" : "You Know, for Search"
}

If you want to invetigate more before reading about the structure of the database, a list of example queries can be found towards the bottom of this document.

Notes on Security and Restrictions

Database access is controlled by IP whitelist. Access from the Trampery’s internal wireless networks will be available, external access from outside the venue will not.

For security, request are limited to HTTP GET only. All other requests are blocked. Be advised that:

There are a number of simple ways to work with this restriction.

<?php

# your base url
$url="http://summadb.summa-project.eu/_search?pretty";

# the "query" string
$query='{
 "query" : {
     "range" : {
         "timeAdded" : { "from" : "2017-10-01", "to" : "2017-10-31" }
     }
 }
}';

# construct a new url
$callme=$url."&source=".urlencode($query);

# show how to use in a shell/bash script (etc)
echo "You can run in your shell as:\n\ncurl -XGET \"$callme\"\n\n";

Database Structure

The database contains 3 tables:

Media Items

This table contains a collection of individual news stories.

[base url]/media-items

Example queries:

Field Description
_id a DB id of the media item
timeAdded datetime when media item was added
feedURL the URL string of the feed from where this item came from
sourceItemOriginFeedName the feed name from where this item came from
sourceItemType type of media item. one of “Article”, “Video”, “livefeed-logical-chunk”
sourceItemLangeCodeGuess the most likely language of the media item content as ISO 639-1 lang code, see https://en.wikipedia.org/wiki/ISO_639-1
sourceItemDate the publishing date of the media item
sourceItemIdAtOrigin the id of the source item in the origin feed
sourceItemTitle title of the media item in the original language (may be missing or auto-generated for live stream chunks)
sourceItemMainText the main textual content of the media item in the original language (may be missing)
sourceItemTeaser a teaser of the media item in the original language (only for DW content)
sourceItemKeywords an array of string keywords in the original language (only for DW content)
sourceItemVideoURL an URL of the video segment in the original language (may be missing)
contentDetectedLangCode ISO 639-1 language code of the detected language’
contentTranscribedMainText a transcript of the video in the original language without punctuation marks but with word-level timestamps and confidence scores
contentTranscribedPunctuatedMainText a transcript in the original language with punctuations marks
engTitle English translation of the title
engMainText English translation of the main text
engTeaser English translation of the teaser
engKeywords English translations of the keywords
engTranscript English translation of the transcript
engTeaserEntities list of entities in engTeaser including positions in text
engMainTextEntities list of entities in engMainText including positions in text
engTranscriptEntities list of entities in engMainText including positions in text
engStorylineId a DB id of the story cluster with similar media items. See “Stories” below
highlightItems a list of sentences that summarize the media item
engTeaserRelationships extracted relationships in engTeaser
engMainTextRelationships extracted relationships in engMainText
engTranscriptRelationships extracted relationships in engTranscript

Note that this list is not exhaustive. The data structure as returned from the database will contain additional fields which may give extra metadata about the important fields above.

Named Entities

This table contains a collection of named entities - people, places, things, etc.

Elements in this table are joined to a media-item using the keys in “engMainTextEntities”.

[base url]/named-entities

Example queries:

Field Description
_id a DB id of the named entity
baseForm a base form of the named entity
type a named entity type, e.g. “person”, “url”, “place”

Note that this list is not exhaustive. The data structure as returned from the database will contain additional fields which may give extra metadata about the important fields above.

Stories

This table contains a collection of stories, which are clusters of individual media-items

Media-items have a field engStorylineId which (if present) contains the id of the story cluster in which is it placed. Media-items are placed in exactly 0 or 1 story clusters.

[base url]/stories

Example queries:

Field Description
_id a DB id of the story
summary a string of sentences that summarize all news items in the story cluster

Note that this list is not exhaustive. The data structure as returned from the database will contain additional fields which may give extra metadata about the important fields above.

Example Queries

Here are some example queries, just to get you started.

Notes on Data Quality

Many media-items contain the results of an automatic speech-to-text (ASR) process. In the case on non-English items, the text is then subjected an automatic translation process (MT). The SUMMA ASR and MT processes are state of the art, but they also represent areas of ongoing research and the quality of the output of both of these processes can be highly variable.

As the database contins a mix of text articles and transcripts from audio-video, media-items containing the attribute value sourceItemType=Article will represent the highest quality as no ASR process was used.

Developer Resources

Python

You need the following modules for python:

Python Elasticsearch is available for manual download at https://elasticsearch-py.readthedocs.io. Clients should use version 5.x.x of the library for compatibility with the Elasticsearch v5.6.3 database.

For example:

$ pip install elasticsearch
$ pyhon
>>> from elasticsearch import Elasticsearch
>>> es = Elasticsearch([{'host': 'toydb.summa-project.eu', 'port':80}])
>>> es.count(index='media-items')
{u'count': 10, u'_shards': {u'successful': 5, u'failed': 0, u'skipped': 0, u'total': 5}}

Media Playback

Media is available as a .mp4 file or a .m3u8 playlist file.

Live video is available as ~5 minute chunks. These chunks may contain >1 individual news stories and/or news stories may be truncated at the end of the chunk. Full stories can be reconstituted by following the values in customMetadata.prev_chunk_relative_url or customMetadata.next_chunk_relative_url.

The JavaScript HLS client using Media Source Extensions http://video-dev.github.io/hls.js makes it easy to embed the content from an .m3u8 file in any modern browser. Althernatively, a media player such as VLC may be used.

Visualisation Resources

Collections

Benjamin: https://github.com/benjbach/vishub/wiki/Visualization-Tools

Medialab: http://tools.medialab.sciences-po.fr

Voyant (many tools on text): http://docs.voyant-tools.org/tools

General purpose

Rawgraphs.io: http://rawgraphs.io

Tableau: https://public.tableau.com/en-us/s

Multivariate data

PoleStar: https://vega.github.io/polestar

iVisDesigner: https://donghaoren.org/ivisdesigner

Text visualization

Textexture (text networks): http://textexture.com

Networks

Gephi: http://gephi.org

iVisDesigner: https://donghaoren.org/ivisdesigner


Feedback

Problems, suggestions, missing information? Contact me before the day at andrew.secker@bbc.co.uk or find me at the venue.