View on GitHub

NewsHack2017

Multilingual Media Monitoring with the SUMMA Platform

SUMMA Hack: Technical Documentation

SUMMA Database

The SUMMA database is provided via the REST API to a single Elasticsearch instance. For the purposes of the event, this database is static and will not be updated with live data during the duration of the hack.

The database is running Elasticsearch version 5.6.3.

Database location

The full SUMMA (production) database is available at http://summadb.summa-project.eu.

The development SUMMA database is available at http://summadb2.summa-project.eu. This development database has not been extensively tested.

The development database has improved transcripts for video and live video, but has no story clustering or topic detection at this time.
media-item IDs will match between the production and development databases.

These can both be subsituted for [base url] in the following document.

Getting started

To test your connection to the database, perform an HTTP GET to [base url]. You should receive some JSON as a response to indicate the database is running, like this:

{
  "name" : "LpuEwye",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "zoRkkbaZSZK4IpoPA6ZGuA",
  "version" : {
    "number" : "5.6.3",
    "build_hash" : "1a2f265",
    "build_date" : "2017-10-06T20:33:39.012Z",
    "build_snapshot" : false,
    "lucene_version" : "6.6.1"
  },
  "tagline" : "You Know, for Search"
}

If you want to invetigate more before reading about the structure of the database, a list of example queries can be found towards the bottom of this document.

Notes on Security and Restrictions

Database access is controlled by IP whitelist. Access from the Trampery’s internal wireless networks will be available, external access from outside the venue will not.

For security, request are limited to HTTP GET only. All other requests are blocked. Be advised that:

Some clients (i.e. Postman) are unable to perform a GET with a body.
Some HTTP proxies are unable to pass GET with a body.

There are a number of simple ways to work with this restriction.

Python: The Pyton binding accounts for this restriction as described in the Python binding docs.
If a proxy is blocking the requests, the URL encoded query can be placed in the request using the source parameter:

<?php

# your base url
$url="http://summadb.summa-project.eu/_search?pretty";

# the "query" string
$query='{
 "query" : {
     "range" : {
         "timeAdded" : { "from" : "2017-10-01", "to" : "2017-10-31" }
     }
 }
}';

# construct a new url
$callme=$url."&source=".urlencode($query);

# show how to use in a shell/bash script (etc)
echo "You can run in your shell as:\n\ncurl -XGET \"$callme\"\n\n";

Database Structure

The database contains 3 tables:

media-items
named-entities
stories
List the available tables: [Base URL\]_aliases?pretty=1

Media Items

This table contains a collection of individual news stories.

[base url]/media-items

Example queries:

Index structure: [base url]/media-items/?pretty
Item count: [base url]/media-items/_count?pretty
Sample items: [base url]/media-items/_search?pretty

Field	Description
_id	a DB id of the media item
timeAdded	datetime when media item was added
feedURL	the URL string of the feed from where this item came from
sourceItemOriginFeedName	the feed name from where this item came from
sourceItemType	type of media item. one of “Article”, “Video”, “livefeed-logical-chunk”
sourceItemLangeCodeGuess	the most likely language of the media item content as ISO 639-1 lang code, see https://en.wikipedia.org/wiki/ISO_639-1
sourceItemDate	the publishing date of the media item
sourceItemIdAtOrigin	the id of the source item in the origin feed
sourceItemTitle	title of the media item in the original language (may be missing or auto-generated for live stream chunks)
sourceItemMainText	the main textual content of the media item in the original language (may be missing)
sourceItemTeaser	a teaser of the media item in the original language (only for DW content)
sourceItemKeywords	an array of string keywords in the original language (only for DW content)
sourceItemVideoURL	an URL of the video segment in the original language (may be missing)
contentDetectedLangCode	ISO 639-1 language code of the detected language’
contentTranscribedMainText	a transcript of the video in the original language without punctuation marks but with word-level timestamps and confidence scores
contentTranscribedPunctuatedMainText	a transcript in the original language with punctuations marks
engTitle	English translation of the title
engMainText	English translation of the main text
engTeaser	English translation of the teaser
engKeywords	English translations of the keywords
engTranscript	English translation of the transcript
engTeaserEntities	list of entities in engTeaser including positions in text
engMainTextEntities	list of entities in engMainText including positions in text
engTranscriptEntities	list of entities in engMainText including positions in text
engStorylineId	a DB id of the story cluster with similar media items. See “Stories” below
highlightItems	a list of sentences that summarize the media item
engTeaserRelationships	extracted relationships in engTeaser
engMainTextRelationships	extracted relationships in engMainText
engTranscriptRelationships	extracted relationships in engTranscript

Note that this list is not exhaustive. The data structure as returned from the database will contain additional fields which may give extra metadata about the important fields above.

Named Entities

This table contains a collection of named entities - people, places, things, etc.

Elements in this table are joined to a media-item using the keys in “engMainTextEntities”.

[base url]/named-entities

Example queries:

index structure: [base url]/named-entities/?pretty
item count: [base url]/named-entities/_count?pretty
sample items: [base url]/named-entities/_search?pretty

Field	Description
_id	a DB id of the named entity
baseForm	a base form of the named entity
type	a named entity type, e.g. “person”, “url”, “place”

Note that this list is not exhaustive. The data structure as returned from the database will contain additional fields which may give extra metadata about the important fields above.

Stories

This table contains a collection of stories, which are clusters of individual media-items

Media-items have a field engStorylineId which (if present) contains the id of the story cluster in which is it placed. Media-items are placed in exactly 0 or 1 story clusters.

[base url]/stories

Example queries:

index structure: [base url]/stories/?pretty
sample items: [base url]/stories/_search?pretty
item count: [base url]/stories/_count?pretty

Field	Description
_id	a DB id of the story
summary	a string of sentences that summarize all news items in the story cluster

Note that this list is not exhaustive. The data structure as returned from the database will contain additional fields which may give extra metadata about the important fields above.

Example Queries

Here are some example queries, just to get you started.

By default, the database will only return 10 results. Use the ?size parameter to adjust this:
```
curl -XGET 'http://summadb.summa-project.eu/_search?pretty&size=25'
```

Determine what languages are available, and how many media items in each of those languages:

curl -XGET 'http://summadb.summa-project.eu/media-items/_search?pretty' -d '{
"size": 0,
"aggs": {
  "group_by_lang": {
    "terms": {
      "field": "contentDetectedLangCode.keyword"
    }
  }
}
}' 

Count all stories published during October 2017:

curl -XGET 'http://summadb.summa-project.eu/_count?pretty' -d '
{
  "query" : {
      "range" : {
          "timeAdded" : { "from" : "2017-10-01", "to" : "2017-10-31" }
      }
  }
}'

Retrieve all stories published during October 2017 (first 10 will be returned by default):

curl -XGET 'http://summadb.summa-project.eu/_search?pretty' -d '
{
  "query" : {
      "range" : {
          "timeAdded" : { "from" : "2017-10-01", "to" : "2017-10-31" }
      }
  }
}'

Count the number of stories with English Main Text containing the word “brexit”

curl -XGET 'http://summadb.summa-project.eu/media-items/_count?pretty' -d '{
"query": {
  "term" : { "engMainText" : "brexit" }
}
}'

Count the number of stories with English Main Text containing the word “brexit” or “trump”

curl -XGET 'http://summadb.summa-project.eu/media-items/_count?pretty' -d '{
"query": {
  "bool": {
    "should": [{"term" : { "engMainText" : "brexit" }},{"term" : { "engMainText" : "trump" }}]
   }
}
}'

Notes on Data Quality

Many media-items contain the results of an automatic speech-to-text (ASR) process. In the case on non-English items, the text is then subjected an automatic translation process (MT). The SUMMA ASR and MT processes are state of the art, but they also represent areas of ongoing research and the quality of the output of both of these processes can be highly variable.

As the database contins a mix of text articles and transcripts from audio-video, media-items containing the attribute value sourceItemType=Article will represent the highest quality as no ASR process was used.

Developer Resources

Python

You need the following modules for python:

Python 2.x: pip install elasticsearch
Python 3.x: pip3 install elasticsearch

Python Elasticsearch is available for manual download at https://elasticsearch-py.readthedocs.io. Clients should use version 5.x.x of the library for compatibility with the Elasticsearch v5.6.3 database.

For example:

$ pip install elasticsearch
$ pyhon
>>> from elasticsearch import Elasticsearch
>>> es = Elasticsearch([{'host': 'toydb.summa-project.eu', 'port':80}])
>>> es.count(index='media-items')
{u'count': 10, u'_shards': {u'successful': 5, u'failed': 0, u'skipped': 0, u'total': 5}}

Media Playback

Media is available as a .mp4 file or a .m3u8 playlist file.

Live video is available as ~5 minute chunks. These chunks may contain >1 individual news stories and/or news stories may be truncated at the end of the chunk. Full stories can be reconstituted by following the values in customMetadata.prev_chunk_relative_url or customMetadata.next_chunk_relative_url.

The JavaScript HLS client using Media Source Extensions http://video-dev.github.io/hls.js makes it easy to embed the content from an .m3u8 file in any modern browser. Althernatively, a media player such as VLC may be used.

Visualisation Resources

Collections

Benjamin: https://github.com/benjbach/vishub/wiki/Visualization-Tools

Medialab: http://tools.medialab.sciences-po.fr

Voyant (many tools on text): http://docs.voyant-tools.org/tools

General purpose

Rawgraphs.io: http://rawgraphs.io

Tableau: https://public.tableau.com/en-us/s

Multivariate data

PoleStar: https://vega.github.io/polestar

iVisDesigner: https://donghaoren.org/ivisdesigner

Text visualization

Textexture (text networks): http://textexture.com

Networks

Gephi: http://gephi.org

iVisDesigner: https://donghaoren.org/ivisdesigner

Feedback

Problems, suggestions, missing information? Contact me before the day at andrew.secker@bbc.co.uk or find me at the venue.