SUMMA Hack: Technical Documentation
SUMMA Database
The SUMMA database is provided via the REST API to a single Elasticsearch instance. For the purposes of the event, this database is static and will not be updated with live data during the duration of the hack.
The database is running Elasticsearch version 5.6.3.
Database location
The full SUMMA (production) database is available at http://summadb.summa-project.eu
.
The development SUMMA database is available at http://summadb2.summa-project.eu
. This development database has not been extensively tested.
- The development database has improved transcripts for video and live video, but has no story clustering or topic detection at this time.
- media-item IDs will match between the production and development databases.
These can both be subsituted for [base url]
in the following document.
Getting started
To test your connection to the database, perform an HTTP GET to [base url]
. You should receive some JSON as a response to indicate the database is running, like this:
{
"name" : "LpuEwye",
"cluster_name" : "elasticsearch",
"cluster_uuid" : "zoRkkbaZSZK4IpoPA6ZGuA",
"version" : {
"number" : "5.6.3",
"build_hash" : "1a2f265",
"build_date" : "2017-10-06T20:33:39.012Z",
"build_snapshot" : false,
"lucene_version" : "6.6.1"
},
"tagline" : "You Know, for Search"
}
If you want to invetigate more before reading about the structure of the database, a list of example queries can be found towards the bottom of this document.
Notes on Security and Restrictions
Database access is controlled by IP whitelist. Access from the Trampery’s internal wireless networks will be available, external access from outside the venue will not.
For security, request are limited to HTTP GET only. All other requests are blocked. Be advised that:
- Some clients (i.e. Postman) are unable to perform a GET with a body.
- Some HTTP proxies are unable to pass GET with a body.
There are a number of simple ways to work with this restriction.
- Python: The Pyton binding accounts for this restriction as described in the Python binding docs.
- If a proxy is blocking the requests, the URL encoded query can be placed in the request using the
source
parameter:
<?php
# your base url
$url="http://summadb.summa-project.eu/_search?pretty";
# the "query" string
$query='{
"query" : {
"range" : {
"timeAdded" : { "from" : "2017-10-01", "to" : "2017-10-31" }
}
}
}';
# construct a new url
$callme=$url."&source=".urlencode($query);
# show how to use in a shell/bash script (etc)
echo "You can run in your shell as:\n\ncurl -XGET \"$callme\"\n\n";
Database Structure
The database contains 3 tables:
- media-items
- named-entities
-
stories
- List the available tables:
[Base URL\]_aliases?pretty=1
Media Items
This table contains a collection of individual news stories.
[base url]/media-items
Example queries:
- Index structure:
[base url]/media-items/?pretty
- Item count:
[base url]/media-items/_count?pretty
- Sample items:
[base url]/media-items/_search?pretty
Field | Description |
---|---|
_id | a DB id of the media item |
timeAdded | datetime when media item was added |
feedURL | the URL string of the feed from where this item came from |
sourceItemOriginFeedName | the feed name from where this item came from |
sourceItemType | type of media item. one of “Article”, “Video”, “livefeed-logical-chunk” |
sourceItemLangeCodeGuess | the most likely language of the media item content as ISO 639-1 lang code, see https://en.wikipedia.org/wiki/ISO_639-1 |
sourceItemDate | the publishing date of the media item |
sourceItemIdAtOrigin | the id of the source item in the origin feed |
sourceItemTitle | title of the media item in the original language (may be missing or auto-generated for live stream chunks) |
sourceItemMainText | the main textual content of the media item in the original language (may be missing) |
sourceItemTeaser | a teaser of the media item in the original language (only for DW content) |
sourceItemKeywords | an array of string keywords in the original language (only for DW content) |
sourceItemVideoURL | an URL of the video segment in the original language (may be missing) |
contentDetectedLangCode | ISO 639-1 language code of the detected language’ |
contentTranscribedMainText | a transcript of the video in the original language without punctuation marks but with word-level timestamps and confidence scores |
contentTranscribedPunctuatedMainText | a transcript in the original language with punctuations marks |
engTitle | English translation of the title |
engMainText | English translation of the main text |
engTeaser | English translation of the teaser |
engKeywords | English translations of the keywords |
engTranscript | English translation of the transcript |
engTeaserEntities | list of entities in engTeaser including positions in text |
engMainTextEntities | list of entities in engMainText including positions in text |
engTranscriptEntities | list of entities in engMainText including positions in text |
engStorylineId | a DB id of the story cluster with similar media items. See “Stories” below |
highlightItems | a list of sentences that summarize the media item |
engTeaserRelationships | extracted relationships in engTeaser |
engMainTextRelationships | extracted relationships in engMainText |
engTranscriptRelationships | extracted relationships in engTranscript |
Note that this list is not exhaustive. The data structure as returned from the database will contain additional fields which may give extra metadata about the important fields above.
Named Entities
This table contains a collection of named entities - people, places, things, etc.
Elements in this table are joined to a media-item using the keys in “engMainTextEntities”.
[base url]/named-entities
Example queries:
- index structure:
[base url]/named-entities/?pretty
- item count:
[base url]/named-entities/_count?pretty
- sample items:
[base url]/named-entities/_search?pretty
Field | Description |
---|---|
_id | a DB id of the named entity |
baseForm | a base form of the named entity |
type | a named entity type, e.g. “person”, “url”, “place” |
Note that this list is not exhaustive. The data structure as returned from the database will contain additional fields which may give extra metadata about the important fields above.
Stories
This table contains a collection of stories, which are clusters of individual media-items
Media-items have a field engStorylineId
which (if present) contains the id of the story cluster in which is it placed. Media-items are placed in exactly 0 or 1 story clusters.
[base url]/stories
Example queries:
- index structure:
[base url]/stories/?pretty
- sample items:
[base url]/stories/_search?pretty
- item count:
[base url]/stories/_count?pretty
Field | Description |
---|---|
_id | a DB id of the story |
summary | a string of sentences that summarize all news items in the story cluster |
Note that this list is not exhaustive. The data structure as returned from the database will contain additional fields which may give extra metadata about the important fields above.
Example Queries
Here are some example queries, just to get you started.
- By default, the database will only return 10 results. Use the
?size
parameter to adjust this:curl -XGET 'http://summadb.summa-project.eu/_search?pretty&size=25'
- Determine what languages are available, and how many media items in each of those languages:
curl -XGET 'http://summadb.summa-project.eu/media-items/_search?pretty' -d '{ "size": 0, "aggs": { "group_by_lang": { "terms": { "field": "contentDetectedLangCode.keyword" } } } }'
- Count all stories published during October 2017:
curl -XGET 'http://summadb.summa-project.eu/_count?pretty' -d ' { "query" : { "range" : { "timeAdded" : { "from" : "2017-10-01", "to" : "2017-10-31" } } } }'
- Retrieve all stories published during October 2017 (first 10 will be returned by default):
curl -XGET 'http://summadb.summa-project.eu/_search?pretty' -d ' { "query" : { "range" : { "timeAdded" : { "from" : "2017-10-01", "to" : "2017-10-31" } } } }'
- Count the number of stories with English Main Text containing the word “brexit”
curl -XGET 'http://summadb.summa-project.eu/media-items/_count?pretty' -d '{ "query": { "term" : { "engMainText" : "brexit" } } }'
- Count the number of stories with English Main Text containing the word “brexit” or “trump”
curl -XGET 'http://summadb.summa-project.eu/media-items/_count?pretty' -d '{ "query": { "bool": { "should": [{"term" : { "engMainText" : "brexit" }},{"term" : { "engMainText" : "trump" }}] } } }'
Notes on Data Quality
Many media-items contain the results of an automatic speech-to-text (ASR) process. In the case on non-English items, the text is then subjected an automatic translation process (MT). The SUMMA ASR and MT processes are state of the art, but they also represent areas of ongoing research and the quality of the output of both of these processes can be highly variable.
As the database contins a mix of text articles and transcripts from audio-video, media-items containing the attribute value sourceItemType=Article
will represent the highest quality as no ASR process was used.
Developer Resources
Python
You need the following modules for python:
- Python 2.x:
pip install elasticsearch
- Python 3.x:
pip3 install elasticsearch
Python Elasticsearch is available for manual download at https://elasticsearch-py.readthedocs.io. Clients should use version 5.x.x of the library for compatibility with the Elasticsearch v5.6.3 database.
For example:
$ pip install elasticsearch
$ pyhon
>>> from elasticsearch import Elasticsearch
>>> es = Elasticsearch([{'host': 'toydb.summa-project.eu', 'port':80}])
>>> es.count(index='media-items')
{u'count': 10, u'_shards': {u'successful': 5, u'failed': 0, u'skipped': 0, u'total': 5}}
Media Playback
Media is available as a .mp4 file or a .m3u8 playlist file.
Live video is available as ~5 minute chunks. These chunks may contain >1 individual news stories and/or news stories may be truncated at the end of the chunk. Full stories can be reconstituted by following the values in customMetadata.prev_chunk_relative_url
or customMetadata.next_chunk_relative_url
.
The JavaScript HLS client using Media Source Extensions http://video-dev.github.io/hls.js makes it easy to embed the content from an .m3u8 file in any modern browser. Althernatively, a media player such as VLC may be used.
Visualisation Resources
Collections
Benjamin: https://github.com/benjbach/vishub/wiki/Visualization-Tools
Medialab: http://tools.medialab.sciences-po.fr
Voyant (many tools on text): http://docs.voyant-tools.org/tools
General purpose
Rawgraphs.io: http://rawgraphs.io
Tableau: https://public.tableau.com/en-us/s
Multivariate data
PoleStar: https://vega.github.io/polestar
iVisDesigner: https://donghaoren.org/ivisdesigner
Text visualization
Textexture (text networks): http://textexture.com
Networks
Gephi: http://gephi.org
iVisDesigner: https://donghaoren.org/ivisdesigner
Feedback
Problems, suggestions, missing information? Contact me before the day at andrew.secker@bbc.co.uk or find me at the venue.