Discovering the world of open data

Haoliang Yu

FOSS4G 2017

About me

  • Haoliang Yu
  • Background in geography and GIS
  • Full-stack GIS developer at NBT Solutions and VETRO FiberMap
    • map interactivity and editing
    • network anaylsis
    • geoprocessing
  • Open-source software developer

The world of open data

a "simple" question

How many open datasets do we have on earth?

a map of 2,600+ open data portals

made by OpenDataSoft

It is the surface of the world.

Portals

Datasets

Portals

Socrata

GeoNode

OpenDataSoft

CKAN

DKAN

Junar

ArcGIS Open Data

Platforms used to build portals

Dataset metadata  API

National Agricultural Library (built with DKAN)

GET https://data.nal.usda.gov/data.json

Result

{
  "@context": "https://project-open-data.cio.gov/v1.1/schema/catalog.jsonld",
  "@id": "http://demo.getdkan.com/data.json",
  "@type": "dcat:Catalog",
  "conformsTo": "https://project-open-data.cio.gov/v1.1/schema",
  "describedBy": "https://project-open-data.cio.gov/v1.1/schema/catalog.json",
  "dataset": [
    {
      "@type": "dcat:Dataset",
      "accessLevel": "public",
      "contactPoint": {
        "fn": "Lewis, Kristin",
        "hasEmail": "mailto:Kristin.Lewis@dot.gov"
      },
      "describedBy": "https://data.nal.usda.gov/dataset/feedstock-readiness-level-evaluations-summary-table-v30/resource/1ff103ed-dd64-4215-8f1a-6847be574abc",
      "description": "<p>Feedstock readiness level evaluations are performed for a specific feedstock-conversion process combination and for a particular region. FSRL evaluations complement evaluations of Fuel Readiness Level (FRL) and environmental progress.  The table in this dataset collates the results of the FSRL evaluations listed under the Farm2Fly Ag Data Commons datasets to enable users to quickly identify, review, and compare available evaluations.  Evaluation scores are explained in the FSRL Checklist and Template available on the NAL Ag Data Commons - scores range from 1 to 9, with higher values indicating greater maturity of the feedstock in each area of assessment (production, market development, policy evaluation and compliance, and linkage to conversion efficiency).  The overall score reflects the lowest maturity area within the four assessment areas.</p>\n<p>Summary data files of the compiled evaluations will be added to the repository on a quarterly basis, and are cumulative (the last quarter will contain the compiled evaluation results from the entire year).  To access the newest evaluations that are not yet included in the most recent compilation, visit the <a href=\"https://data.nal.usda.gov/farm-2-fly\">Farm 2 Fly program page</a> to view all datasets. The date of update/submission is indicated in the title of the file.</p>\n",
      "distribution": [
        {
          "@type": "dcat:Distribution",
          "downloadURL": "https://data.nal.usda.gov/system/files/FSRL%20Evaluations%20Summary%20Table_Q2_2017.xlsx",
          "mediaType": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
          "format": "xlsx"
        },
        {
          "@type": "dcat:Distribution",
          "downloadURL": "https://data.nal.usda.gov/system/files/FSRL%20Evaluations%20Summary%20Table_Q2_2017_0.csv",
          "mediaType": "text/csv",
          "format": "csv"
        },
        {
          "@type": "dcat:Distribution",
          "downloadURL": "https://data.nal.usda.gov/system/files/Data%20Dictionary%20Summary%20Table%20FSRL%202_0.csv",
          "mediaType": "text/csv",
          "format": "csv"
        }
      ],
      "identifier": "9625a0fb-00ef-4651-85ef-1960aa8e1180",
      "keyword": [
        "alternative fuels",
        "Andropogon gerardii",
        "aviation",
        "Beta vulgaris",
        "big bluestem",
        "Brassica carinata",
        "Brassica napus",
      ],
      "language": [
        "en"
      ],
      "modified": "2017-08-04",
      "publisher": {
        "@type": "org:Organization",
        "name": "USDA NAL"
      },
    "references": [
        "https://data.nal.usda.gov/dataset/feedstock-readiness-level-evaluations-summary-table-v20_4939",
        "https://data.nal.usda.gov/dataset/feedstock-readiness-level-evaluations-summary-table-v10_3400"
      ],
      "title": "Feedstock Readiness Level Evaluations Summary Table v3.0",
      "bureauCode": [
        "000:00"
      ],
      "programCode": [
        "000:000"
      ]
    }
  ]
}

Platform-based Portals

Platform Portals
CKAN 90
Socrata 177
ArcGIS Open Data 595
DKAN 55
GeoNode 10
Junar 11
OpenDataSoft 79
Total 1017

Harvesting dataset metadata

How many do I collected?

Count
Dataset 1,061,096
File & Web URL 3,698,501
Tag 239,4680
Category 2,350
Publisher 6,702
Dataset Geo-Coverage 10,296

778 portals in 313 regions

< 40

40 - 100

100 - 250

250 - 800

800+

Datasets

Range: 1 - 278,112

Break: percentile

Responding Portals by Region

What are they about?

Where are they?

A map of 1,000 dataset coverage samples

How do they grow?

What do I learn?

  • Open data is growing quickly and full of diversity!
  • But incomplete or fuzzy dataset metadata is common.
  • Many portals have their own practice of organizing and  publishing dataset metadata.
  • Location information in the metadata is pretty lacking.

Discovering the world

Moving to the Cloud

Status of the project

  • Side project
  • SingularData.net is alive at ALPHA version
  • Weekly data harvesting to keep track of the growth
  • Research and to-do
    • robustness and reliability
    • extract location information
    • make sense from both metadata and dataset
    • support more sources

From open-source, to open-source

  • Source code of the project

      https://github.com/SingularData

  • Portal statistics

      downloadable at http://singulardata.net/portals

  • Raw dataset metadata

      not downloadable yet

Thank you

GitHub @haoliangyu

Twitter @haoliang_yu

Email haoliang.yu@outlook.com

Goals

  • To provide better understand and keep track of the development of open data
  • To provide one search box for multiple portals
    • it is becoming harder for people to follow the growth of open data
  • To preserve open data metadata and their history
    • Where and what data is updated by who

Collect data from a portal

Portal URL

Platform API

API Request

https://data.nal.usda.gov/

https://{portal_url}/data.json

GET https://data.nal.usda.gov/data.json

Dataset metadata

Timeline

2017.03

2017.05

2017.06

2017.07

2017.Q4

2018

  • Collect portal info
  • First line of code
  • Finish 1st data collector
  • Start working on website
  • Deploy at AWS
  • Start collecting data weekly
  • SingularData.net is alive!
  • Modular dataset metadata schema
  • Implement user system
  • Add geospatial search
  • Study on data metadata via NLP
  • Open data via distributed network