Provision Data
Implementation
Data Connect requires table operations to be implemented to specification for basic discovery and browsing.
Optional but not required, query operations may be implemented to support querying with SQL.
The Data Connect API is backend agnostic, which means any solution that implements the API specification is valid. You can use your favorite backend web application framework to implement Data Connect Endpoints or any HTTPS file server (a cloud blob store, for example) for a tables-in-a-bucket implementation requiring no code.
Checkout the following examples for some inspiration.
Tables-in-a-bucket example
The specification allows for a no-code implementation as a collection of files served statically. This is the easiest way to start experimenting with Data Connect. As long as your storage bucket conforms to the correct file structure and it has the correct sharing permissions, it is a valid Data Connect implementation.
A concrete example implementation is available here and try browsing this implementation with these commands.
Here’s how you’ll need to organize your folders
tables
: served in response toGET /tables
table/{table_name}/info
: served in response toGET /table/{table_name}/info
. e.g. a table with the namemytable
should have a corresponding filetable/mytable/info
table/{table_name}/data
: served in response toGET /table/{table_name}/data
. e.g. a table with the namemytable
should have a corresponding filetable/mytable/data
table/{table_name}/data_{pageNumber}
, which will be linked in the next_page_url of the first table (e.g.mytable
).table/{table_name}/data_models/{schemaFile}
: Though not required, data models may be linked via $ref. Data models can also be stored as static JSON documents, and be referred to by relative or absolute URLs.
Try a Reference Implementation
Use the following instructions to run a reference Data Connect implementation backed by a publicly accessible Trino instance hosted by DNAstack as the data source.
You’ll need Docker set up on your system to run the Spring app and the PostgreSQL database where it stores information about running queries.
docker pull postgres:latest
docker run -d --rm --name dnastack-data-connect-db -e POSTGRES_USER=dataconnecttrino -e POSTGRES_PASSWORD=dataconnecttrino -p 15432:5432 postgres
docker pull dnastack/data-connect-trino:latest
docker run --rm --name dnastack-data-connect -p 8089:8089 -e TRINO_DATASOURCE_URL=https://trino-public.prod.dnastack.com -e SPRING_DATASOURCE_URL=jdbc:postgresql://host.docker.internal:15432/dataconnecttrino -e SPRING_PROFILES_ACTIVE=no-auth dnastack/data-connect-trino
docker pull postgres:latest
docker run -d --rm --name dnastack-data-connect-db -e POSTGRES_USER=dataconnecttrino -e POSTGRES_PASSWORD=dataconnecttrino -p 15432:5432 postgres
docker pull dnastack/data-connect-trino:latest
docker run --rm --name dnastack-data-connect -p 8089:8089 -e TRINO_DATASOURCE_URL=https://trino-public.prod.dnastack.com -e SPRING_DATASOURCE_URL=jdbc:postgresql://localhost:15432/dataconnecttrino -e SPRING_PROFILES_ACTIVE=no-auth dnastack/data-connect-trino
Once you have the Data Connect implementation running, the Data Connect API will be accessible at http://localhost:8089. Here are a few things to try:
- Open http://localhost:8089/tables in your web browser to see the table list API response. It helps if you have a browser plugin that pretty-prints JSON.
- Try the Python, R, and CLI examples at right. These examples access Data Connect at http://localhost:8089. See the Installing Clients section if you haven’t set up the clients yet.
- Set up your own Trino instance, then re-run the dnastack-data-connect container with the
TRINO_DATASOURCE_URL
pointed to your own Trino instance.
Further information about this example can be found here.
# init search client
from search_python_client.search import DrsClient, SearchClient
base_url = 'http://localhost:8089/'
search_client = SearchClient(base_url=base_url)
# get tables
tables_iterator = search_client.get_table_list()
tables = [next(tables_iterator, None) for i in range(10)]
tables = list(filter(None, tables))
print(tables)
# get table info
table_name = "sample_phenopackets.ga4gh_tables.gecco_phenopackets"
table_info = search_client.get_table_info(table_name)
print(table_info)
# get table data
table_name = "sample_phenopackets.ga4gh_tables.gecco_phenopackets"
table_data_iterator = search_client.get_table_data(table_name)
table_data = [next(table_data_iterator, None) for i in range(10)]
table_data = list(filter(None, table_data))
print(table_data)
# Fetch table list
library(httr)
tables <- ga4gh.search::ga4gh_list_tables("http://localhost:8089")
print(tables)
# Try a query
search_result <- ga4gh.search::ga4gh_search("http://localhost:8089", "SELECT sample_phenopackets.ga4gh_tables.gecco_phenopackets")
print(tables)
List tables
search-cli list --api-url http://localhost:8089
Get table info
search-cli info dbgap_demo.scr_gecco_susceptibility.sample_multi --api-url http://localhost:8089
Get table data
search-cli data dbgap_demo.scr_gecco_susceptibility.sample_multi --api-url http://localhost:8089