Consume Data
Browsing
At minimum, Data Connect implementations support browsing by table. This means these operations from the API specs are supported for table by table browsing.
On the right is example code to browse the tables-in-a-bucket implementation of Data Connect.
# init search client
from search_python_client.search import DrsClient, SearchClient
base_url_tiab = 'https://storage.googleapis.com/ga4gh-tables-example/'
search_client_tiab = SearchClient(base_url=base_url_tiab)
# get tables
tables_iterator = search_client_tiab.get_table_list()
tables = [next(tables_iterator, None) for i in range(10)]
tables = list(filter(None, tables))
print(tables)
# get table info
table_name = tables[0]['name']
table_info = search_client_tiab.get_table_info(table_name)
print(table_info)
# get table data
table_name = tables[0]['name']
table_data_iterator = search_client_tiab.get_table_data(table_name)
table_data = [next(table_data_iterator, None) for i in range(10)]
table_data = list(filter(None, table_data))
print(table_data)
Under construction
https://colab.research.google.com/drive/1VOP2IcPjsX4U-DfuiTs7Tr0SVlAD0IMh?usp=sharing <= doesn't work right now.
Get list of tables
search-cli list --api-url https://storage.googleapis.com/ga4gh-tables-example
search-cli info subjects --api-url https://storage.googleapis.com/ga4gh-tables-example
search-cli data subjects --api-url https://storage.googleapis.com/ga4gh-tables-example
Queries
Data Connect supports query operation through SQL statements.
Data Connect’s SQL dialect has a familiar interface inspired by current major open source database platforms, including Trino, PostgreSQL, MySQL, and BigQuery. If you have prior experience with these database platforms, you’ll feel right at home with only minor adjustments.
This query returns all female patients from the patient
table.
/* you can scroll on this tab */
SELECT *
FROM kidsfirst.ga4gh_tables.patient
WHERE Json_extract_scalar(patient, '$.gender') = 'female'
LIMIT 5;
This query returns all conditions observed in female patients from the patient
table.
/* you can scroll on this tab */
SELECT Json_extract_scalar(ncpi_disease, '$.code.text') AS disease,
Json_extract_scalar(ncpi_disease, '$.identifier[0].value') AS identifier
FROM kidsfirst.ga4gh_tables.ncpi_disease disease
INNER JOIN kidsfirst.ga4gh_tables.patient patient
ON patient.id = REPLACE(Json_extract_scalar(ncpi_disease,
'$.subject.reference'),
'Patient/')
WHERE Json_extract_scalar(patient, '$.gender') = 'female'
LIMIT 5;
Issuing Queries Using Data Connect
Data Connect can be accessed through the straightforward HTTP calls described in its OpenAPI specification.
While Data Connect API can be navigated using programs like cURL or Postman, it is best accessed programmatically. The results could be split into multiple pages, which is easier to navigate with programmatic access.
Fetch each page only once. Data Connect servers are allowed to “forget” page URLs after you fetch them. This allows the server implementations to be more efficient.
On the right, we provide examples to consume data from Data Connect using the GA4GH Commandline Interface, the R client, Python, and cURL.
# Installing the client library form PyPi
pip install search-python-client
# Installing from Github
pip install git+https://github.com/DNAstack/search-python-client --no-cache-dir
# Building the query
from search_python_client.search import DrsClient, SearchClient
base_url = 'https://search-presto-public.staging.dnastack.com'
search_client = SearchClient(base_url=base_url)
query = """
SELECT Json_extract_scalar(ncpi_disease, '$.code.text') AS disease,
Json_extract_scalar(ncpi_disease, '$.identifier[0].value') AS identifier
FROM kidsfirst.ga4gh_tables.ncpi_disease disease
INNER JOIN kidsfirst.ga4gh_tables.patient patient
ON patient.id = REPLACE(Json_extract_scalar(ncpi_disease,
'$.subject.reference'),
'Patient/')
WHERE Json_extract_scalar(patient, '$.gender') = 'female'
LIMIT 5
"""
# Executing the query
table_data_iterator = search_client.search_table(query)
for item in table_data_iterator:
print(item)
# Results
{'disease': 'Aortic atresia', 'identifier': 'Condition|SD_PREASA7S|272|Aortic atresia|None'}
{'disease': 'Mitral atresia', 'identifier': 'Condition|SD_PREASA7S|272|Mitral atresia|None'}
{'disease': 'Hypoplasia ascending aorta', 'identifier': 'Condition|SD_PREASA7S|272|Hypoplasia ascending aorta|None'}
{'disease': 'Hypoplastic left heart syndrome', 'identifier': 'Condition|SD_PREASA7S|272|Hypoplastic left heart syndrome|None'}
{'disease': 'Hypoplastic left ventricle (subnormal cavity volume)', 'identifier': 'Condition|SD_PREASA7S|272|Hypoplastic left ventricle (subnormal cavity volume)|None'}
# installing devtools
dir.create(path = Sys.getenv("R_LIBS_USER"), showWarnings = FALSE, recursive = TRUE)
install.packages("devtools", lib = Sys.getenv("R_LIBS_USER"), repos = "https://cran.rstudio.com/")
# installing the R client
devtools::install_github("DNAstack/ga4gh-search-client-r")
# Making the request
library(httr)
conditionsInFemalePatients <- ga4gh.search::ga4gh_search("https://search-presto-public.staging.dnastack.com", "select json_extract_scalar(ncpi_disease, '$.code.text') as disease, json_extract_scalar(ncpi_disease, '$.identifier[0].value') as identifier from kidsfirst.ga4gh_tables.ncpi_disease disease INNER JOIN kidsfirst.ga4gh_tables.patient patient ON patient.id=replace(json_extract_scalar(ncpi_disease, '$.subject.reference'), 'Patient/') WHERE json_extract_scalar(patient, '$.gender')='female' limit 5")
# View the results
print(conditionsInFemalePatients)
Output:
disease
1 Aortic atresia
2 Mitral atresia
3 Hypoplasia ascending aorta
4 Hypoplastic left heart syndrome
5 Hypoplastic left ventricle (subnormal cavity volume)
identifier
1 Condition|SD_PREASA7S|272|Aortic atresia|None
2 Condition|SD_PREASA7S|272|Mitral atresia|None
3 Condition|SD_PREASA7S|272|Hypoplasia ascending aorta|None
4 Condition|SD_PREASA7S|272|Hypoplastic left heart syndrome|None
5 Condition|SD_PREASA7S|272|Hypoplastic left ventricle (subnormal cavity volume)|None
search-cli query -q "select json_extract_scalar(ncpi_disease, '$.code.text') as disease, json_extract_scalar(ncpi_disease, '$.identifier[0].value') as identifier from kidsfirst.ga4gh_tables.ncpi_disease disease INNER JOIN kidsfirst.ga4gh_tables.patient patient ON patient.id=replace(json_extract_scalar(ncpi_disease, '$.subject.reference'), 'Patient/') WHERE json_extract_scalar(patient, '$.gender')='female' limit 5" --api-url https://search-presto-public.staging.dnastack.com
These requests
This query returns all female patients from the patient
table.
curl --request POST \
--url https://search-presto-public.staging.dnastack.com/search \
--header 'content-type: application/json' \
--data '{ "query": "select * from kidsfirst.ga4gh_tables.patient WHERE json_extract_scalar(patient, '\''$.gender'\'')='\''female'\'' limit 5"}'
This query returns all conditions observed in female patients from the patient
table.
curl --request POST \
--url https://search-presto-public.staging.dnastack.com/search \
--header 'content-type: application/json' \
--data '{ "query": "select json_extract_scalar(ncpi_disease, '\''$.code.text'\'') as disease, json_extract_scalar(ncpi_disease, '\''$.identifier[0].value'\'') as identifier from kidsfirst.ga4gh_tables.ncpi_disease disease INNER JOIN kidsfirst.ga4gh_tables.patient patient ON patient.id=replace(json_extract_scalar(ncpi_disease, '\''$.subject.reference'\''), '\''Patient/'\'') WHERE json_extract_scalar(patient, '\''$.gender'\'')='\''female'\'' limit 5"}'
More Examples
dbGaP GECCO Example
This is a public implementation of Data Connect. Feel free to follow along with the examples and explore this endpoint with your own script.
# init search client
from search_python_client.search import DrsClient, SearchClient
base_url = 'https://search-presto-public.prod.dnastack.com/'
search_client = SearchClient(base_url=base_url)
# Find available tables
tables_iterator = search_client.get_table_list()
tables = list(tables_iterator)
import pprint
pprint.pprint(tables)
#Get more information about a table returned
table_info = search_client.get_table_info("dbgap_demo.scr_gecco_susceptibility.subject_phenotypes_multi")
pprint.pprint(table_info)
# Dig into the table a little further
table_data_iterator = search_client.get_table_data("dbgap_demo.scr_gecco_susceptibility.subject_phenotypes_multi")
# Limit to first 10 items
tables = [next(table_data_iterator, None) for i in range(10)]
tables = list(filter(None, tables))
pprint.pprint(tables)
# Select all items from the CPS-II study
query = """
SELECT *
FROM dbgap_demo.scr_gecco_susceptibility.subject_phenotypes_multi
WHERE study = 'CPS-II'
LIMIT 5
"""
# Executing the query
table_data_iterator = search_client.search_table(query)
for item in table_data_iterator:
print(item)
# installing devtools
dir.create(path = Sys.getenv("R_LIBS_USER"), showWarnings = FALSE, recursive = TRUE)
install.packages("devtools", lib = Sys.getenv("R_LIBS_USER"), repos = "https://cran.rstudio.com/")
# installing the R client
devtools::install_github("DNAstack/ga4gh-search-client-r")
# Making the request
library(httr)
ga4gh.search::ga4gh_list_tables("https://search-presto-public.prod.dnastack.com")
# Select all items from the CPS-II study
query <- "SELECT * FROM dbgap_demo.scr_gecco_susceptibility.subject_phenotypes_multi WHERE study = 'CPS-II' LIMIT 5"
# Executing the query
ga4gh.search::ga4gh_search("https://search-presto-public.prod.dnastack.com", query)
List tables
search-cli list --api-url "https://search-presto-public.prod.dnastack.com"
Get table info
search-cli info dbgap_demo.scr_gecco_susceptibility.subject_phenotypes_multi --api-url "https://search-presto-public.prod.dnastack.com"
Now run a query and pipe the results to a file called results.txt
search-cli query -q "SELECT * FROM dbgap_demo.scr_gecco_susceptibility.subject_phenotypes_multi WHERE study = 'CPS-II' LIMIT 5" \
--api-url "https://search-presto-public.prod.dnastack.com" > results.txt
COVID Cloud Example
This is a public implementation of Data Connect for COVID Cloud. Find more about COVID Cloud here.
from search_python_client.search import DrsClient, SearchClient
base_url = 'https://ga4gh-search-adapter-presto-covid19-public.prod.dnastack.com/'
search_client = SearchClient(base_url=base_url)
# Find available tables
tables_iterator = search_client.get_table_list()
tables = list(tables_iterator)
import pprint
pprint.pprint(tables)
# Get more information about a table returned
table_info = search_client.get_table_info("covid.cloud.sequences")
pprint.pprint(table_info)
# Dig into the table a little further
table_data_iterator = search_client.get_table_data("covid.cloud.sequences")
# Limit to first 10 items
tables = [next(table_data_iterator, None) for i in range(1)]
tables = list(filter(None, tables))
pprint.pprint(tables)
# Select all sequences from GenBank
query = """
SELECT *
FROM covid.cloud.sequences
WHERE sequence_type='GenBank'
LIMIT 25
"""
table_data_iterator = search_client.search_table(query)
for item in table_data_iterator:
print(item)
# installing devtools
dir.create(path = Sys.getenv("R_LIBS_USER"), showWarnings = FALSE, recursive = TRUE)
install.packages("devtools", lib = Sys.getenv("R_LIBS_USER"), repos = "https://cran.rstudio.com/")
# installing the R client
devtools::install_github("DNAstack/ga4gh-search-client-r")
# Making the request
library(httr)
ga4gh.search::ga4gh_list_tables("https://ga4gh-search-adapter-presto-covid19-public.prod.dnastack.com")
# Select all data from Genbank.
query <- "SELECT * FROM covid.cloud.sequences WHERE sequence_type='GenBank' LIMIT 25"
# Executing the query
ga4gh.search::ga4gh_search("https://ga4gh-search-adapter-presto-covid19-public.prod.dnastack.com", query)
List tables
search-cli list --api-url "https://ga4gh-search-adapter-presto-covid19-public.prod.dnastack.com"
Get table info
search-cli info covid.cloud.sequences --api-url "https://ga4gh-search-adapter-presto-covid19-public.prod.dnastack.com"
Now run a query and pipe the results to a file called results.txt
search-cli query -q "SELECT * FROM covid.cloud.sequences WHERE sequence_type='GenBank' LIMIT 25" \
--api-url "https://ga4gh-search-adapter-presto-covid19-public.prod.dnastack.com" > results.txt