Introduction

Just under a year ago Keshav Murthy, Vice President at CouchBase R&D, published this article detailing how to derive insights from a JSON dataset from CDC.gov.

11 months is a long time in the data world and a lot has changed between then and now. Snowflake has grown tremendously and Precog was released, setting a new standard for scalable self-service JSON, MongoDB, data lake and Web API connectivity.

This article follows in Keshav's footsteps, showing how to derive insights from JSON healthcare data using Snowflake and Precog.

Snowflake is an incredibly powerful and cost effective data warehouse. However, working with non-tabular data using Snowflake alone is an involved task that is time consuming, prone to error and not suited for self service.

Precog empowers even non-technical users to easily browse and curate tables from tabular and non-tabular data. They can then load these tables directly into software like Power BI and Tableau, databases such as Postgres and warehouses such as Snowflake.

Precog is a universal self-service connector for non-tabular data such as MongoDB, JSON data in S3, data lakes, Azure Blob Storage, SaaS applications and web APIs such as Airship, Benchling, WooCommerce and HubSpot. Precog reduces development, integration, testing and compute times whilst improving accuracy. It empowers analysts with self-service access to tabular and non-tabular data to fulfil their own data requests.

Data

JSON data comes in many forms. Some of these are essentially tables and others are non-tabular data. Keshav Murthy's article uses the Causes of death dataset which is small and tabular. It is also available from CDC.gov as CSV.

In lieu of this, dataset we'll use the similarly grim adverse events related to drugs indicated for osteoarthritis dataset from the FDA. This data is non-tabular and more representative of data from web apis, internet of things devices, data lakes and applications. It is paginated and over a gigabyte in size. This dataset is made up of adverse reaction reports which exhibit differences in structure. Each report includes a list of medicines and a list of the adverse reactions.

Important disclaimer

This article uses data from openFDA which issues the following disclaimer.

Do not rely on openFDA to make decisions regarding medical care. Always speak to your health provider about the risks and benefits of FDA-regulated products.

As such you should not rely on this document to make decisions regarding medical care. Always speak to your health provider about the risks and benefits of products. Neither myself nor Precog are health providers or researchers.

Methodology

The first step to getting insight from non-tabular data, such as this dataset, is to tabulate it. This allows us to make charts, train and predict using machine learning models and visualise the data using software such as Tableau, Looker, ThoughtSpot and Power BI.

In this article, we'll explore three approaches to tabulating this JSON healthcare data; using Precog alone, using Precog with Snowflake and using Snowflake alone. These approaches can be applied to any non-tabular data such as apps backed by MongoDB, JSON data in S3, Azure Blob Storage, SaaS applications and web APIs such as Airship, Benchling and HubSpot.

Once the data is tabulated it can be easily:

  • Analysed with applications such as Tableau, Power BI, Looker and ThoughtSpot
  • Used to train and predict with machine learning models using software such as TensorFlow and DataRobot
  • Manipulated using standard SQL and loaded into databases or warehouses such as Postgres, Snowflake and Redshift

How to follow along

Precog Desktop and Snowflake are available as a free trials. PowerBI is available free as well.

If you get Precog Desktop and Snowflake you can follow along with the "Tabulating non-tabular data with Precog", "Tabulating non-tabular data without Precog" and "Analysing the tabulated data with SQL" sections.

If you get Precog Desktop and Power BI you can follow along with the "Tabulating non-tabular data with Precog" and "Analysing the tabulated data with Power BI" sections.

If you prefer a server or private cloud based solution, Docker, AMI, CloudFormation, JAR and Google Cloud based trials of Precog are also available. Please contact [email protected].

Next page: Tabulating non-tabular data with Precog