Data as a service: where we are and where we're headed

February 1, 2021

I recently saw a stat that over 300 BILLION PDFs were created in 2020 representing a significant increase over the previous year. And you know what? I wasn't surprised to see that. We work with some of the largest organizations in the world. Orgs in healthcare, pharma.orgs in Europe, Africa, North America. And you know what the most popular document type they use to send me data?  A PDF. That's right, the trusty and old and ubiquitous PDF.  One thing we do for many of our customers and partners is help convert PDF content (and excel files and CSVs and Access databases and I could go on and on on the variety of formats we've seen) into something more database friendly. The value here is that now these rich datasets can be analyzed. And searched. And integrated into their standard workflow.   Then it hit us a few weeks ago. What we're really providing is a data-as-a-service (DaaS). Much like it's older sibling Software-as-a-service (SaaS), DaaS is the idea that people need data that can be easily consumed. Or integrated into their internal systems. Or visualized. But they don't want to have to worry about managing the datasets. Or keeping them up to date.  Just give me the data  is what they say. There is an immense amount of quality data just sitting inside of PDFs, locked away and hard to find.  

Case Study: Georgia County Vaccination Orders List

 Every 7-10-14 days, the state of Georgia releases the  Vaccine Orders List  (https://dph.georgia.gov/document/document/vaccine-orders-list/download). This a breakdown of which providers have requested which vaccines and how many they were allocated. In the age of COVID and data, this is a *really* rich dataset.   But how is the state releasing the data?  As a github repo? HA An excel file? Nope A CSV? I wish  They're releasing it as a PDF.  

So we did for this file that we do for our customers everyday: we created a routine that automatically checks for new files and extracts the content and loads it into our data platform.  This may sound straightforward enough, but the value from this is immense. We're not serving this data out in three primary formats:

API: you can now integrate this content into your own system. Want to compare counties asking for vaccines with their current disease burden? You can!  

Excel: automatically receive the data in an excel file in your inbox. Want to pivot this data? Now you can!  

Dashboard: maybe you're a boss person, too busy to pivot or do integrations. And maybe you just need a dashboard with the data? Now you can!  

With this one single PDF, we're able to build maps and perform fuzzy logic searches and track trends over time. PDFs hold an immense amount of data. It's just time to unlock that potential.