Monday, June 21, 2021

Using REST API to Get JSON Data Into SQL Server

JSON via REST API (aka RESTful API) is becoming a common offering of cloud providers for their clients to access data. JSON format has an advantage of self defining arrays, sets, and attributes. However, it also has the disadvantage of being extremely verbose, with every field in every record
having its own definition. It's not uncommon for each http call to take 100-200ms. That means that you will max out at 5-10 calls per second. This is problematic for system (retail for example) where there are hundreds or thousands of transactions per second. Note, it's not unusual to have multiple records per REST API call, but it is still very verbose. Compare this to a compressed file, or, even faster, a direct database connection, where data is kept in native data formats, such as binary and decimal. Direct database connections have very little meta-data, which keeps the data compact, and means there is little transformation required on importing.

Having said that, there is no discussion if the cloud vendor only supports REST API. Fortunately, this is relatively simple in SSIS, but it does require an SSIS Script task. The Script task could be a Data Source in a Data Pump, or it could just be a Script task that also inserts within the task. I like using a Data Pump, although it's difficult to output columns of Varchar(max). A workaround is to break the output record into multiple 8000 byte columns. This isn't a very elegant solution, so you may want to write the JSON object directly to a staging database from the Script task. You could unpack the JSON objects in the Script task, but SQL has great JSON functions, and I prefer that Script tasks do as little as possible, as they are "hidden" within the SSIS package. So, all the script task needs to do is get the data and write the outside JSON object (maybe several records within) to a staging database. And if there are multiple pages in the http response, loop through the pages. A subsequent SQL proc can then unpack and load the data into the consolidated SQL database.

Unpacking the JSON data is relatively easy with SQL's JSON functions, JSON_VALUE, JSON_QUERY, and SQL's CROSS APPLY. Essentially, every array (defined by square brackets) will become an SQL table. So each array within an array becomes a dependent (child) table. Each JSON attribute will become an SQL column, and each JSON set (defined by curly brackets) will become column attributes, but their name will inherit the name of the set. 

A good REST API cloud provider will 

  • Make it clear what the primary keys are for all arrays (as they were/become tables). 
  • Make it clear what the data types are for all attributes. That's not part of the JSON definition, and if you get it wrong, it will bite you when an attribute arrives that is out of bounds.
  • Give you the option of filtering records based on last updated/inserted datetimes (or high water marks of some other kind). Otherwise, you will have to download and unpack all records to find out what's changed or added.
There are several vendors that offer tools that assist SSIS packages in making REST API calls. They can assist, but, in my experience, they add another layer of abstraction, and some complications in upgrading, and setting up test servers. So, I prefer to avoid them where I can and simply write a few tailored lines of script code.

No comments: