Showing posts with label SSIS. Show all posts
Showing posts with label SSIS. Show all posts

Monday, June 21, 2021

Using REST API to Get JSON Data Into SQL Server

JSON via REST API (aka RESTful API) is becoming a common offering of cloud providers for their clients to access data. JSON format has an advantage of self defining arrays, sets, and attributes. However, it also has the disadvantage of being extremely verbose, with every field in every record
having its own definition. It's not uncommon for each http call to take 100-200ms. That means that you will max out at 5-10 calls per second. This is problematic for system (retail for example) where there are hundreds or thousands of transactions per second. Note, it's not unusual to have multiple records per REST API call, but it is still very verbose. Compare this to a compressed file, or, even faster, a direct database connection, where data is kept in native data formats, such as binary and decimal. Direct database connections have very little meta-data, which keeps the data compact, and means there is little transformation required on importing.

Having said that, there is no discussion if the cloud vendor only supports REST API. Fortunately, this is relatively simple in SSIS, but it does require an SSIS Script task. The Script task could be a Data Source in a Data Pump, or it could just be a Script task that also inserts within the task. I like using a Data Pump, although it's difficult to output columns of Varchar(max). A workaround is to break the output record into multiple 8000 byte columns. This isn't a very elegant solution, so you may want to write the JSON object directly to a staging database from the Script task. You could unpack the JSON objects in the Script task, but SQL has great JSON functions, and I prefer that Script tasks do as little as possible, as they are "hidden" within the SSIS package. So, all the script task needs to do is get the data and write the outside JSON object (maybe several records within) to a staging database. And if there are multiple pages in the http response, loop through the pages. A subsequent SQL proc can then unpack and load the data into the consolidated SQL database.

Unpacking the JSON data is relatively easy with SQL's JSON functions, JSON_VALUE, JSON_QUERY, and SQL's CROSS APPLY. Essentially, every array (defined by square brackets) will become an SQL table. So each array within an array becomes a dependent (child) table. Each JSON attribute will become an SQL column, and each JSON set (defined by curly brackets) will become column attributes, but their name will inherit the name of the set. 

A good REST API cloud provider will 

  • Make it clear what the primary keys are for all arrays (as they were/become tables). 
  • Make it clear what the data types are for all attributes. That's not part of the JSON definition, and if you get it wrong, it will bite you when an attribute arrives that is out of bounds.
  • Give you the option of filtering records based on last updated/inserted datetimes (or high water marks of some other kind). Otherwise, you will have to download and unpack all records to find out what's changed or added.
There are several vendors that offer tools that assist SSIS packages in making REST API calls. They can assist, but, in my experience, they add another layer of abstraction, and some complications in upgrading, and setting up test servers. So, I prefer to avoid them where I can and simply write a few tailored lines of script code.

Saturday, December 28, 2019

Getting data from the cloud using SSIS

So much of our source data is now in the cloud. We need to continually extract it for our Business Intelligence solutions. Unfortunately, these cloud providers tend not to offer database connections for extracting data. However, it's not too difficult, using SSIS, to script an object as a source for an SSIS Data Flow.
There are some challenges extracting cloud data

  • Every cloud provider has it's own way of authenticating (certificates, userid/password, token in https header etc). 
  • Each provider has it's own way of paging through or filtering data.
  • Data types are not guaranteed.
  • Column names (attributes) are self defining
However, using the WebClient object, we can write a little code and get the JSON record passed back to the SSIS data flow. JSON is the most popular data structure, which can be handled natively within SQL Server with SQL JSON functions (IsJSON, JSON_Value, JSON_Query).
Data FlowWhat I have found fruitful, is to break the http response into data records that go straight into a staging database. No data typing at this stage. The staging database can be queried with IsJSON() and JSON_Value() to populate the data warehouse. Only a small amount of code is required to perform the conversion of JSON to datatyped columns ready for loading. There is no need for the added complexity and fragility of extracting to flat files.

Here is some very simple vb code to download records that uses the https header to authenticate. "{""id"":" identifies the start of a new record for this provider. Note, the https url will typically be parameterised to include filters etc from SSIS control flow. MyOutputBuffer is the name of the output defined on the script source object (see below). The actual name is "MyOutput", "Buffer" is added by SSIS. In this case, I have defined one column in the output called JSONRecord as a string with length of 8000.

So, while it is not as simple as a database connection, it is not complicated and in my experience it is reliable. As always, I encourage defensive programming, so that if a connection or task is broken/unavailable, at any time, the next time the job starts it will automatically continue from where it left off.

    Public Overrides Sub CreateNewOutputRows()
        Using client2 As New Net.WebClient
            client2.Headers.Add("Authorization: bla bla bla")
            client2.Headers.Add("Content-Type: application/json")
            sInputBuffer = client2.DownloadString("https://myCloudProvider.com")
        End Using

        Dim sInputBuffer As String
        Dim sJSONRecord As String
        Dim iEnd As Int32
        Dim iStart As Int32

        iStart = InStr(sInputBuffer, "{""id"":")

        While iStart > 0
            iEnd = InStr(20, sInputBuffer, "{""id"":") - 1
            If iEnd > 0 Then
                sJSONRecord = Mid(sInputBuffer, iStart, iEnd - iStart)
                sInputBuffer = Mid(sInputBuffer, iEnd, 999999)
            Else
                sJSONRecord = Mid(sInputBuffer, iStart, 9999)
                sInputBuffer = ""
            End If
            With MyOutputBuffer
                .AddRow()
                .JSONRecord= sJSONRecord
            End With
            iStart = InStr(sInputBuffer, "{""id"":")
        End While
    End Sub

The example below has four entities (tables) extracted from the cloud. The fourth one is iterative as it must call for a set of records for each of another entity.
Control Flow
Output Properties