On data processing – AWS Data Pipeline and .NET
Round 1 of reasoning by both the packs was interesting. It made us think about the choices deeply and introspect the offerings those choices are to offer. Moving swiftly to Round 2 we re-convened on a wet Thursday morning with our own Tea cups and Coffee mugs.
Let us have replay the goal here to keep our perspective intact – Offer a solution towards how will you handle the transformation processing like –
Round 2
We had our pack of experts with ETL background to demonstrate how they will be building the transformation activities using AWS Data Pipeline. For that they started voicing over a background on important entities of AWS Data Pipeline.
Pipeline Definition – JSON file which describes the activities that are involved in the pipeline. This was the concept which the team pitched heavily on in the Round 1. This enables IaC nice and easy for defining the sequence of activities nice and comfortably.
Activities – These are finite and allowed type of activities that can be included in a Data Pipeline. These include CopyActivity, EmrActivity, SqlActivity, ShellCommadActivity and many more.
Resources – Compute instances on which the pipeline is run. Thus, the pipeline definition is interpreted and translated to action on resources. This is limited to EC2 and EMR cluster.
Actions – Fire and forget kind of activity which allows the external observes know is something has gone wrong with the pipeline execution. TerminateAction and SnsAlaramAction allow information on events.
I am sure there are few more but the activities that we are supposed to perform this much of concepts will be good. There is definitely the concept of a Task Runner in a resource. But that is for more specialized scenarios where our resources are located in on-premise or we intend to customize the flow logic.
Let us begin with few assumptions,
{ “record”: “bbd948b5-a097-4415-9992-05849c76eac6” “lat”: 12.9972222, “long”: 80.2569444, “duration_stop”: 300000, “time_of_day”: 1611379865000, “day”: 20210105, “week_of_day”: “Tuesday”, “day_time_print”: “05-January-2021 11:01:05 AM” “temperature”: 23 “temp_unit”: “Celcius” }
|
Though, temperature is displayed as part of record, it is filled by looking up a service with lat and long parameter. Similarly, the property “day_time_print” will be computed based on “day” and “time_of_day”.
{“record”:…, “lat”: …, …} {“record”:…, “lat”: …, …} {“record”:…, “lat”: …, …} {“record”:…, “lat”: …, …} {“record”:…, “lat”: …, …}
|
These assumptions in place, the transformations will take the shape of –
Pack of ETL experts then started defining the pipeline. The following is the pipeline definition they created to start.
{ “objects”: […], “parameters”: […], “values”: {…} }
|
Here, we did not fill the dots in between. That is how the team did it, and they immediately tested their client configuration for connecting to AWS via CLI as well by running the following command in CLI –
aws datapipeline create-pipeline –name “drive-stop-location-processing” –unique-id “dp-dslp-d86e9710”
|
This yielded a response like this –
{ “pipelineId”: “df-08602529RTPD0169MTB” }
|
Here the pipeline container (not the docker kind of container, rather the usage is as English word) is created to hold the definition and the following command to push the pipeline definition to the datapipeline as follows –
aws datapipeline put-pipeline-definition –pipeline-id “df-08602529RTPD0169MTB” –pipeline-definition “file://C:/Users/Document/drive-stop-location-pipeline.json”
|
This one did not run that well after all the JSON is not well formed isn’t it? We noticed the error –
Expecting value: line 2 column 15 (char 16)
|
Next, we have to start defining the compute environment which we will use to perform the operations. Since, the record count is sizeable but not that large that we would need superior powers, we will use a medium sized instance of EC2 resource.
As we do that, we want to highlight that here is where we could specialize if there is a special kind of EC2 instance required. Since, we can also specify the AMI-ID for the instance. The AMI-ID could be one from the market place or the one that a business could have created with specialized tools installed. We will stick to the default and define the resource as follows –
{ “objects”: [ { “resourceRole”: “DataPipelineDefaultResourceRole”, “role”: “DataPipelineDefaultRole”, “instanceType”: “t2.medium”, “name”: “ComputeEc2”, “id”: “ComputeEc2”, “type”: “Ec2Resource”, “terminateAfter”: “120 Minutes” } ], “parameters”: […], “values”: {…} }
|
There are weeny bits of details to be highlighted. Let us get that out of our way.
While the pack was narrating this, two things popped in our head –
Team steadily progressed highlight the next two important elements in the definition, namely the roles.
We will part now to talk about further how the definition shaped and the corresponding shell code and next curve ball that the ETL pack threw on their audience. Till then happy coding.
Picture courtesy- Px Here