Assortment of Thoughts

On data processing – AWS Data Pipeline and .NET

Image courtesy- Vecrtor Stock

Today, we are going to share an experiment we conducted. It was an interesting one with parallels we had drawn. Our experiment was to do perform data processing with sizeable amount of data. The processing involved few typical operations like –

1. Looking up of a field in an in-memory dictionary
2. Looking up of a field over the internet
3. Concatenation of few strings to form a compounded column
4. Replacing a value with right sized bin-value (i.e., replace let us say a value 104.87 with 90-110 kind of value)

You would agree these are straight forward transformations. Nothing, complicated about them, right? Yes, we too agree. Let us take a moment to give you the back story on how we ended up in this experiment.

We had an amazing team of ETL experts who had planned for a much-deserved break after few stressful deliveries to round off the dreadful year of 2020. We had few who are new to in the team be available just in-case to address any queries if so happens to come up in the delivery. Then, we also had another team which was wrapping up a lead with customer before Christmas holidays so that next year pipeline could be sized appropriately. This team had amazing experts who have decades of experience in C# as programming language.

One fine evening before eve of Christmas the year-end watchmen (striking a chord with night watchman concept in Test cricket) from the pack of ETL experts started organizing an internal challenge of sort. This was inspired by one of delivery made by the team few quarters back. That delivery was the set of transformation that we listed in the beginning. How, this started is a floor mystery. However, it did good for us and particularly kindled us to think in new dimensions. We will touch upon those dimensions little latter.

Now, this frenzy chatter in the slack, had an eavesdropper from the pack of experts with C# as their mother tongue. They started advising the ETL pack to spend time fruitfully by taking a break and participate in a Secret Santa organized by the HR where these folks were to think of good and obviously economical gifts (very much a demand from client) for their ChrisMa and ChrisChild.

Now, this was the point when the year-end watchmen started to feel inferior and along with fulfilling the desires of these bullying C# experts, they also started chatting up the art of coding.

This was a new one for us, typically we have heard friendly exchanges on stack superiority on the grounds of – speed and size i.e., how fast could something be performed and how small is the size of the code; that David x Goliath kind of glorification where a small piece of code like David had hunted a problem of the size of Goliath. But this pack of ETL experts were different. They were talking of art like in medieval times where intellectual superiority was expressed on canvases, walls and rocks!

Thus, we capitalized the moment and let the packs present a compelling argument towards an approach which will accomplish the goals introduced on very subjective terms. After a tough 2020 and many of number crunching exercises, we engaged in for pipeline projection for next year we decided to let ourselves unwind in a nerdy way.

Let the reasoning begin!

Round 1

First was the choice of stack i.e., language and tools. For many of us we believed this was no-brainer. ETL folks would have picked either Informatica or SSIS with Java as their programming language. However, to our surprise they opted for AWS Data Pipeline!

This was the curve ball that no-one in the room expected. Why will one choose over time tested tools to something that many were not aware of? It was a rumored that ETL team was probably trying to learn the Data Pipeline as AWS service, thus were talking about it. But if we think, they were not pitching for a project rather trying defend it as candidate technology for the kind of transformation and loading was decided as goal.

Their argument for the choice was –

1. The pipeline definition is expressive for both machine and human
2. The meta of pipeline is actually managed via HTTP verbs

How cool is that? We can manage the pipeline as-if it is a REST document? Isn’t that a cool application of HTTP protocol? To put in perspective, a pipeline definition will look like this in AWS data pipeline –

 

{

 “Type”: “AWS::DataPipeline::Pipeline”,

 “Properties”: {

     “Activate”: …,

     “Description”: “”,

     “Name”: “”,

     “ParameterObjects”: [ … ],

     “ParameterValues”: [ … ],

     “PipelineObjects”: [ … ],

     “PipelineTags”: [ … ]

   }

}

 

One can easily challenge this with what is so special about this JSON? You can choose any other tool which has even better visuals and help a person looking at the pipeline easily interpret the operations.

But here are the defense lines which we loved. Infrastructure as Code is a concept which no organization involved in software development can ignore. Add to it the flexibility to manage plain text files in a DevOps pipeline. You could diff your pipelines as it evolves.

Next point presented by the team was segregation of definition and compute infrastructure. Using AWS Data Pipeline, we have complete freedom in segregating these two very finely. The definition of the pipeline and it is a service which has its API interfaces to PUT, POST, DELETE etc. On the platform which performs the operation you have a choice between EMR cluster and EC2. These two resources are by themselves offer very less constraints in terms of pre-requisites.

They lead this point to the language where choice of EC2 opens up virtually any programming language to be used for the purpose of programming the transformation. Here too, they preferred the grounded shell scripting.

The folks from the C# camp, leaned towards a simple in-memory processing approach right off the .NET framework. Had you predicted they would select Azure, let us break the stereotype, AWS does not equate to Java and Azure to .NET in fact cloud in general allows you to use any of the programming language.

That said, the choice they exercised is Thread Parallel Library (TPL’s) Data Flow package. This is a curious one. As it is not a platform. Nor a service of sort like the AWS Data Pipeline. They leveraged the point that goal is about transformation and not about the source or for that reason destination integration. Their claim was this approach could be easily deployed in any data pipeline processing approach as a console application. They further went on to ascertain that the AWS Data Pipeline itself could be used as host to the transformation logic they are going to create.

For those, who are not yet exposed to the package and library don’t hesitate from peeking into the docs and guides hosted in MS Docs website.

Given, they have selected the TPL’s Data Flow package, it is natural that they would have gone with C# as programming language. But since it is .NET there are other languages which could be used as well.

On a point-to-point comparison; benefit this approach puts on the table is flexibility for use in any platform thus, answering the Infrastructure as Code as non-relevant parameter for the goal and the expressiveness of C# programming language to give insight on the transformation. Needless to say, the parallelism could be manifested very easily within a compute unit, i.e., in same processor.

For them to convince the audience they showed a peek of a pipeline definition in TPL’s approach to define a pipeline which will process these transformations –

 

dataIntake.LinkTo(inMemoryLookup, {PropagateCompletion = true});

inMemoryLookup.LinkTo(networkLookup, {PropagateCompletion = true});

networkLookup.LinkTo(computedColumn, {PropagateCompletion = true});

computedColumn.LinkTo(binValues, {PropagateCompletion = true});

 

This to us looked more linear in nature and since, it is code exposed directly, we started to wonder what would it take shape as if it had to branch in code as you also might already see that it could get unwieldy.

In the end of Round 1 of reasoning, choice of AWS Data Pipeline seems to have more appeal though it is not right sized for the petty transformation activity. However, never underestimate the raw coding capability as there is always room to improve beyond imagination to make it shine. Particularly when we had seasoned practitioners of C# in that pack.

We shall now take a break before we engaged in our live coding session with the challengers first. With live reasoning to follow. We are excited to share our thoughts as we witnessed that session in our next dispatch.