Dataframe transformation

 

Pandas in Python
Dataframe is the foundation stone of data wrangling. It is the humble two-dimensional representation of
data that interests you. Oftentimes the aesthetics that appeal to humans come on the way to data
representation. This causes the many lines of code that deal with transforming data from one way of
representation to another. Since such transformations deal with aesthetics often data engineers overlook
them. It often proves fatal for budding data engineers. Today, we will look at one such ask without the data
filled in the dataframe to keep it simple.
We were piloting an IoT project where the field gateway will collect data and send us a de-normalised
record. Let us remind ourselves that IoT gateway will always send data in a de-normalised form or in better
words the form that consumes the least energy available. Transformations or looking etc. are the activities
that low-energy devices which connect to the IoT gateway shy away from. They prefer to use the energy to
get the reading and shipping that over waves to the gateway. Without wearing you down on energy let us
just say that the IoT project was to monitor the water level in catchment areas. The data sent over by the
gateway would look like this –
Metric SampleNature Measure
Rainfall Simple1 X
Inflow Simple1 Y
SWT Simple1 Z
Evaporation Simple1 A
Rainfall StormWaterFlow B
Inflow StormWaterFlow C
Loss StormWaterFlow D
RelatedSample Secondary Simple1
Loss Secondary E
Inflow Secondary F
Outflow Secondary G
As you can see values are repeating themselves in the column called SampleNature. This data was
received in a comma-separated values format. Processing them to do the analytics was straightforward.
One of the frequent things we needed was to filter the records and transform the Metric to column headers.
It would look like this –
hydrologyData = pd.read_csv(‘raw_data_20221215.csv’)
hydrologyData[hydrologyData[‘SampleNature’] == ‘Simple1’].pivot(index = ‘SampleNature’, columns =
‘Metric’)[[‘Value’]]
hydrologyData

We first load the data from the CSV file and then filter the records that have SampleNature as Simple1 by
doing
hydrologyData[hydrologyData[‘SampleNature’] == ‘Simple1’]
We then pipe the outcome of this operation to do pivoting. It is the most common operation one would do in
Excel. With the pandas as the framework for data wrangling, we do that using the statement –
.pivot(index = ‘SampleNature’, columns = ‘Metric’)[[‘Value’]]

The first parameter to the call indicates what should be the row index and the second parameter which is
called columns indicate what should be the column headers as we want. The last square bracketed term
indicates the values that should be copied over from the source data frame. Thus, we would get a data
frame that looks like this –

Value
Metric Evaporation Inflow Rainfall SWT
SampleNature
Simple1 A Y X Z
This is great as far as the count is just 1 record. But in the real world, this would flow over and if we stick to
the above transformation, we would end up getting an error stating the index is duplicated. It is natural for
the framework to throw that error since it does not know how the records are constituted. i.e., how many
rows in the source translate to a row? If we skip stating the index we will get a rather unpleasant
representation –
Value
Metric Evaporation Inflow Rainfall SWT
0 A NaN NaN NaN
1 NaN Y NaN NaN
2 NaN NaN X NaN
3 NaN NaN NaN Z
We hid a fact and it is time we reveal it to keep things simple and not go into heuristics for determining the
row index. We had a timestamp column which will be the same only for a few metrics on the source
dataset. We used that column as the index –
hydrologyData[hydrologyData[‘SampleNature’] == ‘Simple1’].pivot(index = ‘timestamp’, columns =
‘Metric’)[[‘Value’]]
Now we did not get an error instead we have a beautiful-looking dataset that could be printed directly to a
report –

Value
Metric Evaporation Inflow Rainfall SWT
Timestamp
2022-12-
14T00:05:00

A Y X Z

2022-12-
14T00:10:00

A1 Y1 X1 Z1

Needless to say, we masked the values here but you got the drift, isn’t it? One of the hallmarks of a data
engineer is to be able to do such transformation as if it is second nature. Some other tools that could be
used in place of the pivot are the functions like mask, fillna, MultiIndex and many more. Data engineer must
always be creative and find the path that costs them the least and documents the intent without having to

write comments near the code. Much of the creativity that a data engineer emanates will be from the
context in which such transformations are needed. We leave you with a puzzle on how the aforesaid
transformation could be achieved by the combination of mask, fillna, MultiIndex.