A book is more than what it is (Part IV)

Exploratory analysis with python

Recap

We have so far read the text out of PDF files. We ran PoS tagging on the text read from the PDF files. In this process also experimented whether there is a difference in generating PoS tags with corpus or a section of the text. The outcome; we know there is difference. It is more relevant to generate PoS tagging using the entire corpus. In this dispatch we tread the journey to load this data in Cosmos DB.

Principle

We have so far read and stored data in the memory of the process. One of the common risks associated with such approach is; we will have to run the entire process if we want to inspect a small aspect of such processing. This can often grow to be painful if we are in early stage of experimentation. What we mean by that is; there will be multiple iterations and running the entire process every time after few iterations becomes counterproductive. Thus, it is prudent to create an exit ramp in experimental processes. The exit ramp is essentially saving the data to disk.

Hey but wait are we not doing that with the CosmosDB. Yes certainly. But you will agree the cost of writing code to save the data to CSV is far cheaper that persisting the data to Cosmos DB. The Cosmos DB is a great destination for analysing the data but is not a good low-cost destination for intermediate storage.

So, the principle we want to highlight here is create enough exit ramps in the operations that can aide in analysis and resumption of the entire experimentation process multiple times.

Creating the exit ramp

Let us use the simple and basic approach to write csv namely –

 

print(“column1 value, column2 value, column 3 value\n”)

 

Yes, it is that simple. But there are so many small things that will look like rocks if not managed well. E.g. the comma itself being present in the text. We will have to escape it with double quotes for being csv safe. But we can take an alternate route; we use a library that is known for such processing and is commonly used pandas. We do that with

   

pip install pandas

 

By using pandas, we will first store the data in data frame and in a safe manner call the pandas function to save the data to csv.

Next, we decide on where to place the exit ramp. It is prudent to place it after the PoS tags are generated. Based on what we need to store in the Cosmos DB we will need these in the csv –

Word as in text

POS Tag

data

data

For us to reach here we need to create the pandas data frames with these two columns. Let us say we save this script in the same location as client for storing data in CosmosDB. We call this script as storage\export_data_csv.py

import pandas as pd

def export_to_csv(tagged_data, csv_file_path):

   data_df = pd.DataFrame(tagged_data)

   data_df.to_csv(csv_file_path)

 

At this moment it is imperative that we peak into the Orchestrator.py. This is the file which is running the entire experimentation process. In that pipeline we invoke this script to mount an exit ramp. So, let us peek into the Orchestrator.py

from .basic_file_read import read_entire_document

from .nlp import tag_words

from .storage import export_to_csv

 

filepath = “data\tale_of_two_cities.pdf”

print(“We read the pdf file as text”)

text_content = read_entire_document(filepath)

print(“We next tag the words using nltk library”)

tagged_words = tag_words(” “.join(text_content))

print(“We wrangle the tagged words to a format based on which pandas data frame can be created”)

 

tagged_data = {“word”:[word for word, tag in tagged_words], “tag”: [tag for word, tag in tagged_words]}

 

print(“Now that we have wrangled the format we next save the outcome to disk as csv file)

 

export_to_csv(tagged_data, “.\output\tagged_words_data.csv”)    

That is it. Now, if we have run this process once we have the data and can import a step in console line interpreter and run that explicitly instead of the entire process every time. That is also simple; assume we have this run the process and have csv file. Our script to store the data in Cosmos DB starts with a CSV file then we can in console use a statement like this to run that one step instead of entire process.

>>> from .storage import save_to_cosmos as sc

>>> sc.persist(“.\output\tagged_words_data.csv”)

You might know about processing pipeline in sklearn. The approach we have taken through these multi-series article unfolds what the world was before such libraries. It is important that a budding data scientist deals with this raw plumbing. Until such exercise is done learning the libraries alone might leave you with being a master of syntactic sugar at worst and master of the library at best.

Connecting to Cosmos DB

Having that out of our way, we need to now connect to CosmosDB. There are few important settings that one must remember never to put in code. If you have done cloud-based development. You will know that secrets are not persisted in code. The approach adorned in such situation is to store such sensitive information in compute ecosystem’s environment variable. The compute ecosystem can range from developer workstation to VM in cloud or a container or simply a serverless runtime like Azure function or AWS lambda.

There is plenty of text in Azure on how to connect to CosmosDB there are few pre-requisites like a Gremlin driver that needs to be installed. Let us fast forward those straight forward operations and assume you have installed them. Then we start with something like this –    

from gremlin_python.driver import client, serializer

from gremlin_python.driver.protocol import GremlinServerError

Most of the times we find that articles generally skip highlighting the imports and leaves it as unnecessary detail or too trivial. We believe that those details are very important for anyone to follow. We need to know where some component is coming from isn’t it? Anyway resuming our journey to connect to Cosmos DB we will continue like this    

from gremlin_python.driver import client, serializer

from gremlin_python.driver.protocol import GremlinServerError

import os

import pandas as pd

 

_Endpoint = os.environ.get(“PYEXP_COSMOS_ENDPOINT”)

_Database = os.environ.get(“PYEXP_COSMOS_DATABASE”)

_Key = os.environ.get(“PYEXP_COSMOS_KEY”)

   

def test_connect():

       GremlinClient = client.Clinet(_Endpoint, “g”, username=_Database, password=_Key, message_serializer=serializer.GraphSONSerializerV2d0())

When you execute this function and it does not result in error, we must be good. We next have to load the data and push to Cosmos DB. Let us say we create a new function in the same script where we did the testing.

def save_data_to_cosmos(csv_file_path):

       tagged_data = pd.read_csv(csv_file_path)

       for row in tagged_data.itertuples():

           //TODO: Add gremlin code to insert nodes

           //TODO: Add gremlin code to insert edges

Now the gremlin code requires bit of attention and is vastly different from the python code. We will pause here and tackle the gremlin code in next dispatch.