A book is more than what it is! – Part II

Exploratory analysis with python

 

Recap

Readers who have landed here directly from the search engines / forums; let us set you context really quick. We aspire to see if there is a writing pattern that can be expressed using Part of Speech (PoS) tags for writers (book authors). Our underlying goal by undergoing this exercise is also to explore the structure of experiments in data science. Organisation for real world projects that could be scaled.

Towards that journey we set up the initial folder structure and added the code to read in a page from the PDF file.

Setting up feedback mechanism

We next take up the printing this to the console. There are two things that crossed our mind. The simplest one is to add a print statement to the function itself. Another is to have separate script that performs such operation. In other languages like C# or java we will have Main that does it for us. For now let us do it inside the function itself.

 

   def read_first_page():

       print(“Reading first page =>”)

       file = pdf.open(path)

       first_page = file.pages[0]

       print(first_page.extract_text())

 

 

Now let us take this a step further and extract the entire page’s text.

 

   def read_entire_document():

       print(“Reading entire document =>”)

       file = pdf.open(path)

       text = []

       for page in file.pages:

           text.append(page.extract_text())

       print(text)

 

 

You must have noticed the challenge. There are two function and we will need only one entry point for the application. That makes us sway towards another approach that we mentioned earlier. So, let us refactor code. Off the bat many calls this fixing the code. In data science and we urge in every manner this is refactor. Come to think about it we developers always refactor code through the entire day.

Refactoring here is simple we replace the last line of both the functions to like this –

 

   def read_first_page():

       …

       return text

 

   def read_entire_document():

       …

       return text

 

 

Small things like keeping the name of the value a function returns can speed up such exploratory analysis related coding lot easier.

Now we introduce the conductor for choosing the mode of exploration we want. Let us say, we call the conductor Experiemnts.py. Isn’t it apt?

 

   # Experiments.py

 

   import basic_file_read as document_reader

 

   print(“Host for the experiments that you want to be done here…”)

 

   content = document_reader.read_first_page()

 

   print(content)

 

 

This gives us a place to compile the path of experiments. We can call as many as we want. The next challenge is when we want to change it to something else; i.e., a different set of experiments. We then will miss the current experiment. Having all of them together in this file also clutters the file. We propose to use Git and tagged commits to manage such check-ins. Though, it appears to be kind of inspector gadget without the manual at present, it becomes easier with practice.

We could opt the best-in off the market IDE with workbench. It works as well. But starting from basics always helps to have root go deeper.

Coming back to the task at hand; we have a conductor for the experiment and have a file that reads the file as raw text.

Next, we take up the inferring the Part of Speech from this text. For that we will need some libraries – nltk. This is a basic library but will suffice our need for now. If you have worked with this earlier you will know that it gives PoS tags with abbreviations drawn straight from the Penn bank. Let us first have a summary of them so that we need not remember it. We made it a dictionary as it is finite.

 

   # english_pos_tags.py

 

       def tag_descriptions():

           return {

               ‘CC‘ : ‘Coordinating conjunction’,

               ‘CD‘ : ‘Cardinal number’,

               ‘DT‘ : ‘Determiner’,

               ‘EX‘ : ‘Existential there’,

               ‘FW‘ : ‘Foreign word’,

               ‘IN‘ : ‘Preposition or subordinating conjunction’,

               ‘JJ‘ : ‘Adjective’,

               ‘JJR‘ : ‘Adjective comparative’,

               ‘JJS‘ : ‘Adjective superlative’,

               ‘LS‘ : ‘List item marker’,

               ‘MD‘ : ‘Modal’,

               ‘NN‘ : ‘Noun singular or mass’,

               ‘NNS‘ : ‘Noun plural’,

               ‘NNP‘ : ‘Proper noun singular’,

               ‘NNPS‘ : ‘Proper noun plural’,

               ‘PDT‘ : ‘Predeterminer’,

               ‘POS‘ : ‘Possessive ending’,

               ‘PRP‘ : ‘Personal pronoun’,

               ‘PRP$‘ : ‘Possessive pronoun’,

               ‘RB‘ : ‘Adverb’,

               ‘RBR‘ : ‘Adverb comparative’,

               ‘RBS‘ : ‘ Adverb superlative’,

               ‘RP‘ : ‘Particle’,

               ‘SYM‘ : ‘Symbol’,

               ‘TO‘ : ‘to’,

               ‘UH‘ : ‘Interjection’,

               ‘VB‘ : ‘Verb base form’,

               ‘VBD‘ : ‘Verb past tense’,

               ‘VBG‘ : ‘Verb gerund or present participle’,

               ‘VBN‘ : ‘Verb past participle’,

               ‘VBP‘ : ‘Verb non-3rd person singular present’,

               ‘VBZ‘ : ‘Verb 3rd person singular present’,

               ‘WDT‘ : ‘With-determiner’,

               ‘WP‘ : ‘With-pronoun’,

               ‘WP$‘ : ‘Possessive with-pronoun’,

               ‘WRB‘ : ‘With -adverb’

           }

 

 

This function we will we consume after generating the PoS tag in our next dispatch.