A book is more than what it is! (Part III)

 

Exploratory analysis with python

Recap

We have so far read a page from the book we are interested. We have also set up a utility function to map the POS tag to a suitable description. If you have landed here directly from the conduits of internet; we aspire to see if there is a writing pattern that can be expressed using Part of Speech (PoS) tags for literary works.

We move towards tokenizing the page. We talked about representing each word as a node, its PoS tag as label and relationship between the nodes be positional occurrence in a sentence.

Reading text

We have so far built a scaffolding of function that looks like

   def read_entire_document(path):

       print(“Reading entire document =>”)

       file = pdf.open(path)

       text = []

       for page in file.pages:

           text.append(page.extract_text())

       print(text)

 

To tokenize document we can endeavour ourselves or use the NLP’s library to do that. We choose latter. For doing so we need to read the content of the document line by line and we do this by introducing the split –

   pageContent = page.extract_text()

   lines = pageContent.split(“\n”)

 

The whole function will then look like this –

   def read_entire_document(path):

       print(“Reading the entire document =>”)

       file = pdf.open(path)

       text = []

       for page in file.pages:

           pageContent = page.extract_text()

           content = pageContent.split(“\n”)

           text += [line.strip() for line in content if line != “\n” and line.strip() != “”]

       return text

 

There are few variations of this that needs to be highlighted here. We are taking appearance of new line to indicate a line. This makes sense when you are looking at the document. It however does not make sense when you are trying to interpret from literary sense. To make sense from that point of view we will have to split the sentence using the period character which separates the words in English. It however is not going to be straight forward. We will have to handle the fringe cases like “i.e.” appearing in the text. Such fragments are not exactly sentences or lines.

Line vs Page and PoS

The other point worth considerations is should the PoS tagging be done for sentences or should it be done for the whole content across pages? Let us see if the answer to this question is going to impact performance of the data parsing or will it be affecting the theory of what we set to find. The quickest way is to conduct a quick experiment itself.

 

We initially do this in the command line interpreter. On a second thought this can be an experiment itself. We name this experiment based on the theory we validate, namely; PoS_Difference_LineVsFile

   

from nltk import pos_tag, word_tokenize

import pdfplumber as pdf

from basic_file_read import read_entire_document

 

print(“QUESTION – Does PoS differ when derived from a statement or the entire text”)

 

pdf_text = [ page.extract_text() for page in pdf.open(“Sample.pdf”).pages]

 

print(“We join the text of individual pages with a space and get the PoS tags in the same line”)

 

full_text_pos = pos_tag(word_tokenize(” “.join(pdf_text)))

 

print(“We now bring in the PoS for individual sentences”)

 

lines = read_entire_document(“Sample.pdf”)

 

tagged_lines = [pos_tag(word_tokenize(line)) for line in lines]

 

flattened_pos = [item for p1 in tagged_lines for item in p1]

 

print(“Let us now compare the two PoS sets.”)

 

print(set(full_text_pos) ^ set(flattened_pos))

 

If the above code prints anything on the console; then there are differences. For any of the literary work that you take we are fairly confident that you will notice something always printed after the above experiment.

That is because Part of Speech tagging assigns the tag in the context of usage of the word. When we supply just a line the context is different from when we supply the entire text in a single go.

Outcome of the experiment for is to use the entire text and we will copy the code which we already have for that purpose.

Re-organising code

We will move that to a script of itself. Let us call it nlp/tag_text.py

   from nltk import pos_tag, word_tokenize

 

   def tag_words(corpus):

       return pos_tag(word_tokenize(corpus))

Needless to say, the __init__.py will export the tag_words for use in main set of experiments.

Storing in graph storage

We have come this far without storing data anywhere. Like in our introductory article we intend to capture the data with PoS tags in graph storage. We will use CosmosDB as our graph storage. Before we create a new folder or script let us take a look at our folder system

 

(-)– experiments

|——__init__.py

|——basic_file_read.py

|——PoS_Difference_LineVsFile

(-)— data

|——tale_of_two_cities.pdf

|——count_of_monte_cristo.pdf

(-)— nlp

|——__init__.py

|——tag_text.py

|

|– Orchestrator.py

Now let us create a new folder that deals with storage and term it storage. Rather than experiment this one puts the data into a graph storage in a agreed upon structure. Thus, it will not be appropriate to call it experiment. Let us take a look again at our folder structure –

(-)– experiments

|——__init__.py

|——basic_file_read.py

|——PoS_Difference_LineVsFile

(-)— data

|——tale_of_two_cities.pdf

|——count_of_monte_cristo.pdf

(-)— nlp

|——__init__.py

|——tag_text.py

(-)— storage

|——__init__.py

|——tagged_words_to_cosmosdb.py

|

|– Orchestrator.py

Here is an outline of what we must do next;

1. We must create a node for each word.
a. The plausible question will be what happens when we encounter repetition?
2. Each PoS tag must be stored as label in the graph storage
3. Nodes/Vertices must be connected using edges which have a number property; that indicates connectedness of two consecutive words. The number must reflect the position of the two words.

With these considerations in our mind we will focus our entire effort of storing data to Cosmos DB in the next dispatch. Till then happy analysing.