Exploratory analysis with python
Recap
We have so far read a page from the book we are interested. We have also set up a utility function to map the POS tag to a suitable description. If you have landed here directly from the conduits of internet; we aspire to see if there is a writing pattern that can be expressed using Part of Speech (PoS) tags for literary works.
We move towards tokenizing the page. We talked about representing each word as a node, its PoS tag as label and relationship between the nodes be positional occurrence in a sentence.
Reading text
We have so far built a scaffolding of function that looks like
def read_entire_document(path): |
print(“Reading entire document =>”) |
file = pdf.open(path) |
text = [] |
for page in file.pages: |
text.append(page.extract_text()) |
print(text) |
To tokenize document we can endeavour ourselves or use the NLP’s library to do that. We choose latter. For doing so we need to read the content of the document line by line and we do this by introducing the split –
pageContent = page.extract_text() |
lines = pageContent.split(“\n”) |
The whole function will then look like this –
def read_entire_document(path): |
print(“Reading the entire document =>”) |
file = pdf.open(path) |
text = [] |
for page in file.pages: |
pageContent = page.extract_text() |
content = pageContent.split(“\n”) |
text += [line.strip() for line in content if line != “\n” and line.strip() != “”] |
return text |
There are few variations of this that needs to be highlighted here. We are taking appearance of new line to indicate a line. This makes sense when you are looking at the document. It however does not make sense when you are trying to interpret from literary sense. To make sense from that point of view we will have to split the sentence using the period character which separates the words in English. It however is not going to be straight forward. We will have to handle the fringe cases like “i.e.” appearing in the text. Such fragments are not exactly sentences or lines.
Line vs Page and PoS
The other point worth considerations is should the PoS tagging be done for sentences or should it be done for the whole content across pages? Let us see if the answer to this question is going to impact performance of the data parsing or will it be affecting the theory of what we set to find. The quickest way is to conduct a quick experiment itself.
We initially do this in the command line interpreter. On a second thought this can be an experiment itself. We name this experiment based on the theory we validate, namely; PoS_Difference_LineVsFile
from nltk import pos_tag, word_tokenize |
import pdfplumber as pdf |
from basic_file_read import read_entire_document |
|
print(“QUESTION – Does PoS differ when derived from a statement or the entire text”) |
|
pdf_text = [ page.extract_text() for page in pdf.open(“Sample.pdf”).pages] |
|
print(“We join the text of individual pages with a space and get the PoS tags in the same line”) |
|
full_text_pos = pos_tag(word_tokenize(” “.join(pdf_text))) |
|
print(“We now bring in the PoS for individual sentences”) |
|
lines = read_entire_document(“Sample.pdf”) |
|
tagged_lines = [pos_tag(word_tokenize(line)) for line in lines] |
|
flattened_pos = [item for p1 in tagged_lines for item in p1] |
|
print(“Let us now compare the two PoS sets.”) |
|
print(set(full_text_pos) ^ set(flattened_pos)) |
If the above code prints anything on the console; then there are differences. For any of the literary work that you take we are fairly confident that you will notice something always printed after the above experiment.
That is because Part of Speech tagging assigns the tag in the context of usage of the word. When we supply just a line the context is different from when we supply the entire text in a single go.
Outcome of the experiment for is to use the entire text and we will copy the code which we already have for that purpose.
Re-organising code
We will move that to a script of itself. Let us call it nlp/tag_text.py
from nltk import pos_tag, word_tokenize |
|
def tag_words(corpus): |
return pos_tag(word_tokenize(corpus)) |
Needless to say, the __init__.py will export the tag_words for use in main set of experiments.
Storing in graph storage
We have come this far without storing data anywhere. Like in our introductory article we intend to capture the data with PoS tags in graph storage. We will use CosmosDB as our graph storage. Before we create a new folder or script let us take a look at our folder system –
(-)– experiments |
|——__init__.py |
|——basic_file_read.py |
|——PoS_Difference_LineVsFile |
(-)— data |
|——tale_of_two_cities.pdf |
|——count_of_monte_cristo.pdf |
(-)— nlp |
|——__init__.py |
|——tag_text.py |
| |
|– Orchestrator.py |
Now let us create a new folder that deals with storage and term it storage. Rather than experiment this one puts the data into a graph storage in a agreed upon structure. Thus, it will not be appropriate to call it experiment. Let us take a look again at our folder structure –
(-)– experiments |
|——__init__.py |
|——basic_file_read.py |
|——PoS_Difference_LineVsFile |
(-)— data |
|——tale_of_two_cities.pdf |
|——count_of_monte_cristo.pdf |
(-)— nlp |
|——__init__.py |
|——tag_text.py |
(-)— storage |
|——__init__.py |
|——tagged_words_to_cosmosdb.py |
| |
|– Orchestrator.py |
Here is an outline of what we must do next;
With these considerations in our mind we will focus our entire effort of storing data to Cosmos DB in the next dispatch. Till then happy analysing.