A book is more than what it is!

Using python to analyze book

 

The big picture

An avid book reader reads cover to cover and soaks in the experience. By book we mean obviously fiction at present. Imagine how a data scientist will look at a book? Well bag of words isn’t it? You must have seen this coming from miles away. This is true though. We took off with this kind of imagination and started wondering on; Is there correlation between author and his style of writing? This was a fun exercise but gave us solid hard look on how do we chase an exploratory work in data science. We share our journey with you here. It is a joy ride with thoughts and some code and some flaws. One thing that we can assert is by the end para of this dispatch you; our reader will have gained experience of how exploratory work is pursued in the reality of uncertainty. You will also be exposed to point of views that are different to your own or might resonate with what you were pondering earlier.

Let us dive in.

Writing style

Think about it; what is a writing style? Limiting ourselves to the language that this article is written in -English; it is usage of words in particular manner. Let us dive in – What is the particular manner? We can drop the word particular from the preceding question and when we quiz ourselves what is the manner inwhich an author writes; isn’t it the choice of sequencing words in parts of speech – like Noun, Verb, Adjective etc. English language has a set of rule defined which defines the way these parts of speech must be arranged called Grammar. Still creative freedom gives authors arrange them in peculiar styles. We took flight with that definition of manner. We wanted to establish if there is a strong correlation between an author and such arrangement of wording across publications and books.

Facets

There are multiple facets to this exploration. We want to first collect publications by an author, then we need to prepare a model for the book’s content, then we need to find these patterns in part of speech.One thing at a time; we picked the data model first. Isn’t it evident that whatever we do in other facets willbecome as much easy or harder based on the data model. We also are aware that sometimes early on we might not have made the right decision but we adapt in-flight to whatever we laydown now must not be something that we are in love with and reluctant to accept as we discover new facets or challenges en-route.

Data model

The best data model we thought for this case will be a graph data structure. We also believed we are better off using graph database to convert the textual data to a analytical structure. In our model we will have words in vertices and the sentence will be edges between vertices. Sequence of occurrence for words in thesentences we decided is better placed as property in the edge.

Next we from the initial thought also needed part of speech to be presented. So we put that in the label. Because we anticipate that we might need query more on the label than the individual word itself. So we got ourselves something like this

 

 

 

 

 

 

 

Populate the data model Data model is great but now the rubber needs to meet the road. We needed to load data in this model. Our data is a pdf file; the entire book. Before we get to the specifics of loading data, we realize we need a sustainable folder structure so that we do not bother ourselves with the specifics of why a file is located where it is every time and it becomes our second nature. So we devised this

 

(-)– experiments

|

|——__init__.py

|——basic_file_read.py

|——[…] (-)— data

|

|——tale_of_two_cities.pdf

|——count_of_monte_cristo.pdf

|——[…]

 

The folder structure is evident but this is not all of what we will. But the pattern is all experiments are persisted as individual script file. Data is collected in on place. These folders are treated as modules in python so we have init.py where we have code. That file resembles to

from .basic_file_read import read_document

from .basic_file_read import read_first_page

from .basic_file_read import read_file_per_line

from .basic_file_read import save_textas_csv

from .exploratory_analysis import statistical_summary

Fairly quickly we needed another folder called Analysis. Where we could store the content of the any intermediate data file that we intend to use introspect. We will doing that quite often isn’t it?

In case you are wondering why did not launch a Jupyternotebook server and run one notebook? It is very much possible. Similarly, this is also possible. i.e. you cook up individual script files and stitch them together via a main run time file. We happen to pick the latter approach. Before we wrap up let us flush in the PDF reading as it is more trivial –

def read_first_page():

print(ReadFirstPage =>”)

file = pdf.open(path)

first_page = file.pages[0]

return first_page.extract_text()

 

 

We use the pdfplumber library for the above code to work. We will continue from here and into the next dispatch we will also take a look at reorganising the code structure a bit for convenience.