Discovering topics of interest in a document

Image courtesy- Pixabay

Using NLTK

The world as we know today is under transition as always. Emphasis is on being digital first. We digital what? Digital content. Internet if you recall was discovered to exchange content between remote devices. The content was called document to indicate a structure. For ages people who talk about the meta and discoverability were looked as hippie with no worry about the real world. But if you track the changes as much as we fancied with the GUI, fancy interfaces, virtual reality, voice input and what not transpired in-between we in this age of machine learning rely much on text content where there is more value and information. Content which is not marked up and has much of structure. This has given birth to the many challenges in getting data that is relevant.

Before we get something useful from a piece of content using the modern machine learning techniques, we need to process the content. This post tackles such scenario where we take the narrow roads marred with many more challenges; parse topics of interest from the PDF file. The assumptions we put down is the PDF has text in them and is not a scanned image of a printed document. We take that route only to emphasize the approach. We can certainly handle the situation where the document is a scanned image of a printed document; it will take some extra processing before we join up with what we are going to do in this post.

Getting started

You will start with knowing what we are dealing with first. This entails opening the document for yourself and seeing what it contains. We used a document that talks about a public policy. You can download the document and see for yourself. This document is digital content with multiple tables but no images like we set target for ourselves in the beginning.

First thing is how do we get the text from this document? It must be fairly simple, isn’t it? We use a 3rd party library to get that job done – pdfplumber. We recommend you have a python virtual environment created for this project to avoid muddling your global python installation. In your project’s python environment run the following command –

 

pip install pdfplumber

 

Once all that scrolling texts stop your installation is complete. You can either start a new python file or work with the interpreter in the project’s folder where test the successful installation –

 

import pdfplumber as pdf

 

file = pdf.open(‘CBULB17.pdf’)

first_page = file.page[0]

first_page.extract_text()

 

These lines should get only the text from the very first page of the document. You must have noticed by now the text contains lot of newline characters. We will not need that for the work we are doing so let us get rid of that –

 

pageContent = first_page.extract_text()

 

contentWithoutEmptyLines = [line.strip() for line in pageContent.split(‘\n’) if line !=’\n’ and line.strip()() != ‘’]

 

This way we get rid of all newlines on the page but they are segmented in different ways where a sentence could have spanned different entries in the array. For now, we are not bothered by that. Notice the strip we call in two places, it is to remove any unwanted spaces in the beginning and the end of the line. We need to keep this in mind when we use it. After seeing the result let us take this journey further. We will now do this for all the pages. It must be simple to loop, isn’t it? While we are at it let us make that a function for us to call somewhere; we will worry about that when we get to it.

 

import pdfplumber as pdf

 

def get_pdf_as_text(path) :

   file = pdf.open(path)

   text = []

 

   for page in file.pages:

       pageContent = first_page.extract_text()

       newLineSeparated = pageContent.split(‘\n’)

       text += [line.strip() for line in newLineSeparated if line != ‘\n’ and line.strip() != ‘’]

   return text

 

Now the text variable in the snippet above contains all the text lines from the pdf file. Now let us take this to the next step. Remember we talked about not being bothered by the separation of array entries in the text variable by “.”. Because, we plan to use another interesting approach to analyse.

Getting ourselves a data store

We are going to use graph database to do just that. We had couple of entries earlier on the graph database its concepts and query language etc. We are going to put all that to use here and eventually lead to the approach of extracting the information.

We could have installed a graph database (neo4j or something like that) locally. But instead of going through the installation process and initial configuration we decided to use an online version and we chose Azure CosmosDB for this purpose. Even if you are starter, you will have no-cost setup done and even with the minimal usage for this blog post we do not have to pay for usage because it will well be within the free usage tier.

CosmosDB uses the gremlin syntax and the best part of that is when we access the database programmatically, we use the same syntax that we will otherwise use in the console. i.e., we do not have memorize many classes like client, connection, node, relationship etc. Let us get there quickly, we will use the gremlin_python library. We get that in our project’s virtual python environment like this –

 

pip install gremlin_python

 

Once all the download and copying are over after this command, we get started with a simple client like this –

 

from gremlin_python.driver import client, serializer, protocol

import os

 

endpoint = os.environ.get(‘PY_COSMOS_POLICY_SRV’)

database = os.environ.get(‘PY_COSMOS_POLICY_DATA’)

key = os.environ.get(‘PY_COSMOS_PSWD’)

 

gremlinClient = client.Client(endpoint, ‘g’, username=database, password=key, message_serializer=serializer.GraphSONSerializerV2d0()

 

There is one very important convention used here. To secure our code and not to give away sensitive data accidentally we push sensitive information to the operating system’s environment variables. This habit must be inculcated even if you are experimenting with things. Because, quick ways out of seeing if code works could potentially leak very sensitive data from company’s cloud assets or you might be abused with an unrealistic credit card bill. The details of CosmosDB’s endpoint, database name and key are all stored in the environment. That setting up in environment is typically performed offline and not part of the program’s code. If you use DevOps that information will be set up by the pipeline. But since we were working on a developer’s workstation analysing the data, we had done that manually. Following script gets that variable to environment

 

[System.Environment]::SetEnvironment(‘PY_COSMOS_POLICY_SRV’, ‘<The value>, ‘User’)

 

We did this in the PowerShell console. You can do that in a command prompt without much challenge but the command will be different. We took this route as it this might be savvier than the cmd 😊. You will have to repeat this for other variables ‘PY_COSMOS_POLICY_DATA’, ‘PY_COSMOS_PSWD’.

We did not touch upon how to set up a cosmos server but that is a cake walk for anyone. Just remember to pick values from the cosmosdb’s property blade for the endpoint, database name and primary key. Remember in the cosmosdb the database name has to be put in a format like this – “/dbs/[Database name]/colls/[Collection name]”. If you put just the database name you might run into trouble.

Well, we are not done yet. We will use the next dispatch to complete what we started here