Discovering topics of interest in a document – II

 

Using NLTK

We have come thus far where we got the text from a pseudo-digital document. We say pseudo because PDF format does not carry enough meta for program / code to work with them without trouble. We wrapped the previous dispatch with the activity of connecting with the CosmosDB. We did not talk much about the library that establishes the connection –

 

gremlinClient = client.Client(endpoint, ‘g’, username=database, password=key, message_serializer=serializer.GraphSONSerializerV2d0()

 

 

The curious first thing about this is the parameter ‘g’. It is the traversal source. Typically, this is the character we start our query or operation. For CosmosDB by convention this is always the character ‘g’. If you are new to the cloud-based programming, you will take cognizance that password is the primary key we copied from the property blade of the CosmosDB in Azure portal. One of the most important of these parameters is the message_serializer. Owing to the way CosmosDB works this serializer has to be exactly that which is GraphSONSerializerV2d0. Any new serializer or a different version will give your cryptic error message when you run a query or perform any graph operation. Though, the client connection will succeed. If you think about, it is just as well, since serializers in the client configuration are used only when you execute a query or request for a graph operation.

Let us now put all this with a sample query as follows –

 

from gremlin_python.driver import client, serializer, protocol

import os

 

endpoint = os.environ.get(‘PY_COSMOS_POLICY_SRV’)

database = os.environ.get(‘PY_COSMOS_POLICY_DATA’)

key = os.environ.get(‘PY_COSMOS_PSWD’)

 

gremlinClient = client.Client(endpoint, ‘g’, username=database, password=key, message_serializer=serializer.GraphSONSerializerV2d0())

 

asyncResult = gremlinClient.submitAsync(“g.addV(‘Sample’).property(‘say’,’Hello World’)”)

 

That is all what we need to create a new vertex in the remote server. If you know from our previous posts the code – g.addV(‘Sample’).property(‘say’,’Hello World’) is exactly the same one that we will use if we had to create one vertex from the gremlin console.

One important thing that we must know before move on is the result of execution asyncResult has to be read for the command to complete its work on the server. Notice that is async so if we run the script above just by itself, the python console would have exited before the operation is complete. Thus, we must wait on the result. If we had some other activity to be carried, we could have done that while waiting for the operation to complete. But we in this example do not so, we do this

 

print(“Result of the operation – {}”.format(asyncResult.result().all().result()))

 

You would see some JSON text with the id of the vertex created.

Have you questioned yourself what are doing with this CosmosDB? Well, here is the plan we need to load the document’s text as a graph in the graph database. There is no value if we load the document as-is. Thus, we enrich the text with Part of Speech tags. These are the standard Penn Treebank tags. These tags will enrich the text and we can use that to perform simple n-gram analysis, just as a starter.

Getting from plain text to the POS tags will require a standard NLP library called nltk. We will get that in our project’s virtual environment like this –

 

pip install nltk

 

Once done for us to get the standard corpus and tokenizers we will have to do one more step

 

import nltk

 

nltk.download

 

GUI window will prompt you to choose all the packages that you want to download. You can pick and choose if you have better command on the NLP paradigm or you can select all download. Once all the downloading activities are completed you will canget to the actual activity i.e., getting the POS tags –

 

from nltk import pos_tag as pt, work_tokenize as wt

 

pt(wt(“Hello world”))

 

The keywords pt and wt are aliases we assigned in the import statement. From our previous dispatch we know that we got all ofthe page text in a variable named text. Let us now get POS tag for each word in that array –

 

from nltk import pos_tag as pt, work_tokenize as wt

import pdfplumber as pdf

 

def get_pdf_as_text(path) :

   file = pdf.open(path)

   text = []

 

   for page in file.pages:

       pageContent = first_page.extract_text()

       newLineSeparated = pageContent.split(‘\n’)

       text += [line.strip() for line in newLineSeparated if line != ‘\n’ and line.strip() != ‘’]

   return text

 

taggedText = get_pdf_text(./CBULB17.pdf’)

for pageText in text:

   taggedText += pt(wt(pageText)

 

 

The outcome of the above work is that we get a tuple with the word as one entry and the POS tag as another entry. I will look something like this –

 

[

(CAPACITY, NNP),

(BUILDING, NN),

(SCHEME, ‘NNP’),

….

]

 

Now that we have all the POS tags, we can process them into the graph database. Remember the order of the words are important. They represent the order of words which lead to the statement which communicates the idea. We proceed as follows –

 

addVertex = “g.addV(‘{}’).property(‘id’,'{}’).property(‘PosTag’,'{}’).property(‘word’, ‘{}’)

 

addEdges = “g.V(‘{}’).addE(‘{}’).to(g.V(‘{}’)).property(‘sequence’, ‘{}’)”

sequenceIndex = 1

for entry in taggedText:

   if(sequenceIndex % 2 != 0):

       prevVertexId = str(entry[1]) + “_” + str(entry[0])

 

   formattedVertex = addVertex.format(CBULB17, str(entry[1]) + “_” + str(entry[0]), entry[1],entry[0])

   callback = GremlinClient.submitAsync(formattedVertex)

  if(callback.result() is not None):

       print(‘{}\n’.format(callback.result().all().result()))

   else:

      print(‘Something went wrong with the vertex \n{}’.format(defaultVertex))

   if(sequenceIndex % 2 == 0):

       formattedEdge = addEdges.format(prevVertexId, CBULB17, str(entry[1]) + “_” + str(entry[0]), sequenceIndex – 1)

       callback = GremlinClient.submitAsync(formattedEdge)

       if(callback.result() is not None):

           print(‘{}\n’.format(callback.result().all().result()))

       else:

           print(‘Something went wrong with the vertex \n{}’.format(defaultVertex))

       sequenceIndex = sequenceIndex + 1

 

 

In the code above we use templatized gremlin operation strings to add vertex and edge. The part which introduces some complexity here is the fact that for the edge to be established we need 2 vertices. Thus, we wait inside the iteration for the 2 vertices to be created only after which we create the edge.

This way we have a graph for the entire document’s text. Now we can use the gremlin query to find n-grams or use advanced techniques like Opinosis to determine the topics of interest in the document. Such, applications are great examples of using NLP and Graph technology to drive meaning to lives.