Master your language

 

Shortest path with elevation in the mix
We could not emphasise this enough more than saying master your language. We remember sprinkling our
earlier articles with the same piece of advice. If we remember well it was positioned in the context of
emerging trends in language-based ways to quickly solve a problem. Let us usher the proverbial keyword
ChatGPT. The word ChatGPT is a specific reference to the project. Since, its release, it has many
adjectives associated with it Intelligent, Artificial, Generative, Job gobbler etc. There is enough said praised
already about its capabilities and enough said to scare about how it can make many titles obsolete. We are
not here to echo any of those. Please chat with ChatGPT to figure out how about it. We are here to ponder
the foundational pillars of ChatGPT. For the rest of this article, we will replace the brand with the technology
i.e., Generative AI.
The foundational pillars of Generative AI are –
1. Information representation
2. Searching through the piles of that representation
3. Constructing a response in natural language.
These topics individually have been areas of research for decades in Computer Science. What changed or
why is there so much attention now? The way we see it has parallels to Moore’s law. In the past decade,
we could do much more with a CPU than we could have done a decade ago. Add to it the ability of social
media, influencers and plush investments endowed to bleeding edge tech companies you have the present
time.
As developers, it is for us to stay grounded and try to understand and build on the fundamentals. The
application of technology can only be as good as we understand it. By application of technology, we mean
our solutions to customers. Many customers want to pilot or do something with Generative AI and often an
exciting start leads to a bit dull phase of stagnation if not steered well by architects and developers.
Information representation
Generative AI is not something manifested in thin air. It is built on the language that is written. Let us spin
you off-axis a bit. Did you encounter Generative AI solutions in languages like Tamil, Hindi, German,
Sanskrit, Arabic or Portuguese? If not why so? We will leave that as a question and hope you give enough
of your cognitive bandwidth to address it at some point. Words no matter which language they are written
are symbols of whose meaning is well understood when spoken between two humans. However, for
accomplishing something like that for computers we need to do the transformation for these symbols. Such
transformation of symbols is called encoding. For any form of intelligence, such an encoding is not enough.
It must also capture the context along with that symbol. For all practical purposes, the context is the
sequence of the symbols. Like this article! Which is a sequence of words (symbols) and together they
convey something to you. One can argue that context is more than sequence. Take a moment’s pause and
think about it. Words (symbols) mean something; that explains the topic of this article. The sequence of
words conveys the idea that we as authors want to communicate to you. This idea “for practical purposes”
constitutes the context. To capture the context, we will need to capture this sequence along with encoding.
The challenge with that is, there are infinitely many sequences and how do we store all that? That is where
we introduce Probability. For a language that has structure, the words have a higher degree of co-
appearance per the grammar of that language and the way people use a language too. In the field of
Computer Science, this act of capturing the possibility of co-occurrence is called word embedding.

The developer soaked in writing code using any kind of programming language, by now would have
assigned a data structure to that. Does array come to mind? If you ask well, array of what? Needless to
say, the CPU is conversant with numbers so how about the array of floats?
If you want to jump in right away and fire up your laptop to write that encoding and embedding. You might
soon hit a wall of challenges. Often frameworks help address such challenges. We will use the TensorFlow
framework in this article.
The one line
Often than not use of frameworks makes the code look rather straightforward to read. It takes away the
complexities involved in getting things done. Something as complex as word embedding by use of the
framework becomes this –
sentenceDenseVector = tf.keras.layers.Embedding(800,8, 20)
This line by itself is not sufficient. Come on! A framework can only do so much to help you. This is the
booting up of the embedding layer. But you need to few more things around this line to get a result. We
could take this for a spin right away. Let us supply a 1D vector to this layer –
sentencesDenseVector(tf.constants[3,6,9]).numpy()
The outcome of this will be a dense vector that does not resemble the input supplied.
array([[0.123456789, 0.987654321, 0.456789012, 0.789012345, 0.012345678, 0.345678901,
0.678901234, 0.901234567],[ -0.456789012, 0.789012345, -0.012345678, 0.345678901, -0.678901234,
0.901234567, -0.123456789, 0.987654321], [0.123456789, -0.987654321, -0.456789012, 0.789012345,
0.012345678, -0.345678901, 0.678901234, -0.901234567]], dtype=float32)
You will notice that this multidimensional array is not sparse. i.e., it does not have the same value over and
over which will happen if only encoding like one hot encoding was used. The array is dense and is of
dimension that is used while creating the embedding layer i.e., 8.
The first parameter is the input dimension where 800 represents the input dimension of the vocabulary
supplied to the layer. In other words, we expect that the input data that we will supply to train the word
embedding will have 800 different words. This is something that you can determine during exploratory
analysis. Remember do not use the exact number as things can grow. So, use a number larger than the
count of unique words in your vocabulary. There are finer decision points like do you want to include stop
words or want to skip in taking this count. It all depends on the application.
The last parameter is the length of the input supplied to the layer. We are technically incorrect in
configuring 20 and supplying only 3 values. You can avoid this conundrum by leaving out the parameter
itself. So that the layer adapts to the size of the input. When you are working with the sentences it makes
sense to tune this number to the average number of words in the sentence.
Applying this to text will involve some more work than declaring a constant. Text is prone to punctuations,
errors in spacing to make it fit for processing by application code etc. Since we are using TensorFlow we
can get that done quickly using a simple library call.
vectorizedText = TextVectorization(max_tokens = 800, output_mode=’int’, output_sequence_length=20)
In your neural network architecture, the vectorization will happen before the embedding. In the layer we just
built above you can additionally mention the standardised parameter that takes care of fixing the spacing
and punctuations and any other sanitization you deem relevant. The max_tokens parameter corresponds to
our guestimate on vocabulary size and the output_sequence_length corresponds to the context which we
introduced a while ago. In your experiments, it is important that you tune these hyperparameters. It
certainly will not be one size fits all.

These layers will in the neural network be configured as
model = Sequential([vectorizedText, sentencesDenseVector, GlobalAveragePooling1D(), Dense(8,
activation=’relu’), Dense(1)])
One key point that you must remember is whatever embedding model you use to capture your data use the
same model to transform your input after commissioning the model and use the same model to do a search
later. As you can see there is not much involved for you to use word embeddings. However, below the
surface of exposed API, there is a lot that goes on. Many a time word embedding is something that often
researchers or open-source enthusiasts share with each other. Word2Vec and GloVE come to mind when
we talk about sharing embeddings. True to the title of the article you must now understand that you need to
master the language you choose to write and communicate your thoughts. Incorrect sentences and poor
spellings will lead t wrong embeddings causing errors to creep into your models. Embedding is one way to
make the Generative AI work in languages other than English.