Transformer Language Models

Somewhat basic implemention of transformer model

device = get_device()
print(device)
[12:05:37] INFO - Using device: mps
mps

Data formatting

  • https://buomsoo-kim.github.io/attention/2020/04/21/Attention-mechanism-19.md/
# dataset = datasets.load_dataset('wikitext', 'wikitext-2-raw-v1')
data = Path('../data/text/tiny_shakespeare.txt').read_text()
tokenizer = CharTokenizer.from_text(data)
tokenized = tokenizer.encode(data)
ds = SimpleCharDataset(tokenized, context_length=8)
x,y = ds[0]
print(tokenizer.decode(x), tokenizer.decode(y))
First Ci irst Cit

Attention

Imagine you’re at position “it” in:

“The cat sat on the mat because it was tired.”

  • Q vector for “it” says “I need an antecedent”.
  • It matches strongest with K vector from “cat”.
  • So V vector from “cat” is heavily weighted in the output for “it”.

source

AttentionHead

 AttentionHead (embed_dim, head_size, block_size, dropout)

self attention head

Details
embed_dim dimension of embedding
head_size size of attention head
block_size context size
dropout dropout rate
vocab_size = 10
batch_size = 5
embed_dim = 20
context_size = 8
dropout = 0.2
head_size = 16
# embedded input (float)
x = torch.randn(batch_size, context_size, embed_dim) #(B,T,C)
print(x.shape)
att = AttentionHead(embed_dim, head_size, context_size, dropout)
xx = att(x)
print(xx.shape) # (B, T, Head_size)
torch.Size([5, 8, 20])
torch.Size([5, 8, 16])

source

MultiHeadAttention

 MultiHeadAttention (num_heads, head_size, embed_dim, block_size, dropout)

multiple heads of self-attention in parallel

num_heads = 5
multi_att = MultiHeadAttention(num_heads, head_size, embed_dim, context_size, dropout)
xxx = multi_att(x)
print(xxx.shape) #B,T,C/embed_dim
torch.Size([5, 8, 20])
mha = nn.MultiheadAttention(
    embed_dim=embed_dim,
    num_heads=num_heads,
    dropout=dropout,
    batch_first=True
    )
attn_out, attn_weight = mha(x, x, x)
print(attn_out.shape)
torch.Size([5, 8, 20])

Feed forward


source

FeedFoward

 FeedFoward (embed_dim, dropout)

a simple linear layer followed by a non-linearity

ff = FeedFoward(embed_dim, dropout)
ff_x = ff(x)
print(ff_x.shape)
torch.Size([5, 8, 20])

Block


source

TransformerBlock

 TransformerBlock (embed_dim, n_head, block_size, dropout)

Transformer block: communication followed by computation

b = TransformerBlock(embed_dim, num_heads, context_size, dropout)
bb = b(x)
print(bb.shape)
torch.Size([5, 8, 20])

GPT model


source

GPTLanguageModel

 GPTLanguageModel (vocab_size, embed_dim, block_size, n_head, n_layer,
                   dropout)

*Base class for all neural network modules.

Your models should also subclass this class.

Modules can also contain other Modules, allowing to nest them in a tree structure. You can assign the submodules as regular attributes::

import torch.nn as nn
import torch.nn.functional as F

class Model(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.Conv2d(1, 20, 5)
        self.conv2 = nn.Conv2d(20, 20, 5)

    def forward(self, x):
        x = F.relu(self.conv1(x))
        return F.relu(self.conv2(x))

Submodules assigned in this way will be registered, and will have their parameters converted too when you call :meth:to, etc.

.. note:: As per the example above, an __init__() call to the parent class must be made before assignment on the child.

:ivar training: Boolean represents whether this module is in training or evaluation mode. :vartype training: bool*

batch_size = 64 # how many independent sequences will we process in parallel?
block_size = 256 # what is the maximum context length for predictions?
eval_interval = 500
learning_rate = 3e-4
device = get_device()
eval_iters = 200
n_embd = 384
n_head = 6
n_layer = 6
dropout = 0.2
vocab_size = 46 #len(v)
m = GPTLanguageModel(vocab_size, n_embd, block_size, n_head, n_layer, dropout)
[11:25:24] INFO - Using device: mps
print(device)
mps
device = 'cpu'
m = m.to(device)
x = torch.randint(vocab_size, (batch_size, block_size)).to(device)
logits, loss = m(x)
print(logits.shape)
torch.Size([64, 256, 46])
# @torch.no_grad()
# def estimate_loss():
#     out = {}
#     model.eval()
#     for split in ['train', 'val']:
#         losses = torch.zeros(eval_iters)
#         for k in range(eval_iters):
#             X, Y = get_batch(split)
#             logits, loss = model(X, Y)
#             losses[k] = loss.item()
#         out[split] = losses.mean()
#     model.train()
#     return out
# create a PyTorch optimizer
optimizer = torch.optim.AdamW(m.parameters(), lr=learning_rate)
acc_loss = []
max_iters = 1
for iter in range(max_iters):
    # sample a batch of data
    xb, yb = get_random_batch(torch.LongTensor(ids), block_size, batch_size, device=device)
    # evaluate the loss
    logits, loss = m(xb.to(device), yb.to(device))
    print(loss.item())
    acc_loss.append(loss.item())
    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    optimizer.step()
plt.plot(acc_loss)
# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(''.join(v.itos(m.generate(context, max_new_tokens=50)[0].tolist())))