IT Strategy

How to Use GPT-2 for Custom Data Generation

In this article we took a step-by-step look at using the GPT-2 model to generate user data on the example of the chess game.

The GPT-2 is a text-generating AI system that has the impressive ability to generate human-like text from minimal prompts. The model generates synthetic text samples to continue an arbitrary text input. It is chameleon-like — it adapts to the style and content of the conditioning text. There are plenty of applications where it has shown success:

  • Text generation - GPT-2 is the almighty king of text generation.
  • Chatbots. Even the unmodified model understands that text is a dialog, like "Human: ... Bot: ..." and answers after "Bot:".
  • Machine translation. The GPT-2 model can learn translations in format "english sentence = french sentence".
  • Summarization. The model understands the "TL;DR:" tag at the end of the text as a signal to write a summary.
  • Question answering. The GPT-2 can answer questions out-of-the-box not that bad, but for accurate results, it should be fine-tuned on some QnA dataset like SQUAD.
  • Generating poetry. GPT-2 models work well for poetry. The quality of the results is limited by sometimes only having access to smaller models and difficulty in running larger models at all.
  • Music generation. "Music Modeling" is just like language modeling – just let the model learn music in an unsupervised way, then have it sample outputs. OpenAI has an impressive compositor MuseNet too.
  • Image generation. Just as a large transformer model trained in language can generate coherent text, the same exact model trained on pixel sequences can generate coherent image completions and samples.
  • GPT-2 models can be used to spread fake news, fake reviews etc. Just people's ethics prevent it from happening, but who knows for what time...

Despite other more complex NLP models being released, like T5 from Google and GPT-3 from OpenAI, the GPT-2 is still the best NLP model able to run on average hardware.

Some ML researchers tried to evaluate GPT-2 in a very unusual application.

Check out a related article:

To confirm that GPT-2 is a general pattern-recognition program, ML researcher Shawn Presser (@theshawwn) trained GPT-2 to play chess using solely PGN files. Here you can find the progress. The model has shown an ability to recognize known chess patterns, which is a great Openai GPT 2 example. Today, we are going to find out how to use Openai GPT 2 using this example.

We liked an idea to evaluate GPT-2 not only on natural language generation, but also on other applications like the chess game. We thought training the model on the current board state is more perspective over training on PGN sequence. In our mind, when the current board state is used, it is not so necessary to have game history to predict the next move.

But in the case with PGN files, the whole history is important. Inspired by Shawn's "Cryochess – GPT-2 1.5B chess engine" trained on PGN notations, and "Programming a Chess Player" by Professor Blank, the following code was written to train GPT-2 play chess using current board state, or FEN.

Here are some useful resources if you are new to Chess:

1. Preparation

1.1. Dependencies

You should have an ML-ready system with a good GPU, CUDA 10.1, TensorFlow 2, and PyTorch installed.

It is recommended to use a virtual environment to run, read here how to setup it. Also, your JupyterLab should have an extension installed, as specified here.

We are going to apply Python development, for instance, GPT 2 python chess, so you’ll need to install python-chess [4] and aitextgen [5] modules:

In [ ]:

!pip install python-chess
!pip install aitextgen
!pip install tqdm

1.2. PGN files download

import os
if not os.path.exists("pgn"):
    os.mkdir("pgn")

Download PGN game files to the pgn folder. Some resources with PGN files are:

You can also use SCID to convert SCID databases (*.sg4) to PGN format.

We've used PGN archives with 100,000 games inside for this training. Please note that importing numerous games (like a million) requires a lot of RAM to start training.

2. Training data generation

Our GPT 2 models use the training file with the current board state and next move on each line, in the following format:

[Result] FEN-position-and-side-only - next_move

Example:

[1-0] r1bq3k/ppp2rpp/5b2/3n4/3P4/P4p2/BP1B1PPP/R2QR1K1 w - a2d5
[1-0] 2bQ4/p4kb1/6n1/q1p1p3/1rn1P3/N3BP2/1PP5/2KR2R1 w - a3c4
[0-1] 1r3rk1/p4nbp/1qppb1p1/4p3/PP2P3/4NN1P/2QB1PP1/2R1R1K1 b - f8c8

Here we used only games which ended up with a win, skipping draws. The function to generate training text is:


import os
from tqdm.auto import tqdm
import glob
import chess.pgn

MAX_IMPORT = 100000

def importPgn(filename, s, max_import):
    counter = 0
    total = 0
    with open(filename) as f:
        for line in f:
            if "[Result" in line:
                total += 1
    if total > max_import:
        total = max_import
    pbar = tqdm(total=total, desc="read " + filename, unit=" games", mininterval=1)

    pgn = open(filename)
    while counter < max_import:
        game = chess.pgn.read_game(pgn)
        if not game:
            break
        board = game.board()
        moves = game.mainline_moves()
        count = sum(1 for _ in moves)

        # skip unfinished games
        if count <= 5:
            continue

        result = game.headers["Result"]
        # import only resultative games
        if result != "1-0" and result != "0-1":
            continue

        for move in moves:
            if board.turn == chess.WHITE and result == "1-0":
                line = (
                    "[1-0] "
                    + " ".join(board.fen().split(" ", 2)[:2])
                    + " - "
                    + move.uci()
                ).strip()
                s.add(line)
            elif board.turn == chess.BLACK and result == "0-1":
                line = (
                    "[0-1] "
                    + " ".join(board.fen().split(" ", 2)[:2])
                    + " - "
                    + move.uci()
                ).strip()
                s.add(line)

            board.push(move)

        counter += 1
        pbar.update(1)
    pbar.close()
    return counter


def convert():
    games = 0
    moves = 0
    max_import = MAX_IMPORT
    s = set()

    # load previous state
    if os.path.exists("fen.txt"):
        with open("fen.txt") as f:
            for line in tqdm(f, desc="read fen.txt", unit=" moves", mininterval=1):
                if line:
                    s.add(line)
                    max_import -= 1
                    if max_import <= 0:
                        break

    for file in glob.glob("pgn/*.pgn"):
        count = importPgn(file, s, max_import)
        games += count
        max_import -= count
        if max_import <= 0:
            break

    with open("fen.txt", "w") as f:
        for line in tqdm(s, desc="write fen.txt", unit=" moves", mininterval=1):
            f.write(line + "\n")
            moves += 1
    print("imported " + str(games) + " games, " + str(moves) + " moves")


convert()

It takes about 15 minutes to import 100K games.

3. Training

As we need only chess moves in a model memory, we train the small GPT-2 model from scratch as described in aitextgen docs. The small model was selected because it is possible to train it on average hardware in a shorter time, compared to larger models. Probably, using the large model has its own benefits but it is over complex for the demonstration.

You may run the training function multiple times to repeat training and achieve acceptable loss, as model checkpoints are periodically saved. I've stopped at loss value near 0.8 to save time, but even at that level, the model can predict moves.

Tune-up batch_size and num_workers to better fit with your GPU and avoid OOM.

from aitextgen import aitextgen
from aitextgen.utils import build_gpt2_config
from aitextgen.TokenDataset import TokenDataset
from aitextgen.tokenizers import train_tokenizer
import os


file_name = "fen.txt"
model_dir = "trained_model"
config_file = os.path.join(model_dir, "config.json")
pytorch_model_file = os.path.join(model_dir, "pytorch_model.bin")
vocab_file = os.path.join(model_dir, "aitextgen-vocab.json")
merges_file = os.path.join(model_dir, "aitextgen-merges.txt")
dataset_cache_file = os.path.join(model_dir, "dataset_cache.tar.gz")
max_length = 100
vocab_size = 10000

def train():
    if not os.path.exists(model_dir):
        os.mkdir(model_dir)

    # train tokenizer if necessary
    if not os.path.exists(vocab_file):
        print("training tokenizer, please wait...")
        train_tokenizer(file_name, save_path=model_dir, vocab_size=vocab_size)
    
    if os.path.exists(dataset_cache_file): # use cache
        data = TokenDataset(
            dataset_cache_file,
            vocab_file=vocab_file,
            merges_file=merges_file,
            block_size=max_length,
            from_cache=True,
        )
    else: # or create token cache if necessary
        data = TokenDataset(
            file_name,
            vocab_file=vocab_file,
            merges_file=merges_file,
            block_size=max_length,
            line_by_line=True,
            save_cache=True,
            cache_destination=dataset_cache_file
        )

    if not os.path.exists(pytorch_model_file):
        config = build_gpt2_config(
            vocab_size=vocab_size,
            max_length=max_length,
            dropout=0.0,
            n_embd=512,
            n_head=16,
            n_layer=16,
        )

        ai = aitextgen(
            config=config, vocab_file=vocab_file, merges_file=merges_file, to_gpu=True
        )
    else:
        ai = aitextgen(
            model=pytorch_model_file,
            config=config_file,
            vocab_file=vocab_file,
            merges_file=merges_file,
            to_gpu=True
        )

    ai.train(
        data,
        num_steps=150000,
        generate_every=1000,
        save_every=1000,
        learning_rate=1e-4,
        batch_size=16,
        num_workers=4,
    )

train()

It takes about 8 hours. To get a well-trained model you'll need a few days.

4. Evaluation

4.1. Random player

This is the simplest possible player. The function takes a list of valid moves and randomly makes a choice. It plays chess badly.

import random

def random_player(board):
    move = random.choice(list(board.legal_moves))
    return move.uci(), False, False

4.2. GPT-2 player

This player is using GPT-2 "AI" to predict the next move. The prompt for the model is constructed from the expected result (we want to win, so it is "1-0" for white and "0-1" for black), current board state and side. Then the model appends the next generated move to the prompt.

A few notes about this player:

  • It is trained on a small amount of data in ML units so cannot act as a chess master.
  • You can see from the results that the model predicts moves from unknown board states, not presented to it during training.
  • The model can generate an invalid move sometimes and fix this a valid random move is used.
import os
from aitextgen import aitextgen
from aitextgen.utils import build_gpt2_config
import chess
from tqdm.auto import tqdm

model_dir = "trained_model"
vocab_file = "aitextgen-vocab.json"
merges_file = "aitextgen-merges.txt"
max_length = 100

model_dir = "trained_model"
config_file = os.path.join(model_dir, "config.json")
pytorch_model_file = os.path.join(model_dir, "pytorch_model.bin")
vocab_file = os.path.join(model_dir, "aitextgen-vocab.json")
merges_file = os.path.join(model_dir, "aitextgen-merges.txt")
dataset_cache_file = os.path.join(model_dir, "dataset_cache.tar.gz")
max_length = 100

ai = aitextgen(
    model=pytorch_model_file,
    config=config_file,
    vocab_file=vocab_file,
    merges_file=merges_file,
    from_cache=True,
    to_gpu=True,
    # to_fp16=True
)

# a set to find known states
db = set()
with open("fen.txt") as f:
    for line in tqdm(f, desc="read fen.txt", unit=" moves"):
        if line:
            db.add(" ".join(line.split(" ", 3)[:3]))

def gpt2_player(board):
    if board.turn == chess.WHITE:
        prompt = "[1-0] " + " ".join(board.fen().split(" ", 2)[:2])
    else:
        prompt = "[0-1] " + " ".join(board.fen().split(" ", 2)[:2])
    isKnown = prompt in db

    prediction = ai.generate_one(
        prompt=prompt,
        max_length=max_length,
        temperature=0.9,
        top_k=0,
    )
    isPredicted = False
    try:
        uci = prediction.split(' - ')[1].strip()
        move = chess.Move.from_uci(uci)
        isPredicted = True
    except Exception as e:
        # print(str(e))
        move = None
    if not move or move not in board.legal_moves:
        # give up and do random move
        move = random.choice(list(board.legal_moves))
        isPredicted = False
    return move.uci(), isPredicted, isKnown

4.3. Playing a Game

This function takes two players and performs the game between them.

import time
from IPython.display import display, HTML, clear_output
import chess

def who(player):
    return "White" if player == chess.WHITE else "Black"


def display_board(board, use_svg):
    if use_svg:
        return board._repr_svg_()
    else:
        return "<pre>" + str(board) + "</pre>"

    
def play_game(player1, player2, visual="svg", pause=0.1):
    """
    playerN1, player2: functions that takes board, return uci move
    visual: "simple" | "svg" | None
    """
    use_svg = (visual == "svg")
    board = chess.Board()
    known1 = 0
    predicted1 = 0
    total1 = 0
    known2 = 0
    predicted2 = 0
    total2 = 0
    if visual is not None:
        display(display_board(board, visual == 'svg'))
    try:
        while not board.is_game_over(claim_draw=True):
            if board.turn == chess.WHITE:
                uci, isPredicted, isKnown = player1(board)
                total1 += 1
                if isKnown:
                    known1 += 1
                if isPredicted:
                    predicted1 += 1
            else:
                uci, isPredicted, isKnown = player2(board)
                total2 += 1
                if isKnown:
                    known2 += 1
                if isPredicted:
                    predicted2 += 1
            name = who(board.turn)
            board.push_uci(uci)
            board_stop = display_board(board, use_svg)

            html = "<b>Move %s %s, Play '%s':</b><br/>%s<br/>Known/Predicted/Total moves: %s/%s/%s %s%% - %s/%s/%s %s%%" % (
                       len(board.move_stack), name, uci, board_stop,
                           known1, predicted1, total1, round(predicted1 / (total1 or 1) * 100),
                           known2, predicted2, total2, round(predicted2 / (total2 or 1) * 100))
            if visual is not None:
                if visual == "svg":
                    clear_output(wait=True)
                display(HTML(html))
                if visual == "svg":
                    time.sleep(pause)
    except KeyboardInterrupt:
        msg = "Game interrupted!"
        return (None, msg, board)
    result = "1/2-1/2"
    if board.is_checkmate():
        msg = "checkmate: " + who(not board.turn) + " wins!"
        result = "1-0" if who(not board.turn) == "White" else "0-1"
    elif board.is_stalemate():
        msg = "draw: stalemate"
    elif board.is_fivefold_repetition():
        msg = "draw: 5-fold repetition"
    elif board.is_insufficient_material():
        msg = "draw: insufficient material"
    elif board.can_claim_draw():
        msg = "draw: claim"
    if visual is not None:
        print(msg)
    return (result, msg, board)

Let's meet together gpt2_player vs. random_player:


play_game(gpt2_player, random_player)
pass


Move 61 White, Play 'd2d7':


Known/Predicted/Total moves: 2/29/31 94% - 0/0/30 0%


checkmate: White wins!

Interesting is that often the game will end up in a stalemate. Most probably it is the result of not analyzing the next move and selecting the best one just for a moment.

Now let's play 100 games (gpt2_player plays white):

from tqdm.auto import tqdm

plays = 100
white_wins = 0
black_wins = 0
pbar1 = None
pbar2 = None

for i in tqdm(range(plays), desc="Plays"):
    if not pbar1:
        pbar1 = tqdm(total=plays, desc="White wins")
    if not pbar2:
        pbar2 = tqdm(total=plays, desc="Black wins")
    result, _, _ = play_game(gpt2_player, random_player, visual=None)
    if result is None:
        break
    elif result == "1-0":
        white_wins += 1
        pbar1.update(1)
    elif result == "0-1":
        black_wins += 1
        pbar2.update(1)
pbar1.close()
pbar2.close()
print("Final score: %s-%s" % (white_wins, black_wins))



Final score: 52-0

In most cases there are draws or gpt2_player wins. Nearly half of plays ended up with a checkmate from the white player controlled by GPT-2, and overall score is decisively on its side. Interesting notation is that almost always the board state is new to the model, and the model is performing valid moves definitely more often than fails. So we can conclude the model learned some basic patterns from training data to successfully predict the next move.

4.4. A human player

This function handles human input to play:

def human_player(board):
    uci = get_move("%s's move [q to quit]> " % who(board.turn))
    legal_uci_moves = [move.uci() for move in board.legal_moves]
    while uci not in legal_uci_moves:
        print("Legal moves: " + (",".join(sorted(legal_uci_moves))))
        uci = get_move("%s's move[q to quit]> " % who(board.turn))
    return uci, True, False

def get_move(prompt):
    uci = input(prompt)
    if uci and uci[0] == "q":
        raise KeyboardInterrupt()
    try:
        chess.Move.from_uci(uci)
    except:
        uci = None
    return uci

Try your hand at playing chess against the gpt2_player. Note that you must enter your move in UCI, such as "a2a4", meaning moving the piece at a2 to location a4.

play_game(human_player, gpt2_player)
pass

Move 10 Black, Play 'b7b6':

Known/Predicted/Total moves: 0/5/5 100% - 2/5/5 100%

5. Results

We applied the natural text generation model, the GPT-2, in an unusual field of Chess game moves generation. Despite it is far away from master level yet, it showed an ability to learn Chess basics. Feeding more training data and increasing the model size theoretically will bring this model to a higher level, but our goal was to confirm the GPT-2 model ability to learn and generate abstract patterns.

What is interesting, when the model is playing against the random player, which moves "stupidly", the model behaves not very well too, but when playing vs human, it moves more "thoughtfully". We think it's because the better you play, the more similar is the board state to some state from training data, and the more confident is the model on generating the next move.

Above results show that besides natural text generation, the GPT-2 model confidently can generate any type of textual patterns. And it's not just repeating training data, because the model finds some similarities to successfully deal with unknown input, in other words, it's like the model builds an algorithm internally. And last but not least, it can be trained on an average personal computer. Hopefully, now you have a better idea on how to use GPT 2.

The subject is open for further experiment, not covered in this article:

  • Continue the model training until lower loss reached.
  • Use a larger model.
  • Use a larger PGN dataset with a billion or more games.
  • Predict the next two, three, or more moves.
  • Add board state and move analysis to make it more like a chess program.

The notebook is available on Google Colab. Feel free to do your own experiment.

avatar
AI Developer
An experienced full-stack software developer with engineer mentality. Dmytro codes mostly with Node.js, Python and Rust, explores and experiments with the latest technologies, such as AI and Deep Learning.