A Standard Dataset

When doing machine learning it helps to use a standardized dataset such that methods can be compared in an objective manner. For machine vision, one of the earliest standard datasets is MNIST, a set of handwritten characters that was also used in the (arguably) first deep learning paper.

We should define such a dataset in the world of chess programming as we try to improve training algorithms for our new chess engines based on neural networks. This blogpost introduces such a dataset called the CCRL Dataset (also giving a huge hint as to where it comes from).

Introducing the CCRL Dataset

This dataset was constructed from CCRL 40/40 and 40/4 data combined. It consists of 2'500'000 games, 20% of which is the testset and 80% the trainingset. You can download the dataset in pgn-format (539M) and v3-format (11G).

This figure shows a distribution of all the gameover types within the testset of this data. Games with over 500 plies have been excluded from this figure to keep it readable, and as such the game count is slightly smaller than 500'000 with 0.02% ignored. The double bands for black/white wins show wins on checkmates and resignations. 38% was won with white, 30% with black and 32% draw. Finally, the testset has ~86% unique positions (including history planes).

Baseline results

For training a simple baseline network the following yaml scheme was used as input to train.py:

%YAML 1.2 --- name: '128x10-base' gpu: 0
dataset: num_chunks: 2500000
train_ratio: 0.80
input_train: '/home/fhuizing/Downloads/chess/computer/cclr/v3/train/'
input_test: '/home/fhuizing/Downloads/chess/computer/cclr/v3/test/'
training: batch_size: 1024
num_batch_splits: 1
test_steps: 1000
train_avg_report_steps: 500
total_steps: 200000
checkpoint_steps: 10000
shuffle_size: 250000
lr_values: - 0.1
- 0.01
- 0.001
- 0.0001
lr_boundaries: - 80000
- 140000
- 180000
policy_loss_weight: 1.0 value_loss_weight: 0.01 path: '/mnt/storage/home/fhuizing/chess/networks'
model: filters: 128
residual_blocks: 10

This resulted in an accuracy of 47.0583%, policy loss of 1.591 and mse loss of 0.10882. The network can be downloaded as ccrl-baseline.pb.gz. The tensorboard graphs can be downloaded as leelalogs-base.tgz.

Potential ideas

For inspiration, here's a list of ideas where using such a dataset may be useful:
  • Testing a multigpu training algorithm
  • Different neural network architectures
  • Different input encoding (e.g. removing history planes)
  • Different move encoding on the policy head
  • Finishing resign or adjudicated games to get more endgame data with n-man tablebases
And much more...

Final notes

This dataset is very different from our selfplay data runs which improve over time through reinforcement learning. Sliding a training window across the vast number of games produced by clients as new networks are trained. As such one can only test a subset of ideas/parameters using this data. Still, it's probably safe to say one should never regress in terms of performance on this dataset.

Have fun experimenting and please share your results (good or bad, as both can be very useful)!


  1. Look forward to testing it. Thanks

  2. Nice idea! I think having some sort of baseline data should be very useful.

  3. Looks interesting, but I don't really understand this text, in spite of the fact that I have followed this project for months. Could you please write it such that more people are able to understand what this means?
    Especially please write explicitly that this is about supervised learning if that is correct.
    And if the network weights produced can be loaded into lc0 to play games then please write that. After all this is a new kind of network made of games played between chess engines.

    1. This is indeed a dataset that can be used for supervised learning in order to improve our training algorithms. The weights can be loaded directly into the lc0 binary. You can download the baseline weights from http://data.lczero.org/files/ccrl-baseline.pb.gz.

  4. Some basic information on supervised learning: https://github.com/dkappe/leela-chess-weights/wiki/Supervised-Learning

    I’ll say that the CCRL dataset doesn’t have enough “dirt” in it. It gives you nets that are perform poorly with material imbalance, especially in the endgame (also you can get to 60%+ accuracy with 5 or so CLR two peak taper runs).

    I’d suggest adding some prepared kingbase games with resignations played out by sf.

    1. The question is whether having "dirt" is required for good method comparison. Having a dataset that produces an excellent network right out of the gate might not be desirable. Or it might, it's unclear to me at this point.

      This dataset is not ment to become a brilliant neural network for playing.

  5. So this dataset is not relevant for the self-training pipeline, which is the main effort of this project?

  6. Training data needs policy. In normal selfplay policy comes from actual number of visits in search. For games played by others there was no search by Leela so we have nowhere to take policy values from. So what do we have for policy values in v3 files? Do we have junk policy values in training data for this case?

  7. Finished training run with SKIP=4 per Discord suggestion to shorten running time on my PC.
    Currently running match between base and skip4 nets.
    Training results:
    step 199500, lr=0.0001 policy=1.60005 mse=0.110815 reg=0.126045
    total=1.73053 (257.39 pos/s)
    step 200000, lr=0.0001 policy=1.60208 mse=0.110331 reg=0.126035
    total=1.73253 (2741.92 pos/s)
    step 200000, policy=1.595 training accuracy=46.9636%, mse=0.108899
    Model saved in file: v3/networks/128x10-base/128x10-base-200000
    saved as 'v3/networks/128x10-base/128x10-base-200000.pb.gz' 11.29M
    Weights saved in file: v3/networks/128x10-base/128x10-base-200000
    saved as 'v3/networks/ccrl128x10.txt' 11.29M

  8. After 200 games skip4 is clearly weaker.

    Rank Name Elo + - games score oppo. draws
    1 Base 43 42 42 200 62% -43 23%
    2 Skip4 -43 42 42 200 38% 43 23%
    Ba Sk
    Base 99
    Skip4 0

    Next is running a "sanity check" with base v base, which should be even

  9. Very worst project as I was not able to train the network and not able to figure out what are the problems after 80 hrs of work it got wasted as nothing result came out.Iam stuck here. evaluation batches is dependent on chunck size / batch_Size
    nstructions for updating:

    Future major versions of TensorFlow will allow gradients to flow
    into the labels input on backprop by default.

    See `tf.nn.softmax_cross_entropy_with_logits_v2`.

    Using 115 evaluation batches

  10. Can you clarify the results for me. What percentage of the moves in the test set were predicted? What is the average MSE of the value head?


Note: Only a member of this blog may post a comment.