2018-08-20

Rule50 encoding bug is found

We had numerous issues in network encoding in the past, and now after pretty long pause we found yet another one! :)

Turns out, that information about 50-move-no-capture-and-pawn-move-counter was located in wrong place in training data, so networks were trained without that information.

That bug existed since the first version of lc0.exe, but wasn't there in lczero.exe (v0.10). That may explain a slight Elo drop when we fully switched to lc0.exe (v0.16).

This bug will be fixed in upcoming v0.17.0.
It may however cause slight Elo drop in networks after that as it needs time to adapt.


And for the curious, what the bug was,

In the code:
struct V3TrainingData {
  uint32_t version;
  float probabilities[1858];
  uint64_t planes[104];
  uint8_t castling_us_ooo;
  uint8_t castling_us_oo;
  uint8_t castling_them_ooo;
  uint8_t castling_them_oo;
  uint8_t side_to_move;
  uint8_t move_count;  // Not used, always 0.
  uint8_t rule50_count;
  int8_t result;
};

Should be:
struct V3TrainingData {
  uint32_t version;
  float probabilities[1858];
  uint64_t planes[104];
  uint8_t castling_us_ooo;
  uint8_t castling_us_oo;
  uint8_t castling_them_ooo;
  uint8_t castling_them_oo;
  uint8_t side_to_move;
  uint8_t rule50_count;
  uint8_t move_count;  // Not used, always 0.
  int8_t result;
};

Spot the difference!

14 comments:

  1. Is it just the order? Does that really matter?

    ReplyDelete
    Replies
    1. I think the order is different in self-play data generation vs data reader in training. So trainer read move_count in place of rule50_count and vice-versa.

      Delete
    2. If it was just the order, it wouldn't be a problem indeed.
      But somewhere in May, when we decided not to use move count plane, both engine and training code explicitly zeroed it.

      So what happens now is engine zeroes what it thinks move_count (but in reality it's rule50_count), and then training code zeroes correct move_count (where rule50_count is actually stored).

      Delete
  2. But aren't variables in structures referenced by name, not by index?

    ReplyDelete
    Replies
    1. It's written as a binary memory dump in C++ engine code.

      Training code is a python code, which loads that binary blob and parses using struct.unpack(). And there it had fields in different order.

      Delete
  3. When do you expect v0.17.0 to drop?

    ReplyDelete
    Replies
    1. We aim to release it before CCCC binary submission deadline, which is August 27th.

      Delete
  4. Replies
    1. This is a bug in training data generation. It doesn't affect game play.

      Delete
  5. Replies
    1. Not sure what you mean but probably the answer is no.

      Delete
  6. Any hunch as to how this may have affected playing style?

    ReplyDelete
    Replies
    1. No, other than random speculations, nothing.

      Delete