#Help on getting MLAgent to understand the board on a board game.

1 messages · Page 1 of 1 (latest)

heady latch
#

Hey guys! I'm quite new on AI and training I've been working on an board game digital adaptation, and needed to train an AI so players can play against. In this game we have 2 phases, a draft phase and an action phase. The AI can pick only one card, and then perform an action and/or a power. I believe I did almost everything correctly when:

  • Making the AI pick a card based on game status and reward it for picking a good card.
  • Making the AI select an action and rewarding it when he picks a valid action.
  • Letting the AI choose if he wants to end turn or change action mid play and rewarding or penalizing based on remaining possible moves.
  • Letting the AI do an action on the board, adding reward if he performs an action on a valid tile, and penalizing if it tries an invalid tile.
  • EndingEpisodes after they finish their turn on the Action phase, so it can undestand that a episode consists of these 2 phases.

With all this in mind, when I put the AI to train it performs quite good, and can finish a match in under 2 seconds, the problem comes when you perceive that it's not learning based on their mistakes on the board, and it doesn't seem to understand my board correctly.

The main board consists of a 9x11 space we'll use that as default, but size will vary in later updates, so I need this value to be mostly dynamic. The AI needs to understand their board and the pieces put in it, but I can't seem to make him understand that. My board class returns me a Dictionary<Vector2 position, Tile information>().

I tried adding each tile in the board as an observation like this:

  foreach (var tile in tiles)
  {
    sensor.AddObservation(new Vector3(tile.posX, tile.posY, (int) tile.pieceType));
  }

For decisions OnActionRecived() the AI will return me 2 discrete actions:

  int boardX = actions.DiscreteActions[2];
  int boardY = actions.DiscreteActions[3];
  Vector2 boardPos = new Vector2(boardX, boardY);
  // After: Try all tiles until he tries a possible tile and reward it.

What am I doing wrong and/or what should I do to make the board be understandable?

#

This is my config.yaml:

default_settings:
    trainer_type: ppo
    hyperparameters:
      batch_size: 32
      buffer_size: 256
      learning_rate: 0.0001
      beta: 0.005
      epsilon: 0.2
      lambd: 0.95
      num_epoch: 3
      learning_rate_schedule: constant
    network_settings:
      normalize: true
      hidden_units: 256
      num_layers: 4
      vis_encode_type: simple
    reward_signals:
      extrinsic:
        gamma: 0.9
        strength: 1.0
    keep_checkpoints: 5
    checkpoint_interval: 2000
    max_steps: 500000000
    time_horizon: 32
    summary_freq: 2000
    threaded: true

behaviors:
  EveEasy:
    trainer_type: ppo
  EveMedium:
    trainer_type: ppo
  EveHard:
    trainer_type: ppo
pliant rain
#

that's quite a range of observations and actions, it's tough to say much without more info on what is "wrong" with the AI learning but are you masking anything during play to try minimize the possible actions etc?
also normalizing the observations might be a good idea
if you want the board observations array to change in size you can use a buffer sensor
is the agent learning via self play?

heady latch
#

The agent trains alone.

#

I'll try the buffer sensor, but I dont know if I quite understand it. Should be the observable size and max num observables be the size of the board multiplied?
If I have a 15x15 board should I just put 225?

#

I tried making a custom sensor called board sensor too. But I'm too noob to understand any of it, so I tried replicating a bit of the match3 sensor that somehow got the board working as observation.

pliant rain
#

observable size is the size of each bit of data sent in, i.e. if you send a bunch of single observations it'll be observable size 1 with max num obs being however many times you want to send that.
if you want to send it all as 1 observation it's the other way around etc
so in your case it could make sense to send each grid space as single observations of size 3 (because it's vector 3) then for max num of obs it'll be however many grids your biggest board size would be.
you assign the sensor obs in the code just like you have now except using the buffer sensor component (you need to reference this manually)

#

for a board game i would recommend looking into self play and let the agent train against itself, don't try to micro manage the rewards so much and limit the valid actions by masking etc