Each LLM is given the same 1000 chess puzzles to solve. See puzzles.csv. Benchmarked on Mar 25, 2024.

Model Solved Solved % Illegal Moves Illegal Moves % Adjusted Elo
gpt-4-turbo-preview 229 22.9% 163 16.3% 1144
gpt-4 195 19.5% 183 18.3% 1047
claude-3-opus-20240229 72 7.2% 464 46.4% 521
claude-3-haiku-20240307 38 3.8% 590 59.0% 363
claude-3-sonnet-20240229 23 2.3% 663 66.3% 286
gpt-3.5-turbo 23 2.3% 683 68.3% 269
claude-instant-1.2 10 1.0% 707 66.3% 245
mistral-large-latest 4 0.4% 813 81.3% 149
mixtral-8x7b 9 0.9% 832 83.2% 136
gemini-1.5-pro-latest* FAIL - - - -

Published by the CEO of Kagi!

  • conciselyverbose@sh.itjust.works
    link
    fedilink
    English
    arrow-up
    8
    ·
    edit-2
    2 months ago

    I wonder how many of the ones they “solved” were just because they’d seen it discussed somewhere in the data set, considering the puzzles are apparently from a public resource.

    • Blóðbók@slrpnk.net
      link
      fedilink
      English
      arrow-up
      10
      ·
      2 months ago

      Yeah, I don’t know why anyone knowledgeable would expect them to be good at chess. LLMs don’t generalise, reason or spot patterns, so unless they read a chess book where the problems came from…

    • Carrolade@lemmy.world
      link
      fedilink
      English
      arrow-up
      5
      ·
      2 months ago

      Likely close to 100%. If you read the (rather good) article, a little further down they test whether the LLM can play an extremely simplistic “Connect 4” game they devise, as a way of narrowing down on specifically reasoning capabilities.

      It cannot.

      Chess puzzles, in particular, are frequently shared and discussed in online chess spaces, so the LLM will have a significant amount of material to work with when it tries to predict the best response to give to the prompt.