• earthworm@sh.itjust.works
    link
    fedilink
    English
    arrow-up
    5
    ·
    edit-2
    6 hours ago

    This seems like a dumb benchmark.

    ClockBench evaluates whether models can read analog clocks - a task that is trivial for humans, but current frontier models struggle with.

    What do you mean trivial? Most humans I know can’t read the most basic white-background-big-black-numbers clocks.

    Someone rigged the jury to get 90% on this:

    • MCasq_qsaCJ_234@lemmy.zip
      link
      fedilink
      English
      arrow-up
      1
      ·
      3 hours ago

      Rather, ClockBench will end up improving AI in this regard over the next few years. This is because they need any AI benchmark to identify its strengths and weaknesses in order to improve it in future versions.

    • panda_abyss@lemmy.ca
      link
      fedilink
      English
      arrow-up
      23
      ·
      19 hours ago

      Some of those don’t have tick marks. I hate clocks like that, they’re difficult to read.

      I’m surprised it’s near 90, a while generation has grown up with digital clocks everywhere

      • MHLoppy@fedia.io
        link
        fedilink
        arrow-up
        6
        ·
        18 hours ago

        Really wish they published the whole dataset. They don’t specify on the page or in the paper what the full set was like, and the GitHub repo only has one of the easy-to-read ones. If >=10% of the set is comprised of clock faces designed not to be readable then fair enough.