• brucethemoose@lemmy.world
    link
    fedilink
    English
    arrow-up
    1
    ·
    5 months ago

    I wonder what the errors were?

    This seems like a much better task for a finetune with constrained decoding (with is not possible with OpenAI), though I’m not sure how much ‘thinking’ about the function needs to be done here.

  • gnufuu@infosec.pub
    link
    fedilink
    English
    arrow-up
    1
    ·
    5 months ago

    The performance of Marmaragan with GPT-4o on the benchmark is promising, with correct annotations having been generated for 50.7% of the benchmark cases. The results establish a foundation for future work on combining the power of LLMs with the reliability of formal software verification.

    That’s basically a coin flip. Even if you made it less successful a user would have higher chances of guessing correctly whether an annotation is true of false.