Hacker News — vinext + Cloudflare Workers

new
past
show
ask
show
jobs
submit

▲We're running out of benchmarks to upper bound AI capabilities (lesswrong.com)

15 points by gmays 1 days ago | 9 comments

nikisweeting 1 days ago [-]

We can definitely make harder evals, the problem is a good eval set is indistinguishable from good training data / market edge, so no one is incentivized to share their best eval sets publicly.

WarmWash 1 days ago [-]

Start front loading the models with 5k, 10k, 50k, 100k tokens of messy quasi related context, and then run the benchmarks.

These models are ridiculously powerful with a blank slate. It's when they get loaded down with all the necessary (and inevitably unnecessary) context to complete the task that they really start to crumble and fold.

jballanc 1 days ago [-]

We need benchmarks that can distinguish between continuous learning and long-context extrapolation.

vrighter 4 hours ago [-]

oh that's easy: continuous learning is not something current architectures can do. So the benchmark for that can be done mentally

UltraSane 1 days ago [-]

This is the least true thing ever. All LLMs are terrible at ARC-AGI-3. Every video game can be used as a benchmark. You could rank LLMs on how long they can keep a game of Dwarf Fortress running or how fast they can beat GTA5.

ttoinou 1 days ago [-]

We already have specialized AI to play video games

UltraSane 1 days ago [-]

We are talking about LLMs. a true AGI would be able to beat every video game.

conception 1 days ago [-]

Until Arc-Battletoads is passed I’m not buying it.

UltraSane 22 hours ago [-]

More like ARC-SegaMasterSystem-ALF

refactorbench 1 days ago [-]

[dead]

Rendered at 23:43:49 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.