OpenAI o1 Results on ARC-AGI-Pub

187 points by z7 4 days ago | 125 comments

In my opinion this blog post is a little bit misleading about the difference between o1 and earlier models. When I first heard about ARC-AGI (a few months ago, I think) I took a few of the ARC tasks and spent a few hours testing all the most powerful models. I was kind of surprised by how completely the models fell on their faces, even with heavy-handed feedback and various prompting techniques. None of the models came close to solving even the easiest puzzles. So today I tried again with o1-preview, and the model solved (probably the easiest) puzzle without any kind of fancy prompting:

https://chatgpt.com/share/66e4b209-8d98-8011-a0c7-b354a68fab...

Anyways, I'm not trying to make any grand claims about AGI in general, or about ARC-AGI as a benchmark, but I do think that o1 is a leap towards LLM-based solutions to ARC.

kobe_bryant 4 days ago | root | parent | next |

So it gives you the wrong answer and then you keep telling it how to fix it until it does? What does fancy prompting look like then, just feeding it the solution piece by piece?

killthebuddha 4 days ago | root | parent |

Basically yes, but there's a very wide range of how explicit the feedback could be. Here's an example where I tell gpt-4 exactly what the rule is and it still fails:

https://chatgpt.com/share/66e514d3-ca0c-8011-8d1e-43234391a0...

and an example using gpt-4o:

https://chatgpt.com/share/66e515da-a848-8011-987f-71dab56446...

I'd share similar examples using claude-3.5-sonnet but I can't figure out how to do it from the claud.ai ui.

To be clear, my point is not at all that o1 is so incredibly smart. IMO the ARC-AGI puzzles show very clearly how dumb even the most advanced models are. My point is just that o1 does seem to be noticeably better at solving these problems than previous models.

rahimnathwani 4 days ago | root | parent | next |

The easiest way I know of to share Claude chats is by using this Chrome extension to create a GitHub gist:

https://chromewebstore.google.com/detail/claudesave/bmdnfhji...

It's not perfect, but works fine for chats that don't have tables.

> where I tell gpt-4 exactly what the rule is and it still fails

It figured out the rule itself. It has problems applying the rule.

In this example btw, asking it to write a program will solve the problem.

seaal 4 days ago | root | parent | prev |

All examples are 404'd for me.

killthebuddha 4 days ago | root | parent | next |

Hmm. My first thought was that I shared non-public links, but I double-checked I can access them from another machine.

stingraycharles 4 days ago | root | parent |

FYI They load fine for me.

seaal 2 days ago | root | parent |

Yeah seems to just be an issue with my Firefox configuration -- works fine on Edge.

Zr01 4 days ago | root | parent | prev |

The pages fail to load on old web browsers.

Author here. Which aspects are misleading? How can it be improved?

killthebuddha 4 days ago | root | parent | next |

I think the post is great, clear and fair and all that. And I definitely agree with the general point that o1 shows some amount of improvement on generality but with a massive tradeoff on cost.

I'm going to think through what I find "misleading" as I write this...

Ok so I guess it's that I wouldn't be surprised at all if we learn that models can improve a ton w.r.t. human-in-the-loop prompt engineering (e.g. ChatGPT) without a commensurate improvement in programmatic prompt engineering.

It's very difficult to get a Python-driven claude-3.5-sonnet agent to solve ARC tasks and it's also very difficult to get claude-3.5-sonnet to solve ARC tasks via the claude.ai UI. The blog post shows that it's also very difficult to get a Python-driven o1-preview agent to solve ARC tasks. From a cursory exploration of o1-preview's capabilities in the ChatGPT UI my intuition is that it's significantly smarter than claude-3.5-sonnet based on how much better it responds to my human-in-the-loop feedback.

So I guess my point is that many people will probably come away from the blog post thinking "there's nothing to see here", o1-preview is more of the same thing, but it seems to me that it's very clearly qualitatively different than previous models.

Aside: This isn't a problem with the blog post at all IMO, we don't need to litter every benchmark post with a million caveats/exceptions/disclaimers/etc.

wokwokwok 4 days ago | root | parent | prev |

I think the parent post is complaining that insufficient acknowledgement is given to how good o1 is, because in their contrived testing, it seems better than previous models.

I don’t think that’s true though, it’s hard to be more fair and explicit than:

> OpenAI o1-preview and o1-mini both outperform GPT-4o on the ARC-AGI public evaluation dataset. o1-preview is about on par with Anthropic's Claude 3.5 Sonnet in terms of accuracy but takes about 10X longer to achieve similar results to Sonnet.

Ie. it’s just not that great, and it’s enormously slow.

That probably wasn’t what people wanted to hear, even if it is literally what the results show.

You cant run away from the numbers:

> It took 70 hours on the 400 public tasks compared to only 30 minutes for GPT-4o and Claude 3.5 Sonnet.

(Side note: readers may be getting confused about what “test-time scaling” is, and why that’s important. TLDR: more compute is getting better results at inference time. That’s a big deal, because previously, pouring more compute at inference time didn’t seem to make much real difference; but overall I don’t see how anything you’ve said is either inaccurate or misleading)

mikeknoop 4 days ago | root | parent | next |

I personally am slightly surprised at o1's modest performance on ARC-AGI given the large leaps in performance on other objectively hard benchmarks like IOI and AIME.

Curiosity is the first step towards new ideas.

ARC Prize's whole goal is to inspire curiosity like this and to encourage more AI researchers to explore and openly share new approaches towards AGI.

What does minutes and hours even mean? Software comparison using absolute time duration is meaningless without a description of the system it was executed on; e.g. SHA256 hashes per second on a Win10 OS and i7-14100 processor. For a product as complex as multiuser TB-sized LLMs, compute time is dependent on everything from the VM software stack to the physical networking and memory caching architecture.

CPU/GPU cycles, FLOPs, IOPs, or even joules would be superior measurements.

achierius 4 days ago | root | parent | next |

These are API calls to a remote server. We don't have the option of scaling up or even measuring the compute they use to run them, so for better or worse the server cluster has to be measured as part of their model service offering.

HeatrayEnjoyer 2 days ago | root | parent |

I understand that, but that's only useful if you're only looking at it from a shallow business perspective.

tsimionescu 4 days ago | root | parent | prev |

You're right about local software comparisons, but this is different. If I'm comparing two SaaS platforms, wall clock time to achieve a similar task is a fair metric to use. The only caveat is if the service offers some kind of tiered performance pricing, like if we were comapring a task performed on an AWS EC2 instance vs Azure VM instance, but that is not the case with these LLMs.

So yes, it may be that the wall clock time is not reflective of the performance of the model, but it is reflective of the performance of the SaaS offerings.

I mean, “scaling compute at inference” actually means “using an LLM agent system.” Haven’t we known that chains of agents can be useful for a while?

killthebuddha 4 days ago | root | parent | prev |

I agree with basically everything you said but I think you've misunderstood my point. I'll reply to the other comment with more.

Both Chat GPT 4o and Claude 3.5 can trivially solve this puzzle if you direct them to do program synthesis to solve it. (that is write a program that solves it - e.g. https://pastebin.com/wDTWYcSx).

Without program synthesis (the way you are doing it), the LLM inevitably fails to change the correct position (bad counting and what not)

riku_iki 4 days ago | root | parent |

and what prompt you gave them to generate program? Did you tell explicitly that they need to fill cornered cells? If yes, it is not what benchmark is about. Benchmark is to ask LLM to figure out what is the pattern.

I entered task to Claude and asked to write py code, and it failed to recognize pattern:

To solve this puzzle, we need to implement a program that follows the pattern observed in the given examples. It appears that the rule is to replace 'O' with 'X' when it's adjacent (horizontally, vertically, or diagonally) to exactly two '@' symbols. Let's write a Python program to solve this:

usaar333 4 days ago | root | parent |

arc reasoning challenge. I'm going to give you 2 example input/output pairs and then a third bare input. Please produce the correct third output.

It used its COT to understand cornering -- then I got it to write a program.

But as I try again, it's not reliable.

exe34 4 days ago | root | parent | next |

> But as I try again, it's not reliable.

this is why I will never try anything like this on a remote server I don't control. all my toy experiments are with local llms that I can make sure are the same ones day after day.

4 days ago | root | parent | prev |

[deleted]

riku_iki 4 days ago | root | parent | prev |

Interesting part if you check CoT output, the way it solved: it said the pattern is to make number of filled cells even in each row with neat layout, which is interesting side effect, but not what task was about.

It is also referring on some "assistant", looks like they have some mysterious component in addition to LLM, or another LLM.

Stevvo 4 days ago | prev | next |

"Greenblatt" shown with 42% in the bar chart is GPT-4o with a strategy: https://substack.com/@ryangreenblatt/p-145731248

So, how well might o1 do with Greenblatt's strategy?

mikeknoop 4 days ago | root | parent |

I bet pretty well! Someone should try this. It's likely expensive but sampling could give you confidence to keep going. Ryan's approach costs about $10k to run the full 400 public eval set at current 4o prices -- which is the arbitrary limit we set for the public leaderboard.

w4 4 days ago | prev | next |

> o1's performance increase did come with a time cost. It took 70 hours on the 400 public tasks compared to only 30 minutes for GPT-4o and Claude 3.5 Sonnet.

Sheesh. We're going to need more compute.

typon 4 days ago | root | parent | next |

Polar icecaps shuddering at the thought

asimpleusecase 4 days ago | root | parent |

That is the next major challenge. Ok you can solve a logic puzzle with a gilzillon watts now go power that same level of compute with a cheese burger, or if you are vegan a nice salad.

Davidzheng 4 days ago | root | parent | prev |

Intelligence is something that gets monotone easier as compute increases and trivial at the large compute limit (for instance can brute force simulate a human at large enough compute). So increasing compute is the most sure way to ensure success at reaching above human level intelligence (agi)

etrautmann 4 days ago | root | parent | next |

This is…highly speculative and fairly ridiculous to anyone who’s attempted to do so

Davidzheng 4 days ago | root | parent |

I'm giving a proof of a theoretical fact not saying it's feasible

komali2 4 days ago | root | parent |

proof + fact, and theoretical, are very different words, I'm really confused by your meaning here

>Intelligence is something that gets monotone easier as compute increases and trivial at the large compute limit (for instance can brute force simulate a human at large enough compute)

It gets monotone easier but the increase can be so slow that even using all the energy in the observable universe wouldn't make a meaningful difference, e.g. for problems in the exponential complexity class.

How does one "brute force simulate a human"? If compute is the limiting factor, then isn't it currently possible to brute force simulate a human, just extremely slowly?

tomohelix 4 days ago | root | parent | next |

I guess technically, one can try to simulate every single atoms and their interactions with each others to get this result.

However, considering how many atoms there are in a cubic foot of meat, this isn't very possible even with current compute. Even trying to solve a PDE with, I don't know, 1e7 factors, is already a hard to crack issue although technically, it is computable.

Now take that to the number of atoms in a meatbag and you quickly see why it is pointless to put any effort into this "extremely slowly" way.

black_knight 4 days ago | root | parent | next |

We have no way of knowing the initial conditions for this (position etc of each fundamental particle in any brain), even if we assume that we have a good enough grasp on fundamental physics to know the rules.

trehalose 4 days ago | root | parent | prev |

But if we had enough compute, it'd be trivial, right? I mean, I didn't think so, but the guy I replied to seems to know so. No, in all seriousness, I realize that "extremely slowly" is an understatement.

In davidzheng's defense, I assume he likely meant a higher-level simulation of a human, one designed to act indistinguishably from an atom-level simulation.

I just think calling that "trivial with enough compute" is mistaking merely having the materials for having mastered them.

Something something monkey at a typewriter writing Shakespeare

Davidzheng 4 days ago | root | parent | next |

This is a more water tight proof of the same fact (so we don't have to argue about physics)

azan_ 4 days ago | root | parent |

It's not a proof at all.

rrrix1 4 days ago | root | parent | prev |

Get out of my head!

bufferoverflow 4 days ago | root | parent | prev |

Human brain has 1000 trillion synapses between 68 billion neurons. What are you going to simulate them on?

And it's not like you can copy brain's connectivity exactly. Such technologies don't exist.

zxexz 4 days ago | root | parent |

I have a computer like that, embedded in my head even! It's good for real-time simulation, but has trouble simulating the same human from even a couple weeks before.

In all seriousness, it's simultaneously wondrous and terrifying to imagine the hypothetical tooling needed for such a simulation.

Now is a good time to spend with families and do work that feels satisfying. Much change is coming.

4 days ago | root | parent | prev |

[deleted]

fsndz 4 days ago | prev | next |

As expected, I've always believed that with the right data allowing the LLM to be trained to imitate reasoning, it's possible to improve its performance. However, this is still pattern matching, and I suspect that this approach may not be very effective for creating true generalization. As a result, once o1 becomes generally available, we will likely notice the persistent hallucinations and faulty reasoning, especially when the problem is sufficiently new or complex, beyond the "reasoning programs" or "reasoning patterns" the model learned during the reinforcement learning phase. https://www.lycee.ai/blog/openai-o1-release-agi-reasoning

skepticATX 4 days ago | root | parent | next |

My feeling is that this is one reason they decided to hide the reasoning tokens.

fsndz 4 days ago | root | parent |

yes indeed

So basically it's a kind of overfitting with pattern matching features? This doesn't undermine the power of LLMs but it is great to study their limitations.

poopiokaka 4 days ago | root | parent | prev |

“As expected I’m right”

fsndz 3 days ago | root | parent |

shouldn't I expect to be right when I have a thesis ? doesn't mean I can't see when I am wrong.

GaggiX 4 days ago | prev | next |

It really shows how far ahead Anthropic is/was when they released Claude 3.5 Sonnet.

That being said, the ARC-agi test is mostly a visual test that would be much easier to beat when these models will truly be multimodal (not just appending a separate vision encoder after training) in my opinion.

I wonder what the graph will look like in a year from now, the models have improved a lot in the last one.

threeseed 4 days ago | root | parent |

> I wonder what the graph will look like in a year from now, the models have improved a lot in the last one.

Potentially not great.

If you look at the AIME accuracy graph on the OpenAI page [1] you will notice that the x-axis is logarithmic. Which is a problem because (a) compute in general has never scaled that well and (b) semiconductor fabrication will inevitably get harder as we approach smaller sizes.

So it looks like unless there is some ground-breaking research in the pipeline the current transformer architecture will likely start to stall out.

[1] https://openai.com/index/learning-to-reason-with-llms/

accountnum 4 days ago | root | parent | next |

It's not a problem, because the point at which we are in the logarithmic curve is the only thing that matters. No one in their right mind ever expected anything linear, because that would imply that creating a perfect oracle is possible.

More compute hasn't been the driving factor of the last developments, the driving factor has been distillation and synthetic data. Since we've seen massive success with that, I really struggle to understand why people continue to doomsay the transformer. I hear these same arguments year after year and people never learn.

GaggiX 4 days ago | root | parent | prev |

I'm very optimistic about it because native multimodal LLMs have hardly been explored.

Also in general, I have yet to see these models plateau, Claude 3.5 Sonnet is a day and night different compared to previous models.

alphabetting 4 days ago | prev | next |

This is best AGI benchmark out there in my opinion. Surprising results that underscore how good Sonnet is.

krackers 4 days ago | root | parent | next |

If ARC-AGI were a good benchmark for "AGI", then MindsAI should effectively be blowing away current frontier models by order of magnitude. I don't know what MindsAI is, but the post implies they're basically fine-tuning or using a very specific strategy for ARC-AGI that isn't really generalizable to other tasks.

I think it's a nice benchmark of a certain type of spatial/visual intelligence, but if you have a model or technique specifically fine-tuned for ARC-AGI then it's no longer A"G"I

drdeca 4 days ago | root | parent | next |

Perhaps a benchmark could be a good approximate upper bound for something without being a good approximate lower bound for that thing?

[deleted]

I clarified in a another post I mean for benchmarking standalone models, not ones fine-tuned for solving ARC

nightski 4 days ago | root | parent | prev |

I mean, there are a lot of tasks that frontier models excel at which many humans wouldn't be able to complete.

zone411 4 days ago | root | parent | prev |

Disagree. My opinion is that solving ARC-AGI won't get us any closer to AGI and it's mostly a distraction.

typon 4 days ago | root | parent | next |

I think solving ARC-AGI will be necessary but not sufficient. My bet is that the converse will not be true - a model that will be considered "AGI" but does poorly on ARC-AGI. So in that sense, I think this is an important benchmark.

ithkuil 4 days ago | root | parent |

One of the key aspects of ARC is that its testing dataset is secret.

The usefulness of the ARC challenge is to figure out how much of the "intelligence" that current models trained on the entire internet is an emergent property and true generalization or how much it is just due to the fact that the training set truly contains an unfathomable amount of examples and thus the models may surprise us with what appears to be genuine insight but it's actually just lookup + interpolation.

I mostly agree, but I think it's fair to say that ARC-AGI is a necessary but definitely not sufficient milestone when it comes to the evaluation of a purported AGI.

How so? I think if a team is fine-tuning specifically to beat ARC that could be true but when you look at Sonnet and o1 getting 20%, I think a standalone frontier model beating it would mean we are close or already at AGI.

authorfly 4 days ago | root | parent |

The creation and iteration of ARC has been designed in part to avoid this.

Francis talks in his "mid-career" work (2015-2019) about priors for general intelligence and avoiding allowing them. While he admits ARC allows for some priors, it was at the time his best reasonable human effort in 2019 to put together and extremely prior-less training set, as he explained on podcasts around that time (e.g. Lex Fridman). The point of this is that humans, with our priors, are able to reliably get the majority of the puzzles correct, and with time, we can even correct mistakes or recognise mistakes in submissions without feedback (I am expanding on his point a little here based on conference conversations so don't take this as his position or at least his position today).

100 different humans will even get very different items correct/incorrect.

The problem with AI getting 21% correct is that, if it always gets the same 21% correct, it means for 79% of prior-less problems, it has no hope as an intelligent system.

Humans on the other hand, a group of 10000 could obviously get 99% or 100% correct despite none of them having priors for all of them in all liklihood given humans don't tend to get them all right (and well - because Francis created 100% of them!).

The goal of ARC as I understood it in 2019, is not to create a single model that gets a majority correct, to show AGI, it has to be an intelligent system, which can handle prior or priorless situations, as good as a group of humans, on diverse and unseen test sets, ideally without any finetuning or training specifically on this task, at all.

From 2019 (I read his paper when it came out believe it or not!), he held a secret set that he alone has that I believe is still unpublished, and at the time the low number of items (hundreds) was designed to prevent effective finetuning(then 'training') but nowadays few shot training shows that it is clearly possible to do on-the-spot training, which is why in talks Francis gave, I remember him positing that any advanced in short term learning via examples should be ignored e.g. each example should be zero shot, which I believe is how most benchmarks are currently done. The puzzles are all "different in different ways" besides the common element of dynamic grids and providing multiple grids as input.

It's also key to know Francis was quite avant-garde in 2019: his work was ofcourse respected, but he became more prominent recently. He took a very bullish/optimistic position on AI advances at the time (no doubt based on keras and seeing transformers trained using it), but he has been proven right.

glial 4 days ago | root | parent | prev |

Is that mainly because AGI is one of those "I'll know it when I see it" things?

mrcwinn 4 days ago | prev | next |

How is Anthropic accomplishing this despite (seemingly) arriving later?What advantage do they have?

Satam 4 days ago | root | parent | next |

I think it's because OpenAI's leadership lacks good taste and talent. Realistically, they haven't shifted the needle with anything really interesting in 2 years now. They're using the inertia well but that's about it. Their model is not the best, the UI is not the best, and their pace of improvement is not great either.

falcor84 4 days ago | root | parent |

I find the chatgpt-4o advanced mode to absolutely be "really interesting". And the video input they showed in the demos (and hope would same day release) could be a real game changer. One thing I would like to try, once that's out, is to put a computer with it amongst a group of students listening to a short lecture about something outside its training set and then check how the AI does on a comprehension quiz following the lecture - my feeling is that it would do significantly better than the average human student on most subjects.

Anthropic currently does much less hype stuff comparing to openai. It's remarkable that openai was like this until the GPT-4 release, and completly changed since Sam Altman started touring countries.

changoplatanero 4 days ago | root | parent | prev |

One theory I heard is that Dario was always interested in RL whereas Ilya was interested in other stuff until more recently. So Anthropic could have had an earlier start on some of this latest RL stuff.

fancyfredbot 4 days ago | prev | next |

I found the level headed explanation of why log linear improvements in test score with increased compute aren't revolutionary the best part of this article. That's not to say the rest wasn't good too! One of the best articles on o1 I've read.

benreesman 4 days ago | prev | next |

The test you really want is the apples-to-apples comparison between GPT-4o faced with the same CoT and other context annealing that presumably, uh, Q* sorry Strawberry now feeds it (on your dime). This would of course require seeing the tokens you are paying for instead of being threatened with bans for asking about them.

Compared to the difficulty in assembling the data and compute and other resources needed to train something like GPT-4-1106 (which are staggering), training an auxiliary model with a relatively straightforward, differentiable, well-behaved loss on a task like "which CoT framing is better according to human click proxy" is just not at that same scale.

Terretta 4 days ago | prev | next |

TL;DR (direct quote):

“In summary, o1 represents a paradigm shift from "memorize the answers" to "memorize the reasoning" but is not a departure from the broader paradigm of fitting a curve to a distribution in order to boost performance by making everything in-distribution.”

“We still need new ideas for AGI.”

sashank_1509 4 days ago | root | parent |

This sounds very fair, but I think fundamentally humans memorize reasoning a lot more than you’d expect. A spark of inspiration is not memorized reasoning, but not many people can claim to enjoy that capability.

ec109685 4 days ago | prev | next |

Why is this considered such a great AGI test? It seems possible to extensively train a model on the algorithms used to solve these cases, and some cases feel beyond what a human could straightforwardly figure out.

isotypic 4 days ago | root | parent | next |

Do you have some examples of ones you found beyond what a human could straightforwardly figure out? I tried a bunch and they all seemed reasonable, so I would be interested in seeing - I didn't try all 400, for obvious reasons, so I don't doubt there are difficult ones.

I think regardless one of the reasons people are interested in it is that is a fairly simple logic puzzle - given some examples, extrapolate a pattern, execute the pattern - that humans achieve high accuracy on (a study linked on the website has ~84% accuracy for humans, some more recent study seems to put it closer to 75%). Yet ML approaches have yet to reach that level, in contrast to other problems ML has been applied to.

Given there is a large prize pool for the challenge, I would imagine actually training a model in the way you describe would already have been tried and is more difficult that it seems.

ec109685 4 days ago | root | parent | next |

I realize I didn’t scroll to other examples for one I found very hard.

I guess the question is whether someone who solves this will have cracked AGI as a necessary precondition or like other Turing tests that have been solved, someone will find a technique that isn’t broadly applicable to general intelligence.

4 days ago | root | parent | prev |

[deleted]

I think huge advantage is that they keep eval tests private, so corps can't finetune them to model and claim breakthrough, which possibly happened with many other benchmarks.

visarga 4 days ago | root | parent | prev |

There is a hidden test set with new puzzle types not seen in the open part. It's designed so that humans do well and AI models have a hard time.

YeGoblynQueenne 4 days ago | root | parent |

"Designed" is not right. What gives "AI models" (i.e. deep neural nets) a hard time is that there are very few examples in the public training and evaluation set: each task has three examples. So basically it's not a test of intelligence but a test of sample efficiency.

Besides which, it is unfair because it excludes an entire category of systems, not to mention a dominant one. If F. Chollet really believes ARC is a test of intelligence, then why not provide enough examples for deep nets or some other big data approach to be trained effectively? The answer is: because a big data approach would then easily beat the test. But if the test can be beaten without intelligence, just with data, then it's not a test of intelligence.

My guess for a long time has been that ARC will fall just like the Winograd Schema challenge (WSC) [1] fell: someone will do the work to generate enough (tens of thousands) examples of ARC-like tasks, then train a deep neural net and go to town. That's what happened with the WSC. A large dataset of Winograd schema sentences was crowd-sourced and a big BERT-era Transformer got around 90% accuracy on the WSC [2]. Bye bye WSC, and any wishful thinking about Winograd schemas requiring human intuition and other undefined stuff.

Or, ARC might go the way of the Bongard Problems [3]: the original 100 problems by Bongard still stand unsolved, but the machine learning community has effectively sidestepped them. Someone made a generator of Bongard-like problems [4], and while this was not enough to solve the original problems, everyone simply switched to training CNNs and reporting results on the new dataset [5].

We basically have no idea how to create a test for intelligence that computers cannot beat by brute force or big data approaches so we have no effective way to test computers for (artificial) intelligence. The only thing we know humans can do that computers can't is identify undecidable problems (like Barber Paradoxes i.e. statements of the form "this sentence is false", as in Gödel's second incompleteness theorem). Unfortunately we already know there is no computer that can ever do that, and even if we observe say ChatGPT returning the right answer we can be sure it has only memorised, not calculated it, so we're a bit stuck. ARC won't get us unstuck in any way shape or form and so it's just a distraction.

_____________________

[1] https://en.wikipedia.org/wiki/Winograd_schema_challenge

[2] WinoGrande: An Adversarial Winograd Schema Challenge at Scale

https://arxiv.org/abs/1907.10641

Although note the results are interpreted to mean LLMs are more or less memorising answers, which is right of course.

[3] Index of Bongard Problems

https://www.foundalis.com/res/bps/bpidx.htm

[4] Comparing machines and humans on a visual categorization test

https://www.pnas.org/doi/abs/10.1073/pnas.1109168108

[5] 25 years of CNNs: Can we compare to human abstraction capabilities?

https://arxiv.org/abs/1607.08366

visarga 4 days ago | root | parent | next |

> "Designed" is not right. What gives "AI models" (i.e. deep neural nets) a hard time is that there are very few examples in the public training and evaluation set

No, he actually made a list of cognitive skills humans have and is targeting them in the benchmark. The list of "Core Knowledge Priors" contains Object cohesion, Object persistence, Object influence via contact, Goal-directedness, Numbers and counting, Basic geometry and topology. The dataset is fit for human ease of solving, but targets areas hard for AI.

> "A typical human can solve most of the ARC evaluation set without any practice or verbal explanations. Crucially, to the best of our knowledge, ARC does not appear to be approachable by any existing machine learning technique (including Deep Learning), due to its focus on broad generalization and few-shot learning, as well as the fact that the evaluation set only features tasks that do not appear in the training set."

YeGoblynQueenne 3 days ago | root | parent |

Thanks, I know about the core knowledge priors, and François Chollet's claims about them (I've read his white paper, although it was long, and long-winded and I don't remember most of it). The empirical observation however is that none of the systems that have positive performance on ARC, on Kaggle or the new leaderboard, have anything to do with core knowledge priors. Which means core knowledge priors are not needed to solve any of the so-far solved ARC tasks.

I think Chollet is making a syllogistic error:

  a) Humans have core knowledge priors and can solve ARC tasks
  b) Some machine X can solve ARC tasks
  c) Therefore machine X has core knowledge priors

That doesn't follow; and like I say it is refuted by empirical observations, to boot. This is particularly so for his claim that ARC "does not appear approachable" (what) by deep learning. Plenty of neural-net based systems on the ARC-AGI leaderboard.

There's also no reason to assume that core knowledge priors present any particular difficulty to computers (i.e. that they're "hard for AI"). The problem seems to be more with the ability of humans to formalise them precisely enough to be programmed into a computer. That's not a computer problem, it's a human problem. But that's common in AI. For example, we don't know how to hand-code an image classifier; but we can train very accurate ones with deep neural nets. That doesn't mean computers aren't good at image classification: they are; CNNs to the proof. It's humans who suck at coding it. Except nobody's insisting on image classification datasets with only three or four training examples for each class, so it was possible to develop those powerful deep neural net classifiers. Chollet's choice to only allow very few training examples is creating an artificial data bottleneck that does not restrict anyone in the real world so it tells us nothing about the true capabilities of deep neural nets.

Cthulhu. I never imagined I'd end up defending deep neural nets...

I have to say this: Chollet annoys me mightily. Every time I hear him speak, he makes gigantic statements about what intelligence is, and how to create it artificially, as if he knows what dozens of thousands of researchers in biology, cognitive science, psychology, neuroscience, AI, and who knows what other field, don't. That is despite the fact that he has created just as many intelligent machines as everyone else so far, which is to say: zero. Where that self-confidence comes from, I have no idea, but the results on his "AIQ test" indicate he, just like everyone else, has no clue what intelligence is, yet he persists with the absurd self-assurance. Insufferable arrogance.

Apologies for the rant.

> My guess for a long time has been that ARC will fall just like the Winograd Schema challenge (WSC) fell: someone will do the work to generate enough (tens of thousands) examples of ARC-like tasks, then train a deep neural net and go to town.

I think that this would be the real AGI (or even superintelligence hurdle) - having essentially a metacognitive AI understand that something given to it is a novel problem, for which it would use the given examples to automatically generate synthetic data sets and then train itself (or a subordinate model) based on these examples to gain the skill of solving this general type of problem.

> The only thing we know humans can do that computers can't is identify undecidable problems (like Barber Paradoxes i.e. statements of the form "this sentence is false", as in Gödel's second incompleteness theorem). Unfortunately we already know there is no computer that can ever do that

Where did the "ever" come from? Why wouldn't future computers be able to do this at (least at) a human level?

YeGoblynQueenne 4 days ago | root | parent |

The "ever" comes from the Church-Turing thesis. Maybe in the future computers will not be Turing machines, but that we can't know yet.

falcor84 3 days ago | root | parent |

That refers to the unsolvability of the general case, which is of course also unsolvable by humans.

YeGoblynQueenne 2 days ago | root | parent | next |

I'm sorry, I'm not sure what "unsolvability" means. What I'm saying above is that humans can identify undecidable statements, i.e. we can recognise them as undecidable. If we couldn't, Gödel, Church, and Turing would not have a proof. But we can. We just don't do it algorithmically, obviously- because there is no algorithm that can do that, and so no computer that can, either.

falcor84 2 days ago | root | parent |

But that's the thing, humans can't do it either, except only in some very specific simple cases. We're not magical; if we had a good way of doing it, we could implement it as an algorithm, but we don't.

There's a nice discussion of it in this CS Stack Exchange thread: https://cs.stackexchange.com/questions/47712/why-can-humans-...

YeGoblynQueenne 2 days ago | root | parent |

I'm confused by the discussion in your link. It starts out about decidability and it soon veers off into complexity, e.g. a discussion about "efficiently" (really, cheaply) solving NP-complete instances with heurstics etc.

In any case, I'm not claiming that humans can decide the truth or falsehood of undecidable statements, either in their special or general cases. I'm arguing that humans can identify that such a statement is undecidable. In other words, we can recognise them as undecidable, without having to decide their truth values.

For example, "this statement is false" is obviously undecidable and we don't have to come up with an algorithm to try and decide its truth value before we can say so. So it's an identification problem that we solve, not a decision problem. But a Turing machine can't do that, either: it has to basically execute the statement before it can decide it's undecidable. The only alternative is to rely on patter recognition, but that is not a general solution.

Another thing to note is that statements of the form "this sentence is false" are undecidable even given infinite resources (it'd be better to refer to Turing's Halting Problem examples here but I need a refresher on that). In the thread you link to, someone says that the problem in the original question (basically higher-order unification) can be decided in a finite number of steps. I think that's wrong but in any case there is no finite way in which "this sentence if false" can be shown to be true or false algorithmically.

I think you're arguing that we can't solve the identification problem in the general case. I think we can, because we can do what I describe above: we can solve, _non-algorithmically_ and with finite resources, problems that _algorithmically_ cannot be solved with infinite resources. Turing gives more examples, as noted. I don't know how it gets more general than that.

falcor84 12 hours ago | root | parent |

Sorry, but I really don't understand your claim. You say

> The only alternative is to rely on patter recognition, but that is not a general solution.

but then you say

> we can solve, _non-algorithmically_ and with finite resources, problems that _algorithmically_ cannot be solved with infinite resources

How do you propose that we humans solve these problem in a way that both isn't algorithmic and isn't reducible to pattern recognition? Because I'm pretty sure it's always one of these.

YeGoblynQueenne 11 hours ago | root | parent |

The first sentence you quote refers to machines, not humans, and machines must be somehow programmed with, or learn patterns from data, that's why I say that it's not a general solution - because it's restricted by the data available.

I don't know how humans do it. Whatever we do it's something that is not covered by our current understanding of computation and maybe even mathematics. I suspect that in order to answer your question we need a new science of computation and mathematics.

I think that may sound a bit kooky but it's a bit late and I'm a bit tired to explain so I guess you'll have to suspect I'm just a crank until I next find the energy to explain myself.

3 days ago | root | parent | prev |

[deleted]

og_kalu 4 days ago | root | parent | prev |

>We basically have no idea how to create a test for intelligence that computers cannot beat by brute force or big data approaches.

Agreed

>so we have no effective way to test computers for (artificial) intelligence.

I never quite understand stances like this considering evolutionary human intelligence is exactly the consequence of incredible brute force and scale. Why is the introduction of brute force suddenly something that means we cannot 'truly' test for intelligence in machines ?

YeGoblynQueenne 3 days ago | root | parent |

This is my entire quote:

>> We basically have no idea how to create a test for intelligence that computers cannot beat by brute force or big data approaches so we have no effective way to test computers for (artificial) intelligence.

When I say "brute force" I mean an exhaustive search of some large search space, in real time, not in evolutionary time. For example, searching a very large database for an answer, rather than computing the answer. But, as usual, I don't understand the point you're trying to make and where the bit about evolution came from. Can you clarify?

Btw, three requests, so we can have a productive conversation with as little time wasted in misunderstandings as possible:

a) Don't Fisk me (https://www.urbandictionary.com/define.php?term=Fisking).

b) Don't quote my words out of context.

c) If you don't understand why I say something, just ask.

og_kalu 2 days ago | root | parent |

Ok. I guess i misunderstood you then. I didn't mean to quote you out of context.

I just meant the human brain is the result of brute force. Evolution is a dumb biological optimizer whose objective function is to procreate. It's not search exactly but well then neither is brute force of Modern NNs.

YeGoblynQueenne 2 days ago | root | parent |

OK, I see what you mean and thank you for the clarification.

So I think here you're mainly talking about the process by which artificial intelligence can be achieved. I don't disagree that, in principle, it should be possible to do this by some kind of brute-force, big-data optimisation programme. There is such a thing as evolutionary computation and genetic algorithms, after all. I think it's probably unrealistic to do that in practice, at least in other than evolutionary time scales, but that's just a hunch and not something I can really support with data, like.

But what I'm talking about is testing the intelligence of such a system, once we have it. By "testing" I mean to things: a) detecting that such a system is intelligent in the first place, and, b) measuring its intelligence. Now ARC-AGI muddles the waters a bit because it doesn't make it clear what kind of test of intelligence it is, a detecting kind of test or a measuring kind of test; and Chollet's white paper that introduced ARC is titled "On the measure of intelligence" which further confuses the issue: does he assume that there already exist artificially intelligent systems, so we don't have to bother with a detection kind of test? Having read his paper, I retain an impression that the answer is: no. So it's a bit of a muddle, like I say.

In any case, to come back to the brute force issue: I assume that, by brute force approaches we can solve any problem that humans solve, presumably using our intelligence, but without requiring intelligence. And that makes it very hard to know whether a system is intelligent or not just by looking at how well it does in a test, e.g. an IQ test for humans, or ARC, etc.

Seen another way: the ability to solve problems by brute force is a big confounding factor when trying to detect the presence of intelligence in an artificial system. My point above is that we have no good way to control for this confounder.

The question that remains is, I think, what counts as "brute force". As you say, I also don't think of neural net inference as brute force. I think of neural net training as brute force, so I'm muddling the issue a bit myself, since I said I'm talking about testing the already-trained system. Let's say that by "brute force" I mean a search of a large combinatorial space carried out at inference time, with or without heuristics to guide it. For example, minimax (as in Deep Blue) is brute force, minimax with a neural-net learned evaluation function (as in AlphaGo and friends) is brute force, AlphaCode, AlphaProof and similar approaches (generating millions of candiates and filtering/ ranking) is brute force, SAT-Solving is brute force, searching for optimal plans is brute force. What is not brute force? Well, for example, SLD-Resolution is not brute force because it's a proof procedure, arithmetic is not brute force because there's algorithms, boolean logic is not brute force, etc. I think I'm arguing that anything for which we have an algorithm that does not require a huge amount of computational power is not brute force, and I think that may even be an intuitive definition. Or not?

og_kalu 2 days ago | root | parent |

Thanks. I understand your position much better now. Your examples of non-brute force are fair enough. But it begs the question. Is it even possible to build/create a self-learning system (that starts from scratch) without brute force? Like we can get a calculator to perform the algorithms for addition but how would we get one to learn how to add from scratch without brute force ? This isn't even about NNs vs GOFAI vs Biology. I mean, how do you control for a variable that is always part of the equation ?

YeGoblynQueenne 12 hours ago | root | parent |

I don't know the answer to that. The problem with brute force approaches to learning is that it might take a very large amount of search, over a very large amount of time, to get to a system that can come up with arithmetic on its own.

Honestly I have no answer. I can see the problems with, essentially, scaling up, but I don't have a solution.

a_wild_dandan 4 days ago | prev | next |

This tests vision, not intelligence. A reasoning test dependent on noisy information is borderline useless.

falcor84 4 days ago | root | parent |

What's noisy about it? The input matrix is discrete and converting it into any sort of structured input is trivial.

lossolo 4 days ago | prev | next |

It seems like o1 is a lot worse than Claude on coding tasks https://livebench.ai

perching_aix 4 days ago | prev | next |

Is it possible for me, a human, to undertake these benchmarks?

terhechte 4 days ago | root | parent |

There's examples on the homepage, and there's a link to the Kaggle notebook in the article.

https://arcprize.org

Alifatisk 4 days ago | prev | next |

This is a great marketing for Anthropic

meowface 4 days ago | prev | next |

Takeaway:

>o1-preview is about on par with Anthropic's Claude 3.5 Sonnet in terms of accuracy but takes about 10X longer to achieve similar results to Sonnet.

Scores:

>GPT-4o: 9%

>o1-preview: 21%

>Claude 3.5 Sonnet: 21%

>MindsAI: 46% (current highest score)

GaggiX 4 days ago | root | parent | next |

The takeaway is also that o1-preview is a major improvement compare to GPT-4o.

Anthropic is just ahead.

disgruntledphd2 4 days ago | root | parent | next |

It's a little embarrassing for OpenAI though?

dr_dshiv 4 days ago | root | parent |

I think they’d expect as much from Dario. He designed GPT3…

True. I've updated my post to include some of the scores.

Alifatisk 4 days ago | root | parent | prev |

How the hell is Anthropic this far ahead? I am yet impressed

There were rumors that 3.5 Sonnet heavily used synthetic data for training, in the same way that OpenAI plans to use o1 to train Orion. Maybe this confirm it?

who knows what MindsAI is?

4 days ago | root | parent | prev |

[deleted]

4 days ago | prev | next |

[deleted]

bulbosaur123 4 days ago | prev | next |

Ok, I have a practical question. How do I use this o1 thing to view codebase for my game app and then simply add new features based on my prompts? Is it possible rn? How?

devit 4 days ago | prev |

Am I missing something or this "ARC-AGI" thing is so ludicrously terrible that it seems to be completely irrelevant?

It seems that the tasks consists of giving the model examples of a transformation of an input colored grid into an output colored grid, and then asking it to provide the output for a given input.

The problem is of course that the transformation is not specified, so any answer is actually acceptable since one can always come up with a justification for it, and thus there is no reasonable way to evaluate the model (other than only accepting the arbitrary answer that the authors pulled out of who knows where).

It's like those stupid tests that tell you "1 2 3 ..." and you are supposed to complete with 4, but obviously that's absurd since any continuation is valid given that e.g. you can find a polynomial that passes for any four numbers, and the test maker didn't provide any objective criteria to determine which algorithm among multiple candidates is to be preferred.

Basically, something like this is about guessing how the test maker thinks, which is completely unrelated to the concept of AGI (i.e. the ability to provide correct answers to questions based on objectively verifiable criteria).

And if instead of AGI one is just trying to evaluate how the model predicts how the average human thinks, then it makes no sense at all to evaluate language model performance by performance on predicting colored grid transformations.

For instance, since normal LLMs are not trained on colored grids, it means that any model specifically trained on colored grid transformations as performed by humans of similar "intelligence" as the ARC-"AGI" test maker is going to outperform normal LLMs at ARC-"AGI", despite the fact that it is not really a better model in general.

YeGoblynQueenne 4 days ago | root | parent |

No no, that's not right. They're not asking for specific solutions. Any transformation of one grid to another will do.

devit 4 days ago | root | parent |

What?

They say: "ARC-AGI tasks are a series of three to five input and output tasks followed by a final task with only the input listed. Each task tests the utilization of a specific learned skill based on a minimal number of cognitive priors.

Tasks are represented as JSON lists of integers. These JSON objects can also be represented visually as a grid of colors using an ARC-AGI task viewer.

A successful submission is a pixel-perfect description (color and position) of the final task's output."

As far as I can tell, they are asking to reproduce exactly the final task's output.

YeGoblynQueenne 3 days ago | root | parent |

What they mean by "specific learned skill" is that each task illustrates the use of certain "core knowledge priors" that François Chollet has claimed are necessary to solve said tasks. You can find this claim in Chollet's white paper that introduced ARC, linked below:

On the Measure of Intelligence

https://arxiv.org/abs/1911.01547

"Core knowledge priors" are a concept from psychology and cognitive science as far as I can tell.

To be clear, other than Chollet's claim that the "core knowledge priors" are necessary to solve ARC tasks, as far as I can tell, there is no other reason to assume so and every single system that has posted any above-0% results so far does not make any attempt to use that concept, so at the very least we can know that the tasks solved so far do not need any core knowledge priors to be solved.

But, just to be perfectly clear: when results are posted, they are measured by simple comparison of the target output grids with the output grids generated by a system. Not by comparing the method used to solve a task.

Also, if I may be critical: you can find this information all over the place online. It takes a bit of reading I suppose, but it's public information.