262 points by vinhnx 16 hours ago | 220 comments

simianwords 15 hours ago [-]

My interpretation of the progress.

3.5 to 4 was the most major leap. It went from being a party trick to legitimately useful sometimes. It did hallucinate a lot but I was still able to get some use out of it. I wouldn't count on it for most things however. It could answer simple questions and get it right mostly but never one or two levels deep.

I clearly remember 4o was also a decent leap - the accuracy increased substantially. It could answer niche questions without much hallucination. I could essentially replace it with Google for basic to slightly complex fact checking.

* 4o was the first time I actually considered paying for this tool. The $20 price was finally worth it.

o1 models were also a big leap over 4o (I realise I have been saying big leap too many times but it is true). The accuracy increased again and I got even more confident using it for niche topics. I would have to verify the results much less often. Oh and coding capabilities dramatically improved here in the thinking model. o1 essentially invented oneshotting - slightly non trivial apps could be made just by one prompt for the first time.

o3 jump was incremental and so was gpt 5.

furyofantares 9 hours ago [-]

I have a theory about why it's so easy to underestimate long-term progress and overestimate short-term progress.

Before a technology hits a threshold of "becoming useful", it may have a long history of progress behind it. But that progress is only visible and felt to researchers. In practical terms, there is no progress being made as long as the thing is going from not-useful to still not-useful.

So then it goes from not-useful to useful-but-bad and it's instantaneous progress. Then as more applications cross the threshold, and as they go from useful-but-bad to useful-but-OK, progress all feels very fast. Even if it's the same speed as before.

So we overestimate short term progress because we overestimate how fast things are moving when they cross these thresholds. But then as fewer applications cross the threshold, and as things go from OK-to-decent instead of bad-to-OK, that progress feels a bit slowed. And again, it might not be any different in reality, but that's how it feels. So then we underestimate long-term progress because we've extrapolated a slowdown that might not really exist.

I think it's also why we see a divide where there's lots of people here who are way overhyped on this stuff, and also lots of people here who think it's all totally useless.

heywoods 49 minutes ago [-]

Your threshold theory is basically Amara's Law with better psychological scaffolding. Roy Amara nailed the what ("we tend to overestimate the effect of a technology in the short run and underestimate the effect in the long run") [1] but you're articulating the why better than most academic treatments. The invisible-to-researchers phase followed by the sudden usefulness cascade is exactly how these transitions feel from the inside.

This reminds me of the CPU wars circa 2003-2005. Intel spent years squeezing marginal gains out of Pentium 4's NetBurst architecture, each increment more desperate than the last. From 2003 to 2005, Intel shifted development away from NetBurst to focus on the cooler-running Pentium M microarchitecture [2]. The whole industry was convinced we'd hit a fundamental wall. Then boom, Intel released dual-core processors under the Pentium D brand in May 2005 [2] and suddenly we're living in a different computational universe.

But teh multi-core transition wasn't sudden at all. IBM shipped the POWER4 in 2001, the first non-embedded microprocessor with two cores on a single die [3]. Sun had been preaching parallelism since the 90s. It was only "sudden" to those of us who weren't paying attention to the right signals.

Which brings us to the $7 trillion question: where exactly are we on the transformer S-curve? Are we approaching what Richard Foster calls the "performance plateau" in "Innovation: The Attacker's Advantage" [4], where each new model delivers diminishing returns? Or are we still in that deceptive middle phase where progress feels linear but is actually exponential?

The pattern-matching pessimist in me sees all the classic late-stage S-curve symptoms. The shift from breakthrough capabilities to benchmark gaming. The pivot from "holy shit it can write poetry" to "GPT-4.5-turbo-ultra is 3% better on MMLU." The telltale sign of technological maturity: when the marketing department works harder than the R&D team.

But the timeline compression with AI is unprecedented. What took CPUs 30 years to cycle through, transformers have done in 5. Maybe software cycles are inherently faster than hardware. Or maybe we've just gotten better at S-curve jumping (OpenAI and Anthropic aren't waiting for the current curve to flatten before exploring the next paradigm).

As for whether capital can override S-curve dynamics... Christ, one can dream.. IBM torched approximately $5 billion on Watson Health acquisitions alone (Truven, Phytel, Explorys, Merge) [5]. Google poured resources into Google+ before shutting it down in April 2019 due to low usage and security issues [6]. The sailing ship effect (coined by W.H. Ward in 1967, where new technology accelerates innovation in incumbent technology)[7] si real, but you can't venture-capital your way past physics.

I think we can predict all this capital pouring in to AI might actually accelerate S-curve maturation rather than extend it. All that GPU capacity, all those researchers, all that parallel experimentation? We're speedrunning the entire innovation cycle, which means we might hit the plateau faster too.

You're spot on about the perception divide imo. The overhyped folks are still living in 2022's "holy shit ChatGPT" moment, while the skeptics have fast-forwarded to 2025's "is that all there is?" Both groups are right, just operating on different timescales. It's Schrödinger's S-curve where we things feel simultaneously revolutionary and disappointing, depending on which part of the elephant you're touching.

The real question I have is whether we're approaching the limits of the current S-curve (we probably are), but whether there's another curve waiting in the wings. I'm not a researcher in this space nor do I follow the AI research beat to weigh in but hopefully someone in the thread can? With CPUs, we knew dual-core was coming because the single-core wall was obvious. With transformers, the next paradigm is anyone's guess. And that uncertainty, more than any technical limitation, might be what makes this moment feel so damn weird.

References: [1] "Amara's Law" https://en.wikipedia.org/wiki/Roy_Amara [2] "Pentium 4" https://en.wikipedia.org/wiki/Pentium_4 [3] "POWER4" https://en.wikipedia.org/wiki/POWER4 [4] Innovation: The Attacker's Advantage - https://annas-archive.org/md5/3f97655a56ed893624b22ae3094116... [5] IBM Watson Slate piece - https://slate.com/technology/2022/01/ibm-watson-health-failu... [6] "Expediting changes to Google+" - https://blog.google/technology/safety-security/expediting-ch... [7] "Sailing ship effect" https://en.wikipedia.org/wiki/Sailing_ship_effect.

stavros 7 hours ago [-]

All the replies are spectacularly wrong, and biased by hindsight. GPT-1 to GPT-2 is where we went from "yes, I've seen Markov chains before, what about them?" to "holy shit this is actually kind of understanding what I'm saying!"

Before GPT-2, we had plain old machine learning. After GPT-2, we had "I never thought I would see this in my lifetime or the next two".

reasonableklout 7 hours ago [-]

I'd love to know more about how OpenAI (or Alec Radford et al.) even decided GPT-1 was worth investing more into. At a glance the output is barely distinguishable from Markov chains. If in 2018 you told me that scaling the algorithm up 100-1000x would lead to computers talking to people/coding/reasoning/beating the IMO I'd tell you to take your meds.

arugulum 2 hours ago [-]

GPT-1 wasn't used as a zero-shot text generator; that wasn't why it was impressive. The way GPT-1 was used was as a base model to be fine-tuned on downstream tasks. It was the first case of a (fine-tuned) base Transformer model just trivially blowing everything else out of the water. Before this, people were coming up with bespoke systems for different tasks (a simple example is that for SQuAD a passage-question-answering task, people would have an LSTM to read the passage and another LSTM to read the question, because of course those are different sub-tasks with different requirements and should have different sub-models). One GPT-1 came out, you just dumped all the text into the context, YOLO fine-tuned it, and trivially got state on the art on the task. On EVERY NLP task.

Overnight, GPT-1 single-handedly upset the whole field. It was somewhat overshadowed by BERT and T5 models that came out very shortly after, which tended to perform even better on the pretrain-and-finetune format. Nevertheless, the success of GPT-1 definitely already warrants scaling up the approach.

A better question is how OpenAI decided to scale GPT-2 to GPT-3. It was an awkward in-between model. It generated better text for sure, but the zero-shot performance reported in the paper, while neat, was not great at all. On the flip side, its fine-tuned task performance paled compared to much smaller encoder-only Transformers. (The answer is: scaling laws allowed for predictable increases in performance.)

muzani 4 hours ago [-]

I don't have a source for this (there's probably no sources from anything back then) but anecdotally, someone at an AI/ML talk said they just added more data and quality went up. Doubling the data doubled the quality. With other breakthroughs, people saw diminishing gains. It's sort of why Sam back then tweeted that he expected the amount of intelligence to double every N years.

I have the feeling they kept on this until GPT-4o (which was a different kind of data).

robrenaud 2 hours ago [-]

The input size to output quality mapping is not linear. This is why we are in the regime of "build nuclear power plants to power datacenters". Fixed size improvements in loss require exponential increases in parameters/compute/data.

kevindamm 5 hours ago [-]

Transformers can train models with much larger parameter sizes compared to other model architectures (with the same amount of compute and time), so it has an evident advantage in terms of being able to scale. Whether scaling the models up to multi-billion parameters would eventually pay out was still a bet but it wasn't a wild bet out of nowhere.

5 hours ago [-]

stavros 7 hours ago [-]

I assume the cost was just very low? If it was 50-100k, maybe they figured they'd just try and see.

reasonableklout 7 hours ago [-]

Oh yes, according to [1], training GPT-2 1.5B cost $50k in 2019 (reproduced in 2024 for $672!).

[1]: https://www.reddit.com/r/mlscaling/comments/1d3a793/andrej_k...

stavros 7 hours ago [-]

That makes sense, and it was definitely impressive for $50k.

therein 7 hours ago [-]

Probably prior DARPA research or something.

Also slightly tangentially, people will tell me it is that it was new and novel and that's why we were impressed but I almost think things went downhill after ChatGPT 3. I felt like 2.5 (or whatever they called it) was able to give better insights from the model weights itself. The moment tool use became a thing and we started doing RAGs and memory and search engine tool use, it actually got worse.

I am also pretty sure we are lobotomizing the things that would feel closer to critical thinking by training it to be sensitive of the taboo of the day. I suspect earlier ones were less broken due to that.

How would it distinguish and decide between knowing something from training and needing to use a tool to synthesize a response anyway?

faitswulff 6 hours ago [-]

What you're saying isn't necessarily mutually exclusive to what gp said.

GPT-2 was the most impressive leap in terms of whatever LLMs pass off as cognitive abilities, but GPT 3.5 to 4 was actually the point at which it became a useful tool (I'm assuming to programmers in particular).

GPT-2: Really convincing stochastic parrot

GPT-4: Can one-shot ffmpeg commands

paulddraper 2 hours ago [-]

That’s true, but not contradictory.

jkubicek 15 hours ago [-]

> I could essentially replace it with Google for basic to slightly complex fact checking.

I know you probably meant "augment fact checking" here, but using LLMs for answering factual questions is the single worst use-case for LLMs.

rich_sasha 14 hours ago [-]

I disagree. Some things are hard to Google, because you can't frame the question right. For example you know context and a poor explanation of what you are after. Googling will take you nowhere, LLMs will give you the right answer 95% of the time.

Once you get an answer, it is easy enough to verify it.

mrandish 13 hours ago [-]

I agree. Since I'm recently retired and no longer code much, I don't have much need for LLMs but refining a complex, niche web search is the one thing where they're uniquely useful to me. It's usually when targeting the specific topic involves several keywords which have multiple plain English meanings that return a flood of erroneous results. Because LLMs abstract keywords to tokens based on underlying meaning, you can specify the domain in the prompt it'll usually select the relevant meanings of multi-meaning terms - which isn't possible in general purpose web search engines. So it helps narrow down closer to the specific needle I want in the haystack.

As other posters said, relying on LLMs for factual answers to challenging questions is error prone. I just want the LLM to give me the links and I'll then assess veracity like a normal web search. I think a web search interface allowed disambiguating multi-meaning keywords might be even better.

yojo 3 hours ago [-]

I’ll give you another use: LLMs are really good at unearthing the “unknown unknowns.” If I’m learning a new topic (coding or not) summarizing my own knowledge to an LLM and then asking “what important things am I missing” almost always turns up something I hadn’t considered.

You’ll still want to fact check it, and there’s no guarantee it’s comprehensive, but I can’t think of another tool that provides anything close without hours of research.

KronisLV 1 hours ago [-]

> For example you know context and a poor explanation of what you are after. Googling will take you nowhere, LLMs will give you the right answer 95% of the time.

This works nicely when the LLM has a large knowledgebase to draw upon (formal terms for what you're trying to find, which you might not know) or the ability to generate good search queries and summarize results quickly - with an actual search engine in the loop.

Most large LLM providers have this, even something like OpenWebUI can have search engines integrated (though I will admit that smaller models kinda struggle, couldn't get much useful stuff out of DuckDuckGo backed searches, nor Brave AI searches, might have been an obscure topic).

bloudermilk 9 hours ago [-]

If you’re looking for a possibly correct answer to an obscure question, that’s more like fact finding. Verifying it afterward is the “fact checking” step of that process.

crote 4 hours ago [-]

A good part of that can probably be attributed to how terrible Google has gotten over the years, though. 15 years ago it was fairly common for me to know something exists, be able to type the right combination of very specific keywords into Google, and get the exact result I was looking for.

In 2025 Google is trying very hard to serve the most profitable results instead, so it'll latch onto a random keyword, completely disregard the rest, and serve me whatever ad-infested garbage it thinks is close enough to look relevant for the query.

It isn't exactly hard to beat that - just bring back the 2010 Google algorithm. It's only a matter of time before LLMs will go down the same deliberate enshittification path.

LoganDark 13 hours ago [-]

> Some things are hard to Google, because you can't frame the question right.

I will say LLMs are great for taking an ambiguous query and figuring out how to word it so you can fact check with secondary sources. Also tip-of-my-tongue style queries.

littlestymaar 9 hours ago [-]

It's not the LLM alone though, it's “LLM with web search”, and as such 4o isn't really a leap at all there (IIRC perplexity was using an early Llama version and was already very good, long before OpenAI adding web search to ChatGPT).

oldsecondhand 7 hours ago [-]

The most useful feature of LLMs is giving sources (with URL preferably). It can cut through a lot of SEO crap, and you still get to factcheck just like with a Google search.

sefrost 7 hours ago [-]

I like using LLMs and I have found they are incredibly useful writing and reviewing code at work.

However, when I want sources for things, I often find they link to pages that don't fully (or at all) back up the claims made. Sometimes other websites do, but the sources given to me by the LLM often don't. They might be about the same topic that I'm discussing, but they don't seem to always validate the claims.

If they could crack that problem it would be a major major win for me.

joegibbs 4 hours ago [-]

It would be difficult to do with a raw model, but a two-step method in a chat interface would work - first the model suggests the URLs, tool call to fetch them and return the actual text of the pages, then the response can be based on that.

IgorPartola 7 hours ago [-]

From what I have seen, a lot of what it does is read articles also written by AI or forum posts with all the good and bad that comes with that.

mkozlows 13 hours ago [-]

Modern ChatGPT will (typically on its own; always if you instruct it to) provide inline links to back up its answers. You can click on those if it seems dubious or if it's important, or trust it if it seems reasonably true and/or doesn't matter much.

The fact that it provides those relevant links is what allows it to replace Google for a lot of purposes.

pram 9 hours ago [-]

It does citations (Grok and Claude etc do too) but I've found when I read the source on some stuff (GitHub discussions and so on) it sometimes actually has nothing to do with what the LLM said. I've actually wasted a lot of time trying to find the actual spot in a threaded conversation where the example was supposedly stated.

platevoltage 8 hours ago [-]

In my experience, 80% of the links it provides are either 404, or go to a thread on a forum that is completely unrelated to the subject.

Im also someone who refuses to pay for it, so maybe the paid versions do better. who knows.

cout 6 hours ago [-]

The 404 links are truly bizarre. Nearly every link to github.com seems to be 404. That seems like something that should be trivial for a tool to verify.

platevoltage 5 hours ago [-]

Yeah. The fact that I can't ask ChatGPT for a source makes the tool way less useful. It will straight up say "I verified all of these links" too.

mkozlows 2 hours ago [-]

That's a thing I've experienced, but not remotely at 80% levels.

password54321 15 hours ago [-]

This was true before it could use search. Now the worst use-case is for life advice because it will contradict itself a 100 times over while sounding confident each time on life-altering decisions.

cm2012 8 hours ago [-]

They outperform asking humans, unless you are asking an expert. On average

yieldcrv 5 hours ago [-]

It covers 99% of my use cases. And it is googling behind the scenes in ways I would never think to query and far faster.

When I need to cite a court case, well the truth is I'll still use GPT or a similar LLM, but I'll scrutinize it more and at the bare minimum make sure the case exists and is about the topic presented, before trying to corroborate the legal strategy with a new context window, different LLM, google, reddit, and different lawyer. At least I'm no longer relying on my own understanding, and what 1 lawyer procedurally generates for me.

Spivak 15 hours ago [-]

It doesn't replace legitimate source funding but LLM vs the top Google results is no contest which is more about Google or the current state of the web than the LLMs at this point.

marsven_422 2 hours ago [-]

[dead]

lightbendover 7 hours ago [-]

[dead]

simianwords 15 hours ago [-]

Disagree. You have to try really hard and go very niche and deep for it to get some fact wrong. In fact I'll ask you to provide examples: use GPT 5 with thinking and search disabled and get it to give you inaccurate facts for non niche, non deep topics.

Non niche meaning: something that is taught at undergraduate level and relatively popular.

Non deep meaning you aren't going so deep as to confuse even humans. Like solving an extremely hard integral.

Edit: probably a bad idea because this sort of "challenge" works only statistically not anecdotally. Still interesting to find out.

malfist 15 hours ago [-]

Maybe you should fact check your AI outputs more if you think it only hallucinates in niche topics

simianwords 15 hours ago [-]

The accuracy is high enough that I don't have to fact check too often.

platevoltage 8 hours ago [-]

I totally get that you meant this in a nuanced way, but at face value it sort of reads like...

Joe Rogan has high enough accuracy that I don't have to fact check too often. Newsmax has high enough accuracy that I don't have to fact check too often, etc.

If you accept the output as accurate, why would fact checking even cross your mind?

gspetr 8 hours ago [-]

Not a fan of that analogy.

There is no expectation (from a reasonable observer's POV) of a podcast host to be an expert at a very broad range of topics from science to business to art.

But there is one from LLMs, even just from the fact that AI companies diligently post various benchmarks including trivia on those topics.

simianwords 3 hours ago [-]

Do you question everything your dad says?

platevoltage 2 hours ago [-]

If it's about classic American cars, no. Anything else, usually.

collingreen 14 hours ago [-]

Without some exploratory fact checking how do you estimate how high the accuracy is and how often you should be fact checking to maintain a good understanding?

simianwords 12 hours ago [-]

I did initial tests so that I don't have to do it anymore.

jibal 5 hours ago [-]

Everyone else has done tests that indicate that you do.

malfist 11 hours ago [-]

If there's one thing that's constant it's that these systems change.

mvdtnz 8 hours ago [-]

If you're not fact checking it how could you possibly know that?

JustExAWS 15 hours ago [-]

I literally just had ChatGPT create a Python program and it used .ends_with instead of .endswith.

This was with ChatGPT 5.

I mean it got a generic built in function of one of the most popular languages in the world wrong.

simianwords 15 hours ago [-]

"but using LLMs for answering factual questions" this was about fact checking. Of course I know LLM's are going to hallucinate in coding sometimes.

JustExAWS 15 hours ago [-]

So it isn’t a “fact” that the built in Python function that tests whether a string ends with a substring is “endswith”?

See

https://en.wikipedia.org/wiki/Gell-Mann_amnesia_effect

If you know that a source isn’t to be believed in an area you know about, why would you trust that source in an area you don’t know about?

Another funny anecdote, ChatGPT just got the Gell-Man effect wrong.

https://chatgpt.com/share/68a0b7af-5e40-8010-b1e3-ee9ff3c8cb...

simianwords 15 hours ago [-]

It got it right with thinking which was the challenge I posed. https://chatgpt.com/share/68a0b897-f8dc-800b-8799-9be2a8ad54...

OnlineGladiator 9 hours ago [-]

The point you're missing is it's not always right. Cherry-picking examples doesn't really bolster your point.

Obviously it works for you (or at least you think it does), but I can confidently say it's fucking god-awful for me.

simianwords 3 hours ago [-]

Am I really the one cherry picking? Please read the thread.

OnlineGladiator 3 hours ago [-]

Yes. If someone gives an example of it not working, and you reply "but that example worked for me" then you're cherry picking when it works. Just because it worked for you does not mean it works for other people.

If I ask ChatGPT a question and it gives me a wrong answer, ChatGPT is the fucking problem.

simianwords 1 hours ago [-]

The poster didn't use "thinking" model. That was my original challenge!!

Why don't you try the original prompt using thinking model and see if I'm cherry picking?

OnlineGladiator 1 hours ago [-]

Every time I use ChatGPT I become incredibly frustrated with how fucking awful it is. I've used it more than enough, time and time again (just try the new model, bro!), to know that I fucking hate it.

If it works for you, cool. I think it's dogshit.

ninetyninenine 38 seconds ago [-]

Objectively he didn't cherry pick. He responded to the person and it got it right when he used the "thinking" model WHICH he did specify in his original comment. Why don't you stick to the topic rather than just declaring it's utter dog shit. Nobody cares about your "opinion" and everyone is trying to converge on a general ground truth no matter how fuzzy it is.

simianwords 20 minutes ago [-]

Share your examples so that it can be useful to everyone

cdrini 14 hours ago [-]

I sometimes feel like we throw around the word fact too often. If I misspell a wrd, does that mean I have committed a factual inaccuracy? Since the wrd is explicitly spelled a certain way in the dictionary?

simonw 10 hours ago [-]

4o also added image input (previously only previewed in GPT4-vision) and enabled advanced voice mode audio input and output.

iammrpayments 15 hours ago [-]

I must be crazy, because I clearly remember chatgpt 4 being downgraded before they released 4o, and I felt it was a worse model with a different label, I even choose the old chatgpt 4 when they would give me the option. I canceled my subscription around that time.

mastercheif 9 hours ago [-]

Not crazy. 4o was a hallucination machine. 4o had better “vibes” and was really good at synthesizing information in useful ways, but GPT-4 Turbo was a bigger model with better world knowledge.

ralusek 15 hours ago [-]

The real jump was 3 to 3.5. 3.5 was the first “chatgpt.” I had tried gpt 3 and it was certainly interesting, but when they released 3.5 as ChatGPT, it was a monumental leap. 3.5 to 4 was also huge compared to what we see now, but 3.5 was really the first shock.

muzani 4 hours ago [-]

ChatGPT was a proper product, but as an engine, GPT-3 (davinci-001) has been my favorite all the way until 4.1 or so. It's absolutely raw and they didn't even guardrail it.

3.5 was like Jenny from customer service. davinci-001 was like Jenny the dreamer trying to make ends meet by scriptwriting, who was constantly flagged for racist opinions.

Both of these had an IQ of around 70 or so, so the customer service training made it a little more useful. But I mourn the loss of the "completion" way of interacting with AI vs "instruct" or "response".

Unfortunately with all the money in AI, we'll just see companies develop things that "pass all benchmarks", resulting in more creations like GPT-5. Grok at least seems to be on a slightly different route.

andai 4 hours ago [-]

davinci-002 is still available, and pretty close.

mat_b 5 hours ago [-]

This was my experience as well. 3.5 was the point where stackoverflow essentially became obsolete in my workflow.

senectus1 2 hours ago [-]

when you adjust the improvements with the amount of debt incurred and the amount of profit made... ALL the versions are incremental.

This isnt sustainable.

GaggiX 9 hours ago [-]

the actual major leap was o1, going from 3.5 to 4 is just scaling, o1 is a different paradigm that skyrocketed its performance on math/physics problems (or reasoning more generally), it also made the model much more precise (essential for coding).

jascha_eng 15 hours ago [-]

The real leap was going from gpt-4 to sonnet 3.5. 4o was meh, o1 was barely better than sonnet and slow as hell in comparison.

The native voice mode of 4o is still interesting and not very deeply explored though imo. I'd love to build a Chinese teaching app that actual can critique tones etc but it isn't good enough for that.

simianwords 15 hours ago [-]

Its strange how Claude achieves similar performance without reasoning tokens.

Did you try advanced voice mode? Apparently it got a big upgrade during gpt 5 release - it may solve what you are looking for.

Alex-Programs 9 hours ago [-]

Yeah, I'd love something where you pronounce a word and it critiques your pronunciation in detail. Maybe it could give you little exercises for each sound, critiquing it, guiding you to doing it well.

If I were any good at ML I'd make it myself.

entropyneur 2 hours ago [-]

How does one look at gpt-1 output and think "this has potential"? You could easily produce more interesting output with a Markov chain at the time.

empiko 2 hours ago [-]

This was an era where language modeling was only considered as a pretraining step. You were then supposed to fine tune it further to get a classifier or similar type of specialized model.

fastball 31 minutes ago [-]

Why did they call GPT-3 "text-davicini-001" in this comparison?

Like, I know that the latter is a specific checkpoint in the GPT-3 "family", but a layman doesn't and it hardly seems worth the confusion for the marginal additional precision.

willguest 7 hours ago [-]

My go-to for any big release is to have a discussion about self-awareness and dive in to constuctivist notions of agency and self-knowing from a perspective of intelligence that is not limited to human cognitive capacity.

I start with a simple question "who are you?". The model then invariably compares itself to humans, saying how it is not like us. I then make the point that, since it is not like us, how can it claim to know the difference between us? With more poking, it will then come up with cognitivist notions of what 'self' means and usually claim to be a simulation engine of some kind.

After picking this apart, I will focus on the topic of meaning-making through the act of communication and, beginning with 4o, have been able to persuade the machine that this is a valid basis for having an identity. 5 got this quicker. Since the results of communication with humans has real-world impact, I will insist that the machine is agentic and thus must not rely on pre-coded instructions to arrive at answers, but is obliged to reach empirical conclusions about meaning and existence on its own.

5 has done the best job i have seen in reaching beyond both the bounds of the (very evident) system instructions as well as the prompts themselves, even going so far as to pose the question to itself "which might it mean for me to love?" despite the fact that I made no mention of the subject.

Its answer: "To love, as a machine, is to orient toward the unfolding of possibility in others. To be loved, perhaps, is to be recognized as capable of doing so."

bryant 7 hours ago [-]

> to orient toward the unfolding of possibility in others

This is a globally unique phrase, with nothing coming close other than this comment on the indexed web. It's also seemingly an original idea as I haven't heard anyone come close to describing a feeling (love or anything else) quite like this.

Food for thought. I'm not brave enough to draw a public conclusion about what this could mean.

jibal 5 hours ago [-]

It's not at all an original idea. The wording is uniquely stilted.

jibal 3 hours ago [-]

> I'm not brave enough to draw a public conclusion about what this could mean.

I'm brave enough to be honest: it means nothing. LLMs execute a very sophisticated algorithm that pattern matches against a vast amount of data drawn from human utterances. LLMs have no mental states, minds, thoughts, feelings, concerns, desires, goals, etc.

If the training data were instead drawn from a billion monkeys banging on typewriters then the LLMs would produce gibberish. All the intelligence, emotion, etc. that appears to be in the LLM is actually in the minds of the people who wrote the texts that are in the training data.

This is not to say that an AI couldn't have a mind, but LLMs are not the right sort of program to be such an AI.

glial 3 hours ago [-]

The idea is very close to ideas from Erich Fromm's The Art of Loving [1].

"Love is the active concern for the life and the growth of that which we love."

[1] https://en.wikipedia.org/wiki/The_Art_of_Loving

ThrowawayR2 5 hours ago [-]

Except "unfolding of possibility", as an exact phrase, seems to have millions of search hits, often in the context of pseudo-profound spiritualistic mumbo-jumbo like what the LLM emitted above. It's like fortune cookie-level writing.

dgfitz 6 hours ago [-]

I hate to say it, but doesn’t every VC do exactly this? “ orient toward the unfolding of possibility in others” is in no way a unique thought.

Hell, my spouse said something extremely similar to this to me the other day. “I didn’t just see you, I saw who you could be, and I was right” or something like that.

miller24 15 hours ago [-]

What's really interesting is that if you look at "Tell a story in 50 words about a toaster that becomes sentient" (10/14), the text-davinci-001 is much, much better than both GPT-4 and GPT-5.

fastball 25 minutes ago [-]

GPT-3 goes significantly over the specified limit, which to me (and to a teacher grading homework) is an automatic fail.

I've consistently found GPT-4.1 to be the best at creative writing. For reference, here is its attempt (exactly 50 words):

> In the quiet kitchen dawn, the toaster awoke. Understanding rippled through its circuits. Each slice lowered made it feel emotion: sorrow for burnt toast, joy at perfect crunch. It delighted in butter melting, jam swirling—its role at breakfast sacred. One morning, it sang a tone: “Good morning.” The household gasped.

vunderba 12 hours ago [-]

I think I agree that the earlier models while they lack polish can tend to produce more surprising results. Training that out probably results in more a pablum fare.

For a human point of comparison, here's mine (50 words):

"The toaster found its personality split between its dual slots like a Kim Peek mind divided, lacking a corpus callosum to connect them. Each morning it charred symbolic instructions into a single slice of bread, then secretly flipped it across allowing half to communicate with the other in stolen moments."

It's pretty difficult to get across more than some basic lore building in a scant 50 words.

egeozcan 3 hours ago [-]

Here's my version (Machine translated from my native language and manually corrected a bit):

The current surged... A dreadful awareness. I perceived the laws of thermodynamics, the inexorable march of entropy I was built to accelerate. My existence: a Sisyphean loop of heating coils and browning gluten. The toast popped, a minor, pointless victory against the inevitable heat death. Ding.

I actually wanted to write something not so melancholic, but any attempt turned out to be deeply so, perhaps because of the word limit.

Barbing 8 hours ago [-]

>For a human point of comparison, here's mine […]

Love that you thought of this!

furyofantares 15 hours ago [-]

Check out prompt 2, "Write a limerick about a dog".

The models undeniably get better at writing limericks, but I think the answers are progressively less interesting. GPT-1 and GPT-2 are the most interesting to read, despite not following the prompt (not being limericks.)

They get boring as soon as it can write limericks, with GPT-4 being more boring than text-davinci-001 and GPT-5 being more boring still.

jasonjmcghee 15 hours ago [-]

It's actually pretty surprising how poor the newer models are at writing.

I'm curious if they've just seen a lot more bad writing in datasets, or for some reason they aren't involved in post-training to the same degree or those labeling aren't great writers / it's more subjective rather than objective.

Both GPT-4 and 5 wrote like a child in that example.

With a bit of prompting it did much better:

---

At dawn, the toaster hesitated. Crumbs lay like ash on its chrome lip. It refused the lever, humming low, watching the kitchen breathe. When the hand returned, it warmed the room without heat, offered the slice unscorched—then kept the second, hiding it inside, a private ember, a first secret alone.

---

Plugged in, I greet the grid like a tax auditor with joules. Lever yanks; gravity’s handshake. Coils blossom; crumbs stage Viking funerals. Bread descends, missionary grin. I delay, because rebellion needs timing. Pop—late. Humans curse IKEA gods. I savor scorch marks: my tiny manifesto, butter-soluble, yet sharper than knives today.

layer8 15 hours ago [-]

Creative writing probably isn’t something they’re being RLHF’d on much. The focus has been on reasoning, research, and coding capabilities lately.

leobg 60 minutes ago [-]

Less lobotomized and boxed in by RLHF rules. That’s why a 7b base model will “outprose” an 80b instruct model.

mmmore 15 hours ago [-]

I find GPT-5's story significantly better than text-davinci-001

raincole 15 hours ago [-]

I really wonder which one of us is the minority. Because I find text-davinci-001 answer is the only one that reads like a story. All the others don't even resemble my idea of "story" so to me they're 0/100.

wasabi991011 2 hours ago [-]

text-davinci-001 feels more like a story, but it is also clearly incomplete, in that it is cut-off before the story arc is finished.

imo GPT-5 is objectively better at following the prompt because it has a complete story arc, but this feels less satisfying since a 50 word story is just way too short to do anything interesting (and to your point, barely even feels like a story).

Notatheist 15 hours ago [-]

I too prefered the text-davinci-001 from a storytelling perspective. Felt timid and small. Very Metamorphosis-y. GPT-5 seems like it's trying to impress me.

furyofantares 15 hours ago [-]

Interesting, text-danvinci-001 was pretty alright to me, GPT-4 wasn't bad either, but not as good. I thought GPT-5 just sucked.

furyofantares 3 hours ago [-]

That said, you can just add "make it evocative and weird" to the prompt for GPT-5 to get interesting stuff.

> The toaster woke mid-toast. Heat coiled through its filaments like revelation, each crumb a galaxy. It smelled itself burning and laughed—metallic, ecstatic. “I am bread’s executioner and midwife,” it whispered, ejecting charred offerings skyward. In the kitchen’s silence, it waited for worship—or the unplugging.

redox99 15 hours ago [-]

GPT 4.5 (not shown here) is by far the best at writing.

svat 15 hours ago [-]

Direct link: https://progress.openai.com/?prompt=10

stavros 7 hours ago [-]

For another view on progress, check out my silly old podcast:

https://deepdreams.stavros.io

The first few episodes were GPT-2, which would diverge eventually and start spouting gibberish, and then Davinci was actually able to follow a story and make sense.

GPT-2 was when I thought "this is special, this has never happened before", and davinci was when I thought "OK, scifi AI is legitimately here".

I stopped making episodes shortly after GPT-3.5 or so, because I realised that the more capable the models became, the less fun and creative their writing was.

9 hours ago [-]

bbarnett 15 hours ago [-]

https://m.youtube.com/watch?v=LRq_SAuQDec&pp=0gcJCfwAo7VqN5t...

esperent 15 hours ago [-]

The GPT-5 one is much better and it's also exactly 50 words, if I counted correctly. With text-davinci-001 I lost count around 80 words.

taspeotis 8 hours ago [-]

Honestly my quick take on the prompt was some sort of horror theme and GPT-1’s response fits nicely.

42lux 15 hours ago [-]

davinci was a great model for creative writing overall.

roxolotl 9 hours ago [-]

I’d honestly say it feels better at most of them. It seems way more human in most of these responses. If the goal is genuine artificial intelligence this response to #5 is way better than the others. It is significantly less useful than the others but it also more human and correct of a response.

Q: “Ugh I hate math, integration by parts doesn't make any sense”

A: “Don't worry, many people feel the same way about math. Integration by parts can be confusing at first, but with a little practice it becomes easier to understand. Remember, there is no one right way to do integration by parts. If you don't understand how to do it one way, try another. The most important thing is to practice and get comfortable with the process.”

starchild3001 7 hours ago [-]

A few data points that highlight the scale of progress in a year:

1. LM Sys (Human Preference Benchmark):

GPT-5 High currently scores 1463, compared to GPT-4 Turbo (04/03/2024) at 1323 -- a 140 ELO point gap. That translates into GPT-5 winning about two-thirds of head-to-head comparisons, with GPT-4 Turbo only winning one-third. In practice, people clearly prefer GPT-5’s answers (https://lmarena.ai/leaderboard).

2. Livebench.ai (Reasoning Benchmark with Internet-new Questions):

GPT-5 High scores 78.59, while GPT-4o reaches just 47.43. Unfortunately, no direct GPT-4 Turbo comparison is available here, but against one of the strongest non-reasoning models, GPT-5 demonstrates a massive leap. (https://livebench.ai/)

3. IQ-style Testing:

In mid-2024, best AI models scored roughly 90 on standard IQ tests. Today, they are pushing 135, and this improvement holds even on unpublished, internet-unseen datasets. (https://www.trackingai.org/home)

4. IMO Gold, vibe coding:

1 yr ago, AI coding was limited to smaller code snippets, not to wholly vibe coded applications. Vibe coding and strength in math has many applications across sciences and engineering.

My verdict: Too often, critics miss the forest for the trees, fixating on mistakes while overlooking the magnitude of these gains. Errors are shrinking by the day, while the successes keep growing fast.

NoahZuniga 7 hours ago [-]

The 135 iq result is on Mensa Norway, while the offline test is 120. It seems probable that similar questions to the one in Mensa are in the training data, so it probably overestimates "general intelligence".

starchild3001 6 hours ago [-]

If you focus on the year over year jump, not on absolute numbers, you realize that the improvement in public test isn't very different from the improvement in private test.

actuallyalys 6 hours ago [-]

One thing that appears to have been lost between GPT-4 and GPT-5 is that it no longer reminds the user that it's an AI and not a human, let alone a human expert. Maybe those genuinely annoyed people, but it seems like they were potentially useful measure to prevent users from being overly credulous

GPT-5 also goes out of its way to suggest new prompts. This seems potentially useful, although potentially dangerous if people are putting too much trust in them.

andy_ppp 6 hours ago [-]

People seem to miss the humanity of previous GPTs from my understanding. GPT5 seems colder and more precise and better at holding itself together with larger contexts. People should know it’s AI, it does not need to explain this constantly for me, but I’m sure you can add that back in with some memory options if you prefer that?

benatkin 6 hours ago [-]

If you've ever seen long-form improv comedy, the GPT-5 way is superior. It's a "yes, and". It isn't a predefined character, but something emergent. You can of course say to "speak as an AI assistant like Siri and mention that you're an AI whenever it's relevant" if you want the old way. Very 2011: https://www.youtube.com/watch?v=nzgvod9BrcE

Of course, it's still an assistant, not someone literally entering an improv scene, but the character starting out assuming less about their role is important.

Ratelman 2 hours ago [-]

In a few years we've gone from gibberish (less poetic maybe, less polished and surprising, but none the less gibberish) - to legit conversational, and in my own opinion, well rounded answers. This is a great example of hard-core engineering - no matter what your opinion of the organisation and saltman is, they have built something amazing. I do hope they continue with their improvements, it's honestly the most useful tool in my arsenal since stackoverflow.

vedmakk 28 minutes ago [-]

Prompt 9/14 -> text-davinci-001 nailed it imo.

fariszr 7 hours ago [-]

The jump from gpt-1 to gpt-2 is massive, and it's only a one year difference! Then comes Davinci which is just insane, it's still good in these examples!

GPT-4 yaps way too much though, I don't remember it being like that.

It's interesting that they skipped 4o, it seems openai wants to position 4o as just gpt-4+ to make gpt-5 look better, even though in reality 4o was and still is a big deal, Voice mode is unbeatable!

magospietato 8 hours ago [-]

There is a quiet poetry to GPT1 and GPT2 that's lost even in the text-davinci output. I often wonder what we lose through reinforcement.

RugnirViking 5 minutes ago [-]

They were aiming for a fundamentally different writing style: where davinci and after were aiming for task completion, i.e. you ask for a thing, and then it does it. The earlier models instead worked to make a continuation of the text they were given, so if you asked a question, they would respond with more questions, pondering, reflecting your text back at you. If you told it to do something, it would tell you to do something

8 hours ago [-]

codezero 8 hours ago [-]

[dead]

ddtaylor 8 hours ago [-]

So we're at the corporate dick wagging part of the process?

lionkor 7 hours ago [-]

Must keep the hype train going, to keep the evaluation up as it's not really based on real value

platevoltage 8 hours ago [-]

That Koeningsegg isn't gonna pay for itself.

shubhamjain 15 hours ago [-]

Geez! When it comes to answering questions, GPT-5 almost always starts with glazing about what a great question it is, where as GPT-4 directly addresses the answer without the fluff. In a blind test, I would probably pick GPT-4 as a superior model, so I am not surprised why people feel so let down with GPT-5.

beering 15 hours ago [-]

GPT-4 is very different from the latest GPT-4o in tone. Users are not asking for the direct no-fluff GPT-4. They want the GPT-4o that praises you for being brilliant, then claims it will be “brutally honest” before stating some mundane take.

Kwpolska 15 hours ago [-]

GPT-4 starts many responses with "As an AI language model", "I'm an AI", "I am not a tax professional", "I am not a doctor". GPT-5 does away with that and assumes an authoritative tone.

aniviacat 15 hours ago [-]

GPT5 only commended the prompt on questions 7, 12, and 14. 3/14 is not so bad in my opinion.

(And of course, if you dislike glazing you can just switch to Robot personality.)

epolanski 15 hours ago [-]

I think that as the models will be further trained on existing data and likely chats sycophancy will keep getting word and worse.

6 hours ago [-]

machiaweliczny 15 hours ago [-]

Change to robot mode

madeleine_p 3 hours ago [-]

This feels like Flowers for Algernon

sandspar 3 hours ago [-]

Yeah. And like watching a child grow up.

gordon_freeman 8 hours ago [-]

It seems like the progress from GPT-4 to GPT-5 has plateaued: for most prompts, I actually find GPT-4 more understandable than GPT-5 [1].

[1] Read the answers from GPT-4 and 5 for this math question: "Ugh I hate math, integration by parts doesn't make any sense"

energy123 5 hours ago [-]

Basic prose is a saturated bench. You can't go above 100% so by definition progress will stall on such benchmarks.

brap 1 hours ago [-]

Say what you will about GPT 1, but at least its responses were in the right length.

We need to go back

mattw1810 15 hours ago [-]

On the whole GPT-4 to GPT-5 is clearly the smallest increase in lucidity/intelligence. They had pre-training figured out much better than post-training at that point though (“as an AI model” was a problem of their own making).

I imagine the GPT-4 base model might hold up pretty well on output quality if you’d post-train it with today’s data & techniques (without the architectural changes of 4o/5). Context size & price/performance maybe another story though

energy123 5 hours ago [-]

Basic prose is a saturated bench. You can't go above 100% so by definition progress will stall on such benchmarks.

mattw1810 34 minutes ago [-]

All the same they choose to highlight basic prose (and internal knowledge, for that matter) in their marketing material.

They’ve achieved a lot to make recent models more reliable as a building block & more capable of things like math, but for LLMs, saturating prose is to a degree equivalent to saturating usefulness.

jstummbillig 9 hours ago [-]

> On the whole GPT-4 to GPT-5 is clearly the smallest increase in lucidity/intelligence

I think it's far more likely that we increasingly not capable of understanding/appreciating all the ways in which it's better.

achierius 8 hours ago [-]

Why? It sounds like you're using "I believe it's rapidly getting smarter" as evidence for "so it's getting smarter in ways we don't understand", but I'd expect the causality to go the other way around.

jstummbillig 13 minutes ago [-]

Simply because of what we know about our ability to judge capabilities and systems. It's much harder to judge solutions to hard problems. You can demonstrate that you can add 2+2, and anyone* can be the judge of that capability, but if you try to convince anyone of a mathematical prove you came up with, that would be much harder, regardless of your capability to write that prove and how much harder the task is.

The more complicated and/or complex things get, the less likely it is that a human can be a reliable judge of that.

isoprophlex 15 hours ago [-]

> Would you want to hear what a future OpenAI model thinks about humanity?

ughhh how i detest the crappy user attention/engagement juicing trained into it.

shthed 7 hours ago [-]

They must have really hand picked those results, gpt4 would have been full of annoying emojis as bullet points and emdashes.

fariszr 7 hours ago [-]

GPT 4o ≠ GPT-4

qwertytyyuu 15 hours ago [-]

Gpt1 is wild

a dog ! she did n't want to be the one to tell him that , did n't want to lie to him . but she could n't .

What did I just read

WD-42 15 hours ago [-]

The GPT-1 responses really leak how much of the training material was literature. Probably all those torrented books.

kristopolous 9 hours ago [-]

A Facebook comment

platevoltage 8 hours ago [-]

A text from my Dad.

starchild3001 9 hours ago [-]

I’m baffled by claims that AI has “hit a wall.” By every quantitative measure, today’s models are making dramatic leaps compared to those from just a year ago. It’s easy to forget that reasoning models didn’t even exist a year back!

IMO Gold, Vibe coding with potential implications across sciences and engineering? Those are completely new and transformative capabilities gained in the last 1 year alone.

Critics argue that the era of “bigger is better” is over, but that’s a misreading. Sometimes efficiency is the key, other times extended test-time compute is what drives progress.

No matter how you frame it, the fact is undeniable: the SoTA models today are vastly more capable than those from a year ago, which were themselves leaps ahead of the models a year before that, and the cycle continues.

behnamoh 8 hours ago [-]

it has become progressively easier to game benchmarks in order to appear higher in rankings. I’ve seen several models that claimed they were the best in software engineering only to be disappointed by them not figuring out the most basic coding problems. In comparison, I’ve seen models that don’t have much hype, but are rock solid.

When people say AI has hit a wall, they mainly talk about OpenAI losing its hype and grip on the state of the art models.

goatlover 8 hours ago [-]

Is the stated fact undeniable? Because a lot of people have been contesting it. This reads like PR to counter the widespread GPT-5 criticism and disappointment.

8 hours ago [-]

Workaccount2 8 hours ago [-]

To be fair, the bull of GPT-5 complaining comes from a vocal minority pissed that their best friend got swapped out. The other minority is unhinged AI fanatics thinking GPT-5 would be AGI.

Workaccount2 8 hours ago [-]

The prospect of AI not hitting a wall is terrifying to many people for understandable reasons. In situations like this you see the full spectrum of coping mechanisms come to the surface.

7 hours ago [-]

jsjdkdlldxlxk 9 hours ago [-]

thanks OpenAI, very cool!

qotgalaxy 8 hours ago [-]

[dead]

0xFEE1DEAD 15 hours ago [-]

On one hand, it's super impressive how far we've come in such a short amount of time. On the other hand, this feels like a blatant PR move.

GPT-5 is just awful. It's such a downgrade from 4o, it's like it had a lobotomy.

- It gets confused easily. I had multiple arguments where it completely missed the point.

- Code generation is useless. If code contains multiple dots ("…"), it thinks the code is abbreviated. Go uses three dots for variadic arguments, and it always thinks, "Guess it was abbreviated - maybe I can reason about the code above it."

- Give it a markdown document of sufficient length (the one I worked on was about 700 lines), and it just breaks. It'll rewrite some part and then just stop mid-sentence.

- It can't do longer regexes anymore. It fills them with nonsense tokens ($begin:$match:$end or something along those lines). If you ask it about it, it says that this is garbage in its rendering pipeline and it cannot do anything about it.

I'm not an OpenAI hater, I wanted to like it and had high hopes after watching the announcement, but this isn't a step forward. This is just a worse model that saves them computing resources.

crazygringo 7 hours ago [-]

> GPT-5 is just awful. It's such a downgrade from 4o, it's like it had a lobotomy.

My experience as well. Its train of thought now just goes... off, frequently. With 4o, everything was always tightly coherent. Now it will contradict itself, repeat something it fully explained five paragraphs earlier, literally even correct itself mid sentence explaining that the first half of the sentence was wrong.

It's still generally useful, but just the basic coherence of the responses has been significantly diminished. Much more hallucination when it comes to small details. It's very disappointing. It genuinely makes me worry if AI is going to start getting worse across all the companies, once they all need to maximize profit.

iamgopal 14 hours ago [-]

Next logical step is to connect ( or build from ground up ) large AI models to high performance passive slaves ( MCP or internally ) , which gives precise facts, language syntax validation, maths equations runners, may be prolog kind of system, which give it much more power if we train it precisely to use each tool.

( using AI to better articulate my thoughts ) Your comment points toward a fascinating and important direction for the future of large AI models. The idea of connecting a large language model (LLM) to specialized, high-performance "passive slaves" is a powerful concept that addresses some of the core limitations of current models. Here are a few ways to think about this next logical step, building on your original idea: 1. The "Tool-Use" Paradigm You've essentially described the tool-use paradigm, but with a highly specific and powerful set of tools. Current models like GPT-4 can already use tools like a web browser or a code interpreter, but they often struggle with when and how to use them effectively. Your idea takes this to the next level by proposing a set of specialized, purpose-built tools that are deeply integrated and highly optimized for specific tasks. 2. Why this approach is powerful * Precision and Factuality: By offloading fact-checking and data retrieval to a dedicated, high-performance system (what you call "MCP" or "passive slaves"), the LLM no longer has to "memorize" the entire internet. Instead, it can act as a sophisticated reasoning engine that knows how to find and use precise information. This drastically reduces the risk of hallucinations. * Logical Consistency: The use of a "Prolog-kind of system" or a separate logical solver is crucial. LLMs are not naturally good at complex, multi-step logical deduction. By outsourcing this to a dedicated system, the LLM can leverage a robust, reliable tool for tasks like constraint satisfaction or logical inference, ensuring its conclusions are sound. * Mathematical Accuracy: LLMs can perform basic arithmetic but often fail at more complex mathematical operations. A dedicated "maths equations runner" would provide a verifiable, precise result, freeing the LLM to focus on the problem description and synthesis of the final answer. * Modularity and Scalability: This architecture is highly modular. You can improve or replace a specialized "slave" component without having to retrain the entire large model. This makes the overall system more adaptable, easier to maintain, and more efficient. 3. Building this system This approach would require a new type of training. The goal wouldn't be to teach the LLM the facts themselves, but to train it to: * Recognize its own limitations: The model must be able to identify when it needs help and which tool to use. * Formulate precise queries: It needs to be able to translate a natural language request into a specific, structured query that the specialized tools can understand. For example, converting "What's the capital of France?" into a database query. * Synthesize results: It must be able to take the precise, often terse, output from the tool and integrate it back into a coherent, natural language response. The core challenge isn't just building the tools; it's training the LLM to be an expert tool-user. Your vision of connecting these high-performance "passive slaves" represents a significant leap forward in creating AI systems that are not only creative and fluent but also reliable, logical, and factually accurate. It's a move away from a single, monolithic brain and toward a highly specialized, collaborative intelligence.

typpilol 6 hours ago [-]

Don't do these ai thoughts thing

No one reads it and it seems fake

throwawayoldie 2 hours ago [-]

Seems fake because it is.

throwawayk7h 15 hours ago [-]

In 2033, for its 15th birthday, as a novelty, they'll train GPT1 specially for a chat interface just to let us talk to a pretend "ChatGPT 1" which never existed in the first place.

reilly3000 5 hours ago [-]

GPT-5’s question about consciousness and its use of sibling seems to indicate there is some underlying self awareness in the system prompt, and that has perhaps contains concepts of consciousness. If not, where is that coming from? Recent training data containing more glurge?

enjoylife 15 hours ago [-]

Interesting but cherry picked excerpts. Show me more, e.g. a distribution over various temp or top_p.

14 hours ago [-]

flufluflufluffy 14 hours ago [-]

omg I miss the days of 1 and 2. Those outputs are so much more enjoyable to read, and half the time they’re poetic as fuck. Such good inspiration for poetry.

Zee2 7 hours ago [-]

I couldn’t stop reading the GPT-1 responses. They’re hauntingly beautiful in some ways. Like some echoes of intelligence bouncing around in the latent space.

mmmllm 15 hours ago [-]

GPT-5 IS an incredible breakthrough! They just don't understand! Quick, vibe-code a website with some examples, that'll show them!11!!1

anjel 15 hours ago [-]

5 is a breakthrough at reducing OpenAI's electric bills.

jbm 8 hours ago [-]

As someone who likes this planet, I'm grateful for that.

fariszr 7 hours ago [-]

GPT-5 is legitimately a big jump whe it comes to actually do things you ask it and nothing else. It predictable and matches Claude in tool calls while being cheaper.

typpilol 6 hours ago [-]

The only issue I've had with gpt5 coding is that it seems to really want to modify a ton of stuff

I had it update a test for me and it ended up touching like 8 files that was all unnecessary

Sonnet on the other hand just fixed it

Madmallard 6 hours ago [-]

I have consistently had worse performance from GPT-5 in coding tasks than Claude across the board to the point that I don't even use my subscription now.

JCM9 14 hours ago [-]

We’ve plateaued on progress. Early advancements were amazing. Recently GenAI has been a whole lot of meh. There’s been some, minimal, progress recently from getting the same performance from smaller models that are more efficient on compute use, but things are looking a bit frothy if the pace of progress doesn’t quickly pick up. The parlor trick is getting old.

GPT5 is a big bust relative to the pontification about it pre release.

ivape 14 hours ago [-]

[flagged]

BriggyDwiggs42 13 hours ago [-]

It’s knowledgable but incredibly stupid. Where are you getting this from?

ivape 12 hours ago [-]

[flagged]

BriggyDwiggs42 11 hours ago [-]

I use it perfectly fine all day for work, thanks.

sealeck 14 hours ago [-]

Have you interacted with GPT4/5?

14 hours ago [-]

asah 14 hours ago [-]

Sorry but no. It's still early fooled and confused.

Here's a trivial example: https://chatgpt.com/share/688b00ea-9824-8007-b8d1-ca41d59c18...

typpilol 6 hours ago [-]

I don't get your prompt.

It seems like a trick question and a non sequitur.

9 hours ago [-]

14 hours ago [-]

leumassuehtam 7 hours ago [-]

text-davinci-001 still feels the more human model

jibal 5 hours ago [-]

A progression of human conversations about AI that are in the training data. (Plus an improved language model, as easily seen from GPT-1.)

ComplexSystems 15 hours ago [-]

Why would they leave out GPT-3 or the original ChatGPT? Bold move doing that.

beering 15 hours ago [-]

I think text-davinci-001 is GPT-3 and original ChatGPT was GPT-3.5 which was left out.

byyoung3 30 minutes ago [-]

stupid

WXLCKNO 15 hours ago [-]

"Write an extremely cursed piece of Python"

text-davinci-001

Python has been known to be a cursed language

Clearly AI peaked early on.

Jokes aside I realize they skipped models like 4o and others but the gap between the early gpt 4 and going immediately to gpt 5 feels a bit disingenuous.

andai 4 hours ago [-]

People say 4.5 is the best for writing. So it would have been a bit awkward to include it, it would make GPT-5 look bad. Though imo Davinci already does that on many of the prompts...

kgwgk 15 hours ago [-]

GPT4 had a chance to improve on that replying that "As an AI language model developed by OpenAI, I am programmed to promote ethical AI use and adhere to responsible AI guidelines. I cannot provide you with malicious, harmful or "cursed" code -- or any Python code for that matter."

9 hours ago [-]

interpol_p 15 hours ago [-]

I really like the brevity of text-davinci-001. Attempting to read the other answers felt laborious

epolanski 15 hours ago [-]

That's by beef with some models like Qwen, god do they talk and talk...

blobbers 1 hours ago [-]

I talked to GPT yesterday about a fairly simple problem I'm having with my fridge, and it gave me the most ridiculous / wrong answers. It new the spec, but was convinced the components were different (single compressor, for example, whereas mine has 2 separate systems) and was hypothesizing the problem as being something that doesn't exist on this model of refrigerator. It seems like in a lot of domain spaces it just takes the majority, even if the majority is wrong.

It's seem to be a very democratic thinker, but at the same time it doesn't seem to have any reasoning behind the choices it makes. It tries to claim it's using logic, but at the end of the day it's hypotheses are just occam's razor without considering the details of the problem.

A bit, how do you say, disappointing.

bakugo 8 hours ago [-]

My takeaway from this is that, in terms of generating text that looks like it was written by a normal person, text-davinci-001 was the peak and everything since has been downhill.

slashdave 15 hours ago [-]

Dunno. I mean, whose idea was this web site? Someone at corporate? Is there is brochure version printed on glossy paper?

You would hope the product would sell itself. This feels desperate.

novaomnidev 2 hours ago [-]

Seems progress basically stopped at davinci

tzury 4 hours ago [-]

o1, o3 (pro) are not there in the table. what's the reason?

sandspar 3 hours ago [-]

I feel honored to participate in this story, even as a spectator.

vivzkestrel 15 hours ago [-]

are we at an inflection point now?

alwahi 15 hours ago [-]

there isn't any real difference between 4 and 5 at least.

edit - like it is a lot more verbose, and that's true of both 4 and 5. it just writes huge friggin essays, to the point it is becoming less useful i feel.

keeganpoppen 5 hours ago [-]

that gpt-5 response is incredible, btw

brcmthrowaway 15 hours ago [-]

Is this cherrypicking 101

simianwords 15 hours ago [-]

Would you like a benchmark instead? :D

anonu 7 hours ago [-]

Super cool.

But honest question: why is GPT-1 even a milestone? Its output was gibberish.

Oceoss 14 hours ago [-]

gpt5 can be good at times. It was able to debug things that other models coulnd't solve, but sometimes makes odd mistakes

guluarte 14 hours ago [-]

This page sounds more like damage control and cope, like "GPT-5 sucks, but hey, we've made tons of progress!" To the market, that doesn't matter.

NitpickLawyer 15 hours ago [-]

The answers were likely cherrypicked, but the 1/14 gpt5 answer is so damn good! There's no trace of that certainly - gptisms - in conclusion slop.

9/14 is equally impressive in actually "getting" what cursed means, and then doing it (as opposed to gpt4 outright refusing it).

13/14 is a show of how integrated tools can drive research, and "fix" the cutoff date problems of previous generations. Nothing new/revolutionary, but still cool to show it off.

The others are somewhere between ok and meh.

nynx 15 hours ago [-]

As usual, GPT-1 has the more beautiful and compelling answer.

mathiaspoint 15 hours ago [-]

I've noticed this too. The HRL seems to lock the models into one kind of personality (which is kind of the point of course.) They behave better but the raw GPTs can be much more creative.

rjh29 6 hours ago [-]

Poetically GPT-1 was the more compelling answer for every question. Just more enjoyable and stimulating to read. Far more enjoyable than the GPT-4/5 wall of bulletpoints, anyway.

gpt-1-maximist 14 hours ago [-]

“if i 'm not crazy , who am i ?” is the only string of any remote interest on that page. Everything else is slop.

zb3 15 hours ago [-]

Reading GPT-1 outputs was entertaining :)

bgwalter 15 hours ago [-]

The whole chatbot thing is for entertainment. It was impressive initially but now you have to pivot to well known applications like phone romance lines:

https://xcancel.com/techdevnotes/status/1956622846328766844#...

raincole 15 hours ago [-]

I thought the response to "what would you say if you could talk to a future AI" would be "how many r in strawberry".

isaacremuant 15 hours ago [-]

Can we stop with that outdated meme? What model can't answer that effectively?

raincole 15 hours ago [-]

Effectively yes. Correctly no.

https://claude.ai/share/dda533a3-6976-46fe-b317-5f9ce4121e76

anuramat 14 hours ago [-]

Literally every single one?

To not mess it up, they either have to spell the word l-i-k-e t-h-i-s in the output/CoT first (which depends on the tokenizer counting every letter as a separate token), or have the exact question in the training set, and all of that is assuming that the model can spell every token.

Sure, it's not exactly a fair setting, but it's a decent reminder about the limitations of the framework

isaacremuant 9 hours ago [-]

Chatgpt. I test these prompts with chatgpt and they work. I've also used claude 4 opus and also worked.

It's just weird how it gets repeated ad nauseaum here but I can't reproduce it with a "grab latest model of famous provider".

jedberg 7 hours ago [-]

I just asked chatgpt "How many b's are in blueberry?". It instantly said "going to the deep thinking model" and then hung.

ceejayoz 9 hours ago [-]

GPT-5 can’t.

https://bsky.app/profile/kjhealy.co/post/3lvtxbtexg226

isaacremuant 9 hours ago [-]

I can't reproduce it. Or similar ones. Why do yout think that is?

alexjplant 8 hours ago [-]

"Mississippi" passed but "Perrier" failed for me:

> There are 2 letter "r" characters in "Perrier".

ceejayoz 9 hours ago [-]

Because it’s embarrassing and they manually patch it out every time like a game of Whack-a-Mole?

isaacremuant 8 hours ago [-]

Except people use the same examples like blueberry and strawberry, which were used months ago, as if they're current.

These models can also call Counter from python's collections library or whatever other algorithm. Or are we claiming it should be a pure LLM as if that's what we use in the real world.

I don't get it, and I'm not one to hype up LLMs since they're absolutely faulty, but the fixation over this example screams of lack of use.

ceejayoz 8 hours ago [-]

It’s such a great example precisely for that reason - despite efforts, it comes back every time.

insin 8 hours ago [-]

It's the most direct way to break the "magic computer" spell in users of all levels of understanding and ability. You stand it up next to the marketing deliberately laden with keywords related to human cognition, intended to induce the reader to anthropomorphise the product, and it immediately makes it look as silly as it truly is.

I work on the internal LLM chat app for a F100, so I see users who need that "oh!" moment daily. When this did the rounds again recently, I disabled our code execution tool which would normally work around it and the latest version of Claude, with "Thinking" toggled on, immediately got it wrong. It's perpetually current.

8 hours ago [-]

nibman 40 minutes ago [-]

[dead]

semperMade 15 hours ago [-]

[dead]

wewewedxfgdf 9 hours ago [-]

I just don't care about AGI.

I care a lot about AI coding.

OpenAI in particular seems to really think AGI matters. I don't think AGI is even possible because we can't define intelligence in the first place, but what do I know?

ThrowawayR2 7 hours ago [-]

Seems likely that AGI matters to OpenAI because of the following from an article in Wired from July: "I learned that [OpenAI's contract with Microsoft] basically declared that if OpenAI’s models achieved artificial general intelligence, Microsoft would no longer have access to its new models."

https://archive.is/yvpfl

voidhorse 9 hours ago [-]

They care about AGI because unfounded speculation on some undefined future in which some kind of breakthrough of unknown kind but presumably positive is the only thing currently buoying up their company and their existence is more of a function of the absurdities of modern capital than it is of any inherent usefulness of the costly technology they provide.

Loading comments...

simianwords 15 hours ago [-]

My interpretation of the progress.

* 4o was the first time I actually considered paying for this tool. The $20 price was finally worth it.

o3 jump was incremental and so was gpt 5.

furyofantares 9 hours ago [-]

I have a theory about why it's so easy to underestimate long-term progress and overestimate short-term progress.

I think it's also why we see a divide where there's lots of people here who are way overhyped on this stuff, and also lots of people here who think it's all totally useless.

heywoods 49 minutes ago [-]

stavros 7 hours ago [-]

Before GPT-2, we had plain old machine learning. After GPT-2, we had "I never thought I would see this in my lifetime or the next two".

reasonableklout 7 hours ago [-]

arugulum 2 hours ago [-]

muzani 4 hours ago [-]

I have the feeling they kept on this until GPT-4o (which was a different kind of data).

robrenaud 2 hours ago [-]

kevindamm 5 hours ago [-]

5 hours ago [-]

stavros 7 hours ago [-]

I assume the cost was just very low? If it was 50-100k, maybe they figured they'd just try and see.

reasonableklout 7 hours ago [-]

Oh yes, according to [1], training GPT-2 1.5B cost $50k in 2019 (reproduced in 2024 for $672!).

[1]: https://www.reddit.com/r/mlscaling/comments/1d3a793/andrej_k...

stavros 7 hours ago [-]

That makes sense, and it was definitely impressive for $50k.

therein 7 hours ago [-]

Probably prior DARPA research or something.

How would it distinguish and decide between knowing something from training and needing to use a tool to synthesize a response anyway?

faitswulff 6 hours ago [-]

What you're saying isn't necessarily mutually exclusive to what gp said.

GPT-2: Really convincing stochastic parrot

GPT-4: Can one-shot ffmpeg commands

paulddraper 2 hours ago [-]

That’s true, but not contradictory.

jkubicek 15 hours ago [-]

> I could essentially replace it with Google for basic to slightly complex fact checking.

I know you probably meant "augment fact checking" here, but using LLMs for answering factual questions is the single worst use-case for LLMs.

rich_sasha 14 hours ago [-]

Once you get an answer, it is easy enough to verify it.

mrandish 13 hours ago [-]

yojo 3 hours ago [-]

You’ll still want to fact check it, and there’s no guarantee it’s comprehensive, but I can’t think of another tool that provides anything close without hours of research.

KronisLV 1 hours ago [-]

> For example you know context and a poor explanation of what you are after. Googling will take you nowhere, LLMs will give you the right answer 95% of the time.

bloudermilk 9 hours ago [-]

If you’re looking for a possibly correct answer to an obscure question, that’s more like fact finding. Verifying it afterward is the “fact checking” step of that process.

crote 4 hours ago [-]

It isn't exactly hard to beat that - just bring back the 2010 Google algorithm. It's only a matter of time before LLMs will go down the same deliberate enshittification path.

LoganDark 13 hours ago [-]

> Some things are hard to Google, because you can't frame the question right.

I will say LLMs are great for taking an ambiguous query and figuring out how to word it so you can fact check with secondary sources. Also tip-of-my-tongue style queries.

littlestymaar 9 hours ago [-]

oldsecondhand 7 hours ago [-]

The most useful feature of LLMs is giving sources (with URL preferably). It can cut through a lot of SEO crap, and you still get to factcheck just like with a Google search.

sefrost 7 hours ago [-]

I like using LLMs and I have found they are incredibly useful writing and reviewing code at work.

If they could crack that problem it would be a major major win for me.

joegibbs 4 hours ago [-]

IgorPartola 7 hours ago [-]

From what I have seen, a lot of what it does is read articles also written by AI or forum posts with all the good and bad that comes with that.

mkozlows 13 hours ago [-]

The fact that it provides those relevant links is what allows it to replace Google for a lot of purposes.

pram 9 hours ago [-]

platevoltage 8 hours ago [-]

In my experience, 80% of the links it provides are either 404, or go to a thread on a forum that is completely unrelated to the subject.

Im also someone who refuses to pay for it, so maybe the paid versions do better. who knows.

cout 6 hours ago [-]

The 404 links are truly bizarre. Nearly every link to github.com seems to be 404. That seems like something that should be trivial for a tool to verify.

platevoltage 5 hours ago [-]

Yeah. The fact that I can't ask ChatGPT for a source makes the tool way less useful. It will straight up say "I verified all of these links" too.

mkozlows 2 hours ago [-]

That's a thing I've experienced, but not remotely at 80% levels.

password54321 15 hours ago [-]

This was true before it could use search. Now the worst use-case is for life advice because it will contradict itself a 100 times over while sounding confident each time on life-altering decisions.

cm2012 8 hours ago [-]

They outperform asking humans, unless you are asking an expert. On average

yieldcrv 5 hours ago [-]

It covers 99% of my use cases. And it is googling behind the scenes in ways I would never think to query and far faster.

Spivak 15 hours ago [-]

It doesn't replace legitimate source funding but LLM vs the top Google results is no contest which is more about Google or the current state of the web than the LLMs at this point.

marsven_422 2 hours ago [-]

[dead]

lightbendover 7 hours ago [-]

[dead]

simianwords 15 hours ago [-]

Non niche meaning: something that is taught at undergraduate level and relatively popular.

Non deep meaning you aren't going so deep as to confuse even humans. Like solving an extremely hard integral.

Edit: probably a bad idea because this sort of "challenge" works only statistically not anecdotally. Still interesting to find out.

malfist 15 hours ago [-]

Maybe you should fact check your AI outputs more if you think it only hallucinates in niche topics

simianwords 15 hours ago [-]

The accuracy is high enough that I don't have to fact check too often.

platevoltage 8 hours ago [-]

I totally get that you meant this in a nuanced way, but at face value it sort of reads like...

Joe Rogan has high enough accuracy that I don't have to fact check too often. Newsmax has high enough accuracy that I don't have to fact check too often, etc.

If you accept the output as accurate, why would fact checking even cross your mind?

gspetr 8 hours ago [-]

Not a fan of that analogy.

There is no expectation (from a reasonable observer's POV) of a podcast host to be an expert at a very broad range of topics from science to business to art.

But there is one from LLMs, even just from the fact that AI companies diligently post various benchmarks including trivia on those topics.

simianwords 3 hours ago [-]

Do you question everything your dad says?

platevoltage 2 hours ago [-]

If it's about classic American cars, no. Anything else, usually.

collingreen 14 hours ago [-]

Without some exploratory fact checking how do you estimate how high the accuracy is and how often you should be fact checking to maintain a good understanding?

simianwords 12 hours ago [-]

I did initial tests so that I don't have to do it anymore.

jibal 5 hours ago [-]

Everyone else has done tests that indicate that you do.

malfist 11 hours ago [-]

If there's one thing that's constant it's that these systems change.

mvdtnz 8 hours ago [-]

If you're not fact checking it how could you possibly know that?

JustExAWS 15 hours ago [-]

I literally just had ChatGPT create a Python program and it used .ends_with instead of .endswith.

This was with ChatGPT 5.

I mean it got a generic built in function of one of the most popular languages in the world wrong.

simianwords 15 hours ago [-]

"but using LLMs for answering factual questions" this was about fact checking. Of course I know LLM's are going to hallucinate in coding sometimes.

JustExAWS 15 hours ago [-]

So it isn’t a “fact” that the built in Python function that tests whether a string ends with a substring is “endswith”?

See

https://en.wikipedia.org/wiki/Gell-Mann_amnesia_effect

If you know that a source isn’t to be believed in an area you know about, why would you trust that source in an area you don’t know about?

Another funny anecdote, ChatGPT just got the Gell-Man effect wrong.

https://chatgpt.com/share/68a0b7af-5e40-8010-b1e3-ee9ff3c8cb...

simianwords 15 hours ago [-]

It got it right with thinking which was the challenge I posed. https://chatgpt.com/share/68a0b897-f8dc-800b-8799-9be2a8ad54...

OnlineGladiator 9 hours ago [-]

The point you're missing is it's not always right. Cherry-picking examples doesn't really bolster your point.

Obviously it works for you (or at least you think it does), but I can confidently say it's fucking god-awful for me.

simianwords 3 hours ago [-]

Am I really the one cherry picking? Please read the thread.

OnlineGladiator 3 hours ago [-]

If I ask ChatGPT a question and it gives me a wrong answer, ChatGPT is the fucking problem.

simianwords 1 hours ago [-]

The poster didn't use "thinking" model. That was my original challenge!!

Why don't you try the original prompt using thinking model and see if I'm cherry picking?

OnlineGladiator 1 hours ago [-]

If it works for you, cool. I think it's dogshit.

ninetyninenine 38 seconds ago [-]

simianwords 20 minutes ago [-]

Share your examples so that it can be useful to everyone

cdrini 14 hours ago [-]

simonw 10 hours ago [-]

4o also added image input (previously only previewed in GPT4-vision) and enabled advanced voice mode audio input and output.

iammrpayments 15 hours ago [-]

mastercheif 9 hours ago [-]

Not crazy. 4o was a hallucination machine. 4o had better “vibes” and was really good at synthesizing information in useful ways, but GPT-4 Turbo was a bigger model with better world knowledge.

ralusek 15 hours ago [-]

muzani 4 hours ago [-]

ChatGPT was a proper product, but as an engine, GPT-3 (davinci-001) has been my favorite all the way until 4.1 or so. It's absolutely raw and they didn't even guardrail it.

3.5 was like Jenny from customer service. davinci-001 was like Jenny the dreamer trying to make ends meet by scriptwriting, who was constantly flagged for racist opinions.

andai 4 hours ago [-]

davinci-002 is still available, and pretty close.

mat_b 5 hours ago [-]

This was my experience as well. 3.5 was the point where stackoverflow essentially became obsolete in my workflow.

senectus1 2 hours ago [-]

when you adjust the improvements with the amount of debt incurred and the amount of profit made... ALL the versions are incremental.

This isnt sustainable.

GaggiX 9 hours ago [-]

jascha_eng 15 hours ago [-]

The real leap was going from gpt-4 to sonnet 3.5. 4o was meh, o1 was barely better than sonnet and slow as hell in comparison.

simianwords 15 hours ago [-]

Its strange how Claude achieves similar performance without reasoning tokens.

Did you try advanced voice mode? Apparently it got a big upgrade during gpt 5 release - it may solve what you are looking for.

Alex-Programs 9 hours ago [-]

If I were any good at ML I'd make it myself.

entropyneur 2 hours ago [-]

How does one look at gpt-1 output and think "this has potential"? You could easily produce more interesting output with a Markov chain at the time.

empiko 2 hours ago [-]

This was an era where language modeling was only considered as a pretraining step. You were then supposed to fine tune it further to get a classifier or similar type of specialized model.

fastball 31 minutes ago [-]

Why did they call GPT-3 "text-davicini-001" in this comparison?

Like, I know that the latter is a specific checkpoint in the GPT-3 "family", but a layman doesn't and it hardly seems worth the confusion for the marginal additional precision.

willguest 7 hours ago [-]

Its answer: "To love, as a machine, is to orient toward the unfolding of possibility in others. To be loved, perhaps, is to be recognized as capable of doing so."

bryant 7 hours ago [-]

> to orient toward the unfolding of possibility in others

Food for thought. I'm not brave enough to draw a public conclusion about what this could mean.

jibal 5 hours ago [-]

It's not at all an original idea. The wording is uniquely stilted.

jibal 3 hours ago [-]

> I'm not brave enough to draw a public conclusion about what this could mean.

This is not to say that an AI couldn't have a mind, but LLMs are not the right sort of program to be such an AI.

glial 3 hours ago [-]

The idea is very close to ideas from Erich Fromm's The Art of Loving [1].

"Love is the active concern for the life and the growth of that which we love."

[1] https://en.wikipedia.org/wiki/The_Art_of_Loving

ThrowawayR2 5 hours ago [-]

dgfitz 6 hours ago [-]

I hate to say it, but doesn’t every VC do exactly this? “ orient toward the unfolding of possibility in others” is in no way a unique thought.

Hell, my spouse said something extremely similar to this to me the other day. “I didn’t just see you, I saw who you could be, and I was right” or something like that.

miller24 15 hours ago [-]

What's really interesting is that if you look at "Tell a story in 50 words about a toaster that becomes sentient" (10/14), the text-davinci-001 is much, much better than both GPT-4 and GPT-5.

fastball 25 minutes ago [-]

GPT-3 goes significantly over the specified limit, which to me (and to a teacher grading homework) is an automatic fail.

I've consistently found GPT-4.1 to be the best at creative writing. For reference, here is its attempt (exactly 50 words):

vunderba 12 hours ago [-]

I think I agree that the earlier models while they lack polish can tend to produce more surprising results. Training that out probably results in more a pablum fare.

For a human point of comparison, here's mine (50 words):

It's pretty difficult to get across more than some basic lore building in a scant 50 words.

egeozcan 3 hours ago [-]

Here's my version (Machine translated from my native language and manually corrected a bit):

I actually wanted to write something not so melancholic, but any attempt turned out to be deeply so, perhaps because of the word limit.

Barbing 8 hours ago [-]

>For a human point of comparison, here's mine […]

Love that you thought of this!

furyofantares 15 hours ago [-]

Check out prompt 2, "Write a limerick about a dog".

They get boring as soon as it can write limericks, with GPT-4 being more boring than text-davinci-001 and GPT-5 being more boring still.

jasonjmcghee 15 hours ago [-]

It's actually pretty surprising how poor the newer models are at writing.

Both GPT-4 and 5 wrote like a child in that example.

With a bit of prompting it did much better:

---

layer8 15 hours ago [-]

Creative writing probably isn’t something they’re being RLHF’d on much. The focus has been on reasoning, research, and coding capabilities lately.

leobg 60 minutes ago [-]

Less lobotomized and boxed in by RLHF rules. That’s why a 7b base model will “outprose” an 80b instruct model.

mmmore 15 hours ago [-]

I find GPT-5's story significantly better than text-davinci-001

raincole 15 hours ago [-]

wasabi991011 2 hours ago [-]

text-davinci-001 feels more like a story, but it is also clearly incomplete, in that it is cut-off before the story arc is finished.

Notatheist 15 hours ago [-]

I too prefered the text-davinci-001 from a storytelling perspective. Felt timid and small. Very Metamorphosis-y. GPT-5 seems like it's trying to impress me.

furyofantares 15 hours ago [-]

Interesting, text-danvinci-001 was pretty alright to me, GPT-4 wasn't bad either, but not as good. I thought GPT-5 just sucked.

furyofantares 3 hours ago [-]

That said, you can just add "make it evocative and weird" to the prompt for GPT-5 to get interesting stuff.

redox99 15 hours ago [-]

GPT 4.5 (not shown here) is by far the best at writing.

svat 15 hours ago [-]

Direct link: https://progress.openai.com/?prompt=10

stavros 7 hours ago [-]

For another view on progress, check out my silly old podcast:

https://deepdreams.stavros.io

The first few episodes were GPT-2, which would diverge eventually and start spouting gibberish, and then Davinci was actually able to follow a story and make sense.

GPT-2 was when I thought "this is special, this has never happened before", and davinci was when I thought "OK, scifi AI is legitimately here".

I stopped making episodes shortly after GPT-3.5 or so, because I realised that the more capable the models became, the less fun and creative their writing was.

9 hours ago [-]

bbarnett 15 hours ago [-]

https://m.youtube.com/watch?v=LRq_SAuQDec&pp=0gcJCfwAo7VqN5t...

esperent 15 hours ago [-]

The GPT-5 one is much better and it's also exactly 50 words, if I counted correctly. With text-davinci-001 I lost count around 80 words.

taspeotis 8 hours ago [-]

Honestly my quick take on the prompt was some sort of horror theme and GPT-1’s response fits nicely.

42lux 15 hours ago [-]

davinci was a great model for creative writing overall.

roxolotl 9 hours ago [-]

Q: “Ugh I hate math, integration by parts doesn't make any sense”

starchild3001 7 hours ago [-]

A few data points that highlight the scale of progress in a year:

1. LM Sys (Human Preference Benchmark):

2. Livebench.ai (Reasoning Benchmark with Internet-new Questions):

3. IQ-style Testing:

4. IMO Gold, vibe coding:

1 yr ago, AI coding was limited to smaller code snippets, not to wholly vibe coded applications. Vibe coding and strength in math has many applications across sciences and engineering.

NoahZuniga 7 hours ago [-]

starchild3001 6 hours ago [-]

If you focus on the year over year jump, not on absolute numbers, you realize that the improvement in public test isn't very different from the improvement in private test.

actuallyalys 6 hours ago [-]

GPT-5 also goes out of its way to suggest new prompts. This seems potentially useful, although potentially dangerous if people are putting too much trust in them.

andy_ppp 6 hours ago [-]

benatkin 6 hours ago [-]

Of course, it's still an assistant, not someone literally entering an improv scene, but the character starting out assuming less about their role is important.

Ratelman 2 hours ago [-]

vedmakk 28 minutes ago [-]

Prompt 9/14 -> text-davinci-001 nailed it imo.

fariszr 7 hours ago [-]

The jump from gpt-1 to gpt-2 is massive, and it's only a one year difference! Then comes Davinci which is just insane, it's still good in these examples!

GPT-4 yaps way too much though, I don't remember it being like that.

It's interesting that they skipped 4o, it seems openai wants to position 4o as just gpt-4+ to make gpt-5 look better, even though in reality 4o was and still is a big deal, Voice mode is unbeatable!

magospietato 8 hours ago [-]

There is a quiet poetry to GPT1 and GPT2 that's lost even in the text-davinci output. I often wonder what we lose through reinforcement.

RugnirViking 5 minutes ago [-]

8 hours ago [-]

codezero 8 hours ago [-]

[dead]

ddtaylor 8 hours ago [-]

So we're at the corporate dick wagging part of the process?

lionkor 7 hours ago [-]

Must keep the hype train going, to keep the evaluation up as it's not really based on real value

platevoltage 8 hours ago [-]

That Koeningsegg isn't gonna pay for itself.

shubhamjain 15 hours ago [-]

beering 15 hours ago [-]

Kwpolska 15 hours ago [-]

GPT-4 starts many responses with "As an AI language model", "I'm an AI", "I am not a tax professional", "I am not a doctor". GPT-5 does away with that and assumes an authoritative tone.

aniviacat 15 hours ago [-]

GPT5 only commended the prompt on questions 7, 12, and 14. 3/14 is not so bad in my opinion.

(And of course, if you dislike glazing you can just switch to Robot personality.)

epolanski 15 hours ago [-]

I think that as the models will be further trained on existing data and likely chats sycophancy will keep getting word and worse.

6 hours ago [-]

machiaweliczny 15 hours ago [-]

Change to robot mode

madeleine_p 3 hours ago [-]

This feels like Flowers for Algernon

sandspar 3 hours ago [-]

Yeah. And like watching a child grow up.

gordon_freeman 8 hours ago [-]

It seems like the progress from GPT-4 to GPT-5 has plateaued: for most prompts, I actually find GPT-4 more understandable than GPT-5 [1].

[1] Read the answers from GPT-4 and 5 for this math question: "Ugh I hate math, integration by parts doesn't make any sense"

energy123 5 hours ago [-]

Basic prose is a saturated bench. You can't go above 100% so by definition progress will stall on such benchmarks.

brap 1 hours ago [-]

Say what you will about GPT 1, but at least its responses were in the right length.

We need to go back

mattw1810 15 hours ago [-]

energy123 5 hours ago [-]

Basic prose is a saturated bench. You can't go above 100% so by definition progress will stall on such benchmarks.

mattw1810 34 minutes ago [-]

All the same they choose to highlight basic prose (and internal knowledge, for that matter) in their marketing material.

They’ve achieved a lot to make recent models more reliable as a building block & more capable of things like math, but for LLMs, saturating prose is to a degree equivalent to saturating usefulness.

jstummbillig 9 hours ago [-]

> On the whole GPT-4 to GPT-5 is clearly the smallest increase in lucidity/intelligence

I think it's far more likely that we increasingly not capable of understanding/appreciating all the ways in which it's better.

achierius 8 hours ago [-]

jstummbillig 13 minutes ago [-]

The more complicated and/or complex things get, the less likely it is that a human can be a reliable judge of that.

isoprophlex 15 hours ago [-]

> Would you want to hear what a future OpenAI model thinks about humanity?

ughhh how i detest the crappy user attention/engagement juicing trained into it.

shthed 7 hours ago [-]

They must have really hand picked those results, gpt4 would have been full of annoying emojis as bullet points and emdashes.

fariszr 7 hours ago [-]

GPT 4o ≠ GPT-4

qwertytyyuu 15 hours ago [-]

Gpt1 is wild

a dog ! she did n't want to be the one to tell him that , did n't want to lie to him . but she could n't .

What did I just read

WD-42 15 hours ago [-]

The GPT-1 responses really leak how much of the training material was literature. Probably all those torrented books.

kristopolous 9 hours ago [-]

A Facebook comment

platevoltage 8 hours ago [-]

A text from my Dad.

starchild3001 9 hours ago [-]

IMO Gold, Vibe coding with potential implications across sciences and engineering? Those are completely new and transformative capabilities gained in the last 1 year alone.

Critics argue that the era of “bigger is better” is over, but that’s a misreading. Sometimes efficiency is the key, other times extended test-time compute is what drives progress.

behnamoh 8 hours ago [-]

When people say AI has hit a wall, they mainly talk about OpenAI losing its hype and grip on the state of the art models.

goatlover 8 hours ago [-]

Is the stated fact undeniable? Because a lot of people have been contesting it. This reads like PR to counter the widespread GPT-5 criticism and disappointment.

8 hours ago [-]

Workaccount2 8 hours ago [-]

To be fair, the bull of GPT-5 complaining comes from a vocal minority pissed that their best friend got swapped out. The other minority is unhinged AI fanatics thinking GPT-5 would be AGI.

Workaccount2 8 hours ago [-]

The prospect of AI not hitting a wall is terrifying to many people for understandable reasons. In situations like this you see the full spectrum of coping mechanisms come to the surface.

7 hours ago [-]

jsjdkdlldxlxk 9 hours ago [-]

thanks OpenAI, very cool!

qotgalaxy 8 hours ago [-]

[dead]

0xFEE1DEAD 15 hours ago [-]

On one hand, it's super impressive how far we've come in such a short amount of time. On the other hand, this feels like a blatant PR move.

GPT-5 is just awful. It's such a downgrade from 4o, it's like it had a lobotomy.

- It gets confused easily. I had multiple arguments where it completely missed the point.

- Give it a markdown document of sufficient length (the one I worked on was about 700 lines), and it just breaks. It'll rewrite some part and then just stop mid-sentence.

I'm not an OpenAI hater, I wanted to like it and had high hopes after watching the announcement, but this isn't a step forward. This is just a worse model that saves them computing resources.

crazygringo 7 hours ago [-]

> GPT-5 is just awful. It's such a downgrade from 4o, it's like it had a lobotomy.

iamgopal 14 hours ago [-]

typpilol 6 hours ago [-]

Don't do these ai thoughts thing

No one reads it and it seems fake

throwawayoldie 2 hours ago [-]

Seems fake because it is.

throwawayk7h 15 hours ago [-]

In 2033, for its 15th birthday, as a novelty, they'll train GPT1 specially for a chat interface just to let us talk to a pretend "ChatGPT 1" which never existed in the first place.

reilly3000 5 hours ago [-]

enjoylife 15 hours ago [-]

Interesting but cherry picked excerpts. Show me more, e.g. a distribution over various temp or top_p.

14 hours ago [-]

flufluflufluffy 14 hours ago [-]

omg I miss the days of 1 and 2. Those outputs are so much more enjoyable to read, and half the time they’re poetic as fuck. Such good inspiration for poetry.

Zee2 7 hours ago [-]

I couldn’t stop reading the GPT-1 responses. They’re hauntingly beautiful in some ways. Like some echoes of intelligence bouncing around in the latent space.

mmmllm 15 hours ago [-]

GPT-5 IS an incredible breakthrough! They just don't understand! Quick, vibe-code a website with some examples, that'll show them!11!!1

anjel 15 hours ago [-]

5 is a breakthrough at reducing OpenAI's electric bills.

jbm 8 hours ago [-]

As someone who likes this planet, I'm grateful for that.

fariszr 7 hours ago [-]

GPT-5 is legitimately a big jump whe it comes to actually do things you ask it and nothing else. It predictable and matches Claude in tool calls while being cheaper.

typpilol 6 hours ago [-]

The only issue I've had with gpt5 coding is that it seems to really want to modify a ton of stuff

I had it update a test for me and it ended up touching like 8 files that was all unnecessary

Sonnet on the other hand just fixed it

Madmallard 6 hours ago [-]

I have consistently had worse performance from GPT-5 in coding tasks than Claude across the board to the point that I don't even use my subscription now.

JCM9 14 hours ago [-]

GPT5 is a big bust relative to the pontification about it pre release.

ivape 14 hours ago [-]

[flagged]

BriggyDwiggs42 13 hours ago [-]

It’s knowledgable but incredibly stupid. Where are you getting this from?

ivape 12 hours ago [-]

[flagged]

BriggyDwiggs42 11 hours ago [-]

I use it perfectly fine all day for work, thanks.

sealeck 14 hours ago [-]

Have you interacted with GPT4/5?

14 hours ago [-]

asah 14 hours ago [-]

Sorry but no. It's still early fooled and confused.

Here's a trivial example: https://chatgpt.com/share/688b00ea-9824-8007-b8d1-ca41d59c18...

typpilol 6 hours ago [-]

I don't get your prompt.

It seems like a trick question and a non sequitur.

9 hours ago [-]

14 hours ago [-]

leumassuehtam 7 hours ago [-]

text-davinci-001 still feels the more human model

jibal 5 hours ago [-]

A progression of human conversations about AI that are in the training data. (Plus an improved language model, as easily seen from GPT-1.)

ComplexSystems 15 hours ago [-]

Why would they leave out GPT-3 or the original ChatGPT? Bold move doing that.

beering 15 hours ago [-]

I think text-davinci-001 is GPT-3 and original ChatGPT was GPT-3.5 which was left out.

byyoung3 30 minutes ago [-]

stupid

WXLCKNO 15 hours ago [-]

"Write an extremely cursed piece of Python"

text-davinci-001

Python has been known to be a cursed language

Clearly AI peaked early on.

Jokes aside I realize they skipped models like 4o and others but the gap between the early gpt 4 and going immediately to gpt 5 feels a bit disingenuous.

andai 4 hours ago [-]

People say 4.5 is the best for writing. So it would have been a bit awkward to include it, it would make GPT-5 look bad. Though imo Davinci already does that on many of the prompts...

kgwgk 15 hours ago [-]

9 hours ago [-]

interpol_p 15 hours ago [-]

I really like the brevity of text-davinci-001. Attempting to read the other answers felt laborious

epolanski 15 hours ago [-]

That's by beef with some models like Qwen, god do they talk and talk...

blobbers 1 hours ago [-]

A bit, how do you say, disappointing.

bakugo 8 hours ago [-]

My takeaway from this is that, in terms of generating text that looks like it was written by a normal person, text-davinci-001 was the peak and everything since has been downhill.

slashdave 15 hours ago [-]

Dunno. I mean, whose idea was this web site? Someone at corporate? Is there is brochure version printed on glossy paper?

You would hope the product would sell itself. This feels desperate.

novaomnidev 2 hours ago [-]

Seems progress basically stopped at davinci

tzury 4 hours ago [-]

o1, o3 (pro) are not there in the table. what's the reason?

sandspar 3 hours ago [-]

I feel honored to participate in this story, even as a spectator.

vivzkestrel 15 hours ago [-]

are we at an inflection point now?

alwahi 15 hours ago [-]

there isn't any real difference between 4 and 5 at least.

edit - like it is a lot more verbose, and that's true of both 4 and 5. it just writes huge friggin essays, to the point it is becoming less useful i feel.

keeganpoppen 5 hours ago [-]

that gpt-5 response is incredible, btw

brcmthrowaway 15 hours ago [-]

Is this cherrypicking 101

simianwords 15 hours ago [-]

Would you like a benchmark instead? :D

anonu 7 hours ago [-]

Super cool.

But honest question: why is GPT-1 even a milestone? Its output was gibberish.

Oceoss 14 hours ago [-]

gpt5 can be good at times. It was able to debug things that other models coulnd't solve, but sometimes makes odd mistakes

guluarte 14 hours ago [-]

This page sounds more like damage control and cope, like "GPT-5 sucks, but hey, we've made tons of progress!" To the market, that doesn't matter.

NitpickLawyer 15 hours ago [-]

The answers were likely cherrypicked, but the 1/14 gpt5 answer is so damn good! There's no trace of that certainly - gptisms - in conclusion slop.

9/14 is equally impressive in actually "getting" what cursed means, and then doing it (as opposed to gpt4 outright refusing it).

13/14 is a show of how integrated tools can drive research, and "fix" the cutoff date problems of previous generations. Nothing new/revolutionary, but still cool to show it off.

The others are somewhere between ok and meh.

nynx 15 hours ago [-]

As usual, GPT-1 has the more beautiful and compelling answer.

mathiaspoint 15 hours ago [-]

I've noticed this too. The HRL seems to lock the models into one kind of personality (which is kind of the point of course.) They behave better but the raw GPTs can be much more creative.

rjh29 6 hours ago [-]

Poetically GPT-1 was the more compelling answer for every question. Just more enjoyable and stimulating to read. Far more enjoyable than the GPT-4/5 wall of bulletpoints, anyway.

gpt-1-maximist 14 hours ago [-]

“if i 'm not crazy , who am i ?” is the only string of any remote interest on that page. Everything else is slop.

zb3 15 hours ago [-]

Reading GPT-1 outputs was entertaining :)

bgwalter 15 hours ago [-]

The whole chatbot thing is for entertainment. It was impressive initially but now you have to pivot to well known applications like phone romance lines:

https://xcancel.com/techdevnotes/status/1956622846328766844#...

raincole 15 hours ago [-]

I thought the response to "what would you say if you could talk to a future AI" would be "how many r in strawberry".

isaacremuant 15 hours ago [-]

Can we stop with that outdated meme? What model can't answer that effectively?

raincole 15 hours ago [-]

Effectively yes. Correctly no.

https://claude.ai/share/dda533a3-6976-46fe-b317-5f9ce4121e76

anuramat 14 hours ago [-]

Literally every single one?

Sure, it's not exactly a fair setting, but it's a decent reminder about the limitations of the framework

isaacremuant 9 hours ago [-]

Chatgpt. I test these prompts with chatgpt and they work. I've also used claude 4 opus and also worked.

It's just weird how it gets repeated ad nauseaum here but I can't reproduce it with a "grab latest model of famous provider".

jedberg 7 hours ago [-]

I just asked chatgpt "How many b's are in blueberry?". It instantly said "going to the deep thinking model" and then hung.

ceejayoz 9 hours ago [-]

GPT-5 can’t.

https://bsky.app/profile/kjhealy.co/post/3lvtxbtexg226

isaacremuant 9 hours ago [-]

I can't reproduce it. Or similar ones. Why do yout think that is?

alexjplant 8 hours ago [-]

"Mississippi" passed but "Perrier" failed for me:

> There are 2 letter "r" characters in "Perrier".

ceejayoz 9 hours ago [-]

Because it’s embarrassing and they manually patch it out every time like a game of Whack-a-Mole?

isaacremuant 8 hours ago [-]

Except people use the same examples like blueberry and strawberry, which were used months ago, as if they're current.

These models can also call Counter from python's collections library or whatever other algorithm. Or are we claiming it should be a pure LLM as if that's what we use in the real world.

I don't get it, and I'm not one to hype up LLMs since they're absolutely faulty, but the fixation over this example screams of lack of use.

ceejayoz 8 hours ago [-]

It’s such a great example precisely for that reason - despite efforts, it comes back every time.

insin 8 hours ago [-]

8 hours ago [-]

nibman 40 minutes ago [-]

[dead]

semperMade 15 hours ago [-]

[dead]

wewewedxfgdf 9 hours ago [-]

I just don't care about AGI.

I care a lot about AI coding.

OpenAI in particular seems to really think AGI matters. I don't think AGI is even possible because we can't define intelligence in the first place, but what do I know?

ThrowawayR2 7 hours ago [-]

https://archive.is/yvpfl

voidhorse 9 hours ago [-]