> It isn’t “wrong.” Wolfram defines Binomial[n,m] at negative integers by a symmetric limiting rule that enforces Binomial[n,m] = Binomial[n,n−m]. With n = −1, m = −1 this forces Binomial[−1,−1] = Binomial[−1,0] = 1. The gamma-formula has poles at nonpositive integers, so values there depend on which limit you adopt. Wolfram chooses the symmetry-preserving limit; it breaks Pascal’s identity at a few points but keeps symmetry. If you want the convention that preserves Pascal’s rule and makes all cases with both arguments negative zero, use PascalBinomial[−1,−1] = 0. Wolfram added this explicitly to support that alternative definition.
Of course this particular question might have been in the training set.
Honestly 2.5 years feel like infinity when it comes to AI development. I'm using ChatGPT very regularly, and while it's far from perfect, recently it gave obviously wrong answers very rarely. Can't say anything about ChatGPT 5, I feel like in my conversations with AI, I've reached my limit, so I'd hardly notice AI getting smarter, because it's already smart enough for my questions.
On Wolfram specifically, GPT-5 is a huge step up from GPT-4. One of the first things I asked it was to write me a mathematica program to test the basic properties (injectivity, surjectivity, bijectivity) of various functions. The notebook it produced was
1) 100% correct
2) Really useful (ie it includes various things I didn’t ask for but are really great like a little manipulator to walk through the function at various points and visualize what the mapping is doing)
3) Built in a general way so I can easily change the mapping to explore different types of functions and how they work.
It seems very clear (both from what they said in the launch demos etc and from my experience of trying it out) that performance on coding tasks has been an area of massive focus and the results are pretty clear to me.
Sooner or later, I end up with a coworker or two who takes it for granted that my code is usually very solid (a bit too solid according to some). I end up with a bug in production that was way too dumb not to catch it in review, and the reviewer confesses to not looking very closely at the code.
I think of that every time people talk about trusting generated code. Or the obfuscated code competition. It’s going to get you into the dumbest trouble some day.
> recently it gave obviously wrong answers very rarely
Are you concerned it may be giving you subtley wrong answers that you're not noticing? If you have to double check everything, is it really saving time?
Sometimes I can write the code, but I just think it's too tedious and AI can do it faster than me. For example I can write `jq` incantations, but it takes some time and man digging, so for non-trivial cases I can easily spend hour tinkering with it. ChatGPT often do it in one minute and does it correctly. So in this case I can perfectly evaluate the result. Although I've recently started to avoid this kind of usage, because I feel it makes me dumber and my salary is not increased for the time saved, so I don't really have an incentive to work slightly faster at the expense of my mental abilities.
Sometimes I'm using it as a Google replacement. Now this is controversial usage and I certainly can't quickly evaluate if he did a good job. I'm usually inspecting list of "Sources" and double-checking the most important facts. If I feel it missed important sources, I'm going to search myself.
Sometimes I just need an opinion. I work mostly alone and I don't have anyone to talk to. Also I have some mental issues, where I can stuck on very simple issue, like identifier naming and procrastinate over it for a prolonged time. AI had been a saviour for me, because he can present another opinion which I could just follow and move on (because it doesn't really matter, which identifier name to use). So for this case, there's no "right" or "wrong" answer, I just need "some" answer. I could even use RNG but actually ChatGPT makes me feel better by following seemingly reasonable suggestions.
Often I write code and submit it to ChatGPT for review. It spots a lot of irrelevant issues, or issues I don't really care about. However sometimes it finds a bugs, so it helps me a lot by uncovering bugs early. In this case, verification is obvious, I know when the bug is bug.
If I'm asking for my own enlightenment but don't care about a correct answer, then why bother?
If my manager is fine with a half-assed response, then eventually he'll cut me out altogether and go straight to the AI.
It's a shame that software engineering hasn't progressed to the point were we can reliably build bug free software. It's really sad if AI gives shitty results but iterates fast enough that it's still perceived as better than real humans.
Try using it to do stuff with dnf (fedora). It makes hallucination after hallucination for one of most commonly used package managers. I would say for linux stuff I've asked 30 or so questions with 0% success rate.
I don't think this is a reason I'd trust it, actually this is a reason I don't trust it.
There's a big difference between "obviously wrong" and "wrong". It is not objective but entirely depends on the reader/user.
The problem is it optimizes deception alongside accuracy. It's a useful tool but good design says we should want to make errors loud and apparent. That's because we want tools to complement us, to make us better. But if errors are subtle, nuanced, or just difficult to notice then there is actually a lot of danger to the tool (true for any tool).
I'm reminded of the Murray Gell-Mann Amnesia effect: you read something in the news paper that you're an expert in and lambast it for its inaccuracies, but then turn the page to something you don't have domain knowledge and trust it.
The reason I bring up MGA is because we don't often ask GPT things we know about or have deep knowledge in. But this is a good way to learn about how much we should trust it. Pretend to know nothing about a topic you are an expert in. Are its answers good enough? If not, then be careful when asking questions you can't verify.
Or, I guess... just ask it to solve "5.9 = x + 5.11"
Something a lot of people don’t get is that there are ways you can trust a consistently mediocre developer that you can’t trust a volatile but brilliant coworker. You almost always know how the formers’ work will be broken. The latter will blow up spectacularly on you at some terribly in opportune moment.
That’s not to say fire all your brilliant devs and hire mediocrity, but the reverse case is often made by loudmouths trying to fluff their own egos. Getting rid of the average devs is ignoring the vocational aspects of the job.
> You almost always know how the formers’ work will be broken.
The thing is that there are enough people who blindly trust ChatGPT's answers, and they don't know in which ways they could be broken, and they wouldn't have the knowledge to verify the answers because they are asking about things they themselves know very little about.
I think the analogy makes since but I'm not sure I've met too many brilliant but volatile people. At least the volatility I see in them is more like getting distracted by some rabbit hole or doing some other task. Which I think that's easier to handle but highly depends on management... (For the mediocre dev, usually their screwups are categorically consistent too, so you know what to give extra scrutiny. You can also teach them)
FWIW, I still use LLMs, but I just don't trust them. I think people get these confused. The thing is you just use tools you don't trust very differently than other ones. A simple but common example is I'll use them to aid google searches. Ask about a topic and it'll drop a bunch of info, but you treat them as bullshitters. Take a "trust but verify" approach. Since it sounds accurate, it usually drops a bunch of vernacular to that topic. Getting these keywords can really help with searches, especially when terms are overloaded.
I think in general it is good to treat them as bullshitters. There's still utility in that, but it is less straight forward.
Side note:
LLMs have really changed the way I think about people. I never understood how used cars salesmen were so successful despite the widespread stereotype, but I guess a lot more people like sycophants than I would have expected. Honestly, I can't stand how much the things praise me. It's really belittling. Praise me for actual accomplishments, not over every little thing.
Don't give me a "yes man" give me a man that can actually do things. Never trust "yes men", they don't ever have your best interests in mind. Trust people that will say no. People who will tell you you're wrong when you are wrong. You're not the main character, no one is. The whole reason we use these tools and work with others is because we couldn't possibly know everything and do everything ourselves. Yes men just try to convince you you're the main character. They know the easiest person to fool is yourself.
Pretty interesting - some contamination, some better answers, and it failed to write a sentence with all 5-letter-words. I’d have expected it to pass this one!
It's not the end of the world. Both are equally "impressive" at basic Q/A skills and GPT-4 is noticeably more sterile writing prose.
Even if GPT-3.5 was noticeably worse for any of these questions, it's honestly more interesting for someone's first experience to be with the exaggerated shortcomings of AI. The slightly-screwy answers are still endemic of what you see today, so it all ended well enough I think. Would've been a terribly boring exchange if Knuth's reply was just "looks great, thanks for asking ChatGPT" with no challenging commentary.
The Haj novel by Leon Uris, by the way, is a disturbing Zionist book that depicts Arabs as backward, violent, and incapable of progress without outside control, and that justifies the taking of Palestinian land by Jewish settlers.
> I myself shall certainly continue to leave such research to others, and to devote my time to developing concepts that are authentic and trustworthy. And I hope you do the same.
In a way taming these stochastic beasts into reliable and trustworthy software components is more like (quantitative) social science than computer science.
2023 was a crazy and exciting year for AI research. LLMs have come a long way, but clearly still have a long way to go. They should do much better on most of these questions.
The discussion at the end also reminded me of how a lot of us took Gary Marcus' prose more seriously at the time before many of his short-term predictions started failing spectacularly.
Tokenization. Before text is fed into an LLM, it's processed into tokens typically consisting of multiple characters. GPT-4o would see the word "jokes" as the tokens [73, 17349]. That's much more efficient than processing individual characters, but it means that LLMs can't count letters or words without some additional trickery. LLMs have also struggled with arithmetic for the same reason - they can't "see" the numbers in the text, just tokens representing groups of characters.
I was reading yesterday about a Buddhist concept (albeit quite popular in the west) called Begginer's Mind. I think this post represents it perfectly.
We are presented with a first reaction to chatgpt, we must never forget how incredible this technology is, and not become accustomed to it.
Donald knuth approached several of the questions from the absence of knowledge, asking questions as basic as "12. Write a sentence that contains only 5-letter words.", and being amazed not only by correct answers, but incorrect answers parsed effectively and with semantic understanding.
It's sad that we've made the internet so disorganized and crammed with advertising and crap that we now need tools to find actual information and summarize it for us.
I've just watched our new lord and saviour, GPT-5 in agent mode, enter a death loop of frustration. I asked it to find an image online. Ostensibly an easy task...
First it spent three minutes getting fucked by cookie banners, then it ddossed Wikipedia by guessing article names, then it started searching for stock photo sites offering an API, then it hallucinated a python script to search stock photography vaguely related to what i wanted. This failed as well, so it called its image generator and finally served me some made up AI slop.
Ten minutes, kilowatts of GPU power, and jack shit in return. So not even the shiny new tools are up to the task.
the internet was made to happen. and that is what happened.
I would say I have seen 3 completely different internets. and I started keeping track in the late 90s after the dotcom boom made it truly global and everywhere
Well nobody sane claims that if carbon appears on a planet then conscious minds will suffer through ads and cookie banners.
There is the internet, there was a decision point to subsidize bunch of stuff with ads and here we are. It could’ve looked much different if, for example, Google decided to launch a subscription and market figured out it is viable path forward and people would refuse to interact with ads crap.
>I would say I have seen 3 completely different internets. and I started keeping track in the late 90s after the dotcom boom made it truly global and everywhere
I meant "we" in the collective sense. No single person or group is directly responsible, but overall humanity hasn't done a very good job building the internet, in retrospect.
It's almost a no-op for humanity. In the past we had disconnected, offline knowledge sources, like libraries and newspaper archives, and now we have them online, but we've packed it so full of spam and misinformation that it's hard to find anything and if you do, it's hard to trust it.
Now we're making tools to summarize it all, and they're no doubt going to be just as exploititive as the stuff we've put into it, and also untrustworthy. It's easy enough to find people telling you what you want to hear online, and it's even easier with ChatGPT.
Maybe people can't handle having access to all of the world's information.
it could easily tip the other way. we could collectively decide to use sites that either only use web 1.0, possibility with a carefully-curated set of extensions. This could be enforced by, say, browser extensions that refuse to load "bad" pages, or load copies of sanitized pages (like how archive.ph does).
> It isn’t “wrong.” Wolfram defines Binomial[n,m] at negative integers by a symmetric limiting rule that enforces Binomial[n,m] = Binomial[n,n−m]. With n = −1, m = −1 this forces Binomial[−1,−1] = Binomial[−1,0] = 1. The gamma-formula has poles at nonpositive integers, so values there depend on which limit you adopt. Wolfram chooses the symmetry-preserving limit; it breaks Pascal’s identity at a few points but keeps symmetry. If you want the convention that preserves Pascal’s rule and makes all cases with both arguments negative zero, use PascalBinomial[−1,−1] = 0. Wolfram added this explicitly to support that alternative definition.
Of course this particular question might have been in the training set.
Honestly 2.5 years feel like infinity when it comes to AI development. I'm using ChatGPT very regularly, and while it's far from perfect, recently it gave obviously wrong answers very rarely. Can't say anything about ChatGPT 5, I feel like in my conversations with AI, I've reached my limit, so I'd hardly notice AI getting smarter, because it's already smart enough for my questions.
1) 100% correct
2) Really useful (ie it includes various things I didn’t ask for but are really great like a little manipulator to walk through the function at various points and visualize what the mapping is doing)
3) Built in a general way so I can easily change the mapping to explore different types of functions and how they work.
It seems very clear (both from what they said in the launch demos etc and from my experience of trying it out) that performance on coding tasks has been an area of massive focus and the results are pretty clear to me.
I think of that every time people talk about trusting generated code. Or the obfuscated code competition. It’s going to get you into the dumbest trouble some day.
It suggested two new attributes which it did not add after claiming that this had been done, and after this was done, the attributes were not used.
Are you concerned it may be giving you subtley wrong answers that you're not noticing? If you have to double check everything, is it really saving time?
Sometimes I can write the code, but I just think it's too tedious and AI can do it faster than me. For example I can write `jq` incantations, but it takes some time and man digging, so for non-trivial cases I can easily spend hour tinkering with it. ChatGPT often do it in one minute and does it correctly. So in this case I can perfectly evaluate the result. Although I've recently started to avoid this kind of usage, because I feel it makes me dumber and my salary is not increased for the time saved, so I don't really have an incentive to work slightly faster at the expense of my mental abilities.
Sometimes I'm using it as a Google replacement. Now this is controversial usage and I certainly can't quickly evaluate if he did a good job. I'm usually inspecting list of "Sources" and double-checking the most important facts. If I feel it missed important sources, I'm going to search myself.
Sometimes I just need an opinion. I work mostly alone and I don't have anyone to talk to. Also I have some mental issues, where I can stuck on very simple issue, like identifier naming and procrastinate over it for a prolonged time. AI had been a saviour for me, because he can present another opinion which I could just follow and move on (because it doesn't really matter, which identifier name to use). So for this case, there's no "right" or "wrong" answer, I just need "some" answer. I could even use RNG but actually ChatGPT makes me feel better by following seemingly reasonable suggestions.
Often I write code and submit it to ChatGPT for review. It spots a lot of irrelevant issues, or issues I don't really care about. However sometimes it finds a bugs, so it helps me a lot by uncovering bugs early. In this case, verification is obvious, I know when the bug is bug.
The problem is that doing this enough will make you forget how to come up with proofs in the first place.
What if it gets you most of the way there and you don't care so much about the quality (and thus don't feel any urge to double-check), though?
Which, I expect, is the most common case: people using it to churn half-arsed work faster and with no effort from them.
If I'm asking for my own enlightenment but don't care about a correct answer, then why bother?
If my manager is fine with a half-assed response, then eventually he'll cut me out altogether and go straight to the AI.
It's a shame that software engineering hasn't progressed to the point were we can reliably build bug free software. It's really sad if AI gives shitty results but iterates fast enough that it's still perceived as better than real humans.
If we keep retraining them on the currently available datasets then the questions that stumped ChatGPT3 are in the training set for chatgpt5.
I don’t have the background to understand the functional changes between ChatGPT 3 and 5. It can’t be just the training data can it?
There's a big difference between "obviously wrong" and "wrong". It is not objective but entirely depends on the reader/user.
The problem is it optimizes deception alongside accuracy. It's a useful tool but good design says we should want to make errors loud and apparent. That's because we want tools to complement us, to make us better. But if errors are subtle, nuanced, or just difficult to notice then there is actually a lot of danger to the tool (true for any tool).
I'm reminded of the Murray Gell-Mann Amnesia effect: you read something in the news paper that you're an expert in and lambast it for its inaccuracies, but then turn the page to something you don't have domain knowledge and trust it.
The reason I bring up MGA is because we don't often ask GPT things we know about or have deep knowledge in. But this is a good way to learn about how much we should trust it. Pretend to know nothing about a topic you are an expert in. Are its answers good enough? If not, then be careful when asking questions you can't verify.
Or, I guess... just ask it to solve "5.9 = x + 5.11"
That’s not to say fire all your brilliant devs and hire mediocrity, but the reverse case is often made by loudmouths trying to fluff their own egos. Getting rid of the average devs is ignoring the vocational aspects of the job.
The thing is that there are enough people who blindly trust ChatGPT's answers, and they don't know in which ways they could be broken, and they wouldn't have the knowledge to verify the answers because they are asking about things they themselves know very little about.
FWIW, I still use LLMs, but I just don't trust them. I think people get these confused. The thing is you just use tools you don't trust very differently than other ones. A simple but common example is I'll use them to aid google searches. Ask about a topic and it'll drop a bunch of info, but you treat them as bullshitters. Take a "trust but verify" approach. Since it sounds accurate, it usually drops a bunch of vernacular to that topic. Getting these keywords can really help with searches, especially when terms are overloaded.
I think in general it is good to treat them as bullshitters. There's still utility in that, but it is less straight forward.
Side note:
LLMs have really changed the way I think about people. I never understood how used cars salesmen were so successful despite the widespread stereotype, but I guess a lot more people like sycophants than I would have expected. Honestly, I can't stand how much the things praise me. It's really belittling. Praise me for actual accomplishments, not over every little thing.
Don't give me a "yes man" give me a man that can actually do things. Never trust "yes men", they don't ever have your best interests in mind. Trust people that will say no. People who will tell you you're wrong when you are wrong. You're not the main character, no one is. The whole reason we use these tools and work with others is because we couldn't possibly know everything and do everything ourselves. Yes men just try to convince you you're the main character. They know the easiest person to fool is yourself.
https://chatgpt.com/share/6897a21b-25c0-8011-a10a-85850870da...
Pretty interesting - some contamination, some better answers, and it failed to write a sentence with all 5-letter-words. I’d have expected it to pass this one!
Simple example: “Every night, dreams swirl swiftly.
For example, something like "running" might get tokenizef like "runn"+"ing", being only two tokens for ChatGPT.
It'll learn to infer some of these things over the course of training, but limited.
Same reason it's not great at math.
d r e a m s (6 letters)
s w i f t l y (7 letters)
Every quiet mouse crept under thick brown boxes.
Even if GPT-3.5 was noticeably worse for any of these questions, it's honestly more interesting for someone's first experience to be with the exaggerated shortcomings of AI. The slightly-screwy answers are still endemic of what you see today, so it all ended well enough I think. Would've been a terribly boring exchange if Knuth's reply was just "looks great, thanks for asking ChatGPT" with no challenging commentary.
In a way taming these stochastic beasts into reliable and trustworthy software components is more like (quantitative) social science than computer science.
The discussion at the end also reminded me of how a lot of us took Gary Marcus' prose more seriously at the time before many of his short-term predictions started failing spectacularly.
Anyone have an idea how this happened? Supposed to be a sentence of only 5 letter words.
We are presented with a first reaction to chatgpt, we must never forget how incredible this technology is, and not become accustomed to it.
Donald knuth approached several of the questions from the absence of knowledge, asking questions as basic as "12. Write a sentence that contains only 5-letter words.", and being amazed not only by correct answers, but incorrect answers parsed effectively and with semantic understanding.
First it spent three minutes getting fucked by cookie banners, then it ddossed Wikipedia by guessing article names, then it started searching for stock photo sites offering an API, then it hallucinated a python script to search stock photography vaguely related to what i wanted. This failed as well, so it called its image generator and finally served me some made up AI slop.
Ten minutes, kilowatts of GPU power, and jack shit in return. So not even the shiny new tools are up to the task.
the internet was made to happen. and that is what happened.
I would say I have seen 3 completely different internets. and I started keeping track in the late 90s after the dotcom boom made it truly global and everywhere
There is the internet, there was a decision point to subsidize bunch of stuff with ads and here we are. It could’ve looked much different if, for example, Google decided to launch a subscription and market figured out it is viable path forward and people would refuse to interact with ads crap.
What 3?
It's almost a no-op for humanity. In the past we had disconnected, offline knowledge sources, like libraries and newspaper archives, and now we have them online, but we've packed it so full of spam and misinformation that it's hard to find anything and if you do, it's hard to trust it.
Now we're making tools to summarize it all, and they're no doubt going to be just as exploititive as the stuff we've put into it, and also untrustworthy. It's easy enough to find people telling you what you want to hear online, and it's even easier with ChatGPT.
Maybe people can't handle having access to all of the world's information.