AI has all the answers. Even the wrong ones | 不論答案對錯,AI知道一切? - FT中文網
登錄×
電子郵件/用戶名
密碼
記住我
請輸入郵箱和密碼進行綁定操作:
請輸入手機號碼,透過簡訊驗證(目前僅支援中國大陸地區的手機號):
請您閱讀我們的用戶註冊協議私隱權保護政策,點擊下方按鈕即視爲您接受。
FT英語電臺

AI has all the answers. Even the wrong ones
不論答案對錯,AI知道一切?

ChatGPT has the appearance of a brilliant logician and that』s a problem
大型語言模型解決邏輯謎題的準確性與可信度探究。
00:00

Can large language models solve logic puzzles? There’s one way to find out, which is to ask. That’s what Fernando Perez-Cruz and Hyun Song Shin recently did. (Perez-Cruz is an engineer; Shin is the head of research at the Bank for International Settlements as well as the man who, in the early 1990s, taught me some of the more mathematical pieces of economic theory.)

The puzzle in question is commonly known as the “Cheryl’s birthday puzzle”. Cheryl challenges her friends Albert and Bernard to guess her birthday, and for puzzle-reasons they know it’s one of 10 dates: May 15, 16 or 19; June 17 or 18; July 14 or 16; or August 14, 15 or 17. To speed up the guessing, Cheryl tells Albert her birth month, and tells Bernard the day of the month, but not the month itself.

Albert and Bernard think for a while. Then Albert announces, “I don’t know your birthday, and I know that Bernard doesn’t either.” Bernard replies, “In that case, I now know your birthday.” Albert responds, “Now I know your birthday too.” What is Cheryl’s birthday?* More to the point, what do we learn by asking GPT-4?

The puzzle is a challenging one. Solving it requires eliminating possibilities step by step while pondering questions such as “what is it that Albert must know, given what he knows that Bernard does not know?” It is, therefore, hugely impressive that when Perez-Cruz and Shin repeatedly asked GPT-4 to solve the puzzle, the large language model got the answer right every time, fluently elaborating varied and accurate explanations of the logic of the problem. Yet this bravura performance of logical mastery was nothing more than a clever illusion. The illusion fell apart when Perez-Cruz and Shin asked the computer a trivially modified version of the puzzle, changing the names of the characters and of the months.

GPT-4 continued to produce fluent, plausible explanations of the logic, so fluent, in fact, it takes real concentration to spot the moments when those explanations dissolve into nonsense. Both the original problem and its answer are available online, so presumably the computer had learnt to rephrase this text in a sophisticated way, giving the appearance of a brilliant logician.

When I tried the same thing, preserving the formal structure of the puzzle but changing the names to Juliet, Bill and Ted, and the months to January, February, March and April, I got the same disastrous result. GPT-4 and the new GPT-4o both authoritatively worked through the structure of the argument but reached false conclusions at several steps, including the final one. (I also realised that in my first attempt I introduced a fatal typo into the puzzle, making it unsolvable. GPT-4 didn’t bat an eyelid and “solved” it anyway.)

undefined

Curious, I tried another famous puzzle. A game show contestant is trying to find a prize behind one of three doors. The quizmaster, Monty Hall, allows a provisional pick, opens another door to reveal no grand prize, and then offers the contestant the chance to switch doors. Should they switch?

The Monty Hall problem is actually much simpler than Cheryl’s Birthday, but bewilderingly counterintuitive. I made things harder for GPT4o by adding some complications. I introduced a fourth door and asked not whether the contestant should switch (they should), but whether it was worth paying $3,500 to switch if two doors were open and the grand prize were $10,000.**

GPT-4’s response was remarkable. It avoided the cognitive trap in this puzzle, clearly articulating the logic of every step. Then it fumbled at the finishing line, adding a nonsensical assumption and deriving the wrong answer as a result.

What should we make of all this? In some ways, Perez-Cruz and Shin have merely found a twist on the familiar problem that large language models sometimes insert believable fiction into their answers. Instead of plausible errors of fact, here the computer served up plausible errors of logic.

Defenders of large language models might respond that with a cleverly designed prompt, the computer may do better (which is true, although the word “may” is doing a lot of work). It is also almost certain that future models will do better. But as Perez-Cruz and Shin argue, that may be besides the point. A computer that is capable of seeming so right yet being so wrong is a risky tool to use. It’s as though we were relying on a spreadsheet for our analysis (hazardous enough already) and the spreadsheet would occasionally and sporadically forget how multiplication worked.

Not for the first time, we learn that large language models can be phenomenal bullshit engines. The difficulty here is that the bullshit is so terribly plausible. We have seen falsehoods before, and errors, and goodness knows we have seen fluent bluffers. But this? This is something new.

*If Bernard was told 18th (or 19th) he would know the birthday was June 18 (or that it was May 19). So when Albert says that he knows that Bernard doesn’t know the answer, that rules out these possibilities: Albert must have been told July or August instead of May or June. Bernard’s response that he now knows the answer for certain reveals that it can’t be the 14th (which would have left him guessing between July or August). The remaining dates are August 15 or 17, or July 16. Albert knows which month, and the statement that he now knows the answer reveals the month must be July and that Cheryl’s birthday is July 16.

**The chance of initially picking the correct door is 25 per cent, and that is not changed when Monty Hall opens two empty doors. Therefore the chance of winning $10,000 is 75 per cent if you switch to the remaining door, and 25 per cent if you stick with your initial choice. For a sufficiently steely risk-taker, it is worth paying up to $5,000 to switch.

Follow @FTMag to find out about our latest stories first and subscribe to our podcast Life and Art wherever you listen

版權聲明:本文版權歸FT中文網所有,未經允許任何單位或個人不得轉載,複製或以任何其他方式使用本文全部或部分,侵權必究。

從臺北到布達佩斯:尋呼機爆炸的神祕軌跡

黎巴嫩真主黨遭遇的大膽襲擊事件所涉設備的供應鏈跨越三大洲。

Lex專欄:無論如何衡量,私募股權基金的表現都很糟糕

投資者急於回籠資金,迫使私募股權基金不得不降低標價以售出資產。

歐盟新任競爭事務專員:必須「改進」合併規則

特雷莎•裏貝拉在接受FT採訪時表示,歐洲企業需要具備規模才能與全球對手競爭。

鋪設中國太陽能板的熱潮威脅巴基斯坦負債累累的電網

電價飆升促使巴基斯坦企業爭相在工廠屋頂鋪設超低價的中國太陽能板。

針對川普的明顯暗殺企圖:到目前爲止我們知道什麼?

嫌疑人被捕引發了人們對美國總統選舉最後階段候選人安全的擔憂。

技術能源正在重塑世界

擁有化石燃料儲備的傳統權力掮客將看到他們的全球影響力減弱。
設置字型大小×
最小
較小
默認
較大
最大
分享×