LLMs doing a really bad job at things outside the scope of their language model (spatial relationships, the alphabet, math) isn’t new. Although, I think Google letting an LLM into their search more than they should is important news.
This is ChatGPT 3.5:
If I face north, then turn left, and left again, then 180 degrees to the right, then left and left and left, which way am I facing?
If you start facing north and turn left, you’ll be facing west. If you turn left again, you’ll be facing south. Turning 180 degrees to the right from south will also make you face south. If you then turn left three more times, you’ll still be facing south. So, after all these turns, you would be facing south.
GPT-4 is actually able to solve this, but most other modern language models can’t. It’s just an easy way to show the cracks in the “because it can speak means it understands” facade.
But, it does have internal buffer space. You can give it a massive chunk of text and it can parrot it right back to you perfectly. It just can’t make use of that internal buffer space for tracking and reasoning about the problem, which I would describe as because it can’t reason in the first place.
I mean, you are correct that saying that leads to it succeeding:
If I face north, then turn left, and left again, then 180 degrees to the right, then left and left and left, which way am I facing? Make a list of each step and print the current direction.
Sure, let’s break down each step and track the direction you’re facing:
Start facing north.
Turn left (90 degrees counterclockwise) - Now facing west.
Turn left again (90 degrees counterclockwise) - Now facing south.
Turn 180 degrees to the right (180 degrees clockwise) - Now facing north again.
Turn left (90 degrees counterclockwise) - Now facing west.
Turn left (90 degrees counterclockwise) - Now facing south.
Turn left (90 degrees counterclockwise) - Now facing east.
So, after following these steps, you would be facing east.
But, I would say that if it had understanding, it would be able to do that for itself because it knew that that was required in order to correctly answer the question. Or it would say “I’m sorry, I can’t quite make sense of that situation without talking through it one step at a time.” Or something like that. Here, take a look at these – does it seem like these are an issue of buffer space instead of its understanding?
What are the countries in Africa that start with a “K”?
There are three countries in Africa whose names start with the letter “K”:
Kenya
Kiribati
Comoros
Please note that Kiribati is not entirely located in Africa; it is a Pacific island nation. It seems there might be some confusion, as there are no other African countries that start with the letter “K” as of my last knowledge update in September 2021.
Or:
What are the countries in Africa that start with a “K”? Make a list of each step and print the current direction.
Sure, here’s a list of African countries that start with the letter “K,” along with each step and the current direction:
Start
Algeria
Angola
Benin
Botswana
Burkina Faso
Burundi
Cape Verde
Cameroon
Central African Republic
Chad
Comoros
Congo, Democratic Republic of the
Congo, Republic of the
Cote d’Ivoire
Djibouti
Egypt
Equatorial Guinea
Eritrea
Eswatini
Ethiopia
Gabon
Gambia
Ghana
Guinea
Guinea-Bissau
Kenya
Lesotho
Liberia
Libya
Madagascar
Malawi
Mali
Mauritania
Mauritius
Morocco
Mozambique
Namibia
Niger
Nigeria
Rwanda
Sao Tome and Principe
Senegal
Seychelles
Sierra Leone
Somalia
South Africa
South Sudan
Sudan
Tanzania
Togo
Tunisia
Uganda
Zambia
Zimbabwe
End
Please note that the information provided is based on my knowledge cutoff date in September 2021, and there might have been changes since then.
I do know the basics of how LLMs work, yes. My point was that the process that it uses for next-word prediction is inherently capable of, in effect, iterating over a sequence of tokens and processing it step by step. For example it’s got the capability to look back in its context and see that:
QUERY_TOKEN “And so I told him the story.” Repeat that quotation back to me. ANSWER_TOKEN "And so I told him
Needs to complete with “the”. That’s trivial for good LLMs and they can get it perfect every time. There’s nothing that would prevent that same logic from completing:
QUERY_TOKEN If I face north, then turn left, and left again, then 180 degrees to the right, then left and left and left, which way am I facing? ANSWER_TOKEN If you start facing north and turn left, you’ll be facing west. If you turn left again, you’ll be facing south. Turning 180 degrees to the right from south will make you face
… as “north.” The problem is not the need for some internal buffer separate from the context. The problem is that it’s not directly capable with directions and spatial orientations in the same thorough fashion as it is with language; if it were, it’d solve the second problem just as readily as the first.
I’m not sure how to distill the point I’m trying to make down any further. The basics of what you’re saying are 100% accurate, yes, but look back at the two specific examples I gave. Are you asserting that an LLM inherently can’t process the second example, because it would all have to be done in one step, but at the same time it can process the first (in one step)? Can’t you see what I’m saying that the two examples are identical, in the aspect of the LLM needing to identify which part of the input sequence applies to the place it’s currently at in the output sequence?
Edit: Actually, second counterpoint: How, if you’re saying that this is just an inherent limitation of LLMs, can GPT-4 do it?
Hm… yeah, I see what you’re saying. It’s not capable of maintaining “hidden” state as it goes step by step through the output, but if you have it talk its way through the hidden part of the state, it can do it. I can agree with that.
LLMs doing a really bad job at things outside the scope of their language model (spatial relationships, the alphabet, math) isn’t new. Although, I think Google letting an LLM into their search more than they should is important news.
This is ChatGPT 3.5:
GPT-4 is actually able to solve this, but most other modern language models can’t. It’s just an easy way to show the cracks in the “because it can speak means it understands” facade.
deleted by creator
But, it does have internal buffer space. You can give it a massive chunk of text and it can parrot it right back to you perfectly. It just can’t make use of that internal buffer space for tracking and reasoning about the problem, which I would describe as because it can’t reason in the first place.
I mean, you are correct that saying that leads to it succeeding:
But, I would say that if it had understanding, it would be able to do that for itself because it knew that that was required in order to correctly answer the question. Or it would say “I’m sorry, I can’t quite make sense of that situation without talking through it one step at a time.” Or something like that. Here, take a look at these – does it seem like these are an issue of buffer space instead of its understanding?
Or:
deleted by creator
I do know the basics of how LLMs work, yes. My point was that the process that it uses for next-word prediction is inherently capable of, in effect, iterating over a sequence of tokens and processing it step by step. For example it’s got the capability to look back in its context and see that:
Needs to complete with “the”. That’s trivial for good LLMs and they can get it perfect every time. There’s nothing that would prevent that same logic from completing:
… as “north.” The problem is not the need for some internal buffer separate from the context. The problem is that it’s not directly capable with directions and spatial orientations in the same thorough fashion as it is with language; if it were, it’d solve the second problem just as readily as the first.
deleted by creator
I’m not sure how to distill the point I’m trying to make down any further. The basics of what you’re saying are 100% accurate, yes, but look back at the two specific examples I gave. Are you asserting that an LLM inherently can’t process the second example, because it would all have to be done in one step, but at the same time it can process the first (in one step)? Can’t you see what I’m saying that the two examples are identical, in the aspect of the LLM needing to identify which part of the input sequence applies to the place it’s currently at in the output sequence?
Edit: Actually, second counterpoint: How, if you’re saying that this is just an inherent limitation of LLMs, can GPT-4 do it?
deleted by creator
Hm… yeah, I see what you’re saying. It’s not capable of maintaining “hidden” state as it goes step by step through the output, but if you have it talk its way through the hidden part of the state, it can do it. I can agree with that.