Melanie Mitchell [USE IT] The debate over "understanding" in AI's LLMs

Topic: The Debate Over “Understanding” in AI’s Large Language Models
Abstract: Prof. Mitchell will survey a current, heated debate in the AI research community on whether large pre-trained language models can be said to “understand” language—and the physical and social situations language encodes—in any important sense. She will describe arguments that have been made for and against such understanding, and, more generally, will discuss what methods can be used to fairly evaluate understanding and intelligence in AI systems. Mitchell will conclude with key questions for the broader sciences of intelligence that have arisen in light of these discussions.

Bio: Melanie Mitchell is a Professor at the Santa Fe Institute. Her current research focuses on conceptual abstraction, analogy-making, and visual recognition in artificial intelligence systems. Melanie is the author or editor of six books and numerous scholarly papers in the fields of artificial intelligence, cognitive science, and complex systems. Her book Complexity: A Guided Tour (Oxford University Press) won the 2010 Phi Beta Kappa Science Book Award and was named by Amazon as one of the ten best science books of 2009. Her latest book is Artificial Intelligence: A Guide for Thinking Humans (Farrar, Straus, and Giroux).

so it looks like uh we are already live uh so hello everybody and I’m glad to uh to greet you actually on the first event of uh Ukrainian scientific and educational it Society so we are going to invite um uh different distinguished computer science uh academics all over the world uh before we’ll start and before I’ll invite our today’s speaker uh I will stop on some technical issues so please turn off your microphones uh you can use chat for any comments but if you have a question uh I’ll ask you to use Q&A SE section also for those who are watching us uh uh on live uh on Facebook you can also ask your questions in comments uh so and I’m glad to uh to give a floor to Melanie Mitchell uh professor of the Santa Fe Institute uh so Professor Mitchell has worked on artificial intelligence from fascinating perspectives including adaptive uh uh complex systems genetic algorithms and cognitive architecture uh Professor Mitchell is the author of uh six books and numerous scholarly papers in the field of artificial intelligence uh so Professor Mel the floor is yours great thank you so much for inviting me to speak at this event uh can you see my slides okay yes perfectly okay perfect so uh I’m going to talk about uh a very uh prominent debate that’s going on in the AI community that I wrote about with my collaborator David Kau and was published recently in proceedings with the National Academy of Sciences um on whether large language models can understand in any interesting and meaningful sense so I’ll talk about what that means so we know that many AI systems have been shown to have trouble understanding the data that they process at least in a humanlike way and that it matters because these systems can make errors that are hard to predict and that might have big impacts on the people that use these systems so here’s a few examples uh way back in 2018 when people were using uh deep neural networks to um do very well on the image net object recognition challenge we had deep neural networks that were able to recognize objects like this one school bus with 100% confidence but this group uh showed that if these objects were rotated via Photoshop or other uh photo apps in different poses then the network was 99% confident that this was a garbage truck or a punching bag 100% confident or a snow plow 92% confident so these networks were seeing these images confidently in a very different way than humans were and so the the systems while they seem to do very well on images that were similar to ones they’d seen in their training data actually had a lot of problems when the images were changed in some way that humans would not find so puzzling we’ve also seen examples more recently of things like self-driving car Vision systems that do very well on recognizing objects on the road like cars but have trouble distinguishing say images on the back of a van that’s an advertisement for ebikes here recognizing these images as real things in the road like bicycles and people so being able to take into account the context of an image seems to be a challenge for a lot of these Vision systems and one very uh striking example of this was when this person tweeted that his Tesla that was using the um self-driving software kept slamming on the brakes in this area that had no stop sign but after a few drives he noticed this billboard which if you uh scale it up you can see is an ad for um for something I don’t know what but there’s a police officer with a holding a stop sign and the car was interpreting that as a real stop sign and slamming on the brakes so this kind of basic recognition of context and Common Sense knowledge has been very challenging for AI systems another example of a lack of understanding is in the uh when AI systems machine Learning Systems learn what’s called shortcuts so one example is in this paper um these uh researchers trained a deep neural network to recognize um skin cancer malignant skin cancer and when they um their system did very well on a particular Benchmark of images but they found when they looked at what Clues what cues the the system was using that images that contained rulers like this ruler were more common to be seen in malignant skin cancers than in benign skin Cancers and the system had been using um rulers as a predictor for skin cancer because of course they it had found this statistical Association this is called a shortcut um where this the machine Learning System learns something um that is useful for predicting um something thing that um that makes it do well on the training data but that thing that it’s using is not what we intended it to use and it won’t generalize well uh here’s an example that uh demitro gave me for for you I had in my earlier talk I had an example in French but he gave me this example in Ukrainian of Google translate uh translating um an idiom an English idiom we wanted to play football but now it’s raining cats and dogs so I guess we’ll stay inside and the Google translate was um really not understanding the context of this idiom and translated it uh quite literally into Ukrainian so I’m sure u i I actually don’t know what this says but you all can uh see that example so this actually has had this kind of translation error has had real world effects for example people working in um processing refugees applications for the US Asylum cases ha have um found that the translation apps they’re using can make very big errors that can actually endanger some cases so now we’re really in a new era of AI with large language models like chat GPT and some people have proposed these systems are really a revolution in AI that have achieved a richer humanlike understanding than the previous AI systems which I showed you turn uh can often be quite brittle so here’s an example that dimitro gave me from chat GPT it asks uh chat GPT to translate from um English to Ukrainian the same sentence and and here’s its translation which is much better than what Google translate gave us and you can even ask it to explain the translation and it explains it it says it means to May made sure to convey the same ideas well adjusting the idiomatic expression appropriate for the target language and talks about how it’s uh translating this idiom into the Ukrainian idiom uh which is really expressing the same thing so it does seem to understand this language much better than previous AI systems ever have um Terence teren sinowski who’s a was a Pioneer in neural networks going back as far as the 1980s wrote this paper saying that you know something’s happening now that we didn’t expect these systems can can appeared that can communicate with us in an eerily human way they’re not human but they’re superhuman in some aspects and some aspects of their behavior appear to be intelligent but if it’s not human intelligence what exactly is the nature of their intelligence and this is something that everybody in the AI Community is grappling with how do we characterize the nature of the intelligence of these systems some people have proposed that for instance they these systems do in a very real sense understand a wide range of Concepts blaz aguer arus is a um AI researcher and a vice president at Google uh and he’s very much um believes that these systems do understand uh language in um but people like uh Jacob Brownie and Yan laon who’s the head of AI meta uh take an opposite view they say that the systems trained on language alone will never approximate human intelligence and understanding even if trained from now until the heat death of the universe so this is really a big debate in the AI community and one group did a survey of natural language processing researchers asking people who attended big NLP conferences this question do you agree or disagree some generative models trained only on text given enough data and resources could understand natural language in some non-trivial sense and the results of the survey were really split down the middle 50 here we have you know the people who s who weekly disagree disagree weekly agree agree and really it’s just half and half which shows that there’s no agreement in the community on this question of understanding so how is it that we can evaluate whether llms understand well one way is to just look at their behavior have a kind of a touring test where you just talk to them and do they seem hum like a human could they fool you into thinking that they’re human well one problem with that is called the Eliza effect named after the the original chatbot Eliza from the 1960s which um mimicked a a psychotherapist and people talked to Eliza it was very primitive but they talked to it and often they were convinced that it understood them and that they were talking to some entity that actually had intelligence and understanding so humans are very prone to project understanding onto systems that communicate in uh natural language and this is something we have to be aware of that we ourselves can be fooled so a more objective um approach is to test these systems on benchmarks that try to assess natural language understanding and there’s quite a few of these benchmarks in in the um NLP Community but the problem is that such benchmarks often allow the kinds of shortcuts that I showed before where there’s subtle um statistical associations in the text that allow the system to do well on the Benchmark without actually understanding in the way people do so for example the there’s a a data set called the glue the general language understanding evaluation a collection of of natural language understanding tests and it’s successor superglue which is even more extensive and here is the leaderboard which shows um the scores of different systems on this Benchmark the top seven are all large language models number eight is humans so does this mean that these large language models outperform humans at language understanding well on this these benchmarks yes but it’s unclear that these benchmarks actually truly test language understanding due to uh issues like shortcuts and people have talked about this kind of thing in many different papers and um I think it’s a real often it’s very subtle to tease out exactly what these shortcuts are another approach is to give these uh uh large language models standardized tests that are designed for humans so we’ve seen headlines like chat GPT does better than many students on MBA exams or law exams or U medical licensing exams um but the problem is that it’s hard to know whether parts of those exams or something similar were actually in the system’s pre-training data that’s called Data contamination we don’t know what the pre-training data is for systems like chat GPT so it’s hard to know if they were actually trained on something similar to the things they’re being tested on but also it’s not clear if performance on these kinds of tests for a system that sort of memorized the entire internet in a way um will actually correlate with performance in the real world on new kinds of uh questions that haven’t aren’t like ones they’ve seen before and I wrote about this in my substack blog and um other people have re-evaluated the um performance on different standardized tests and shown that it’s less impressive than it originally seemed and others have argued that we shouldn’t really the professional benchmarks designed for humans are not appropriate for testing these large language models so now another approach is to evaluate these systems on tasks that require what we feel like are the the sort of Hallmarks of understanding the ability to abstract and to reason but even when these systems do well on reasoning tasks there’s a question of how real or robust are the abstractions they create so as one example this paper did a test um on a whole bunch of different tasks that required some kind of abstraction and reasoning and showed that when given um questions in this each domain that were similar to what’s likely to be in the training data gp4 did very well that’s these blue bars but when they changed the task to be what they called counterfactual task um different from what’s likely to be in the training data but using the same abstract reasoning uh processes this the performance of gp4 went uh down fairly dramatically in many cases so just as one example on code execution so uh gp4 is quite good you can see on this blue bar of the task of taking a little snippet of python code and printing out what it produce would produce when run so it interprets the code it in some sense understands the code and this is you know the the uh prom the kind of prompt that’s given but these uh researchers said well what if we make a counterfactual task where we have this uh pretend hypothetical programming language they called it thonp instead of python it’s identical except uh structures like lists like this one are indexed by one instead of zero so rather than this being the zeroth element it’s uh the first element so you know this is something like mat lab and R the indices start from one can you then print out what this these Snippets of code do and there this uh sort of Red Bar shows that gp4 was much worse on that version was not able to adapt what it understood in the original case to this new very similar case whereas I think any human programmer who um understands code would be able to adapt it so that’s a kind of abstraction that uh gp4 doesn’t seem to be able to make another paper there’s been several papers on this same theme this paper called Embers of Auto regression showed that uh when they tested on in various uh uh reasoning problems so uh gp4s performance on things that are likely to be common in the training data the performance was much better than when they tested the same reasoning task but with content that was likely to be rare in the training data so the take-home message here is that large language models tend to be better often dramatically better on solving reasoning tasks that are similar to those seen in the training data and that reflects a failure of abstract understanding so how is it that we could get machines to learn and use humanlike Concepts and abstractions this is really the the topic of my own research well human concepts are mental models of categories situations events and even one’s own self and internal state so so how can we get machines to form the kinds of mental models that we humans use to understand so here’s a really simple concept on top of so something being on top of something else okay basic con spatial concept okay well we know that human concepts are compositional if you understand something on top of something else you can understand something else on top of that and something else on top of that that we can build up you know um this notion of on top of to apply um sort of recursively or iteratively if we understand something like a cat on top of a television we can also understand a television on top of a cat Okay this I got um uh do the text to image program to generate but it had a lot of trouble generating uh television on top of a cat cuz that’s much less likely but humans are able to take these compositional Concepts and um apply them in different ways they also have causal structure um and they enable us to predict to reason and to have basic common sense so even babies learn if thing things that are on top of other things how those uh things will behave under certain kinds of interventions uh children can figure out can reason about how to get on top of something that they want to get on top of and also learn sort of the topological reasoning for instance shoes go on top of socks so you have to put your socks on before you put your shoes on okay so we’re humans are able to understand these very basic ideas moreover Concepts like spatial Concepts can be abstracted to new situations via analogy and metaphor so you know in English we have all these metaphors that use this concept on top of like on top of the world on top of one’s game at the top of one’s voice you know translating the spatial ideas to more um other mod ities and more abstract situations on top of a social hierarchy so this is something that’s um absolutely Central to I think our human understanding is this ability to take Concepts basic concepts and Abstract them in this way and in fact um the cognitive psychologist Lawrence Barcelo defined a concept as a competence or disposition for generating infinite conceptualizations of a category and you saw how I was trying to get at that with this concept of on top of but really you can do this with any concept and it’s really uh saying that what concepts are are mental models that are generative and I think this is something that’s really needed in AI um and lake off at all um proposed that abstract concepts are learned via metaphors involving Core Concepts so for example you know we we conceptualize social interactions in terms of physical temperature you know we say things like she gave me a warm greeting or status is uh conceptualized as physical location she’s two rungs above me on the corporate ladder you know things like that so Lake off at all have have shown this is is is just pervasive in our language this understanding of abstract Concepts via metaphor so um other people have proposed that concepts are grounded in mental simulations of physical situations you know so people like um Josh tenenbom and his collaborators proposed that our physical understanding and metaph orally our our abstract understanding is grounded in in actually mental simulations kind of like game engines um and they talk about the notion of the game engine in the brain that we actually do some kind of simulation um of uh physical uh properties in um our understanding um and finally Li Liz staly a developmental psych ologist proposes that this concept learning Builds on innate systems of what she calls core knowledge and core knowledge is what she proposes is essentially innate in humans that were born with either knowledge itself or a strong propensity to learn uh this this kind of knowledge about for instance objects that the world is divided into objects and objects have certain kinds of interactions uh numerosity we understand Concepts like something being larger than something else or smaller um or um some we we know about very small numbers like one and two we sort of understand those kinds of Concepts and build upon that basic geometry and topology of space and we understand things like um uh certain kinds of of relationships like things are inside other things or outside certain things contain other things and so on and um we also understand about agents that have goals and that certain kinds of behavior can be goal directed so this is a proposal for for um H how Concepts arise and how they are built up out of these basic core knowledge systems so how do we get machines to understand in a in all these various ways that I’ve been talking about these kind kinds of humanlike Concepts well franois cholle who’s a researcher at Google proposed what he called uh the abstraction and reasoning Corpus he proposed it in this paper which I recommend if you’re interested in um Ai and cognitive science it’s a really interesting sort of argument about how to measure and intelligence in either humans or machines and he proposed this um Benchmark essentially which is um a domain for concept induction and abstract reasoning and it was inspired by spy’s core knowledge systems so let me just give you an example so this is called a task in the this abstraction and reasoning Corpus or Arc here the world is a uh defined in terms of grids with colored pixels and one grid is transformed into another so this is an example of a of of a concept okay so this one transforms into this one here’s another example this second one transforms into this one a third example this bottom one transforms into this one so all of these Transformations have something in common and that’s sort of the abstract concept that the the task is um demonstrating and your uh your uh task is to do the same thing to This Test example okay clearly you know what I think you can probably see the pattern you’re you take the color that’s sort of different in this little pyramid and you extend the the the sort of pointy part of the pyramid to the boundary with that color so here we would do that with this light blue up to the top so this involves concepts of objects of of shapes of um difference different color you know of of being extending something to a boundary of directionality which way this is pointing all kinds of of these sort of core knowledge Concepts okay um here’s another one okay so we have three demonstrations of Transformations and your task is to um do the same thing to this new Test example uh this test input and um if this were an in-person talk I’d ask for volunteers to tell me but I’ll I’ll just tell you here so here this is sort of about shape and variance so we have these different kinds of shapes you know this kind of uh U uh rotated U shape where um and those shapes are colored pink no matter what size they are no matter what direction they’re facing you know um no matter where they are in the grid and similarly these kinds of shapes uh that that uh look look kind of like um uh well this one’s another U uh this one looks like an L rotated it’s colored blue and and the ones that look like backward hes are colored red so this notion of shape and variance is an important idea having to do with how shapes uh can be um uh transformed in various ways so uh cholet created a thousand different tasks that did tried to get at understanding Core Concepts and being able to abstract the concept to understand the concept with just a few examples right so you can think of these in machine learning terms as three training examples and a test example so you’re it’s a few what’s called a few shot learning problem there’s only a few examples humans are very good at that but machines are not so good at this so he published 800 he held out 200 is hidden test set and he put the whole Challenge on the kaggle platform kaggle is a platform for machine learning challenges that people compete on in the end he offered some prize money in the end about a almost a thousand teams submitted programs that uh attempted to um solve this Challenge and they were tested on the hidden test set and the winning program got about 20 % accuracy and each program was allowed three guesses per task um and if it got one of them right the task was considered solved correctly so given that the winning program which was a a program synthesis approach tried to uh given that the uh demonstrations tried to synthesize a a program that would generate these Transformations and then to apply it to the test um so it only got about 20% The Ensemble of the top two programs got about 31% this is the world record and there’s a new challenge uh there was a challenge last year uh run in Switzerland where you could earn a thou a thousand Swiss Franks which is about $1,000 for every percentage point above the world record and I think the new world record is about 34% accuracy so and there’s a new arthon challenge in 2024 so still this you know this is much lower than human ability to solve these problems um and people have tested um large language models as well and I’ll talk a little bit about this so our group my my um collaborators and I here felt that this was a fantastic challenge we really liked it but it had a few problems you know many of these problems in the original Arc task were actually pretty hard for humans and also they didn’t systematically test understanding of Concepts so just as an example if a program can solve this task correctly does it mean that it understands in a general sense the notion of object invariance well not necessarily because it might might be using some kind of shortcut there might be some Shortcut like I I don’t know if that’s true here but you know maybe the things with the most um uh you know the things things on uh on the uh left tend to be red and on the right tend to be blue that’s not the actual shortcut here but there might be other shortcuts that that are able enable you to solve a particular problem so really what you need instead is a a whole bunch of variations on each concept so this is work that I did with some collaborators um here at the Santa Fe Institute um we created this new Benchmark called concept Arc where we created variations on Arc tasks for each of 16 Concepts and we tried to make them as easy as possible for humans but for each concept the tasks varied in complexity and degree of abstraction so uh the kinds of Concepts we looked at were spatial Concepts like identifying something at the center of something inside versus outside same versus different that semantic concept Top versus bottom so let me give you an example we had a total of 480 tasks each concept group like Center had 30 different tasks so here’s a a a pretty simple example from our Benchmark so here you probably can see easily the the task is um delete the bottom object whatever that object is to get from the the left to the right um and then do the same thing here on the test input okay well we tested uh humans on this we also tested the winning program from the original kaggle competition and we tested gp4 we actually tested gp4 in two ways one was using sort of a text version of these tasks where you give each pixel as a number um you you just give kind of the pixel values of the um each grid um and we also tested it using its Vision system uh where we actually give uh the actual grids to its Vision system Al I can tell you later exactly how we tested it but we found that the Texton version worked a lot better than the vision system interestingly anyway um all of these systems got this correct so does that mean that all of these systems humans this kago program and gp4 understand stand top versus bottom and objects well we have to test it further here’s another variation on that same concept here the rule is color the top row of the object uh red okay everybody got that right it’s looking good but if we give yet another variation remove the top and bottom objects here 100% of humans got it right but these programs got it wrong showing that they lack some robustness in this concept of top and bottom and you know here’s another example this one says extract the center pixel color Okay most humans and the programs got that one right but this one another one extract the center object humans 100% got that right the winning program and gp4 got it wrong and this one is another variation move the colored pixel to the to the center and extend it humans got that right these programs got it wrong and this was another when we compared humans and these Sy systems on um these tasks here or her all the different tasks there’s 30 different uh I mean all the different concepts there’s 30 different tasks per concept and here’s Human Performance very high overall the first place kaggle program now this is higher than the 20% that it got on the original Arc uh Benchmark because our problems are intentionally much easier and here’s GPT 4’s best performance so these programs are still lacking lagging behind humans on these relatively simple abstraction con conceptual abstraction problems here skip that so just to end you know my question was about the debate over understanding in language and I think also just in general understanding the world that language describes in principle I think AI could understand the world you know we humans can understand it to some extent and I don’t see any reason why we couldn’t build AI systems to do that but current AI has still has many failures of understanding which makes it untrustworthy in many cases and I believe that to understand the world in a humanlike way we need to give these systems um humanlike core knowledge systems and Concepts you know that I talked about extensively how we we humans build up these Concepts and these are all uh abilities composition analogy metaphor causality that AI current AI systems lack you know current AI systems have kind of gone backwards they start with language and they supposedly can uh get emergent Concepts from that whereas humans as babies start with learning Concepts basic concepts and then build language on top of that so AI is is learning in an opposite way as humans and it might not be the most the best way to learn robust systems that understand the world and I don’t know you know some people have proposed that a AI systems will need something like embodiment or active interaction in the real world as opposed to just learning from language or passively learning um you know being given images or videos that’s a real big question that I think we can explore empirically so um I’ll stop there this was our paper in pnas that talked about a lot of these issues and I’m happy to um answer questions and have disc discussion so thank you uh thank you so it’s time for questions we have some questions in the chat and in uh QA section uh but actually if you want to uh to ask your question out loud you can just raise your hand and uh I will give you uh I will give you uh a turn so the first question uh from the chat uh n simulation of understanding then distinguished from genuine understanding and does the distinction matter for practical applications yeah I mean I think that it’s sort of a philosophical question whether a simulation of understanding is different from actual understanding I think that’s what you know the Turing test was meant to get at that question and say there is no difference if behaviorally a system acts like it understands then it does understand that was kind of turin’s uh uh point but as I mentioned that actually you know runs into problems with the so-called Eliza effect that we we humans can be easily fooled so you know I don’t know the answer to that question I think that right now um it’s not it’s still not that hard hard to distinguish machines from humans if we really are trying to do that you know machines can fool us if we’re not really trying to distinguish them from humans if but if we actually are giving a very more rigorous test we can we can tell if something you know if something’s a machine or not um but in the future maybe that will change and um in that case you know it’s going to be harder to say whether or not we believe they understand without actually better defining what we mean by understand and um like looking internally at do they have the kind of mechanisms that we believe underly understanding thank you uh art do you want to ask a question with voice yes yes um I would like to to ask a question by voice because it seems to me that we don’t have a definition a good definition for of understanding so we talk about if this big uh huge large language models understand or in general in natural language processing we don’t have a good definition of understanding and I want just to offer an argument that um any AI system cannot understand because it cannot have experiences like us so imagine I say I like ice cream right I do have an experience of tasting an ice cream but all these artificial intelligence systems cannot have this experience they they don’t have Quia and therefore they cannot understand so you you might try first to Define what understanding is because we are are not sure what we are looking for so I I would like your take on this I think it’s a philosophical problem it goes back to to the Chinese room argument of John C but there is now a Resurgence of experience in Neuroscience with Kristoff Co for example he’s talking a lot about this experience uh experience of uh of of humans in the fact that the computational system cannot Encompass this this experience thank you yeah yeah as you say that’s that’s a you know philosophical question about sort of sentience and Consciousness and embodiment that um has been argued for probably for centuries even before AI um so I I agree with you that the top the question of understanding is not well enough defined for us to answer the question so maybe and maybe one of the things that AI is forcing us to do is to think about different kinds of understanding to to to to make the definition more sort of dependent on what type of understanding we’re talking about and I think that AI is also forcing us to think about words like intelligence and Consciousness and all of these Mentalist terms that don’t have rigorous definitions perhaps we are going to have to redefine them at least intuitively to to make room for machines doing what they do that perhaps they understand in some ways but not in others senum uh so we have also some questions in q& a section uh the first question can we ask your opinion about intelligence systems on the Android platform for example expert systems uh that can correctly justifies their decision about a problem compared to CH GPT and according to your talk CH GPT doesn’t justify decision quite correctly is it possible in the future to use not only Char GPT 4 but also other llms for decision making so yeah I think um you know there’s always been this a bit of a trade-off in machine learning over uh performance versus uh interpretability that systems like for instance in machine learning we have things like decision trees which are much more interpretable more symbolic kinds of systems versus neural networks which are not very interpretable but perform much better in general so it may be many people have proposed that we really need to merge these kinds of approaches to have what are called neuros symbolic systems that retain the the the sort of performance of things like deep neural networks while allowing the more symbolic uh approaches to interpretability that’s I think it’s great idea uh we haven’t really seen it work very well yet but perhaps in the future that that will that will be the the kind of system we rely on of these hybrid systems so yeah I think it’s it’s an area of very active research that’s that’s all I can say for now great thank you so the next question from the chat um about the experiments is theq of people who participate in tests assessed is the speed of the correct Solutions found assessed how many attempts again to people and are people selected based on their IQ or age or any other Criterium yeah so in the experiments we did um we did them on this uh site called prolific which is a sort of crowdsourcing site for doing like you know psychological experiments and other tasks for humans uh all we asked was that people be uh over 18 and um speak uh that English was their native language for for these particular tests because we also part of the test was to ask them to write explanations of how they solved the problem we did not um collect information about their the speed by which they solved the problem um we gave each person three attempts just like the programs to solve the problem and all we didn’t give them any feedback except that they didn’t get it correct the first time or second time perhap um and that was the same feedback we gave the programs I think that was all the questions uh okay thank you so we have two more philosophical questions in Q&A uh so how do the different interpretation of understanding among AI researchers reflect broader philosophical disagreement about the nature of intelligence and cognition and they can I think that we can uh also pull the second one how might LMS influence our understanding on human language processing and cognition okay yeah those are two really great questions yeah I think that a a lot of there’s a lot of difference in philosophical approaches to understanding which is probably what um makes this division this debate so as was previously um brought up maybe having this this notion of qualia this like what does it feel like to eat an ice cream or what does it feel like to see the color red that without having a body and then the right kind of sensors you can’t understand that that that’s um one proposal but other people you know feel that at least a a good enough kind of understanding can be gleaned just by just from language Alone um and um that that’s I think that’s really a difference of opinion in in this um the second question was how can these llms sort of affect our view of natural language and yeah I think it’s really challenged a lot of of theories of how you know natural language works and how it can be learned there you know there there’s there was um you know the most influential linguist of course is no Chomsky who um saw language as you know you H having to have some kind of innate structure that’s um uh sort of what he called the universal grammar uh was necessary and it does seem that large language models which do not have that kind of innate structure have captured the syntax and often some of the semantics of language so it does seem like that kind of hypothesis has been strongly challenged if not disproven so there’s still some debate about that too um so I think these things you know one of the things that AI has done throughout it history is to challenge our views of how intelligence works and large language models perhaps are some of the biggest challenges okay thank you uh and I believe I have last question uh considering the current trory of AI development how should we prepare for the integration of uh AI into society especially uh in roles that traditionally require deep understanding and empathy yeah that’s a hard question I mean you know AI we’ve already seen a lot of issues in integrating AI in society things having to do with disinformation you know deep fakes um very um convincing generation of of of text images videos audio that’s really H having a big effect on how you know how trustworthy media is and I think it’s just going to get worse um and it’s hard to know how to solve this problem you know a lot of there’s been a lot of regulation of AI like in the EU some in the US some in other countries and it’s it’s you know hard for regul Regulators to keep up with the speed of of of progress or of change in in this these Technologies so it’s I think it’s a really difficult question um in terms of uh I think they they asked about sort of things that need deep understanding and empathy and so on yeah I don’t think humans are going to be replaced as soon as some people say uh you know there’s been a lot of predictions about certain kinds of jobs that um that people have predicted will be replaced soon and and all of these predictions ended up failing so I’m not so convinced that AI you know I think we tend to underestimate how complex human intelligence is and how much unconscious intelligence is needed to do certain kinds of tasks you know we certainly saw that with expert systems where people tried to extract Knowledge from experts and put it into a set of rules and turned out that the experts weren’t able to communicate some of the most important things about how they were solving problems and making decisions because it was sort of unconscious and based on very basic non perhaps non-linguistic knowledge so I you know it’s hard to predict what’s going to happen but I I think that you know AI is is somewhat hyped right now and is not going to have the kinds of sort of effects of killing off everybody’s jobs that some people are proposing right thank you actually I uh I can recommend uh to our participants to watch uh mon debates on AI where you have participated and actually I didn’t want to uh make a spoiler but we will have uh one of speaker who was open than you in that mon debate I wouldn’t say who who will be yet but we’ll we’ll have it in a month uh so and we have one more question uh when will a capab exceed human humans in your opinion um I think it will be uh longer than many people have predicted human capabilities I mean you know it’s hard to to to say what what those are just you know to make a list so you know maybe GPT chat GPT can like write a paper for you but it’s not going to go fix your Plumbing anytime soon and it’s not going to um you know do do a lot of things that humans can do that you know some of which aren’t necessarily considered part of quote unquote intelligence but actually uh are deeply important for all of our sort of I think intelligent related abilities so I would say not very soon okay thank you and we have one more question actually we have two more minutes uh so the question is um uh to continue the previous question about your experiments so 100% people solve the tasks within three attempts uh what ask so easy what happened if somebody fail with three attempts those results were not used in uh the analysis the core of my question is whether we compare the work of AI with the best humans or with the average level of for concept understanding and maybe I can continue that actually we also uh have seen that uh CH GPT successfully passed like law school tests or MBA entrance exams so maybe we should uh look also on the tests for humans maybe they not correct action yeah I mean you know we we in our experiments were really in some sense comparing with average humans over our sample we weren’t comparing with the best humans uh and um it that’s always a question how do you compare uh but we were trying to see sort of on average are are humans able to solve these kinds of problems fairly easily and and generally they were you know on on these crowd sourcing platforms the signal is kind of noisy because like some people aren’t even trying right right so but you know overall we got pretty high uh performance from humans um you know standardized tests like the bar exam and uh medical exams and so on do have um potential sort of shortcuts and you know humans often can solve them via shortcuts if you’ve ever taken a multiple choice exam you’re probably using not just your understanding of the uh of the content but also your understanding of how multiple choice exams work and um I think that you know that’s kind of a shortcut uh but machines are much better at finding very subtle short cuts so I’ll just say that great thanks a lot it was really interesting uh for our participants I want to mention that we’ll have a recording of uh that lecture because we understand the problems with electricity in Ukraine uh thanks a lot uh it was Professor Melanie Mitchell uh thank you for your talk and uh please subscribe to our uh social uh media we will announce our next talks there thank you thank you so much for coming bye bye

Melanie Mitchell [USE IT] The debate over “understanding” in AI’s LLMs

500KM à VELO – 15ans (Paris-Lyon)

Wednesday 8.30am Mass, St. Joseph’s, Paris

3 endroits à connaître pour faire du vélo de route à Paris 🚲 #cycling #paris #locavelow #velo #vtt

1 Comment

Melanie Mitchell [USE IT] The debate over “understanding” in AI’s LLMs

Related

500KM à VELO – 15ans (Paris-Lyon)

Wednesday 8.30am Mass, St. Joseph’s, Paris

3 endroits à connaître pour faire du vélo de route à Paris 🚲 #cycling #paris #locavelow #velo #vtt

1 Comment