My oral presentation at NAACL 2024

    hi everyone I’m Michael Saxon a PhD student at UC Santa Barbara and this is my work with collaborators from ASU and John’s Hopkins called Lost in Translation translation errors and challenges for fair assessment of text image models on multilingual Concepts so what do I mean by multilingual assessment in text image models this is a followon work on some previous research that we had done before previous research that we had done before where we inspected the multilingual capabilities in particular unexpected multilingual capabilities in text image models what we did in this work was we produced parallel lists of prompts describing common tangible objects in this case a photo of an airplane and then we would slot in different textto image models such as staple diffusion Dolly Etc and then prompt them with these parallel translated uh prompts these would give us populations of images for each of these Concepts which we could then use some image space features to check if the model truly knows those Concepts in the different languages in order to produce this set we sampled sets of tangible nouns mostly in English and then applied this rather convoluted pipeline to translate them into these parallel lists in order to do this we first used an ensemble of commercial translators such as Google translate uh I do Etc produced candidate sets of translations from those different uh translation systems and then we used babet which is a Knowledge Graph produced from wiary and other sources that ties together these things called sinets which are single semantic parallel sets that are aligned to the same meaning the idea in this implementation that we had was that we could use these parallel sinets in order to assert that the same sense of the English word was preserved across these translations so oh I could have used these to explain so here’s the initial list of Concepts there’s our translation Pipeline and then we end up with these parallel translated lists what this allowed us to then do is feed each of those translated Concepts into these different models so we have going from uh top right words to bottom we have Dolly mini Dolly 2 Cog view2 which is a Chinese language model stable diffusion 1.4 and2 and ALT diffusion which was a model that was explicitly trained using a multilingual encoder rather than clip these histograms represent each of those Concepts in the uh inventory that we had collected using that translation pipeline right word mass corresponds to a higher uh concept level cosign similarity in the image feature space between the English language images and the other language images what this represents is that for each of those languages more rightward mass means that more concepts are known by those models whereas left word mass means that the model is less performant in those languages so here are some tangible examples from that initial work so here’s Dolly 2 being fed the concept of airplane in each language as we can see for Chinese it does appear to know the airplane although between the different languages there are some nuances and difference in what the um prototypical image of an airplane would look like however when we go to stable diffusion 2 while it still knows what an airplane is in Spanish and German it does not know it in Chinese we verified that this work was correct by checking from these histograms and sampling from the rightward side by checking against what the concept itself was and that xaxis correctness feature which we called XC we were able to indeed verify that these Concepts and and these features do correspond to knowledge of the concepts by the model so we have from top to bottom Spanish rabbit Japanese snow and Indonesian guitar and just by sampling from these histograms we indeed confirmed that this is a proxy for knowledge however when we presented this work uh to my future collaborators at ASU we found that there were actually quite a few errors in this translation set that we had produced in fact what we found was a candidate error rate of 4.7% in Spanish 8.8% in Chinese and 12.9% in Japanese however we were as picky as possible with calling something an error so for example on the top row we have an error for the translation of the concept bike into Japanese the original translation produced by the model was Buu which has more of a sense of a motorbike than a bicycle this is a difference in the prototypical sense that English speakers and Japanese speakers would have for a bike whereas under the candidate correction when we translate it to gensa we indeed are able to verify that now the images are the same between English and Japanese however we wouldn’t necessarily characteriz that as a particularly severe translation error whereas on the bottom we translated the concept suit under this system first to uh which means suitable or fitting however when we applied the translation toang it now means a western suit and the images are correctly generated so the question is which of these translation errors really matter which ones impacts the construct validity of our data set and which ones don’t so we introduced this text domain error severity feature which we called Delta sem what we did was we used a multilingual embedding space for the concepts before and after correction and then we look at the difference and distances between those words in English and the candidate language in the embedding space so here’s that same example of bicycle where we have biku and Jena they’re relatively close together in this embedding space as they share a lot of Senses it’s a small difference and therefore a small Delta sem whereas another translation error which we had in Chinese was we translated the word table into B which more means like table as in a spreadsheet as far as I understand I don’t actually know Chinese I just uh made sure to take note of how to pronounce these before the talk um however when we applied the correction to jza um that is much closer to the prototypical sense of a tangible table that we have in English now of course these are both correct translations for the term but it doesn’t really make sense to ask for an intangible table as a tabular form uh into a text image system so we do characterize this as a large difference and a Severe error so we first use this Delta feature and look at each of the translation errors we identified a few different error types uh which we referred to as formality an example of that would be translating the word father into a sense that more means daddy or Papa instead of a sense that more closely aligns to the level of formality that father has in English we also have commonality errors where the word was translated to a less common but still technically correct version of the term as you can see we have this gradient from relatively uh questionable if it really counts as an error errors uh and another example of that would be transliteration errors so one example is the word flame when we translated flame in into Spanish a common translation of that is Yama l l m a now that is a correct example of flame but it has a collision in English with the word llama so that is what we call a transliteration error and then finally we have sense errors which are the most severe so as we can see on these plots the is and Os colors the darkest ones are much more rightward on our Delta sem feature as opposed to the formality and commonality quote unquote errors so the question is does this deltm feature predict which translation Corrections will actually improve the performance of uh the of this Benchmark in the image domain in order to assess that we test the change in that Delta C the aformentioned xaxis feature characterizing the knowledge uh of a model of each individual concept against that Delta s so here are the examples of that correlation for stable diffusion 2.1 for Japanese Chinese and Spanish going going from left to right now from inspecting uh these plots it doesn’t really look like this deltm feature is predictive except for in the case of Spanish however we can tie this back into the analysis done in the previous paper so these are the histograms once again for concept correctness and I’ve highlighted in green the rightward side of the concept mass as you can see from this analysis English Spanish and German have considerable right word mass so it would be reasonable to say that this model has more knowledge of Spanish and German than it does of the other languages by uh contrast alt diffusion here has considerably more knowledge of Chinese and Japanese so indeed when we oh once again I could have used those to illustrate but the point is when we check alt diffusion for these same Scatter Plots We Now find there is a very considerable correlation so in other words this basically confirms what we had expected that there is this strong correlation between uh these two similarities in both the text domain on the x-axis and the image domain on the y- axis so here are some subjective examples here in the case of tent this is a uh strong example of a commonality error so in Spanish you can technically translate tent as Tienda now Tienda also has a collision with the word shop because I guess in Old Times uh shop shops were often put under tense and indeed when we translate uh Tienda into shop it actually works on stable diffusion 1.0 and two uh to give us a shop rather than a tent as we had expected similarly here for the more severe error on the right when we translate table to B uh we don’t get meaningful images it’s just nonsense and even after we correct it though it still doesn’t work that’s because we are looking at these models that do not have much Chinese knowledge however by contrast when we look at these models that have more Spanish and Chinese knowledge now applying the corrections does help them as you can tell from looking at the bottom row where Tienda under the correction now or sorry TI a which literally means tent for camping uh does indeed give a tent that is aligned with the English sense similarly the correction that we applied on Chinese now does produce a table that is quite visually similar to the table that was generated by that model in English so what are the implications of this work first off uh there’s the pretty obvious ones which are that we should apply these translations to our Benchmark so we’ve gone ahead and done that we applied all of these changes that had a high Delta into our uh new version of our cocaa Benchmark which is what it’s called that stands for conceptual coverage across languages and in future versions of The Benchmark though because we don’t really want a manually produced Benchmark we want a dynamic Benchmark generating process we need to do better with our translation pipeline this issue really boils down to the fact that translating a single word into a specific sense is quite out of domain for translation models they’re meant to translate full sentences and the ambiguity that’s inherently present in a single word where we’re looking for one specific sense is very difficult for translation models to overcome we found that uh from preliminary experiments working with llms to do this is quite promising and in future versions we plan to do so but then there’s these higher level questions that really are present in any kind of work inspecting multilinguality in any kind of model not just text to image models first of which is that not all Concepts even are translatable now in our previous work because we focused on common tangible Concepts this really wasn’t much of a challenge but if we want to push this further and really test the parity of multilingual knowledge in models we’re going to have to overcome this challenge the next question is what really what really are the limits to multilinguality assessment in general because there’s more to multilinguality than just having parallel alignment between English and the test languages really what we need is a way to verify that the performance of the model in the language is meeting the speakers of that language on their own terms and while we are able to test for parity figuring out a good grounded way to test for the second question is much more difficult and if we want to scale this assessment more how do we find those prototypical Concepts that do make sense to translate and finally are there ways to assess that knowledge that doesn’t require this strict sense and translatability agreement as we relied on in this work all of these are really important questions that I hope to work on and I hope to see some of you work on as well so thank you and uh if you have any questions you can contact me at that email hey uh I really enjoy talk uh thanks thanks a lot and I have to say I’m impressed by your Chinese pronunciation I’m much better at pronunciation than comprehension uh I have I have a question uh uh about do you think it’s possible to apply this framework to the more abstract Concepts such as happy sad or um even events such as using not only image but perhaps a video a very short video to describe what this event looks like yeah that’s a great question so as far as the video uh issue I think that boils down to just a general uh assessment uh capacity that’s much harder than tangible Concepts which is looking for verbal knowledge so I know in the example of using clip so I mean I can back up a little bit uh the way that this system is really working is that we’re using the clip feature features uh in order to compare the generated images now clip has a lot of sorry this is going to take a okay here we go clip has a lot of feat uh issues uh with performing uh these tasks Beyond just tangible Concepts because the objective that clip was trained on is uh doing this contrast of learning between specific images and then their captions now the problem is that those mini batches don’t contain hard examples Beyond just using the nouns right so maybe there’s one image of a panda in a tree with that caption but then all the images all the other images in that mini batch are just completely unrelated so all that clip had to learn is to attend to the tangible objects which is why it’s the easiest thing to assess we’ve been experimenting with trying to measure things like uh verbs and uh attributes so trying to say like for example and also relations so like for example can we test the relation on Fire Within the different languages it turns out that that is quite difficult because really it’s just attending to a fire so if there’s a fire in the image but it’s not the object itself actually being on fire these features are blinds to it and additionally this was something that I Ali did but we cover in the paper with this bike example even though this motorbike uh and uh bicycle senses are very close together clip is actually blind to this distinction so because there are two wheels and a person sitting on it and a seat and handlebars clip doesn’t really care about the Nuance differences that one is a motorcycle and one is a bicycle so in general uh those are really big challenges and I think they require modeling advances uh to automate them thank you yeah thank you any other questions uh Willam W from UCSB I have a question about the um your suggestions for the future um you know um data set so can you give us some you know hints about how do we actually take this into account for building a future data sets thanks sure yeah that’s a good question thanks for setting me up William uh so here was the diagram where I was showing how to generate it and uh like I said before what we had hoped would help us overcome these Word Sense errors by using this uh Knowledge Graph to align the differences actually didn’t protect us and in some ways it actually made the errors even more severe so an example that I didn’t cover in here but is in our paper is that actually across all the languages when we translated rock it translated it to mean rock music rather than rock as in a stone and this error was actually propagated by our use of a knowledge graph for agreement what we’ve been working on now to try to improve this is using an llm GPT 40 is actually quite good at performing this tangible domain translation task um and even in cases where it isn’t we’re actually able to once again use those Ensemble of commercial translators but then use a llm as a verification uh in the followup as for scaling it uh Beyond just uh tangible Concepts like I was saying before use of the llm and not having to rely on this knowledge graph allows us to scale Beyond these high resource languages into more low resource ones because the llm uh particularly gp40 has picked up sufficient capabilities on many languages in order to translate them so we’re hoping to scale in that way and finally like I alluded to before we’re actively working on better metrics mainly using VMS to compare for the more abstract Concepts relations and verbs so look forward to seeing that hopefully uh by the end of this year

    Leave A Reply