En 2014 eut lieu la première journée scientifique dédiée à l’éthique du TAL, la première journée ATALA Éthique et TAL. 10 ans plus tard, le monde du TAL a subi plusieurs révolutions et l’éthique n’a jamais été aussi présente, dans les médias, les recherches, les appels des plus grandes conférences, comme dans l’administration du TAL international (comités d’éthique de conférences et d’ACL).
Organisée au Loria en avril 2024 par Karën Fort, la Journée Éthique et TAL permet de faire un point sur les avancées et les défis à venir.
Sur cette vidéo : “NLP Experiments Challenges of reproducibility”, par Clément Morand.
+ Plus d’informations sur : https://www.atala.org/
Actes de la journée d’étude Journée Éthique et TAL 2024 : https://members.loria.fr/KFort/files/fichiers_cours/ActesJEEthiqueEtTAL2024.pdf
hello everyone I’m k m and as has introduced me uh we will present a talk on the reproducibility of uh NLP experiments in terms of their Caron Footprints so uh first of all uh the information communication Technologies CA significant environmental damages and they are expected to increase uh in 20120 they were estimated to account for around 4% of the global electric consumption and between 2 and 4% of global greenhouse gas in nation and there has been at least a 5% increase in emissions since 2015 and uh as we all know we need to reduce the global emissions also the I sector uh in the case of machine learning methods we’ve also seen an exponential growth in the energy requirements and compute requirements to train uh State models and machine learning activities uh consume significant amounts of energy so for instance at Google it amounts or 10 to 15% of the energy consumption of the whole company and uh it has been shown that there are important uh greenhouse gas emissions that are induced by the energy consumption required to train large language models on the order of magnitude of 500 tons of CO2 equivalent to train a g RS for instance uh since the Sim work by St and colleagues showing the importance of deep learning impacts of NLP in particular there has been a call for a green AI research that evaluates model not only in terms of its accuracy so in the quality of the models in some sense but also on its efficacy of the amounts of energy of uh environmental damages we required to produce the model and uh green AI is now a research field and uh there has been a number of tools that were developed to evaluate the model carbon Footprints there are two main families of model of tools uh which kind of implement the same formula but gather information from different sources there are measuring tools that uh uh pull uh real time information from the Registries on the direct consum consumption and there are estimation tools where you can run it whenever you want you just plug in the information on your Hardware on the time you run your experiment Etc and you get uh number you can use them before after even when the experiment is running whatever you want and uh if we want to reuse and compare the environmental assessments we make they need to they need to be reproducible there are different levels of reproducibilities that are possible uh we can hope to reproduce to the level of the same value uh for measures it will not be possible since uh measures are not uh deterministic especially uh in terms of energy consumption of a process you never have the same exact uh scheduling in your computer uh you can reproduce to the level of the results so having close results and uh when you cannot have a relibility to the level of the results uh a lower level of producibility is only when you are able to reproduce the same conclusions but not the same results and here we are interested in how can we ruce the results of NP environmental footprint assessments so uh in this work we uh we studied five uh MLP studies so one of the covered footprint of name recognition in French one in French and English language spoken understanding uh one on French spoken language understanding one on the C footprint of models and one of the C footprint of the blue model and uh we chose these studies because uh first of all they include an assessment of the carbon footprint of the experiments and also they either provide enough information that we could try and replicate their results or they give us or we were able to contact the authors if we needed supplement supplementary information um one of the challenges is that uh all the studies use different tools uh some use the measuring tools and use the estimation tools and they also have different Scopes so while all the evaluations account for the energy consumption the evaluation of the Bloom model so the study by the chinian colleagues also account for the production of the hardware and we want it to be able to use the same tool to reproduce the results from all of these studies uh for that we used a tool that we developed which is called mlca it evaluates the impacts over the production and use phace of the hardware and uh the estimations for the hard projection are based on the data and methodology of the B Vista Association um and also on the only on only the study that we know of that accounts for graphic scale production in a life cycle assessment and it uses the it assesses the usage of the energy consumption with the same methodology as the algorithm tools so it’s an estimation tool and uh what we can say first is that uh repr predictability uh was much more complex than what we expected because for all the studies we needed to gather additional information beyond the information that was available in the manuscript so for instance we might have uh information that uh this the experiment was conducted on one specific Data Center but we did not know uh the specific Hardware in the data centers we needed to use the public information on the data center to uh cross reference or we did not have the information in three of the five studies we needed to contact the authors to getl information on the hard configuration they used and uh another thing is that the processing units us so the gra SC was used full potential full power all of the time or 50% or Etc was generally not reported so we needed to choose a value which shows the maximum value possible and um this process also um revealed some incurance and inconsistencies in the manuscripts so for instance uh in in one of the papers we spoted uh assessments that were multiple of those of magnitude higher than what we could expect and after pointing this out to the authors they conducted new experiments uh that we were then able to reproduce and uh also some inconsistencies so um for instance uh there was a problem in one of the manuscripts with the carbon intensity so the factor that you used to convert the energy consumption to the carbon Footprints uh so uh what amounts of CO2 emissions per kilow water and uh it was not constant so sometimes it was the Theon footprint of the exist of France sometimes it was another country and uh while all the experiments were conducted in the same place so this should have been constant and this should have uh matched the Caron footprint or the Caron intensity of the French electricity grid and one Poss explanation for that is that the measurement tools they need to gather information on your location and sometimes uh if you are if you use a secure connection for instance they might not be able to retrieve your location properly and they will use an incorrect factor for that and you might miss that they did not use the factor that you wanted uh on the marginal level we were able to produce the results of three of the five studies and the two studies where we showed the problems we were not able to reproduce the results but only the conclusion so that the fact that training longer for a model for longer emits more Caron and we have a little bit higher estimates than the Assessments in the vure so uh this is first of all due to the fact that uh we used an estimation tool and estimation tools tend to provide higher results than the measurement tools and also due to the fact that we used an overestimation of the use rat that was not reported if we take a closer look uh on the bloom language model assessments so that they go by linian colleagues we can see that for the embodied impacts which are the impacts of the hardware production that we can um attribute to the specific training we can see that we have close results but a little bit higher and this is because uh we have a methodology for assessing the impacts of producing graphics card and since there is no publicly available data uh it’s complicated to have uh precise information on the dynamic energy consumption so the energy of the hardware that’s running the experiment were very close a little bit uh underestimating that uh is mostly due to grounding errors and on the infrastructure consumption so the consumption of the data center and the not of the data centers that are required to do the scheduling to cool the to cool the hardware Etc we have a little bit higher results than the results presented in the manuscript but uh this is because the results presented in the manuscript did not account for the time when the nodes are on but are not running your job and uh why do we account for that it’s that uh for your job to be able to run we need to have the data center that’s uh on and loaded so to be able to have the scheduling prob done we also need to account for that and uh this is this explains the difference in the results we have so uh what we can say is that reproducibility is a difficult Endeavor and this is not specific to the environmental assessment but it also holds for environmental assessment uh if you do not plant your studies to be reproducible it will be really hard to do it uh a posterior uh not have all the information that you need and you might have a hard time uh getting back all this information it requires special efforts if you want it to be effective and even when you are trying to be reusables problem may arise so for instance uh in a study that we did not uh comment because it was not anel study we were able to it was really easier to reproduce the results because of the supplementary material and all the efforts they dedicated to it still in this case there was a copying error in the supplementary material leading to the Eraser of some parameters of some experiments that we then needed to approximat and there are also uh lots of challenges for the environmental assessment of natural language processing so there’s a lack of data availability so does the for instance does the tool have data on my specific Hardware if I use an A1 100 does the does the estimation tool have information on the energy consumption of this specific GPU uh there are lots of uncertainties over the whole chain so over the production we don’t really have certain information especially since there is no available public data by the manufacturer and even in the use phase as we’ve seen with the measurement tools where they might not use the factors that we expected them to use there are variations in scope for instance do we include or not the production of the hard and there is a huge temporal and geographical variability so for instance the carbon intensity of the electricity mix you use changes a lot between countries so for instance in France it’s around 100 gr of CO2 per kilow and in the US it’s uh five times more and uh it’s also a huge temporal uh variability because uh if you have more wind for instance you are able to produce more renewable energy with uh uh also the solar panel if there is sand Etc and depending on the time of the day also there’s L consumption Etc so if we want to be able to reproduce to the final level we need to account for this temporal and this geographical variability and Report when and where the experiments were conducted um there has been different sets of rules that have been devised to improve the reproducibility of NLP experiments and we can draw from them to improve theability of uh MLP cbon footprint assessments for instance uh if you do not follow rules on reporting the system metadata or the parameters of the tools you use uh this might lead to crucial information for reproducibility not being reported and uh if you have uh manual steps so for instance you need to copy the results from the estimation tool to a table Etc you might have problems uh of inconsistencies of not the LED version that’s presented in your manuscript and uh what we could maybe do is Define a standard for reporting the uh assessments and for instance we could Ure that all the parameters necessary to estimate with one specific tool the other green algorithm tools are properly reported with its versions because the data also changes with the version of the tool in conclusion we can say that the environmental assessments are not easily reproducible and require special attention in this work we are able to produce the results of three out of the five studies using uh our tool mlca which is a publicly available tool covering the production and usage of the hardware it required a non-figural information collection process uh showing that there is still a lots of efforts required if we want all of the papers to be more reproducible and it confirms the significance of how manufacturing impacts in large NLP experiments in the Blom it counts for around a third of the impacts so showing that we need to discuss what we account for when we assess the environmental Footprints and uh I will uh end on that it’s really important to clearly Define why do we pursue the assessments before pursuing them because we will not choose the same tools and we will not need the same levels of reproducibility for instance and even of understanding of the results depending on what we do if we are assessing only to uh have a reflexive thinking on what we can do to improve we do not need it to be easily reproducible and uh so precise but if you want it to compare different models and to do proper accounting we need it to be as reproducible as possible and as precise possible so clearly defining the objectives of the measures before doing them will also guide us on how to measure and what to evaluate thanks for your attention thank you