There were more than 170 submissions for the 2023 Collaborator Showcase, and 10 posters were selected to present lightning talks during Friday’s portion of the symposium.

    1) Mapping of Critical Care EHR Flowsheet data to the OMOP CDM via SSSOM • Polina Talapova, SciForce
    2) Paving the way to estimate daily dose in OMOP CDM for Drug Utilisation Studies in DARWIN EU® • Theresa Burkard, University of Oxford
    3) Generating Synthetic Electronic Health Records in OMOP using GPT • Chao Pang, Columbia University
    4) Comparing concepts extracted from clinical Dutch text to conditions in the structured data • Tom Seinen, Erasmus MC
    5) Finding a constrained number of predictor phenotypes for multiple outcome prediction • Jenna Reps, Johnson & Johnson

    Moderator: Davera Gabriel, Johns Hopkins University

    To find materials from OHDSI2023, including presentation videos, slides, posters and more, please visit the global symposium homepage: https://ohdsi.org/ohdsi2023/.

    To learn more about the OHDSI community, please visit our website: www.ohdsi.org.

    #JoinTheJourney

    Well? Hello, everyone. I’m Devera Gabriel, and I am going to be presenting moderating this next session. So you help me move us forward here. I’m thrilled to be the moderator and we have so many excellent submissions that we don’t really have so much time for them.

    So to make time to get this much good content in, we’re going to be reading very fast today. And because time is so short, we won’t be any questions. But you are will have the ability to see these fine folks at their poster after this session.

    So get ready, stay sharp in your seatbelts because there’s a lot of good content coming your way super fast. My first person that’s coming up is Paulina Tapalova, who will be presenting mapping of critical care, our flow sheet data to the OMAP Common Data model via CSM and Hello everyone.

    My name is Paulina Antelava and it is a great honor for me to open the first Latin talk session. And before we start, I’d like to express my sincere gratitude to all the people who support my native country, Ukraine, during the full scale war with Russia.

    Also, I would like to thank to the citizens and the government of the United States of America for the invaluable help, especially military help, Despite the enemy propaganda narrative. this support has been instrumental and clearly visible for us. Ukrainian voice and it’s because of you that we continue our battle against the evil

    And the global terrorism embodied by Russia. Thank you so much. We do appreciate your help. And let’s come back to the topics. So the topic is method of critical care. Each outflow data to all of CDM. We assess them or a simple standard for sharing ontology mappings, starting with the background.

    So this slide shows the Odysseus multifaceted journey and you might see the real world data, which is like Europe boroughs, you know this creature and this symbolizes the endless, endless cycle of data collection, harmonization, analysis and evidence generation. As new data emerges, it feeds the cycle of promoting continuous research and discovery.

    Standardized vocabularies all week total that you see above the real world data. So it enables us to utilize the data and interpret it, interpret it, visit direction from the order, share leads, ETL, translate data using the vocabularies and the community, including software developers, Utilize these resources, conducting studies and producing evidence that lead

    To more effective health care. So as you understand the vocabularies as a core and the heart of the OHDSI ecosystem, that’s why they’re very important. So while various each other flow data requires mapping, our ultimate challenge is to define the best strategy for data. So for generation preserving and disseminating mapping across

    The OHDSI ecosystem. And also we know that the health data mapping is costly, use case specific and requires training and subject matter. Experts in health care also is essential for algorithm development and analytics. In turn, open source mappings usually don’t have enough documentation and can lead to unexpected and sometimes hidden data inconsistent.

    This also we have adoption challenges because people use various sharing approach for mappings, for instance. So it’s a concept map provides map and table staging tables or lookup tables and whatever have you. So that’s why we have to take care of the almost vocabularies. First. We need to support them to to mean

    18 dual categories and also we need the enhancement in the area of provenance, precision and love and justification. So our solution is prioritization of the vocabulary usage and generation map and precision metadata. So this session is the elegant solution for these, and it us to store these information

    And use it in our research and our ETL especially. So here you can see different data types like making origin and intent, mapping creation, mapping precision and mapping, and to see which gives us more details about mapping. The next thing that we offer is the usage of mapping

    With the data table, which can be added to the almost CDM. And this table contains information about the level of confidence of the mapping shows, the direction of the mapping, and whether these exits match broad match or narrow match or related match. Also, we can see we can see the mapping justification,

    Whether it was manually created or automatically created. Also, we usually want to know who did it, who actually did this mapping. And this table can help us to sort this information. Also, you may see a reviewer idea label which will help us to see who checked and validate this mapping.

    Also, there is information about mapping to map until versions, for instance, Zygi or Jackalope can be there if you use it. The content of mapping metadata table should be discussed in some fields. Can we edit or exclude it after the after the discussion? next is the automation.

    So we are working on the development of machinery which can help us to integrate session lamp in table into our ETL and our automaticity instance. So we populate staging tables, mapping the data tables, and these allows us to store information in basic vocabulary tables

    Like concept calls of relationship goals, events, a circle concept synonym. And also we put the event tables using this machinery and other important thing that we need integration of CSM into the order share tools which do mappings or use mappings. And we begin with Jackalope. The Jackalope is a AI enhanced mapping tool

    Which allow to manage mappings, create mappings and visualize them. So if you’re interested, please visit our native posture number 24. The last but not least, and the most essential for us is the community engagement. This is why we encourage you to use modern of data

    In your ETL, in your current and future projects. So yeah, if you do mapping or use mapping, please visit our poster 501. Thank you very much for the attention. Next up, we have Theresa Burkard who will be presenting, paving the way to estimate daily dose and common

    Data models for drug utilization studies in Darwin. EU. Good afternoon and welcome to my presentation on dose estimates. I would really like to thank the organizers for giving me the opportunity to present this work today and I can assure you we’ve worked very hard this last couple of weeks and months.

    And with that, I can not only I will not only show you the dose finding approach, but also some dose estimates as well as validation. Today, we think that drug dosing is very valuable for pharmaco epidemiologic studies because it has a lot of applications. However, it’s quite a challenging task.

    Some of you who’ve tried will agree with me, and especially in some databases don’t have the dosing available that the doctor may have prescribed. However, there is a quite an excellent reference, which is the W.H.O. ATC Daily Dose index, and here you see the example of paracetamol,

    Which has a kind of a standard daily dose of three gram and there is different administration route, it’s oral parenteral or rectal and it’s actually those dose forms, those different dose forms that gives us the headache when trying to estimate those and finding those formulas.

    So we set out to a uniform approach to develop dose formulas as well as validate those suggested dose formulas. Our approach revolved around routes and the link with the OHDSI vocabulary is the drug strength pattern from the drug strength table. In detail, we identified 31 patterns with clinically relevant units

    And we categorized them into three groups. You see the fixed amount formulation pattern, the time based formulation patterns and the concentration of formulation patterns. And it’s the last example which I want to use to show you the concept of how we derived the dose formulas. So here you see an example

    Of two different patterns, and it’s the numerator and denominator that makes up this concentration formulation patterns. In the first row you see that the denominator is present, and in the second row it is missing, which means it’s a quantified drug. And in this concentration formulation patterns,

    We had 22 patterns with clinically relevant units. And to see what was really going on, we performed an extensive clinical review in four different databases. The gold Forum, the Dutch primary care database, as well as the four metrics plus for academics. And the suggested dose formula is depicted here.

    And the clinical review really wanted to know whether the numerator value made sense in relationship with the quantity. So we found that for the patterns depicted here divided by a root, whether this formula makes sense or not. And if you want to know how we derived the root from the dose form,

    Which is a separate project, you can find me at poster 13. So it’s not only important to find or suggest those formulas, but also validate them. And we did this in five dif using five different ingredients in five different databases. And we compared our estimates to the W.H.O., Daily Dose.

    And for time constraints, I can only show you a subset of results. So let’s start with furosemide, which is 14 milligram in oral and injectable route. And this is a drug that’s given regularly, so it makes sense to estimate those there. Okay. let’s see what’s going on here. Okay.

    So we have those three databases, which is the IQ, the disease analyzer, Germany, the Dutch primary Care, the Netherlands, and four Matrix, plus four academics. And in the first column, you would see the most simple analysis. Well, we only looked at what the unit was.

    And you see that the majority of drug records had a milligram unit and they have the exact same are dose estimates exactly like the daily Dose in the second column when we took into account the root. You see that in the primary care in the Netherlands

    Our dose formula did not have good results for the injectables. But also the number was very small. But it’s relevant. If you, for example, want to have, for whatever reason, use the injectables. And in the last column, you we also have the dose stratified by pattern.

    And here you can see that the milligram the fixed amount formulation in oral this 99.6% in the ABC database. So it would make sense to perform a study in in oral tablets for example and not in injectables for if I have enough time. Yes. I would also show the second example.

    It’s quite an interesting one because here the W.H.O. daily dose depends on the dose form. So it’s either an inhaled powder or an inhaled solution. And you can see that our approach by the drug strength pattern really picked up the perfect W.H.O.

    Daily dose, depending on whether it was the powder or the inhaled solution. Also here were there were some patterns which were offered. I mean, data is never always perfect. In strengths, we say that we demonstrated a uniform approach towards dose finding and we validated the dose formulas. However, the dose

    Finding process is quite slow and extensive due to clinical review. And the major obstacle we found was the quantity field, which varied a lot depending on the databases that we’ve seen. And this made it really hard to suggest a uniform dose formula.

    I mean, this is not the point to have a dose formula per database, but we wanted to suggest a generic approach. So to conclude, depending on the setting of the data, the dosing estimation worked better or worse for different formulations and routes, and we would always suggest to perform

    Thorough diagnostic investigations before estimating dose. And this can certainly be done with this package that our the great team wrote in Oxford, and it’s a drug utilization package developed under the EU. And with that, I would like to thank the funders, collaborators, the community and the whole team in Oxford, Data

    Partners have run the study. And if you want to know more, you can find me at poster 502 in the break. Thank you. All right. Next up, we have Chao Pang who will be presenting generated synthetic health records in OMAP using g p t. Hello.

    Thank you so much for the nice introduction, Davera. So it’s super exciting to talk to you guys about the work generating synthetic electronic health records in your own lab using CPT. I’m actually going to go through quite a lot of slides, so I’m going to be talking to lightning speed. All right.

    So why is this synthetic data important? Well, the fundamental issue is that we can share the primary patient data with external collaborators. Therefore, having good synthetic data, we will enable or speed up certain research areas, such as external validation of machine learning models or phenotype, algorithm and so on, so forth.

    So the common approach for generating that data is as follows Where the longitudinal data is first converted to a data matrix using the back of word approach, essentially counting the number of features in a time window. Then again, model is trained to generate a similar data matrix,

    Although this approach works pretty well, underlying data structure doesn’t capture any temporal dependencies, therefore limiting its use due to recent advancements in deep learning, people are starting to operating on longitudinal data directly using a combination of a sequence models.

    And I guess this is a good one of the good examples in this paper, the authors created a hierarchical model where they first predict a timestamp given what’s come before timestamp of the next visit, given what’s come before, and then predict the diagnosis codes based on the timestamp and the previous

    Heat and stage. Although such methods seem to have captured temporal information, it is still not sufficient because of the way are assumed to end on the same day as the way start date, which is not true for the inpatient visits. There’s no way to type, there’s no discharge time in the sequence.

    And in addition, the synthetic data is not ready for dissemination for use. So we think this problem should be tackled from a different angle. Instead of focusing on models, we created a patient representation that not only retains the patient timeline, but also captures essential basic information.

    The This project representation is designed to have the demographic prompt at the beginning, followed by a bunch of basic blocks. The key innovation here is the use of so-called artificial time tokens, or 80 to represent the time intervals in days between visits

    Or within the inpatient visits spend to produce up to preserve durations. That way, we can seamlessly convert this sequence to all move back and forth. So based on this idea, we proposed a framework where we first convert all mapped to a bunch of sequences using the patient representation.

    Then we train a generative, generative model to learn a sequence distribution and generate new data. Next, we convert the synthetic sequences back to OMAP using the OMAP converter. And finally, we evaluate the quality of this synthetic OMAP using all these RDC tools and philosophy.

    So for the generative model, we use GPS due to its recent success. The key idea is to treat the patient’s or history as if they were text documents. So at the input layer we use concept embeddings, faster trainable positional embeddings. In the model we use 16 transformative coders

    And we train the model using next token prediction to generate a new sequence. We randomly draw a patient from the population and turn it into a demographic prompt and feed it into the model to auto aggressively generate a sequence. So when assembling from the output this distribution, there are top K and top

    K sampling strategies, which is essentially a hyper parameter to control the shape of the output distribution. We tested a few options such as top K equals 100 or topic equals 95%, and for each strategy we generated half million patient sequences. So because we know the pattern in the representation

    Is always the demographic prompt, followed by a bunch of visits. So we can easily convert this back to all my records. The most important thing here is that because we have a starting point, the year token in the demographic prompt, we also have all the time intervals embedded in the sequence.

    This allows us to calculate the timestamp accurately for any token in the sequence, which is how we reconstruct the patient timeline or map. Okay, now we have the synthetic data. How do we know it’s good? Or in other words, how do we actually measure the similarity of the two OMAP instances?

    So we came up with this evaluation where the synthetic data is evaluated on three levels. On level one, we look at the concept distributions in different groups such as full population or female population on level two. We look at the co-occurrence relationships within the synthetic data,

    And on level three, we get a real logistic regression model, synthetic cohorts. So what we are looking at here is the comparison of the concept prevalence stratified by population growth in rows and by domain columns, where the x axis represents the source data

    And Y represents the synthetic data generated using the top 95% strategy. And each dot represents represent a concept in the plot. So what this shows is that the majority of the concepts actually lended around to the diagonal lines in the plot, suggesting the synthetic data agrees with the source, especially in

    The high frequency regions. Okay. So on level two, we’re going to compare the co-occurrence matrices of the source and the synthetic on maps. We first construct a lifetime co-occurrence matrix and then simply can calculate the divergence between them. We also included a lower bound and upper bound.

    So the lower bound was obtained by applying the same procedure shown on the left. To to randomly draw samples from the source and for the upper bound, we first generated a hypothetical co-occurrence matrix by assuming independence in the source data and then apply the same procedure to obtain the upper bounds.

    So this is the results of the divergence associated with different sampling strategies, including lower bound and upper bound. So basically the closer it is to the lower bounds in the bottom left corner, the better the synthetic data. And in this case, top 95% seems to be the best strategy. Okay. On level three,

    We test whether the synthetic data carries this same machine learning capability. Meaning if we generate a synthetic cohort and run logistic on it, it should have a similar performance as the real data. We included five different cohorts that we previously used in the previous work called The Sherbert.

    Now, going to go through the cohort definitions in the interest of time, but the hope is that we’re going to include more cohorts in the future, the more the better. So here’s the results. In each cell we report three numbers the prevalence of the positive cases ROC, AUC and PR.

    You see, the second column of the table in green shows the ground shows met metrics generated from the real data, and the other columns show the metrics corresponding to different synthetic data sets. Unfortunately, we don’t have time to go through the results,

    But the take home message here is that we are able to replicate the model performance using synthetic data for every single cohort. In conclusion, this is the first deep learning framework that generates longitudinal synthetic data using OMAP. Secondly, we created a patient representation that allows us

    To reconstruct the patient timeline without any loss of temporal information. And finally, evaluation procedures show that the synthetic data preserve the underlying characteristics of, the real population. I like to thank my team and the collaborators for your inspiration and your support. I’d like to take any questions during the poster sessions at a503

    And thank you very much, everybody staying in their time frame, It’s really, really impressive. So now I’d like to introduce Tom Seinen, who will be presenting comparing concepts extracted from Dutch text to conditions in the structured data. Thank you Davera for the introduction. Yeah.

    So we stay in the realm of data similarity because I will talk about comparing extracted concepts from text to the structured conditions into the database. Because at the Erasmus Medical Center we maintain a large or large Dutch general practitioner database of 2.5 million patients.

    That is around 8% of the Dutch population of active patients. And while the structured data in the CDM is of course used for all kinds of studies and analysis, the unstructured text, the free text, so unstructured data, the free text that it sits in the notes in the notes table, the CDM,

    Which is actually fairly large, 35% of the physical size of the database is unused, is a lot of potential information unused. So to change stuff, we thought, okay, can we extract this information? Can we extract all the clinical concepts from the text and then use it in our analysis?

    So there are many tools to do this for English. However, there are not many to do this for Dutch. So I thought, okay, let’s create a framework ourselves for extracting these concepts. So that’s what we did. However, Dutch is similar to English in certain ways, but we still wanted

    To verify if our concept extraction was actually any good. So could we evaluate it? But that requires a annotated dataset. So we do have multiple English datasets that are annotated with concepts, but that does not exist for Dutch, unfortunately. And creating one is a very laborious task.

    So I didn’t want to do that. So we thought about this challenge and thought, okay, what information do we actually have for these notes? Because these notes do not exist by themselves in a database, because they often are linked to or appear together with a structured code. So a condition, for example

    So that we thought about, okay, can we maybe use this structured data for evaluating our concept extraction, using the structured data basically as surrogate annotations to use it in the in the evaluation. And to do this, we thought, okay, can we just compare

    All the structure of the coast that we extracted or consider? We extract it with the structured code because if we can find similar or related concepts in the text, then the extraction works. And of course concept can be the same. So if somebody, for example, if you find COVID 19 for the condition

    COVID 19 fight COVID 19 text, that’s the same condition. You can also find SARS-CoV-2 as a bit less the same. You can find a vaccine that is related. I guess we’ll find the word car or the concept of car. Well, I don’t see how you can connect those two. So we made our

    Extraction framework using an existing open source framework called Mitsubishi and we basically changed the structure of the algorithm by using a Dutch token, either using spazzy. A Dutch snow met city dictionary for the concept extraction and Dutch context rules for seeing if we can find any indications, for example.

    And of course we stored all the information again in the note and LP table. Now for the experimental setup we looked at the most frequently, the most the 300 most conditions in the in the database. And for all these conditions we took all the condition appearances

    For every of all the patients, around 30 million appearances. Then we took the notes within a three day window around these appearances. So 110 million. And then we extracted all the clinical codes from these notes that end up like 430 million extracted concepts. Now, I set the assumption

    Was that the text is related to the condition that might be always right. Of course, there’s always cases where the condition doesn’t need any explanation, so there is no text. And of course, we’re not only looking at conditions, so text can also be about procedures, medications or other structured data. So that is

    Could not be relevant. So we still needed some ground truth. So that’s why we annotated the set of 2000 to code observations or code appearances, including 200 different conditions. And while you would normally annotate this text, we’d say, okay, here is this concept here. So that’s concept

    Because that concept in the text, we now just said, okay, does this text describe a similar concept or a related concept? So this was very fast because this is just two yes or no question. So to do in this code up to the code occurrence, this annotated was pretty done pretty fast.

    Now to the dissimilarity between concepts, we decided to look at Pre-trained concept embeddings, Snowbird, Snowbird and Bendix to be exactly so these embeddings are created using a neural network and are a numerical representation of a specific concept and therefore hold a certain semantic of this concept.

    Then, to calculate the semantic similarity between the concept, we could use the cosine distance between the embeddings. So for example, COVID 19 is sars-cov have a distance of 0.87 and COVID 19 a car only 0.01 now to quickly say, okay, this is the article and setup

    We find in the text already related to the condition. We just look at the macro, the concept, which is the largest similarity to the condition, just to make it, just to make it quick. So we only focus on the concept with the maximum similarity. However, you still have only a number.

    How do you say, okay, this concept is the same or it is going to be similar or it’s related? Therefore, we looked at the distribution of all the different 400 million extracted concept stuff we found. We made a distribution of them and then started looking at both the

    Distribution and examples to see, okay, what kind of thresholds can we use? So for the related concepts, we felt like everything above one standard deviation of the median that does look related and I think above one standard deviation from the max. So minus the max, those are actually quite

    Similar to the original condition. Now if you can use these same thresholds on only the maximum similarity concepts, we could then say, okay, how many in how many cases to the does the text describe the same condition as the structure data? So here we see that 4030 million conditional appearances that in 20%

    We find a similar concept, 70 47%, we find a related concept and the 27 you only find unrelated concepts unrelated to the condition. Now we also validate this, evaluate this on the annotated data sets. So here on the left you can see the observed.

    So let’s see the annotated dataset and on the rightist predicted dataset where the unrelated related and similar concepts. And here we find that in these 2000 appearances we found, okay, we found less similar concepts than we expected, but we did find more related concepts that we expected.

    We also find a bit more concepts that we expected. However, if we did not find a similar concept, then usually a related concept was found. So that was pretty nice. And if we look at this problem as a classification problem, then we can see that D related concepts is quite good.

    Better than a similar than getting a similar concept, but combining the both, then we have a pretty good performance now. In conclusion, we created the extraction framework that we evaluated using the structured data. Of course there are some limitations, but as the symbol language agnostic, our framework performs relatively well and we found

    That our assumption was quite true that two related and similar concepts are found in the training texts of the clinical conditions. If you have any more questions or you want more information, come to my poster at 2504 all ready to round out our lightning talks.

    We have Jenna Reps who was presenting finding a constrained number of predictor phenotypes for multiple outcome prediction. The introduction. So I’m going to be discussing a project that we’ve been working on where we tried to look into a patient condition and see whether could identify a smallish say, around 50 predictors

    That could be used for any outcome. So before I actually dive into the project, I just want to highlight this is very much a team effort and a shout out to Chung Su, who actually managed to take all of our individual photos and put us into a hypothetical.

    If we work in the same office photo. So why is it that we’re interested in coming up with this constrained set of predictors? Well, some of you may have come across websites like Calc, where you can actually go onto there.

    You can pick a model you’re interested in, you can fill out things like your age or sex and a few questions about yourself. Present. It will tell you your risk of of a certain outcome. But what you find with these is this will give you a risk of one outcome. If you wanted

    To see your risk of another outcome, you have to find another form. Fill out a few more questions, press the button and you get that risk. But this doesn’t scale up If you want to know what’s my risk of actually 100 different outcomes or 1000 different outcomes in order to you

    Know, we like to do things at scale, maybe even a million outcomes, you’re going to have to fill out a lot of different questions and it’s going to be a lot of forms. So our idea is can we find some predictors that are just generally good for any prediction task.

    And that way we could have one form that you fill out, press enter, and it could tell your risk of of many different predictions. So the methodology is quite a lot to it. And if you want to hear more detail, I welcome you to come find me a post,

    A five by five that just to give you a brief overview, we ended up doing a very, very large scale characterization. So what we did is we looked at lots of different prediction tasks and within each prediction tasks we looked at what drugs and conditions were recorded in the year prior to index.

    And then we looked at how each of these drugs and conditions were associated to developing an outcome in the future, because for prediction, the good predictors are generally the things that are strongly associated to having the outcome in the future. So we looked at, let’s find these drugs and conditions

    That are associated that should be good predictors for an outcome. But let’s just do this for one task. Let’s do this for 65,000 prediction tasks. So we looked at many, many different at risk populations, many, many different outcomes. We’ve seen lots of claims in the datasets, and we counted out of these 65,000

    Plus tasks how often each drug and each condition was associated to developing the outcome in the future. And we ordered these drugs and conditions in descending order based on how often they were found to be associated. So these conditions and drugs that were in the top 1500,

    These are the things that generally over lots of different patient tasks are pretty good predictors. And we reviewed them and we found 67 different topics, things like hypertension and heart failure that we then created phenotypes for in the standard OHDSI process. So this is our list.

    I’m sure some of you may not be able to read if you’re at the back, but then there’s a list online. And also I’m happy to explain the more of the poster, but effectively there’s three takeaways from all 67 predictors that we found were candidate predictors. One thing I want to highlight

    Is that the available in the phenotype library so you can actually get them from an all package now that Galpharm is is maintaining. And then secondly, these predictors, they actually are quite diverse. So we’re seeing that we actually found things that cover lots of different disorders. So that’s good.

    And secondly, the things that are commonly recorded in our data and often you’ll find these in existing prediction models. So really we didn’t find anything new. We just found a set of 67 predictors that you would expect probably are going to be good predictors and models. The next thing

    We wanted to do, they were seeing how well do these do So previous approach of doing prediction with in OHDSI is the kitchen sink approach. You use any drug, any condition. The record we recorded in the prior year to see if you can put them as candidate

    Predictors plus age and sex, and then you try and develop a model with those. So this is what we’ve called our best case logistic regression. Throw everything in there and see how well we do. No constraint. And then we’ve got our models that we develop using not constraints,

    Their predictors, the 67 predictors, and we just identify plus age and six. So these are constrained models. And then we also decided, well, we actually look at some prediction task where we have existing models and we can see if there’s how the existing models do as well.

    So the example I’m showing you here is the one year risk of mortality. And in that prediction task Charlson Comorbidity index is a common one. So we put that in. So there’s a plot here. This is just one examples of one prediction task. But if you want to see more complement my poster,

    The first row here is the best case. This is the unconstrained. So this is a model where we’ve got tens of thousands of predictors. So we could have in that model to predict mortality in the next year. The second one is just age and sex only predictors in the model,

    The third and fourth row. This is to constrain models. So this is when we’re just using our 67 predictors we just found plus age and sex. And then the bottom one, this is Charles and comorbidity. So this is the existing model for this task and each column here is a different database.

    And the number you find here is the AUC. So this is a measure of how well the model does in terms of discrimination. The closer that is to 100, the better. So the further they are to the right, the better these dots are.

    And what we see in this prediction task is that R constrained models, so are models that use only 67 predictors plus age and sex were actually doing as well or better than the model that was using the tens of thousands of as Canada’s predictors. So this is actually very promising.

    We saw this result across all these different databases. So this for this prediction task, I’ll constrain set predictors actually did very, very well. And it actually does seem that you can come up with a constraints that predictors that should do well. But we didn’t just do it for this prediction task.

    We actually have this for many different prediction tasks. And as I said, formula post if you want to see similar results. So if this is is made, you’re interested. We actually in the OHDSI style, we didn’t stop there. We actually have created a website

    Using these 67 predictors that you can actually go to. You can fill out one form, put in your age, your sex 67 predictors, whether you have them or not in the prior year. And it will tell you your risk of not one, not two, but over 100 different outcomes.

    So I hope you enjoy the website and thanks for listening. So thank you to all of our panelists. But for those for their time, thank you so much. So now our next session will go into our first poster and demo session. The methodological and data can be found on my left,

    Which is your right and open source. Clinical lightning talks and posters will be on my right your left over on that side. Also, be sure to look for the four poster walkers who are carrying signs indicating their poster themes. So where our poster?

    Let’s see, we’ve got movie Van Zandt, who’s doing data standards. We have Christophe Lambert, who’s doing methods research, Paul Nagy Who is worse? Paul I don’t see him. hello. Hello, There’s Paul Maggi doing open source development who’s doing clinical applications? there you are. Hello, it’s Christian. Hello.

    Doing clinical applications, so have a great time in the poster demo session and be sure to come back to your seats by about 330. Right on time for the next Lightning Talk series. Thank you so much.

    Leave A Reply