So I've jumped ahead a little bit apparently something weird going on with my slides that we have some sort of Star Wars scroll going on behind the scenes and it seems I don't know what's going on there that's kind of cool so we're going to focus on I.R.C. I'm going to this is a part of a larger presentation obviously about practical applications and then sort of a critical perspective on I or to you again this is my effort to demystify highlighted uses and also highlight its limitations so you know I think it's useful to sort of contrast I.R.T. with classical test theory the object model to to Stickley in classical history is the is the actual item score. That's like a 0 or one for educational test it's like a one to 5 or a Likert scale and I are T. It's a probability of getting a particular item correct or achieving a particular threshold in a pool in a polygamist like Likert scale item and dependent on a series of parameters and I'm going to explain shortly the conception of what's measured in c.d.t like what is the score again that's the question I keep asking like what is your score and insight and C.C.T. it's an exam any true score you can think of it as an average you can think of it as a sum equivalently defined as expected value across replications of observed scores I mean I or to recreate this repaired mystical feta that is a particularly useful scale as well described for comparing across different populations and so it's a very useful skill if the model fits the data. So again just to highlight C.T.T. and why I spent the 1st half of this presentation on it it is it is still by far the most widely used psychometric toolkit it will it can do a ton do not sell it short but then of course they've recognised sometimes you have to publish some papers and need a fancier acronym and I are told will serve come in handy then but filthy T. should be a knee jerk 1st analysis just as descriptive statistics are 1st step for most a circle work. So. We think our T. is useful because it conceives of items like C.T.T. has an urn full of items is just like I don't care on average they're kind of like this and they're variances kind of like this I R T You take out each little marble from that urn and you sort of appreciate it they're like this this marble is special and it has these properties and then you can take all these marbles out and lay them on the table and say here's what I want to do with them and that tends to be much more useful as like a way to design tests the way to maintain scales over time it's a standard approach for large scale testing which is to say that for many if you I.R.T. is a little bit of a sledgehammer to a nail if you're developing your own scale for your own purposes it is just going to stay static and just previews for a particular population and go ahead and do I or to you but realize it's kind of an indulgence I just kind of like you know I'm taken again a sledgehammer to a nail for a large scale testing programs where you are substituting out items you are maintaining scales you're giving it to different populations over time I had to use incredibly powerful and it is the standard for use in large scale testing programs perhaps the only major exception to this are the Iowa test of the Iowa test currently where I used to teach at Iowa and most of the hold out of sort of classical test area so you can sort of see why I've been sort of classically focused here is because they pull off amazing things without I or T. and do it quite well and quite rigorously you do not always need either T is just simpler and more elegant to use it which is why it's in such a common practice today. So so the irony sort of asks what if there were this alternative scale this alternative status scale for which item characteristics would not depend on each other the way they do in classical test theory in the google doc I think Wendy and Fitzpatrick in their chapter on I.R.T. and in the handbook they describe I.R.C. as if your assumptions hold as person free item measurement and item free person which is to say that this a to 6 you come up with right the difficulty of an item the discrimination of an item and the proficiency of you the examinee they don't depend on the items you happen to have and similarly the item features do not depend on the population you happen to have and that if the assumptions hold is pretty darn powerful because that marble you pick out from that burn you know has those properties and will always have those properties and so when you use it to construct a test then they will continue to have those properties again if the model holds. So just to define here this is we're going to get into law gods but the simplest. I or team model is known as the rush model for George Rush's $161.00 monograph It's also known as the one parameter logistic or the one P. L. model and I like to write it like this which is the log of the odds of a correct response to the item right P. over Q write the probability of a correct response over one minus a probability of a correct response is the odds this is a log of the odds of the natural log of the odds a log of the odds is just a simple linear function is this common slope parameter a this this person. Intercept P. and then the sign is important here and it usually in logistic regression he used to seeing a plus here and we're going to define it as a minus so that this will be we're going to define this as difficulty as instead of easiness this is a difficulty by item and then this is a random This is a random effect this is an error term right so Sigma P. is distributed normal 01 No there is no variance of the person distribution rest meeting here were standardizing it to normal 01 most I or 2 models aren't written out like this and I think that the has it that has the effect of mystifying it somehow it gets some confused consume confusing logistic function and I prefer just saying hey we're linear in the log odds here this is not fancy if you can do logistic regression you understand what a random a random effect is then this is just familiar modeling and it really is OK So again a lot of the odds is simply this common A No There's no sub script and I there's no sub script on a the discrimination does not depend on the item they're just a common. Parameter estimating across all items here that's going to change in the next model but for now it's common across items because the discriminator and then this is the sort of difficulty parameter right for each item and then every person is going to get a data. OK. This is the more intimidating way of writing the same model we actually model the probability itself but this of course is equivalent to this is the scary way of writing logistic regression this is a less maybe less intimidating way of looking at the logistic regression as long as you don't look over here is just a log of the odds and logistic regression when you're in the log it's just a generalized Illini or model Don't forget. So these are the curves that we estimate so this is the logistics of the logistic curves that we estimate that says sort of says for a given theta these are the probabilities of getting an item correct right so it's a given day to 0 you've got this item has a 50 percent chance of getting correct or thereabouts maybe 55 percent now or up to 7075 percent and so on so which items are easier the ones you can think about on top you can think about is shifted to the left or right the more difficult items are the ones shifted to the right. So or shifted down depending on how you think about it in logistic regression we usually think of intercepts on why what we've done here is we flip this to think about position on X.. So those are these are higher intercepts and lower Y. intercept but now we've done is we've shifted it to think again of like greater difficulty shifting that sort of S. curves Walk Like An Egyptian sort of thing like that that US curve that way. So and just to give you a little bit of the punch line here what I've done here is I have core I have good showed you a scatterplot of the classical test theory difficulty which is to say the percent correct classical test of your super annoying the call difficult the difficulty in those percent correct and this is I.R.T. difficulty so there's a negative relationship but this is just to say if you were to sort of say like what how does percent correct correspond with I.R.T. difficulty and as I read to you giving me something magical and mystical over love and beyond percent correct answer is not really. It's pretty it's pretty much the same information this is not surprising but again Ira to use I'm going to show you going to be useful for some more advanced applications. I just want to demystify this further this is where Matt and I had a conversation about this earlier today I had to use a latent variable measurement model it is a fact or analytic model it is a structural equation a model do not think of these as separate things they are separate practices in the way that a nova and regression are separate practices but are the same under the hood right like I think those like the are the act of doing a know it as a sort of way of thinking about a statistical analysis even if it's the same thing I could do the same thing of aggression similarly structural collision models factor factor analysis I think of as a different practice asking sort of different questions using the same statistical machinery I'm happy to elaborate on that but I don't want to treat these as like completely separate models when I think of them more as completely separate literatures and separate fields and use them for separate reasons in a same way that sort of a Nova and regression are really the same under the hood so what I'm sort of setting up for you here is a way of doing I.R.T. using G. The G. Sand package in State of the generalized structural equation model packages data you can see here all it is you know one is what I say I.R.T. is factor analysis but they caught variables. Right that's all it is and that's not all it is the what we do with it is different but under the hood that's all it is. So the S.T.M. formulation is that the probability here dependent on theta and B. is the just stick with M. theta minus B. as a slightly different parameter parameterization than the one that I showed you with the 8 term because the A was outside these print the C's right but the but at the same general same general approach the slope is constrained to be common across items and you'd fit this in in status with the G. surprising regression the same thing so actually before I or before stay to 14 came out say to 14 was just released. Last year yet instead of 14 was released they had an hour to package before they had an hour to package guess what I didn't say to do sound so which is to say it's like what are you teaching a course by using special creation models because they're the same thing and so I had all this really convoluted code to get all the stuff I needed out of just Entirety and then of course state I thankfully at least I retain made all that obsolete and had to record everything but it just goes to show that sort of same thing under the hood that So this is the 2 parameter legit to start yeah sure actually Chris. Absolutely due to it has a hard time with 3. But about I.C.M. can absolutely do to all you do for. For the 2 families or sick model is free this right there instead of forcing the slope to be the same across items you let it vary I'm not a give you the 2 from the logistic model the 3 family just like model I don't think you can do N.G.'s and you can do it in glam Sophia's package and are just package but but but again it's the same under the Good question. So this is the 2 from logistic model it allows items to vary in their discrimination over over items so again log I like writing it log odds terms and so all I've done is added a sub I all I've done is up the slope parameter vary across items OK and then again we have difficulty So these are the more difficult items these are the less difficult items this is the less if I want to be less fancy about it what would I do I would plot this and log on and then it would just look like a bunch of lines right and so to again this is a sort of mystifying way of describing I or to you if I wanted to make it simpler I just show you all the different straight lines that were in a lot of god space Yep the last year was. Like and then how many around the map. Click the icon. The legend there are 20 items so that's obviously have 20 parameters here we're not estimating the data P's right this is a random effect and we're actually studying this to 0 and once we're not estimating those we can do it in a Basine way after the fact in the same way that we can do alter random effects estimates after the fact and then we have in this case 20. Parameters for for difficulty. So so this here I can actually show you. And. In the output. Where was the output I don't think I have it here. Yet but. I think that is. The underlying. Your love $828.00 of. Money but we're talking agreed to freedom for example. Looks like Absolutely you know what we're feeding what we're feeding here so you can do it long or wide it doesn't really matter to you lets you do it why just as easily said the data I should have done this before and I'm sure the data look like it is a person by a matrix right where you have persons as rows items as columns zeros and ones you can also extend that to 012 for polygamous items in each of the in each of the cells and you're modeling the probability of a response to correct item so what is the data look like I think I have. This if I can show this to you. And but there you go so this is what the this is sort of what the data looks like behind the scenes so what I've done here so these are 2 separate item characterised occurs for to 2 items and what I've done is I have mapped the the sum score associated with each data onto the theta scale here and put you know put those weights like how many how many observations happen to be there as dots and so you can sort of see that what we're trying to do is fit fit the probability of the correct response given like that's that sort of overall score does that help a little bit OK that. Data is really weird and annoying because it's like not I mean where it come from it's like this lead in scale and so you can sort of see it's like sat in the same way that a random effect is sort of status just sort of like we just say it's got some meaning of 0 and instead of estimating the variance or putting it back on the slope. So so let me show you further like a little bit of a sense of what the curves look like so this is the item characteristic curve like demo that I like to do here is my visualizing our 2 sides so this is a 3 parameter logistic model so what happens if I increase the there's a blue item hiding behind this there's a blue I.C.C. hiding behind this or that is a blue curve hiding behind this red curve and what I'm going to do is I'm going to increase the discrimination of this blue item and we're going to see right is that we're going to increase this sort of slope here and in the. Probability space and this is a this blue item is now what we describe as more discriminating in the sense that people just below that sort of midpoint there right versus just above are going to have a pretty massive swing in their probability of a correct response so my question my trick question to you is which item is more discriminating. Blue or red and the sort of knee jerk reaction is what the answer blue is more discriminating but if you if you think about it more carefully and some of you did a good job of like working through this on the google doc. Where is the slope you know which item has a higher slope isn't the general there's a general answer to that and in fact when might the red item be better. Yeah so at the tail end of the distribution you can see that like you know for people who are very high high achieving on this scale or very low on this scale this goes back to Sue's question right for who might be trying to just discriminate who are we trying to discriminate among And you can sort of we're going to get to information shortly but the idea is that what I already allows you to do is say difficulty for whom discrimination for whom and even though you have A's and B.'s those you wouldn't want to call that you know just more difficult they're just less difficult because it all depends on for whom right and so you can use this again to construct test in very strategic ways to provide information for high achieving or low achieving students if you're so inclined. So similarly what I'm going to do now is increase the difficulty of this blue item. What he thinks going to what you think what he thinks could Which way do you think that blue curve is going to go. So the blue curve here is going to shift to the right it's going to take a little bit of a lock and for more and more people across the state a scale there probability of a correct response is going to be is going to be low. So now you're being your blue curve here blue item is more difficult it seems right it's like a $1.00 B. parameter estimates you're like That is more difficult is it really more difficult when that easier. So if you look all the way up at the top you actually see an instance right where the blue item is easier than the red item so when the discrimination parameters are not the same this is like an interaction effect right you can't really sort of say across the board which is more difficult which is easier it depends on where you are in the scale now if all it premiers are the same as they are in the one premier logistic model then there's then there's never any overlap then and difficult item is always more difficult than either item is always easier but once you allow for discrimination to change then that allows you to be for very targeted about for whom is it difficult for whom it is In a minute giving you. A guess my question is do you have. To really find people in order to. Discotheques communicating your ideas and finally gave you one people were really high achieving. I wouldn't have any information. That's right you'd be forced to extrapolate in the way that we do it to say exact same thing as fitting a linear model and then sort I mean this is a linear model in the log odds and then you're just sort of saying what I'm going to assume is that if I want to pick pick people down there what's going to happen to people down there is extrapolating that linear in a lot God sumption right and so when when we say person free item measurement and item free person measurement really over saying is yeah if my model holds which is what we always say when you know this is just a regression assumption is nothing magical right but but it is still nonetheless useful and that and that what we find in a lot of cases is that that when you're in the log odds assumption is pretty reasonable So yeah so just a quick note the the slope here is is a over 4 and of course in the log odds space it's just the slope itself and again be careful about when A's vary when discrimination varies be careful about assuming discrimination is discrimination do not select items based on parameters select items based on curves. So any sense right so you should sort of think in a characteristic curve way like you know always visualize if you can the items and selves. So I want to show you what happens here right so the see parameter I haven't really talked about given how fast I've been rushing through this parameter when I increase it here I'll show you what happens here sort of like the cyclist's the floor and see what's going on here so some of you already might know the answer to this but why would this be useful why would we want to say that and certain cases in educational testing people with extremely low proficiency still have a 25 percent chance of getting it right. You know you. Might. Not want to. Go with those like where you were going for here but I do like this sentiment this is the this is the data fitting exercise so you wouldn't you wouldn't really sort of want to control that in that particular way but I really I really do like that I really do like a sentiment that wouldn't quite pull that off you there but I think it's a cool thing that I like the idea. That's over. Now so this is very tuned to the Educational Testing when you have multiple choice tests and the idea is that like you know when you have a very very low scoring examinee forcing the lower asymptote to be 0 it's kind of silly so I guess my general recommendation is to never use a 3 from a religious 6 model and I'm going to show you why by setting blue to point And then point 95.3.6. That didn't quite work out. Point 3 and point 2 so maybe I got this a little bit I know what it is. Let me just I can just fix. The vindaloo. So what I've done here is I've created a situation where we have dramatically not that dramatic of it fairly dramatically different parameter estimates but the curves are overlapping through much of the upper end of the distribution right you see how those curves are sort of sitting on top of each other over there and the question would be do you have enough information at the bottom end of that distribution to actually estimate those lower asymptotes So C. parameters are are notoriously noisy and so stated in its in all its wisdom I'm very grateful for this has actually not given you the option to fit a true 3 parameter logistic model when you fit it when you fit a 3 parameter logistic model status as all your C. parameters have to be the same across items and estimate a common lower asymptote and that's a really wise thing because otherwise there's no information down there and you get a whole bunch of noise and it throws all through Clinton already throws all of your other parameter estimates off so that's just so you know in general I don't recommend using the 3 family just model in practice it is used a lot and I do not really understand why I keep pushing back on states against using it because it just adds a whole bunch of noise do not overfit your data is a general rule so luckily state it has prevented you from doing that by giving you a common C. parameter to estimate that's just fine if your song climbed. So. This is a little bit of here is actually some of the output this is I or T and state and again now that I don't have to use G C M anymore I like ridiculously long do files that are now completely obsolete because all you have to do is type I or T one P L and your items and you're all set you can plot it's got some good I.R.T. plotting functions for you. And you get output that sort of looks like this. Yeah. Gasping. For I did in the long format I don't. Personally get a job. Like a Man and that. Is a think about it you did you can my slides so that's how I got it I. Think. It is I actually actually deleted that slide here but I have an extremely low git which is exactly the same thing and so. The difference in your mind is that it is in effect from. Grabbing a random attack. For people like you and me. The audience. And then. Afterwards get of it so that and so I actually usually take a 3 step approach where I 1st especially to economise it's useful to show it that way right and people who are like sort of multi level modelers you start you start off by showing it as a random effects logistic model and and then I show it to the people who have taken structure equation modeling factor analysis before and I just try to demystify it as like under the hood it's all the same thing don't freak out but we psych machines have developed kind of this mystical language for talking about it. So. So now just a quick note here again like this is the linear in the law God's right so there is kind of this people often say that I.R.T. really is an equal interval it is equal interval it is it is it is setting up this a linear assumption but it treats as sort of the target of interest the log of the odds of a correct response and assume the sort of linearity between Theta and all of those all of those long odds functions so I guess I'll just say like remember that this is the assumption. And it's a sort of a simple model when you show it like this maybe it's not as pretty but that's really what's going on in the. So this is again this is a 3 primatologist model it's estimating a common super amateur I think that's a good thing you can sort of show that it's fitting better in some cases I don't really like the likelihood ratio test for these purposes because usually in practice you have these massive data sets and everything's always going to show up as like fitting better when you give it more parameters it's not really that interesting sometimes simpler is better now. But it's the 1st. Person. So it's mine you. Never 70 percent or something. Like 100 Tests. Or hundreds of similar questions you'll get 70 of them really and then when we get there you don't think. I mean I guess it's the same that's an interesting question you'd think you'd be like the Terminator stick in some way that's a good question I think it's don't think about you I think about people like you who also sit at that data that's probably the easiest way to think about it and there's a $100.00 people at that data and 30 of them are getting it wrong so it's nothing against you personally. Just something we haven't modeled in you to be able to tell it's more discriminating we don't have the specific model for you so just think about all the people at that data rather than you having a 70 percent chance of getting. Help I mean the same sort of thing in any given scatterplot of a regression right like you have an X. you have a Y. you know so like so how is it that you know but you're not talking about you you're talking about on average people at X. What's their what's their What's your best guess for why. This is just a note I'm parameterization So you're talking about like do you estimate the variance of those of the random effects or do you let slopes vary and so I just want to sort of note here that that you can do both in for those of you taken factor analysis or structural collision modeling you know they have to anchor the scale in one of 2 ways you set the variance or you set one of the loadings I just want to show that there is sort of an equivalence there is a sort of an aside all the this here as a reference. So so some practical guidance here for you when it comes to like sample size estimation you get the same kind of guidance for Factor Analysis right but just be careful this is not a small sample kind of endeavor for the one parameter logistic model you can get away with small samples this is just a reminder that when you have small samples just stick stick with Rush Rush is like a good a good way to get what you need. You get various advice from different authors for the 2 from majestic model 3 from really just sick model don't use it 3 primarily just sick model unless it's the way stated use it just it's just an absolute mess Los going examinees needed for 3 P.-L. but don't even bother and then this goes for polygamous items to you may have heard of the great response model which is for polygamous items this is why I was saying get your discrete histograms see if people are responding like 45 and score points to estimate those cars. So I want to just talk a little bit about the practical differences between item response theory and. And classical test eery So here what I've shown is like a some score on the logit of the percent correct adjusted a little bit to keep from 100 percent and you can see here that is just the sort of nonlinear transformation of of so loaded it's just a non-linear transformation of that and loads it looks a lot like you know the one parameter logistic estimates for for theta right which is just to say like don't you know don't think it's going to create dramatically different scores in your case like the one primarily just like model would give you say it is that or just an slight non-linear transformation of the sum score so that's the this here is the relationship between the one parameter logistic and the sum score once you get to the 2 parameter logistic model here you start to get some. Information based on the items that discriminate more or less and and similarly like between the 2 parameter in the 3 parameter logistic model here you basically got the same thing that lower asymptote is not making that much of a difference so if you want to talk about the practical impact of I.R.T. on like your scoring that's not where you're going to see the difference again I think the value of I.R.T. is really for scale means over time for linkages for like for fancy things where you're subbing in new items and estimating for new populations within any given static item response like data panel. It's not you know I or T. overdub of classical test theory is kind of like a sledgehammer to a nail that doesn't mean it's not a cool thing to do and useful for diagnosis but really what you want to do with AI are to sort of say OK now I'm going to pick these items up and use those like particular marbles from this particular urns to target a measurement instrument top for a particular purpose and it's for that particular design that I or she becomes particularly handy. So let's see what should I do so. I want to talk a little bit I talked with not about this to one of the cool things about our T. is that it enables it puts like if you look at the the equation for I or to you right it puts the data which is a person like ability estimate and B. which is like an item feature puts it on the same scale as sort of puts them like subtract them into sort of says you know your theta is your difficulty it's sort of you know and you can sort of say that for a given theta Let's say you get a beta of mean and what I like about T. is that it gives you a way of kind of mapping items to the scale in a way that imbues that scale with almost you'd argue like a qualitative kind of property right sort of says OK let's say that I think you know the. Response probability which means I'm probably going to get an item correct or think of it as like 70 percent to use it to use or cut off psychic in that way and so we can do here is sort of say OK if that's the case then if I have a state of like 2.2 than that's where I'm going to be likely to get that kind of item correct and if I have a 3rd of 1.2 that's going to get this item correct and different data is will have different different mappings so again why is this useful is because often times you're going to get people asking so I got a score of like 30 What does that mean like what isn't a C.T. score of 30 mean what is an S.C.T. score of 600000000 what isn't the score of The does and by putting examinee proficiency and I have difficulty on the same scale it allows me to create what we call these item that's And here's some of the work that we've done nape this is not very elegant I have to have to say but it sort of says OK what is is explain properties of sums of odd numbers very. Apple you can click on that answer see what that means you can do right with this it with a with a specified probability I really like this because educational scales can be extremely abstract you know you're always sort of wondering what a 10 or 20 or 30 is and I've actually actually asked my students in many cases like whether this is like a psychological scale like you get a great score of 3 What is that or a like a a theta scale you know it's like as if you scale of 600 This allows you these like qualitative descriptions of what that actually means I think this is a very powerful underused method for you know increasingly I think statistics is moving towards descriptions of magnitudes in addition to like statistical tests for example like how much is an effect size of point 5 is like something we really struggle with and I think you know being able to say here's a point 5 means says you used to be able to do 2 digits which action now you can do 3 digit subtraction or something whatever that is like being able to accurately describe what you could what you could do then and what you can do now can be really powerful. Left. So. That so that that would be an example of the model not fitting the data right if that's sort of happened a lot where you had you know where you had usually we have we have the ideal approach where it's like every single time you move up you only get more and more items kind of correct obviously it doesn't happen in practice but it has to happen on average and if that doesn't happen the other team model won't fit and you'll get really bad alphas because effectively all your items in even the classical test area. Even at that stage will recognise that your scale is not cohering So once so if you have a high alpha if you do. Risk replot for dimensionality of your ire to my old model fits which are all different ways of saying kind of you dimensional scale then what you're saying doesn't happen that often and so you can with with with by picking a response probability and these curves being correct sort of how this ordering of items in the states are successively ordered way and sometimes it crosses it you can see here so the 2 primary just a model gets a little dicey as far as interpretation there is because the item orderings aren't the same different given your responsibility but on the whole this is I think of no reasonable way to sort of sail like OK here's what performance at this level means. So so so previously. They all got the sort of spiraled set of randomly equivalent question. Yes. We're moving to in math multi-stage testing which is to say adaptive but likes so you know kind of like what was done in like some of the National Center for Education Statistics tests we got this 2 stage exam based on whether you performed high or low they give you harder items or easier items but still like even for those items even if they never saw some of those items you'd still in a model based way be able to predict whether or not they respond correctly if the model holds that's that's the whole sort of idea of IRA to itself that even if you didn't observe that item you could still sort of predict your probability of a correct response to it so you would hope that these item maps you know if the model fits that's all that's what we always condition on those item apps would hold but I really like this this is like my one of my pet things I like about our T. so I hope you kind of remember this as like you know something you can do when you're trying to say to your you know it's your aunt or uncle like you know you know it's like my daughter got a got a 600 on the M. caster like great like you know what's her percentile rank or you can be like But this is what they can do in the public still I suppose it's a percent right it's OK but. This is a good way of like anchoring the scale and talking about you know this is really what I think measurement is part of like what what does this mean they can they can do. So and this is someone derived this lesson Who was that. That was those good so this is a slightly different algebraically equivalent version of the same thing but this is just inverting converting the the higher tech or the I R T equation I C C. OK so I'm going to. Skip's estimation even though would be really fun to talk about this Brian. But this is a little bit of an illustration of maximum likelihood and how things work but I'm going to talk a little bit about how tests go ahead. Your side of the just what is the use in the 3rd quarter in the template that we're going to. Have my you have some problem because you want to sort of. Having a more efficient manner. In the way that you would be about. You know what the numbers are present like I'm going to be worried that the would you want to use because. All of this. Got this morning to do that but it's really. Have to Do With us now is not and how do you know I think about how to use the school so much respect. For why don't we have some of the goal is I.X. I don't so I think the general goal of item apps is to understand what score means implies about what a student knows and is able to do in the case of educational testing or happened to report in the case of psychological testing right or happens to have like so so for example if you have a great if you have a great score of like 4 that means you went from neutral to affirmative on this particular item right like that's a way of like saying that's what foreign means and then I think that could support the I love that you're asking this is a question you know I usually ask other people if I love that you're asking me this but I think with that generally does increase the likelihood of appropriate interpretation of scores if on average like you know with because they're big nape declines from 2013 to 2015 how big. Not that big if you look at the kind of differences in this in this in the kinds of skills that they were on average able to do this year versus on a on average able to last year I just sort of helps to give people a sense of magnitude and I think you know Mark let's see has this great piece translating the effects of of of interventions to like interpretable forms I think that's the that's the job right I think I think it's and I think he does it in a bunch of really useful ways about like talking about cost benefit analysis talking about numbers of months of learning but I think this is a way in a criterion referenced way to say like literally hey this is what you're able to do now this is what you're able to do then and that will facilitate any number of interpretations downstream because it's really like what we what did you do you know what do we predict you're able to do so whenever it whenever you're thinking about a score and helping people interpret scores let item naps be one possible way you can describe them let me be very specific in another way about how they're used I did not sir also used to set standards I haven't put standard setting in here because. I. Have opinions about it. So standard setting is a process by which we say this much is good enough nape has set standards that set a proficiency cut score it is a judgment will cut score we just had this massive evaluation from the National Academy of Sciences about whether that process was justifiable is that for the most part it was but that's a judgment the process that we use this mapping system for if you are a reader is coming in to set standards you would get a book of all of these items in a row and what you would do is flip through the book and put a bookmark in where you think that just proficient designation should be so that's another way in which this is actually used in a very practical way to sort of help people sort of set A judge mental cut point on what they think is good enough based on what people can actually do at that level is that help now. So you know. What was in the. Back tell you about the classic rush people. This is a great point Chris so. There's like a camp of very thoughtful well reasoned but also sometimes cultish for not offending anybody am I on tape. People many of whom are very close friends of mine. Who are sort of in this rush camp where they think the model is so useful that it's worthwhile sometimes to throw away data to get the model to fit it right which is and this sounds a little bit crazy to those of us who grew up in a sort of statistical camp but the idea is like look we're trying to design a good measure this item is discriminating difficult differently it's going to lead to these weird ordering effects where now I can't have item maps that are all in the same order if I pick different response probabilities I don't like that I'm going to not use that item which means you're defining in a very strong way like in this very statistical way like what you think that construct is and it gets sort of like to be this subset of things you might want to measure because you're throwing away all the stuff that doesn't fit the model what you end up getting in the end is arguably this very sort of clean scale where everything is like ordered without conditions and there's no crossing of these lines and no interactions in this item is always more difficult for everybody than this other item which you might have lost in that process is content and as I said Content is king content is king you can see my bias here when I'm when I'm talking about sort of like that that you should you know fit the data you know have a theory and not throw out data to fit your model but the same time I think there they have. A framework in place that makes them comfortable with doing that for particular uses then tend to be very diagnostic about these things right there to sort of these targeted scales for particular purposes and they don't tend to they don't tend to agree that it's good for all purposes like I don't think they'd say Do that for a state assessment but this camp exists and they're they're good people but they really like their model. They say. I don't. Like to think ever get over you siding with your collection I and. My friends we can win measuring this thing in $1120.00 when they think they might do that and they treat each one separately and try to create like on. Their concept at a level 20 it's sort of an exploratory factor analytic or confirmatory factor analytic approach where you kind of want to take a data based way of sort of saying with this item load more on this or load more on that that that's something you can do as well and I sort of ceased sort of confirmatory factor analytic camp as not so different from the Rush camp they're trying to sort of make the pictures fit and I don't think that that's bad I think that that serves particular purposes but I tend to be more dimensional because I sort of am cynical about the ways people can use multiple scores like I was just going to add it together in the end so might as well analyze it that way and but but for theoretical reasons I see why S.C.M. and factor analysis are useful for that purpose. So just some useful facts for you. For the one in 2 parameter logistic model there is a sufficient statistic for estimating data what is a sufficient statistic it holds all the information you need to estimate data it is not data but it holds all the information that you need to estimate it so what that sufficient statistic is the sum of discrimination parameters for the items you've got right so make sense so I mean at least as operationally not necessarily intuitively So basically in a one P. L. model all the discriminations are the same. Which is to say the number correct for the Russian model holds all the information you need. To to estimate your ultimate data which is to say everyone who gets the same sum score right will have the same data have OK Now when you have discriminations that differ and some items hold effectively more information than others you get credit for the items the discrimination premieres of the items you answer correctly so if you get a 20 correct and I get a 20 correct if you can 80 percent and I get an 80 percent we might not have the same data why would I why would it be different. Really easy this way and. This is a good this is a this is totally tricked you I'm so sorry but that is exactly what I said that when my advisor asked me if it's like 12 years ago. This is so yeah that's what it said the 25th when you got easier and so you got the 20 hard ones right and I got the 20 easy ones right but don't forget that if you got the 20 hard ones right then you must've gotten the 20 or all the other easy ones wrong I said That's weird. So it's actually not the difficulty of the items that matter it's the discrimination right so the idea is that the 20 you got right where the ones that had the information and the 20 that I got right were the ones that were coin flips. But that I said the same thing I mean it is so but and so you have to sort of invert it right. There is a little bit while there is a lot of pressure along the way or the only part of life or very. Witty. And. is practically and ideas are. Really. A number of art. We would have paid for in the. Basement we made it but this was so but again remember that for him to get the 80 percent of difficult items correct he must have gotten 20 percent of easy items wrong which is basically a statement of Mr Right that's weird right and so it doesn't happen that often and so the and so the if that happened a lot the model wouldn't be unanimity modeling it right it would say like I have no idea what you're doing all these items aren't correlating with each other right now so it doesn't happen very often and for the most part the scale would be unidimensional right which is to say like the higher you know if the one P. L. fit right the higher you are and you're getting you know these items correct with greater and greater probability say and even higher probability for all those other items that receipt so that's what the unit Michel the assumption doesn't model fit kind of Biggs into that the rarity of that happening but that's but that's absolutely right that is intuition I had to but you sort of have to remember to flip that and say But don't forget you got all the easy ones wrong which is we're. Good so I think this is helpful intuition for you right. And so just to sort of sort of note here when you get your scores from state testing programs where did they come from you would like to think you would you might think that if they were and I are T. using state right that they would estimate data for everybody and report all these different data. Yes that is not what happens right and there's a reason that's not what happens and it's purely to do with these ability feasibility and transparency and the feasibility idea is that like we you know we can't run all these data you can't run these giant models every single time the transparency idea is hey that thing that we just talked about what Try explaining that to someone in the public right so you see that you got 20 correct and I got 20 correct and you're telling me that they gotta have so it's the fact that we can't explain recycle magician's can't explain that well so I was giving up on the fact that that data hat if we truly have a 2 from really just sick model is a better estimate of data and if we're answering a more informative items correctly we should use that information we generally don't because for the sake of transparency will we publish right what a lot of states publish and you'll see in these tech reports are these raw score to scale score conversion tables or just to say take the sum score and then find your row and then you find your is a one to one mapping from raw scores to scale scores right and that we would be able to do that if we had this like weird thing where it's like well if you got a 20 and you got a 13816 in the war right then like you know that you have this data and like someone else has this other data so that's what we call the difference between pattern scoring and number correct scoring so you might in your own analyses have data is that have sort of continued from A to P. L. that have this continuous distribution but when what you might get from a state is going to look much more discrete even if they have a 2 peelers repeal. The. What they are. Doing in their life. But having the whole. Democrat. Know what. Would use of that right. Now you are going to develop it on the I actually really like the cash contract for me to drop the right thing in the. OK. But then you're going back here in. Some ways against. The individual schools now as I showed right like you know I was showing you those scatterplots before the correlations are like point 98.99 you know so so there it is it is not making too much of a difference but yes what we're basically conceding is like we're just going to punt on for feasibility and transparency reasons and and go back and don't forget the value the value of it which I have actually haven't had sufficient time to demonstrate here is scale maintenance right like we can't use the same items it's here that we use last year because everyone saw them last last year and so now we have to use different items but because we know what the futures of those marbles are and not earning we can sort of you know if we can build like the perfect test that all like measuring the same across the across the same area that we could before. So this is what you know this is to give you an example right this is one P. L. This is the sum score and this is the this is the distribution for the data scores it's the same thing at the same things to same thing it's all we did was a one to one because like the sufficient statistic is the sum score right and all we did in this is what I've described before is like what is I R T do for practical purposes for like a static set of item responses it scratches the middle and it stretches the ends and that's it it's sort of see that just barely here a Woodward overdoing is a non linear transformation. So this is the one P. L. versus the 2 P. L. right I'm sort of showing you these running back spots here so this is the one parameter logistic Right so everyone who got a 3 gets the same score but you can see like in any given any given score point right the people who scored really high in the 2 P.L.R. those that got the discriminating items right and the people who scored really low got the low discriminating items right. So so how should I. See so. I'm trying to think about. How to close here so. With 5 minutes left let me. Just go back to basics and open up the questions I think that's what I'll do there's a lot here I have like to have this is linking this is showing you like how you can get to. The comparisons that I showed you today through common items and where my hair there's this so so so so anyway let me close here and I'll open it up for questions like What do I want you to believe with I do want you like I think there's so much to be said for just diligent exploratory data analysis and I hope you don't think that's too boring because I swear to you we'll see you so much time later when you're trying to fit your I.R.T. models and they're not converging it is well worth it today of selling I.R.T. I sure showed you how it worked but there is a really powerful way in which like I didn't get to animate here sufficiently for you like how these marbles from these urns do have these properties and you can very precisely like each item has this like information function associated with it and you can pick it up and sort of say I want to measure here and then like and they also want to measure here maybe and you can sort of build like the sort of perfect test in this way to discriminate at particular points and in the data distribution and that's really powerful so for example if you wanted to sort of set evaluate people right at a cut score if you were designing a diagnostic test for Pascal purposes you could stack all of the items from your turn that have maximal information at that particular point and. Target a test for precisely that purpose so I or allows you through this strategic item estimates to have that information and you know I can actually show you right you can sort of see it in this light zoom out a little bit here. Right so under here I have these item information functions so here's what I'm going to do I'm going to increase the discrimination on the blue item. For dislocated this. Make this like 2 so you see light you see that right there like now I've described that item as a lot of discrimination at exactly that point and so I can like you know if it were negative one it would have discrimination out at this point right here on in so each of these items has this information function and you can sort of say you can sort of stack them up and figure out where you're going to minimize your standard errors so for these people they're going to have low standard errors and for these people you can sort of sacrifice them because you're not making decisions on them. So so again. Don't forget content Don't forget classical test theory I've just begun to scratch the surface with the usefulness of I.R.T. and we've all got a lot more to learn in this. In this field so yeah let me open up for questions. One of my students just told me the other day don't ask Do you have any questions because the answer could be no say what questions do you have. Yeah. The SO and the if the and the search or early. Release. The urge to eat the lower so the bucket says there is always your go to like 70 percent 80 percent of what you need to do can be done without again when when is I or T helpful it's when you're changing items and changing populations and stuff is changing over time or if you just have this little form the 8 I'm good scale Don't worry about i or to but if you want to sub out those items because people are starting to do them if we start using grid for high stakes testing and people like hey I remember that item then you want to start switching out and that's when I actually started to be super useful so I guess I'd always keep it in your back pocket for you know for when you need to sort out items and or let's say you want to take it like you know we can talk about differential item functioning but it's like what if you want to pick this test up and go take it to Japan or something like that and then then I or to figure out measurement and variance so they're they're all these like use cases where you should sort of feel like you've got I or T. as like your sledgehammer in your basement to like come out and tackle a particularly thorny thorny problem but again classical test theory is your basic Ikea tool kit you know kind of thing gets you in a get you're pretty far. That's the to the our earth. Were over her own way to a Owen the surfing really surrogate family structure exams we have a lot of fire to use now. Yeah and I wrote about the long response to your response to what you really think of the form which. Is. Just now just so even though in this country where you the people are sending you money it seems like to the students that there is this need to treat it's going to test. How you deal with situations. That you might see like you consistently are. Starting. To wonder if. I actually can treat this thing and where did it come to fruition. So very strategically back in the day when biased tests were concerned not that they're not a concern anymore but yes scholars at U.T.S. sort of said hey let's call it something more neutral because they're asking good questions about whether measures differ for different people but bias is such a loaded term so I came up with the term Paul and unlike others coined the term differential item functioning to make it make of this like biased sound scientific and and so on and it kind of does I guess but the basic idea is that you have 2 different item characteristic curves for the same items corresponding to different groups that's bad right they don't contain all the information about how you're responding to a particular item and if you estimate for a different population a different an item characteristic curve that doesn't align then you've got evidence of differential item functioning for that group so there's a whole set of that under the diff and each command you can say help defend each You could also do logistic regression of the of the price of the item score on the total score and that in and out in and of itself will 4 and then with an indicator for the group and that in and of itself will give you a test of whether or not the item is functioning differently for one group or the other so there's a bunch of different ways we have to to detect that and it's a violation of the model and is a concern. I prefer to form or use the date I like to call the inhabitants of the practice in order to avoid bias I don't. Know what I did for me because of the way I didn't want to know that I did that. But it surely was something more than well in that organization with. McCain leading any sort of made any leading a major achievement gap or generation many gaps or things one of those why he's only making sure to manage using a certain set of ideas and I never know right now others have those planes again so so do if it's conditional on data right so for 2 people with the same feta right are there different probabilities of correct response so that still allows for different groups to have different distributions of data. Right so and so you can have these 2 different groups of 2 to distributions of data that can be that true gap right but then if you estimate 2 different item characteristic curves from them and they don't align that's problematic right so so for people who score very low in both groups right are they are they going to get this. Like. I don't buy it or get absolutely they use going. Away so what we do what we do is because there's 2 things we do 1st of all we assume through the content development process content is king right you assume through the content development practice process that you are measuring something that's that's that's good that's right and it's like that part of that theory right there where we're not just asking yacht questions or country club questions or color blind questions for people who are color blind right so it's got to go back to content in that regard and then one to have that then you're looking at relative death right because it's always going to sum to 0 it's circular in exactly the way you're describing or you have some sort of some sort of X. or no referent that you assume is unbiased So one of the other if you do the internal way it's going to be circular external way you have to question the bias in the external referent but the but those are 2 approaches to doing it the way we get out of that circularity jam is coming all the way from models back to content and some theory that what you're measuring is right and so what we usually do in the test development process is we flag items for death they go to a content or view team they try to come up with a couple hypotheses for why that could have happened usually they can't and so they leave the item in and that's that Paul and wrote this famous paper in like 2003 or something it's like what's the diff about defeat don't make no diff. Which is differential item function right so so I mean because that's really what happens in practice is tests are designed to through the content development process to our. Already this is you know Diane Ravitch is like you know language police kind of book right way back in the day it's already designed to squeeze out everything interesting and possibly like you know differential func functioning differently across the tast so you get something that's so sterile in the end that it is like no basis on which you can really throw anything out it's kind of a sad statement but you know. To a. Way or the A. Way that. You're always has me at the depends on the use. Of the a. To a. The and this is very much and so 1st of all I forgot what time do we end I thought we ended at 5 but I realize now it's 5300 wow OK well we can talk about all sorts of stuff. You know keep it to questions. And they always like I mean I am kind of exhausted it's I mean like we've got half an hour let's get Fox Let's talk scale pliability. So I might really do that you guys better ask questions otherwise I'm going to get. If it was less relevant that. It's like how do you get mad at yourself it makes sense for. If I can hold it for you know better I feel for the boys when I was in the program. But why do you think you have what you feel so. Let's get to that. So that to address the just to fit question right so there. Are different schools of thought and so that I know because he's trained more and psychological measurements are educational measurement is is more interested in model fit and people in structural question modeling generally and factor analysis generally are interested in a whole array of fits into sticks that make me dizzy sometimes they're about you know back in the day like 20 years ago you could get tenure based on like you know creating the next new Fit sadistic and now there are 60 of them and I can't keep track but I don't mean to be glib like you can sort of tell by the way I'm talking about it that I'm I'm just skeptical of the idea of it like I think you can start off with like an alpha statistic and once it's like a sufficient level you're just trying your using it you're using I or T. to accomplish something if it helps go ahead and use it if not don't and so and so I think that the sort of the dimensionality questions are often a little bit overwrought that said I think like as a matter of likes like operationalization of your measurement like objectives I do think like streetlights alphas and scree thoughts and overall fit all the C.F.I. And R.M.S. CA and a whole suite of hits a test takes are helpful the only problem is that like you know what you run the risk of like people being like you're fit to test it is like point 02 below the cutoff in your sleep where the hell these come out what does this even mean so I know little cynical about fits this fix but I do think you know support an. There are models to fit the data I just don't so so how does I.R.T. and C M factor analysis kind of differ in the practice like in the same way that regression and a nova differ in the practice right. In the eye when we use I.R.T. we are very we tend to be very interested in like the marbles we're trying to create a test and we want to or like maintain a test and so we care about the specific parameter estimates for those items and we use them very very carefully in S.C.M. and factor analysis you're sort of more interested in this sort of global measure of like does the model fit and like that if it fits and sort of it helps to explain my theory sometimes a structural equation modeling you are interested in particular structural parameters in the same way that you're interested in regression coefficients but in general you're sure of your interest in the sort of global ideas fit right so I guess I guess that's the difference is that I.R.T. cycle I don't care if it fits like my standard error on this discrimination parameter is pretty it's pretty good is pretty decent and it's sort of unidimensional and. That's that'll do right so I guess I would say what we usually see in practice are like these scripts lots of these general fits the test 6 someone doesn't say and describes fit and then you sort of move on and so if you look at Duckworth and Quinn they do this sort of token confirmatory factor analysis and the like like OK Hey it fits now let's go see if it predicts future outcomes like enough of that let's go let's go do something else so I think that's a good standard practice and if you like some That's article is a good one right where he does that internal consistency examination on his on his on his scale and confirms it works and people are often using it that's a good model. Of the questions. With. One person so I assume that the days of 01 and go ahead and. Expose the leak is are calling getting those are the ones to just put your POTUS on post and that's the only place I've almost like almost got to step a submitter right now I have your ability to go back and get a raise gets are. Reliable rest. Like you mean if it's not normally distributed yet. You sooner than all the one thing you must mean anyway it's kind of feeling well this isn't exactly what you're wanting so it's really part of life or there's a you it's a given want to be able to like if you're a little it's like you're supposed to be something about life believe your estimates are that's cool it's a cool idea so in general I think this probably fits under this like more B.Z. and ways to go about this publicly. And like so there are a lot of people who kind of do this market chain market Carlo approach to sort of simultaneously estimate everything they have priors on the B. parameters priors and they parameters can have strong priors in the C. parameters what they'd a kind of feed back into that information that sort of 2 step approach so I think that's probably where that sort of stuff comes in in a more fully framework so I guess I would look there I'm not I haven't done that in a long time and so I'm not sure where the current state of the art is but kind of a cool idea. So let me let's let's kill this to a little like I mean you know it's like it's probably beer o'clock but let's none the less and less do a little bit of scale pliability of uselessness up for whatever about to get into so I. Assume you know is this an equal interval scale so this is the big debate going on I'm not sure it's debate seems pretty obvious to me but some there are those in our field who are less utilitarian instrumentalist than I am who are really struggling to give psychological and educational measurements the cache of physical estimates right they want to sort of say this is my own breakable scale don't don't bend don't bend it and I think it's it's it's that's sort of silly so so so interval scale again we're setting it up as when you're in the log odds of risk of correct responses to items so there is a way in which it is already equal interval you've always got to be equal interval to risk with respect to something. So there's a good literature right now bond in Lang and Nielsen as well that you cited in your paper which I appreciate their good work on this they're trying to tie achievement scores to these extra reference and they're sort of bending the scale in response to these like other scaled that she even test typically gets subjugated to. In it and sometimes very useful ways so so I So the thing that's going to is equal interval with respect to the log odds of correct responses to items but there's nothing sort of magical about that you can sort of bend everything. Right and everything will still sort of fit as long as it's a monotonic transformation the it's no longer linear in the log odds but it's still like going to fit the data right it's because it's going to chase the data in some arbitrary way so so large sort of shows that you know it doesn't really matter the data can't tell as long as you're monotonically transforming the both the item response function and the data themselves I mean it's just going to chase the data let it do whatever you do right so what do you make of scale indeterminacy so logistic I don't response wontons mathematically convenient is a loose rational basis under normal assumptions there you go but the data can't tell which of any of any plot plausible monotone transformations is desirable there's no one correct or natural skill for measuring traits or abilities and education and so I come down very similarly to what Brian and Jesse articulated so well in their G.P. paper which is that there's a it's probably useful to think of. A class of. You know again I like to call these plausible monotone transformations that you should subject your scales to re estimate according to those data after those transformations I mean just make sure that your your whatever you're concluding is robust to those transformations so interpreted interpretation should be robust a plausible turn of course scales so this is what I described before where we sort of have these like you know one to 2 to 3 and like we try to sort of I think we need a way to sort of talk about how like pliable these scale those are because because you know the you know who's to say like think about the item maps who's to say that the distance between 2 digit and the distance between like derivatives in. I mean how how are you going to sort of objectively sort of say what that difference is and so yes I would again sort of say the scale is pliable and there ordinal number interval I feel like ordinal interval is like an antiquated dichotomy and we should sort of think of a way to sort of think of something between the ordinal on the interval the equal interval arguments like weak but not baseless so. This is just to illustrate what happens like if we were to just. Operationalize a transformation of of an underlying scale right already said you know what I see you normal distributions but what I really care about are differences down there like a negative 3 to the negative one. So like that's where like I want to prioritize growth either from like an incentive standpoint or that's where I you know from a measurement standpoint truly believe that you know did those distances are like 10 times or you can sort of say that this is actually the distribution that the distributions I've got Were these are actually the distributions I've got and if you're to do a straight like you know standard standardized in a mean difference then this sort of changes the actual effect size right the actual set number of standard deviation units you can look at differences in percentiles too and the idea is that whatever you're sort of assuming but it whatever judgment you're making is going to be robust to these transformations so similarly So what we did Sean and I did address sort of a separate problem but still resulted in a neat technique I think serve to define this this class of transformations that is mean and variance preserving that's just like to keep keep your head on straight Syria not. Trying to go to a completely different sort of scale it's like all your keeping you're keeping your sort of head and arm with sort of approximately the same and just working things in various directions and then a lot. Of like my mean distributions today it's kind of fun so and so this is so subject to this constraints of this is like a class of exponential transformations subject to these constraints we get this this formulation and this is that sort of transformation from X. to X. star So what we're sort of sort of doing here is saying this red transformation here right that's accentuating these higher scores here and the blue transformation is accentuating these low scores here. You can imagine also kurtosis kinds of transformations where you're stretching the tails but keeping everything symmetrical these are sort of one direction in the other direction. So so this is like what would happen under these various sort of C. parameters as I've defined as we've defined them where you take a normal distribution this is sea of negative point $50.00 negative skew for the blue distribution and this is a sea of positive point 5 The positive skew over there yes in. My. Place. Counts more. Here than up here close. To 10 you know whether it is a little bit. Better than the original. She has was. Very clear about. Why this is that interesting today there's one week to function we should just try to estimate. The loop. Hole I had absolutely right right and I so there's So I think it's absolutely right this is like any sensitivity study is not like random It's like asking a different question right exactly and I think that's exactly the right way to frame it and one of this is where I think sort of item maps can kind of help is because what an item that will do is sort of go along with this function and say hey look what you would now what you said is that you know do a derivatives are close to enter goals and like to vision to just abstraction is huge and that's not random and that's a statement of like you know of a belief and different and in these different magnitudes so don't treat it as random error but say like under this condition these are the these are the results you get under this condition these results you get and I think that's a great is exactly right and this is a by the way I think a general way to think about it I think a lot of people have said this but don't think of sensitivity studies as just like you know a bunch of random things you do there each you know questions exactly right. So the way that I we've set this up is the balance of C.R. set to say that the slope of the transformation at the 5th percentile is one 5th to 5 times that of the slope of the 95th percentile that's one way to sort of think about it it's like you know the rate of rate of the relative rate here is like you know 5 times a relative rate at the top of the distribution there just very various ways of sort of think about you know how to how to stretch and squish the scale and so again you know if you want to sort of like what to do it kind of thing is they take the sting scores apply a family of possible transformations taking Sue's feedback seriously here what you'd also want to be very clear about is what that implies for like you know the 2 different it's like a difference down here in the difference up here using some item napping or some other way of describing it calculate metrics of interest from each dataset and assess robustness of interpretations of metrics across these possible transformations. By you. Know actually so that this reference is was related to get measurement broadly but what we're trying to do is is make sure that our reliability estimates would not change too much in like whether it was parametric or non-parametric So really trying to solve a completely different problem we were just sort of saying hey there's a cool transformation that'll work for this purpose so I'm citing this as like Shawn and I kind of hit 3 fun things in that paper that had kind of nothing to do with the abstract like the 1st was like hey what are liabilities across state testing programs the United States we just threw that in is a figure and another was like this little thing here sir just trying to solve the real problems associated with our R.V. get procedure and so it's really kind of ancillary but that's where we started writing it up week so we should really we should we should really write it up you know some more formally. Like the all the things we don't have time for but yes so you know is this the right family can we think of kurtosis kinds of transformations see about. Appropriately I love to use feedback in that don't think this is random So this is this is this is our. Like does or is there a liability coefficient or a liability coefficient if we if we what the nonparametric ordinal reliability causation and and so this is this is sort of saying that actually that that our correlations are actually pretty stable across all of these different transformations and so we don't have to worry too much about reliability depending on different scale transformations So here's I would sort of say left we can create. A hierarchy of research like of statistical procedures that based on whether they are sensitive to scale transformation right and so you know differences in means are going to be pretty darn robust right correlations as we've shown here are pretty darn robust differences in differences that gets that's good that's problematic right and so like when you're actually when you whenever you have these sort of like interaction effects like that's heavily dependent on scale because all I have to do is squish this to like make it parallel and stretch this and like I get it I get a different I get a different kind of interaction effect so there are different classes of of procedures that I think we can sort of lay out in like a more sensitive versus less sensitive kind of framework and I think that would be useful so Nielsen does that you know and. In the papers like which wasn't a shock to us that changes and gaps are different then I mean that's just that's that's pretty straightforward and that but generalizing that to like a these kinds of methods are in general these kinds of questions are in general sense if the scale is really useful. So so this is just a little example of how like value added models like our are not robust but we don't have too much time so so. I draw this. So if you look back so that's another good reference for this and other good reference for the changes in gaps question it's a back to Harlan ho in 2000 and ho in Harlem 2006 we should we sort of showed that that for the most part gaps are to cast a CLI ordered right there's nothing you can do to reverse the sign of a gap right like I mean so so one for the most part like high achieving groups and low achieving groups are so far apart that there's that there's no transformation that could possibly reverse them but we we created sort of just of sort of proof that that we call that 2nd order to cast according which is kind of a mouthful but the idea is that for changes in gaps right that it's very very easy for the most part as long as you've got certain conditions that hold that a transformation can reverse the sign agree versus the sign change in the gaps Right exactly so which is the same as like an interaction effect which I think it sounds. Exactly exactly right. Was a bit because here's what I mean your response is exactly right which you know would know which is even not the or leave the room part but maybe that's the right thing to do but your response is exactly right which is to say like what does what are the intervals that this is assuming right and so that's where I think like the sort of idea map and scale anchoring can be really helpful is because you're saying like look if you want to disagree with me about the ordering Here are the here is what I'm saying about the scale right at this point is this this point is this point is this this point is it's about have a content based argument about it about it go ahead like I think that's where you sort of like can set your stake in the ground and because what I don't want to do is like get to this sort of nihilistic likening bond of language a little bit too far and sort of saying let's create let's let's sort of solve for the crazy possible transformations that could possibly reverse this sort of gap and I think that's a little bit too extreme and so what I tried to do also in this paper with Carroll you is sort of sort of say like What are the distributions we see in practice and how crazy should we stretch things to be to be plausible so we should have a debate in exactly the way that I think you and Jesse were describing about what's plausible in which that situations that should be leveraged based on you know like a. Decision can be made based on sort of a survey of the shapes of distributions that we see in practice that's fun thanks I'm glad we had that extra time.