Well. Many courses here we have students from 2 classes. For tool we are very pleased to have joining us to is a professor at the Harvard Graduate School Education cyclists in the real was. Who is going to be talking with us today give you a flavor for some of. The really important issues in education measurement. First thought only measurement folks in education research to pull. Lists as well. Goal here is not to have everything that one would learn in a foursome. Sequence in psychometrics but to give you a flavor of what's out there and help give you some ground. To air into is. Accomplished. Academic here member of the now for Assessment Governing Board. Member. He's a former. Master since you just pick from. It's a really very pleased to have. Your Thank you thank you. We had a game against Michigan. I should I shouldn't started with the crowd turned against me so thank you for this opportunity these are these are weird things right like 3 hours measurement What can I accomplish I'm like actually not sure I think the presentation is kind of a mess but maybe deliberately So what I wanted to sort of leave with like a couple of provocations a couple of references and the sort of nagging idea that you need to learn more and that there are places for you to do so and so I should start by saying one of the places you can do so is right here you have a stealth psychometricians in Brian Not to mention a psycho magician in that not to mention people who know statistics and measurement broadly likes you and Chris so you do have a bunch of people who I would love to have the sitting in the audience and watching teach this in their own way and I would learn a ton from it so we all would teach this were different in different ways and Brian and I actually recently had an exchange about measurement he was talking to a condom it's and I was reflecting on how he presented it and I was a ton of fun and so I wish you know it's actually I was talking to Matt about this earlier too I wish we could talk more about how we teach this and this is I think it's a great opportunity for me to engage with you but also to get a little bit from from people about the different ways that they approach this because we could learn a lot from them. So here are my provocations So I kind of want to do these like you know these listicle ce writing this is like how to get people's attention 7 things you need to know about measurement but I'm caving I said that I'm going to do it anyway so here are here are those those provocations I'm being deliberately kind of extreme here just kind of to procure a little bit so validate score uses of the question I find myself asking my students most often when they are starting to conceive their measurement projects is what for and not just what for but what and like what is a score ultimately whether it's a scale score or an average of scores or regression coefficient ultimately those scores are what's used and we don't validate tests we validate those ultimate uses of scores to content is king not models content so if you start analyzing data ever without knowing what items there were you should slap yourself on the wrist or feel your advisor slapping yourself on the wrist What are you measuring and you need to have an embodied experience of what that is you need to put yourself in the shoes of your participants and feel what it's like to be a part of that measurement procedure have a tendency in measurement to jump at the neatest hottest sounding models with the longest acronyms stop it start with simple simple descriptive statistics in measurement and I think of classical test area as the descriptive statistics of measurement you should always start there in the same way that those of us data start off with what command. Summarise so did so so Alpha is your summarize of measurement OK. this is like this these are not the droids you seek or whatever it is go to the reliable the that is most often calculated is probably not the reliability that you're interested in and I'd like you to leave here today being able to answer your aunt or uncle when they ask you what is reliability I want you to be able to answer that question I remember when I was getting my masters in statistics on them because I don't agree what the standard deviation can you think of how to answer that question to your uncle as I could give you an equation like how do you actually describe that in a meaningful way I think that's reliability like the standard deviation of measurement be able to what is point that mean and you should be able to answer that I mean today if you can't read. I don't respond there is just a model. We always hear we should really have someone who knows and in response to it I wish I had someone who knows I know response theory which I so many could teach I'm response theory is just a model and is also a very very useful model so I'm both going to demystify it and focus you on what it does particularly well 6 your scale is pliable bend it don't break it and this is something that Brian wrote about recently as well. The numbers that you should you should think of them as like solid ground like the distances that you sort of see they can there's kind of like this is kind of like this bridge it's like springs between the boards maybe like there's like a sense that the scale is pliable to shift a ball but not breakable and there's actually empirical ways we can address this tendency and your or the judgments you make based on your scale information should should be robust to that spring in us. And then 7 again this is a reiteration of other things know the process that generated your scores and use them accordingly do not go beyond what the data suggests that's just a general recommendation so these are my publications they sound just like floating talking points now they're deeply embodied to me and I'm going to try to do my best to make them feel meaningful to you over the next couple of hours. So but I want to start by just stepping back and again when you think measured what kind of resources can you use moving forward and in the e-mail that I think some of you received from me you've got $2.00 to $3.00 citations right 2 references in the 1st Are these the standards for educational and psychological testing what we're doing has has decades centuries of history and a field has built up around it that has some guidance for you right and 3 the major body is the American Educational Research Association the American Psychological Association and the National Council on measurement and education came together and actually agreed on something right these are the standards of the field and this is a very powerful tome not just for you but for the people who you're developing tests for you can say I did this and these are the standards of the field you should have this book and there's a discount for members of this of any of these but like if I were to recommend one thing these are the authoritative standards of the field they're not perfect I've got a kind of quibbles with them but they're powerful and it reads actually is as a good as a as a pretty reasonable like intro text. this because 160 bucks and I remember being a student and it's also very much a reference text but this is sort of the cut bible of the field has all the sort of heavy hitters in measurement who contributed chapters to it it is sort of the authority to like what we would cite when we say if you had generic a site for reliability you would go to Hartle 2006 generic site for validation you would go to Cain 2006 and those are the 1st 2 chapters in this book do not buy it unless you are really into the stuff. But it but have it on reference of it's on reference for easy reference your library so much you put it on reference frame so they would be my sort of to go to books for for educational measurement probably and you should look to them if something is provoked to you as places to go for citations. So how do we learn measurement I think this is important to visit visit too so like you know this is very much I think in the way that the way this is like reflects the philosophy of teaching here too but I always want to sort of point to my fellow by and I'm like eager to look at other still abide to from folks other folks who teach math but if you want other references there they're up on my website there's like sort of like what I think of as where to go to for I.R.T. what I think you know what I think people should read when it comes to differential item functioning when I think what people should read when it comes to standard setting so they don't forget to look at people still by when your reference hunting and you want to say someone said a cut score who should I cite for the whole a whole cut score setting thing and Philip I can be very useful in addition to that in addition to this tome where it's Hamilton and Tony Adams who did on who did standard setting so my fellow by and the syllable of others like Matt are good places to go for references on these things. And then again like you know learn it use it learn it again and use it again like in practice makes perfect and all of your classes your methods classes I think you're using data getting your hands dirty struggling with those state error codes looking at those manuals right so you know help help help them clicking on the P.D.F. that's that's what you're going to be doing a dodgy and I just want want you to want to recognize that measurement much like the other methods you're learning requires that struggle and that patience so this is something that's a little trick I just want to give a shout out to a few people who contributed to my Google Docs this is something I do to like incentivize and encourage out of class reading and out of class discussion so there are a bunch of tools for this and D. has and it has a new thing called to result at some of my colleagues at Harvard of develops or just open google docs but it looks like this this is what I asked some of you to contribute to I said Hi I'm students is the typical pre class discussion that we run it is to see I ask questions I have you respond to them by 10 pm last night and then I reply like I am kind of a night owl and so like in between 10 pm and Dawn for a class I like I sort of give people like answers I have little conversations and sometimes for the for the people who contribute late like around 9 pm We get into these sort of discussions online actually how did someone who was I know who it was but I was online with some one of you. And and just going a little bit back and forth about how to write out an equation so so thanks to Josh G. Josh G. There you go Josh so so so we I had a little. Cross talk with you I'm not sure if you check back in solid I saw what I wrote but this is a part of the read write do this is just sort of like how to sort of stay engaged. And here's an actual Doc So thanks to Josh G. thanks to Fernando you can see I'm replying italics here thanks to Stephanie H. Stephanie I missed your comment you must. OK I'll reply later I promise Karin. So there are a lot of really good. Some derivations we did here Stacey. So yeah and then some that there's always I always sort of leave a space Cassandra didn't get a chance to reply to you but Stacy and Josh here there's a general space for general questions and discussion and you asked and good general questions that you can have time to engage with towards the end of this class today I should say to like interrupt. But I think it's important sometimes you just might talk straight street pedagogy and like how to learn this stuff and how to stay involved so read write do. So again there's a 7 principles I'm going to start with and you start with the beginning and start straight from validation so we don't validate test to be validate score uses to talk a little bit about the validity theory and this might seem a little bit detached and I'm going to get more technical later on I know I feel like this is very talking about this as tribal I'll try modal audience potentially So I think I might interest some of you sometimes and others of you other times but all of this is important in belongs to the body of measurement so I hope you'll survive remember even the things that may sound theoretical even the things that may sound too technical but are all in a continuum and think of it as what we do. So validation This is the more recent depiction of what I think of as the standard for validation in educational measurement in particular I contrast that with Matt who teaches more from the the psychological measurement paradigm which I think has a slightly different perspective on validity but but Michael and the field is an educational measurement in particular is very utilitarian very instrumentalist we care about the ultimate use right it's almost a theoretical if you say if you if you take it to a certain extent we just don't even care about the numbers as long as the interpretation or use is correct that's extreme but that shows you what that what they're emphasizing here what we're emphasizing here to Val to validate an interpretation or use of test scores is to evaluate the plausibility of the claims based on the scores as an argument based approach you are building an argument with evidence over time there is never a point where something is valid it is part of an ongoing evidence building process and that is deeply unsatisfying right wouldn't it be great if there were a correlation coefficient and once it exceeded point 7 you said Jack and this is a super frustrating like call to you to never do that right and I think it's particularly frustrating for introductory students for whom like that they might who even for those who like might not have careers and measurement to say they really have to do all this and I guess I'd say like actually no you don't really have to do this but at least you have to know that these are the sort of standards of the field even if you selectively ignore them. So and again this is just these are for measurement there is my colleague Derek Briggs who kind of disagrees with this utilitarian instrumentalist current status quo in measurement but there actually debates about this in the field right what is the validity you can write a paper on that right and contribute to the discussion of what it means for the use and interpretation of test scores to be valid and appropriate and we broached this this morning when I was talking about the use of this new data set that my colleagues and I have created that allows you to compare districts or school districts across states and we're not we don't just ask is that valid or not we as we say is there are the uses of that are the interpretations of that valid or not and there's some really good feedback from the faculty members and students in the room about which which research designs and which research inferences would be or would not be appropriate in those situations is very similar what you're trying to do with the scores and is that appropriate is that supported by the evidence so I would say these are there these are different definitions of sort of validity or full of schools of thought about validity and not reading all this text because I'm sort of that sort of leaving these slides as a reference but we are in a very instrumentalist even utilitarian moment and educational measurement where we care ultimately about how you're using those scores not about the test or even the construct is about the score use. So again modern test validation theory is dominated by instrumentalists I'm concerned with test uses an interpretation and I'm acknowledging that this can be frustrating because it kind of takes the control away from your special little instrument and it's in its ultimate scores and places it in this fuzzy domain where people pick them up and use them and you might kind of be responsible for that. So I think a validity and as I say as I tweeted before it's like I'm not ashamed to use mnemonics and so I think of 5 sources of Liberty evidence and I call them the 5 seas so the 1st is content right so the 1st to take the test what is it measuring There is a good overview of alignment of 4 big testing enterprises to the Common Core recently Morgan pull it off and then see DURIE publish this piece in Fordham earlier this year which is which is basically a content study right do park and smarter balance these big testing consortia and as well as M. CAS and the A C. T. aspire these the Massachusetts state test and a.c.t do they aligned to the Common Core state standards this is a content study and there I think there are too few people frankly delving into this like arena which is currently sort of dominated I think by more model based statistically based approaches so I'm just sort of reiterating that content is important serious important cognition is another source to go as another source of evidence is like when you take that scale. Are you thinking what the designer intended you to be thinking as I'm thinking through this math test as I'm thinking about whether or not I'm greedy or not think about the studies of great recently that have been concerned about reference bias right that is to say like do I feel greedy and can you compare it across courses or my referencing my grit to the people who happen to be in the school or in this classroom right so how are people thinking about it cognitively the way we we have seen they could the evidence we can get up often comes from Sir think aloud protocols as well as a parable analysis. Coherence is where the field since it seems sort of stuck with validity and there aren't a lot of what I'm going to talk about subsequently is going to be in this into this 3rd seed so this is where reliability analyses come up if a C.F.A. I.R.T. this is what not teaches as well as well as me this is I think what people sort of assume measurement is from a technical standpoint and what I'm highlighting here is it's only one city right you've got to think about content you've got to think but cognition and sure you can do your reliability analyses but that's only a piece of the puzzle another piece of the puzzle that is often this comes up a lot in structural equation modeling comes up a lot in economics too where you're trying to predict future outcomes does this predict college attendance graduation or college entry or freshman G.P.A. or future outcomes or more concurrently does this does this correlate or not correlate with things that should be similar and things should be different you sometimes hear this is convergent or discriminant ability but this is again only a piece of the puzzle and the Fitzy is consequences right evidence based on the consequences of testing you could think about this even as a counterfactual like had I not undertaken this measurement enterprise at all but would have been the difference so doesn't think about the scores as much as the use of the scores and like that has has the act of testing and measuring itself had some consequence and so this is a fairly controversial relatively recent addition to the sort of the Litany framework but these 5 sources of Lady evidence are clearly articulated in the standards and what you should think of when you're designing a measure when you're using a measure as the kinds of evidence you can live. So so this is sort of in contrast with what with what I think of when I think people are thinking of validation commonly I developed a scale with good theory I fit a C.F.A. and got good can for confirmatory fit index and my reliability is greater than point my scores predict desirable outcomes so I have a valid reliable measure that's like the common sort of articulation of like a good baseline study about I'm setting that up as you know so that's content that's coherence that's coherence to this is correlation and that's incomplete or sort of missing cognition room if we're missing consequence you're missing this argument for use what are your scores how we use them what would have happened had you not measured and so these are other questions you could ask just with complete this sort of validity framework so it's more than just. A good fit index and good item parameter estimates. So again 7 key principles we don't validate tests we validate score uses That's what I was covering and I want to emphasize content a little bit and then dig into a little bit of classical test ary And I think that'll probably take us to the break or thereabouts. So. Let's and then we'll get into the reliability and I are to be sure of after so this yes sure. Yeah talking about consequences how it should be in the context of what you mean like if if a student had never been tested then what other measure to measure look underlying ability to or think that we're getting the score ultimately is is yeah is used for something right so once once we test what's the sort of theory of that making a difference in some way and it could be like publishing an article and having that feedback into the system it can be very abstract in that way it could also be the teacher is going to use it to give you feedback and is that feedback going to have a positive or negative impact on you right or it's going to lead to a value added estimate for a teacher and they're going to respond differently to teach so it's like Had that not happened a whole process not just the OR it like you know the score but the use of the score in this theory of action had that not happened what would be the difference so I think that's kind of a pretty gold standard level of like I mean we're taught but a major evaluation at that point but which is why this is sort of a controversy all sorts of related evidence because like good luck and how long do you wait for long term outcomes but but this. I mean from an economic You can be because of the kind of catch all the time you know you have people. Like how can you think of them. So again so you know in all the ways that I think you're trained to write as economists right so I think you again like and I didn't I wasn't being glib and I was sort of saying this is like why we're glad we have people like you is because I think you are asking like what you know what is the counterfactual for you know if we didn't have high stakes test based accountability like we'd have some sort of paper by some guy named Brian Jacob and Condi or something and and and sort of think about what happened had there not been this rise in accountability at this particular time so these are the kinds of evaluations I think that I'm not soley putting this in the in the in like in economics like that but that that said I do think that's my encouragement to you is to never just think of a test as something that's validated up in the air but as like part of the results in the score that is used for a purpose and if that purpose is for you to publish and get some correlation coefficient and get in a journal and that's great and that's part of your theory of action and that's pretty light but all but ultimately I sort of say like but you know why are you doing this and that's why I'm sort of for pushing people to go is that ultimately your scores are used by people for something can you can you describe that to me please and that's what I find myself asking most students like that's what's missing and when they say I want to create a measure of X. I'm sure like why you know those scores we're going to do with them what's going to happen and that's what that's often what I find missing in their their thought process. Thank you we're here and I'm trying to figure out correlation you said evidence based on relation to other variables and so I'm wondering if by that you mean like I would validate one standardized test by its relationship to student scores on a similar kind of test of similar kind of content or reading of things much broader than I'd like. This chance to and how the critics like high school graduation you're going to college and so how would I know those kinds of things before like if I'm using these as a foundation for measurement and developing I haven't given it yet so how do I have evidence on this is. So so this is why crown Bach and all the sort of. People who have developed validity theory over time have been. Very clear that it is an ongoing process that it's not I mean again and this is where psycho magicians struggle with dealing with the outside world because the outside world is like show me your valid measure and you're like but this is this process that takes a look at Show me your valid measure and and so it can be frustrating but this is how the field thinks about it I think you have to wear different hats and when you're talking to people who have that their definition of liturgy and just say this checks all the technical boxes and you do want to at least some correlations with concurrent out concurrent variables in some way but but look at the end cast Tech Report technical report look at the report here for your tech your deep What is it now and and you'll see that the all of these are laid out in there in varying degrees of depth and usually coherence is a massive section with classical test theory I or T differential item functioning alike and correlation to small consequences is a paragraph cognition is like we did a lab and content is very very fleshed out with content frameworks and the like so this is why I explicitly walk through Technical Manual You know when you finish my class you should be able to read a technical manual for a state testing program whose data you're going to use and figure out what implications it has very for your own analysis yes that's a good model to check. That's. What are some of. The valid for the test but for. What are some of the kind of. Thing and I'm wondering when you were talking about federalism focus you seem the one to see complex necessary if it was really going to meet that. Goal or. Cause. Geared up on care I don't care what I don't hear the reliability of its core how well he learned in college now that. You're going to be anything other than for. This is a good question so this is where the economy is so we're probably shouldn't like over and over going to miss dinner over drinks at some other point we will have a detailed argument about or debate about why these things should matter I think I mean so from a very utilitarian standpoint in the near term before you get those long term outcomes you know if you're developing your own measure you need to stand on something in the near term before you've got those long term outcomes. The here yeah and it also I think it also I mean I don't know like if you happen to find some spurious correlation of something I mean there's got to be some and you are interpreting when you completed a C.T. score that there is some sort of college readiness and you know when you say like point 3 It's like socioeconomic status correlates point 3 and it's like you don't say are college ready based on social economic status right and so the interpretations we use like matter is the sort of psychometric argument and so you know when whether I enter Be specific about that interpretation and what is the warrant for that interpretation and if it's only based on social economic status and the warrant seems. Detached from the human So I think this is a deeper philosophical argument you're raising that I don't think should be. So I but I think it's a good one and certainly some that might that my students have advocated for and it's certainly econ leaning. But you know the you know what I often fight with is like why do we care about freshman G.P.A. I mean look at that's a horrible measure I kind of wanted to kind of want freshman G.P.A. to predict my on my high school test because that's a better measure because of the content the directionality I mean so it's does arise I think from. The items in the content is the is psychometric percent. But so so on to a little bit of classical test here in the tools that we use to evaluate in particular Clarence. Or and content so this is sort of like my checklist for it like how to get into a sort of secondary analysis of test score data right you get a bunch of you get a state a D.T. a file and it's got people in rows and there are all these items all these like columns that correspond to items and I guess you know so this is like my going to skip around is going to go 12378 or something like that but this is this is sort of part of a larger checklist and again like you know this is from John will it's presentation as well. No you're right it's right like read each one take the test get a sense of what it's trying to measure. So so this is an example from a a. Measure of like self perception of teaching success you have high standards of teacher performance you're continually learning on the job you're successful in educating your students it's a waste of time to do your best as a teacher this is negative negative negative polarity you look forward to working at your school how much of the time are you satisfied with your job right and so this is like my advice to you is never go into an analysis without actually looking at the items and sort of taking that like scoring the test thinking of yourself as a subject and then you have all these sort of like your scale items is one to 6 you see here some someone snuck in a one to 4 item this happens from time to time so do not get caught unawares do not type in Alpha without recognising that some of your variables have different items skills than others because it will give you incorrect answers so so take control of your scale and know it backwards and forwards and again I'm going to in the interest of time I'm going to jump through this always on the scale of your items right to score your test how is it actually being scored is it a some score it isn't. Average are you reversing some of the play or any of your some of your items are you stretching the scales of some of them so the algo from 0 to 100 what do you how you actually scoring it. So if you if you look here right again you're going to want to sort of what I recommend that you do when you're actually going through this is reverse it yourself like take control in state and reverse coat it so that they're all pointing in the same director because and then make this because otherwise I have I found myself making mistakes is some very practical advice for you to not slip up in the sort of data in the early stages of an analysis so you know again look at your data get a sense of the missing this label your items make absolutely sure your items skills are oriented in the same direction or you're using code that recognizes when they're not positive should mean something similar if not fix it. Here's more exploring I have mandate that people always like give me discreet histograms for items scales I want to know Mike how many ones there are how many to 0 how many threes fours fives and sixes I want to see if you've got a 7 point Likert scale if no one is picking 6 or 7 ever I expect you to know that from the very beginning and don't start running I.R.T. until you have a sense of your. Data actually look like. This is important as well does a one mean one at all times it is is it always like strongly disagree when you have a scale that goes like one to 4 right so if I have strongly disagreed strongly agree and then I have not successful it's a very successful and this is one to 6 and this is one to 4 and I throw that into alpha if I throw that into like a reliability analysis what is a going to do is going to assume that very successful means slightly agree does that make sense. It could make sense you better think about it and make a decision so if so the idea here is that all of these items scales are not in a classical analysis are are they think of ones as ones and sixes sixes so you better take control of that and make sure that that's right so often what that entails is 2 things one stretching this 124281 to 6 or actually just forcing this to be one forcing this to be 6 forth and forcing this to be what 2 and like actually equally spacing that item out so that you're saying not successful is like strongly disagree very successful as like strongly agree so one of the big mistakes I see people making when they get the scale is a secondary data analyst that assuming that all items are sort of interchangeable and that the player he doesn't matter and you sort of control over that. Another way to approach it is to standardize within each item so what you're doing is you're to your set you're just dividing by the standard deviation unit in each time and each and each item and in that case you're saying that strongly disagree here and strongly disagree there might not mean the same thing depending on the variance of each of those ITEM ITEM distribution. And that's weird too like when your liquids like or scale items are all strongly disagree to strongly agree do not standardize right because strongly disagree means the same thing across those items and if you standardise you lose that information Similarly if you have an educational test that has like correct or not correct should you standardize absolutely not correct is correct and the same thing so do not standardize you know in those cases either as these are the like the little things that seem trivial and I feel like in my in my own way in my own students like analyses and I'm not running through there coming up with absolutely incorrect alpha values I can even just like the baseline descriptive statistics let alone getting to I.R.T. or structural cohesion modeling or attack so you've got to take control of your data from the very beginning and be very very careful and intentional about every single step that's like general advice for statistics period right but I'm saying it still applies to measurement. OK So this is a baseline reliability analysis check this out Alpha X. one to dash X. as is that should be your template and the items gives you all these items to 6 as is I saw I have this sneaky suspicion that this is leading to inflation of reliability coefficients throughout state and users and perhaps other programs as well but as is does is it says the direction of the scale like the direction of the item scale positive is always positive like if you coded as positive and treating it as positive if you don't include as is there could be a really bad item in your scale that correlates negatively with all the other items negatively and state a will flip it for you. Without telling you will show up here but you might not notice it without telling you it's going to flip it for you which is to say you've got such a bad item that status as it can't possibly be that bad in reverse it for you and that's crazy to me that they do that and so you thought this is that for a lot of elementary analysts dramatically over interpreting their simple. Alpha they're simple reliability value because they're. Going To Do you know best but but but but anyway so this is be my default code to make sure that you're controlling it appropriately be intentional at every step of your analysis and know what the direction is and know what the scale points are OK So this is I'm going to I'm just going to short hand wave 3 this but these are. Various discriminations statistics they basically are like does this item correspond to the sum of other items on the scale does this item correlate with other items and this is the coherence question this is an internal correlation does this item correlate with other items on a scale which is really kind of what is at the heart of classical test theory I or G structural question modeling factor analysis and the like. This is an example of a little bit of you know more pseudo code from state A for you. How many people don't use data. So and you're using M. plus. Because this is why we include a whole bunch of do files and I've sent Bryan a couple off and I'm more but I'm happy to give you sort of templates for this. too we'll talk we'll talk more about that the simplest of the good cos it will test every kind of descriptive stats. To the you know like OK you know. Anyway what we're worth running so like I mean they they presume that you sort of done all that already and so do all that already like to do that 1st as a as I'm recommending it as make sure you sort of have control over your scale. So again you know coming in as a sort of content is king there in the sense of like you know your items know your scale and get a sense of what it's trying to measure and don't just validate it based on whether or not it predicts life earnings next. But if it were the debate. What exactly were they. Looking at like that. In the sense not in the sense of like I mean you want to read a book on the question because I want to get. More. With. Like I mentioned. Some of that question but maybe. I can see Mollenhauer. All. Right so this is this is a subscale question this comes up all the time so Alpha is a property of of a of a scale right and if you want to create subscales get get information about each of your sub scales that's what Alpha should be for and what else if you throw an alpha across all of the items across subscales it's asking how coherent is this across subscales So the question I always ask people who are using subscales is what's the question how are you using your scores right so that you know if you take a cynical approach from like you know at heart of us always like if you give policymakers to numbers a lot and together. So that you know so this is like the you know so that your great scale the Angela Duckworth a Tim Duckworth and Queen 8 item great scale there are 2 subs course we think people are doing with them. Adding in the getting so if you want your question my question is what your question should be what is the property of the score that is being used. This is that this is the utilitarian sort of instrumentalist of you and if you are creating a scale with like that people are using those subscales an evaluative each of them accordingly and then take alphas for each of those subscales report outfits for each of the sub scales I'll show you how Angela and Patrick. Do this and shortly in their actual paper so yeah so so so which is just to say good to have subscales but then then what I would do is Alpha out C.T. analyses on the subject and later will talk confirmatory factor analysis and all that jazz or actually that well that's what his class is good at. In particular. So let's. Go So this is this is the this is a paper that I have everyone in my class dig deeply into this is Angela Duckworth and and Patrick Quinn's. Journal of Personality assessment paper in 2009 that. I was talking with not about this is a very common practice to develop a scale that has way a ad that has now way too many items but a lot of items and you might not you might want to think about how to minister them feasibly in a flexible situation and so you can use Costco test area in response to a response they're both very very good at figuring out how to shorten that scale like how to how to preserve information while while reducing the number of items. This is a say you know I just gave myself I just gave you advice I'm trying to follow it this is sort of a brief description of the great scale I actually have my students take this so we can like analyze their data new ideas and projects sometimes distract me setbacks don't discourage me I've been obsessed with a certain idea but I am a hard worker I often set a goal but later choose to pursue it so I'm shortening them a little bit this is to give you a sense of how great operationalize So this is their item scale in this paper they're sort of saying we had a 12 item scale we're going to 8 it will all be fine don't worry about it. So part of my screenshots here see table one for item level correlations after excluding 28 I'm sure each subscale I talk in subscales here right there is all things out in great scale this displayed acceptable internal consistency that's code for alpha with alphas ranging from point 73 to point a look at their table to write again we spent a lot of time digging into these articles in class so this is like you know West Point the famously her National Spelling Bee sample Ivy League undergraduates and these are conduct also values these are the values I was describing point the sum that's the total scale that's the that's the reliability coefficient. For the overall scale and then she breaks it down into pursuits of effort and consistency of interest and so the question I would ask in this case is again what's being used and if you're treating these separately you can see what their alpha values are and then if you're treating them as a whole that that's the that's so you can sort of cover your use cases here and say for those purposes here is your level of internal consistency that makes sense. Absolutely and so this is why your classical test there isa to 6 are your descriptive statistics your knee jerk 1st reaction and after that we're going to get to a more powerful framework that allows you to answer questions like the ones who's asking and so this is what I consider level one this like summarize and I really do mean that is like the very after that you get to more sophisticated questions OK so by the way the what I always have one of my questions my google doc questions is is kind of this annoying I guess what I'm thinking questions but it's like Does anything look off to you about this and I'm just going to sort of this is like a tough question so I'm just going to pause and and the just take a look at this table in particular these alphas these alphas compared to these alphas and I just so this is you know going for items for items and 8 items and I just want to sort of this is to have you take a look at that and just get a new curve gut reactions as to what I find a little surprising. There's a bit of it that. I have a plan to in the audience. Try and. There can be a couple answers here so don't be shy. Yeah. For example. For example the man. Who does point 73 or an 8 item scale I have that's wacko. Right and so I'm not sure if he's correcting for that and didn't mention but or if there's something weird going on in the sub scale relationships but that is not what you expect what you expect when you have many more items in fact we're going to show you a prophecy formula that predicts this when you have more items in the same way that you average over more things you have center deviation over route and is your position the more you average over the more precision you have now it is a little surprising that it's accurate that's a kind of discipline perception that you'll develop with with with measurement. Cause that a lot but Joining me to go from this. Which is that much that. You were to be purely So this is one way right so we're going to develop even better ways with I.R.T. But this is just sort of a ranking of how each item correlates this is the item rest correlation is a literally the Pearson correlation a simple vanilla correlation between an item in one column and the sum score of all the other items in the other column so this is a measure a very descriptive statistic again kneejerk summarize level descriptive statistic of how well this coheres with the rest of the scale we're going to see a better version of this is going to get to higher T. but often they very rarely tell dramatically different stories so this is why again we sort of start with our feet on the ground with a basic analysis and then get advanced and I are today. So you so wish to ask your question if you were to be purely cynical about it and didn't care at all about content you drop you drop maybe one in 3 you know rerun the item S. correlations maybe drop a couple more if you felt like it and then calculate Alpha for whatever's remaining and I don't recommend you do that because content is king that's the be careful of throwing away a subscale you care about but and imagine that an educational test where suddenly you're not measuring mass or something right so you can imagine that that would be dangerous but but but that's from a purely statistical standpoint that's what you could pull off. So so good we're going to great I'm going to judge do I have this Yes Skip to slides on classical test or E.C.L. that's that's it sort of there for you there's a bunch of equations there sort of putting that in as like stuff for future reference what I want to do is talk a little bit about why classical test theory is a theory and what it predicts and why it seems like it's useful so what is classical test theory actually predict and why do we think of it as theory the 1st you know what can we infer from classical test theory 1st variation increased reliability and so this is akin to the logic you are using might sort of flip it on its head that you know if you if you ask what the reliability of a grade 3 of a set of Grade ask for the reliability of Grade 3 grade 4 grade 5 scores is together you're going to get like a point $85.00 right so as you increase the variance right in the same way that we you know as we know from correlations period reliabilities or just correlations I forgot to forgot to ask what's reliability We'll get to that again but just like any other correlation as you increase the variance you increase the scatterplot rate you increase the sense of correlation and you know whiteboard in here. Later is that. So I know that. Because I don't work wow OK that the mind OK I'll get the I'll get that shortly. Go see Thank you. So this is a a read derivation of. The liability if you square both sides put air put X. under air that's the proportion of air variance and then one minus that is the proportion of true score variance. So that that's reliability and so if you do a little bit of algebra here you get this expression and so in terms of the observed set you can you can derive the senator measurement in terms of the observed as the observed standard deviation and reliability and as you increase that standard deviation you you get you get you get you're going to increase your reliability in the same way that I'm going to draw right now so thank you. So this is. So like let's think about. Which I think if you're just this is a Grade 3 X. and greed. Are grade 3 X. prime or something like that so let's imagine these are replications of procedures or grade. So so in any case like if you have some correlation that's like Grade 4 but then increase the scale and have a grade 5 here and Grade 6 here. As you keep going up the scale so Grade 3 here. So if you look at this sort of scatter plot Harry like add that correlates around like point 6 or so but as you can as you can see as you keep sort of Caterpillar ing this out is like a caterpillar I know it's not the greatest picture but the idea is that ideas that now hey this correlation looks more like point $8.00 And so the greater the greater the variation you have the more the more reliability you'll have so I'll say to you so one of my students will is doing a pretty neat project with Dana McCoy She's using Google Street View to rate schools and like the sort of perception of like school quality from what you can tell in Google Street View and she kind of she made a mistake upon reflection you know when thinking about this prediction of predicting of taking schools that were too similar to each other and they're like the stick let's take a bunch of schools are too similar in quality and then look at inner rater liability and item or liability across those schools upon reflection which she should have done in order to like in scale development is to make sure that the variation very deliberately was reflective of the variation in the population so that she can get a reliability that corresponds to that that said classical test or he does give you a tool for for. For correction correcting for the variance in the sample you have versus variance of the population you have versus the variance in the population you ultimately care about so this is like your general expression for how like the changes in variation will increase the ultimate reliability and I'm just again putting this here as a reference. So that's a again a very a classic thing you should know about correlation is that as the variance of the true variance increase in the population it will increase Yeah. We're going to think this is. A distraction the root of this is kind of a random subset of the population. Were battle weary but it's been a gamble and I think I've seen some educated. By. Looking at it said only that the size and the program. The college we're looking at the. Trial Court but the. College or inappropriate is. The whole the. Whole array of the giving some of the work to lay. There really but the. Selection. Of the so I should've had this been a 12 week course in measurement I would have made sure to hammer that home repeated so the classic example is for example the correlation between like a city scores and freshman G.P.A. at the University of Michigan right and that tells you what it tells you but if you're interested in Had they had everyone been a minute what with the correlation have been that would have been that would have you would have seen that would've been larger but you can't tell for the reasons that that Brian Brian suggested I should add here here's my general advice if you ever were to undertake this because if I were a reviewer I would I would then you if you didn't follow it and that support both right report both the initial correlation and the you know as you assume there's going to be attenuated or discipline you wait a correlation and state your assumptions clearly but never just say and here's my just attenuated correlation and I actually reported this a tenured correlations in my presentation today but in the paper we report both says I'm trying to follow my own advice. So similarly that advice is going to is going to hold here as well if we're ever going to talk about standard deviations the standard deviations observe standard deviations are inflated due to measurement error right so as you can think of this is my mining of the normal distribution again. As Ewing as you decrease your liability or your distribution to sort of blurs out until it just becomes this like blob and so as you increase your liability your standard deviation gets tighter and tighter so we know that observe standard deviations are inflated due to measurement error because reliability is again the proportion of observed score variance accounted for by tree score variance and so correlations between 2 observed variables X. and Y. will be attenuated by measurement error in both variables that's just a side note and so there is a general formula for the correction of correlations due to due to measurement error what we do is we divide by the square root of liabilities and if there's if there's error and X. and error and why we divide by the square root of reliability in one and the square root of reliability in the other and this inflates the correlation I hate this correction and I use it all the time so because what you're sort of trying to say is like if had these had these variables been measured without. That measurement error than here would have been their correlation right this is what structural creation models as Matt is doing do behind the scenes for you right there actually taking it to actually estimating the measurement error in each of the variables and reporting that discipline you made a correlation for you and and so this is a way of sort of doing that mechanically and in the classical test area framework My advice here holds to if you're going to do this report the initial correlation and then report that this attenuated correlation because you're kind of doing here in a very not so subtle way is taking advantage of measurement error like the more imprecision I have the greater I inflate my test scores I mean the greater inflating I got the greater inflate my reliability coefficients sometimes you get reliability coefficients that are greater than one. This happened and then you then you know you've done something I mean that's just that just reveals how silly the whole process is right on you're giving yourself a lot of imprecision and credit for measurement error. But those that that's I mean that's something we should take away too and then finally regression to the mean so too much to talk about here I'll punt this later finally so and then finally that's going to be the this is the correction formula that would lead you to be suspicious of that table that I showed you and Angela and Patrick's paper right not suspicious in the thing and I did something wrong but I have questions about it right and that is that as you increase the number of items on your test you get greater reliability so if you ever are in this position of doing massive scale development and have like 200 items do not pat yourself on the back for having a reliability of point 13 because you have hundreds of items of course you do that's going to be the average of that is going to be very very stable with respect to measurement error so that's why I always when I report reliability is I also report the number of items because you sort of condition your interpretation of the reliability itself on the number of items that you've got. And so this is just an example if you know if the liability is point one and we double the test length what is a predictive reliability so K. would be 2 in this case in the same way you can given any given any test score length and reliability you could estimate the reliability of a single item test by plugging in cases like one over the number of items so if you ever really really want to take a gamble that people do this right of everything that's questions like Would you recommend this to a friend. That's like what's called the Net Promoter Score and so the net promoter score is supposed to be this like one shrew item that tells you whether or not your product is going to do well in the business sense it's like a single item scale right so anyway like if you ever want to figure out what you know what the one item reliability the one item test would be just like in case one over your number of items so these are all super handy formulas that I would expect you to have just kind of like in your back pocket the way you have a standard deviation the way you have a correlation coefficient these are the basics. So so I'm going to skip comebacks Alf see OK. So what what is reliability what is reliability so I said reliability is point 8 and you're trying to explain to your your uncle what point what you say. And you can talk generally about reliability is some sort of measure of precision and that's good but I also know what is point 8.8 what is what does that mean actually it's a hard question to. Me because the good news is. Good good good so that's that's the right that's the you know the sort of coherence of the overall measure and it's you know on the sort of 0 to one scale right but so but then if you want to get very specific and actually address the magnitude itself what would it what's clear what his point is in that case I did my usual motor mouth routine and like I said it a couple times but like very quickly and without pausing. Good good. Good that's a good that's a good rule of thumb that's segmentations cringe at rules of thumb but that said it's one that I don't mind cosigning for general purposes but so all the more reason to know what point 7 means right. So that you're talking about a signal to noise ratio where you're talking about the true score variance over the the air variance you're close and it's just a convolution but to anyone you've got true score variance in the numerator that's good what's in the nominator. It's absurd score variances in the nominator total variance so how much of the variance that you see is accounted for by that signal and you can get to that from the signal to noise ratio but but but reliable so when you see point 8 you're saying 80 percent of the observed score variance is accounted for by true score variance that's not the only way to think about the. Reliability question if you can also frame it in just the way we think of an intra class correlation. As as a correlation in itself. And it's a correlation in this case of 2 replications of the measurement procedure. It's a correlation of X. and X. prime that's actually why you write it Row X. X. prime. It's a correlation between X. and what we imagine a replication of X. to be which is equivalent to the proportion of the observed score variance accounted for that use governs so it's bilingual in the same way that you can think of an and enter class correlation as a correlation and this measure of between group variance right. In the same way reliability is both the proportion of observed score variance accounted for by trees grow variance and the correlation between 2 applications of an event procedure. The monster that I like is. A person with one watch knows what time it is a person with 2 watches is never quite sure and that's kind of kind of what psychometrics is all about it's very sort of saying like we always want to know exactly how imprecise one to be precise about or in prison. OK so how do we estimate this in practice here are 3 types of reliability The 1st is sort of the gold standard to sort of parallel forms reliability we actually try to do that we try to replicate the whole measurement procedure twice we sort of we could we create 2 different equivalent forms imagine to spurn the satisfactions earn this magical turn right off of stuff like marbles or sort of take a scoop of the items and create one form take another random scoop of the items and create another form and then we give it all to you like now and we give it all to you in like some separate room on Sun separate day with some separate Raiders and we try to vary all the things that we care about varying and give that to give that in a different scenario and then we simply take the correlation of the X. and X. prime and that's a parallel forms reliability another way we approach it is to do test retest reliability what that does not capture is the variance to the items because if I test you and then I retest you again I haven't drawn again from this pattern of items so you want to think about all these turns of like items of Raiders of occasions of tasks and think about all of those it's contributing to your sources of variance and 3rd this is sort of the weakest form that you usually get the highest reliability from is our internal consistency reliability which which treats all this stuff like all the stuff that's going on in this room right now is fixed and only considers the variance of items within the within the little test that you happen to have right it sort of says hey instead of drawing an urn drawing from this urn of items I recognize that I've already drawn from the urn of items I can split the test items sort of randomly in half to correlations of all those halves and think about how that is an estimate of a liability that's how internal consistency reliability works. So again I hope you're sort of bilingual. In the order of the light. consistency reliability 10 percent of the time it's some weird approach using R.T. that I'll talk about shortly. And actually show Shawn and I from our from our 2015 paper to have this here I might have cut the slide we actually show you the histogram for all reported state reliability causations that you see in practice just a give you a sense and point to point and all of them are over point 7 in this case there are centered on point 9 with a slight negative skew. All. Together for. A purpose. And then averages of that. Here is what I hear is what I skipped over so Comdex Alpha is exactly that and it actually can show you can prove that it is the average of all possible split halves right you split in half you split in half every single possible way you can you take the correlation over and over and over again now that correlation this is where you combine come back south and Spearman Brown right you've taken half tests when you split in half you've taken have tests so you have successfully described on average the reliability of a half test and then used Aaron Brown to ramp that up to that to the full test so it's a nice and neat little exercise. But you've got you've got the intuition Exactly. So so you know this is the last thing also service and it will take I think a 5 minute break that will end up being this is this is where your reliability is not a liability right and this is the point that I think Brian was sort of leading to is that you should think of the reliability coefficients that you get in your technical manuals and all of your state tests as being an impoverished version of the reliability you might imagine right if it's trying to answer the question how well does this X. correlate with this possible X. prime like that's not varying items doesn't cut it right and if you were to vary occasions if you were to vary like spin areas if you were to vary raters if you were to vary all these other things that we might actually be interested in generalizing over and that reliable you'd probably almost assuredly be lower right and so that's worth thinking about as you as you are adjusting for reliability is what exactly are the replications over which I'm interested in generalizing and that leads to an entire series another yet another theory called generalize ability theory which is developed by the Crown doc and many others decades ago Bob Brennan has done the biggest the most work on this as of late and is something you should know about that's not going to dig too far into right now but I'll give you a couple key references brand in 2002 is a great primer by my former 2nd advisor rich Ableson in the arena lab in 1901 it's a nice little Sage primer and it's kind of depressing that it hasn't really gone out of date since 1901 but this is basically just analysis of variance is pretty straightforward. And that's just it answers a couple of questions Tom Kane and I did a paper on this I present in my class about teacher observations right and how many readers do you need how many lessons do you need how many items do you need to get sufficiently precise estimates of teacher observation scores by many readers and are for example administrator raters different than peer Vader's these are the kinds of question you generalize ability theory is. Really well primed to answer this is my colleague Heather Hill at Harvard who wrote a great article on education researchers say the title is that before the colon was in her like Rader reliability is not enough which is to say like Often times we think we've got a bunch of readers let's just see how well they match with Master coders not enough and I totally agree with her so I think you should sort of think as you're developing if you ever have a skill that depends on Raiders you should definitely start with greater accuracy and then move quickly to generalize about the theory if you can and leverage the sources here so this generalized ability studies are expensive but they also are due diligence when it comes to real reliability right they're a liability you have is not the reliability you seek.