Andrew Ho: 2016 Psychometrics mini course - Part 1
Transcript:
Well.
Many courses here we have
students from 2 classes.
For tool we are very
pleased to have joining us
to is a professor at the Harvard Graduate
School Education cyclists in the real was.
Who is going to be talking with us
today give you a flavor for some of.
The really important issues
in education measurement.
First thought only measurement folks
in education research to pull.
Lists as well.
Goal here is not to have everything
that one would learn in a foursome.
Sequence in psychometrics but to give
you a flavor of what's out there and
help give you some ground.
To air into is.
Accomplished.
Academic here member of the now for
Assessment Governing Board.
Member.
He's a former.
Master since you just pick from.
It's a really very pleased to have.
Your Thank you thank you.
We had a game against Michigan.
I should I shouldn't started with the
crowd turned against me so thank you for
this opportunity these are these
are weird things right like 3 hours
measurement What can I accomplish
I'm like actually not sure I think
the presentation is kind of a mess but
maybe deliberately So what I wanted
to sort of leave with like a couple of
provocations a couple of references and
the sort of nagging idea that you need to
learn more and that there are places for
you to do so and so I should start by
saying one of the places you can do so
is right here you have a stealth
psychometricians in Brian Not to mention
a psycho magician in that not to
mention people who know statistics and
measurement broadly likes you and Chris so
you do have a bunch of people who I would
love to have the sitting
in the audience and
watching teach this in their own way and
I would learn a ton from it so
we all would teach this were different in
different ways and Brian and I actually
recently had an exchange about measurement
he was talking to a condom it's and
I was reflecting on how he presented it
and I was a ton of fun and so I wish you
know it's actually I was talking to Matt
about this earlier too I wish we could
talk more about how we teach this and this
is I think it's a great opportunity for
me to engage with you but also to get
a little bit from from people about
the different ways that they approach this
because we could learn a lot from them.
So here are my provocations So I kind
of want to do these like you know these
listicle ce writing this is like how to
get people's attention 7 things you need
to know about measurement but I'm caving
I said that I'm going to do it anyway so
here are here are those those provocations
I'm being deliberately kind of
extreme here just kind of
to procure a little bit so
validate score uses of the question
I find myself asking my students most
often when they are starting to conceive
their measurement projects is what for
and not just what for but what and
like what is a score ultimately whether
it's a scale score or an average of scores
or regression coefficient ultimately
those scores are what's used and
we don't validate tests we validate
those ultimate uses of scores
to content is king not models content so
if you start analyzing data ever
without knowing what items there were
you should slap yourself on the wrist or
feel your advisor slapping yourself on
the wrist What are you measuring and
you need to have an embodied
experience of what that
is you need to put yourself in
the shoes of your participants and
feel what it's like to be a part
of that measurement procedure
have a tendency in measurement to jump at
the neatest hottest sounding models
with the longest acronyms stop it start
with simple simple descriptive statistics
in measurement and I think of classical
test area as the descriptive statistics of
measurement you should always start there
in the same way that those of us
data start off with what command.
Summarise so did so so
Alpha is your summarize of measurement OK.
this is like this these are not the droids
you seek or whatever it is go to the
reliable the that is most often calculated
is probably not the reliability
that you're interested in and
I'd like you to leave here today being
able to answer your aunt or uncle when
they ask you what is reliability I want
you to be able to answer that question I
remember when I was getting my masters in
statistics on them because I don't agree
what the standard deviation can you
think of how to answer that question to
your uncle as I could give you an equation
like how do you actually describe that in
a meaningful way I think that's
reliability like the standard deviation of
measurement be able to what is point
that mean and you should be able to answer
that I mean today if you can't read.
I don't respond there is just a model.
We always hear we should really
have someone who knows and
in response to it I wish I had someone who
knows I know response theory which I so
many could teach I'm response
theory is just a model and
is also a very very useful model so
I'm both going to demystify it and
focus you on what it does particularly
well 6 your scale is pliable bend
it don't break it and this is something
that Brian wrote about recently as well.
The numbers that you should you should
think of them as like solid ground like
the distances that you sort of see they
can there's kind of like this is kind of
like this bridge it's like springs between
the boards maybe like there's like a sense
that the scale is pliable to shift
a ball but not breakable and
there's actually empirical ways
we can address this tendency and
your or the judgments you make based
on your scale information should
should be robust to that spring in us.
And then 7 again this is a reiteration
of other things know the process that
generated your scores and
use them accordingly do not go beyond what
the data suggests that's just a general
recommendation so these are my
publications they sound just like
floating talking points now they're deeply
embodied to me and I'm going to try to
do my best to make them feel meaningful
to you over the next couple of hours.
So but I want to start by just stepping
back and again when you think measured
what kind of resources can
you use moving forward and
in the e-mail that I think some of
you received from me you've got $2.00
to $3.00 citations right 2 references
in the 1st Are these the standards for
educational and psychological testing
what we're doing has has decades
centuries of history and a field has built
up around it that has some guidance for
you right and 3 the major body is the
American Educational Research Association
the American Psychological Association and
the National Council on measurement and
education came together and
actually agreed on something right these
are the standards of the field and this is
a very powerful tome not just for you but
for the people who you're developing
tests for you can say I did this and
these are the standards of the field
you should have this book and
there's a discount for
members of this of any of these but
like if I were to recommend one thing
these are the authoritative standards of
the field they're not perfect I've
got a kind of quibbles with them but
they're powerful and
it reads actually is as a good as a as
a pretty reasonable like intro text.
this because 160 bucks and
I remember being a student and
it's also very much a reference text but
this is sort of the cut bible of the field
has all the sort of heavy hitters
in measurement who contributed chapters
to it it is sort of the authority to
like what we would cite when we
say if you had generic a site for
reliability you would go to
Hartle 2006 generic site for
validation you would go to Cain 2006 and
those are the 1st 2 chapters in this book
do not buy it unless you
are really into the stuff.
But it but have it on reference of it's on
reference for easy reference your library
so much you put it on reference frame so
they would be my sort of to go to books
for for educational measurement
probably and you should
look to them if something is provoked
to you as places to go for citations.
So how do we learn measurement I think
this is important to visit visit too so
like you know this is very much I think in
the way that the way this is like reflects
the philosophy of teaching here too but
I always want to sort of point to my
fellow by and I'm like eager to look
at other still abide to from folks
other folks who teach math but if you want
other references there they're up on my
website there's like sort of like what
I think of as where to go to for I.R.T.
what I think you know what I think people
should read when it comes to differential
item functioning when I think what people
should read when it comes to standard
setting so they don't forget to look
at people still by when your reference
hunting and you want to say someone
said a cut score who should I cite for
the whole a whole cut
score setting thing and
Philip I can be very useful in addition
to that in addition to this tome
where it's Hamilton and Tony Adams who
did on who did standard setting so
my fellow by and the syllable of others
like Matt are good places to go for
references on these things.
And then again like you know learn
it use it learn it again and
use it again like in
practice makes perfect and
all of your classes your methods classes I
think you're using data getting your hands
dirty struggling with those state error
codes looking at those manuals right so
you know help help help them clicking on
the P.D.F. that's that's what you're going
to be doing a dodgy and I just want want
you to want to recognize that measurement
much like the other methods you're
learning requires that struggle and
that patience so this is something
that's a little trick I just want
to give a shout out to a few people
who contributed to my Google Docs
this is something I do to like incentivize
and encourage out of class reading and
out of class discussion so
there are a bunch of tools for this and D.
has and it has a new thing called to
result at some of my colleagues at Harvard
of develops or just open google docs but
it looks like this this is what I asked
some of you to contribute to I said
Hi I'm students is the typical pre class
discussion that we run it is to see I
ask questions I have you respond to them
by 10 pm last night and then I reply
like I am kind of a night owl and so
like in between 10 pm and Dawn for a class
I like I sort of give people like answers
I have little conversations and sometimes
for the for the people who contribute late
like around 9 pm We get into these
sort of discussions online actually
how did someone who was I know who it was
but I was online with some one of you.
And and just going a little bit back and
forth about how to write out
an equation so so thanks to Josh G.
Josh G.
There you go Josh so so
so we I had a little.
Cross talk with you I'm not sure if you
check back in solid I saw what I wrote but
this is a part of the read write
do this is just sort of like
how to sort of stay engaged.
And here's an actual Doc So
thanks to Josh G.
thanks to Fernando you can
see I'm replying italics here
thanks to Stephanie H.
Stephanie I missed your comment you must.
OK I'll reply later I promise Karin.
So there are a lot of really good.
Some derivations we did here Stacey.
So yeah and then some that there's always
I always sort of leave a space Cassandra
didn't get a chance to reply to you but
Stacy and
Josh here there's a general space for
general questions and discussion and
you asked and good general questions
that you can have time to engage with
towards the end of this class today
I should say to like interrupt.
But I think it's important sometimes
you just might talk straight street
pedagogy and like how to learn this stuff
and how to stay involved so read write do.
So again there's a 7 principles
I'm going to start with and
you start with the beginning and
start straight from validation so
we don't validate test to be validate
score uses to talk a little bit about
the validity theory and
this might seem a little bit detached and
I'm going to get more technical later on
I know I feel like this is very talking
about this as tribal I'll try modal
audience potentially So I think I might
interest some of you sometimes and
others of you other times but
all of this is important in belongs to
the body of measurement so I hope you'll
survive remember even the things that may
sound theoretical even the things that may
sound too technical but are all in
a continuum and think of it as what we do.
So validation This is the more recent
depiction of what I think of as
the standard for validation in educational
measurement in particular I contrast that
with Matt who teaches more from the the
psychological measurement paradigm which I
think has a slightly different perspective
on validity but but Michael and
the field is an educational measurement
in particular is very utilitarian
very instrumentalist we
care about the ultimate use
right it's almost a theoretical if you say
if you if you take it to a certain extent
we just don't even care about the numbers
as long as the interpretation or use
is correct that's extreme but that shows
you what that what they're emphasizing
here what we're emphasizing here to
Val to validate an interpretation or
use of test scores is to evaluate
the plausibility of the claims based on
the scores as an argument based approach
you are building an argument with evidence
over time there is never a point
where something is valid it is part
of an ongoing evidence building process
and that is deeply unsatisfying
right wouldn't it be great if there
were a correlation coefficient and
once it exceeded point 7 you said Jack and
this is a super frustrating
like call to you to never do
that right and I think it's particularly
frustrating for introductory students for
whom like that they might who even for
those who like might not have careers and
measurement to say they really have to
do all this and I guess I'd say like
actually no you don't really have to do
this but at least you have to know that
these are the sort of standards of the
field even if you selectively ignore them.
So and again this is just these are for
measurement there is my colleague
Derek Briggs who kind of disagrees with
this utilitarian instrumentalist
current status quo in measurement but
there actually debates about this in the
field right what is the validity you can
write a paper on that right and contribute
to the discussion of what it means for
the use and interpretation of test
scores to be valid and appropriate and
we broached this this morning when I was
talking about the use of this new data
set that my colleagues and I have created
that allows you to compare districts or
school districts across states and we're
not we don't just ask is that valid or not
we as we say is there are the uses of that
are the interpretations of that valid or
not and there's some really good
feedback from the faculty members and
students in the room about which
which research designs and
which research inferences would be or
would not be appropriate in those
situations is very similar what you're
trying to do with the scores and
is that appropriate is that supported
by the evidence so I would say these
are there these are different
definitions of sort of validity or
full of schools of thought about validity
and not reading all this text because
I'm sort of that sort of leaving
these slides as a reference but
we are in a very instrumentalist
even utilitarian moment and
educational measurement where we care
ultimately about how you're using those
scores not about the test or
even the construct is about the score use.
So again modern test validation theory
is dominated by instrumentalists I'm
concerned with test uses an interpretation
and I'm acknowledging that this can
be frustrating because it kind of takes
the control away from your special little
instrument and it's in its ultimate
scores and places it in this fuzzy domain
where people pick them up and use them and
you might kind of be responsible for that.
So I think a validity and
as I say as I tweeted before it's like
I'm not ashamed to use mnemonics and
so I think of 5 sources of Liberty
evidence and I call them the 5 seas so
the 1st is content right so the 1st
to take the test what is it measuring
There is a good overview of alignment
of 4 big testing enterprises
to the Common Core recently Morgan pull it
off and then see DURIE publish this piece
in Fordham earlier this year which is
which is basically a content study right
do park and smarter balance these big
testing consortia and as well as M.
CAS and the A C.
T.
aspire these the Massachusetts state test
and a.c.t do they aligned to the Common
Core state standards this is a content
study and there I think there are too few
people frankly delving into this like
arena which is currently
sort of dominated I think by
more model based statistically based
approaches so I'm just sort of reiterating
that content is important serious
important cognition is another source to
go as another source of evidence
is like when you take that scale.
Are you thinking what the designer
intended you to be thinking as I'm
thinking through this math test
as I'm thinking about whether or
not I'm greedy or not think about the
studies of great recently that have been
concerned about reference bias right
that is to say like do I feel greedy and
can you compare it across courses or
my referencing my grit to the people who
happen to be in the school or in this
classroom right so how are people thinking
about it cognitively the way we we have
seen they could the evidence we can get up
often comes from Sir think aloud
protocols as well as a parable analysis.
Coherence is where the field since it
seems sort of stuck with validity and
there aren't a lot of what I'm going to
talk about subsequently is going to be in
this into this 3rd seed so this is where
reliability analyses come up if a C.F.A.
I.R.T. this is what not teaches as
well as well as me this is I think
what people sort of assume measurement is
from a technical standpoint and what I'm
highlighting here is it's only one city
right you've got to think about content
you've got to think but cognition and sure
you can do your reliability analyses but
that's only a piece of the puzzle another
piece of the puzzle that is often this
comes up a lot in structural equation
modeling comes up a lot in economics too
where you're trying to predict future
outcomes does this predict college
attendance graduation or
college entry or freshman G.P.A.
or future outcomes or more concurrently
does this does this correlate or
not correlate with things
that should be similar and
things should be different you
sometimes hear this is convergent or
discriminant ability but this is
again only a piece of the puzzle and
the Fitzy is consequences right evidence
based on the consequences of testing
you could think about this even as
a counterfactual like had I not undertaken
this measurement enterprise at all but
would have been the difference so
doesn't think about the scores as
much as the use of the scores and
like that has has the act of testing and
measuring itself had some consequence and
so this is a fairly controversial
relatively recent addition to
the sort of the Litany framework but
these 5 sources of Lady evidence are
clearly articulated in the standards and
what you should think of when you're
designing a measure when you're using
a measure as the kinds of
evidence you can live.
So so this is sort of in contrast
with what with what I think of when I
think people are thinking of validation
commonly I developed a scale with good
theory I fit a C.F.A. and got good can for
confirmatory fit index and
my reliability is greater than point my
scores predict desirable outcomes so I
have a valid reliable measure that's like
the common sort of articulation of like
a good baseline study about I'm setting
that up as you know so that's content
that's coherence that's coherence to this
is correlation and that's incomplete or
sort of missing cognition room if we're
missing consequence you're missing this
argument for use what are your scores how
we use them what would have happened had
you not measured and so
these are other questions you could ask
just with complete this sort of validity
framework so it's more than just.
A good fit index and
good item parameter estimates.
So again 7 key principles we don't
validate tests we validate score uses
That's what I was covering and I want
to emphasize content a little bit and
then dig into a little bit
of classical test ary And
I think that'll probably take
us to the break or thereabouts.
So.
Let's and
then we'll get into the reliability and
I are to be sure of after so
this yes sure.
Yeah talking about consequences how it
should be in the context of what you
mean like if if a student had never
been tested then what other measure
to measure look underlying ability to or
think that we're getting
the score ultimately is is yeah is used
for something right so once once we
test what's the sort of theory of that
making a difference in some way and
it could be like publishing an article and
having that feedback into the system it
can be very abstract in that way it
could also be the teacher is going
to use it to give you feedback and is
that feedback going to have a positive or
negative impact on you right or it's going
to lead to a value added estimate for
a teacher and they're going to respond
differently to teach so it's like Had
that not happened a whole
process not just the OR
it like you know the score but the use
of the score in this theory of action
had that not happened what would be
the difference so I think that's kind of
a pretty gold standard level of
like I mean we're taught but
a major evaluation at that point but which
is why this is sort of a controversy all
sorts of related evidence
because like good luck and
how long do you wait for
long term outcomes but but this.
I mean from an economic You can be
because of the kind of catch all the time
you know you have people.
Like how can you think of them.
So again so you know in all the ways
that I think you're trained to write
as economists right so I think you again
like and I didn't I wasn't being glib and
I was sort of saying this is like why
we're glad we have people like you is
because I think you are asking like what
you know what is the counterfactual for
you know if we didn't have high stakes
test based accountability like we'd
have some sort of paper by some guy named
Brian Jacob and Condi or something and and
and sort of think about what happened had
there not been this rise in accountability
at this particular time so these
are the kinds of evaluations I think that
I'm not soley putting this in the in
the in like in economics like that but
that that said I do think that's my
encouragement to you is to never just
think of a test as something
that's validated up in the air but
as like part of the results in the score
that is used for a purpose and
if that purpose is for you to publish and
get some correlation coefficient and
get in a journal and that's great and
that's part of your theory of action and
that's pretty light but all but
ultimately I sort of say like but
you know why are you doing this and
that's why I'm sort of for
pushing people to go is that ultimately
your scores are used by people for
something can you can you
describe that to me please and
that's what I find myself asking most
students like that's what's missing and
when they say I want to
create a measure of X.
I'm sure like why you know those scores
we're going to do with them
what's going to happen and
that's what that's often what I find
missing in their their thought process.
Thank you we're here and I'm trying
to figure out correlation you said
evidence based on relation
to other variables and
so I'm wondering if by that you mean like
I would validate one standardized test by
its relationship to student scores
on a similar kind of test of similar
kind of content or reading of
things much broader than I'd like.
This chance to and how the critics
like high school graduation you're
going to college and so
how would I know those kinds of things
before like if I'm using these as
a foundation for measurement and
developing I haven't given it yet so
how do I have evidence on this is.
So so this is why crown Bach and
all the sort of.
People who have developed validity
theory over time have been.
Very clear that it is an ongoing
process that it's not I mean again and
this is where psycho magicians struggle
with dealing with the outside world
because the outside world is like show me
your valid measure and you're like but
this is this process that takes a look
at Show me your valid measure and and so
it can be frustrating but
this is how the field thinks about it I
think you have to wear different hats and
when you're talking to people who have
that their definition of liturgy and
just say this checks all the technical
boxes and you do want to at least some
correlations with concurrent out
concurrent variables in some way but but
look at the end cast Tech Report technical
report look at the report here for
your tech your deep What is it now and and
you'll see that the all of these are laid
out in there in varying degrees of
depth and usually coherence is a massive
section with classical test theory I or
T differential item functioning alike and
correlation to small
consequences is a paragraph
cognition is like we did a lab and
content is very very fleshed out with
content frameworks and the like so
this is why I explicitly walk
through Technical Manual You know
when you finish my class you should be
able to read a technical manual for
a state testing program whose data
you're going to use and figure out
what implications it has very for your own
analysis yes that's a good model to check.
That's.
What are some of.
The valid for the test but for.
What are some of the kind of.
Thing and I'm wondering when you
were talking about federalism focus
you seem the one to see complex necessary
if it was really going to meet that.
Goal or.
Cause.
Geared up on care I don't care what
I don't hear the reliability of
its core how well he learned
in college now that.
You're going to be
anything other than for.
This is a good question so
this is where the economy is so
we're probably shouldn't like over and
over going to miss dinner over drinks at
some other point we will have a detailed
argument about or debate about why these
things should matter I think I mean so
from a very utilitarian standpoint in the
near term before you get those long term
outcomes you know if you're developing
your own measure you need to stand on
something in the near term before
you've got those long term outcomes.
The here yeah and
it also I think it also I mean
I don't know like if you happen to find
some spurious correlation of something
I mean there's got to be some and you
are interpreting when you completed a C.T.
score that there is some sort of college
readiness and you know when you say like
point 3 It's like socioeconomic
status correlates point 3 and
it's like you don't say are college ready
based on social economic status right and
so the interpretations we use like matter
is the sort of psychometric argument and
so you know when whether I enter Be
specific about that interpretation and
what is the warrant for
that interpretation and
if it's only based on social economic
status and the warrant seems.
Detached from the human So I think this
is a deeper philosophical argument you're
raising that I don't think should be.
So I but I think it's a good one and
certainly some that might that my
students have advocated for and
it's certainly econ leaning.
But you know the you know what I
often fight with is like why do
we care about freshman G.P.A.
I mean look at that's a horrible measure
I kind of wanted to kind
of want freshman G.P.A.
to predict my on my high school test
because that's a better measure because
of the content the directionality I
mean so it's does arise I think from.
The items in the content is
the is psychometric percent.
But so so on to a little bit
of classical test here in
the tools that we use to
evaluate in particular Clarence.
Or and content so this is sort of
like my checklist for it like how to
get into a sort of secondary
analysis of test score data right
you get a bunch of you get a state a D.T.
a file and it's got people in rows and
there are all these items all these like
columns that correspond to items and
I guess you know so this is like my going
to skip around is going to go 12378 or
something like that but this is this is
sort of part of a larger checklist and
again like you know this is from
John will it's presentation as well.
No you're right it's right like read each
one take the test get a sense of what it's
trying to measure.
So so this is an example from a a.
Measure of like self perception of
teaching success you have high standards
of teacher performance you're continually
learning on the job you're successful in
educating your students it's a waste of
time to do your best as a teacher this is
negative negative negative polarity you
look forward to working at your school
how much of the time are you satisfied
with your job right and so this is like my
advice to you is never go into an analysis
without actually looking at the items and
sort of taking that like scoring the test
thinking of yourself as a subject and
then you have all these sort of like your
scale items is one to 6 you see here some
someone snuck in a one to 4 item
this happens from time to time so
do not get caught unawares do not type in
Alpha without recognising that some of
your variables have different items skills
than others because it will give you
incorrect answers so so take control
of your scale and know it backwards and
forwards and
again I'm going to in the interest of
time I'm going to jump through this
always on the scale of your items
right to score your test
how is it actually being
scored is it a some score it isn't.
Average are you reversing
some of the play or
any of your some of your items are you
stretching the scales of some of them so
the algo from 0 to 100 what do
you how you actually scoring it.
So if you if you look here right again
you're going to want to sort of what I
recommend that you do when you're actually
going through this is reverse it yourself
like take control in state and reverse
coat it so that they're all pointing
in the same director because and
then make this because otherwise
I have I found myself making mistakes
is some very practical advice for
you to not slip up in the sort of data
in the early stages of an analysis
so you know again look at your data get a
sense of the missing this label your items
make absolutely sure your items skills
are oriented in the same direction or
you're using code that
recognizes when they're not
positive should mean something
similar if not fix it.
Here's more exploring I have mandate
that people always like give me discreet
histograms for items scales I want to
know Mike how many ones there are how
many to 0 how many threes fours fives and
sixes I want to see if you've got a 7
point Likert scale if no one is picking 6
or 7 ever I expect you to know that from
the very beginning and don't start running
I.R.T. until you have a sense of your.
Data actually look like.
This is important as well does
a one mean one at all times it is
is it always like strongly disagree
when you have a scale that goes
like one to 4 right so if I have
strongly disagreed strongly agree and
then I have not successful it's a very
successful and this is one to 6 and
this is one to 4 and I throw that
into alpha if I throw that into like
a reliability analysis what is
a going to do is going to assume
that very successful means slightly
agree does that make sense.
It could make sense you better think
about it and make a decision so if so
the idea here is that all of
these items scales are not
in a classical analysis are are they
think of ones as ones and
sixes sixes so you better take control of
that and make sure that that's right so
often what that entails is 2 things
one stretching this 124281 to 6
or actually just forcing this to be
one forcing this to be 6 forth and
forcing this to be what 2 and
like actually equally spacing that item
out so that you're saying not successful
is like strongly disagree very successful
as like strongly agree so one of the big
mistakes I see people making when they
get the scale is a secondary data analyst
that assuming that all items
are sort of interchangeable and
that the player he doesn't matter and
you sort of control over that.
Another way to approach it is to
standardize within each item so
what you're doing is you're to
your set you're just dividing by
the standard deviation unit in each time
and each and each item and in that case
you're saying that strongly disagree here
and strongly disagree there might not mean
the same thing depending on the variance
of each of those ITEM ITEM distribution.
And that's weird too like when your
liquids like or scale items are all
strongly disagree to strongly agree do
not standardize right because strongly
disagree means the same thing across those
items and if you standardise you lose that
information Similarly if you have an
educational test that has like correct or
not correct should you standardize
absolutely not correct is correct and
the same thing so do not standardize
you know in those cases either as these
are the like the little things that seem
trivial and I feel like in my in my own
way in my own students like analyses and
I'm not running through there coming up
with absolutely incorrect alpha values
I can even just like the baseline
descriptive statistics let alone getting
to I.R.T. or structural cohesion modeling
or attack so you've got to take control
of your data from the very beginning and
be very very careful and intentional about
every single step that's like general
advice for statistics period right but I'm
saying it still applies to measurement.
OK So this is a baseline reliability
analysis check this out Alpha X.
one to dash X.
as is that should be your template and
the items gives you all these
items to 6 as is I saw I
have this sneaky suspicion that this
is leading to inflation of reliability
coefficients throughout state and users
and perhaps other programs as well but
as is does is it says the direction of
the scale like the direction of the item
scale positive is always positive
like if you coded as positive and
treating it as positive if
you don't include as is
there could be a really bad item in your
scale that correlates negatively with all
the other items negatively and
state a will flip it for you.
Without telling you will show up here but
you might not notice it without telling
you it's going to flip it for you which
is to say you've got such a bad item that
status as it can't possibly be
that bad in reverse it for you and
that's crazy to me that they do that and
so you thought this is that for
a lot of elementary analysts dramatically
over interpreting their simple.
Alpha they're simple reliability
value because they're.
Going To Do you know best but
but but but anyway so
this is be my default code to make sure
that you're controlling it appropriately
be intentional at every
step of your analysis and
know what the direction is and
know what the scale points are OK So
this is I'm going to I'm just going to
short hand wave 3 this but these are.
Various discriminations statistics
they basically are like does this item
correspond to the sum of
other items on the scale
does this item correlate with other
items and this is the coherence question
this is an internal correlation does this
item correlate with other items on a scale
which is really kind of what is at
the heart of classical test theory I or
G structural question modeling
factor analysis and the like.
This is an example of a little bit
of you know more pseudo code from
state A for you.
How many people don't use data.
So and you're using M.
plus.
Because this is why we include a whole
bunch of do files and I've sent Bryan
a couple off and I'm more but I'm happy
to give you sort of templates for this.
too we'll talk we'll talk more about that
the simplest of the good cos it will
test every kind of descriptive stats.
To the you know like OK you know.
Anyway what we're worth running so like I
mean they they presume that you sort of
done all that already and so do all that
already like to do that 1st as a as I'm
recommending it as make sure you sort
of have control over your scale.
So again you know coming in as a sort of
content is king there in the sense of like
you know your items know your scale and
get a sense of what it's
trying to measure and
don't just validate it based on whether or
not it predicts life earnings next.
But if it were the debate.
What exactly were they.
Looking at like that.
In the sense not in the sense
of like I mean you want to
read a book on the question
because I want to get.
More.
With.
Like I mentioned.
Some of that question but maybe.
I can see Mollenhauer.
All.
Right so this is this is a subscale
question this comes up all the time so
Alpha is a property of of a of a scale
right and if you want to create subscales
get get information about each of your sub
scales that's what Alpha should be for and
what else if you throw an alpha across all
of the items across subscales it's asking
how coherent is this across subscales So
the question I always ask people who
are using subscales is what's the question
how are you using your scores right so
that you know if you take a cynical
approach from like you know at heart of us
always like if you give policymakers
to numbers a lot and together.
So that you know so
this is like the you know so
that your great scale the Angela Duckworth
a Tim Duckworth and Queen 8 item great
scale there are 2 subs course we
think people are doing with them.
Adding in the getting so if you want
your question my question is what
your question should be what is the
property of the score that is being used.
This is that this is the utilitarian
sort of instrumentalist of you and
if you are creating a scale with like
that people are using those subscales
an evaluative each of them accordingly and
then take alphas for
each of those subscales report outfits for
each of the sub scales I'll show
you how Angela and Patrick.
Do this and
shortly in their actual paper so
yeah so so so which is just to
say good to have subscales but
then then what I would
do is Alpha out C.T.
analyses on the subject and later will
talk confirmatory factor analysis and
all that jazz or actually that well
that's what his class is good at.
In particular.
So let's.
Go So this is this is the this is a paper
that I have everyone in my class dig
deeply into this is Angela Duckworth and
and Patrick Quinn's.
Journal of Personality
assessment paper in 2009 that.
I was talking with not about this is
a very common practice to develop
a scale that has way a ad that has now
way too many items but a lot of items and
you might not you might want to think
about how to minister them feasibly
in a flexible situation and so you can
use Costco test area in response to
a response they're both very very good at
figuring out how to shorten that scale
like how to how to preserve information
while while reducing the number of items.
This is a say you know I just gave myself
I just gave you advice I'm trying to
follow it this is sort of a brief
description of the great scale I actually
have my students take this so we can
like analyze their data new ideas and
projects sometimes distract me setbacks
don't discourage me I've been obsessed
with a certain idea but I am a hard worker
I often set a goal but later choose to
pursue it so I'm shortening them a little
bit this is to give you a sense of how
great operationalize So this is their item
scale in this paper they're sort of saying
we had a 12 item scale we're going to 8
it will all be fine don't worry about it.
So part of my screenshots
here see table one for
item level correlations after excluding 28
I'm sure each subscale I talk in subscales
here right there is all things out in
great scale this displayed acceptable
internal consistency that's code for alpha
with alphas ranging from point 73 to point
a look at their table to write
again we spent a lot of time digging
into these articles in class so
this is like you know West Point the
famously her National Spelling Bee sample
Ivy League undergraduates and these are
conduct also values these are the values I
was describing point
the sum that's the total scale that's
the that's the reliability coefficient.
For the overall scale and
then she breaks it down into pursuits of
effort and consistency of interest and so
the question I would ask in this
case is again what's being used and
if you're treating these separately you
can see what their alpha values are and
then if you're treating them as
a whole that that's the that's so
you can sort of cover your
use cases here and say for
those purposes here is your level of
internal consistency that makes sense.
Absolutely and so this is why your
classical test there isa to 6 are your
descriptive statistics your knee jerk 1st
reaction and after that we're going to get
to a more powerful framework that allows
you to answer questions like the ones
who's asking and so this is what I
consider level one this like summarize and
I really do mean that is like the very
after that you get to more
sophisticated questions OK so
by the way the what I always have one
of my questions my google doc questions
is is kind of this annoying I guess
what I'm thinking questions but
it's like Does anything look off to you
about this and I'm just going to sort of
this is like a tough question so I'm just
going to pause and and the just take
a look at this table in particular these
alphas these alphas compared to these
alphas and I just so this is you know
going for items for items and 8 items and
I just want to sort of this is to have you
take a look at that and just get a new
curve gut reactions as to what
I find a little surprising.
There's a bit of it that.
I have a plan to in the audience.
Try and.
There can be a couple answers here so
don't be shy.
Yeah.
For example.
For example the man.
Who does point 73 or
an 8 item scale I have that's wacko.
Right and so I'm not sure if he's
correcting for that and didn't mention but
or if there's something weird going
on in the sub scale relationships but
that is not what you expect what you
expect when you have many more items in
fact we're going to show you a prophecy
formula that predicts this when you
have more items in the same way that you
average over more things you have center
deviation over route and is your position
the more you average over the more
precision you have now it is a little
surprising that it's accurate that's
a kind of discipline perception that
you'll develop with with with measurement.
Cause that a lot but
Joining me to go from this.
Which is that much that.
You were to be purely So this is one
way right so we're going to develop
even better ways with I.R.T. But this is
just sort of a ranking of how each item
correlates this is the item rest
correlation is a literally the Pearson
correlation a simple vanilla correlation
between an item in one column and
the sum score of all the other
items in the other column so
this is a measure a very descriptive
statistic again kneejerk summarize level
descriptive statistic of how well this
coheres with the rest of the scale
we're going to see a better version
of this is going to get to higher T.
but often they very rarely tell
dramatically different stories so
this is why again we sort of start with
our feet on the ground with a basic
analysis and then get advanced and
I are today.
So you so wish to ask your question if you
were to be purely cynical about it and
didn't care at all about content
you drop you drop maybe one in 3
you know rerun the item S.
correlations maybe drop a couple more if
you felt like it and then calculate Alpha
for whatever's remaining and I don't
recommend you do that because content is
king that's the be careful of throwing
away a subscale you care about but
and imagine that an educational test where
suddenly you're not measuring mass or
something right so you can imagine
that that would be dangerous but but
but that's from a purely statistical
standpoint that's what you could pull off.
So so good we're going to great I'm going
to judge do I have this Yes Skip to
slides on classical test or E.C.L.
that's that's it sort of there for
you there's a bunch of equations there
sort of putting that in as like stuff for
future reference what I want to
do is talk a little bit about
why classical test theory is a theory and
what it predicts and
why it seems like it's useful so what is
classical test theory actually predict and
why do we think of it as theory the 1st
you know what can we infer from classical
test theory 1st variation
increased reliability and so
this is akin to the logic you are using
might sort of flip it on its head that you
know if you if you ask what the
reliability of a grade 3 of a set of Grade
ask for the reliability of Grade 3 grade 4
grade 5 scores is together you're going
to get like a point $85.00 right so
as you increase the variance right in the
same way that we you know as we know from
correlations period reliabilities or just
correlations I forgot to forgot to ask
what's reliability We'll get to that
again but just like any other correlation
as you increase the variance
you increase the scatterplot
rate you increase the sense of correlation
and you know whiteboard in here.
Later is that.
So I know that.
Because I don't work wow OK that the mind
OK I'll get the I'll get that shortly.
Go see Thank you.
So this is a a read derivation of.
The liability if you square
both sides put air put X.
under air that's the proportion
of air variance and
then one minus that is the proportion
of true score variance.
So that that's reliability and so if you
do a little bit of algebra here you get
this expression and so
in terms of the observed set you can
you can derive the senator measurement
in terms of the observed as the observed
standard deviation and reliability and
as you increase that standard deviation
you you get you get you get you're
going to increase your reliability
in the same way that I'm going
to draw right now so thank you.
So this is.
So like let's think about.
Which I think if you're
just this is a Grade 3 X.
and greed.
Are grade 3 X.
prime or something like that so
let's imagine these are replications
of procedures or grade.
So so in any case like if you have some
correlation that's like Grade 4 but
then increase the scale and
have a grade 5 here and Grade 6 here.
As you keep going up the scale so
Grade 3 here.
So if you look at this sort of scatter
plot Harry like add that correlates around
like point 6 or so but as you can as you
can see as you keep sort of Caterpillar
ing this out is like a caterpillar I
know it's not the greatest picture but
the idea is that ideas that now hey this
correlation looks more like point $8.00
And so the greater the greater
the variation you have
the more the more reliability
you'll have so I'll say to you so
one of my students will is doing a pretty
neat project with Dana McCoy She's
using Google Street View to rate
schools and like the sort of perception
of like school quality from what you can
tell in Google Street View and she kind of
she made a mistake upon reflection you
know when thinking about this prediction
of predicting of taking schools that
were too similar to each other and
they're like the stick let's take a bunch
of schools are too similar in quality and
then look at inner rater liability and
item or liability across those schools
upon reflection which she should
have done in order to like in
scale development is to make sure that the
variation very deliberately was reflective
of the variation in the population so that
she can get a reliability that corresponds
to that that said classical test or
he does give you a tool for for.
For correction correcting for
the variance in the sample you have
versus variance of the population you have
versus the variance in the population you
ultimately care about so
this is like your general expression for
how like the changes in variation will
increase the ultimate reliability and
I'm just again putting
this here as a reference.
So that's a again a very a classic
thing you should know about
correlation is that as the variance of
the true variance increase in
the population it will increase Yeah.
We're going to think this is.
A distraction the root of this is kind
of a random subset of the population.
Were battle weary but it's been a gamble
and I think I've seen some educated.
By.
Looking at it said only that the size and
the program.
The college we're looking at the.
Trial Court but the.
College or inappropriate is.
The whole the.
Whole array of the giving
some of the work to lay.
There really but the.
Selection.
Of the so I should've had this been
a 12 week course in measurement I would
have made sure to hammer that home
repeated so the classic example is for
example the correlation between like
a city scores and freshman G.P.A.
at the University of Michigan right and
that tells you what it tells you but
if you're interested in Had they had
everyone been a minute what with
the correlation have been that would have
been that would have you would have seen
that would've been larger but
you can't tell for the reasons that
that Brian Brian suggested I should
add here here's my general advice
if you ever were to undertake this because
if I were a reviewer I would I would
then you if you didn't follow it and that
support both right report both the initial
correlation and the you know as you
assume there's going to be attenuated or
discipline you wait a correlation and
state your assumptions clearly but
never just say and
here's my just attenuated correlation and
I actually reported this a tenured
correlations in my presentation today but
in the paper we report both says
I'm trying to follow my own advice.
So similarly that advice is going
to is going to hold here as well
if we're ever going to talk about standard
deviations the standard deviations
observe standard deviations are inflated
due to measurement error right so
as you can think of this is my mining
of the normal distribution again.
As Ewing as you decrease your liability or
your distribution
to sort of blurs out until it just becomes
this like blob and so as you increase your
liability your standard deviation
gets tighter and tighter so we know
that observe standard deviations are
inflated due to measurement error because
reliability is again the proportion of
observed score variance accounted for
by tree score variance and so correlations
between 2 observed variables X.
and Y.
will be attenuated by measurement error in
both variables that's just a side note and
so there is a general formula for
the correction of correlations due to due
to measurement error what we do is we
divide by the square root of liabilities
and if there's if there's error and X.
and error and why we divide by the square
root of reliability in one and
the square root of reliability in the
other and this inflates the correlation
I hate this correction and I use it all
the time so because what you're sort of
trying to say is like if had these had
these variables been measured without.
That measurement error than here would
have been their correlation right
this is what structural creation models
as Matt is doing do behind the scenes for
you right there actually taking it to
actually estimating the measurement error
in each of the variables and reporting
that discipline you made a correlation for
you and and so this is a way of
sort of doing that mechanically and
in the classical test area
framework My advice here holds to
if you're going to do this report the
initial correlation and then report that
this attenuated correlation because you're
kind of doing here in a very not so
subtle way is taking advantage of
measurement error like the more
imprecision I have the greater I inflate
my test scores I mean the greater
inflating I got the greater inflate my
reliability coefficients sometimes you get
reliability coefficients
that are greater than one.
This happened and then you then you know
you've done something I mean that's just
that just reveals how silly the whole
process is right on you're giving yourself
a lot of imprecision and
credit for measurement error.
But those that that's I mean that's
something we should take away too and
then finally regression to the mean so
too much to talk about here
I'll punt this later finally so and then
finally that's going to be the this is
the correction formula that would lead
you to be suspicious of that table that I
showed you and Angela and Patrick's paper
right not suspicious in the thing and
I did something wrong but I have
questions about it right and that is that
as you increase the number of items on
your test you get greater reliability so
if you ever are in this position of
doing massive scale development and
have like 200 items do not
pat yourself on the back for
having a reliability of point 13
because you have hundreds of items
of course you do that's going to be
the average of that is going to be very
very stable with respect to measurement
error so that's why I always when I report
reliability is I also report the number of
items because you sort of condition your
interpretation of the reliability itself
on the number of items that you've got.
And so this is just an example if you
know if the liability is point one and
we double the test length what is
a predictive reliability so K.
would be 2 in this case in
the same way you can given any
given any test score length and
reliability you could estimate
the reliability of a single item test by
plugging in cases like one over the number
of items so if you ever really really want
to take a gamble that people do this right
of everything that's questions like
Would you recommend this to a friend.
That's like what's called the Net Promoter
Score and so the net promoter score is
supposed to be this like one shrew item
that tells you whether or not your product
is going to do well in the business sense
it's like a single item scale right so
anyway like if you ever want to figure
out what you know what the one item
reliability the one item test would be
just like in case one over your number
of items so these are all super handy
formulas that I would expect you to have
just kind of like in your back pocket
the way you have a standard deviation
the way you have a correlation
coefficient these are the basics.
So so
I'm going to skip comebacks Alf see OK.
So what what is reliability what is
reliability so I said reliability
is point 8 and you're trying to explain to
your your uncle what point what you say.
And you can talk generally about
reliability is some sort of measure of
precision and
that's good but I also know what is
point 8.8 what is what does that mean
actually it's a hard question to.
Me because the good news is.
Good good good so that's that's
the right that's the you know the sort
of coherence of the overall measure and
it's you know on the sort of 0 to one
scale right but so but then if you want
to get very specific and actually address
the magnitude itself what would it what's
clear what his point is in that case
I did my usual motor mouth routine and
like I said it a couple times but
like very quickly and without pausing.
Good good.
Good that's a good that's a good rule of
thumb that's segmentations cringe at rules
of thumb but that said it's one
that I don't mind cosigning for
general purposes but so all the more
reason to know what point 7 means right.
So that you're talking about a signal to
noise ratio where you're talking about
the true score variance over
the the air variance you're close and
it's just a convolution but to anyone
you've got true score variance in
the numerator that's good
what's in the nominator.
It's absurd score variances in
the nominator total variance so
how much of the variance that you see is
accounted for by that signal and you can
get to that from the signal to noise ratio
but but but reliable so when you see
point 8 you're saying 80 percent of the
observed score variance is accounted for
by true score variance that's not
the only way to think about the.
Reliability question if you can also
frame it in just the way we think of
an intra class correlation.
As as a correlation in itself.
And it's a correlation in this case of 2
replications of the measurement procedure.
It's a correlation of X.
and X.
prime that's actually why you write
it Row X.
X.
prime.
It's a correlation between X.
and what we imagine a replication of X.
to be which is equivalent to the
proportion of the observed score variance
accounted for that use governs so it's
bilingual in the same way that you can
think of an and enter class correlation as
a correlation and this measure
of between group variance right.
In the same way reliability is both
the proportion of observed score
variance accounted for
by trees grow variance and
the correlation between 2
applications of an event procedure.
The monster that I like is.
A person with one watch
knows what time it is
a person with 2 watches is never
quite sure and that's kind of kind of
what psychometrics is all about it's very
sort of saying like we always want to know
exactly how imprecise one to
be precise about or in prison.
OK so how do we estimate this in practice
here are 3 types of reliability The 1st
is sort of the gold standard to sort of
parallel forms reliability we actually try
to do that we try to replicate the whole
measurement procedure twice we sort of
we could we create 2 different equivalent
forms imagine to spurn the satisfactions
earn this magical turn right off of stuff
like marbles or sort of take a scoop
of the items and create one form take
another random scoop of the items and
create another form and then we give it
all to you like now and we give it all to
you in like some separate room on Sun
separate day with some separate Raiders
and we try to vary all the things that
we care about varying and give that to
give that in a different scenario and then
we simply take the correlation of the X.
and X.
prime and that's a parallel
forms reliability another
way we approach it is to do test retest
reliability what that does not capture
is the variance to the items because if
I test you and then I retest you again
I haven't drawn again from this pattern of
items so you want to think about all these
turns of like items of Raiders of
occasions of tasks and think about all of
those it's contributing to your sources
of variance and 3rd this is sort of
the weakest form that you usually get the
highest reliability from is our internal
consistency reliability which which treats
all this stuff like all the stuff that's
going on in this room right now is fixed
and only considers the variance of items
within the within the little test that you
happen to have right it sort of says hey
instead of drawing an urn drawing from
this urn of items I recognize that I've
already drawn from the urn of items I can
split the test items sort of randomly in
half to correlations of all those halves
and think about how that is an estimate of
a liability that's how internal
consistency reliability works.
So again I hope you're sort of bilingual.
In the order of the light.
consistency reliability 10 percent of
the time it's some weird approach using
R.T. that I'll talk about shortly.
And actually show Shawn and
I from our from our 2015 paper to
have this here I might have cut the slide
we actually show you the histogram for
all reported state reliability causations
that you see in practice just a give you
a sense and point to point and
all of them are over point 7 in this case
there are centered on point 9
with a slight negative skew.
All.
Together for.
A purpose.
And then averages of that.
Here is what I hear is
what I skipped over so
Comdex Alpha is exactly that and it
actually can show you can prove that it is
the average of all possible split halves
right you split in half you split
in half every single possible way you can
you take the correlation over and over and
over again now that correlation this is
where you combine come back south and
Spearman Brown right you've taken half
tests when you split in half you've taken
have tests so you have successfully
described on average the reliability of
a half test and then used Aaron Brown to
ramp that up to that to the full test so
it's a nice and neat little exercise.
But you've got you've got
the intuition Exactly.
So so you know this is the last thing
also service and it will take I think
a 5 minute break that will end up being
this is this is where your reliability
is not a liability right and
this is the point that I think Brian was
sort of leading to is that you should
think of the reliability coefficients that
you get in your technical manuals and
all of your state tests as being
an impoverished version of the reliability
you might imagine right if it's trying to
answer the question how well does this X.
correlate with this possible X.
prime like that's not varying items
doesn't cut it right and if you
were to vary occasions if you were to vary
like spin areas if you were to vary raters
if you were to vary all these other things
that we might actually be interested in
generalizing over and that reliable you'd
probably almost assuredly be lower right
and so that's worth thinking about as you
as you are adjusting for reliability is
what exactly are the replications over
which I'm interested in generalizing and
that leads to an entire series another yet
another theory called generalize ability
theory which is developed by the Crown doc
and many others decades ago Bob Brennan
has done the biggest the most work on this
as of late and is something you should
know about that's not going to dig too far
into right now but I'll give you a couple
key references brand in 2002 is a great
primer by my former 2nd advisor rich
Ableson in the arena lab in 1901
it's a nice little Sage primer and
it's kind of depressing that it hasn't
really gone out of date since 1901 but
this is basically just analysis of
variance is pretty straightforward.
And that's just it answers
a couple of questions Tom Kane and
I did a paper on this I present in my
class about teacher observations right and
how many readers do you need how many
lessons do you need how many items do you
need to get sufficiently precise
estimates of teacher observation scores
by many readers and are for example
administrator raters different than peer
Vader's these are the kinds of question
you generalize ability theory is.
Really well primed to answer this is my
colleague Heather Hill at Harvard who
wrote a great article on education
researchers say the title is
that before the colon was in her like
Rader reliability is not enough which is
to say like Often times we think we've got
a bunch of readers let's just see how well
they match with Master coders not
enough and I totally agree with her so
I think you should sort of think as you're
developing if you ever have a skill that
depends on Raiders you should definitely
start with greater accuracy and
then move quickly to generalize
about the theory if you can and
leverage the sources here so this
generalized ability studies are expensive
but they also are due
diligence when it comes to
real reliability right they're a liability
you have is not the reliability you seek.