Andrew Ho: Psychometrics mini course - Part 1 | Gerald R. Ford School of Public Policy

Andrew Ho: 2016 Psychometrics mini course - Part 1

Transcript:

Well.

Many courses here we have
students from 2 classes.

For tool we are very
pleased to have joining us

to is a professor at the Harvard Graduate
School Education cyclists in the real was.

Who is going to be talking with us
today give you a flavor for some of.

The really important issues
in education measurement.

First thought only measurement folks
in education research to pull.

Lists as well.

Goal here is not to have everything
that one would learn in a foursome.

Sequence in psychometrics but to give
you a flavor of what's out there and

help give you some ground.

To air into is.

Accomplished.

Academic here member of the now for
Assessment Governing Board.

Member.

He's a former.

Master since you just pick from.

It's a really very pleased to have.

Your Thank you thank you.

We had a game against Michigan.

I should I shouldn't started with the
crowd turned against me so thank you for

this opportunity these are these
are weird things right like 3 hours

measurement What can I accomplish
I'm like actually not sure I think

the presentation is kind of a mess but
maybe deliberately So what I wanted

to sort of leave with like a couple of
provocations a couple of references and

the sort of nagging idea that you need to
learn more and that there are places for

you to do so and so I should start by
saying one of the places you can do so

is right here you have a stealth
psychometricians in Brian Not to mention

a psycho magician in that not to
mention people who know statistics and

measurement broadly likes you and Chris so
you do have a bunch of people who I would

love to have the sitting
in the audience and

watching teach this in their own way and
I would learn a ton from it so

we all would teach this were different in
different ways and Brian and I actually

recently had an exchange about measurement
he was talking to a condom it's and

I was reflecting on how he presented it
and I was a ton of fun and so I wish you

know it's actually I was talking to Matt
about this earlier too I wish we could

talk more about how we teach this and this
is I think it's a great opportunity for

me to engage with you but also to get
a little bit from from people about

the different ways that they approach this
because we could learn a lot from them.

So here are my provocations So I kind
of want to do these like you know these

listicle ce writing this is like how to
get people's attention 7 things you need

to know about measurement but I'm caving
I said that I'm going to do it anyway so

here are here are those those provocations
I'm being deliberately kind of

extreme here just kind of
to procure a little bit so

validate score uses of the question

I find myself asking my students most
often when they are starting to conceive

their measurement projects is what for
and not just what for but what and

like what is a score ultimately whether
it's a scale score or an average of scores

or regression coefficient ultimately
those scores are what's used and

we don't validate tests we validate
those ultimate uses of scores

to content is king not models content so
if you start analyzing data ever

without knowing what items there were
you should slap yourself on the wrist or

feel your advisor slapping yourself on
the wrist What are you measuring and

you need to have an embodied
experience of what that

is you need to put yourself in
the shoes of your participants and

feel what it's like to be a part
of that measurement procedure

have a tendency in measurement to jump at

the neatest hottest sounding models
with the longest acronyms stop it start

with simple simple descriptive statistics
in measurement and I think of classical

test area as the descriptive statistics of
measurement you should always start there

in the same way that those of us
data start off with what command.

Summarise so did so so
Alpha is your summarize of measurement OK.

this is like this these are not the droids

you seek or whatever it is go to the
reliable the that is most often calculated

is probably not the reliability
that you're interested in and

I'd like you to leave here today being
able to answer your aunt or uncle when

they ask you what is reliability I want
you to be able to answer that question I

remember when I was getting my masters in
statistics on them because I don't agree

what the standard deviation can you
think of how to answer that question to

your uncle as I could give you an equation
like how do you actually describe that in

a meaningful way I think that's
reliability like the standard deviation of

measurement be able to what is point

that mean and you should be able to answer
that I mean today if you can't read.

I don't respond there is just a model.

We always hear we should really
have someone who knows and

in response to it I wish I had someone who
knows I know response theory which I so

many could teach I'm response
theory is just a model and

is also a very very useful model so
I'm both going to demystify it and

focus you on what it does particularly
well 6 your scale is pliable bend

it don't break it and this is something
that Brian wrote about recently as well.

The numbers that you should you should
think of them as like solid ground like

the distances that you sort of see they
can there's kind of like this is kind of

like this bridge it's like springs between
the boards maybe like there's like a sense

that the scale is pliable to shift
a ball but not breakable and

there's actually empirical ways
we can address this tendency and

your or the judgments you make based
on your scale information should

should be robust to that spring in us.

And then 7 again this is a reiteration
of other things know the process that

generated your scores and

use them accordingly do not go beyond what
the data suggests that's just a general

recommendation so these are my
publications they sound just like

floating talking points now they're deeply
embodied to me and I'm going to try to

do my best to make them feel meaningful
to you over the next couple of hours.

So but I want to start by just stepping
back and again when you think measured

what kind of resources can
you use moving forward and

in the e-mail that I think some of
you received from me you've got $2.00

to $3.00 citations right 2 references
in the 1st Are these the standards for

educational and psychological testing
what we're doing has has decades

centuries of history and a field has built
up around it that has some guidance for

you right and 3 the major body is the
American Educational Research Association

the American Psychological Association and
the National Council on measurement and

education came together and
actually agreed on something right these

are the standards of the field and this is
a very powerful tome not just for you but

for the people who you're developing
tests for you can say I did this and

these are the standards of the field
you should have this book and

there's a discount for
members of this of any of these but

like if I were to recommend one thing
these are the authoritative standards of

the field they're not perfect I've
got a kind of quibbles with them but

they're powerful and

it reads actually is as a good as a as
a pretty reasonable like intro text.

this because 160 bucks and

I remember being a student and
it's also very much a reference text but

this is sort of the cut bible of the field
has all the sort of heavy hitters

in measurement who contributed chapters
to it it is sort of the authority to

like what we would cite when we
say if you had generic a site for

reliability you would go to
Hartle 2006 generic site for

validation you would go to Cain 2006 and
those are the 1st 2 chapters in this book

do not buy it unless you
are really into the stuff.

But it but have it on reference of it's on
reference for easy reference your library

so much you put it on reference frame so
they would be my sort of to go to books

for for educational measurement
probably and you should

look to them if something is provoked
to you as places to go for citations.

So how do we learn measurement I think
this is important to visit visit too so

like you know this is very much I think in
the way that the way this is like reflects

the philosophy of teaching here too but
I always want to sort of point to my

fellow by and I'm like eager to look
at other still abide to from folks

other folks who teach math but if you want
other references there they're up on my

website there's like sort of like what
I think of as where to go to for I.R.T.

what I think you know what I think people
should read when it comes to differential

item functioning when I think what people
should read when it comes to standard

setting so they don't forget to look
at people still by when your reference

hunting and you want to say someone
said a cut score who should I cite for

the whole a whole cut
score setting thing and

Philip I can be very useful in addition
to that in addition to this tome

where it's Hamilton and Tony Adams who
did on who did standard setting so

my fellow by and the syllable of others
like Matt are good places to go for

references on these things.

And then again like you know learn
it use it learn it again and

use it again like in
practice makes perfect and

all of your classes your methods classes I
think you're using data getting your hands

dirty struggling with those state error
codes looking at those manuals right so

you know help help help them clicking on
the P.D.F. that's that's what you're going

to be doing a dodgy and I just want want
you to want to recognize that measurement

much like the other methods you're
learning requires that struggle and

that patience so this is something
that's a little trick I just want

to give a shout out to a few people
who contributed to my Google Docs

this is something I do to like incentivize
and encourage out of class reading and

out of class discussion so
there are a bunch of tools for this and D.

has and it has a new thing called to
result at some of my colleagues at Harvard

of develops or just open google docs but
it looks like this this is what I asked

some of you to contribute to I said
Hi I'm students is the typical pre class

discussion that we run it is to see I
ask questions I have you respond to them

by 10 pm last night and then I reply
like I am kind of a night owl and so

like in between 10 pm and Dawn for a class
I like I sort of give people like answers

I have little conversations and sometimes
for the for the people who contribute late

like around 9 pm We get into these
sort of discussions online actually

how did someone who was I know who it was
but I was online with some one of you.

And and just going a little bit back and

forth about how to write out
an equation so so thanks to Josh G.

Josh G.

There you go Josh so so
so we I had a little.

Cross talk with you I'm not sure if you
check back in solid I saw what I wrote but

this is a part of the read write
do this is just sort of like

how to sort of stay engaged.

And here's an actual Doc So
thanks to Josh G.

thanks to Fernando you can
see I'm replying italics here

thanks to Stephanie H.

Stephanie I missed your comment you must.

OK I'll reply later I promise Karin.

So there are a lot of really good.

Some derivations we did here Stacey.

So yeah and then some that there's always
I always sort of leave a space Cassandra

didn't get a chance to reply to you but
Stacy and

Josh here there's a general space for
general questions and discussion and

you asked and good general questions
that you can have time to engage with

towards the end of this class today
I should say to like interrupt.

But I think it's important sometimes
you just might talk straight street

pedagogy and like how to learn this stuff
and how to stay involved so read write do.

So again there's a 7 principles
I'm going to start with and

you start with the beginning and
start straight from validation so

we don't validate test to be validate
score uses to talk a little bit about

the validity theory and
this might seem a little bit detached and

I'm going to get more technical later on
I know I feel like this is very talking

about this as tribal I'll try modal
audience potentially So I think I might

interest some of you sometimes and
others of you other times but

all of this is important in belongs to
the body of measurement so I hope you'll

survive remember even the things that may
sound theoretical even the things that may

sound too technical but are all in
a continuum and think of it as what we do.

So validation This is the more recent
depiction of what I think of as

the standard for validation in educational
measurement in particular I contrast that

with Matt who teaches more from the the
psychological measurement paradigm which I

think has a slightly different perspective
on validity but but Michael and

the field is an educational measurement
in particular is very utilitarian

very instrumentalist we
care about the ultimate use

right it's almost a theoretical if you say
if you if you take it to a certain extent

we just don't even care about the numbers
as long as the interpretation or use

is correct that's extreme but that shows
you what that what they're emphasizing

here what we're emphasizing here to
Val to validate an interpretation or

use of test scores is to evaluate
the plausibility of the claims based on

the scores as an argument based approach
you are building an argument with evidence

over time there is never a point
where something is valid it is part

of an ongoing evidence building process
and that is deeply unsatisfying

right wouldn't it be great if there
were a correlation coefficient and

once it exceeded point 7 you said Jack and

this is a super frustrating
like call to you to never do

that right and I think it's particularly
frustrating for introductory students for

whom like that they might who even for
those who like might not have careers and

measurement to say they really have to
do all this and I guess I'd say like

actually no you don't really have to do
this but at least you have to know that

these are the sort of standards of the
field even if you selectively ignore them.

So and again this is just these are for

measurement there is my colleague
Derek Briggs who kind of disagrees with

this utilitarian instrumentalist
current status quo in measurement but

there actually debates about this in the
field right what is the validity you can

write a paper on that right and contribute
to the discussion of what it means for

the use and interpretation of test
scores to be valid and appropriate and

we broached this this morning when I was
talking about the use of this new data

set that my colleagues and I have created
that allows you to compare districts or

school districts across states and we're
not we don't just ask is that valid or not

we as we say is there are the uses of that
are the interpretations of that valid or

not and there's some really good
feedback from the faculty members and

students in the room about which
which research designs and

which research inferences would be or
would not be appropriate in those

situations is very similar what you're
trying to do with the scores and

is that appropriate is that supported
by the evidence so I would say these

are there these are different
definitions of sort of validity or

full of schools of thought about validity
and not reading all this text because

I'm sort of that sort of leaving
these slides as a reference but

we are in a very instrumentalist
even utilitarian moment and

educational measurement where we care
ultimately about how you're using those

scores not about the test or
even the construct is about the score use.

So again modern test validation theory
is dominated by instrumentalists I'm

concerned with test uses an interpretation
and I'm acknowledging that this can

be frustrating because it kind of takes
the control away from your special little

instrument and it's in its ultimate
scores and places it in this fuzzy domain

where people pick them up and use them and
you might kind of be responsible for that.

So I think a validity and

as I say as I tweeted before it's like
I'm not ashamed to use mnemonics and

so I think of 5 sources of Liberty
evidence and I call them the 5 seas so

the 1st is content right so the 1st
to take the test what is it measuring

There is a good overview of alignment
of 4 big testing enterprises

to the Common Core recently Morgan pull it
off and then see DURIE publish this piece

in Fordham earlier this year which is
which is basically a content study right

do park and smarter balance these big
testing consortia and as well as M.

CAS and the A C.

T.
aspire these the Massachusetts state test

and a.c.t do they aligned to the Common
Core state standards this is a content

study and there I think there are too few
people frankly delving into this like

arena which is currently
sort of dominated I think by

more model based statistically based
approaches so I'm just sort of reiterating

that content is important serious
important cognition is another source to

go as another source of evidence
is like when you take that scale.

Are you thinking what the designer
intended you to be thinking as I'm

thinking through this math test
as I'm thinking about whether or

not I'm greedy or not think about the
studies of great recently that have been

concerned about reference bias right
that is to say like do I feel greedy and

can you compare it across courses or
my referencing my grit to the people who

happen to be in the school or in this
classroom right so how are people thinking

about it cognitively the way we we have
seen they could the evidence we can get up

often comes from Sir think aloud
protocols as well as a parable analysis.

Coherence is where the field since it
seems sort of stuck with validity and

there aren't a lot of what I'm going to
talk about subsequently is going to be in

this into this 3rd seed so this is where
reliability analyses come up if a C.F.A.

I.R.T. this is what not teaches as
well as well as me this is I think

what people sort of assume measurement is
from a technical standpoint and what I'm

highlighting here is it's only one city
right you've got to think about content

you've got to think but cognition and sure
you can do your reliability analyses but

that's only a piece of the puzzle another
piece of the puzzle that is often this

comes up a lot in structural equation
modeling comes up a lot in economics too

where you're trying to predict future
outcomes does this predict college

attendance graduation or
college entry or freshman G.P.A.

or future outcomes or more concurrently
does this does this correlate or

not correlate with things
that should be similar and

things should be different you
sometimes hear this is convergent or

discriminant ability but this is
again only a piece of the puzzle and

the Fitzy is consequences right evidence
based on the consequences of testing

you could think about this even as
a counterfactual like had I not undertaken

this measurement enterprise at all but
would have been the difference so

doesn't think about the scores as
much as the use of the scores and

like that has has the act of testing and

measuring itself had some consequence and
so this is a fairly controversial

relatively recent addition to
the sort of the Litany framework but

these 5 sources of Lady evidence are
clearly articulated in the standards and

what you should think of when you're
designing a measure when you're using

a measure as the kinds of
evidence you can live.

So so this is sort of in contrast
with what with what I think of when I

think people are thinking of validation
commonly I developed a scale with good

theory I fit a C.F.A. and got good can for
confirmatory fit index and

my reliability is greater than point my
scores predict desirable outcomes so I

have a valid reliable measure that's like
the common sort of articulation of like

a good baseline study about I'm setting
that up as you know so that's content

that's coherence that's coherence to this
is correlation and that's incomplete or

sort of missing cognition room if we're
missing consequence you're missing this

argument for use what are your scores how
we use them what would have happened had

you not measured and so
these are other questions you could ask

just with complete this sort of validity
framework so it's more than just.

A good fit index and
good item parameter estimates.

So again 7 key principles we don't
validate tests we validate score uses

That's what I was covering and I want
to emphasize content a little bit and

then dig into a little bit
of classical test ary And

I think that'll probably take
us to the break or thereabouts.

So.

Let's and
then we'll get into the reliability and

I are to be sure of after so
this yes sure.

Yeah talking about consequences how it
should be in the context of what you

mean like if if a student had never
been tested then what other measure

to measure look underlying ability to or
think that we're getting

the score ultimately is is yeah is used
for something right so once once we

test what's the sort of theory of that
making a difference in some way and

it could be like publishing an article and
having that feedback into the system it

can be very abstract in that way it
could also be the teacher is going

to use it to give you feedback and is
that feedback going to have a positive or

negative impact on you right or it's going
to lead to a value added estimate for

a teacher and they're going to respond
differently to teach so it's like Had

that not happened a whole
process not just the OR

it like you know the score but the use
of the score in this theory of action

had that not happened what would be
the difference so I think that's kind of

a pretty gold standard level of
like I mean we're taught but

a major evaluation at that point but which
is why this is sort of a controversy all

sorts of related evidence
because like good luck and

how long do you wait for
long term outcomes but but this.

I mean from an economic You can be
because of the kind of catch all the time

you know you have people.

Like how can you think of them.

So again so you know in all the ways
that I think you're trained to write

as economists right so I think you again
like and I didn't I wasn't being glib and

I was sort of saying this is like why
we're glad we have people like you is

because I think you are asking like what
you know what is the counterfactual for

you know if we didn't have high stakes
test based accountability like we'd

have some sort of paper by some guy named
Brian Jacob and Condi or something and and

and sort of think about what happened had
there not been this rise in accountability

at this particular time so these
are the kinds of evaluations I think that

I'm not soley putting this in the in
the in like in economics like that but

that that said I do think that's my
encouragement to you is to never just

think of a test as something
that's validated up in the air but

as like part of the results in the score
that is used for a purpose and

if that purpose is for you to publish and
get some correlation coefficient and

get in a journal and that's great and
that's part of your theory of action and

that's pretty light but all but
ultimately I sort of say like but

you know why are you doing this and
that's why I'm sort of for

pushing people to go is that ultimately
your scores are used by people for

something can you can you
describe that to me please and

that's what I find myself asking most
students like that's what's missing and

when they say I want to
create a measure of X.

I'm sure like why you know those scores

we're going to do with them
what's going to happen and

that's what that's often what I find
missing in their their thought process.

Thank you we're here and I'm trying
to figure out correlation you said

evidence based on relation
to other variables and

so I'm wondering if by that you mean like
I would validate one standardized test by

its relationship to student scores
on a similar kind of test of similar

kind of content or reading of
things much broader than I'd like.

This chance to and how the critics
like high school graduation you're

going to college and so
how would I know those kinds of things

before like if I'm using these as
a foundation for measurement and

developing I haven't given it yet so
how do I have evidence on this is.

So so this is why crown Bach and
all the sort of.

People who have developed validity
theory over time have been.

Very clear that it is an ongoing
process that it's not I mean again and

this is where psycho magicians struggle
with dealing with the outside world

because the outside world is like show me
your valid measure and you're like but

this is this process that takes a look
at Show me your valid measure and and so

it can be frustrating but

this is how the field thinks about it I
think you have to wear different hats and

when you're talking to people who have
that their definition of liturgy and

just say this checks all the technical
boxes and you do want to at least some

correlations with concurrent out
concurrent variables in some way but but

look at the end cast Tech Report technical
report look at the report here for

your tech your deep What is it now and and

you'll see that the all of these are laid
out in there in varying degrees of

depth and usually coherence is a massive
section with classical test theory I or

T differential item functioning alike and
correlation to small

consequences is a paragraph
cognition is like we did a lab and

content is very very fleshed out with
content frameworks and the like so

this is why I explicitly walk
through Technical Manual You know

when you finish my class you should be
able to read a technical manual for

a state testing program whose data
you're going to use and figure out

what implications it has very for your own
analysis yes that's a good model to check.

That's.

What are some of.

The valid for the test but for.

What are some of the kind of.

Thing and I'm wondering when you
were talking about federalism focus

you seem the one to see complex necessary
if it was really going to meet that.

Goal or.

Cause.

Geared up on care I don't care what
I don't hear the reliability of

its core how well he learned
in college now that.

You're going to be
anything other than for.

This is a good question so

this is where the economy is so
we're probably shouldn't like over and

over going to miss dinner over drinks at
some other point we will have a detailed

argument about or debate about why these
things should matter I think I mean so

from a very utilitarian standpoint in the
near term before you get those long term

outcomes you know if you're developing
your own measure you need to stand on

something in the near term before
you've got those long term outcomes.

The here yeah and
it also I think it also I mean

I don't know like if you happen to find
some spurious correlation of something

I mean there's got to be some and you
are interpreting when you completed a C.T.

score that there is some sort of college
readiness and you know when you say like

point 3 It's like socioeconomic
status correlates point 3 and

it's like you don't say are college ready
based on social economic status right and

so the interpretations we use like matter
is the sort of psychometric argument and

so you know when whether I enter Be
specific about that interpretation and

what is the warrant for
that interpretation and

if it's only based on social economic
status and the warrant seems.

Detached from the human So I think this
is a deeper philosophical argument you're

raising that I don't think should be.

So I but I think it's a good one and
certainly some that might that my

students have advocated for and
it's certainly econ leaning.

But you know the you know what I
often fight with is like why do

we care about freshman G.P.A.
I mean look at that's a horrible measure

I kind of wanted to kind
of want freshman G.P.A.

to predict my on my high school test
because that's a better measure because

of the content the directionality I
mean so it's does arise I think from.

The items in the content is
the is psychometric percent.

But so so on to a little bit
of classical test here in

the tools that we use to
evaluate in particular Clarence.

Or and content so this is sort of
like my checklist for it like how to

get into a sort of secondary
analysis of test score data right

you get a bunch of you get a state a D.T.
a file and it's got people in rows and

there are all these items all these like
columns that correspond to items and

I guess you know so this is like my going
to skip around is going to go 12378 or

something like that but this is this is
sort of part of a larger checklist and

again like you know this is from
John will it's presentation as well.

No you're right it's right like read each
one take the test get a sense of what it's

trying to measure.

So so this is an example from a a.

Measure of like self perception of
teaching success you have high standards

of teacher performance you're continually
learning on the job you're successful in

educating your students it's a waste of
time to do your best as a teacher this is

negative negative negative polarity you
look forward to working at your school

how much of the time are you satisfied
with your job right and so this is like my

advice to you is never go into an analysis
without actually looking at the items and

sort of taking that like scoring the test
thinking of yourself as a subject and

then you have all these sort of like your
scale items is one to 6 you see here some

someone snuck in a one to 4 item
this happens from time to time so

do not get caught unawares do not type in
Alpha without recognising that some of

your variables have different items skills
than others because it will give you

incorrect answers so so take control
of your scale and know it backwards and

forwards and
again I'm going to in the interest of

time I'm going to jump through this
always on the scale of your items

right to score your test
how is it actually being

scored is it a some score it isn't.

Average are you reversing
some of the play or

any of your some of your items are you
stretching the scales of some of them so

the algo from 0 to 100 what do
you how you actually scoring it.

So if you if you look here right again
you're going to want to sort of what I

recommend that you do when you're actually
going through this is reverse it yourself

like take control in state and reverse
coat it so that they're all pointing

in the same director because and
then make this because otherwise

I have I found myself making mistakes
is some very practical advice for

you to not slip up in the sort of data
in the early stages of an analysis

so you know again look at your data get a
sense of the missing this label your items

make absolutely sure your items skills
are oriented in the same direction or

you're using code that
recognizes when they're not

positive should mean something
similar if not fix it.

Here's more exploring I have mandate
that people always like give me discreet

histograms for items scales I want to
know Mike how many ones there are how

many to 0 how many threes fours fives and
sixes I want to see if you've got a 7

point Likert scale if no one is picking 6
or 7 ever I expect you to know that from

the very beginning and don't start running
I.R.T. until you have a sense of your.

Data actually look like.

This is important as well does
a one mean one at all times it is

is it always like strongly disagree
when you have a scale that goes

like one to 4 right so if I have
strongly disagreed strongly agree and

then I have not successful it's a very
successful and this is one to 6 and

this is one to 4 and I throw that
into alpha if I throw that into like

a reliability analysis what is
a going to do is going to assume

that very successful means slightly
agree does that make sense.

It could make sense you better think
about it and make a decision so if so

the idea here is that all of
these items scales are not

in a classical analysis are are they
think of ones as ones and

sixes sixes so you better take control of
that and make sure that that's right so

often what that entails is 2 things
one stretching this 124281 to 6

or actually just forcing this to be
one forcing this to be 6 forth and

forcing this to be what 2 and

like actually equally spacing that item
out so that you're saying not successful

is like strongly disagree very successful
as like strongly agree so one of the big

mistakes I see people making when they
get the scale is a secondary data analyst

that assuming that all items
are sort of interchangeable and

that the player he doesn't matter and
you sort of control over that.

Another way to approach it is to
standardize within each item so

what you're doing is you're to
your set you're just dividing by

the standard deviation unit in each time
and each and each item and in that case

you're saying that strongly disagree here
and strongly disagree there might not mean

the same thing depending on the variance
of each of those ITEM ITEM distribution.

And that's weird too like when your
liquids like or scale items are all

strongly disagree to strongly agree do
not standardize right because strongly

disagree means the same thing across those
items and if you standardise you lose that

information Similarly if you have an
educational test that has like correct or

not correct should you standardize
absolutely not correct is correct and

the same thing so do not standardize
you know in those cases either as these

are the like the little things that seem
trivial and I feel like in my in my own

way in my own students like analyses and
I'm not running through there coming up

with absolutely incorrect alpha values
I can even just like the baseline

descriptive statistics let alone getting
to I.R.T. or structural cohesion modeling

or attack so you've got to take control
of your data from the very beginning and

be very very careful and intentional about
every single step that's like general

advice for statistics period right but I'm
saying it still applies to measurement.

OK So this is a baseline reliability
analysis check this out Alpha X.

one to dash X.

as is that should be your template and

the items gives you all these
items to 6 as is I saw I

have this sneaky suspicion that this
is leading to inflation of reliability

coefficients throughout state and users
and perhaps other programs as well but

as is does is it says the direction of
the scale like the direction of the item

scale positive is always positive
like if you coded as positive and

treating it as positive if
you don't include as is

there could be a really bad item in your
scale that correlates negatively with all

the other items negatively and
state a will flip it for you.

Without telling you will show up here but
you might not notice it without telling

you it's going to flip it for you which
is to say you've got such a bad item that

status as it can't possibly be
that bad in reverse it for you and

that's crazy to me that they do that and
so you thought this is that for

a lot of elementary analysts dramatically
over interpreting their simple.

Alpha they're simple reliability
value because they're.

Going To Do you know best but
but but but anyway so

this is be my default code to make sure
that you're controlling it appropriately

be intentional at every
step of your analysis and

know what the direction is and
know what the scale points are OK So

this is I'm going to I'm just going to
short hand wave 3 this but these are.

Various discriminations statistics
they basically are like does this item

correspond to the sum of
other items on the scale

does this item correlate with other
items and this is the coherence question

this is an internal correlation does this
item correlate with other items on a scale

which is really kind of what is at
the heart of classical test theory I or

G structural question modeling
factor analysis and the like.

This is an example of a little bit
of you know more pseudo code from

state A for you.

How many people don't use data.

So and you're using M.

plus.

Because this is why we include a whole
bunch of do files and I've sent Bryan

a couple off and I'm more but I'm happy
to give you sort of templates for this.

too we'll talk we'll talk more about that

the simplest of the good cos it will
test every kind of descriptive stats.

To the you know like OK you know.

Anyway what we're worth running so like I
mean they they presume that you sort of

done all that already and so do all that
already like to do that 1st as a as I'm

recommending it as make sure you sort
of have control over your scale.

So again you know coming in as a sort of
content is king there in the sense of like

you know your items know your scale and

get a sense of what it's
trying to measure and

don't just validate it based on whether or
not it predicts life earnings next.

But if it were the debate.

What exactly were they.

Looking at like that.

In the sense not in the sense
of like I mean you want to

read a book on the question
because I want to get.

More.

With.

Like I mentioned.

Some of that question but maybe.

I can see Mollenhauer.

All.

Right so this is this is a subscale
question this comes up all the time so

Alpha is a property of of a of a scale
right and if you want to create subscales

get get information about each of your sub
scales that's what Alpha should be for and

what else if you throw an alpha across all
of the items across subscales it's asking

how coherent is this across subscales So
the question I always ask people who

are using subscales is what's the question
how are you using your scores right so

that you know if you take a cynical
approach from like you know at heart of us

always like if you give policymakers
to numbers a lot and together.

So that you know so
this is like the you know so

that your great scale the Angela Duckworth
a Tim Duckworth and Queen 8 item great

scale there are 2 subs course we
think people are doing with them.

Adding in the getting so if you want
your question my question is what

your question should be what is the
property of the score that is being used.

This is that this is the utilitarian
sort of instrumentalist of you and

if you are creating a scale with like

that people are using those subscales
an evaluative each of them accordingly and

then take alphas for
each of those subscales report outfits for

each of the sub scales I'll show
you how Angela and Patrick.

Do this and
shortly in their actual paper so

yeah so so so which is just to
say good to have subscales but

then then what I would
do is Alpha out C.T.

analyses on the subject and later will
talk confirmatory factor analysis and

all that jazz or actually that well
that's what his class is good at.

In particular.

So let's.

Go So this is this is the this is a paper
that I have everyone in my class dig

deeply into this is Angela Duckworth and
and Patrick Quinn's.

Journal of Personality
assessment paper in 2009 that.

I was talking with not about this is
a very common practice to develop

a scale that has way a ad that has now
way too many items but a lot of items and

you might not you might want to think
about how to minister them feasibly

in a flexible situation and so you can
use Costco test area in response to

a response they're both very very good at
figuring out how to shorten that scale

like how to how to preserve information
while while reducing the number of items.

This is a say you know I just gave myself
I just gave you advice I'm trying to

follow it this is sort of a brief
description of the great scale I actually

have my students take this so we can
like analyze their data new ideas and

projects sometimes distract me setbacks
don't discourage me I've been obsessed

with a certain idea but I am a hard worker
I often set a goal but later choose to

pursue it so I'm shortening them a little
bit this is to give you a sense of how

great operationalize So this is their item
scale in this paper they're sort of saying

we had a 12 item scale we're going to 8
it will all be fine don't worry about it.

So part of my screenshots
here see table one for

item level correlations after excluding 28
I'm sure each subscale I talk in subscales

here right there is all things out in
great scale this displayed acceptable

internal consistency that's code for alpha
with alphas ranging from point 73 to point

a look at their table to write

again we spent a lot of time digging
into these articles in class so

this is like you know West Point the
famously her National Spelling Bee sample

Ivy League undergraduates and these are
conduct also values these are the values I

was describing point

the sum that's the total scale that's
the that's the reliability coefficient.

For the overall scale and

then she breaks it down into pursuits of
effort and consistency of interest and so

the question I would ask in this
case is again what's being used and

if you're treating these separately you
can see what their alpha values are and

then if you're treating them as
a whole that that's the that's so

you can sort of cover your
use cases here and say for

those purposes here is your level of
internal consistency that makes sense.

Absolutely and so this is why your
classical test there isa to 6 are your

descriptive statistics your knee jerk 1st
reaction and after that we're going to get

to a more powerful framework that allows
you to answer questions like the ones

who's asking and so this is what I
consider level one this like summarize and

I really do mean that is like the very

after that you get to more
sophisticated questions OK so

by the way the what I always have one
of my questions my google doc questions

is is kind of this annoying I guess
what I'm thinking questions but

it's like Does anything look off to you
about this and I'm just going to sort of

this is like a tough question so I'm just
going to pause and and the just take

a look at this table in particular these
alphas these alphas compared to these

alphas and I just so this is you know
going for items for items and 8 items and

I just want to sort of this is to have you
take a look at that and just get a new

curve gut reactions as to what
I find a little surprising.

There's a bit of it that.

I have a plan to in the audience.

Try and.

There can be a couple answers here so
don't be shy.

Yeah.

For example.

For example the man.

Who does point 73 or

an 8 item scale I have that's wacko.

Right and so I'm not sure if he's
correcting for that and didn't mention but

or if there's something weird going
on in the sub scale relationships but

that is not what you expect what you
expect when you have many more items in

fact we're going to show you a prophecy
formula that predicts this when you

have more items in the same way that you
average over more things you have center

deviation over route and is your position
the more you average over the more

precision you have now it is a little
surprising that it's accurate that's

a kind of discipline perception that
you'll develop with with with measurement.

Cause that a lot but
Joining me to go from this.

Which is that much that.

You were to be purely So this is one
way right so we're going to develop

even better ways with I.R.T. But this is
just sort of a ranking of how each item

correlates this is the item rest
correlation is a literally the Pearson

correlation a simple vanilla correlation
between an item in one column and

the sum score of all the other
items in the other column so

this is a measure a very descriptive
statistic again kneejerk summarize level

descriptive statistic of how well this
coheres with the rest of the scale

we're going to see a better version
of this is going to get to higher T.

but often they very rarely tell
dramatically different stories so

this is why again we sort of start with
our feet on the ground with a basic

analysis and then get advanced and
I are today.

So you so wish to ask your question if you
were to be purely cynical about it and

didn't care at all about content
you drop you drop maybe one in 3

you know rerun the item S.

correlations maybe drop a couple more if
you felt like it and then calculate Alpha

for whatever's remaining and I don't
recommend you do that because content is

king that's the be careful of throwing
away a subscale you care about but

and imagine that an educational test where
suddenly you're not measuring mass or

something right so you can imagine
that that would be dangerous but but

but that's from a purely statistical
standpoint that's what you could pull off.

So so good we're going to great I'm going
to judge do I have this Yes Skip to

slides on classical test or E.C.L.

that's that's it sort of there for

you there's a bunch of equations there
sort of putting that in as like stuff for

future reference what I want to
do is talk a little bit about

why classical test theory is a theory and
what it predicts and

why it seems like it's useful so what is
classical test theory actually predict and

why do we think of it as theory the 1st
you know what can we infer from classical

test theory 1st variation
increased reliability and so

this is akin to the logic you are using
might sort of flip it on its head that you

know if you if you ask what the
reliability of a grade 3 of a set of Grade

ask for the reliability of Grade 3 grade 4

grade 5 scores is together you're going
to get like a point $85.00 right so

as you increase the variance right in the
same way that we you know as we know from

correlations period reliabilities or just
correlations I forgot to forgot to ask

what's reliability We'll get to that
again but just like any other correlation

as you increase the variance
you increase the scatterplot

rate you increase the sense of correlation
and you know whiteboard in here.

Later is that.

So I know that.

Because I don't work wow OK that the mind
OK I'll get the I'll get that shortly.

Go see Thank you.

So this is a a read derivation of.

The liability if you square
both sides put air put X.

under air that's the proportion
of air variance and

then one minus that is the proportion
of true score variance.

So that that's reliability and so if you
do a little bit of algebra here you get

this expression and so
in terms of the observed set you can

you can derive the senator measurement
in terms of the observed as the observed

standard deviation and reliability and
as you increase that standard deviation

you you get you get you get you're
going to increase your reliability

in the same way that I'm going
to draw right now so thank you.

So this is.

So like let's think about.

Which I think if you're
just this is a Grade 3 X.

and greed.

Are grade 3 X.

prime or something like that so

let's imagine these are replications
of procedures or grade.

So so in any case like if you have some
correlation that's like Grade 4 but

then increase the scale and
have a grade 5 here and Grade 6 here.

As you keep going up the scale so
Grade 3 here.

So if you look at this sort of scatter
plot Harry like add that correlates around

like point 6 or so but as you can as you
can see as you keep sort of Caterpillar

ing this out is like a caterpillar I
know it's not the greatest picture but

the idea is that ideas that now hey this
correlation looks more like point $8.00

And so the greater the greater
the variation you have

the more the more reliability
you'll have so I'll say to you so

one of my students will is doing a pretty
neat project with Dana McCoy She's

using Google Street View to rate
schools and like the sort of perception

of like school quality from what you can
tell in Google Street View and she kind of

she made a mistake upon reflection you
know when thinking about this prediction

of predicting of taking schools that
were too similar to each other and

they're like the stick let's take a bunch
of schools are too similar in quality and

then look at inner rater liability and
item or liability across those schools

upon reflection which she should
have done in order to like in

scale development is to make sure that the
variation very deliberately was reflective

of the variation in the population so that
she can get a reliability that corresponds

to that that said classical test or
he does give you a tool for for.

For correction correcting for
the variance in the sample you have

versus variance of the population you have
versus the variance in the population you

ultimately care about so
this is like your general expression for

how like the changes in variation will
increase the ultimate reliability and

I'm just again putting
this here as a reference.

So that's a again a very a classic
thing you should know about

correlation is that as the variance of

the true variance increase in
the population it will increase Yeah.

We're going to think this is.

A distraction the root of this is kind
of a random subset of the population.

Were battle weary but it's been a gamble
and I think I've seen some educated.

By.

Looking at it said only that the size and
the program.

The college we're looking at the.

Trial Court but the.

College or inappropriate is.

The whole the.

Whole array of the giving
some of the work to lay.

There really but the.

Selection.

Of the so I should've had this been
a 12 week course in measurement I would

have made sure to hammer that home
repeated so the classic example is for

example the correlation between like
a city scores and freshman G.P.A.

at the University of Michigan right and
that tells you what it tells you but

if you're interested in Had they had
everyone been a minute what with

the correlation have been that would have
been that would have you would have seen

that would've been larger but
you can't tell for the reasons that

that Brian Brian suggested I should
add here here's my general advice

if you ever were to undertake this because
if I were a reviewer I would I would

then you if you didn't follow it and that
support both right report both the initial

correlation and the you know as you
assume there's going to be attenuated or

discipline you wait a correlation and
state your assumptions clearly but

never just say and
here's my just attenuated correlation and

I actually reported this a tenured
correlations in my presentation today but

in the paper we report both says
I'm trying to follow my own advice.

So similarly that advice is going
to is going to hold here as well

if we're ever going to talk about standard
deviations the standard deviations

observe standard deviations are inflated
due to measurement error right so

as you can think of this is my mining
of the normal distribution again.

As Ewing as you decrease your liability or
your distribution

to sort of blurs out until it just becomes
this like blob and so as you increase your

liability your standard deviation
gets tighter and tighter so we know

that observe standard deviations are
inflated due to measurement error because

reliability is again the proportion of
observed score variance accounted for

by tree score variance and so correlations
between 2 observed variables X.

and Y.
will be attenuated by measurement error in

both variables that's just a side note and
so there is a general formula for

the correction of correlations due to due
to measurement error what we do is we

divide by the square root of liabilities
and if there's if there's error and X.

and error and why we divide by the square
root of reliability in one and

the square root of reliability in the
other and this inflates the correlation

I hate this correction and I use it all
the time so because what you're sort of

trying to say is like if had these had
these variables been measured without.

That measurement error than here would
have been their correlation right

this is what structural creation models
as Matt is doing do behind the scenes for

you right there actually taking it to
actually estimating the measurement error

in each of the variables and reporting
that discipline you made a correlation for

you and and so this is a way of
sort of doing that mechanically and

in the classical test area
framework My advice here holds to

if you're going to do this report the
initial correlation and then report that

this attenuated correlation because you're
kind of doing here in a very not so

subtle way is taking advantage of
measurement error like the more

imprecision I have the greater I inflate
my test scores I mean the greater

inflating I got the greater inflate my
reliability coefficients sometimes you get

reliability coefficients
that are greater than one.

This happened and then you then you know
you've done something I mean that's just

that just reveals how silly the whole
process is right on you're giving yourself

a lot of imprecision and
credit for measurement error.

But those that that's I mean that's
something we should take away too and

then finally regression to the mean so
too much to talk about here

I'll punt this later finally so and then
finally that's going to be the this is

the correction formula that would lead
you to be suspicious of that table that I

showed you and Angela and Patrick's paper
right not suspicious in the thing and

I did something wrong but I have
questions about it right and that is that

as you increase the number of items on
your test you get greater reliability so

if you ever are in this position of
doing massive scale development and

have like 200 items do not
pat yourself on the back for

having a reliability of point 13
because you have hundreds of items

of course you do that's going to be
the average of that is going to be very

very stable with respect to measurement
error so that's why I always when I report

reliability is I also report the number of
items because you sort of condition your

interpretation of the reliability itself
on the number of items that you've got.

And so this is just an example if you
know if the liability is point one and

we double the test length what is
a predictive reliability so K.

would be 2 in this case in
the same way you can given any

given any test score length and
reliability you could estimate

the reliability of a single item test by
plugging in cases like one over the number

of items so if you ever really really want
to take a gamble that people do this right

of everything that's questions like
Would you recommend this to a friend.

That's like what's called the Net Promoter
Score and so the net promoter score is

supposed to be this like one shrew item
that tells you whether or not your product

is going to do well in the business sense
it's like a single item scale right so

anyway like if you ever want to figure
out what you know what the one item

reliability the one item test would be
just like in case one over your number

of items so these are all super handy
formulas that I would expect you to have

just kind of like in your back pocket
the way you have a standard deviation

the way you have a correlation
coefficient these are the basics.

So so
I'm going to skip comebacks Alf see OK.

So what what is reliability what is
reliability so I said reliability

is point 8 and you're trying to explain to
your your uncle what point what you say.

And you can talk generally about
reliability is some sort of measure of

precision and

that's good but I also know what is
point 8.8 what is what does that mean

actually it's a hard question to.

Me because the good news is.

Good good good so that's that's
the right that's the you know the sort

of coherence of the overall measure and
it's you know on the sort of 0 to one

scale right but so but then if you want
to get very specific and actually address

the magnitude itself what would it what's
clear what his point is in that case

I did my usual motor mouth routine and
like I said it a couple times but

like very quickly and without pausing.

Good good.

Good that's a good that's a good rule of
thumb that's segmentations cringe at rules

of thumb but that said it's one
that I don't mind cosigning for

general purposes but so all the more
reason to know what point 7 means right.

So that you're talking about a signal to
noise ratio where you're talking about

the true score variance over
the the air variance you're close and

it's just a convolution but to anyone
you've got true score variance in

the numerator that's good
what's in the nominator.

It's absurd score variances in
the nominator total variance so

how much of the variance that you see is
accounted for by that signal and you can

get to that from the signal to noise ratio
but but but reliable so when you see

point 8 you're saying 80 percent of the
observed score variance is accounted for

by true score variance that's not
the only way to think about the.

Reliability question if you can also
frame it in just the way we think of

an intra class correlation.

As as a correlation in itself.

And it's a correlation in this case of 2
replications of the measurement procedure.

It's a correlation of X.

and X.
prime that's actually why you write

it Row X.

X.
prime.

It's a correlation between X.

and what we imagine a replication of X.

to be which is equivalent to the
proportion of the observed score variance

accounted for that use governs so it's

bilingual in the same way that you can
think of an and enter class correlation as

a correlation and this measure
of between group variance right.

In the same way reliability is both
the proportion of observed score

variance accounted for
by trees grow variance and

the correlation between 2
applications of an event procedure.

The monster that I like is.

A person with one watch
knows what time it is

a person with 2 watches is never
quite sure and that's kind of kind of

what psychometrics is all about it's very
sort of saying like we always want to know

exactly how imprecise one to
be precise about or in prison.

OK so how do we estimate this in practice
here are 3 types of reliability The 1st

is sort of the gold standard to sort of
parallel forms reliability we actually try

to do that we try to replicate the whole
measurement procedure twice we sort of

we could we create 2 different equivalent
forms imagine to spurn the satisfactions

earn this magical turn right off of stuff
like marbles or sort of take a scoop

of the items and create one form take
another random scoop of the items and

create another form and then we give it
all to you like now and we give it all to

you in like some separate room on Sun
separate day with some separate Raiders

and we try to vary all the things that
we care about varying and give that to

give that in a different scenario and then
we simply take the correlation of the X.

and X.

prime and that's a parallel
forms reliability another

way we approach it is to do test retest
reliability what that does not capture

is the variance to the items because if
I test you and then I retest you again

I haven't drawn again from this pattern of
items so you want to think about all these

turns of like items of Raiders of
occasions of tasks and think about all of

those it's contributing to your sources
of variance and 3rd this is sort of

the weakest form that you usually get the
highest reliability from is our internal

consistency reliability which which treats
all this stuff like all the stuff that's

going on in this room right now is fixed
and only considers the variance of items

within the within the little test that you
happen to have right it sort of says hey

instead of drawing an urn drawing from
this urn of items I recognize that I've

already drawn from the urn of items I can
split the test items sort of randomly in

half to correlations of all those halves
and think about how that is an estimate of

a liability that's how internal
consistency reliability works.

So again I hope you're sort of bilingual.

In the order of the light.

consistency reliability 10 percent of

the time it's some weird approach using
R.T. that I'll talk about shortly.

And actually show Shawn and
I from our from our 2015 paper to

have this here I might have cut the slide
we actually show you the histogram for

all reported state reliability causations
that you see in practice just a give you

a sense and point to point and
all of them are over point 7 in this case

there are centered on point 9
with a slight negative skew.

All.

Together for.

A purpose.

And then averages of that.

Here is what I hear is
what I skipped over so

Comdex Alpha is exactly that and it
actually can show you can prove that it is

the average of all possible split halves
right you split in half you split

in half every single possible way you can
you take the correlation over and over and

over again now that correlation this is
where you combine come back south and

Spearman Brown right you've taken half
tests when you split in half you've taken

have tests so you have successfully
described on average the reliability of

a half test and then used Aaron Brown to
ramp that up to that to the full test so

it's a nice and neat little exercise.

But you've got you've got
the intuition Exactly.

So so you know this is the last thing
also service and it will take I think

a 5 minute break that will end up being

this is this is where your reliability
is not a liability right and

this is the point that I think Brian was
sort of leading to is that you should

think of the reliability coefficients that
you get in your technical manuals and

all of your state tests as being
an impoverished version of the reliability

you might imagine right if it's trying to
answer the question how well does this X.

correlate with this possible X.

prime like that's not varying items
doesn't cut it right and if you

were to vary occasions if you were to vary
like spin areas if you were to vary raters

if you were to vary all these other things
that we might actually be interested in

generalizing over and that reliable you'd
probably almost assuredly be lower right

and so that's worth thinking about as you
as you are adjusting for reliability is

what exactly are the replications over
which I'm interested in generalizing and

that leads to an entire series another yet
another theory called generalize ability

theory which is developed by the Crown doc
and many others decades ago Bob Brennan

has done the biggest the most work on this
as of late and is something you should

know about that's not going to dig too far
into right now but I'll give you a couple

key references brand in 2002 is a great
primer by my former 2nd advisor rich

Ableson in the arena lab in 1901
it's a nice little Sage primer and

it's kind of depressing that it hasn't
really gone out of date since 1901 but

this is basically just analysis of
variance is pretty straightforward.

And that's just it answers
a couple of questions Tom Kane and

I did a paper on this I present in my
class about teacher observations right and

how many readers do you need how many
lessons do you need how many items do you

need to get sufficiently precise
estimates of teacher observation scores

by many readers and are for example
administrator raters different than peer

Vader's these are the kinds of question
you generalize ability theory is.

Really well primed to answer this is my
colleague Heather Hill at Harvard who

wrote a great article on education
researchers say the title is

that before the colon was in her like
Rader reliability is not enough which is

to say like Often times we think we've got
a bunch of readers let's just see how well

they match with Master coders not
enough and I totally agree with her so

I think you should sort of think as you're
developing if you ever have a skill that

depends on Raiders you should definitely
start with greater accuracy and

then move quickly to generalize
about the theory if you can and

leverage the sources here so this
generalized ability studies are expensive

but they also are due
diligence when it comes to

real reliability right they're a liability
you have is not the reliability you seek.