Andrew Ho: 2016 Psychometrics mini course - Part 1

## Transcript:

Well.

Many courses here we have

students from 2 classes.

For tool we are very

pleased to have joining us

to is a professor at the Harvard Graduate

School Education cyclists in the real was.

Who is going to be talking with us

today give you a flavor for some of.

The really important issues

in education measurement.

First thought only measurement folks

in education research to pull.

Lists as well.

Goal here is not to have everything

that one would learn in a foursome.

Sequence in psychometrics but to give

you a flavor of what's out there and

help give you some ground.

To air into is.

Accomplished.

Academic here member of the now for

Assessment Governing Board.

Member.

He's a former.

Master since you just pick from.

It's a really very pleased to have.

Your Thank you thank you.

We had a game against Michigan.

I should I shouldn't started with the

crowd turned against me so thank you for

this opportunity these are these

are weird things right like 3 hours

measurement What can I accomplish

I'm like actually not sure I think

the presentation is kind of a mess but

maybe deliberately So what I wanted

to sort of leave with like a couple of

provocations a couple of references and

the sort of nagging idea that you need to

learn more and that there are places for

you to do so and so I should start by

saying one of the places you can do so

is right here you have a stealth

psychometricians in Brian Not to mention

a psycho magician in that not to

mention people who know statistics and

measurement broadly likes you and Chris so

you do have a bunch of people who I would

love to have the sitting

in the audience and

watching teach this in their own way and

I would learn a ton from it so

we all would teach this were different in

different ways and Brian and I actually

recently had an exchange about measurement

he was talking to a condom it's and

I was reflecting on how he presented it

and I was a ton of fun and so I wish you

know it's actually I was talking to Matt

about this earlier too I wish we could

talk more about how we teach this and this

is I think it's a great opportunity for

me to engage with you but also to get

a little bit from from people about

the different ways that they approach this

because we could learn a lot from them.

So here are my provocations So I kind

of want to do these like you know these

listicle ce writing this is like how to

get people's attention 7 things you need

to know about measurement but I'm caving

I said that I'm going to do it anyway so

here are here are those those provocations

I'm being deliberately kind of

extreme here just kind of

to procure a little bit so

validate score uses of the question

I find myself asking my students most

often when they are starting to conceive

their measurement projects is what for

and not just what for but what and

like what is a score ultimately whether

it's a scale score or an average of scores

or regression coefficient ultimately

those scores are what's used and

we don't validate tests we validate

those ultimate uses of scores

to content is king not models content so

if you start analyzing data ever

without knowing what items there were

you should slap yourself on the wrist or

feel your advisor slapping yourself on

the wrist What are you measuring and

you need to have an embodied

experience of what that

is you need to put yourself in

the shoes of your participants and

feel what it's like to be a part

of that measurement procedure

have a tendency in measurement to jump at

the neatest hottest sounding models

with the longest acronyms stop it start

with simple simple descriptive statistics

in measurement and I think of classical

test area as the descriptive statistics of

measurement you should always start there

in the same way that those of us

data start off with what command.

Summarise so did so so

Alpha is your summarize of measurement OK.

this is like this these are not the droids

you seek or whatever it is go to the

reliable the that is most often calculated

is probably not the reliability

that you're interested in and

I'd like you to leave here today being

able to answer your aunt or uncle when

they ask you what is reliability I want

you to be able to answer that question I

remember when I was getting my masters in

statistics on them because I don't agree

what the standard deviation can you

think of how to answer that question to

your uncle as I could give you an equation

like how do you actually describe that in

a meaningful way I think that's

reliability like the standard deviation of

measurement be able to what is point

that mean and you should be able to answer

that I mean today if you can't read.

I don't respond there is just a model.

We always hear we should really

have someone who knows and

in response to it I wish I had someone who

knows I know response theory which I so

many could teach I'm response

theory is just a model and

is also a very very useful model so

I'm both going to demystify it and

focus you on what it does particularly

well 6 your scale is pliable bend

it don't break it and this is something

that Brian wrote about recently as well.

The numbers that you should you should

think of them as like solid ground like

the distances that you sort of see they

can there's kind of like this is kind of

like this bridge it's like springs between

the boards maybe like there's like a sense

that the scale is pliable to shift

a ball but not breakable and

there's actually empirical ways

we can address this tendency and

your or the judgments you make based

on your scale information should

should be robust to that spring in us.

And then 7 again this is a reiteration

of other things know the process that

generated your scores and

use them accordingly do not go beyond what

the data suggests that's just a general

recommendation so these are my

publications they sound just like

floating talking points now they're deeply

embodied to me and I'm going to try to

do my best to make them feel meaningful

to you over the next couple of hours.

So but I want to start by just stepping

back and again when you think measured

what kind of resources can

you use moving forward and

in the e-mail that I think some of

you received from me you've got $2.00

to $3.00 citations right 2 references

in the 1st Are these the standards for

educational and psychological testing

what we're doing has has decades

centuries of history and a field has built

up around it that has some guidance for

you right and 3 the major body is the

American Educational Research Association

the American Psychological Association and

the National Council on measurement and

education came together and

actually agreed on something right these

are the standards of the field and this is

a very powerful tome not just for you but

for the people who you're developing

tests for you can say I did this and

these are the standards of the field

you should have this book and

there's a discount for

members of this of any of these but

like if I were to recommend one thing

these are the authoritative standards of

the field they're not perfect I've

got a kind of quibbles with them but

they're powerful and

it reads actually is as a good as a as

a pretty reasonable like intro text.

this because 160 bucks and

I remember being a student and

it's also very much a reference text but

this is sort of the cut bible of the field

has all the sort of heavy hitters

in measurement who contributed chapters

to it it is sort of the authority to

like what we would cite when we

say if you had generic a site for

reliability you would go to

Hartle 2006 generic site for

validation you would go to Cain 2006 and

those are the 1st 2 chapters in this book

do not buy it unless you

are really into the stuff.

But it but have it on reference of it's on

reference for easy reference your library

so much you put it on reference frame so

they would be my sort of to go to books

for for educational measurement

probably and you should

look to them if something is provoked

to you as places to go for citations.

So how do we learn measurement I think

this is important to visit visit too so

like you know this is very much I think in

the way that the way this is like reflects

the philosophy of teaching here too but

I always want to sort of point to my

fellow by and I'm like eager to look

at other still abide to from folks

other folks who teach math but if you want

other references there they're up on my

website there's like sort of like what

I think of as where to go to for I.R.T.

what I think you know what I think people

should read when it comes to differential

item functioning when I think what people

should read when it comes to standard

setting so they don't forget to look

at people still by when your reference

hunting and you want to say someone

said a cut score who should I cite for

the whole a whole cut

score setting thing and

Philip I can be very useful in addition

to that in addition to this tome

where it's Hamilton and Tony Adams who

did on who did standard setting so

my fellow by and the syllable of others

like Matt are good places to go for

references on these things.

And then again like you know learn

it use it learn it again and

use it again like in

practice makes perfect and

all of your classes your methods classes I

think you're using data getting your hands

dirty struggling with those state error

codes looking at those manuals right so

you know help help help them clicking on

the P.D.F. that's that's what you're going

to be doing a dodgy and I just want want

you to want to recognize that measurement

much like the other methods you're

learning requires that struggle and

that patience so this is something

that's a little trick I just want

to give a shout out to a few people

who contributed to my Google Docs

this is something I do to like incentivize

and encourage out of class reading and

out of class discussion so

there are a bunch of tools for this and D.

has and it has a new thing called to

result at some of my colleagues at Harvard

of develops or just open google docs but

it looks like this this is what I asked

some of you to contribute to I said

Hi I'm students is the typical pre class

discussion that we run it is to see I

ask questions I have you respond to them

by 10 pm last night and then I reply

like I am kind of a night owl and so

like in between 10 pm and Dawn for a class

I like I sort of give people like answers

I have little conversations and sometimes

for the for the people who contribute late

like around 9 pm We get into these

sort of discussions online actually

how did someone who was I know who it was

but I was online with some one of you.

And and just going a little bit back and

forth about how to write out

an equation so so thanks to Josh G.

Josh G.

There you go Josh so so

so we I had a little.

Cross talk with you I'm not sure if you

check back in solid I saw what I wrote but

this is a part of the read write

do this is just sort of like

how to sort of stay engaged.

And here's an actual Doc So

thanks to Josh G.

thanks to Fernando you can

see I'm replying italics here

thanks to Stephanie H.

Stephanie I missed your comment you must.

OK I'll reply later I promise Karin.

So there are a lot of really good.

Some derivations we did here Stacey.

So yeah and then some that there's always

I always sort of leave a space Cassandra

didn't get a chance to reply to you but

Stacy and

Josh here there's a general space for

general questions and discussion and

you asked and good general questions

that you can have time to engage with

towards the end of this class today

I should say to like interrupt.

But I think it's important sometimes

you just might talk straight street

pedagogy and like how to learn this stuff

and how to stay involved so read write do.

So again there's a 7 principles

I'm going to start with and

you start with the beginning and

start straight from validation so

we don't validate test to be validate

score uses to talk a little bit about

the validity theory and

this might seem a little bit detached and

I'm going to get more technical later on

I know I feel like this is very talking

about this as tribal I'll try modal

audience potentially So I think I might

interest some of you sometimes and

others of you other times but

all of this is important in belongs to

the body of measurement so I hope you'll

survive remember even the things that may

sound theoretical even the things that may

sound too technical but are all in

a continuum and think of it as what we do.

So validation This is the more recent

depiction of what I think of as

the standard for validation in educational

measurement in particular I contrast that

with Matt who teaches more from the the

psychological measurement paradigm which I

think has a slightly different perspective

on validity but but Michael and

the field is an educational measurement

in particular is very utilitarian

very instrumentalist we

care about the ultimate use

right it's almost a theoretical if you say

if you if you take it to a certain extent

we just don't even care about the numbers

as long as the interpretation or use

is correct that's extreme but that shows

you what that what they're emphasizing

here what we're emphasizing here to

Val to validate an interpretation or

use of test scores is to evaluate

the plausibility of the claims based on

the scores as an argument based approach

you are building an argument with evidence

over time there is never a point

where something is valid it is part

of an ongoing evidence building process

and that is deeply unsatisfying

right wouldn't it be great if there

were a correlation coefficient and

once it exceeded point 7 you said Jack and

this is a super frustrating

like call to you to never do

that right and I think it's particularly

frustrating for introductory students for

whom like that they might who even for

those who like might not have careers and

measurement to say they really have to

do all this and I guess I'd say like

actually no you don't really have to do

this but at least you have to know that

these are the sort of standards of the

field even if you selectively ignore them.

So and again this is just these are for

measurement there is my colleague

Derek Briggs who kind of disagrees with

this utilitarian instrumentalist

current status quo in measurement but

there actually debates about this in the

field right what is the validity you can

write a paper on that right and contribute

to the discussion of what it means for

the use and interpretation of test

scores to be valid and appropriate and

we broached this this morning when I was

talking about the use of this new data

set that my colleagues and I have created

that allows you to compare districts or

school districts across states and we're

not we don't just ask is that valid or not

we as we say is there are the uses of that

are the interpretations of that valid or

not and there's some really good

feedback from the faculty members and

students in the room about which

which research designs and

which research inferences would be or

would not be appropriate in those

situations is very similar what you're

trying to do with the scores and

is that appropriate is that supported

by the evidence so I would say these

are there these are different

definitions of sort of validity or

full of schools of thought about validity

and not reading all this text because

I'm sort of that sort of leaving

these slides as a reference but

we are in a very instrumentalist

even utilitarian moment and

educational measurement where we care

ultimately about how you're using those

scores not about the test or

even the construct is about the score use.

So again modern test validation theory

is dominated by instrumentalists I'm

concerned with test uses an interpretation

and I'm acknowledging that this can

be frustrating because it kind of takes

the control away from your special little

instrument and it's in its ultimate

scores and places it in this fuzzy domain

where people pick them up and use them and

you might kind of be responsible for that.

So I think a validity and

as I say as I tweeted before it's like

I'm not ashamed to use mnemonics and

so I think of 5 sources of Liberty

evidence and I call them the 5 seas so

the 1st is content right so the 1st

to take the test what is it measuring

There is a good overview of alignment

of 4 big testing enterprises

to the Common Core recently Morgan pull it

off and then see DURIE publish this piece

in Fordham earlier this year which is

which is basically a content study right

do park and smarter balance these big

testing consortia and as well as M.

CAS and the A C.

T.

aspire these the Massachusetts state test

and a.c.t do they aligned to the Common

Core state standards this is a content

study and there I think there are too few

people frankly delving into this like

arena which is currently

sort of dominated I think by

more model based statistically based

approaches so I'm just sort of reiterating

that content is important serious

important cognition is another source to

go as another source of evidence

is like when you take that scale.

Are you thinking what the designer

intended you to be thinking as I'm

thinking through this math test

as I'm thinking about whether or

not I'm greedy or not think about the

studies of great recently that have been

concerned about reference bias right

that is to say like do I feel greedy and

can you compare it across courses or

my referencing my grit to the people who

happen to be in the school or in this

classroom right so how are people thinking

about it cognitively the way we we have

seen they could the evidence we can get up

often comes from Sir think aloud

protocols as well as a parable analysis.

Coherence is where the field since it

seems sort of stuck with validity and

there aren't a lot of what I'm going to

talk about subsequently is going to be in

this into this 3rd seed so this is where

reliability analyses come up if a C.F.A.

I.R.T. this is what not teaches as

well as well as me this is I think

what people sort of assume measurement is

from a technical standpoint and what I'm

highlighting here is it's only one city

right you've got to think about content

you've got to think but cognition and sure

you can do your reliability analyses but

that's only a piece of the puzzle another

piece of the puzzle that is often this

comes up a lot in structural equation

modeling comes up a lot in economics too

where you're trying to predict future

outcomes does this predict college

attendance graduation or

college entry or freshman G.P.A.

or future outcomes or more concurrently

does this does this correlate or

not correlate with things

that should be similar and

things should be different you

sometimes hear this is convergent or

discriminant ability but this is

again only a piece of the puzzle and

the Fitzy is consequences right evidence

based on the consequences of testing

you could think about this even as

a counterfactual like had I not undertaken

this measurement enterprise at all but

would have been the difference so

doesn't think about the scores as

much as the use of the scores and

like that has has the act of testing and

measuring itself had some consequence and

so this is a fairly controversial

relatively recent addition to

the sort of the Litany framework but

these 5 sources of Lady evidence are

clearly articulated in the standards and

what you should think of when you're

designing a measure when you're using

a measure as the kinds of

evidence you can live.

So so this is sort of in contrast

with what with what I think of when I

think people are thinking of validation

commonly I developed a scale with good

theory I fit a C.F.A. and got good can for

confirmatory fit index and

my reliability is greater than point my

scores predict desirable outcomes so I

have a valid reliable measure that's like

the common sort of articulation of like

a good baseline study about I'm setting

that up as you know so that's content

that's coherence that's coherence to this

is correlation and that's incomplete or

sort of missing cognition room if we're

missing consequence you're missing this

argument for use what are your scores how

we use them what would have happened had

you not measured and so

these are other questions you could ask

just with complete this sort of validity

framework so it's more than just.

A good fit index and

good item parameter estimates.

So again 7 key principles we don't

validate tests we validate score uses

That's what I was covering and I want

to emphasize content a little bit and

then dig into a little bit

of classical test ary And

I think that'll probably take

us to the break or thereabouts.

So.

Let's and

then we'll get into the reliability and

I are to be sure of after so

this yes sure.

Yeah talking about consequences how it

should be in the context of what you

mean like if if a student had never

been tested then what other measure

to measure look underlying ability to or

think that we're getting

the score ultimately is is yeah is used

for something right so once once we

test what's the sort of theory of that

making a difference in some way and

it could be like publishing an article and

having that feedback into the system it

can be very abstract in that way it

could also be the teacher is going

to use it to give you feedback and is

that feedback going to have a positive or

negative impact on you right or it's going

to lead to a value added estimate for

a teacher and they're going to respond

differently to teach so it's like Had

that not happened a whole

process not just the OR

it like you know the score but the use

of the score in this theory of action

had that not happened what would be

the difference so I think that's kind of

a pretty gold standard level of

like I mean we're taught but

a major evaluation at that point but which

is why this is sort of a controversy all

sorts of related evidence

because like good luck and

how long do you wait for

long term outcomes but but this.

I mean from an economic You can be

because of the kind of catch all the time

you know you have people.

Like how can you think of them.

So again so you know in all the ways

that I think you're trained to write

as economists right so I think you again

like and I didn't I wasn't being glib and

I was sort of saying this is like why

we're glad we have people like you is

because I think you are asking like what

you know what is the counterfactual for

you know if we didn't have high stakes

test based accountability like we'd

have some sort of paper by some guy named

Brian Jacob and Condi or something and and

and sort of think about what happened had

there not been this rise in accountability

at this particular time so these

are the kinds of evaluations I think that

I'm not soley putting this in the in

the in like in economics like that but

that that said I do think that's my

encouragement to you is to never just

think of a test as something

that's validated up in the air but

as like part of the results in the score

that is used for a purpose and

if that purpose is for you to publish and

get some correlation coefficient and

get in a journal and that's great and

that's part of your theory of action and

that's pretty light but all but

ultimately I sort of say like but

you know why are you doing this and

that's why I'm sort of for

pushing people to go is that ultimately

your scores are used by people for

something can you can you

describe that to me please and

that's what I find myself asking most

students like that's what's missing and

when they say I want to

create a measure of X.

I'm sure like why you know those scores

we're going to do with them

what's going to happen and

that's what that's often what I find

missing in their their thought process.

Thank you we're here and I'm trying

to figure out correlation you said

evidence based on relation

to other variables and

so I'm wondering if by that you mean like

I would validate one standardized test by

its relationship to student scores

on a similar kind of test of similar

kind of content or reading of

things much broader than I'd like.

This chance to and how the critics

like high school graduation you're

going to college and so

how would I know those kinds of things

before like if I'm using these as

a foundation for measurement and

developing I haven't given it yet so

how do I have evidence on this is.

So so this is why crown Bach and

all the sort of.

People who have developed validity

theory over time have been.

Very clear that it is an ongoing

process that it's not I mean again and

this is where psycho magicians struggle

with dealing with the outside world

because the outside world is like show me

your valid measure and you're like but

this is this process that takes a look

at Show me your valid measure and and so

it can be frustrating but

this is how the field thinks about it I

think you have to wear different hats and

when you're talking to people who have

that their definition of liturgy and

just say this checks all the technical

boxes and you do want to at least some

correlations with concurrent out

concurrent variables in some way but but

look at the end cast Tech Report technical

report look at the report here for

your tech your deep What is it now and and

you'll see that the all of these are laid

out in there in varying degrees of

depth and usually coherence is a massive

section with classical test theory I or

T differential item functioning alike and

correlation to small

consequences is a paragraph

cognition is like we did a lab and

content is very very fleshed out with

content frameworks and the like so

this is why I explicitly walk

through Technical Manual You know

when you finish my class you should be

able to read a technical manual for

a state testing program whose data

you're going to use and figure out

what implications it has very for your own

analysis yes that's a good model to check.

That's.

What are some of.

The valid for the test but for.

What are some of the kind of.

Thing and I'm wondering when you

were talking about federalism focus

you seem the one to see complex necessary

if it was really going to meet that.

Goal or.

Cause.

Geared up on care I don't care what

I don't hear the reliability of

its core how well he learned

in college now that.

You're going to be

anything other than for.

This is a good question so

this is where the economy is so

we're probably shouldn't like over and

over going to miss dinner over drinks at

some other point we will have a detailed

argument about or debate about why these

things should matter I think I mean so

from a very utilitarian standpoint in the

near term before you get those long term

outcomes you know if you're developing

your own measure you need to stand on

something in the near term before

you've got those long term outcomes.

The here yeah and

it also I think it also I mean

I don't know like if you happen to find

some spurious correlation of something

I mean there's got to be some and you

are interpreting when you completed a C.T.

score that there is some sort of college

readiness and you know when you say like

point 3 It's like socioeconomic

status correlates point 3 and

it's like you don't say are college ready

based on social economic status right and

so the interpretations we use like matter

is the sort of psychometric argument and

so you know when whether I enter Be

specific about that interpretation and

what is the warrant for

that interpretation and

if it's only based on social economic

status and the warrant seems.

Detached from the human So I think this

is a deeper philosophical argument you're

raising that I don't think should be.

So I but I think it's a good one and

certainly some that might that my

students have advocated for and

it's certainly econ leaning.

But you know the you know what I

often fight with is like why do

we care about freshman G.P.A.

I mean look at that's a horrible measure

I kind of wanted to kind

of want freshman G.P.A.

to predict my on my high school test

because that's a better measure because

of the content the directionality I

mean so it's does arise I think from.

The items in the content is

the is psychometric percent.

But so so on to a little bit

of classical test here in

the tools that we use to

evaluate in particular Clarence.

Or and content so this is sort of

like my checklist for it like how to

get into a sort of secondary

analysis of test score data right

you get a bunch of you get a state a D.T.

a file and it's got people in rows and

there are all these items all these like

columns that correspond to items and

I guess you know so this is like my going

to skip around is going to go 12378 or

something like that but this is this is

sort of part of a larger checklist and

again like you know this is from

John will it's presentation as well.

No you're right it's right like read each

one take the test get a sense of what it's

trying to measure.

So so this is an example from a a.

Measure of like self perception of

teaching success you have high standards

of teacher performance you're continually

learning on the job you're successful in

educating your students it's a waste of

time to do your best as a teacher this is

negative negative negative polarity you

look forward to working at your school

how much of the time are you satisfied

with your job right and so this is like my

advice to you is never go into an analysis

without actually looking at the items and

sort of taking that like scoring the test

thinking of yourself as a subject and

then you have all these sort of like your

scale items is one to 6 you see here some

someone snuck in a one to 4 item

this happens from time to time so

do not get caught unawares do not type in

Alpha without recognising that some of

your variables have different items skills

than others because it will give you

incorrect answers so so take control

of your scale and know it backwards and

forwards and

again I'm going to in the interest of

time I'm going to jump through this

always on the scale of your items

right to score your test

how is it actually being

scored is it a some score it isn't.

Average are you reversing

some of the play or

any of your some of your items are you

stretching the scales of some of them so

the algo from 0 to 100 what do

you how you actually scoring it.

So if you if you look here right again

you're going to want to sort of what I

recommend that you do when you're actually

going through this is reverse it yourself

like take control in state and reverse

coat it so that they're all pointing

in the same director because and

then make this because otherwise

I have I found myself making mistakes

is some very practical advice for

you to not slip up in the sort of data

in the early stages of an analysis

so you know again look at your data get a

sense of the missing this label your items

make absolutely sure your items skills

are oriented in the same direction or

you're using code that

recognizes when they're not

positive should mean something

similar if not fix it.

Here's more exploring I have mandate

that people always like give me discreet

histograms for items scales I want to

know Mike how many ones there are how

many to 0 how many threes fours fives and

sixes I want to see if you've got a 7

point Likert scale if no one is picking 6

or 7 ever I expect you to know that from

the very beginning and don't start running

I.R.T. until you have a sense of your.

Data actually look like.

This is important as well does

a one mean one at all times it is

is it always like strongly disagree

when you have a scale that goes

like one to 4 right so if I have

strongly disagreed strongly agree and

then I have not successful it's a very

successful and this is one to 6 and

this is one to 4 and I throw that

into alpha if I throw that into like

a reliability analysis what is

a going to do is going to assume

that very successful means slightly

agree does that make sense.

It could make sense you better think

about it and make a decision so if so

the idea here is that all of

these items scales are not

in a classical analysis are are they

think of ones as ones and

sixes sixes so you better take control of

that and make sure that that's right so

often what that entails is 2 things

one stretching this 124281 to 6

or actually just forcing this to be

one forcing this to be 6 forth and

forcing this to be what 2 and

like actually equally spacing that item

out so that you're saying not successful

is like strongly disagree very successful

as like strongly agree so one of the big

mistakes I see people making when they

get the scale is a secondary data analyst

that assuming that all items

are sort of interchangeable and

that the player he doesn't matter and

you sort of control over that.

Another way to approach it is to

standardize within each item so

what you're doing is you're to

your set you're just dividing by

the standard deviation unit in each time

and each and each item and in that case

you're saying that strongly disagree here

and strongly disagree there might not mean

the same thing depending on the variance

of each of those ITEM ITEM distribution.

And that's weird too like when your

liquids like or scale items are all

strongly disagree to strongly agree do

not standardize right because strongly

disagree means the same thing across those

items and if you standardise you lose that

information Similarly if you have an

educational test that has like correct or

not correct should you standardize

absolutely not correct is correct and

the same thing so do not standardize

you know in those cases either as these

are the like the little things that seem

trivial and I feel like in my in my own

way in my own students like analyses and

I'm not running through there coming up

with absolutely incorrect alpha values

I can even just like the baseline

descriptive statistics let alone getting

to I.R.T. or structural cohesion modeling

or attack so you've got to take control

of your data from the very beginning and

be very very careful and intentional about

every single step that's like general

advice for statistics period right but I'm

saying it still applies to measurement.

OK So this is a baseline reliability

analysis check this out Alpha X.

one to dash X.

as is that should be your template and

the items gives you all these

items to 6 as is I saw I

have this sneaky suspicion that this

is leading to inflation of reliability

coefficients throughout state and users

and perhaps other programs as well but

as is does is it says the direction of

the scale like the direction of the item

scale positive is always positive

like if you coded as positive and

treating it as positive if

you don't include as is

there could be a really bad item in your

scale that correlates negatively with all

the other items negatively and

state a will flip it for you.

Without telling you will show up here but

you might not notice it without telling

you it's going to flip it for you which

is to say you've got such a bad item that

status as it can't possibly be

that bad in reverse it for you and

that's crazy to me that they do that and

so you thought this is that for

a lot of elementary analysts dramatically

over interpreting their simple.

Alpha they're simple reliability

value because they're.

Going To Do you know best but

but but but anyway so

this is be my default code to make sure

that you're controlling it appropriately

be intentional at every

step of your analysis and

know what the direction is and

know what the scale points are OK So

this is I'm going to I'm just going to

short hand wave 3 this but these are.

Various discriminations statistics

they basically are like does this item

correspond to the sum of

other items on the scale

does this item correlate with other

items and this is the coherence question

this is an internal correlation does this

item correlate with other items on a scale

which is really kind of what is at

the heart of classical test theory I or

G structural question modeling

factor analysis and the like.

This is an example of a little bit

of you know more pseudo code from

state A for you.

How many people don't use data.

So and you're using M.

plus.

Because this is why we include a whole

bunch of do files and I've sent Bryan

a couple off and I'm more but I'm happy

to give you sort of templates for this.

too we'll talk we'll talk more about that

the simplest of the good cos it will

test every kind of descriptive stats.

To the you know like OK you know.

Anyway what we're worth running so like I

mean they they presume that you sort of

done all that already and so do all that

already like to do that 1st as a as I'm

recommending it as make sure you sort

of have control over your scale.

So again you know coming in as a sort of

content is king there in the sense of like

you know your items know your scale and

get a sense of what it's

trying to measure and

don't just validate it based on whether or

not it predicts life earnings next.

But if it were the debate.

What exactly were they.

Looking at like that.

In the sense not in the sense

of like I mean you want to

read a book on the question

because I want to get.

More.

With.

Like I mentioned.

Some of that question but maybe.

I can see Mollenhauer.

All.

Right so this is this is a subscale

question this comes up all the time so

Alpha is a property of of a of a scale

right and if you want to create subscales

get get information about each of your sub

scales that's what Alpha should be for and

what else if you throw an alpha across all

of the items across subscales it's asking

how coherent is this across subscales So

the question I always ask people who

are using subscales is what's the question

how are you using your scores right so

that you know if you take a cynical

approach from like you know at heart of us

always like if you give policymakers

to numbers a lot and together.

So that you know so

this is like the you know so

that your great scale the Angela Duckworth

a Tim Duckworth and Queen 8 item great

scale there are 2 subs course we

think people are doing with them.

Adding in the getting so if you want

your question my question is what

your question should be what is the

property of the score that is being used.

This is that this is the utilitarian

sort of instrumentalist of you and

if you are creating a scale with like

that people are using those subscales

an evaluative each of them accordingly and

then take alphas for

each of those subscales report outfits for

each of the sub scales I'll show

you how Angela and Patrick.

Do this and

shortly in their actual paper so

yeah so so so which is just to

say good to have subscales but

then then what I would

do is Alpha out C.T.

analyses on the subject and later will

talk confirmatory factor analysis and

all that jazz or actually that well

that's what his class is good at.

In particular.

So let's.

Go So this is this is the this is a paper

that I have everyone in my class dig

deeply into this is Angela Duckworth and

and Patrick Quinn's.

Journal of Personality

assessment paper in 2009 that.

I was talking with not about this is

a very common practice to develop

a scale that has way a ad that has now

way too many items but a lot of items and

you might not you might want to think

about how to minister them feasibly

in a flexible situation and so you can

use Costco test area in response to

a response they're both very very good at

figuring out how to shorten that scale

like how to how to preserve information

while while reducing the number of items.

This is a say you know I just gave myself

I just gave you advice I'm trying to

follow it this is sort of a brief

description of the great scale I actually

have my students take this so we can

like analyze their data new ideas and

projects sometimes distract me setbacks

don't discourage me I've been obsessed

with a certain idea but I am a hard worker

I often set a goal but later choose to

pursue it so I'm shortening them a little

bit this is to give you a sense of how

great operationalize So this is their item

scale in this paper they're sort of saying

we had a 12 item scale we're going to 8

it will all be fine don't worry about it.

So part of my screenshots

here see table one for

item level correlations after excluding 28

I'm sure each subscale I talk in subscales

here right there is all things out in

great scale this displayed acceptable

internal consistency that's code for alpha

with alphas ranging from point 73 to point

a look at their table to write

again we spent a lot of time digging

into these articles in class so

this is like you know West Point the

famously her National Spelling Bee sample

Ivy League undergraduates and these are

conduct also values these are the values I

was describing point

the sum that's the total scale that's

the that's the reliability coefficient.

For the overall scale and

then she breaks it down into pursuits of

effort and consistency of interest and so

the question I would ask in this

case is again what's being used and

if you're treating these separately you

can see what their alpha values are and

then if you're treating them as

a whole that that's the that's so

you can sort of cover your

use cases here and say for

those purposes here is your level of

internal consistency that makes sense.

Absolutely and so this is why your

classical test there isa to 6 are your

descriptive statistics your knee jerk 1st

reaction and after that we're going to get

to a more powerful framework that allows

you to answer questions like the ones

who's asking and so this is what I

consider level one this like summarize and

I really do mean that is like the very

after that you get to more

sophisticated questions OK so

by the way the what I always have one

of my questions my google doc questions

is is kind of this annoying I guess

what I'm thinking questions but

it's like Does anything look off to you

about this and I'm just going to sort of

this is like a tough question so I'm just

going to pause and and the just take

a look at this table in particular these

alphas these alphas compared to these

alphas and I just so this is you know

going for items for items and 8 items and

I just want to sort of this is to have you

take a look at that and just get a new

curve gut reactions as to what

I find a little surprising.

There's a bit of it that.

I have a plan to in the audience.

Try and.

There can be a couple answers here so

don't be shy.

Yeah.

For example.

For example the man.

Who does point 73 or

an 8 item scale I have that's wacko.

Right and so I'm not sure if he's

correcting for that and didn't mention but

or if there's something weird going

on in the sub scale relationships but

that is not what you expect what you

expect when you have many more items in

fact we're going to show you a prophecy

formula that predicts this when you

have more items in the same way that you

average over more things you have center

deviation over route and is your position

the more you average over the more

precision you have now it is a little

surprising that it's accurate that's

a kind of discipline perception that

you'll develop with with with measurement.

Cause that a lot but

Joining me to go from this.

Which is that much that.

You were to be purely So this is one

way right so we're going to develop

even better ways with I.R.T. But this is

just sort of a ranking of how each item

correlates this is the item rest

correlation is a literally the Pearson

correlation a simple vanilla correlation

between an item in one column and

the sum score of all the other

items in the other column so

this is a measure a very descriptive

statistic again kneejerk summarize level

descriptive statistic of how well this

coheres with the rest of the scale

we're going to see a better version

of this is going to get to higher T.

but often they very rarely tell

dramatically different stories so

this is why again we sort of start with

our feet on the ground with a basic

analysis and then get advanced and

I are today.

So you so wish to ask your question if you

were to be purely cynical about it and

didn't care at all about content

you drop you drop maybe one in 3

you know rerun the item S.

correlations maybe drop a couple more if

you felt like it and then calculate Alpha

for whatever's remaining and I don't

recommend you do that because content is

king that's the be careful of throwing

away a subscale you care about but

and imagine that an educational test where

suddenly you're not measuring mass or

something right so you can imagine

that that would be dangerous but but

but that's from a purely statistical

standpoint that's what you could pull off.

So so good we're going to great I'm going

to judge do I have this Yes Skip to

slides on classical test or E.C.L.

that's that's it sort of there for

you there's a bunch of equations there

sort of putting that in as like stuff for

future reference what I want to

do is talk a little bit about

why classical test theory is a theory and

what it predicts and

why it seems like it's useful so what is

classical test theory actually predict and

why do we think of it as theory the 1st

you know what can we infer from classical

test theory 1st variation

increased reliability and so

this is akin to the logic you are using

might sort of flip it on its head that you

know if you if you ask what the

reliability of a grade 3 of a set of Grade

ask for the reliability of Grade 3 grade 4

grade 5 scores is together you're going

to get like a point $85.00 right so

as you increase the variance right in the

same way that we you know as we know from

correlations period reliabilities or just

correlations I forgot to forgot to ask

what's reliability We'll get to that

again but just like any other correlation

as you increase the variance

you increase the scatterplot

rate you increase the sense of correlation

and you know whiteboard in here.

Later is that.

So I know that.

Because I don't work wow OK that the mind

OK I'll get the I'll get that shortly.

Go see Thank you.

So this is a a read derivation of.

The liability if you square

both sides put air put X.

under air that's the proportion

of air variance and

then one minus that is the proportion

of true score variance.

So that that's reliability and so if you

do a little bit of algebra here you get

this expression and so

in terms of the observed set you can

you can derive the senator measurement

in terms of the observed as the observed

standard deviation and reliability and

as you increase that standard deviation

you you get you get you get you're

going to increase your reliability

in the same way that I'm going

to draw right now so thank you.

So this is.

So like let's think about.

Which I think if you're

just this is a Grade 3 X.

and greed.

Are grade 3 X.

prime or something like that so

let's imagine these are replications

of procedures or grade.

So so in any case like if you have some

correlation that's like Grade 4 but

then increase the scale and

have a grade 5 here and Grade 6 here.

As you keep going up the scale so

Grade 3 here.

So if you look at this sort of scatter

plot Harry like add that correlates around

like point 6 or so but as you can as you

can see as you keep sort of Caterpillar

ing this out is like a caterpillar I

know it's not the greatest picture but

the idea is that ideas that now hey this

correlation looks more like point $8.00

And so the greater the greater

the variation you have

the more the more reliability

you'll have so I'll say to you so

one of my students will is doing a pretty

neat project with Dana McCoy She's

using Google Street View to rate

schools and like the sort of perception

of like school quality from what you can

tell in Google Street View and she kind of

she made a mistake upon reflection you

know when thinking about this prediction

of predicting of taking schools that

were too similar to each other and

they're like the stick let's take a bunch

of schools are too similar in quality and

then look at inner rater liability and

item or liability across those schools

upon reflection which she should

have done in order to like in

scale development is to make sure that the

variation very deliberately was reflective

of the variation in the population so that

she can get a reliability that corresponds

to that that said classical test or

he does give you a tool for for.

For correction correcting for

the variance in the sample you have

versus variance of the population you have

versus the variance in the population you

ultimately care about so

this is like your general expression for

how like the changes in variation will

increase the ultimate reliability and

I'm just again putting

this here as a reference.

So that's a again a very a classic

thing you should know about

correlation is that as the variance of

the true variance increase in

the population it will increase Yeah.

We're going to think this is.

A distraction the root of this is kind

of a random subset of the population.

Were battle weary but it's been a gamble

and I think I've seen some educated.

By.

Looking at it said only that the size and

the program.

The college we're looking at the.

Trial Court but the.

College or inappropriate is.

The whole the.

Whole array of the giving

some of the work to lay.

There really but the.

Selection.

Of the so I should've had this been

a 12 week course in measurement I would

have made sure to hammer that home

repeated so the classic example is for

example the correlation between like

a city scores and freshman G.P.A.

at the University of Michigan right and

that tells you what it tells you but

if you're interested in Had they had

everyone been a minute what with

the correlation have been that would have

been that would have you would have seen

that would've been larger but

you can't tell for the reasons that

that Brian Brian suggested I should

add here here's my general advice

if you ever were to undertake this because

if I were a reviewer I would I would

then you if you didn't follow it and that

support both right report both the initial

correlation and the you know as you

assume there's going to be attenuated or

discipline you wait a correlation and

state your assumptions clearly but

never just say and

here's my just attenuated correlation and

I actually reported this a tenured

correlations in my presentation today but

in the paper we report both says

I'm trying to follow my own advice.

So similarly that advice is going

to is going to hold here as well

if we're ever going to talk about standard

deviations the standard deviations

observe standard deviations are inflated

due to measurement error right so

as you can think of this is my mining

of the normal distribution again.

As Ewing as you decrease your liability or

your distribution

to sort of blurs out until it just becomes

this like blob and so as you increase your

liability your standard deviation

gets tighter and tighter so we know

that observe standard deviations are

inflated due to measurement error because

reliability is again the proportion of

observed score variance accounted for

by tree score variance and so correlations

between 2 observed variables X.

and Y.

will be attenuated by measurement error in

both variables that's just a side note and

so there is a general formula for

the correction of correlations due to due

to measurement error what we do is we

divide by the square root of liabilities

and if there's if there's error and X.

and error and why we divide by the square

root of reliability in one and

the square root of reliability in the

other and this inflates the correlation

I hate this correction and I use it all

the time so because what you're sort of

trying to say is like if had these had

these variables been measured without.

That measurement error than here would

have been their correlation right

this is what structural creation models

as Matt is doing do behind the scenes for

you right there actually taking it to

actually estimating the measurement error

in each of the variables and reporting

that discipline you made a correlation for

you and and so this is a way of

sort of doing that mechanically and

in the classical test area

framework My advice here holds to

if you're going to do this report the

initial correlation and then report that

this attenuated correlation because you're

kind of doing here in a very not so

subtle way is taking advantage of

measurement error like the more

imprecision I have the greater I inflate

my test scores I mean the greater

inflating I got the greater inflate my

reliability coefficients sometimes you get

reliability coefficients

that are greater than one.

This happened and then you then you know

you've done something I mean that's just

that just reveals how silly the whole

process is right on you're giving yourself

a lot of imprecision and

credit for measurement error.

But those that that's I mean that's

something we should take away too and

then finally regression to the mean so

too much to talk about here

I'll punt this later finally so and then

finally that's going to be the this is

the correction formula that would lead

you to be suspicious of that table that I

showed you and Angela and Patrick's paper

right not suspicious in the thing and

I did something wrong but I have

questions about it right and that is that

as you increase the number of items on

your test you get greater reliability so

if you ever are in this position of

doing massive scale development and

have like 200 items do not

pat yourself on the back for

having a reliability of point 13

because you have hundreds of items

of course you do that's going to be

the average of that is going to be very

very stable with respect to measurement

error so that's why I always when I report

reliability is I also report the number of

items because you sort of condition your

interpretation of the reliability itself

on the number of items that you've got.

And so this is just an example if you

know if the liability is point one and

we double the test length what is

a predictive reliability so K.

would be 2 in this case in

the same way you can given any

given any test score length and

reliability you could estimate

the reliability of a single item test by

plugging in cases like one over the number

of items so if you ever really really want

to take a gamble that people do this right

of everything that's questions like

Would you recommend this to a friend.

That's like what's called the Net Promoter

Score and so the net promoter score is

supposed to be this like one shrew item

that tells you whether or not your product

is going to do well in the business sense

it's like a single item scale right so

anyway like if you ever want to figure

out what you know what the one item

reliability the one item test would be

just like in case one over your number

of items so these are all super handy

formulas that I would expect you to have

just kind of like in your back pocket

the way you have a standard deviation

the way you have a correlation

coefficient these are the basics.

So so

I'm going to skip comebacks Alf see OK.

So what what is reliability what is

reliability so I said reliability

is point 8 and you're trying to explain to

your your uncle what point what you say.

And you can talk generally about

reliability is some sort of measure of

precision and

that's good but I also know what is

point 8.8 what is what does that mean

actually it's a hard question to.

Me because the good news is.

Good good good so that's that's

the right that's the you know the sort

of coherence of the overall measure and

it's you know on the sort of 0 to one

scale right but so but then if you want

to get very specific and actually address

the magnitude itself what would it what's

clear what his point is in that case

I did my usual motor mouth routine and

like I said it a couple times but

like very quickly and without pausing.

Good good.

Good that's a good that's a good rule of

thumb that's segmentations cringe at rules

of thumb but that said it's one

that I don't mind cosigning for

general purposes but so all the more

reason to know what point 7 means right.

So that you're talking about a signal to

noise ratio where you're talking about

the true score variance over

the the air variance you're close and

it's just a convolution but to anyone

you've got true score variance in

the numerator that's good

what's in the nominator.

It's absurd score variances in

the nominator total variance so

how much of the variance that you see is

accounted for by that signal and you can

get to that from the signal to noise ratio

but but but reliable so when you see

point 8 you're saying 80 percent of the

observed score variance is accounted for

by true score variance that's not

the only way to think about the.

Reliability question if you can also

frame it in just the way we think of

an intra class correlation.

As as a correlation in itself.

And it's a correlation in this case of 2

replications of the measurement procedure.

It's a correlation of X.

and X.

prime that's actually why you write

it Row X.

X.

prime.

It's a correlation between X.

and what we imagine a replication of X.

to be which is equivalent to the

proportion of the observed score variance

accounted for that use governs so it's

bilingual in the same way that you can

think of an and enter class correlation as

a correlation and this measure

of between group variance right.

In the same way reliability is both

the proportion of observed score

variance accounted for

by trees grow variance and

the correlation between 2

applications of an event procedure.

The monster that I like is.

A person with one watch

knows what time it is

a person with 2 watches is never

quite sure and that's kind of kind of

what psychometrics is all about it's very

sort of saying like we always want to know

exactly how imprecise one to

be precise about or in prison.

OK so how do we estimate this in practice

here are 3 types of reliability The 1st

is sort of the gold standard to sort of

parallel forms reliability we actually try

to do that we try to replicate the whole

measurement procedure twice we sort of

we could we create 2 different equivalent

forms imagine to spurn the satisfactions

earn this magical turn right off of stuff

like marbles or sort of take a scoop

of the items and create one form take

another random scoop of the items and

create another form and then we give it

all to you like now and we give it all to

you in like some separate room on Sun

separate day with some separate Raiders

and we try to vary all the things that

we care about varying and give that to

give that in a different scenario and then

we simply take the correlation of the X.

and X.

prime and that's a parallel

forms reliability another

way we approach it is to do test retest

reliability what that does not capture

is the variance to the items because if

I test you and then I retest you again

I haven't drawn again from this pattern of

items so you want to think about all these

turns of like items of Raiders of

occasions of tasks and think about all of

those it's contributing to your sources

of variance and 3rd this is sort of

the weakest form that you usually get the

highest reliability from is our internal

consistency reliability which which treats

all this stuff like all the stuff that's

going on in this room right now is fixed

and only considers the variance of items

within the within the little test that you

happen to have right it sort of says hey

instead of drawing an urn drawing from

this urn of items I recognize that I've

already drawn from the urn of items I can

split the test items sort of randomly in

half to correlations of all those halves

and think about how that is an estimate of

a liability that's how internal

consistency reliability works.

So again I hope you're sort of bilingual.

In the order of the light.

consistency reliability 10 percent of

the time it's some weird approach using

R.T. that I'll talk about shortly.

And actually show Shawn and

I from our from our 2015 paper to

have this here I might have cut the slide

we actually show you the histogram for

all reported state reliability causations

that you see in practice just a give you

a sense and point to point and

all of them are over point 7 in this case

there are centered on point 9

with a slight negative skew.

All.

Together for.

A purpose.

And then averages of that.

Here is what I hear is

what I skipped over so

Comdex Alpha is exactly that and it

actually can show you can prove that it is

the average of all possible split halves

right you split in half you split

in half every single possible way you can

you take the correlation over and over and

over again now that correlation this is

where you combine come back south and

Spearman Brown right you've taken half

tests when you split in half you've taken

have tests so you have successfully

described on average the reliability of

a half test and then used Aaron Brown to

ramp that up to that to the full test so

it's a nice and neat little exercise.

But you've got you've got

the intuition Exactly.

So so you know this is the last thing

also service and it will take I think

a 5 minute break that will end up being

this is this is where your reliability

is not a liability right and

this is the point that I think Brian was

sort of leading to is that you should

think of the reliability coefficients that

you get in your technical manuals and

all of your state tests as being

an impoverished version of the reliability

you might imagine right if it's trying to

answer the question how well does this X.

correlate with this possible X.

prime like that's not varying items

doesn't cut it right and if you

were to vary occasions if you were to vary

like spin areas if you were to vary raters

if you were to vary all these other things

that we might actually be interested in

generalizing over and that reliable you'd

probably almost assuredly be lower right

and so that's worth thinking about as you

as you are adjusting for reliability is

what exactly are the replications over

which I'm interested in generalizing and

that leads to an entire series another yet

another theory called generalize ability

theory which is developed by the Crown doc

and many others decades ago Bob Brennan

has done the biggest the most work on this

as of late and is something you should

know about that's not going to dig too far

into right now but I'll give you a couple

key references brand in 2002 is a great

primer by my former 2nd advisor rich

Ableson in the arena lab in 1901

it's a nice little Sage primer and

it's kind of depressing that it hasn't

really gone out of date since 1901 but

this is basically just analysis of

variance is pretty straightforward.

And that's just it answers

a couple of questions Tom Kane and

I did a paper on this I present in my

class about teacher observations right and

how many readers do you need how many

lessons do you need how many items do you

need to get sufficiently precise

estimates of teacher observation scores

by many readers and are for example

administrator raters different than peer

Vader's these are the kinds of question

you generalize ability theory is.

Really well primed to answer this is my

colleague Heather Hill at Harvard who

wrote a great article on education

researchers say the title is

that before the colon was in her like

Rader reliability is not enough which is

to say like Often times we think we've got

a bunch of readers let's just see how well

they match with Master coders not

enough and I totally agree with her so

I think you should sort of think as you're

developing if you ever have a skill that

depends on Raiders you should definitely

start with greater accuracy and

then move quickly to generalize

about the theory if you can and

leverage the sources here so this

generalized ability studies are expensive

but they also are due

diligence when it comes to

real reliability right they're a liability

you have is not the reliability you seek.