Andrew Ho: Psychometrics mini course - Part 2 | Gerald R. Ford School of Public Policy
 
International Policy Center Home Page
 
 
WHAT WE DO NEWS & EVENTS PEOPLE OPPORTUNITIES WEISER DIPLOMACY CENTER
 

Andrew Ho: Psychometrics mini course - Part 2

December 7, 2016 1:06:00
Kaltura Video

Andrew Ho: 2016 psychometrics mini course - Part 2

Transcript:

So I've jumped ahead a little bit
apparently something weird going on with

my slides that we have some sort of Star
Wars scroll going on behind the scenes and

it seems I don't know what's going
on there that's kind of cool so

we're going to focus on I.R.C. I'm going
to this is a part of a larger presentation

obviously about practical applications and
then sort of a critical perspective on I

or to you again this is my effort
to demystify highlighted uses and

also highlight its limitations so
you know I think it's useful to

sort of contrast I.R.T. with classical
test theory the object model to to

Stickley in classical history
is the is the actual item score.

That's like a 0 or one for
educational test it's like a one to 5 or

a Likert scale and I are T.

It's a probability of getting
a particular item correct or

achieving a particular threshold in a pool
in a polygamist like Likert scale item and

dependent on a series of parameters and
I'm going to explain shortly

the conception of what's measured in
c.d.t like what is the score again that's

the question I keep asking like what
is your score and insight and C.C.T.

it's an exam any true score you can think
of it as an average you can think of it

as a sum equivalently defined
as expected value across

replications of observed scores I mean I
or to recreate this repaired mystical feta

that is a particularly useful
scale as well described for

comparing across different

populations and so it's a very useful
skill if the model fits the data.

So again just to highlight C.T.T.
and why I spent the 1st half of this

presentation on it it is it is still by
far the most widely used psychometric

toolkit it will it can do a ton
do not sell it short but then

of course they've recognised sometimes
you have to publish some papers and

need a fancier acronym and I are told will
serve come in handy then but filthy T.

should be a knee jerk 1st analysis just as
descriptive statistics are 1st step for

most a circle work.

So.

We think our T.

is useful because it conceives of
items like C.T.T. has an urn full of

items is just like I don't care on average
they're kind of like this and they're

variances kind of like this I R T You take
out each little marble from that urn and

you sort of appreciate it they're
like this this marble is special and

it has these properties and then you
can take all these marbles out and

lay them on the table and
say here's what I want to do with them and

that tends to be much more
useful as like a way to

design tests the way to maintain scales
over time it's a standard approach for

large scale testing which is to
say that for many if you I.R.T.

is a little bit of a sledgehammer
to a nail if you're developing

your own scale for your own purposes
it is just going to stay static and

just previews for a particular population
and go ahead and do I or to you but

realize it's kind of
an indulgence I just kind of like

you know I'm taken again a sledgehammer
to a nail for a large scale testing

programs where you are substituting out
items you are maintaining scales you're

giving it to different populations over
time I had to use incredibly powerful and

it is the standard for use in large
scale testing programs perhaps the only

major exception to this are the Iowa
test of the Iowa test currently where

I used to teach at Iowa and most of the
hold out of sort of classical test area so

you can sort of see why I've been
sort of classically focused here

is because they pull off
amazing things without I or T.

and do it quite well and

quite rigorously you do not always
need either T is just simpler and

more elegant to use it which is why
it's in such a common practice today.

So so the irony sort of asks what if
there were this alternative scale this

alternative status scale for which item
characteristics would not depend on each

other the way they do in classical test
theory in the google doc I think Wendy and

Fitzpatrick in their chapter on I.R.T.
and in the handbook they describe I.R.C.

as if your assumptions hold as
person free item measurement and

item free person which is to say
that this a to 6 you come up

with right the difficulty of an item
the discrimination of an item and

the proficiency of you the examinee they
don't depend on the items you happen to

have and similarly the item features
do not depend on the population you

happen to have and that if the assumptions
hold is pretty darn powerful because that

marble you pick out from that burn
you know has those properties and

will always have those properties and
so when you use it to construct a test

then they will continue to have those
properties again if the model holds.

So just to define here this is we're going
to get into law gods but the simplest.

I or team model is known as the rush
model for George Rush's $161.00

monograph It's also known as the one
parameter logistic or the one P.

L.
model and I like to write it like

this which is the log of the odds of
a correct response to the item right P.

over Q write the probability of a correct
response over one minus a probability of

a correct response is the odds this is
a log of the odds of the natural log

of the odds a log of the odds is just
a simple linear function is this

common slope parameter a this this person.

Intercept P.

and then the sign is important here and

it usually in logistic regression
he used to seeing a plus here and

we're going to define it as a minus so
that this will be we're going to define

this as difficulty as instead of easiness
this is a difficulty by item and then

this is a random This is a random effect
this is an error term right so Sigma P.

is distributed normal 01 No there is
no variance of the person distribution

rest meeting here were standardizing it
to normal 01 most I or 2 models aren't

written out like this and I think that the
has it that has the effect of mystifying

it somehow it gets some confused
consume confusing logistic function and

I prefer just saying hey we're linear in
the log odds here this is not fancy if you

can do logistic regression you understand
what a random a random effect is then this

is just familiar modeling and
it really is OK So again a lot of

the odds is simply this common A No
There's no sub script and I there's no sub

script on a the discrimination does not
depend on the item they're just a common.

Parameter estimating across all items here
that's going to change in the next model

but for now it's common across
items because the discriminator and

then this is the sort of difficulty
parameter right for each item and

then every person is going to get a data.

OK.

This is the more intimidating way of
writing the same model we actually model

the probability itself but this of course
is equivalent to this is the scary way of

writing logistic regression this is a less
maybe less intimidating way of looking at

the logistic regression as long
as you don't look over here

is just a log of the odds and
logistic regression when you're in the log

it's just a generalized Illini or
model Don't forget.

So these are the curves
that we estimate so this is

the logistics of the logistic curves that
we estimate that says sort of says for

a given theta these are the probabilities
of getting an item correct right so

it's a given day to 0 you've
got this item has a 50 percent

chance of getting correct or
thereabouts maybe 55 percent now or

up to 7075 percent and so
on so which items are easier

the ones you can think about on top you
can think about is shifted to the left or

right the more difficult items
are the ones shifted to the right.

So or shifted down depending
on how you think about it

in logistic regression we usually
think of intercepts on why

what we've done here is we flip
this to think about position on X..

So those are these are higher
intercepts and lower Y.

intercept but now we've done is we've
shifted it to think again of like

greater difficulty
shifting that sort of S.

curves Walk Like An Egyptian sort of
thing like that that US curve that way.

So and just to give you a little bit of
the punch line here what I've done here is

I have core I have good showed you
a scatterplot of the classical test theory

difficulty which is to say the percent
correct classical test of your super

annoying the call difficult the difficulty
in those percent correct and this is

I.R.T. difficulty so there's a negative
relationship but this is just to say if

you were to sort of say like what how does
percent correct correspond with I.R.T.

difficulty and as I read to you
giving me something magical and

mystical over love and beyond percent
correct answer is not really.

It's pretty it's pretty much the same
information this is not surprising but

again Ira to use I'm going to
show you going to be useful for

some more advanced applications.

I just want to demystify this
further this is where Matt and

I had a conversation about this earlier
today I had to use a latent variable

measurement model it is a fact or analytic
model it is a structural equation a model

do not think of these as separate things
they are separate practices in the way

that a nova and regression are separate
practices but are the same under the hood

right like I think those like the are the
act of doing a know it as a sort of way of

thinking about a statistical analysis even
if it's the same thing I could do the same

thing of aggression similarly structural
collision models factor factor analysis

I think of as a different practice
asking sort of different questions

using the same statistical machinery
I'm happy to elaborate on that but

I don't want to treat these as like
completely separate models when I think of

them more as completely separate
literatures and separate fields and

use them for separate reasons in
a same way that sort of a Nova and

regression are really the same under the
hood so what I'm sort of setting up for

you here is a way of doing I.R.T. using G.

The G.

Sand package in State of the generalized
structural equation model packages data

you can see here all it is you
know one is what I say I.R.T.

is factor analysis but
they caught variables.

Right that's all it is and
that's not all it is

the what we do with it is different but
under the hood that's all it is.

So the S.T.M. formulation is that the
probability here dependent on theta and B.

is the just stick with M.

theta minus B.

as a slightly different parameter
parameterization than the one that I

showed you with the 8 term because the A
was outside these print the C's right but

the but at the same general same general
approach the slope is constrained to be

common across items and
you'd fit this in in status with the G.

surprising regression the same thing so

actually before I or before stay to 14
came out say to 14 was just released.

Last year yet instead of 14 was
released they had an hour to package

before they had an hour to package guess
what I didn't say to do sound so which is

to say it's like what are you teaching
a course by using special creation models

because they're the same thing and so
I had all this really convoluted code

to get all the stuff I needed out of
just Entirety and then of course state I

thankfully at least I retain made all that
obsolete and had to record everything but

it just goes to show that sort of
same thing under the hood that So

this is the 2 parameter legit to
start yeah sure actually Chris.

Absolutely due to it
has a hard time with 3.

But about I.C.M.
can absolutely do to all you do for.


For the 2 families or sick model is
free this right there instead of forcing

the slope to be the same across items you
let it vary I'm not a give you the 2 from

the logistic model the 3 family just like
model I don't think you can do N.G.'s and

you can do it in glam Sophia's package and
are just package but

but but again it's the same
under the Good question.

So this is the 2 from logistic model
it allows items to vary in their

discrimination over over items so again
log I like writing it log odds terms and

so all I've done is added a sub I all
I've done is up the slope parameter vary

across items OK and then again we
have difficulty So these are the more

difficult items these are the less
difficult items this is the less

if I want to be less fancy about it
what would I do I would plot this and

log on and then it would just look
like a bunch of lines right and so

to again this is a sort of mystifying way
of describing I or to you if I wanted to

make it simpler I just show you all
the different straight lines that were in

a lot of god space Yep the last year was.

Like and then how many around the map.

Click the icon.

The legend there are 20 items so
that's obviously have 20 parameters

here we're not estimating the data
P's right this is a random effect and

we're actually studying this to 0 and once
we're not estimating those we can do it

in a Basine way after the fact
in the same way that we can do

alter random effects estimates after
the fact and then we have in this case 20.

Parameters for for difficulty.

So so this here I can actually show you.

And.

In the output.

Where was the output I
don't think I have it here.

Yet but.

I think that is.

The underlying.

Your love $828.00 of.

Money but we're talking agreed
to freedom for example.

Looks like Absolutely you know what
we're feeding what we're feeding here so

you can do it long or wide it doesn't
really matter to you lets you do it

why just as easily said the data I
should have done this before and

I'm sure the data look like it is a person
by a matrix right where you have persons

as rows items as columns zeros and
ones you can also extend that to 012 for

polygamous items in each of
the in each of the cells and

you're modeling the probability
of a response to correct item so

what is the data look like I think I have.

This if I can show this to you.

And but there you go so this is what the
this is sort of what the data looks like

behind the scenes so
what I've done here so

these are 2 separate item characterised
occurs for to 2 items and

what I've done is I have mapped the the
sum score associated with each data

onto the theta scale here and put you
know put those weights like how many how

many observations happen to be there as
dots and so you can sort of see that what

we're trying to do is fit fit
the probability of the correct response

given like that's that sort of overall
score does that help a little bit OK that.

Data is really weird and
annoying because it's like not

I mean where it come from it's
like this lead in scale and so

you can sort of see it's like sat in the
same way that a random effect is sort of

status just sort of like we just
say it's got some meaning of 0 and

instead of estimating the variance or
putting it back on the slope.

So so

let me show you further like a little bit
of a sense of what the curves look like so

this is the item characteristic curve
like demo that I like to do here is

my visualizing our 2 sides so this is a 3
parameter logistic model so what happens

if I increase the there's a blue item
hiding behind this there's a blue I.C.C.

hiding behind this or that is a blue
curve hiding behind this red curve and

what I'm going to do is I'm going
to increase the discrimination

of this blue item and we're going to see

right is that we're going to increase
this sort of slope here and in the.

Probability space and
this is a this blue item is now

what we describe as more discriminating
in the sense that people just below

that sort of midpoint there right versus
just above are going to have a pretty

massive swing in their probability
of a correct response so

my question my trick question to you
is which item is more discriminating.

Blue or red and the sort of knee
jerk reaction is what the answer

blue is more discriminating but if you
if you think about it more carefully and

some of you did a good job of like
working through this on the google doc.

Where is the slope you know which item
has a higher slope isn't the general

there's a general answer to that and
in fact when might the red item be better.

Yeah so at the tail end of
the distribution you can see that like you

know for people who are very high
high achieving on this scale or

very low on this scale this goes
back to Sue's question right for

who might be trying to just discriminate
who are we trying to discriminate among

And you can sort of we're going
to get to information shortly but

the idea is that what I already allows
you to do is say difficulty for

whom discrimination for whom and
even though you have A's and

B.'s those you wouldn't want to call that
you know just more difficult they're

just less difficult because it all
depends on for whom right and so you can

use this again to construct test in very
strategic ways to provide information for

high achieving or low achieving
students if you're so inclined.

So similarly what I'm going to do now is
increase the difficulty of this blue item.

What he thinks going to what
you think what he thinks could

Which way do you think that
blue curve is going to go.

So the blue curve here is going to shift
to the right it's going to take a little

bit of a lock and for more and
more people across the state a scale

there probability of a correct response
is going to be is going to be low.

So now you're being your blue curve

here blue item is more difficult it
seems right it's like a $1.00 B.

parameter estimates you're like That
is more difficult is it really more

difficult when that easier.

So if you look all the way up at the top
you actually see an instance right where

the blue item is easier
than the red item so

when the discrimination parameters are not
the same this is like an interaction

effect right you can't really
sort of say across the board

which is more difficult which is easier
it depends on where you are in the scale

now if all it premiers are the same as
they are in the one premier logistic model

then there's then there's never any
overlap then and difficult item is always

more difficult than either item is
always easier but once you allow for

discrimination to change then that allows
you to be for very targeted about for

whom is it difficult for whom it is

In a minute giving you.

A guess my question is do you have.

To really find people in order to.

Discotheques communicating your ideas and

finally gave you one people
were really high achieving.

I wouldn't have any information.

That's right you'd be forced to
extrapolate in the way that we do it to

say exact same thing as fitting a linear
model and then sort I mean this is

a linear model in the log odds and then
you're just sort of saying what I'm going

to assume is that if I want to pick pick
people down there what's going to happen

to people down there is extrapolating that
linear in a lot God sumption right and so

when when we say person free item
measurement and item free person

measurement really over saying is yeah if
my model holds which is what we always say

when you know this is just a regression
assumption is nothing magical right but

but it is still nonetheless useful and
that and that what we find in a lot of

cases is that that when you're in the log
odds assumption is pretty reasonable So

yeah so
just a quick note the the slope here is is

a over 4 and of course in the log
odds space it's just the slope

itself and again be careful about when

A's vary when discrimination varies be
careful about assuming discrimination is

discrimination do not select
items based on parameters

select items based on curves.

So any sense right so you should sort
of think in a characteristic curve way

like you know always visualize
if you can the items and selves.

So I want to show you what
happens here right so

the see parameter I haven't really talked
about given how fast I've been rushing

through this parameter when I increase
it here I'll show you what happens here

sort of like the cyclist's the floor and
see what's going on here so

some of you already might know the answer
to this but why would this be useful

why would we want to say that and certain
cases in educational testing people

with extremely low proficiency still have
a 25 percent chance of getting it right.

You know you.

Might.

Not want to.

Go with those like where
you were going for here but

I do like this sentiment this is the this
is the data fitting exercise so you

wouldn't you wouldn't really sort of want
to control that in that particular way but

I really I really do like that I really
do like a sentiment that wouldn't

quite pull that off you there but I think
it's a cool thing that I like the idea.

That's over.

Now so this is very tuned to the
Educational Testing when you have multiple

choice tests and the idea is that like you
know when you have a very very low scoring

examinee forcing the lower asymptote
to be 0 it's kind of silly so

I guess my general recommendation is to
never use a 3 from a religious 6 model and

I'm going to show you why
by setting blue to point


And then point 95.3.6.

That didn't quite work out.

Point 3 and point 2 so maybe I got
this a little bit I know what it is.

Let me just I can just fix.

The vindaloo.

So what I've done here is I've created
a situation where we have dramatically not

that dramatic of it fairly dramatically
different parameter estimates but

the curves are overlapping through much of
the upper end of the distribution right

you see how those curves are sort of
sitting on top of each other over there

and the question would be do you have
enough information at the bottom end

of that distribution to actually
estimate those lower asymptotes So C.

parameters are are notoriously noisy and
so

stated in its in all its
wisdom I'm very grateful for

this has actually not given you the option
to fit a true 3 parameter logistic model

when you fit it when you fit a 3 parameter
logistic model status as all your C.

parameters have to be the same across
items and estimate a common lower

asymptote and that's a really wise thing
because otherwise there's no information

down there and you get a whole
bunch of noise and it throws all

through Clinton already throws all of
your other parameter estimates off so

that's just so you know in general I don't
recommend using the 3 family just model

in practice it is used a lot and
I do not really understand why

I keep pushing back on states against
using it because it just adds a whole

bunch of noise do not overfit
your data is a general rule so

luckily state it has prevented you from
doing that by giving you a common C.

parameter to estimate that's
just fine if your song climbed.

So.

This is a little bit of here is actually
some of the output this is I or T and

state and again now that I don't have to
use G C M anymore I like ridiculously long

do files that are now completely obsolete
because all you have to do is type I or

T one P L and your items and you're all
set you can plot it's got some good I.R.T.

plotting functions for you.

And you get output that
sort of looks like this.

Yeah.

Gasping.

For I did in the long format I don't.

Personally get a job.

Like a Man and that.

Is a think about it you did you can
my slides so that's how I got it I.

Think.

It is I actually actually
deleted that slide here but

I have an extremely low git which
is exactly the same thing and so.

The difference in your mind
is that it is in effect from.

Grabbing a random attack.

For people like you and me.

The audience.

And then.

Afterwards get of it so that and so
I actually usually take a 3 step approach

where I 1st especially to economise it's
useful to show it that way right and

people who are like sort of multi level
modelers you start you start off by

showing it as a random effects logistic
model and and then I show it to the people

who have taken structure equation modeling
factor analysis before and I just try to

demystify it as like under the hood it's
all the same thing don't freak out but we

psych machines have developed kind of this
mystical language for talking about it.

So.

So now just a quick note here again like
this is the linear in the law God's right

so there is kind of this
people often say that I.R.T.

really is an equal interval it is equal
interval it is it is it is setting up this

a linear assumption but it treats as sort
of the target of interest the log of

the odds of a correct response and assume
the sort of linearity between Theta and

all of those all of those
long odds functions so

I guess I'll just say like remember
that this is the assumption.

And it's a sort of a simple model when you
show it like this maybe it's not as pretty

but that's really what's going on in the.

So this is again this is a 3 primatologist
model it's estimating a common super

amateur I think that's a good thing you
can sort of show that it's fitting better

in some cases I don't really like the
likelihood ratio test for these purposes

because usually in practice you have
these massive data sets and everything's

always going to show up as like fitting
better when you give it more parameters

it's not really that interesting
sometimes simpler is better now.

But it's the 1st.

Person.

So it's mine you.

Never 70 percent or something.

Like 100 Tests.

Or hundreds of similar questions
you'll get 70 of them really and

then when we get there you don't think.

I mean I guess it's the same that's
an interesting question you'd think you'd

be like the Terminator stick in some
way that's a good question I think it's

don't think about you I think about people
like you who also sit at that data that's

probably the easiest way to think about it
and there's a $100.00 people at that data

and 30 of them are getting it wrong so
it's nothing against you personally.

Just something we haven't modeled
in you to be able to tell it's

more discriminating we don't have the
specific model for you so just think about

all the people at that data rather than
you having a 70 percent chance of getting.

Help I mean the same sort of thing in
any given scatterplot of a regression

right like you have an X.

you have a Y.

you know so like so
how is it that you know but

you're not talking about you you're
talking about on average people at X.

What's their what's their
What's your best guess for why.

This is just a note I'm parameterization
So you're talking about like do you

estimate the variance of those
of the random effects or

do you let slopes vary and so I just want
to sort of note here that that you can

do both in for those of you taken factor
analysis or structural collision modeling

you know they have to anchor the scale
in one of 2 ways you set the variance or

you set one of the loadings I just want to
show that there is sort of an equivalence

there is a sort of an aside all
the this here as a reference.

So so some practical guidance here for
you when it comes to like sample

size estimation you get the same kind of
guidance for Factor Analysis right but

just be careful this is not a small
sample kind of endeavor for

the one parameter logistic model
you can get away with small samples

this is just a reminder that
when you have small samples

just stick stick with Rush Rush is like
a good a good way to get what you need.

You get various advice
from different authors for

the 2 from majestic model 3 from really
just sick model don't use it 3 primarily

just sick model unless it's the way stated
use it just it's just an absolute mess

Los going examinees needed for 3 P.-L.

but don't even bother and then this goes
for polygamous items to you may have heard

of the great response model which is for
polygamous items this is why I was saying

get your discrete histograms see if
people are responding like 45 and

score points to estimate those cars.

So I want to just talk a little bit about

the practical differences between
item response theory and.

And classical test eery So here what I've
shown is like a some score on the logit

of the percent correct adjusted a little
bit to keep from 100 percent and

you can see here that is just the sort of
nonlinear transformation of of so loaded

it's just a non-linear transformation of
that and loads it looks a lot like you

know the one parameter logistic estimates
for for theta right which is just to say

like don't you know don't think it's going
to create dramatically different scores in

your case like the one primarily just like
model would give you say it is that or

just an slight non-linear transformation
of the sum score so that's the this here

is the relationship between the one
parameter logistic and the sum score

once you get to the 2 parameter logistic
model here you start to get some.

Information based on the items
that discriminate more or less and

and similarly like between the 2 parameter
in the 3 parameter logistic model here you

basically got the same thing that lower
asymptote is not making that much

of a difference so if you want to talk
about the practical impact of I.R.T.

on like your scoring that's not where
you're going to see the difference again I

think the value of I.R.T. is really for
scale means over time for linkages for

like for fancy things where you're
subbing in new items and estimating for

new populations within any given
static item response like data panel.

It's not you know I or T.

overdub of classical test theory is
kind of like a sledgehammer to a nail

that doesn't mean it's not a cool thing
to do and useful for diagnosis but

really what you want to do with AI are to
sort of say OK now I'm going to pick these

items up and use those like particular
marbles from this particular urns

to target a measurement instrument top for
a particular purpose and

it's for that particular design that I or
she becomes particularly handy.

So let's see what should I do so.

I want to talk a little bit I
talked with not about this to

one of the cool things about our T.

is that it enables it puts like if
you look at the the equation for I or

to you right it puts the data which is
a person like ability estimate and B.

which is like an item feature puts it on
the same scale as sort of puts them like

subtract them into sort of says you know
your theta is your difficulty it's sort of

you know and you can sort of say that for
a given theta Let's say you get a beta of

mean and what I like about T.

is that it gives you a way of kind of
mapping items to the scale in a way that

imbues that scale with almost you'd argue
like a qualitative kind of property right

sort of says OK let's say
that I think you know the.

Response probability which means I'm
probably going to get an item correct or

think of it as like 70 percent to use it
to use or cut off psychic in that way and

so we can do here is sort of say OK if
that's the case then if I have a state of

like 2.2 than that's where I'm going
to be likely to get that kind of item

correct and if I have a 3rd of 1.2 that's
going to get this item correct and

different data is will have
different different mappings so

again why is this useful is because often
times you're going to get people asking so

I got a score of like 30 What does
that mean like what isn't a C.T.

score of 30 mean what is an S.C.T.
score of 600000000 what isn't the score of


The does and by putting examinee
proficiency and I have difficulty on

the same scale it allows me to create what
we call these item that's And here's some

of the work that we've done nape this is
not very elegant I have to have to say but

it sort of says OK what is is

explain properties of
sums of odd numbers very.

Apple you can click on that answer see
what that means you can do right with this

it with a with a specified probability
I really like this because educational

scales can be extremely abstract you know
you're always sort of wondering what a 10

or 20 or 30 is and I've actually actually
asked my students in many cases like

whether this is like a psychological
scale like you get a great score of 3

What is that or
a like a a theta scale you know it's like

as if you scale of 600 This allows you
these like qualitative descriptions of

what that actually means I think this
is a very powerful underused method for

you know increasingly I think statistics
is moving towards descriptions of

magnitudes in addition to like statistical
tests for example like how much is

an effect size of point 5 is like
something we really struggle with and

I think you know being able to say here's
a point 5 means says you used to be

able to do 2 digits which action now
you can do 3 digit subtraction or

something whatever that is like being
able to accurately describe what

you could what you could do then and what
you can do now can be really powerful.

Left.

So.

That so that that would be an example
of the model not fitting the data

right if that's sort of
happened a lot where you had

you know where you had usually we have we
have the ideal approach where it's like

every single time you move up you only
get more and more items kind of correct

obviously it doesn't happen in practice
but it has to happen on average and

if that doesn't happen the other team
model won't fit and you'll get really bad

alphas because effectively all your
items in even the classical test area.

Even at that stage will recognise that
your scale is not cohering So once so

if you have a high alpha if you do.

Risk replot for dimensionality
of your ire to my old model fits

which are all different ways of
saying kind of you dimensional scale

then what you're saying doesn't
happen that often and so

you can with with with by picking
a response probability and

these curves being correct sort of how
this ordering of items in the states

are successively ordered way and sometimes
it crosses it you can see here so

the 2 primary just a model gets a little
dicey as far as interpretation there is

because the item orderings aren't the same
different given your responsibility but

on the whole this is I think of no
reasonable way to sort of sail like OK

here's what performance
at this level means.

So so so previously.

They all got the sort of spiraled
set of randomly equivalent question.

Yes.

We're moving to in math multi-stage
testing which is to say adaptive but likes

so you know kind of like what was done
in like some of the National Center for

Education Statistics tests we got this 2
stage exam based on whether you performed

high or low they give you harder items or
easier items but still like even for

those items even if they never saw some of
those items you'd still in a model based

way be able to predict whether or
not they respond correctly if the model

holds that's that's the whole sort of idea
of IRA to itself that even if you didn't

observe that item you could still sort
of predict your probability of a correct

response to it so you would hope that
these item maps you know if the model fits

that's all that's what we always condition
on those item apps would hold but

I really like this this is like my one
of my pet things I like about our T.

so I hope you kind of remember this as
like you know something you can do when

you're trying to say to your
you know it's your aunt or

uncle like you know you know it's like
my daughter got a got a 600 on the M.

caster like great like you know what's her
percentile rank or you can be like But

this is what they can do in the public
still I suppose it's a percent right

it's OK but.

This is a good way of like
anchoring the scale and

talking about you know this is really what
I think measurement is part of like what

what does this mean they can they can do.

So and this is someone derived
this lesson Who was that.

That was those good so

this is a slightly different algebraically
equivalent version of the same thing but

this is just inverting converting the the
higher tech or the I R T equation I C C.

OK so I'm going to.

Skip's estimation even though would be
really fun to talk about this Brian.

But this is a little bit of
an illustration of maximum likelihood and

how things work but I'm going to talk
a little bit about how tests go ahead.

Your side of the just what
is the use in the 3rd

quarter in the template
that we're going to.

Have my you have some problem
because you want to sort of.

Having a more efficient manner.

In the way that you would be about.

You know what the numbers
are present like I'm

going to be worried that
the would you want to use because.

All of this.

Got this morning to do that but
it's really.

Have to Do With us now is not and

how do you know I think about how
to use the school so much respect.

For why don't we have some of
the goal is I.X. I don't so

I think the general goal of item apps is
to understand what score means implies

about what a student knows and is able to
do in the case of educational testing or

happened to report in the case of
psychological testing right or

happens to have like so so for example if
you have a great if you have a great score

of like 4 that means you went from
neutral to affirmative on this particular

item right like that's a way of like
saying that's what foreign means and

then I think that could support the I love
that you're asking this is a question you

know I usually ask other people if I
love that you're asking me this but

I think with that generally does
increase the likelihood of appropriate

interpretation of scores
if on average like you know

with because they're big nape
declines from 2013 to 2015 how big.

Not that big if you look at the kind of
differences in this in this in the kinds

of skills that they were on average able
to do this year versus on a on average

able to last year I just sort of helps
to give people a sense of magnitude and

I think you know Mark let's see has this
great piece translating the effects of of

of interventions to like interpretable
forms I think that's the that's the job

right I think I think it's and I think he
does it in a bunch of really useful ways

about like talking about cost benefit
analysis talking about numbers of months

of learning but I think this is a way in
a criterion referenced way to say like

literally hey this is what you're able to
do now this is what you're able to do then

and that will facilitate any number of
interpretations downstream because it's

really like what we what did you do you
know what do we predict you're able to do

so whenever it whenever you're
thinking about a score and

helping people interpret scores let item
naps be one possible way you can describe

them let me be very specific in another
way about how they're used I did not sir

also used to set standards I haven't
put standard setting in here because.

I.

Have opinions about it.

So standard setting is a process by which
we say this much is good enough nape has

set standards that set a proficiency cut
score it is a judgment will cut score

we just had this massive evaluation from
the National Academy of Sciences about

whether that process was justifiable
is that for the most part it was but

that's a judgment the process that
we use this mapping system for

if you are a reader is coming in to set
standards you would get a book of all of

these items in a row and what you would
do is flip through the book and put

a bookmark in where you think that just
proficient designation should be so that's

another way in which this is actually used
in a very practical way to sort of help

people sort of set A judge mental cut
point on what they think is good enough

based on what people can actually
do at that level is that help now.

So you know.

What was in the.

Back tell you about
the classic rush people.

This is a great point Chris so.

There's like a camp of very
thoughtful well reasoned but

also sometimes cultish for
not offending anybody am I on tape.

People many of whom are very
close friends of mine.

Who are sort of in this rush camp
where they think the model is so

useful that it's worthwhile sometimes to

throw away data to get the model
to fit it right which is and

this sounds a little bit crazy to those of
us who grew up in a sort of statistical

camp but the idea is like look we're
trying to design a good measure this item

is discriminating difficult differently
it's going to lead to these weird ordering

effects where now I can't have item maps
that are all in the same order if I pick

different response probabilities I don't
like that I'm going to not use that item

which means you're defining in a very
strong way like in this very statistical

way like what you think that construct is
and it gets sort of like to be this subset

of things you might want to measure
because you're throwing away all the stuff

that doesn't fit the model what you end up
getting in the end is arguably this very

sort of clean scale where everything
is like ordered without conditions and

there's no crossing of these lines and
no interactions in this item

is always more difficult for
everybody than this other item

which you might have lost in that process
is content and as I said Content is king

content is king you can see my bias here
when I'm when I'm talking about sort of

like that that you should you know fit
the data you know have a theory and

not throw out data to fit your model but
the same time I think there they have.

A framework in place that makes them
comfortable with doing that for

particular uses then tend to be very
diagnostic about these things right there

to sort of these targeted scales for
particular purposes and

they don't tend to they don't
tend to agree that it's good for

all purposes like I don't think they'd
say Do that for a state assessment but

this camp exists and they're they're good
people but they really like their model.

They say.

I don't.

Like to think ever get over you
siding with your collection I and.

My friends we can win measuring this
thing in $1120.00 when they think

they might do that and they treat each
one separately and try to create like on.

Their concept at a level 20 it's sort
of an exploratory factor analytic or

confirmatory factor analytic approach
where you kind of want to take a data

based way of sort of saying with
this item load more on this or

load more on that that that's
something you can do as well and

I sort of ceased sort of confirmatory
factor analytic camp as not so

different from the Rush camp they're
trying to sort of make the pictures fit

and I don't think that that's bad I think
that that serves particular purposes but

I tend to be more dimensional because I
sort of am cynical about the ways people

can use multiple scores like I was just
going to add it together in the end so

might as well analyze it that way and
but but for theoretical reasons I

see why S.C.M. and factor analysis
are useful for that purpose.

So just some useful facts for you.

For the one in 2 parameter logistic model
there is a sufficient statistic for

estimating data what is a sufficient
statistic it holds all the information you

need to estimate data it is not data but

it holds all the information
that you need to estimate it so

what that sufficient statistic is
the sum of discrimination parameters for

the items you've got right so
make sense so I mean

at least as operationally not necessarily
intuitively So basically in a one P.

L.
model all the discriminations

are the same.

Which is to say the number correct for

the Russian model holds all
the information you need.

To to estimate your ultimate data which
is to say everyone who gets the same sum

score right will have the same data have

OK Now when you have
discriminations that differ and

some items hold effectively more
information than others you get credit for

the items the discrimination premieres
of the items you answer correctly so

if you get a 20 correct and I get a 20
correct if you can 80 percent and

I get an 80 percent we
might not have the same

data why would I why
would it be different.

Really easy this way and.

This is a good this is a this is totally
tricked you I'm so sorry but that is

exactly what I said that when my advisor
asked me if it's like 12 years ago.

This is so yeah that's what it said
the 25th when you got easier and so

you got the 20 hard ones right and I got
the 20 easy ones right but don't forget

that if you got the 20 hard ones right
then you must've gotten the 20 or

all the other easy ones
wrong I said That's weird.

So it's actually not the difficulty of the
items that matter it's the discrimination

right so the idea is that the 20
you got right where the ones that

had the information and the 20 that I got
right were the ones that were coin flips.

But that I said the same thing
I mean it is so but and so

you have to sort of invert it right.

There is a little bit while there is
a lot of pressure along the way or

the only part of life or very.

Witty.

And.

is practically and ideas are.

Really.

A number of art.

We would have paid for in the.

Basement we made it but this was so
but again remember that for him to get

the 80 percent of difficult items correct
he must have gotten 20 percent of easy

items wrong which is basically a statement
of Mr Right that's weird right and

so it doesn't happen that often and so
the and so the if that happened a lot

the model wouldn't be unanimity modeling
it right it would say like I have no idea

what you're doing all these items aren't
correlating with each other right now so

it doesn't happen very often and for
the most part the scale would be

unidimensional right which is to say
like the higher you know if the one P.

L.
fit right the higher you are and

you're getting you know these
items correct with greater and

greater probability say and even higher
probability for all those other items

that receipt so that's what the unit
Michel the assumption doesn't model fit

kind of Biggs into that the rarity of that
happening but that's but that's absolutely

right that is intuition I had to but you
sort of have to remember to flip that and

say But don't forget you got all
the easy ones wrong which is we're.

Good so I think this is helpful
intuition for you right.

And so just to sort of sort of note here

when you get your scores from state
testing programs where did they come from

you would like to think you would you
might think that if they were and I are T.

using state right that they would
estimate data for everybody and

report all these different data.

Yes that is not what happens right and
there's a reason that's not what happens

and it's purely to do with these ability
feasibility and transparency and

the feasibility idea is that like we
you know we can't run all these data

you can't run these giant models every
single time the transparency idea is

hey that thing that we just talked
about what Try explaining that to

someone in the public right so
you see that you got 20 correct and

I got 20 correct and you're telling
me that they gotta have so it's

the fact that we can't explain recycle
magician's can't explain that well so

I was giving up on the fact that that data
hat if we truly have a 2 from really just

sick model is a better estimate of data
and if we're answering a more informative

items correctly we should use that
information we generally don't because for

the sake of transparency will we publish
right what a lot of states publish and

you'll see in these tech reports
are these raw score to scale score

conversion tables or just to say take
the sum score and then find your row and

then you find your is a one to one mapping
from raw scores to scale scores right and

that we would be able to do that if we had
this like weird thing where it's like well

if you got a 20 and you got a 13816 in
the war right then like you know that

you have this data and like someone else
has this other data so that's what we call

the difference between pattern scoring and
number correct scoring so you

might in your own analyses have data is
that have sort of continued from A to P.

L.
that have this continuous distribution but

when what you might get from a state
is going to look much more discrete

even if they have a 2 peelers repeal.

The.


What they are.

Doing in their life.

But having the whole.

Democrat.

Know what.

Would use of that right.

Now you are going to develop it on the I
actually really like the cash contract for

me to drop the right thing in the.

OK.

But then you're going back here in.

Some ways against.

The individual schools now as I showed
right like you know I was showing you

those scatterplots before the correlations
are like point 98.99 you know so so

there it is it is not making
too much of a difference but

yes what we're basically conceding is
like we're just going to punt on for

feasibility and transparency reasons and
and go back and don't forget the value

the value of it which I have actually
haven't had sufficient time to demonstrate

here is scale maintenance right like we
can't use the same items it's here that we

use last year because everyone saw them
last last year and so now we have to

use different items but because we know
what the futures of those marbles are and

not earning we can sort of you know if we
can build like the perfect test that all

like measuring the same across the across
the same area that we could before.

So this is what you know this is to
give you an example right this is one P.

L.
This is the sum score and

this is the this is the distribution for
the data scores it's the same thing

at the same things to same thing
it's all we did was a one to

one because like the sufficient
statistic is the sum score right and

all we did in this is what I've described
before is like what is I R T do for

practical purposes for
like a static set of item responses

it scratches the middle and it stretches
the ends and that's it it's sort of see

that just barely here a Woodward
overdoing is a non linear transformation.

So this is the one P.

L.
versus the 2 P.

L.
right I'm sort of showing you

these running back spots here so this
is the one parameter logistic Right so

everyone who got a 3 gets the same
score but you can see like in any given

any given score point right the people
who scored really high in the 2 P.L.R.

those that got the discriminating
items right and

the people who scored really low got
the low discriminating items right.

So so how should I.

See so.

I'm trying to think about.

How to close here so.

With 5 minutes left let me.

Just go back to basics and open up
the questions I think that's what I'll do

there's a lot here I have like to have

this is linking this is showing
you like how you can get to.

The comparisons that I showed you today
through common items and where my hair

there's this so so so so anyway let
me close here and I'll open it up for

questions like What do I want you to
believe with I do want you like I

think there's so much to be said for just
diligent exploratory data analysis and

I hope you don't think that's too boring
because I swear to you we'll see you so

much time later when you're trying
to fit your I.R.T. models and

they're not converging it is well worth it

today of selling I.R.T. I sure showed
you how it worked but there is a really

powerful way in which like I didn't
get to animate here sufficiently for

you like how these marbles from these urns
do have these properties and you can very

precisely like each item has this like
information function associated with

it and you can pick it up and sort of say
I want to measure here and then like and

they also want to measure here maybe and
you can sort of build like the sort

of perfect test in this way to
discriminate at particular points and

in the data distribution and
that's really powerful so for

example if you wanted to sort of set
evaluate people right at a cut score if

you were designing a diagnostic test for
Pascal purposes you could stack all of

the items from your turn that have maximal
information at that particular point and.

Target a test for
precisely that purpose so I or

allows you through this strategic

item estimates to have that information
and you know I can actually show you right

you can sort of see it in this
light zoom out a little bit here.

Right so under here I have these item
information functions so here's what I'm

going to do I'm going to increase
the discrimination on the blue item.

For dislocated this.

Make this like 2 so you see light you
see that right there like now I've

described that item as a lot of
discrimination at exactly that point and

so I can like you know if it were negative
one it would have discrimination out at

this point right here on in so each of
these items has this information function

and you can sort of say you
can sort of stack them up and

figure out where you're going to
minimize your standard errors so for

these people they're going to have low
standard errors and for these people you

can sort of sacrifice them because
you're not making decisions on them.

So so again.

Don't forget content Don't
forget classical test theory

I've just begun to scratch the surface
with the usefulness of I.R.T. and

we've all got a lot more to learn in this.

In this field so
yeah let me open up for questions.

One of my students just told me
the other day don't ask Do you have any

questions because the answer could be
no say what questions do you have.

Yeah.

The SO and

the if the and

the search or early.

Release.

The urge to eat the lower so the bucket
says there is always your go to like 70

percent 80 percent of what you need to do
can be done without again when when is

I or T helpful it's when you're changing
items and changing populations and

stuff is changing over time or if you just
have this little form the 8 I'm good scale

Don't worry about i or to but if you want
to sub out those items because people

are starting to do them if we start
using grid for high stakes testing and

people like hey I remember that item
then you want to start switching out and

that's when I actually started to be super
useful so I guess I'd always keep it in

your back pocket for you know for
when you need to sort out items and or

let's say you want to take it like you
know we can talk about differential

item functioning but it's like what
if you want to pick this test up and

go take it to Japan or
something like that and then then I or

to figure out measurement and

variance so they're they're all these like
use cases where you should sort of feel

like you've got I or T.

as like your sledgehammer in your
basement to like come out and

tackle a particularly thorny thorny
problem but again classical test theory is

your basic Ikea tool kit you know kind of
thing gets you in a get you're pretty far.

That's the to the our earth.

Were over her own way to

a Owen the surfing

really surrogate family structure exams
we have a lot of fire to use now.

Yeah and
I wrote about the long response to

your response to what you
really think of the form which.

Is.

Just now just so

even though in this country where
you the people are sending you money

it seems like to the students that there
is this need to treat it's going to test.

How you deal with situations.

That you might see like
you consistently are.

Starting.

To wonder if.

I actually can treat this thing and
where did it come to fruition.

So very strategically back in
the day when biased tests were

concerned not that they're not a concern
anymore but yes scholars at U.T.S.

sort of said hey let's call it something
more neutral because they're asking good

questions about whether measures
differ for different people but

bias is such a loaded term so
I came up with the term Paul and

unlike others coined the term differential
item functioning to make it make of this

like biased sound scientific and and
so on and it kind of does I guess but

the basic idea is that you have 2
different item characteristic curves for

the same items corresponding to different
groups that's bad right they don't

contain all the information about how
you're responding to a particular item and

if you estimate for
a different population a different an item

characteristic curve that doesn't align
then you've got evidence of differential

item functioning for that group so there's
a whole set of that under the diff and

each command you can say help defend each
You could also do logistic regression

of the of the price of the item score
on the total score and that in and

out in and of itself will 4 and then with
an indicator for the group and that in and

of itself will give you a test
of whether or not the item is

functioning differently for one group or
the other so there's a bunch of different

ways we have to to detect that and it's
a violation of the model and is a concern.

I prefer to form or use the date I like to

call the inhabitants of the practice
in order to avoid bias I don't.

Know what I did for me because of the way
I didn't want to know that I did that.

But it surely was something more
than well in that organization with.

McCain leading any sort of made any
leading a major achievement gap or

generation many gaps or
things one of those why

he's only making sure to manage
using a certain set of ideas and

I never know right now others
have those planes again so so

do if it's conditional on data right so
for 2 people with the same

feta right are there different
probabilities of correct response so

that still allows for different groups
to have different distributions of data.

Right so and so you can have
these 2 different groups of 2 to

distributions of data that can
be that true gap right but

then if you estimate 2 different item
characteristic curves from them and

they don't align that's
problematic right so so for

people who score very low in both groups
right are they are they going to get this.

Like.

I don't buy it or
get absolutely they use going.

Away so what we do what we do is because
there's 2 things we do 1st of all we

assume through the content development
process content is king right you assume

through the content development practice
process that you are measuring something

that's that's that's good that's right and
it's like that part of that theory right

there where we're not just asking yacht
questions or country club questions or

color blind questions for people who are
color blind right so it's got to go back

to content in that regard and then one to
have that then you're looking at relative

death right because it's always going to
sum to 0 it's circular in exactly the way

you're describing or
you have some sort of some sort of X.

or no referent that you assume is
unbiased So one of the other if you do

the internal way it's going to be circular
external way you have to question the bias

in the external referent but the but those
are 2 approaches to doing it the way we

get out of that circularity jam is coming
all the way from models back to content

and some theory that what you're measuring
is right and so what we usually do in

the test development process is we flag
items for death they go to a content or

view team they try to come up
with a couple hypotheses for

why that could have happened usually they
can't and so they leave the item in and

that's that Paul and
wrote this famous paper in like 2003 or

something it's like what's the diff
about defeat don't make no diff.

Which is differential
item function right so so

I mean because that's really what happens
in practice is tests are designed to

through the content
development process to our.

Already this is you know
Diane Ravitch is like you know

language police kind of book
right way back in the day

it's already designed to squeeze
out everything interesting and

possibly like you know differential func
functioning differently across the tast so

you get something that's so sterile in the
end that it is like no basis on which you

can really throw anything out it's
kind of a sad statement but you know.

To a.

Way or the A.

Way that.

You're always has me at
the depends on the use.

Of the a.

To a.

The and this is very much and

so 1st of all I forgot what time do
we end I thought we ended at 5 but

I realize now it's 5300 wow OK well
we can talk about all sorts of stuff.

You know keep it to questions.

And they always like I mean I am kind of
exhausted it's I mean like we've got half

an hour let's get Fox Let's
talk scale pliability.

So I might really do that you guys better
ask questions otherwise I'm going to get.

If it was less relevant that.

It's like how do you get mad at
yourself it makes sense for.

If I can hold it for
you know better I feel for

the boys when I was in the program.

But why do you think you
have what you feel so.

Let's get to that.

So that to address the just to
fit question right so there.

Are different schools of thought and so
that I know because he's trained more and

psychological measurements are educational
measurement is is more interested in

model fit and people in structural
question modeling generally and

factor analysis generally are interested
in a whole array of fits into sticks that

make me dizzy sometimes they're about you
know back in the day like 20 years ago you

could get tenure based on like you know
creating the next new Fit sadistic and

now there are 60 of them and I can't keep
track but I don't mean to be glib like you

can sort of tell by the way I'm talking
about it that I'm I'm just skeptical of

the idea of it like I think you can start
off with like an alpha statistic and once

it's like a sufficient level you're just
trying your using it you're using I or T.

to accomplish something if it helps
go ahead and use it if not don't and

so and so I think that the sort of
the dimensionality questions are often

a little bit overwrought that said I
think like as a matter of likes like

operationalization of your measurement
like objectives I do think like

streetlights alphas and scree thoughts and
overall fit all the C.F.I.

And R.M.S. CA and a whole suite
of hits a test takes are helpful

the only problem is that like you
know what you run the risk of like

people being like you're fit to test it
is like point 02 below the cutoff in your

sleep where the hell these come
out what does this even mean so

I know little cynical about fits this
fix but I do think you know support an.

There are models to fit the data I
just don't so so how does I.R.T.

and C M factor analysis kind of
differ in the practice like in

the same way that regression and
a nova differ in the practice right.

In the eye when we use I.R.T. we are very
we tend to be very interested in like

the marbles we're trying to create a test
and we want to or like maintain a test and

so we care about the specific parameter
estimates for those items and

we use them very very carefully in S.C.M.
and factor analysis you're sort

of more interested in this sort of global
measure of like does the model fit and

like that if it fits and sort of it
helps to explain my theory sometimes

a structural equation modeling you
are interested in particular structural

parameters in the same way that you're
interested in regression coefficients but

in general you're sure of your interest
in the sort of global ideas fit right so

I guess I guess that's the difference is
that I.R.T. cycle I don't care if it fits

like my standard error on this
discrimination parameter is pretty it's

pretty good is pretty decent and
it's sort of unidimensional and.

That's that'll do right so

I guess I would say what we usually
see in practice are like these scripts

lots of these general fits
the test 6 someone doesn't say and

describes fit and then you sort of move
on and so if you look at Duckworth and

Quinn they do this sort of token
confirmatory factor analysis and

the like like OK Hey it fits now let's
go see if it predicts future outcomes

like enough of that let's go let's go do
something else so I think that's a good

standard practice and if you like some
That's article is a good one right where

he does that internal consistency
examination on his on his on his

scale and confirms it works and people
are often using it that's a good model.

Of the questions.

With.

One person so I assume that
the days of 01 and go ahead and.

Expose the leak is are calling getting
those are the ones to just put your

POTUS on post and that's the only place
I've almost like almost got to step

a submitter right now I have your ability
to go back and get a raise gets are.

Reliable rest.

Like you mean if it's not
normally distributed yet.

You sooner than all the one thing you must
mean anyway it's kind of feeling well

this isn't exactly what you're wanting so
it's really part of life or

there's a you it's a given want to be
able to like if you're a little it's like

you're supposed to be something about life
believe your estimates are that's cool

it's a cool idea so in general I think
this probably fits under this like

more B.Z. and
ways to go about this publicly.

And like so

there are a lot of people who kind of do
this market chain market Carlo approach to

sort of simultaneously estimate
everything they have priors on the B.

parameters priors and they parameters
can have strong priors in the C.

parameters what they'd a kind of
feed back into that information

that sort of 2 step approach so

I think that's probably where that sort of
stuff comes in in a more fully framework

so I guess I would look there I'm not I
haven't done that in a long time and so

I'm not sure where the current state
of the art is but kind of a cool idea.

So let me let's let's kill this to
a little like I mean you know it's like

it's probably beer o'clock but let's none
the less and less do a little bit of scale

pliability of uselessness up for
whatever about to get into so I.

Assume you know is this an equal interval
scale so this is the big debate going on

I'm not sure it's debate seems
pretty obvious to me but

some there are those in our field who are
less utilitarian instrumentalist than I am

who are really struggling
to give psychological and

educational measurements the cache of
physical estimates right they want to sort

of say this is my own breakable scale
don't don't bend don't bend it and

I think it's it's it's that's
sort of silly so so so

interval scale again we're setting it up

as when you're in the log odds of
risk of correct responses to items so

there is a way in which it is already
equal interval you've always got

to be equal interval to risk
with respect to something.

So there's a good literature
right now bond in Lang and

Nielsen as well that you cited in your
paper which I appreciate their good work

on this they're trying to tie achievement
scores to these extra reference and

they're sort of bending the scale in
response to these like other scaled that

she even test typically
gets subjugated to.

In it and sometimes very useful ways so
so I So the thing that's going to

is equal interval with respect to the log
odds of correct responses to items but

there's nothing sort of magical about
that you can sort of bend everything.

Right and everything will still sort
of fit as long as it's a monotonic

transformation the it's no longer linear
in the log odds but it's still like going

to fit the data right it's because it's
going to chase the data in some arbitrary

way so so large sort of shows that you
know it doesn't really matter the data

can't tell as long as you're monotonically
transforming the both the item

response function and the data themselves
I mean it's just going to chase the data

let it do whatever you do right so
what do you make of scale indeterminacy so

logistic I don't response wontons
mathematically convenient is a loose

rational basis under normal assumptions
there you go but the data can't tell which

of any of any plot plausible monotone
transformations is desirable there's no

one correct or natural skill for measuring
traits or abilities and education and so

I come down very similarly to what
Brian and Jesse articulated so well in

their G.P. paper which is that there's
a it's probably useful to think of.

A class of.

You know again I like to call these
plausible monotone transformations

that you should subject your scales
to re estimate according to those

data after those transformations
I mean just make sure that your

your whatever you're concluding is robust
to those transformations so interpreted

interpretation should be robust a
plausible turn of course scales so this is

what I described before where we sort of
have these like you know one to 2 to 3 and

like we try to sort of I think we need a
way to sort of talk about how like pliable

these scale those are because because
you know the you know who's to say

like think about the item maps who's to
say that the distance between 2 digit and

the distance between like derivatives in.

I mean how how are you going to sort
of objectively sort of say what that

difference is and so

yes I would again sort of say the scale is
pliable and there ordinal number interval

I feel like ordinal interval is
like an antiquated dichotomy and

we should sort of think of a way to sort
of think of something between the ordinal

on the interval the equal interval
arguments like weak but not baseless so.

This is just to illustrate what
happens like if we were to just.

Operationalize a transformation
of of an underlying scale

right already said you know what I see you
normal distributions but what I really

care about are differences down there
like a negative 3 to the negative one.

So like that's where like I want to
prioritize growth either from like

an incentive standpoint or that's where
I you know from a measurement standpoint

truly believe that you know did
those distances are like 10 times or


you can sort of say that this is
actually the distribution that

the distributions I've got Were these are
actually the distributions I've got and

if you're to do a straight like you know
standard standardized in a mean difference

then this sort of changes the actual
effect size right the actual set number of

standard deviation units you can look
at differences in percentiles too and

the idea is that whatever you're sort of
assuming but it whatever judgment you're

making is going to be robust
to these transformations so

similarly So what we did Sean and I did
address sort of a separate problem but

still resulted in a neat technique
I think serve to define this

this class of transformations
that is mean and

variance preserving that's just like to
keep keep your head on straight Syria not.

Trying to go to a completely different
sort of scale it's like all your keeping

you're keeping your sort of head and arm
with sort of approximately the same and

just working things in various
directions and then a lot.

Of like my mean distributions
today it's kind of fun so and so

this is so subject to this constraints
of this is like a class of

exponential transformations subject
to these constraints we get this

this formulation and this is that
sort of transformation from X.

to X.

star So what we're sort of sort of doing
here is saying this red transformation

here right that's accentuating
these higher scores here and

the blue transformation is
accentuating these low scores here.

You can imagine also kurtosis kinds
of transformations where you're

stretching the tails but
keeping everything symmetrical

these are sort of one direction
in the other direction.

So so this is like what would happen
under these various sort of C.

parameters as I've defined as we've
defined them where you take a normal

distribution this is sea of negative
point $50.00 negative skew for

the blue distribution and

this is a sea of positive point 5
The positive skew over there yes in.

My.

Place.

Counts more.

Here than up here close.

To 10 you know whether it is a little bit.

Better than the original.

She has was.

Very clear about.

Why this is that interesting
today there's one week to

function we should just try to estimate.

The loop.

Hole I had absolutely right right and I so
there's So I think it's absolutely right

this is like any sensitivity study is not
like random It's like asking a different

question right exactly and I think that's
exactly the right way to frame it and

one of this is where I think sort of
item maps can kind of help is because

what an item that will do is sort
of go along with this function and

say hey look what you would now what you
said is that you know do a derivatives

are close to enter goals and like to
vision to just abstraction is huge and

that's not random and that's a statement
of like you know of a belief and different

and in these different magnitudes so
don't treat it as random error but

say like under this condition these are
the these are the results you get under

this condition these results you get and I
think that's a great is exactly right and

this is a by the way I think a general
way to think about it I think

a lot of people have said this but
don't think of sensitivity studies as

just like you know a bunch of random
things you do there each you know

questions exactly right.

So the way that I we've set
this up is the balance of C.R.

set to say that the slope of
the transformation at the 5th percentile

is one 5th to 5 times that of the slope of
the 95th percentile that's one way to sort

of think about it it's like you know
the rate of rate of the relative rate here

is like you know 5 times a relative rate
at the top of the distribution there just

very various ways of sort of think about
you know how to how to stretch and squish

the scale and so again you know if you
want to sort of like what to do it kind of

thing is they take the sting scores apply
a family of possible transformations

taking Sue's feedback seriously here what
you'd also want to be very clear about is

what that implies for like you know
the 2 different it's like a difference

down here in the difference up here using
some item napping or some other way

of describing it calculate metrics of
interest from each dataset and assess

robustness of interpretations of metrics
across these possible transformations.

By you.

Know actually so that this reference is
was related to get measurement broadly but

what we're trying to do is is make
sure that our reliability estimates

would not change too much in like whether
it was parametric or non-parametric So

really trying to solve a completely
different problem we were just sort of

saying hey there's a cool transformation
that'll work for this purpose so

I'm citing this as like Shawn and
I kind of hit 3 fun things in that

paper that had kind of nothing to do with
the abstract like the 1st was like hey

what are liabilities across state testing
programs the United States we just threw

that in is a figure and another was like
this little thing here sir just trying to

solve the real problems associated
with our R.V. get procedure and so

it's really kind of ancillary but that's
where we started writing it up week so we

should really we should we should really
write it up you know some more formally.

Like the all the things we don't have time
for but yes so you know is this the right

family can we think of kurtosis
kinds of transformations see about.

Appropriately I love to use feedback
in that don't think this is random So

this is this is this is our.

Like does or is there a liability
coefficient or a liability coefficient if

we if we what the nonparametric
ordinal reliability causation and

and so this is this is sort of saying
that actually that that our correlations

are actually pretty stable across all
of these different transformations and

so we don't have to worry too much
about reliability depending on

different scale transformations So here's
I would sort of say left we can create.

A hierarchy of research like
of statistical procedures

that based on whether they are sensitive
to scale transformation right and

so you know differences in means
are going to be pretty darn robust

right correlations as we've shown here
are pretty darn robust differences in

differences that gets that's good
that's problematic right and

so like when you're actually when you
whenever you have these sort of like

interaction effects like that's
heavily dependent on scale because

all I have to do is squish this to like
make it parallel and stretch this and

like I get it I get a different I get
a different kind of interaction effect so

there are different classes of of
procedures that I think we can sort of lay

out in like a more sensitive versus
less sensitive kind of framework and

I think that would be useful so
Nielsen does that you know and.

In the papers like which wasn't
a shock to us that changes and

gaps are different then I mean that's just
that's that's pretty straightforward and

that but generalizing that to like
a these kinds of methods are in general

these kinds of questions are in general
sense if the scale is really useful.

So so this is just a little example of
how like value added models like our

are not robust but
we don't have too much time so so.

I draw this.

So if you look back so that's another
good reference for this and other good

reference for the changes in gaps question
it's a back to Harlan ho in 2000 and

ho in Harlem 2006 we should we
sort of showed that that for

the most part gaps are to cast a CLI
ordered right there's nothing you can do

to reverse the sign of a gap
right like I mean so so one for

the most part like high achieving
groups and low achieving groups are so

far apart that there's that there's no
transformation that could possibly reverse

them but we we created sort
of just of sort of proof that

that we call that 2nd order to cast
according which is kind of a mouthful but

the idea is that for changes in gaps right
that it's very very easy for the most

part as long as you've got certain
conditions that hold that a transformation

can reverse the sign agree versus the sign
change in the gaps Right exactly so

which is the same as like an interaction
effect which I think it sounds.

Exactly exactly right.

Was a bit because here's what I
mean your response is exactly right

which you know would know which is even
not the or leave the room part but

maybe that's the right thing to do but
your response is exactly right which is

to say like what does what are the
intervals that this is assuming right and

so that's where I think like the sort of
idea map and scale anchoring can be really

helpful is because you're saying like look
if you want to disagree with me about

the ordering Here are the here is what I'm
saying about the scale right at this point

is this this point is this point is this
this point is it's about have a content

based argument about it about it go ahead
like I think that's where you sort of like

can set your stake in the ground and
because what I don't want to do is like

get to this sort of nihilistic likening
bond of language a little bit too far and

sort of saying let's create let's
let's sort of solve for the crazy

possible transformations that could
possibly reverse this sort of gap and

I think that's a little bit too extreme
and so what I tried to do also in this

paper with Carroll you is sort of sort of
say like What are the distributions we see

in practice and how crazy should we
stretch things to be to be plausible so

we should have a debate in exactly
the way that I think you and

Jesse were describing about what's
plausible in which that situations

that should be leveraged
based on you know like a.

Decision can be made based on sort of a
survey of the shapes of distributions that

we see in practice that's fun thanks
I'm glad we had that extra time.