Andrew Ho: 2016 psychometrics mini course - Part 2
Transcript:
So I've jumped ahead a little bit
apparently something weird going on with
my slides that we have some sort of Star
Wars scroll going on behind the scenes and
it seems I don't know what's going
on there that's kind of cool so
we're going to focus on I.R.C. I'm going
to this is a part of a larger presentation
obviously about practical applications and
then sort of a critical perspective on I
or to you again this is my effort
to demystify highlighted uses and
also highlight its limitations so
you know I think it's useful to
sort of contrast I.R.T. with classical
test theory the object model to to
Stickley in classical history
is the is the actual item score.
That's like a 0 or one for
educational test it's like a one to 5 or
a Likert scale and I are T.
It's a probability of getting
a particular item correct or
achieving a particular threshold in a pool
in a polygamist like Likert scale item and
dependent on a series of parameters and
I'm going to explain shortly
the conception of what's measured in
c.d.t like what is the score again that's
the question I keep asking like what
is your score and insight and C.C.T.
it's an exam any true score you can think
of it as an average you can think of it
as a sum equivalently defined
as expected value across
replications of observed scores I mean I
or to recreate this repaired mystical feta
that is a particularly useful
scale as well described for
comparing across different
populations and so it's a very useful
skill if the model fits the data.
So again just to highlight C.T.T.
and why I spent the 1st half of this
presentation on it it is it is still by
far the most widely used psychometric
toolkit it will it can do a ton
do not sell it short but then
of course they've recognised sometimes
you have to publish some papers and
need a fancier acronym and I are told will
serve come in handy then but filthy T.
should be a knee jerk 1st analysis just as
descriptive statistics are 1st step for
most a circle work.
So.
We think our T.
is useful because it conceives of
items like C.T.T. has an urn full of
items is just like I don't care on average
they're kind of like this and they're
variances kind of like this I R T You take
out each little marble from that urn and
you sort of appreciate it they're
like this this marble is special and
it has these properties and then you
can take all these marbles out and
lay them on the table and
say here's what I want to do with them and
that tends to be much more
useful as like a way to
design tests the way to maintain scales
over time it's a standard approach for
large scale testing which is to
say that for many if you I.R.T.
is a little bit of a sledgehammer
to a nail if you're developing
your own scale for your own purposes
it is just going to stay static and
just previews for a particular population
and go ahead and do I or to you but
realize it's kind of
an indulgence I just kind of like
you know I'm taken again a sledgehammer
to a nail for a large scale testing
programs where you are substituting out
items you are maintaining scales you're
giving it to different populations over
time I had to use incredibly powerful and
it is the standard for use in large
scale testing programs perhaps the only
major exception to this are the Iowa
test of the Iowa test currently where
I used to teach at Iowa and most of the
hold out of sort of classical test area so
you can sort of see why I've been
sort of classically focused here
is because they pull off
amazing things without I or T.
and do it quite well and
quite rigorously you do not always
need either T is just simpler and
more elegant to use it which is why
it's in such a common practice today.
So so the irony sort of asks what if
there were this alternative scale this
alternative status scale for which item
characteristics would not depend on each
other the way they do in classical test
theory in the google doc I think Wendy and
Fitzpatrick in their chapter on I.R.T.
and in the handbook they describe I.R.C.
as if your assumptions hold as
person free item measurement and
item free person which is to say
that this a to 6 you come up
with right the difficulty of an item
the discrimination of an item and
the proficiency of you the examinee they
don't depend on the items you happen to
have and similarly the item features
do not depend on the population you
happen to have and that if the assumptions
hold is pretty darn powerful because that
marble you pick out from that burn
you know has those properties and
will always have those properties and
so when you use it to construct a test
then they will continue to have those
properties again if the model holds.
So just to define here this is we're going
to get into law gods but the simplest.
I or team model is known as the rush
model for George Rush's $161.00
monograph It's also known as the one
parameter logistic or the one P.
L.
model and I like to write it like
this which is the log of the odds of
a correct response to the item right P.
over Q write the probability of a correct
response over one minus a probability of
a correct response is the odds this is
a log of the odds of the natural log
of the odds a log of the odds is just
a simple linear function is this
common slope parameter a this this person.
Intercept P.
and then the sign is important here and
it usually in logistic regression
he used to seeing a plus here and
we're going to define it as a minus so
that this will be we're going to define
this as difficulty as instead of easiness
this is a difficulty by item and then
this is a random This is a random effect
this is an error term right so Sigma P.
is distributed normal 01 No there is
no variance of the person distribution
rest meeting here were standardizing it
to normal 01 most I or 2 models aren't
written out like this and I think that the
has it that has the effect of mystifying
it somehow it gets some confused
consume confusing logistic function and
I prefer just saying hey we're linear in
the log odds here this is not fancy if you
can do logistic regression you understand
what a random a random effect is then this
is just familiar modeling and
it really is OK So again a lot of
the odds is simply this common A No
There's no sub script and I there's no sub
script on a the discrimination does not
depend on the item they're just a common.
Parameter estimating across all items here
that's going to change in the next model
but for now it's common across
items because the discriminator and
then this is the sort of difficulty
parameter right for each item and
then every person is going to get a data.
OK.
This is the more intimidating way of
writing the same model we actually model
the probability itself but this of course
is equivalent to this is the scary way of
writing logistic regression this is a less
maybe less intimidating way of looking at
the logistic regression as long
as you don't look over here
is just a log of the odds and
logistic regression when you're in the log
it's just a generalized Illini or
model Don't forget.
So these are the curves
that we estimate so this is
the logistics of the logistic curves that
we estimate that says sort of says for
a given theta these are the probabilities
of getting an item correct right so
it's a given day to 0 you've
got this item has a 50 percent
chance of getting correct or
thereabouts maybe 55 percent now or
up to 7075 percent and so
on so which items are easier
the ones you can think about on top you
can think about is shifted to the left or
right the more difficult items
are the ones shifted to the right.
So or shifted down depending
on how you think about it
in logistic regression we usually
think of intercepts on why
what we've done here is we flip
this to think about position on X..
So those are these are higher
intercepts and lower Y.
intercept but now we've done is we've
shifted it to think again of like
greater difficulty
shifting that sort of S.
curves Walk Like An Egyptian sort of
thing like that that US curve that way.
So and just to give you a little bit of
the punch line here what I've done here is
I have core I have good showed you
a scatterplot of the classical test theory
difficulty which is to say the percent
correct classical test of your super
annoying the call difficult the difficulty
in those percent correct and this is
I.R.T. difficulty so there's a negative
relationship but this is just to say if
you were to sort of say like what how does
percent correct correspond with I.R.T.
difficulty and as I read to you
giving me something magical and
mystical over love and beyond percent
correct answer is not really.
It's pretty it's pretty much the same
information this is not surprising but
again Ira to use I'm going to
show you going to be useful for
some more advanced applications.
I just want to demystify this
further this is where Matt and
I had a conversation about this earlier
today I had to use a latent variable
measurement model it is a fact or analytic
model it is a structural equation a model
do not think of these as separate things
they are separate practices in the way
that a nova and regression are separate
practices but are the same under the hood
right like I think those like the are the
act of doing a know it as a sort of way of
thinking about a statistical analysis even
if it's the same thing I could do the same
thing of aggression similarly structural
collision models factor factor analysis
I think of as a different practice
asking sort of different questions
using the same statistical machinery
I'm happy to elaborate on that but
I don't want to treat these as like
completely separate models when I think of
them more as completely separate
literatures and separate fields and
use them for separate reasons in
a same way that sort of a Nova and
regression are really the same under the
hood so what I'm sort of setting up for
you here is a way of doing I.R.T. using G.
The G.
Sand package in State of the generalized
structural equation model packages data
you can see here all it is you
know one is what I say I.R.T.
is factor analysis but
they caught variables.
Right that's all it is and
that's not all it is
the what we do with it is different but
under the hood that's all it is.
So the S.T.M. formulation is that the
probability here dependent on theta and B.
is the just stick with M.
theta minus B.
as a slightly different parameter
parameterization than the one that I
showed you with the 8 term because the A
was outside these print the C's right but
the but at the same general same general
approach the slope is constrained to be
common across items and
you'd fit this in in status with the G.
surprising regression the same thing so
actually before I or before stay to 14
came out say to 14 was just released.
Last year yet instead of 14 was
released they had an hour to package
before they had an hour to package guess
what I didn't say to do sound so which is
to say it's like what are you teaching
a course by using special creation models
because they're the same thing and so
I had all this really convoluted code
to get all the stuff I needed out of
just Entirety and then of course state I
thankfully at least I retain made all that
obsolete and had to record everything but
it just goes to show that sort of
same thing under the hood that So
this is the 2 parameter legit to
start yeah sure actually Chris.
Absolutely due to it
has a hard time with 3.
But about I.C.M.
can absolutely do to all you do for.
For the 2 families or sick model is
free this right there instead of forcing
the slope to be the same across items you
let it vary I'm not a give you the 2 from
the logistic model the 3 family just like
model I don't think you can do N.G.'s and
you can do it in glam Sophia's package and
are just package but
but but again it's the same
under the Good question.
So this is the 2 from logistic model
it allows items to vary in their
discrimination over over items so again
log I like writing it log odds terms and
so all I've done is added a sub I all
I've done is up the slope parameter vary
across items OK and then again we
have difficulty So these are the more
difficult items these are the less
difficult items this is the less
if I want to be less fancy about it
what would I do I would plot this and
log on and then it would just look
like a bunch of lines right and so
to again this is a sort of mystifying way
of describing I or to you if I wanted to
make it simpler I just show you all
the different straight lines that were in
a lot of god space Yep the last year was.
Like and then how many around the map.
Click the icon.
The legend there are 20 items so
that's obviously have 20 parameters
here we're not estimating the data
P's right this is a random effect and
we're actually studying this to 0 and once
we're not estimating those we can do it
in a Basine way after the fact
in the same way that we can do
alter random effects estimates after
the fact and then we have in this case 20.
Parameters for for difficulty.
So so this here I can actually show you.
And.
In the output.
Where was the output I
don't think I have it here.
Yet but.
I think that is.
The underlying.
Your love $828.00 of.
Money but we're talking agreed
to freedom for example.
Looks like Absolutely you know what
we're feeding what we're feeding here so
you can do it long or wide it doesn't
really matter to you lets you do it
why just as easily said the data I
should have done this before and
I'm sure the data look like it is a person
by a matrix right where you have persons
as rows items as columns zeros and
ones you can also extend that to 012 for
polygamous items in each of
the in each of the cells and
you're modeling the probability
of a response to correct item so
what is the data look like I think I have.
This if I can show this to you.
And but there you go so this is what the
this is sort of what the data looks like
behind the scenes so
what I've done here so
these are 2 separate item characterised
occurs for to 2 items and
what I've done is I have mapped the the
sum score associated with each data
onto the theta scale here and put you
know put those weights like how many how
many observations happen to be there as
dots and so you can sort of see that what
we're trying to do is fit fit
the probability of the correct response
given like that's that sort of overall
score does that help a little bit OK that.
Data is really weird and
annoying because it's like not
I mean where it come from it's
like this lead in scale and so
you can sort of see it's like sat in the
same way that a random effect is sort of
status just sort of like we just
say it's got some meaning of 0 and
instead of estimating the variance or
putting it back on the slope.
So so
let me show you further like a little bit
of a sense of what the curves look like so
this is the item characteristic curve
like demo that I like to do here is
my visualizing our 2 sides so this is a 3
parameter logistic model so what happens
if I increase the there's a blue item
hiding behind this there's a blue I.C.C.
hiding behind this or that is a blue
curve hiding behind this red curve and
what I'm going to do is I'm going
to increase the discrimination
of this blue item and we're going to see
right is that we're going to increase
this sort of slope here and in the.
Probability space and
this is a this blue item is now
what we describe as more discriminating
in the sense that people just below
that sort of midpoint there right versus
just above are going to have a pretty
massive swing in their probability
of a correct response so
my question my trick question to you
is which item is more discriminating.
Blue or red and the sort of knee
jerk reaction is what the answer
blue is more discriminating but if you
if you think about it more carefully and
some of you did a good job of like
working through this on the google doc.
Where is the slope you know which item
has a higher slope isn't the general
there's a general answer to that and
in fact when might the red item be better.
Yeah so at the tail end of
the distribution you can see that like you
know for people who are very high
high achieving on this scale or
very low on this scale this goes
back to Sue's question right for
who might be trying to just discriminate
who are we trying to discriminate among
And you can sort of we're going
to get to information shortly but
the idea is that what I already allows
you to do is say difficulty for
whom discrimination for whom and
even though you have A's and
B.'s those you wouldn't want to call that
you know just more difficult they're
just less difficult because it all
depends on for whom right and so you can
use this again to construct test in very
strategic ways to provide information for
high achieving or low achieving
students if you're so inclined.
So similarly what I'm going to do now is
increase the difficulty of this blue item.
What he thinks going to what
you think what he thinks could
Which way do you think that
blue curve is going to go.
So the blue curve here is going to shift
to the right it's going to take a little
bit of a lock and for more and
more people across the state a scale
there probability of a correct response
is going to be is going to be low.
So now you're being your blue curve
here blue item is more difficult it
seems right it's like a $1.00 B.
parameter estimates you're like That
is more difficult is it really more
difficult when that easier.
So if you look all the way up at the top
you actually see an instance right where
the blue item is easier
than the red item so
when the discrimination parameters are not
the same this is like an interaction
effect right you can't really
sort of say across the board
which is more difficult which is easier
it depends on where you are in the scale
now if all it premiers are the same as
they are in the one premier logistic model
then there's then there's never any
overlap then and difficult item is always
more difficult than either item is
always easier but once you allow for
discrimination to change then that allows
you to be for very targeted about for
whom is it difficult for whom it is
In a minute giving you.
A guess my question is do you have.
To really find people in order to.
Discotheques communicating your ideas and
finally gave you one people
were really high achieving.
I wouldn't have any information.
That's right you'd be forced to
extrapolate in the way that we do it to
say exact same thing as fitting a linear
model and then sort I mean this is
a linear model in the log odds and then
you're just sort of saying what I'm going
to assume is that if I want to pick pick
people down there what's going to happen
to people down there is extrapolating that
linear in a lot God sumption right and so
when when we say person free item
measurement and item free person
measurement really over saying is yeah if
my model holds which is what we always say
when you know this is just a regression
assumption is nothing magical right but
but it is still nonetheless useful and
that and that what we find in a lot of
cases is that that when you're in the log
odds assumption is pretty reasonable So
yeah so
just a quick note the the slope here is is
a over 4 and of course in the log
odds space it's just the slope
itself and again be careful about when
A's vary when discrimination varies be
careful about assuming discrimination is
discrimination do not select
items based on parameters
select items based on curves.
So any sense right so you should sort
of think in a characteristic curve way
like you know always visualize
if you can the items and selves.
So I want to show you what
happens here right so
the see parameter I haven't really talked
about given how fast I've been rushing
through this parameter when I increase
it here I'll show you what happens here
sort of like the cyclist's the floor and
see what's going on here so
some of you already might know the answer
to this but why would this be useful
why would we want to say that and certain
cases in educational testing people
with extremely low proficiency still have
a 25 percent chance of getting it right.
You know you.
Might.
Not want to.
Go with those like where
you were going for here but
I do like this sentiment this is the this
is the data fitting exercise so you
wouldn't you wouldn't really sort of want
to control that in that particular way but
I really I really do like that I really
do like a sentiment that wouldn't
quite pull that off you there but I think
it's a cool thing that I like the idea.
That's over.
Now so this is very tuned to the
Educational Testing when you have multiple
choice tests and the idea is that like you
know when you have a very very low scoring
examinee forcing the lower asymptote
to be 0 it's kind of silly so
I guess my general recommendation is to
never use a 3 from a religious 6 model and
I'm going to show you why
by setting blue to point
And then point 95.3.6.
That didn't quite work out.
Point 3 and point 2 so maybe I got
this a little bit I know what it is.
Let me just I can just fix.
The vindaloo.
So what I've done here is I've created
a situation where we have dramatically not
that dramatic of it fairly dramatically
different parameter estimates but
the curves are overlapping through much of
the upper end of the distribution right
you see how those curves are sort of
sitting on top of each other over there
and the question would be do you have
enough information at the bottom end
of that distribution to actually
estimate those lower asymptotes So C.
parameters are are notoriously noisy and
so
stated in its in all its
wisdom I'm very grateful for
this has actually not given you the option
to fit a true 3 parameter logistic model
when you fit it when you fit a 3 parameter
logistic model status as all your C.
parameters have to be the same across
items and estimate a common lower
asymptote and that's a really wise thing
because otherwise there's no information
down there and you get a whole
bunch of noise and it throws all
through Clinton already throws all of
your other parameter estimates off so
that's just so you know in general I don't
recommend using the 3 family just model
in practice it is used a lot and
I do not really understand why
I keep pushing back on states against
using it because it just adds a whole
bunch of noise do not overfit
your data is a general rule so
luckily state it has prevented you from
doing that by giving you a common C.
parameter to estimate that's
just fine if your song climbed.
So.
This is a little bit of here is actually
some of the output this is I or T and
state and again now that I don't have to
use G C M anymore I like ridiculously long
do files that are now completely obsolete
because all you have to do is type I or
T one P L and your items and you're all
set you can plot it's got some good I.R.T.
plotting functions for you.
And you get output that
sort of looks like this.
Yeah.
Gasping.
For I did in the long format I don't.
Personally get a job.
Like a Man and that.
Is a think about it you did you can
my slides so that's how I got it I.
Think.
It is I actually actually
deleted that slide here but
I have an extremely low git which
is exactly the same thing and so.
The difference in your mind
is that it is in effect from.
Grabbing a random attack.
For people like you and me.
The audience.
And then.
Afterwards get of it so that and so
I actually usually take a 3 step approach
where I 1st especially to economise it's
useful to show it that way right and
people who are like sort of multi level
modelers you start you start off by
showing it as a random effects logistic
model and and then I show it to the people
who have taken structure equation modeling
factor analysis before and I just try to
demystify it as like under the hood it's
all the same thing don't freak out but we
psych machines have developed kind of this
mystical language for talking about it.
So.
So now just a quick note here again like
this is the linear in the law God's right
so there is kind of this
people often say that I.R.T.
really is an equal interval it is equal
interval it is it is it is setting up this
a linear assumption but it treats as sort
of the target of interest the log of
the odds of a correct response and assume
the sort of linearity between Theta and
all of those all of those
long odds functions so
I guess I'll just say like remember
that this is the assumption.
And it's a sort of a simple model when you
show it like this maybe it's not as pretty
but that's really what's going on in the.
So this is again this is a 3 primatologist
model it's estimating a common super
amateur I think that's a good thing you
can sort of show that it's fitting better
in some cases I don't really like the
likelihood ratio test for these purposes
because usually in practice you have
these massive data sets and everything's
always going to show up as like fitting
better when you give it more parameters
it's not really that interesting
sometimes simpler is better now.
But it's the 1st.
Person.
So it's mine you.
Never 70 percent or something.
Like 100 Tests.
Or hundreds of similar questions
you'll get 70 of them really and
then when we get there you don't think.
I mean I guess it's the same that's
an interesting question you'd think you'd
be like the Terminator stick in some
way that's a good question I think it's
don't think about you I think about people
like you who also sit at that data that's
probably the easiest way to think about it
and there's a $100.00 people at that data
and 30 of them are getting it wrong so
it's nothing against you personally.
Just something we haven't modeled
in you to be able to tell it's
more discriminating we don't have the
specific model for you so just think about
all the people at that data rather than
you having a 70 percent chance of getting.
Help I mean the same sort of thing in
any given scatterplot of a regression
right like you have an X.
you have a Y.
you know so like so
how is it that you know but
you're not talking about you you're
talking about on average people at X.
What's their what's their
What's your best guess for why.
This is just a note I'm parameterization
So you're talking about like do you
estimate the variance of those
of the random effects or
do you let slopes vary and so I just want
to sort of note here that that you can
do both in for those of you taken factor
analysis or structural collision modeling
you know they have to anchor the scale
in one of 2 ways you set the variance or
you set one of the loadings I just want to
show that there is sort of an equivalence
there is a sort of an aside all
the this here as a reference.
So so some practical guidance here for
you when it comes to like sample
size estimation you get the same kind of
guidance for Factor Analysis right but
just be careful this is not a small
sample kind of endeavor for
the one parameter logistic model
you can get away with small samples
this is just a reminder that
when you have small samples
just stick stick with Rush Rush is like
a good a good way to get what you need.
You get various advice
from different authors for
the 2 from majestic model 3 from really
just sick model don't use it 3 primarily
just sick model unless it's the way stated
use it just it's just an absolute mess
Los going examinees needed for 3 P.-L.
but don't even bother and then this goes
for polygamous items to you may have heard
of the great response model which is for
polygamous items this is why I was saying
get your discrete histograms see if
people are responding like 45 and
score points to estimate those cars.
So I want to just talk a little bit about
the practical differences between
item response theory and.
And classical test eery So here what I've
shown is like a some score on the logit
of the percent correct adjusted a little
bit to keep from 100 percent and
you can see here that is just the sort of
nonlinear transformation of of so loaded
it's just a non-linear transformation of
that and loads it looks a lot like you
know the one parameter logistic estimates
for for theta right which is just to say
like don't you know don't think it's going
to create dramatically different scores in
your case like the one primarily just like
model would give you say it is that or
just an slight non-linear transformation
of the sum score so that's the this here
is the relationship between the one
parameter logistic and the sum score
once you get to the 2 parameter logistic
model here you start to get some.
Information based on the items
that discriminate more or less and
and similarly like between the 2 parameter
in the 3 parameter logistic model here you
basically got the same thing that lower
asymptote is not making that much
of a difference so if you want to talk
about the practical impact of I.R.T.
on like your scoring that's not where
you're going to see the difference again I
think the value of I.R.T. is really for
scale means over time for linkages for
like for fancy things where you're
subbing in new items and estimating for
new populations within any given
static item response like data panel.
It's not you know I or T.
overdub of classical test theory is
kind of like a sledgehammer to a nail
that doesn't mean it's not a cool thing
to do and useful for diagnosis but
really what you want to do with AI are to
sort of say OK now I'm going to pick these
items up and use those like particular
marbles from this particular urns
to target a measurement instrument top for
a particular purpose and
it's for that particular design that I or
she becomes particularly handy.
So let's see what should I do so.
I want to talk a little bit I
talked with not about this to
one of the cool things about our T.
is that it enables it puts like if
you look at the the equation for I or
to you right it puts the data which is
a person like ability estimate and B.
which is like an item feature puts it on
the same scale as sort of puts them like
subtract them into sort of says you know
your theta is your difficulty it's sort of
you know and you can sort of say that for
a given theta Let's say you get a beta of
mean and what I like about T.
is that it gives you a way of kind of
mapping items to the scale in a way that
imbues that scale with almost you'd argue
like a qualitative kind of property right
sort of says OK let's say
that I think you know the.
Response probability which means I'm
probably going to get an item correct or
think of it as like 70 percent to use it
to use or cut off psychic in that way and
so we can do here is sort of say OK if
that's the case then if I have a state of
like 2.2 than that's where I'm going
to be likely to get that kind of item
correct and if I have a 3rd of 1.2 that's
going to get this item correct and
different data is will have
different different mappings so
again why is this useful is because often
times you're going to get people asking so
I got a score of like 30 What does
that mean like what isn't a C.T.
score of 30 mean what is an S.C.T.
score of 600000000 what isn't the score of
The does and by putting examinee
proficiency and I have difficulty on
the same scale it allows me to create what
we call these item that's And here's some
of the work that we've done nape this is
not very elegant I have to have to say but
it sort of says OK what is is
explain properties of
sums of odd numbers very.
Apple you can click on that answer see
what that means you can do right with this
it with a with a specified probability
I really like this because educational
scales can be extremely abstract you know
you're always sort of wondering what a 10
or 20 or 30 is and I've actually actually
asked my students in many cases like
whether this is like a psychological
scale like you get a great score of 3
What is that or
a like a a theta scale you know it's like
as if you scale of 600 This allows you
these like qualitative descriptions of
what that actually means I think this
is a very powerful underused method for
you know increasingly I think statistics
is moving towards descriptions of
magnitudes in addition to like statistical
tests for example like how much is
an effect size of point 5 is like
something we really struggle with and
I think you know being able to say here's
a point 5 means says you used to be
able to do 2 digits which action now
you can do 3 digit subtraction or
something whatever that is like being
able to accurately describe what
you could what you could do then and what
you can do now can be really powerful.
Left.
So.
That so that that would be an example
of the model not fitting the data
right if that's sort of
happened a lot where you had
you know where you had usually we have we
have the ideal approach where it's like
every single time you move up you only
get more and more items kind of correct
obviously it doesn't happen in practice
but it has to happen on average and
if that doesn't happen the other team
model won't fit and you'll get really bad
alphas because effectively all your
items in even the classical test area.
Even at that stage will recognise that
your scale is not cohering So once so
if you have a high alpha if you do.
Risk replot for dimensionality
of your ire to my old model fits
which are all different ways of
saying kind of you dimensional scale
then what you're saying doesn't
happen that often and so
you can with with with by picking
a response probability and
these curves being correct sort of how
this ordering of items in the states
are successively ordered way and sometimes
it crosses it you can see here so
the 2 primary just a model gets a little
dicey as far as interpretation there is
because the item orderings aren't the same
different given your responsibility but
on the whole this is I think of no
reasonable way to sort of sail like OK
here's what performance
at this level means.
So so so previously.
They all got the sort of spiraled
set of randomly equivalent question.
Yes.
We're moving to in math multi-stage
testing which is to say adaptive but likes
so you know kind of like what was done
in like some of the National Center for
Education Statistics tests we got this 2
stage exam based on whether you performed
high or low they give you harder items or
easier items but still like even for
those items even if they never saw some of
those items you'd still in a model based
way be able to predict whether or
not they respond correctly if the model
holds that's that's the whole sort of idea
of IRA to itself that even if you didn't
observe that item you could still sort
of predict your probability of a correct
response to it so you would hope that
these item maps you know if the model fits
that's all that's what we always condition
on those item apps would hold but
I really like this this is like my one
of my pet things I like about our T.
so I hope you kind of remember this as
like you know something you can do when
you're trying to say to your
you know it's your aunt or
uncle like you know you know it's like
my daughter got a got a 600 on the M.
caster like great like you know what's her
percentile rank or you can be like But
this is what they can do in the public
still I suppose it's a percent right
it's OK but.
This is a good way of like
anchoring the scale and
talking about you know this is really what
I think measurement is part of like what
what does this mean they can they can do.
So and this is someone derived
this lesson Who was that.
That was those good so
this is a slightly different algebraically
equivalent version of the same thing but
this is just inverting converting the the
higher tech or the I R T equation I C C.
OK so I'm going to.
Skip's estimation even though would be
really fun to talk about this Brian.
But this is a little bit of
an illustration of maximum likelihood and
how things work but I'm going to talk
a little bit about how tests go ahead.
Your side of the just what
is the use in the 3rd
quarter in the template
that we're going to.
Have my you have some problem
because you want to sort of.
Having a more efficient manner.
In the way that you would be about.
You know what the numbers
are present like I'm
going to be worried that
the would you want to use because.
All of this.
Got this morning to do that but
it's really.
Have to Do With us now is not and
how do you know I think about how
to use the school so much respect.
For why don't we have some of
the goal is I.X. I don't so
I think the general goal of item apps is
to understand what score means implies
about what a student knows and is able to
do in the case of educational testing or
happened to report in the case of
psychological testing right or
happens to have like so so for example if
you have a great if you have a great score
of like 4 that means you went from
neutral to affirmative on this particular
item right like that's a way of like
saying that's what foreign means and
then I think that could support the I love
that you're asking this is a question you
know I usually ask other people if I
love that you're asking me this but
I think with that generally does
increase the likelihood of appropriate
interpretation of scores
if on average like you know
with because they're big nape
declines from 2013 to 2015 how big.
Not that big if you look at the kind of
differences in this in this in the kinds
of skills that they were on average able
to do this year versus on a on average
able to last year I just sort of helps
to give people a sense of magnitude and
I think you know Mark let's see has this
great piece translating the effects of of
of interventions to like interpretable
forms I think that's the that's the job
right I think I think it's and I think he
does it in a bunch of really useful ways
about like talking about cost benefit
analysis talking about numbers of months
of learning but I think this is a way in
a criterion referenced way to say like
literally hey this is what you're able to
do now this is what you're able to do then
and that will facilitate any number of
interpretations downstream because it's
really like what we what did you do you
know what do we predict you're able to do
so whenever it whenever you're
thinking about a score and
helping people interpret scores let item
naps be one possible way you can describe
them let me be very specific in another
way about how they're used I did not sir
also used to set standards I haven't
put standard setting in here because.
I.
Have opinions about it.
So standard setting is a process by which
we say this much is good enough nape has
set standards that set a proficiency cut
score it is a judgment will cut score
we just had this massive evaluation from
the National Academy of Sciences about
whether that process was justifiable
is that for the most part it was but
that's a judgment the process that
we use this mapping system for
if you are a reader is coming in to set
standards you would get a book of all of
these items in a row and what you would
do is flip through the book and put
a bookmark in where you think that just
proficient designation should be so that's
another way in which this is actually used
in a very practical way to sort of help
people sort of set A judge mental cut
point on what they think is good enough
based on what people can actually
do at that level is that help now.
So you know.
What was in the.
Back tell you about
the classic rush people.
This is a great point Chris so.
There's like a camp of very
thoughtful well reasoned but
also sometimes cultish for
not offending anybody am I on tape.
People many of whom are very
close friends of mine.
Who are sort of in this rush camp
where they think the model is so
useful that it's worthwhile sometimes to
throw away data to get the model
to fit it right which is and
this sounds a little bit crazy to those of
us who grew up in a sort of statistical
camp but the idea is like look we're
trying to design a good measure this item
is discriminating difficult differently
it's going to lead to these weird ordering
effects where now I can't have item maps
that are all in the same order if I pick
different response probabilities I don't
like that I'm going to not use that item
which means you're defining in a very
strong way like in this very statistical
way like what you think that construct is
and it gets sort of like to be this subset
of things you might want to measure
because you're throwing away all the stuff
that doesn't fit the model what you end up
getting in the end is arguably this very
sort of clean scale where everything
is like ordered without conditions and
there's no crossing of these lines and
no interactions in this item
is always more difficult for
everybody than this other item
which you might have lost in that process
is content and as I said Content is king
content is king you can see my bias here
when I'm when I'm talking about sort of
like that that you should you know fit
the data you know have a theory and
not throw out data to fit your model but
the same time I think there they have.
A framework in place that makes them
comfortable with doing that for
particular uses then tend to be very
diagnostic about these things right there
to sort of these targeted scales for
particular purposes and
they don't tend to they don't
tend to agree that it's good for
all purposes like I don't think they'd
say Do that for a state assessment but
this camp exists and they're they're good
people but they really like their model.
They say.
I don't.
Like to think ever get over you
siding with your collection I and.
My friends we can win measuring this
thing in $1120.00 when they think
they might do that and they treat each
one separately and try to create like on.
Their concept at a level 20 it's sort
of an exploratory factor analytic or
confirmatory factor analytic approach
where you kind of want to take a data
based way of sort of saying with
this item load more on this or
load more on that that that's
something you can do as well and
I sort of ceased sort of confirmatory
factor analytic camp as not so
different from the Rush camp they're
trying to sort of make the pictures fit
and I don't think that that's bad I think
that that serves particular purposes but
I tend to be more dimensional because I
sort of am cynical about the ways people
can use multiple scores like I was just
going to add it together in the end so
might as well analyze it that way and
but but for theoretical reasons I
see why S.C.M. and factor analysis
are useful for that purpose.
So just some useful facts for you.
For the one in 2 parameter logistic model
there is a sufficient statistic for
estimating data what is a sufficient
statistic it holds all the information you
need to estimate data it is not data but
it holds all the information
that you need to estimate it so
what that sufficient statistic is
the sum of discrimination parameters for
the items you've got right so
make sense so I mean
at least as operationally not necessarily
intuitively So basically in a one P.
L.
model all the discriminations
are the same.
Which is to say the number correct for
the Russian model holds all
the information you need.
To to estimate your ultimate data which
is to say everyone who gets the same sum
score right will have the same data have
OK Now when you have
discriminations that differ and
some items hold effectively more
information than others you get credit for
the items the discrimination premieres
of the items you answer correctly so
if you get a 20 correct and I get a 20
correct if you can 80 percent and
I get an 80 percent we
might not have the same
data why would I why
would it be different.
Really easy this way and.
This is a good this is a this is totally
tricked you I'm so sorry but that is
exactly what I said that when my advisor
asked me if it's like 12 years ago.
This is so yeah that's what it said
the 25th when you got easier and so
you got the 20 hard ones right and I got
the 20 easy ones right but don't forget
that if you got the 20 hard ones right
then you must've gotten the 20 or
all the other easy ones
wrong I said That's weird.
So it's actually not the difficulty of the
items that matter it's the discrimination
right so the idea is that the 20
you got right where the ones that
had the information and the 20 that I got
right were the ones that were coin flips.
But that I said the same thing
I mean it is so but and so
you have to sort of invert it right.
There is a little bit while there is
a lot of pressure along the way or
the only part of life or very.
Witty.
And.
is practically and ideas are.
Really.
A number of art.
We would have paid for in the.
Basement we made it but this was so
but again remember that for him to get
the 80 percent of difficult items correct
he must have gotten 20 percent of easy
items wrong which is basically a statement
of Mr Right that's weird right and
so it doesn't happen that often and so
the and so the if that happened a lot
the model wouldn't be unanimity modeling
it right it would say like I have no idea
what you're doing all these items aren't
correlating with each other right now so
it doesn't happen very often and for
the most part the scale would be
unidimensional right which is to say
like the higher you know if the one P.
L.
fit right the higher you are and
you're getting you know these
items correct with greater and
greater probability say and even higher
probability for all those other items
that receipt so that's what the unit
Michel the assumption doesn't model fit
kind of Biggs into that the rarity of that
happening but that's but that's absolutely
right that is intuition I had to but you
sort of have to remember to flip that and
say But don't forget you got all
the easy ones wrong which is we're.
Good so I think this is helpful
intuition for you right.
And so just to sort of sort of note here
when you get your scores from state
testing programs where did they come from
you would like to think you would you
might think that if they were and I are T.
using state right that they would
estimate data for everybody and
report all these different data.
Yes that is not what happens right and
there's a reason that's not what happens
and it's purely to do with these ability
feasibility and transparency and
the feasibility idea is that like we
you know we can't run all these data
you can't run these giant models every
single time the transparency idea is
hey that thing that we just talked
about what Try explaining that to
someone in the public right so
you see that you got 20 correct and
I got 20 correct and you're telling
me that they gotta have so it's
the fact that we can't explain recycle
magician's can't explain that well so
I was giving up on the fact that that data
hat if we truly have a 2 from really just
sick model is a better estimate of data
and if we're answering a more informative
items correctly we should use that
information we generally don't because for
the sake of transparency will we publish
right what a lot of states publish and
you'll see in these tech reports
are these raw score to scale score
conversion tables or just to say take
the sum score and then find your row and
then you find your is a one to one mapping
from raw scores to scale scores right and
that we would be able to do that if we had
this like weird thing where it's like well
if you got a 20 and you got a 13816 in
the war right then like you know that
you have this data and like someone else
has this other data so that's what we call
the difference between pattern scoring and
number correct scoring so you
might in your own analyses have data is
that have sort of continued from A to P.
L.
that have this continuous distribution but
when what you might get from a state
is going to look much more discrete
even if they have a 2 peelers repeal.
The.
What they are.
Doing in their life.
But having the whole.
Democrat.
Know what.
Would use of that right.
Now you are going to develop it on the I
actually really like the cash contract for
me to drop the right thing in the.
OK.
But then you're going back here in.
Some ways against.
The individual schools now as I showed
right like you know I was showing you
those scatterplots before the correlations
are like point 98.99 you know so so
there it is it is not making
too much of a difference but
yes what we're basically conceding is
like we're just going to punt on for
feasibility and transparency reasons and
and go back and don't forget the value
the value of it which I have actually
haven't had sufficient time to demonstrate
here is scale maintenance right like we
can't use the same items it's here that we
use last year because everyone saw them
last last year and so now we have to
use different items but because we know
what the futures of those marbles are and
not earning we can sort of you know if we
can build like the perfect test that all
like measuring the same across the across
the same area that we could before.
So this is what you know this is to
give you an example right this is one P.
L.
This is the sum score and
this is the this is the distribution for
the data scores it's the same thing
at the same things to same thing
it's all we did was a one to
one because like the sufficient
statistic is the sum score right and
all we did in this is what I've described
before is like what is I R T do for
practical purposes for
like a static set of item responses
it scratches the middle and it stretches
the ends and that's it it's sort of see
that just barely here a Woodward
overdoing is a non linear transformation.
So this is the one P.
L.
versus the 2 P.
L.
right I'm sort of showing you
these running back spots here so this
is the one parameter logistic Right so
everyone who got a 3 gets the same
score but you can see like in any given
any given score point right the people
who scored really high in the 2 P.L.R.
those that got the discriminating
items right and
the people who scored really low got
the low discriminating items right.
So so how should I.
See so.
I'm trying to think about.
How to close here so.
With 5 minutes left let me.
Just go back to basics and open up
the questions I think that's what I'll do
there's a lot here I have like to have
this is linking this is showing
you like how you can get to.
The comparisons that I showed you today
through common items and where my hair
there's this so so so so anyway let
me close here and I'll open it up for
questions like What do I want you to
believe with I do want you like I
think there's so much to be said for just
diligent exploratory data analysis and
I hope you don't think that's too boring
because I swear to you we'll see you so
much time later when you're trying
to fit your I.R.T. models and
they're not converging it is well worth it
today of selling I.R.T. I sure showed
you how it worked but there is a really
powerful way in which like I didn't
get to animate here sufficiently for
you like how these marbles from these urns
do have these properties and you can very
precisely like each item has this like
information function associated with
it and you can pick it up and sort of say
I want to measure here and then like and
they also want to measure here maybe and
you can sort of build like the sort
of perfect test in this way to
discriminate at particular points and
in the data distribution and
that's really powerful so for
example if you wanted to sort of set
evaluate people right at a cut score if
you were designing a diagnostic test for
Pascal purposes you could stack all of
the items from your turn that have maximal
information at that particular point and.
Target a test for
precisely that purpose so I or
allows you through this strategic
item estimates to have that information
and you know I can actually show you right
you can sort of see it in this
light zoom out a little bit here.
Right so under here I have these item
information functions so here's what I'm
going to do I'm going to increase
the discrimination on the blue item.
For dislocated this.
Make this like 2 so you see light you
see that right there like now I've
described that item as a lot of
discrimination at exactly that point and
so I can like you know if it were negative
one it would have discrimination out at
this point right here on in so each of
these items has this information function
and you can sort of say you
can sort of stack them up and
figure out where you're going to
minimize your standard errors so for
these people they're going to have low
standard errors and for these people you
can sort of sacrifice them because
you're not making decisions on them.
So so again.
Don't forget content Don't
forget classical test theory
I've just begun to scratch the surface
with the usefulness of I.R.T. and
we've all got a lot more to learn in this.
In this field so
yeah let me open up for questions.
One of my students just told me
the other day don't ask Do you have any
questions because the answer could be
no say what questions do you have.
Yeah.
The SO and
the if the and
the search or early.
Release.
The urge to eat the lower so the bucket
says there is always your go to like 70
percent 80 percent of what you need to do
can be done without again when when is
I or T helpful it's when you're changing
items and changing populations and
stuff is changing over time or if you just
have this little form the 8 I'm good scale
Don't worry about i or to but if you want
to sub out those items because people
are starting to do them if we start
using grid for high stakes testing and
people like hey I remember that item
then you want to start switching out and
that's when I actually started to be super
useful so I guess I'd always keep it in
your back pocket for you know for
when you need to sort out items and or
let's say you want to take it like you
know we can talk about differential
item functioning but it's like what
if you want to pick this test up and
go take it to Japan or
something like that and then then I or
to figure out measurement and
variance so they're they're all these like
use cases where you should sort of feel
like you've got I or T.
as like your sledgehammer in your
basement to like come out and
tackle a particularly thorny thorny
problem but again classical test theory is
your basic Ikea tool kit you know kind of
thing gets you in a get you're pretty far.
That's the to the our earth.
Were over her own way to
a Owen the surfing
really surrogate family structure exams
we have a lot of fire to use now.
Yeah and
I wrote about the long response to
your response to what you
really think of the form which.
Is.
Just now just so
even though in this country where
you the people are sending you money
it seems like to the students that there
is this need to treat it's going to test.
How you deal with situations.
That you might see like
you consistently are.
Starting.
To wonder if.
I actually can treat this thing and
where did it come to fruition.
So very strategically back in
the day when biased tests were
concerned not that they're not a concern
anymore but yes scholars at U.T.S.
sort of said hey let's call it something
more neutral because they're asking good
questions about whether measures
differ for different people but
bias is such a loaded term so
I came up with the term Paul and
unlike others coined the term differential
item functioning to make it make of this
like biased sound scientific and and
so on and it kind of does I guess but
the basic idea is that you have 2
different item characteristic curves for
the same items corresponding to different
groups that's bad right they don't
contain all the information about how
you're responding to a particular item and
if you estimate for
a different population a different an item
characteristic curve that doesn't align
then you've got evidence of differential
item functioning for that group so there's
a whole set of that under the diff and
each command you can say help defend each
You could also do logistic regression
of the of the price of the item score
on the total score and that in and
out in and of itself will 4 and then with
an indicator for the group and that in and
of itself will give you a test
of whether or not the item is
functioning differently for one group or
the other so there's a bunch of different
ways we have to to detect that and it's
a violation of the model and is a concern.
I prefer to form or use the date I like to
call the inhabitants of the practice
in order to avoid bias I don't.
Know what I did for me because of the way
I didn't want to know that I did that.
But it surely was something more
than well in that organization with.
McCain leading any sort of made any
leading a major achievement gap or
generation many gaps or
things one of those why
he's only making sure to manage
using a certain set of ideas and
I never know right now others
have those planes again so so
do if it's conditional on data right so
for 2 people with the same
feta right are there different
probabilities of correct response so
that still allows for different groups
to have different distributions of data.
Right so and so you can have
these 2 different groups of 2 to
distributions of data that can
be that true gap right but
then if you estimate 2 different item
characteristic curves from them and
they don't align that's
problematic right so so for
people who score very low in both groups
right are they are they going to get this.
Like.
I don't buy it or
get absolutely they use going.
Away so what we do what we do is because
there's 2 things we do 1st of all we
assume through the content development
process content is king right you assume
through the content development practice
process that you are measuring something
that's that's that's good that's right and
it's like that part of that theory right
there where we're not just asking yacht
questions or country club questions or
color blind questions for people who are
color blind right so it's got to go back
to content in that regard and then one to
have that then you're looking at relative
death right because it's always going to
sum to 0 it's circular in exactly the way
you're describing or
you have some sort of some sort of X.
or no referent that you assume is
unbiased So one of the other if you do
the internal way it's going to be circular
external way you have to question the bias
in the external referent but the but those
are 2 approaches to doing it the way we
get out of that circularity jam is coming
all the way from models back to content
and some theory that what you're measuring
is right and so what we usually do in
the test development process is we flag
items for death they go to a content or
view team they try to come up
with a couple hypotheses for
why that could have happened usually they
can't and so they leave the item in and
that's that Paul and
wrote this famous paper in like 2003 or
something it's like what's the diff
about defeat don't make no diff.
Which is differential
item function right so so
I mean because that's really what happens
in practice is tests are designed to
through the content
development process to our.
Already this is you know
Diane Ravitch is like you know
language police kind of book
right way back in the day
it's already designed to squeeze
out everything interesting and
possibly like you know differential func
functioning differently across the tast so
you get something that's so sterile in the
end that it is like no basis on which you
can really throw anything out it's
kind of a sad statement but you know.
To a.
Way or the A.
Way that.
You're always has me at
the depends on the use.
Of the a.
To a.
The and this is very much and
so 1st of all I forgot what time do
we end I thought we ended at 5 but
I realize now it's 5300 wow OK well
we can talk about all sorts of stuff.
You know keep it to questions.
And they always like I mean I am kind of
exhausted it's I mean like we've got half
an hour let's get Fox Let's
talk scale pliability.
So I might really do that you guys better
ask questions otherwise I'm going to get.
If it was less relevant that.
It's like how do you get mad at
yourself it makes sense for.
If I can hold it for
you know better I feel for
the boys when I was in the program.
But why do you think you
have what you feel so.
Let's get to that.
So that to address the just to
fit question right so there.
Are different schools of thought and so
that I know because he's trained more and
psychological measurements are educational
measurement is is more interested in
model fit and people in structural
question modeling generally and
factor analysis generally are interested
in a whole array of fits into sticks that
make me dizzy sometimes they're about you
know back in the day like 20 years ago you
could get tenure based on like you know
creating the next new Fit sadistic and
now there are 60 of them and I can't keep
track but I don't mean to be glib like you
can sort of tell by the way I'm talking
about it that I'm I'm just skeptical of
the idea of it like I think you can start
off with like an alpha statistic and once
it's like a sufficient level you're just
trying your using it you're using I or T.
to accomplish something if it helps
go ahead and use it if not don't and
so and so I think that the sort of
the dimensionality questions are often
a little bit overwrought that said I
think like as a matter of likes like
operationalization of your measurement
like objectives I do think like
streetlights alphas and scree thoughts and
overall fit all the C.F.I.
And R.M.S. CA and a whole suite
of hits a test takes are helpful
the only problem is that like you
know what you run the risk of like
people being like you're fit to test it
is like point 02 below the cutoff in your
sleep where the hell these come
out what does this even mean so
I know little cynical about fits this
fix but I do think you know support an.
There are models to fit the data I
just don't so so how does I.R.T.
and C M factor analysis kind of
differ in the practice like in
the same way that regression and
a nova differ in the practice right.
In the eye when we use I.R.T. we are very
we tend to be very interested in like
the marbles we're trying to create a test
and we want to or like maintain a test and
so we care about the specific parameter
estimates for those items and
we use them very very carefully in S.C.M.
and factor analysis you're sort
of more interested in this sort of global
measure of like does the model fit and
like that if it fits and sort of it
helps to explain my theory sometimes
a structural equation modeling you
are interested in particular structural
parameters in the same way that you're
interested in regression coefficients but
in general you're sure of your interest
in the sort of global ideas fit right so
I guess I guess that's the difference is
that I.R.T. cycle I don't care if it fits
like my standard error on this
discrimination parameter is pretty it's
pretty good is pretty decent and
it's sort of unidimensional and.
That's that'll do right so
I guess I would say what we usually
see in practice are like these scripts
lots of these general fits
the test 6 someone doesn't say and
describes fit and then you sort of move
on and so if you look at Duckworth and
Quinn they do this sort of token
confirmatory factor analysis and
the like like OK Hey it fits now let's
go see if it predicts future outcomes
like enough of that let's go let's go do
something else so I think that's a good
standard practice and if you like some
That's article is a good one right where
he does that internal consistency
examination on his on his on his
scale and confirms it works and people
are often using it that's a good model.
Of the questions.
With.
One person so I assume that
the days of 01 and go ahead and.
Expose the leak is are calling getting
those are the ones to just put your
POTUS on post and that's the only place
I've almost like almost got to step
a submitter right now I have your ability
to go back and get a raise gets are.
Reliable rest.
Like you mean if it's not
normally distributed yet.
You sooner than all the one thing you must
mean anyway it's kind of feeling well
this isn't exactly what you're wanting so
it's really part of life or
there's a you it's a given want to be
able to like if you're a little it's like
you're supposed to be something about life
believe your estimates are that's cool
it's a cool idea so in general I think
this probably fits under this like
more B.Z. and
ways to go about this publicly.
And like so
there are a lot of people who kind of do
this market chain market Carlo approach to
sort of simultaneously estimate
everything they have priors on the B.
parameters priors and they parameters
can have strong priors in the C.
parameters what they'd a kind of
feed back into that information
that sort of 2 step approach so
I think that's probably where that sort of
stuff comes in in a more fully framework
so I guess I would look there I'm not I
haven't done that in a long time and so
I'm not sure where the current state
of the art is but kind of a cool idea.
So let me let's let's kill this to
a little like I mean you know it's like
it's probably beer o'clock but let's none
the less and less do a little bit of scale
pliability of uselessness up for
whatever about to get into so I.
Assume you know is this an equal interval
scale so this is the big debate going on
I'm not sure it's debate seems
pretty obvious to me but
some there are those in our field who are
less utilitarian instrumentalist than I am
who are really struggling
to give psychological and
educational measurements the cache of
physical estimates right they want to sort
of say this is my own breakable scale
don't don't bend don't bend it and
I think it's it's it's that's
sort of silly so so so
interval scale again we're setting it up
as when you're in the log odds of
risk of correct responses to items so
there is a way in which it is already
equal interval you've always got
to be equal interval to risk
with respect to something.
So there's a good literature
right now bond in Lang and
Nielsen as well that you cited in your
paper which I appreciate their good work
on this they're trying to tie achievement
scores to these extra reference and
they're sort of bending the scale in
response to these like other scaled that
she even test typically
gets subjugated to.
In it and sometimes very useful ways so
so I So the thing that's going to
is equal interval with respect to the log
odds of correct responses to items but
there's nothing sort of magical about
that you can sort of bend everything.
Right and everything will still sort
of fit as long as it's a monotonic
transformation the it's no longer linear
in the log odds but it's still like going
to fit the data right it's because it's
going to chase the data in some arbitrary
way so so large sort of shows that you
know it doesn't really matter the data
can't tell as long as you're monotonically
transforming the both the item
response function and the data themselves
I mean it's just going to chase the data
let it do whatever you do right so
what do you make of scale indeterminacy so
logistic I don't response wontons
mathematically convenient is a loose
rational basis under normal assumptions
there you go but the data can't tell which
of any of any plot plausible monotone
transformations is desirable there's no
one correct or natural skill for measuring
traits or abilities and education and so
I come down very similarly to what
Brian and Jesse articulated so well in
their G.P. paper which is that there's
a it's probably useful to think of.
A class of.
You know again I like to call these
plausible monotone transformations
that you should subject your scales
to re estimate according to those
data after those transformations
I mean just make sure that your
your whatever you're concluding is robust
to those transformations so interpreted
interpretation should be robust a
plausible turn of course scales so this is
what I described before where we sort of
have these like you know one to 2 to 3 and
like we try to sort of I think we need a
way to sort of talk about how like pliable
these scale those are because because
you know the you know who's to say
like think about the item maps who's to
say that the distance between 2 digit and
the distance between like derivatives in.
I mean how how are you going to sort
of objectively sort of say what that
difference is and so
yes I would again sort of say the scale is
pliable and there ordinal number interval
I feel like ordinal interval is
like an antiquated dichotomy and
we should sort of think of a way to sort
of think of something between the ordinal
on the interval the equal interval
arguments like weak but not baseless so.
This is just to illustrate what
happens like if we were to just.
Operationalize a transformation
of of an underlying scale
right already said you know what I see you
normal distributions but what I really
care about are differences down there
like a negative 3 to the negative one.
So like that's where like I want to
prioritize growth either from like
an incentive standpoint or that's where
I you know from a measurement standpoint
truly believe that you know did
those distances are like 10 times or
you can sort of say that this is
actually the distribution that
the distributions I've got Were these are
actually the distributions I've got and
if you're to do a straight like you know
standard standardized in a mean difference
then this sort of changes the actual
effect size right the actual set number of
standard deviation units you can look
at differences in percentiles too and
the idea is that whatever you're sort of
assuming but it whatever judgment you're
making is going to be robust
to these transformations so
similarly So what we did Sean and I did
address sort of a separate problem but
still resulted in a neat technique
I think serve to define this
this class of transformations
that is mean and
variance preserving that's just like to
keep keep your head on straight Syria not.
Trying to go to a completely different
sort of scale it's like all your keeping
you're keeping your sort of head and arm
with sort of approximately the same and
just working things in various
directions and then a lot.
Of like my mean distributions
today it's kind of fun so and so
this is so subject to this constraints
of this is like a class of
exponential transformations subject
to these constraints we get this
this formulation and this is that
sort of transformation from X.
to X.
star So what we're sort of sort of doing
here is saying this red transformation
here right that's accentuating
these higher scores here and
the blue transformation is
accentuating these low scores here.
You can imagine also kurtosis kinds
of transformations where you're
stretching the tails but
keeping everything symmetrical
these are sort of one direction
in the other direction.
So so this is like what would happen
under these various sort of C.
parameters as I've defined as we've
defined them where you take a normal
distribution this is sea of negative
point $50.00 negative skew for
the blue distribution and
this is a sea of positive point 5
The positive skew over there yes in.
My.
Place.
Counts more.
Here than up here close.
To 10 you know whether it is a little bit.
Better than the original.
She has was.
Very clear about.
Why this is that interesting
today there's one week to
function we should just try to estimate.
The loop.
Hole I had absolutely right right and I so
there's So I think it's absolutely right
this is like any sensitivity study is not
like random It's like asking a different
question right exactly and I think that's
exactly the right way to frame it and
one of this is where I think sort of
item maps can kind of help is because
what an item that will do is sort
of go along with this function and
say hey look what you would now what you
said is that you know do a derivatives
are close to enter goals and like to
vision to just abstraction is huge and
that's not random and that's a statement
of like you know of a belief and different
and in these different magnitudes so
don't treat it as random error but
say like under this condition these are
the these are the results you get under
this condition these results you get and I
think that's a great is exactly right and
this is a by the way I think a general
way to think about it I think
a lot of people have said this but
don't think of sensitivity studies as
just like you know a bunch of random
things you do there each you know
questions exactly right.
So the way that I we've set
this up is the balance of C.R.
set to say that the slope of
the transformation at the 5th percentile
is one 5th to 5 times that of the slope of
the 95th percentile that's one way to sort
of think about it it's like you know
the rate of rate of the relative rate here
is like you know 5 times a relative rate
at the top of the distribution there just
very various ways of sort of think about
you know how to how to stretch and squish
the scale and so again you know if you
want to sort of like what to do it kind of
thing is they take the sting scores apply
a family of possible transformations
taking Sue's feedback seriously here what
you'd also want to be very clear about is
what that implies for like you know
the 2 different it's like a difference
down here in the difference up here using
some item napping or some other way
of describing it calculate metrics of
interest from each dataset and assess
robustness of interpretations of metrics
across these possible transformations.
By you.
Know actually so that this reference is
was related to get measurement broadly but
what we're trying to do is is make
sure that our reliability estimates
would not change too much in like whether
it was parametric or non-parametric So
really trying to solve a completely
different problem we were just sort of
saying hey there's a cool transformation
that'll work for this purpose so
I'm citing this as like Shawn and
I kind of hit 3 fun things in that
paper that had kind of nothing to do with
the abstract like the 1st was like hey
what are liabilities across state testing
programs the United States we just threw
that in is a figure and another was like
this little thing here sir just trying to
solve the real problems associated
with our R.V. get procedure and so
it's really kind of ancillary but that's
where we started writing it up week so we
should really we should we should really
write it up you know some more formally.
Like the all the things we don't have time
for but yes so you know is this the right
family can we think of kurtosis
kinds of transformations see about.
Appropriately I love to use feedback
in that don't think this is random So
this is this is this is our.
Like does or is there a liability
coefficient or a liability coefficient if
we if we what the nonparametric
ordinal reliability causation and
and so this is this is sort of saying
that actually that that our correlations
are actually pretty stable across all
of these different transformations and
so we don't have to worry too much
about reliability depending on
different scale transformations So here's
I would sort of say left we can create.
A hierarchy of research like
of statistical procedures
that based on whether they are sensitive
to scale transformation right and
so you know differences in means
are going to be pretty darn robust
right correlations as we've shown here
are pretty darn robust differences in
differences that gets that's good
that's problematic right and
so like when you're actually when you
whenever you have these sort of like
interaction effects like that's
heavily dependent on scale because
all I have to do is squish this to like
make it parallel and stretch this and
like I get it I get a different I get
a different kind of interaction effect so
there are different classes of of
procedures that I think we can sort of lay
out in like a more sensitive versus
less sensitive kind of framework and
I think that would be useful so
Nielsen does that you know and.
In the papers like which wasn't
a shock to us that changes and
gaps are different then I mean that's just
that's that's pretty straightforward and
that but generalizing that to like
a these kinds of methods are in general
these kinds of questions are in general
sense if the scale is really useful.
So so this is just a little example of
how like value added models like our
are not robust but
we don't have too much time so so.
I draw this.
So if you look back so that's another
good reference for this and other good
reference for the changes in gaps question
it's a back to Harlan ho in 2000 and
ho in Harlem 2006 we should we
sort of showed that that for
the most part gaps are to cast a CLI
ordered right there's nothing you can do
to reverse the sign of a gap
right like I mean so so one for
the most part like high achieving
groups and low achieving groups are so
far apart that there's that there's no
transformation that could possibly reverse
them but we we created sort
of just of sort of proof that
that we call that 2nd order to cast
according which is kind of a mouthful but
the idea is that for changes in gaps right
that it's very very easy for the most
part as long as you've got certain
conditions that hold that a transformation
can reverse the sign agree versus the sign
change in the gaps Right exactly so
which is the same as like an interaction
effect which I think it sounds.
Exactly exactly right.
Was a bit because here's what I
mean your response is exactly right
which you know would know which is even
not the or leave the room part but
maybe that's the right thing to do but
your response is exactly right which is
to say like what does what are the
intervals that this is assuming right and
so that's where I think like the sort of
idea map and scale anchoring can be really
helpful is because you're saying like look
if you want to disagree with me about
the ordering Here are the here is what I'm
saying about the scale right at this point
is this this point is this point is this
this point is it's about have a content
based argument about it about it go ahead
like I think that's where you sort of like
can set your stake in the ground and
because what I don't want to do is like
get to this sort of nihilistic likening
bond of language a little bit too far and
sort of saying let's create let's
let's sort of solve for the crazy
possible transformations that could
possibly reverse this sort of gap and
I think that's a little bit too extreme
and so what I tried to do also in this
paper with Carroll you is sort of sort of
say like What are the distributions we see
in practice and how crazy should we
stretch things to be to be plausible so
we should have a debate in exactly
the way that I think you and
Jesse were describing about what's
plausible in which that situations
that should be leveraged
based on you know like a.
Decision can be made based on sort of a
survey of the shapes of distributions that
we see in practice that's fun thanks
I'm glad we had that extra time.