Andrew Ho: 2016 psychometrics mini course - Part 2

## Transcript:

So I've jumped ahead a little bit

apparently something weird going on with

my slides that we have some sort of Star

Wars scroll going on behind the scenes and

it seems I don't know what's going

on there that's kind of cool so

we're going to focus on I.R.C. I'm going

to this is a part of a larger presentation

obviously about practical applications and

then sort of a critical perspective on I

or to you again this is my effort

to demystify highlighted uses and

also highlight its limitations so

you know I think it's useful to

sort of contrast I.R.T. with classical

test theory the object model to to

Stickley in classical history

is the is the actual item score.

That's like a 0 or one for

educational test it's like a one to 5 or

a Likert scale and I are T.

It's a probability of getting

a particular item correct or

achieving a particular threshold in a pool

in a polygamist like Likert scale item and

dependent on a series of parameters and

I'm going to explain shortly

the conception of what's measured in

c.d.t like what is the score again that's

the question I keep asking like what

is your score and insight and C.C.T.

it's an exam any true score you can think

of it as an average you can think of it

as a sum equivalently defined

as expected value across

replications of observed scores I mean I

or to recreate this repaired mystical feta

that is a particularly useful

scale as well described for

comparing across different

populations and so it's a very useful

skill if the model fits the data.

So again just to highlight C.T.T.

and why I spent the 1st half of this

presentation on it it is it is still by

far the most widely used psychometric

toolkit it will it can do a ton

do not sell it short but then

of course they've recognised sometimes

you have to publish some papers and

need a fancier acronym and I are told will

serve come in handy then but filthy T.

should be a knee jerk 1st analysis just as

descriptive statistics are 1st step for

most a circle work.

So.

We think our T.

is useful because it conceives of

items like C.T.T. has an urn full of

items is just like I don't care on average

they're kind of like this and they're

variances kind of like this I R T You take

out each little marble from that urn and

you sort of appreciate it they're

like this this marble is special and

it has these properties and then you

can take all these marbles out and

lay them on the table and

say here's what I want to do with them and

that tends to be much more

useful as like a way to

design tests the way to maintain scales

over time it's a standard approach for

large scale testing which is to

say that for many if you I.R.T.

is a little bit of a sledgehammer

to a nail if you're developing

your own scale for your own purposes

it is just going to stay static and

just previews for a particular population

and go ahead and do I or to you but

realize it's kind of

an indulgence I just kind of like

you know I'm taken again a sledgehammer

to a nail for a large scale testing

programs where you are substituting out

items you are maintaining scales you're

giving it to different populations over

time I had to use incredibly powerful and

it is the standard for use in large

scale testing programs perhaps the only

major exception to this are the Iowa

test of the Iowa test currently where

I used to teach at Iowa and most of the

hold out of sort of classical test area so

you can sort of see why I've been

sort of classically focused here

is because they pull off

amazing things without I or T.

and do it quite well and

quite rigorously you do not always

need either T is just simpler and

more elegant to use it which is why

it's in such a common practice today.

So so the irony sort of asks what if

there were this alternative scale this

alternative status scale for which item

characteristics would not depend on each

other the way they do in classical test

theory in the google doc I think Wendy and

Fitzpatrick in their chapter on I.R.T.

and in the handbook they describe I.R.C.

as if your assumptions hold as

person free item measurement and

item free person which is to say

that this a to 6 you come up

with right the difficulty of an item

the discrimination of an item and

the proficiency of you the examinee they

don't depend on the items you happen to

have and similarly the item features

do not depend on the population you

happen to have and that if the assumptions

hold is pretty darn powerful because that

marble you pick out from that burn

you know has those properties and

will always have those properties and

so when you use it to construct a test

then they will continue to have those

properties again if the model holds.

So just to define here this is we're going

to get into law gods but the simplest.

I or team model is known as the rush

model for George Rush's $161.00

monograph It's also known as the one

parameter logistic or the one P.

L.

model and I like to write it like

this which is the log of the odds of

a correct response to the item right P.

over Q write the probability of a correct

response over one minus a probability of

a correct response is the odds this is

a log of the odds of the natural log

of the odds a log of the odds is just

a simple linear function is this

common slope parameter a this this person.

Intercept P.

and then the sign is important here and

it usually in logistic regression

he used to seeing a plus here and

we're going to define it as a minus so

that this will be we're going to define

this as difficulty as instead of easiness

this is a difficulty by item and then

this is a random This is a random effect

this is an error term right so Sigma P.

is distributed normal 01 No there is

no variance of the person distribution

rest meeting here were standardizing it

to normal 01 most I or 2 models aren't

written out like this and I think that the

has it that has the effect of mystifying

it somehow it gets some confused

consume confusing logistic function and

I prefer just saying hey we're linear in

the log odds here this is not fancy if you

can do logistic regression you understand

what a random a random effect is then this

is just familiar modeling and

it really is OK So again a lot of

the odds is simply this common A No

There's no sub script and I there's no sub

script on a the discrimination does not

depend on the item they're just a common.

Parameter estimating across all items here

that's going to change in the next model

but for now it's common across

items because the discriminator and

then this is the sort of difficulty

parameter right for each item and

then every person is going to get a data.

OK.

This is the more intimidating way of

writing the same model we actually model

the probability itself but this of course

is equivalent to this is the scary way of

writing logistic regression this is a less

maybe less intimidating way of looking at

the logistic regression as long

as you don't look over here

is just a log of the odds and

logistic regression when you're in the log

it's just a generalized Illini or

model Don't forget.

So these are the curves

that we estimate so this is

the logistics of the logistic curves that

we estimate that says sort of says for

a given theta these are the probabilities

of getting an item correct right so

it's a given day to 0 you've

got this item has a 50 percent

chance of getting correct or

thereabouts maybe 55 percent now or

up to 7075 percent and so

on so which items are easier

the ones you can think about on top you

can think about is shifted to the left or

right the more difficult items

are the ones shifted to the right.

So or shifted down depending

on how you think about it

in logistic regression we usually

think of intercepts on why

what we've done here is we flip

this to think about position on X..

So those are these are higher

intercepts and lower Y.

intercept but now we've done is we've

shifted it to think again of like

greater difficulty

shifting that sort of S.

curves Walk Like An Egyptian sort of

thing like that that US curve that way.

So and just to give you a little bit of

the punch line here what I've done here is

I have core I have good showed you

a scatterplot of the classical test theory

difficulty which is to say the percent

correct classical test of your super

annoying the call difficult the difficulty

in those percent correct and this is

I.R.T. difficulty so there's a negative

relationship but this is just to say if

you were to sort of say like what how does

percent correct correspond with I.R.T.

difficulty and as I read to you

giving me something magical and

mystical over love and beyond percent

correct answer is not really.

It's pretty it's pretty much the same

information this is not surprising but

again Ira to use I'm going to

show you going to be useful for

some more advanced applications.

I just want to demystify this

further this is where Matt and

I had a conversation about this earlier

today I had to use a latent variable

measurement model it is a fact or analytic

model it is a structural equation a model

do not think of these as separate things

they are separate practices in the way

that a nova and regression are separate

practices but are the same under the hood

right like I think those like the are the

act of doing a know it as a sort of way of

thinking about a statistical analysis even

if it's the same thing I could do the same

thing of aggression similarly structural

collision models factor factor analysis

I think of as a different practice

asking sort of different questions

using the same statistical machinery

I'm happy to elaborate on that but

I don't want to treat these as like

completely separate models when I think of

them more as completely separate

literatures and separate fields and

use them for separate reasons in

a same way that sort of a Nova and

regression are really the same under the

hood so what I'm sort of setting up for

you here is a way of doing I.R.T. using G.

The G.

Sand package in State of the generalized

structural equation model packages data

you can see here all it is you

know one is what I say I.R.T.

is factor analysis but

they caught variables.

Right that's all it is and

that's not all it is

the what we do with it is different but

under the hood that's all it is.

So the S.T.M. formulation is that the

probability here dependent on theta and B.

is the just stick with M.

theta minus B.

as a slightly different parameter

parameterization than the one that I

showed you with the 8 term because the A

was outside these print the C's right but

the but at the same general same general

approach the slope is constrained to be

common across items and

you'd fit this in in status with the G.

surprising regression the same thing so

actually before I or before stay to 14

came out say to 14 was just released.

Last year yet instead of 14 was

released they had an hour to package

before they had an hour to package guess

what I didn't say to do sound so which is

to say it's like what are you teaching

a course by using special creation models

because they're the same thing and so

I had all this really convoluted code

to get all the stuff I needed out of

just Entirety and then of course state I

thankfully at least I retain made all that

obsolete and had to record everything but

it just goes to show that sort of

same thing under the hood that So

this is the 2 parameter legit to

start yeah sure actually Chris.

Absolutely due to it

has a hard time with 3.

But about I.C.M.

can absolutely do to all you do for.

For the 2 families or sick model is

free this right there instead of forcing

the slope to be the same across items you

let it vary I'm not a give you the 2 from

the logistic model the 3 family just like

model I don't think you can do N.G.'s and

you can do it in glam Sophia's package and

are just package but

but but again it's the same

under the Good question.

So this is the 2 from logistic model

it allows items to vary in their

discrimination over over items so again

log I like writing it log odds terms and

so all I've done is added a sub I all

I've done is up the slope parameter vary

across items OK and then again we

have difficulty So these are the more

difficult items these are the less

difficult items this is the less

if I want to be less fancy about it

what would I do I would plot this and

log on and then it would just look

like a bunch of lines right and so

to again this is a sort of mystifying way

of describing I or to you if I wanted to

make it simpler I just show you all

the different straight lines that were in

a lot of god space Yep the last year was.

Like and then how many around the map.

Click the icon.

The legend there are 20 items so

that's obviously have 20 parameters

here we're not estimating the data

P's right this is a random effect and

we're actually studying this to 0 and once

we're not estimating those we can do it

in a Basine way after the fact

in the same way that we can do

alter random effects estimates after

the fact and then we have in this case 20.

Parameters for for difficulty.

So so this here I can actually show you.

And.

In the output.

Where was the output I

don't think I have it here.

Yet but.

I think that is.

The underlying.

Your love $828.00 of.

Money but we're talking agreed

to freedom for example.

Looks like Absolutely you know what

we're feeding what we're feeding here so

you can do it long or wide it doesn't

really matter to you lets you do it

why just as easily said the data I

should have done this before and

I'm sure the data look like it is a person

by a matrix right where you have persons

as rows items as columns zeros and

ones you can also extend that to 012 for

polygamous items in each of

the in each of the cells and

you're modeling the probability

of a response to correct item so

what is the data look like I think I have.

This if I can show this to you.

And but there you go so this is what the

this is sort of what the data looks like

behind the scenes so

what I've done here so

these are 2 separate item characterised

occurs for to 2 items and

what I've done is I have mapped the the

sum score associated with each data

onto the theta scale here and put you

know put those weights like how many how

many observations happen to be there as

dots and so you can sort of see that what

we're trying to do is fit fit

the probability of the correct response

given like that's that sort of overall

score does that help a little bit OK that.

Data is really weird and

annoying because it's like not

I mean where it come from it's

like this lead in scale and so

you can sort of see it's like sat in the

same way that a random effect is sort of

status just sort of like we just

say it's got some meaning of 0 and

instead of estimating the variance or

putting it back on the slope.

So so

let me show you further like a little bit

of a sense of what the curves look like so

this is the item characteristic curve

like demo that I like to do here is

my visualizing our 2 sides so this is a 3

parameter logistic model so what happens

if I increase the there's a blue item

hiding behind this there's a blue I.C.C.

hiding behind this or that is a blue

curve hiding behind this red curve and

what I'm going to do is I'm going

to increase the discrimination

of this blue item and we're going to see

right is that we're going to increase

this sort of slope here and in the.

Probability space and

this is a this blue item is now

what we describe as more discriminating

in the sense that people just below

that sort of midpoint there right versus

just above are going to have a pretty

massive swing in their probability

of a correct response so

my question my trick question to you

is which item is more discriminating.

Blue or red and the sort of knee

jerk reaction is what the answer

blue is more discriminating but if you

if you think about it more carefully and

some of you did a good job of like

working through this on the google doc.

Where is the slope you know which item

has a higher slope isn't the general

there's a general answer to that and

in fact when might the red item be better.

Yeah so at the tail end of

the distribution you can see that like you

know for people who are very high

high achieving on this scale or

very low on this scale this goes

back to Sue's question right for

who might be trying to just discriminate

who are we trying to discriminate among

And you can sort of we're going

to get to information shortly but

the idea is that what I already allows

you to do is say difficulty for

whom discrimination for whom and

even though you have A's and

B.'s those you wouldn't want to call that

you know just more difficult they're

just less difficult because it all

depends on for whom right and so you can

use this again to construct test in very

strategic ways to provide information for

high achieving or low achieving

students if you're so inclined.

So similarly what I'm going to do now is

increase the difficulty of this blue item.

What he thinks going to what

you think what he thinks could

Which way do you think that

blue curve is going to go.

So the blue curve here is going to shift

to the right it's going to take a little

bit of a lock and for more and

more people across the state a scale

there probability of a correct response

is going to be is going to be low.

So now you're being your blue curve

here blue item is more difficult it

seems right it's like a $1.00 B.

parameter estimates you're like That

is more difficult is it really more

difficult when that easier.

So if you look all the way up at the top

you actually see an instance right where

the blue item is easier

than the red item so

when the discrimination parameters are not

the same this is like an interaction

effect right you can't really

sort of say across the board

which is more difficult which is easier

it depends on where you are in the scale

now if all it premiers are the same as

they are in the one premier logistic model

then there's then there's never any

overlap then and difficult item is always

more difficult than either item is

always easier but once you allow for

discrimination to change then that allows

you to be for very targeted about for

whom is it difficult for whom it is

In a minute giving you.

A guess my question is do you have.

To really find people in order to.

Discotheques communicating your ideas and

finally gave you one people

were really high achieving.

I wouldn't have any information.

That's right you'd be forced to

extrapolate in the way that we do it to

say exact same thing as fitting a linear

model and then sort I mean this is

a linear model in the log odds and then

you're just sort of saying what I'm going

to assume is that if I want to pick pick

people down there what's going to happen

to people down there is extrapolating that

linear in a lot God sumption right and so

when when we say person free item

measurement and item free person

measurement really over saying is yeah if

my model holds which is what we always say

when you know this is just a regression

assumption is nothing magical right but

but it is still nonetheless useful and

that and that what we find in a lot of

cases is that that when you're in the log

odds assumption is pretty reasonable So

yeah so

just a quick note the the slope here is is

a over 4 and of course in the log

odds space it's just the slope

itself and again be careful about when

A's vary when discrimination varies be

careful about assuming discrimination is

discrimination do not select

items based on parameters

select items based on curves.

So any sense right so you should sort

of think in a characteristic curve way

like you know always visualize

if you can the items and selves.

So I want to show you what

happens here right so

the see parameter I haven't really talked

about given how fast I've been rushing

through this parameter when I increase

it here I'll show you what happens here

sort of like the cyclist's the floor and

see what's going on here so

some of you already might know the answer

to this but why would this be useful

why would we want to say that and certain

cases in educational testing people

with extremely low proficiency still have

a 25 percent chance of getting it right.

You know you.

Might.

Not want to.

Go with those like where

you were going for here but

I do like this sentiment this is the this

is the data fitting exercise so you

wouldn't you wouldn't really sort of want

to control that in that particular way but

I really I really do like that I really

do like a sentiment that wouldn't

quite pull that off you there but I think

it's a cool thing that I like the idea.

That's over.

Now so this is very tuned to the

Educational Testing when you have multiple

choice tests and the idea is that like you

know when you have a very very low scoring

examinee forcing the lower asymptote

to be 0 it's kind of silly so

I guess my general recommendation is to

never use a 3 from a religious 6 model and

I'm going to show you why

by setting blue to point

And then point 95.3.6.

That didn't quite work out.

Point 3 and point 2 so maybe I got

this a little bit I know what it is.

Let me just I can just fix.

The vindaloo.

So what I've done here is I've created

a situation where we have dramatically not

that dramatic of it fairly dramatically

different parameter estimates but

the curves are overlapping through much of

the upper end of the distribution right

you see how those curves are sort of

sitting on top of each other over there

and the question would be do you have

enough information at the bottom end

of that distribution to actually

estimate those lower asymptotes So C.

parameters are are notoriously noisy and

so

stated in its in all its

wisdom I'm very grateful for

this has actually not given you the option

to fit a true 3 parameter logistic model

when you fit it when you fit a 3 parameter

logistic model status as all your C.

parameters have to be the same across

items and estimate a common lower

asymptote and that's a really wise thing

because otherwise there's no information

down there and you get a whole

bunch of noise and it throws all

through Clinton already throws all of

your other parameter estimates off so

that's just so you know in general I don't

recommend using the 3 family just model

in practice it is used a lot and

I do not really understand why

I keep pushing back on states against

using it because it just adds a whole

bunch of noise do not overfit

your data is a general rule so

luckily state it has prevented you from

doing that by giving you a common C.

parameter to estimate that's

just fine if your song climbed.

So.

This is a little bit of here is actually

some of the output this is I or T and

state and again now that I don't have to

use G C M anymore I like ridiculously long

do files that are now completely obsolete

because all you have to do is type I or

T one P L and your items and you're all

set you can plot it's got some good I.R.T.

plotting functions for you.

And you get output that

sort of looks like this.

Yeah.

Gasping.

For I did in the long format I don't.

Personally get a job.

Like a Man and that.

Is a think about it you did you can

my slides so that's how I got it I.

Think.

It is I actually actually

deleted that slide here but

I have an extremely low git which

is exactly the same thing and so.

The difference in your mind

is that it is in effect from.

Grabbing a random attack.

For people like you and me.

The audience.

And then.

Afterwards get of it so that and so

I actually usually take a 3 step approach

where I 1st especially to economise it's

useful to show it that way right and

people who are like sort of multi level

modelers you start you start off by

showing it as a random effects logistic

model and and then I show it to the people

who have taken structure equation modeling

factor analysis before and I just try to

demystify it as like under the hood it's

all the same thing don't freak out but we

psych machines have developed kind of this

mystical language for talking about it.

So.

So now just a quick note here again like

this is the linear in the law God's right

so there is kind of this

people often say that I.R.T.

really is an equal interval it is equal

interval it is it is it is setting up this

a linear assumption but it treats as sort

of the target of interest the log of

the odds of a correct response and assume

the sort of linearity between Theta and

all of those all of those

long odds functions so

I guess I'll just say like remember

that this is the assumption.

And it's a sort of a simple model when you

show it like this maybe it's not as pretty

but that's really what's going on in the.

So this is again this is a 3 primatologist

model it's estimating a common super

amateur I think that's a good thing you

can sort of show that it's fitting better

in some cases I don't really like the

likelihood ratio test for these purposes

because usually in practice you have

these massive data sets and everything's

always going to show up as like fitting

better when you give it more parameters

it's not really that interesting

sometimes simpler is better now.

But it's the 1st.

Person.

So it's mine you.

Never 70 percent or something.

Like 100 Tests.

Or hundreds of similar questions

you'll get 70 of them really and

then when we get there you don't think.

I mean I guess it's the same that's

an interesting question you'd think you'd

be like the Terminator stick in some

way that's a good question I think it's

don't think about you I think about people

like you who also sit at that data that's

probably the easiest way to think about it

and there's a $100.00 people at that data

and 30 of them are getting it wrong so

it's nothing against you personally.

Just something we haven't modeled

in you to be able to tell it's

more discriminating we don't have the

specific model for you so just think about

all the people at that data rather than

you having a 70 percent chance of getting.

Help I mean the same sort of thing in

any given scatterplot of a regression

right like you have an X.

you have a Y.

you know so like so

how is it that you know but

you're not talking about you you're

talking about on average people at X.

What's their what's their

What's your best guess for why.

This is just a note I'm parameterization

So you're talking about like do you

estimate the variance of those

of the random effects or

do you let slopes vary and so I just want

to sort of note here that that you can

do both in for those of you taken factor

analysis or structural collision modeling

you know they have to anchor the scale

in one of 2 ways you set the variance or

you set one of the loadings I just want to

show that there is sort of an equivalence

there is a sort of an aside all

the this here as a reference.

So so some practical guidance here for

you when it comes to like sample

size estimation you get the same kind of

guidance for Factor Analysis right but

just be careful this is not a small

sample kind of endeavor for

the one parameter logistic model

you can get away with small samples

this is just a reminder that

when you have small samples

just stick stick with Rush Rush is like

a good a good way to get what you need.

You get various advice

from different authors for

the 2 from majestic model 3 from really

just sick model don't use it 3 primarily

just sick model unless it's the way stated

use it just it's just an absolute mess

Los going examinees needed for 3 P.-L.

but don't even bother and then this goes

for polygamous items to you may have heard

of the great response model which is for

polygamous items this is why I was saying

get your discrete histograms see if

people are responding like 45 and

score points to estimate those cars.

So I want to just talk a little bit about

the practical differences between

item response theory and.

And classical test eery So here what I've

shown is like a some score on the logit

of the percent correct adjusted a little

bit to keep from 100 percent and

you can see here that is just the sort of

nonlinear transformation of of so loaded

it's just a non-linear transformation of

that and loads it looks a lot like you

know the one parameter logistic estimates

for for theta right which is just to say

like don't you know don't think it's going

to create dramatically different scores in

your case like the one primarily just like

model would give you say it is that or

just an slight non-linear transformation

of the sum score so that's the this here

is the relationship between the one

parameter logistic and the sum score

once you get to the 2 parameter logistic

model here you start to get some.

Information based on the items

that discriminate more or less and

and similarly like between the 2 parameter

in the 3 parameter logistic model here you

basically got the same thing that lower

asymptote is not making that much

of a difference so if you want to talk

about the practical impact of I.R.T.

on like your scoring that's not where

you're going to see the difference again I

think the value of I.R.T. is really for

scale means over time for linkages for

like for fancy things where you're

subbing in new items and estimating for

new populations within any given

static item response like data panel.

It's not you know I or T.

overdub of classical test theory is

kind of like a sledgehammer to a nail

that doesn't mean it's not a cool thing

to do and useful for diagnosis but

really what you want to do with AI are to

sort of say OK now I'm going to pick these

items up and use those like particular

marbles from this particular urns

to target a measurement instrument top for

a particular purpose and

it's for that particular design that I or

she becomes particularly handy.

So let's see what should I do so.

I want to talk a little bit I

talked with not about this to

one of the cool things about our T.

is that it enables it puts like if

you look at the the equation for I or

to you right it puts the data which is

a person like ability estimate and B.

which is like an item feature puts it on

the same scale as sort of puts them like

subtract them into sort of says you know

your theta is your difficulty it's sort of

you know and you can sort of say that for

a given theta Let's say you get a beta of

mean and what I like about T.

is that it gives you a way of kind of

mapping items to the scale in a way that

imbues that scale with almost you'd argue

like a qualitative kind of property right

sort of says OK let's say

that I think you know the.

Response probability which means I'm

probably going to get an item correct or

think of it as like 70 percent to use it

to use or cut off psychic in that way and

so we can do here is sort of say OK if

that's the case then if I have a state of

like 2.2 than that's where I'm going

to be likely to get that kind of item

correct and if I have a 3rd of 1.2 that's

going to get this item correct and

different data is will have

different different mappings so

again why is this useful is because often

times you're going to get people asking so

I got a score of like 30 What does

that mean like what isn't a C.T.

score of 30 mean what is an S.C.T.

score of 600000000 what isn't the score of

The does and by putting examinee

proficiency and I have difficulty on

the same scale it allows me to create what

we call these item that's And here's some

of the work that we've done nape this is

not very elegant I have to have to say but

it sort of says OK what is is

explain properties of

sums of odd numbers very.

Apple you can click on that answer see

what that means you can do right with this

it with a with a specified probability

I really like this because educational

scales can be extremely abstract you know

you're always sort of wondering what a 10

or 20 or 30 is and I've actually actually

asked my students in many cases like

whether this is like a psychological

scale like you get a great score of 3

What is that or

a like a a theta scale you know it's like

as if you scale of 600 This allows you

these like qualitative descriptions of

what that actually means I think this

is a very powerful underused method for

you know increasingly I think statistics

is moving towards descriptions of

magnitudes in addition to like statistical

tests for example like how much is

an effect size of point 5 is like

something we really struggle with and

I think you know being able to say here's

a point 5 means says you used to be

able to do 2 digits which action now

you can do 3 digit subtraction or

something whatever that is like being

able to accurately describe what

you could what you could do then and what

you can do now can be really powerful.

Left.

So.

That so that that would be an example

of the model not fitting the data

right if that's sort of

happened a lot where you had

you know where you had usually we have we

have the ideal approach where it's like

every single time you move up you only

get more and more items kind of correct

obviously it doesn't happen in practice

but it has to happen on average and

if that doesn't happen the other team

model won't fit and you'll get really bad

alphas because effectively all your

items in even the classical test area.

Even at that stage will recognise that

your scale is not cohering So once so

if you have a high alpha if you do.

Risk replot for dimensionality

of your ire to my old model fits

which are all different ways of

saying kind of you dimensional scale

then what you're saying doesn't

happen that often and so

you can with with with by picking

a response probability and

these curves being correct sort of how

this ordering of items in the states

are successively ordered way and sometimes

it crosses it you can see here so

the 2 primary just a model gets a little

dicey as far as interpretation there is

because the item orderings aren't the same

different given your responsibility but

on the whole this is I think of no

reasonable way to sort of sail like OK

here's what performance

at this level means.

So so so previously.

They all got the sort of spiraled

set of randomly equivalent question.

Yes.

We're moving to in math multi-stage

testing which is to say adaptive but likes

so you know kind of like what was done

in like some of the National Center for

Education Statistics tests we got this 2

stage exam based on whether you performed

high or low they give you harder items or

easier items but still like even for

those items even if they never saw some of

those items you'd still in a model based

way be able to predict whether or

not they respond correctly if the model

holds that's that's the whole sort of idea

of IRA to itself that even if you didn't

observe that item you could still sort

of predict your probability of a correct

response to it so you would hope that

these item maps you know if the model fits

that's all that's what we always condition

on those item apps would hold but

I really like this this is like my one

of my pet things I like about our T.

so I hope you kind of remember this as

like you know something you can do when

you're trying to say to your

you know it's your aunt or

uncle like you know you know it's like

my daughter got a got a 600 on the M.

caster like great like you know what's her

percentile rank or you can be like But

this is what they can do in the public

still I suppose it's a percent right

it's OK but.

This is a good way of like

anchoring the scale and

talking about you know this is really what

I think measurement is part of like what

what does this mean they can they can do.

So and this is someone derived

this lesson Who was that.

That was those good so

this is a slightly different algebraically

equivalent version of the same thing but

this is just inverting converting the the

higher tech or the I R T equation I C C.

OK so I'm going to.

Skip's estimation even though would be

really fun to talk about this Brian.

But this is a little bit of

an illustration of maximum likelihood and

how things work but I'm going to talk

a little bit about how tests go ahead.

Your side of the just what

is the use in the 3rd

quarter in the template

that we're going to.

Have my you have some problem

because you want to sort of.

Having a more efficient manner.

In the way that you would be about.

You know what the numbers

are present like I'm

going to be worried that

the would you want to use because.

All of this.

Got this morning to do that but

it's really.

Have to Do With us now is not and

how do you know I think about how

to use the school so much respect.

For why don't we have some of

the goal is I.X. I don't so

I think the general goal of item apps is

to understand what score means implies

about what a student knows and is able to

do in the case of educational testing or

happened to report in the case of

psychological testing right or

happens to have like so so for example if

you have a great if you have a great score

of like 4 that means you went from

neutral to affirmative on this particular

item right like that's a way of like

saying that's what foreign means and

then I think that could support the I love

that you're asking this is a question you

know I usually ask other people if I

love that you're asking me this but

I think with that generally does

increase the likelihood of appropriate

interpretation of scores

if on average like you know

with because they're big nape

declines from 2013 to 2015 how big.

Not that big if you look at the kind of

differences in this in this in the kinds

of skills that they were on average able

to do this year versus on a on average

able to last year I just sort of helps

to give people a sense of magnitude and

I think you know Mark let's see has this

great piece translating the effects of of

of interventions to like interpretable

forms I think that's the that's the job

right I think I think it's and I think he

does it in a bunch of really useful ways

about like talking about cost benefit

analysis talking about numbers of months

of learning but I think this is a way in

a criterion referenced way to say like

literally hey this is what you're able to

do now this is what you're able to do then

and that will facilitate any number of

interpretations downstream because it's

really like what we what did you do you

know what do we predict you're able to do

so whenever it whenever you're

thinking about a score and

helping people interpret scores let item

naps be one possible way you can describe

them let me be very specific in another

way about how they're used I did not sir

also used to set standards I haven't

put standard setting in here because.

I.

Have opinions about it.

So standard setting is a process by which

we say this much is good enough nape has

set standards that set a proficiency cut

score it is a judgment will cut score

we just had this massive evaluation from

the National Academy of Sciences about

whether that process was justifiable

is that for the most part it was but

that's a judgment the process that

we use this mapping system for

if you are a reader is coming in to set

standards you would get a book of all of

these items in a row and what you would

do is flip through the book and put

a bookmark in where you think that just

proficient designation should be so that's

another way in which this is actually used

in a very practical way to sort of help

people sort of set A judge mental cut

point on what they think is good enough

based on what people can actually

do at that level is that help now.

So you know.

What was in the.

Back tell you about

the classic rush people.

This is a great point Chris so.

There's like a camp of very

thoughtful well reasoned but

also sometimes cultish for

not offending anybody am I on tape.

People many of whom are very

close friends of mine.

Who are sort of in this rush camp

where they think the model is so

useful that it's worthwhile sometimes to

throw away data to get the model

to fit it right which is and

this sounds a little bit crazy to those of

us who grew up in a sort of statistical

camp but the idea is like look we're

trying to design a good measure this item

is discriminating difficult differently

it's going to lead to these weird ordering

effects where now I can't have item maps

that are all in the same order if I pick

different response probabilities I don't

like that I'm going to not use that item

which means you're defining in a very

strong way like in this very statistical

way like what you think that construct is

and it gets sort of like to be this subset

of things you might want to measure

because you're throwing away all the stuff

that doesn't fit the model what you end up

getting in the end is arguably this very

sort of clean scale where everything

is like ordered without conditions and

there's no crossing of these lines and

no interactions in this item

is always more difficult for

everybody than this other item

which you might have lost in that process

is content and as I said Content is king

content is king you can see my bias here

when I'm when I'm talking about sort of

like that that you should you know fit

the data you know have a theory and

not throw out data to fit your model but

the same time I think there they have.

A framework in place that makes them

comfortable with doing that for

particular uses then tend to be very

diagnostic about these things right there

to sort of these targeted scales for

particular purposes and

they don't tend to they don't

tend to agree that it's good for

all purposes like I don't think they'd

say Do that for a state assessment but

this camp exists and they're they're good

people but they really like their model.

They say.

I don't.

Like to think ever get over you

siding with your collection I and.

My friends we can win measuring this

thing in $1120.00 when they think

they might do that and they treat each

one separately and try to create like on.

Their concept at a level 20 it's sort

of an exploratory factor analytic or

confirmatory factor analytic approach

where you kind of want to take a data

based way of sort of saying with

this item load more on this or

load more on that that that's

something you can do as well and

I sort of ceased sort of confirmatory

factor analytic camp as not so

different from the Rush camp they're

trying to sort of make the pictures fit

and I don't think that that's bad I think

that that serves particular purposes but

I tend to be more dimensional because I

sort of am cynical about the ways people

can use multiple scores like I was just

going to add it together in the end so

might as well analyze it that way and

but but for theoretical reasons I

see why S.C.M. and factor analysis

are useful for that purpose.

So just some useful facts for you.

For the one in 2 parameter logistic model

there is a sufficient statistic for

estimating data what is a sufficient

statistic it holds all the information you

need to estimate data it is not data but

it holds all the information

that you need to estimate it so

what that sufficient statistic is

the sum of discrimination parameters for

the items you've got right so

make sense so I mean

at least as operationally not necessarily

intuitively So basically in a one P.

L.

model all the discriminations

are the same.

Which is to say the number correct for

the Russian model holds all

the information you need.

To to estimate your ultimate data which

is to say everyone who gets the same sum

score right will have the same data have

OK Now when you have

discriminations that differ and

some items hold effectively more

information than others you get credit for

the items the discrimination premieres

of the items you answer correctly so

if you get a 20 correct and I get a 20

correct if you can 80 percent and

I get an 80 percent we

might not have the same

data why would I why

would it be different.

Really easy this way and.

This is a good this is a this is totally

tricked you I'm so sorry but that is

exactly what I said that when my advisor

asked me if it's like 12 years ago.

This is so yeah that's what it said

the 25th when you got easier and so

you got the 20 hard ones right and I got

the 20 easy ones right but don't forget

that if you got the 20 hard ones right

then you must've gotten the 20 or

all the other easy ones

wrong I said That's weird.

So it's actually not the difficulty of the

items that matter it's the discrimination

right so the idea is that the 20

you got right where the ones that

had the information and the 20 that I got

right were the ones that were coin flips.

But that I said the same thing

I mean it is so but and so

you have to sort of invert it right.

There is a little bit while there is

a lot of pressure along the way or

the only part of life or very.

Witty.

And.

is practically and ideas are.

Really.

A number of art.

We would have paid for in the.

Basement we made it but this was so

but again remember that for him to get

the 80 percent of difficult items correct

he must have gotten 20 percent of easy

items wrong which is basically a statement

of Mr Right that's weird right and

so it doesn't happen that often and so

the and so the if that happened a lot

the model wouldn't be unanimity modeling

it right it would say like I have no idea

what you're doing all these items aren't

correlating with each other right now so

it doesn't happen very often and for

the most part the scale would be

unidimensional right which is to say

like the higher you know if the one P.

L.

fit right the higher you are and

you're getting you know these

items correct with greater and

greater probability say and even higher

probability for all those other items

that receipt so that's what the unit

Michel the assumption doesn't model fit

kind of Biggs into that the rarity of that

happening but that's but that's absolutely

right that is intuition I had to but you

sort of have to remember to flip that and

say But don't forget you got all

the easy ones wrong which is we're.

Good so I think this is helpful

intuition for you right.

And so just to sort of sort of note here

when you get your scores from state

testing programs where did they come from

you would like to think you would you

might think that if they were and I are T.

using state right that they would

estimate data for everybody and

report all these different data.

Yes that is not what happens right and

there's a reason that's not what happens

and it's purely to do with these ability

feasibility and transparency and

the feasibility idea is that like we

you know we can't run all these data

you can't run these giant models every

single time the transparency idea is

hey that thing that we just talked

about what Try explaining that to

someone in the public right so

you see that you got 20 correct and

I got 20 correct and you're telling

me that they gotta have so it's

the fact that we can't explain recycle

magician's can't explain that well so

I was giving up on the fact that that data

hat if we truly have a 2 from really just

sick model is a better estimate of data

and if we're answering a more informative

items correctly we should use that

information we generally don't because for

the sake of transparency will we publish

right what a lot of states publish and

you'll see in these tech reports

are these raw score to scale score

conversion tables or just to say take

the sum score and then find your row and

then you find your is a one to one mapping

from raw scores to scale scores right and

that we would be able to do that if we had

this like weird thing where it's like well

if you got a 20 and you got a 13816 in

the war right then like you know that

you have this data and like someone else

has this other data so that's what we call

the difference between pattern scoring and

number correct scoring so you

might in your own analyses have data is

that have sort of continued from A to P.

L.

that have this continuous distribution but

when what you might get from a state

is going to look much more discrete

even if they have a 2 peelers repeal.

The.

What they are.

Doing in their life.

But having the whole.

Democrat.

Know what.

Would use of that right.

Now you are going to develop it on the I

actually really like the cash contract for

me to drop the right thing in the.

OK.

But then you're going back here in.

Some ways against.

The individual schools now as I showed

right like you know I was showing you

those scatterplots before the correlations

are like point 98.99 you know so so

there it is it is not making

too much of a difference but

yes what we're basically conceding is

like we're just going to punt on for

feasibility and transparency reasons and

and go back and don't forget the value

the value of it which I have actually

haven't had sufficient time to demonstrate

here is scale maintenance right like we

can't use the same items it's here that we

use last year because everyone saw them

last last year and so now we have to

use different items but because we know

what the futures of those marbles are and

not earning we can sort of you know if we

can build like the perfect test that all

like measuring the same across the across

the same area that we could before.

So this is what you know this is to

give you an example right this is one P.

L.

This is the sum score and

this is the this is the distribution for

the data scores it's the same thing

at the same things to same thing

it's all we did was a one to

one because like the sufficient

statistic is the sum score right and

all we did in this is what I've described

before is like what is I R T do for

practical purposes for

like a static set of item responses

it scratches the middle and it stretches

the ends and that's it it's sort of see

that just barely here a Woodward

overdoing is a non linear transformation.

So this is the one P.

L.

versus the 2 P.

L.

right I'm sort of showing you

these running back spots here so this

is the one parameter logistic Right so

everyone who got a 3 gets the same

score but you can see like in any given

any given score point right the people

who scored really high in the 2 P.L.R.

those that got the discriminating

items right and

the people who scored really low got

the low discriminating items right.

So so how should I.

See so.

I'm trying to think about.

How to close here so.

With 5 minutes left let me.

Just go back to basics and open up

the questions I think that's what I'll do

there's a lot here I have like to have

this is linking this is showing

you like how you can get to.

The comparisons that I showed you today

through common items and where my hair

there's this so so so so anyway let

me close here and I'll open it up for

questions like What do I want you to

believe with I do want you like I

think there's so much to be said for just

diligent exploratory data analysis and

I hope you don't think that's too boring

because I swear to you we'll see you so

much time later when you're trying

to fit your I.R.T. models and

they're not converging it is well worth it

today of selling I.R.T. I sure showed

you how it worked but there is a really

powerful way in which like I didn't

get to animate here sufficiently for

you like how these marbles from these urns

do have these properties and you can very

precisely like each item has this like

information function associated with

it and you can pick it up and sort of say

I want to measure here and then like and

they also want to measure here maybe and

you can sort of build like the sort

of perfect test in this way to

discriminate at particular points and

in the data distribution and

that's really powerful so for

example if you wanted to sort of set

evaluate people right at a cut score if

you were designing a diagnostic test for

Pascal purposes you could stack all of

the items from your turn that have maximal

information at that particular point and.

Target a test for

precisely that purpose so I or

allows you through this strategic

item estimates to have that information

and you know I can actually show you right

you can sort of see it in this

light zoom out a little bit here.

Right so under here I have these item

information functions so here's what I'm

going to do I'm going to increase

the discrimination on the blue item.

For dislocated this.

Make this like 2 so you see light you

see that right there like now I've

described that item as a lot of

discrimination at exactly that point and

so I can like you know if it were negative

one it would have discrimination out at

this point right here on in so each of

these items has this information function

and you can sort of say you

can sort of stack them up and

figure out where you're going to

minimize your standard errors so for

these people they're going to have low

standard errors and for these people you

can sort of sacrifice them because

you're not making decisions on them.

So so again.

Don't forget content Don't

forget classical test theory

I've just begun to scratch the surface

with the usefulness of I.R.T. and

we've all got a lot more to learn in this.

In this field so

yeah let me open up for questions.

One of my students just told me

the other day don't ask Do you have any

questions because the answer could be

no say what questions do you have.

Yeah.

The SO and

the if the and

the search or early.

Release.

The urge to eat the lower so the bucket

says there is always your go to like 70

percent 80 percent of what you need to do

can be done without again when when is

I or T helpful it's when you're changing

items and changing populations and

stuff is changing over time or if you just

have this little form the 8 I'm good scale

Don't worry about i or to but if you want

to sub out those items because people

are starting to do them if we start

using grid for high stakes testing and

people like hey I remember that item

then you want to start switching out and

that's when I actually started to be super

useful so I guess I'd always keep it in

your back pocket for you know for

when you need to sort out items and or

let's say you want to take it like you

know we can talk about differential

item functioning but it's like what

if you want to pick this test up and

go take it to Japan or

something like that and then then I or

to figure out measurement and

variance so they're they're all these like

use cases where you should sort of feel

like you've got I or T.

as like your sledgehammer in your

basement to like come out and

tackle a particularly thorny thorny

problem but again classical test theory is

your basic Ikea tool kit you know kind of

thing gets you in a get you're pretty far.

That's the to the our earth.

Were over her own way to

a Owen the surfing

really surrogate family structure exams

we have a lot of fire to use now.

Yeah and

I wrote about the long response to

your response to what you

really think of the form which.

Is.

Just now just so

even though in this country where

you the people are sending you money

it seems like to the students that there

is this need to treat it's going to test.

How you deal with situations.

That you might see like

you consistently are.

Starting.

To wonder if.

I actually can treat this thing and

where did it come to fruition.

So very strategically back in

the day when biased tests were

concerned not that they're not a concern

anymore but yes scholars at U.T.S.

sort of said hey let's call it something

more neutral because they're asking good

questions about whether measures

differ for different people but

bias is such a loaded term so

I came up with the term Paul and

unlike others coined the term differential

item functioning to make it make of this

like biased sound scientific and and

so on and it kind of does I guess but

the basic idea is that you have 2

different item characteristic curves for

the same items corresponding to different

groups that's bad right they don't

contain all the information about how

you're responding to a particular item and

if you estimate for

a different population a different an item

characteristic curve that doesn't align

then you've got evidence of differential

item functioning for that group so there's

a whole set of that under the diff and

each command you can say help defend each

You could also do logistic regression

of the of the price of the item score

on the total score and that in and

out in and of itself will 4 and then with

an indicator for the group and that in and

of itself will give you a test

of whether or not the item is

functioning differently for one group or

the other so there's a bunch of different

ways we have to to detect that and it's

a violation of the model and is a concern.

I prefer to form or use the date I like to

call the inhabitants of the practice

in order to avoid bias I don't.

Know what I did for me because of the way

I didn't want to know that I did that.

But it surely was something more

than well in that organization with.

McCain leading any sort of made any

leading a major achievement gap or

generation many gaps or

things one of those why

he's only making sure to manage

using a certain set of ideas and

I never know right now others

have those planes again so so

do if it's conditional on data right so

for 2 people with the same

feta right are there different

probabilities of correct response so

that still allows for different groups

to have different distributions of data.

Right so and so you can have

these 2 different groups of 2 to

distributions of data that can

be that true gap right but

then if you estimate 2 different item

characteristic curves from them and

they don't align that's

problematic right so so for

people who score very low in both groups

right are they are they going to get this.

Like.

I don't buy it or

get absolutely they use going.

Away so what we do what we do is because

there's 2 things we do 1st of all we

assume through the content development

process content is king right you assume

through the content development practice

process that you are measuring something

that's that's that's good that's right and

it's like that part of that theory right

there where we're not just asking yacht

questions or country club questions or

color blind questions for people who are

color blind right so it's got to go back

to content in that regard and then one to

have that then you're looking at relative

death right because it's always going to

sum to 0 it's circular in exactly the way

you're describing or

you have some sort of some sort of X.

or no referent that you assume is

unbiased So one of the other if you do

the internal way it's going to be circular

external way you have to question the bias

in the external referent but the but those

are 2 approaches to doing it the way we

get out of that circularity jam is coming

all the way from models back to content

and some theory that what you're measuring

is right and so what we usually do in

the test development process is we flag

items for death they go to a content or

view team they try to come up

with a couple hypotheses for

why that could have happened usually they

can't and so they leave the item in and

that's that Paul and

wrote this famous paper in like 2003 or

something it's like what's the diff

about defeat don't make no diff.

Which is differential

item function right so so

I mean because that's really what happens

in practice is tests are designed to

through the content

development process to our.

Already this is you know

Diane Ravitch is like you know

language police kind of book

right way back in the day

it's already designed to squeeze

out everything interesting and

possibly like you know differential func

functioning differently across the tast so

you get something that's so sterile in the

end that it is like no basis on which you

can really throw anything out it's

kind of a sad statement but you know.

To a.

Way or the A.

Way that.

You're always has me at

the depends on the use.

Of the a.

To a.

The and this is very much and

so 1st of all I forgot what time do

we end I thought we ended at 5 but

I realize now it's 5300 wow OK well

we can talk about all sorts of stuff.

You know keep it to questions.

And they always like I mean I am kind of

exhausted it's I mean like we've got half

an hour let's get Fox Let's

talk scale pliability.

So I might really do that you guys better

ask questions otherwise I'm going to get.

If it was less relevant that.

It's like how do you get mad at

yourself it makes sense for.

If I can hold it for

you know better I feel for

the boys when I was in the program.

But why do you think you

have what you feel so.

Let's get to that.

So that to address the just to

fit question right so there.

Are different schools of thought and so

that I know because he's trained more and

psychological measurements are educational

measurement is is more interested in

model fit and people in structural

question modeling generally and

factor analysis generally are interested

in a whole array of fits into sticks that

make me dizzy sometimes they're about you

know back in the day like 20 years ago you

could get tenure based on like you know

creating the next new Fit sadistic and

now there are 60 of them and I can't keep

track but I don't mean to be glib like you

can sort of tell by the way I'm talking

about it that I'm I'm just skeptical of

the idea of it like I think you can start

off with like an alpha statistic and once

it's like a sufficient level you're just

trying your using it you're using I or T.

to accomplish something if it helps

go ahead and use it if not don't and

so and so I think that the sort of

the dimensionality questions are often

a little bit overwrought that said I

think like as a matter of likes like

operationalization of your measurement

like objectives I do think like

streetlights alphas and scree thoughts and

overall fit all the C.F.I.

And R.M.S. CA and a whole suite

of hits a test takes are helpful

the only problem is that like you

know what you run the risk of like

people being like you're fit to test it

is like point 02 below the cutoff in your

sleep where the hell these come

out what does this even mean so

I know little cynical about fits this

fix but I do think you know support an.

There are models to fit the data I

just don't so so how does I.R.T.

and C M factor analysis kind of

differ in the practice like in

the same way that regression and

a nova differ in the practice right.

In the eye when we use I.R.T. we are very

we tend to be very interested in like

the marbles we're trying to create a test

and we want to or like maintain a test and

so we care about the specific parameter

estimates for those items and

we use them very very carefully in S.C.M.

and factor analysis you're sort

of more interested in this sort of global

measure of like does the model fit and

like that if it fits and sort of it

helps to explain my theory sometimes

a structural equation modeling you

are interested in particular structural

parameters in the same way that you're

interested in regression coefficients but

in general you're sure of your interest

in the sort of global ideas fit right so

I guess I guess that's the difference is

that I.R.T. cycle I don't care if it fits

like my standard error on this

discrimination parameter is pretty it's

pretty good is pretty decent and

it's sort of unidimensional and.

That's that'll do right so

I guess I would say what we usually

see in practice are like these scripts

lots of these general fits

the test 6 someone doesn't say and

describes fit and then you sort of move

on and so if you look at Duckworth and

Quinn they do this sort of token

confirmatory factor analysis and

the like like OK Hey it fits now let's

go see if it predicts future outcomes

like enough of that let's go let's go do

something else so I think that's a good

standard practice and if you like some

That's article is a good one right where

he does that internal consistency

examination on his on his on his

scale and confirms it works and people

are often using it that's a good model.

Of the questions.

With.

One person so I assume that

the days of 01 and go ahead and.

Expose the leak is are calling getting

those are the ones to just put your

POTUS on post and that's the only place

I've almost like almost got to step

a submitter right now I have your ability

to go back and get a raise gets are.

Reliable rest.

Like you mean if it's not

normally distributed yet.

You sooner than all the one thing you must

mean anyway it's kind of feeling well

this isn't exactly what you're wanting so

it's really part of life or

there's a you it's a given want to be

able to like if you're a little it's like

you're supposed to be something about life

believe your estimates are that's cool

it's a cool idea so in general I think

this probably fits under this like

more B.Z. and

ways to go about this publicly.

And like so

there are a lot of people who kind of do

this market chain market Carlo approach to

sort of simultaneously estimate

everything they have priors on the B.

parameters priors and they parameters

can have strong priors in the C.

parameters what they'd a kind of

feed back into that information

that sort of 2 step approach so

I think that's probably where that sort of

stuff comes in in a more fully framework

so I guess I would look there I'm not I

haven't done that in a long time and so

I'm not sure where the current state

of the art is but kind of a cool idea.

So let me let's let's kill this to

a little like I mean you know it's like

it's probably beer o'clock but let's none

the less and less do a little bit of scale

pliability of uselessness up for

whatever about to get into so I.

Assume you know is this an equal interval

scale so this is the big debate going on

I'm not sure it's debate seems

pretty obvious to me but

some there are those in our field who are

less utilitarian instrumentalist than I am

who are really struggling

to give psychological and

educational measurements the cache of

physical estimates right they want to sort

of say this is my own breakable scale

don't don't bend don't bend it and

I think it's it's it's that's

sort of silly so so so

interval scale again we're setting it up

as when you're in the log odds of

risk of correct responses to items so

there is a way in which it is already

equal interval you've always got

to be equal interval to risk

with respect to something.

So there's a good literature

right now bond in Lang and

Nielsen as well that you cited in your

paper which I appreciate their good work

on this they're trying to tie achievement

scores to these extra reference and

they're sort of bending the scale in

response to these like other scaled that

she even test typically

gets subjugated to.

In it and sometimes very useful ways so

so I So the thing that's going to

is equal interval with respect to the log

odds of correct responses to items but

there's nothing sort of magical about

that you can sort of bend everything.

Right and everything will still sort

of fit as long as it's a monotonic

transformation the it's no longer linear

in the log odds but it's still like going

to fit the data right it's because it's

going to chase the data in some arbitrary

way so so large sort of shows that you

know it doesn't really matter the data

can't tell as long as you're monotonically

transforming the both the item

response function and the data themselves

I mean it's just going to chase the data

let it do whatever you do right so

what do you make of scale indeterminacy so

logistic I don't response wontons

mathematically convenient is a loose

rational basis under normal assumptions

there you go but the data can't tell which

of any of any plot plausible monotone

transformations is desirable there's no

one correct or natural skill for measuring

traits or abilities and education and so

I come down very similarly to what

Brian and Jesse articulated so well in

their G.P. paper which is that there's

a it's probably useful to think of.

A class of.

You know again I like to call these

plausible monotone transformations

that you should subject your scales

to re estimate according to those

data after those transformations

I mean just make sure that your

your whatever you're concluding is robust

to those transformations so interpreted

interpretation should be robust a

plausible turn of course scales so this is

what I described before where we sort of

have these like you know one to 2 to 3 and

like we try to sort of I think we need a

way to sort of talk about how like pliable

these scale those are because because

you know the you know who's to say

like think about the item maps who's to

say that the distance between 2 digit and

the distance between like derivatives in.

I mean how how are you going to sort

of objectively sort of say what that

difference is and so

yes I would again sort of say the scale is

pliable and there ordinal number interval

I feel like ordinal interval is

like an antiquated dichotomy and

we should sort of think of a way to sort

of think of something between the ordinal

on the interval the equal interval

arguments like weak but not baseless so.

This is just to illustrate what

happens like if we were to just.

Operationalize a transformation

of of an underlying scale

right already said you know what I see you

normal distributions but what I really

care about are differences down there

like a negative 3 to the negative one.

So like that's where like I want to

prioritize growth either from like

an incentive standpoint or that's where

I you know from a measurement standpoint

truly believe that you know did

those distances are like 10 times or

you can sort of say that this is

actually the distribution that

the distributions I've got Were these are

actually the distributions I've got and

if you're to do a straight like you know

standard standardized in a mean difference

then this sort of changes the actual

effect size right the actual set number of

standard deviation units you can look

at differences in percentiles too and

the idea is that whatever you're sort of

assuming but it whatever judgment you're

making is going to be robust

to these transformations so

similarly So what we did Sean and I did

address sort of a separate problem but

still resulted in a neat technique

I think serve to define this

this class of transformations

that is mean and

variance preserving that's just like to

keep keep your head on straight Syria not.

Trying to go to a completely different

sort of scale it's like all your keeping

you're keeping your sort of head and arm

with sort of approximately the same and

just working things in various

directions and then a lot.

Of like my mean distributions

today it's kind of fun so and so

this is so subject to this constraints

of this is like a class of

exponential transformations subject

to these constraints we get this

this formulation and this is that

sort of transformation from X.

to X.

star So what we're sort of sort of doing

here is saying this red transformation

here right that's accentuating

these higher scores here and

the blue transformation is

accentuating these low scores here.

You can imagine also kurtosis kinds

of transformations where you're

stretching the tails but

keeping everything symmetrical

these are sort of one direction

in the other direction.

So so this is like what would happen

under these various sort of C.

parameters as I've defined as we've

defined them where you take a normal

distribution this is sea of negative

point $50.00 negative skew for

the blue distribution and

this is a sea of positive point 5

The positive skew over there yes in.

My.

Place.

Counts more.

Here than up here close.

To 10 you know whether it is a little bit.

Better than the original.

She has was.

Very clear about.

Why this is that interesting

today there's one week to

function we should just try to estimate.

The loop.

Hole I had absolutely right right and I so

there's So I think it's absolutely right

this is like any sensitivity study is not

like random It's like asking a different

question right exactly and I think that's

exactly the right way to frame it and

one of this is where I think sort of

item maps can kind of help is because

what an item that will do is sort

of go along with this function and

say hey look what you would now what you

said is that you know do a derivatives

are close to enter goals and like to

vision to just abstraction is huge and

that's not random and that's a statement

of like you know of a belief and different

and in these different magnitudes so

don't treat it as random error but

say like under this condition these are

the these are the results you get under

this condition these results you get and I

think that's a great is exactly right and

this is a by the way I think a general

way to think about it I think

a lot of people have said this but

don't think of sensitivity studies as

just like you know a bunch of random

things you do there each you know

questions exactly right.

So the way that I we've set

this up is the balance of C.R.

set to say that the slope of

the transformation at the 5th percentile

is one 5th to 5 times that of the slope of

the 95th percentile that's one way to sort

of think about it it's like you know

the rate of rate of the relative rate here

is like you know 5 times a relative rate

at the top of the distribution there just

very various ways of sort of think about

you know how to how to stretch and squish

the scale and so again you know if you

want to sort of like what to do it kind of

thing is they take the sting scores apply

a family of possible transformations

taking Sue's feedback seriously here what

you'd also want to be very clear about is

what that implies for like you know

the 2 different it's like a difference

down here in the difference up here using

some item napping or some other way

of describing it calculate metrics of

interest from each dataset and assess

robustness of interpretations of metrics

across these possible transformations.

By you.

Know actually so that this reference is

was related to get measurement broadly but

what we're trying to do is is make

sure that our reliability estimates

would not change too much in like whether

it was parametric or non-parametric So

really trying to solve a completely

different problem we were just sort of

saying hey there's a cool transformation

that'll work for this purpose so

I'm citing this as like Shawn and

I kind of hit 3 fun things in that

paper that had kind of nothing to do with

the abstract like the 1st was like hey

what are liabilities across state testing

programs the United States we just threw

that in is a figure and another was like

this little thing here sir just trying to

solve the real problems associated

with our R.V. get procedure and so

it's really kind of ancillary but that's

where we started writing it up week so we

should really we should we should really

write it up you know some more formally.

Like the all the things we don't have time

for but yes so you know is this the right

family can we think of kurtosis

kinds of transformations see about.

Appropriately I love to use feedback

in that don't think this is random So

this is this is this is our.

Like does or is there a liability

coefficient or a liability coefficient if

we if we what the nonparametric

ordinal reliability causation and

and so this is this is sort of saying

that actually that that our correlations

are actually pretty stable across all

of these different transformations and

so we don't have to worry too much

about reliability depending on

different scale transformations So here's

I would sort of say left we can create.

A hierarchy of research like

of statistical procedures

that based on whether they are sensitive

to scale transformation right and

so you know differences in means

are going to be pretty darn robust

right correlations as we've shown here

are pretty darn robust differences in

differences that gets that's good

that's problematic right and

so like when you're actually when you

whenever you have these sort of like

interaction effects like that's

heavily dependent on scale because

all I have to do is squish this to like

make it parallel and stretch this and

like I get it I get a different I get

a different kind of interaction effect so

there are different classes of of

procedures that I think we can sort of lay

out in like a more sensitive versus

less sensitive kind of framework and

I think that would be useful so

Nielsen does that you know and.

In the papers like which wasn't

a shock to us that changes and

gaps are different then I mean that's just

that's that's pretty straightforward and

that but generalizing that to like

a these kinds of methods are in general

these kinds of questions are in general

sense if the scale is really useful.

So so this is just a little example of

how like value added models like our

are not robust but

we don't have too much time so so.

I draw this.

So if you look back so that's another

good reference for this and other good

reference for the changes in gaps question

it's a back to Harlan ho in 2000 and

ho in Harlem 2006 we should we

sort of showed that that for

the most part gaps are to cast a CLI

ordered right there's nothing you can do

to reverse the sign of a gap

right like I mean so so one for

the most part like high achieving

groups and low achieving groups are so

far apart that there's that there's no

transformation that could possibly reverse

them but we we created sort

of just of sort of proof that

that we call that 2nd order to cast

according which is kind of a mouthful but

the idea is that for changes in gaps right

that it's very very easy for the most

part as long as you've got certain

conditions that hold that a transformation

can reverse the sign agree versus the sign

change in the gaps Right exactly so

which is the same as like an interaction

effect which I think it sounds.

Exactly exactly right.

Was a bit because here's what I

mean your response is exactly right

which you know would know which is even

not the or leave the room part but

maybe that's the right thing to do but

your response is exactly right which is

to say like what does what are the

intervals that this is assuming right and

so that's where I think like the sort of

idea map and scale anchoring can be really

helpful is because you're saying like look

if you want to disagree with me about

the ordering Here are the here is what I'm

saying about the scale right at this point

is this this point is this point is this

this point is it's about have a content

based argument about it about it go ahead

like I think that's where you sort of like

can set your stake in the ground and

because what I don't want to do is like

get to this sort of nihilistic likening

bond of language a little bit too far and

sort of saying let's create let's

let's sort of solve for the crazy

possible transformations that could

possibly reverse this sort of gap and

I think that's a little bit too extreme

and so what I tried to do also in this

paper with Carroll you is sort of sort of

say like What are the distributions we see

in practice and how crazy should we

stretch things to be to be plausible so

we should have a debate in exactly

the way that I think you and

Jesse were describing about what's

plausible in which that situations

that should be leveraged

based on you know like a.

Decision can be made based on sort of a

survey of the shapes of distributions that

we see in practice that's fun thanks

I'm glad we had that extra time.