stats for psychology

latent class analysis

If you are interested in developmental trajectories, chances are that you will use a Latent Class Analysis (LCA) at some point. Essentially, you can use an LCA to identify groups of individuals who follow unique trajectories over time.

For example, a lot of my research focuses on delinquency among adolescents. So if I wanted to try and identify unique patterns of delinquent behavior over time, I could use latent class analysis.

This is an example of how to run an LCA using Mplus.

Step One – Preparing Your Data


In order to run a LCA, you have to have measures of your variable of interest at different time points.

In this example, because I am interesting in trajectories of offending behavior, I have measured delinquency as a count variable. Specifically, participants self-reported whether or not they engaged in certain delinquent behaviors (e.g., have you broken into a building to steal something?) in the past 12 months.

Now, I want to model that behavior over time. So the resulting variables would be things like “delinquency at age 11,” “delinquency at age 12,” and so on.

dataforlca

The image above shows how the data should be organized. You can see that my variables are “del.11” and so on, which means “delinquency at [age]”. If you need help getting your data into this format, please see my post (in the SPSS section) about transposing data.

You may also notice that there are a bunch of “-999” values. This is the number I have designated for missing data. You absolutely have to code all missing data or else Mplus will get grumpy and refuse to work.

Also, as you surely know, you have to save your data in a different format in order to use it in Mplus. I use comma-delimited, but there are a few other formats that will work.

lca save as

Notes: Select file type (.csv), be sure that you are using local encoding, and do not write variable names to spreadsheet – this box is usually checked by default, so be sure to uncheck it before saving.

 

Now that you’ve prepared your data, you can run the analysis.

Step Two – Syntax


Syntax is hands-down the most difficult part of using Mplus, so we are going to walk through the syntax for a latent class analysis step-by-step.

lca1

First, you have to designate where the data is coming from. This is done by using the Data command.

 

Note: In this example, I spelled out “file is.” However, you can substitute “=” for “is” and it will do the same thing. That’s one of the few things that Mplus is pretty flexible about.

Be sure that you include the entire file name, including the extension (i.e., “.csv”)

The next step is specifying information about your variables.

First, you have to name all of the variables in the data set…even if you don’t intend to use them. Mplus relies on the number of variables in the data set in order to “read in” the file the correct way.

For example, this data set has 13 variables total: ID, sex, race, and ten delinquency variables. If I only name the delinquency variables (NAMES = del10-del19), Mplus will still read the first three variables…now they’re just named wrong (i.e., del10, del11, and del12, respectively), which will mess up the analysis.

lca2

You must name all variables in the data file, even if you are not using them.

Another important note is that for variables that are repeated except for the number extension, you can do what I have done in this example to save some work.

So instead of typing out “del10 del11 del12 del13 del14 del15 del16 del17 del18 del19,” I can just put “first-last,” and Mplus is smart enough to figure out the rest.

lca3

 

The next few lines of syntax are just a way of telling Mplus what it is working with.

First of all, remember how I told you to designate a value for missing data? Well, now you have to tell Mplus what that value is. The statement MISSING = ALL (999); tells Mplus that for all variables, the value of missing data is 999. Also, note that if you designate -999 as the missing value, you would have to put -999 in this statement; it is sensitive to +/- values.

Now that you’ve named all of the variables in the file, you use the USEVAR statement to tell Mplus which variables you are using in this analysis. In latent class analysis, you only have to use the variable that you are trying to model over time; in this example, that is just delinquency.

The underlying process of modeling data like this involves creating individual trajectories (i.e., for each participant) and then using those to make grouping decisions. Thus, Mplus needs to know the IDVARIABLE so that it can keep the data straight.

Finally, so that the program handles the data appropriately, we have to designate that our delinquency variable is a count variable. This is the COUNT = statement.

Also, in this example, there are a lot of zero values (because delinquency isn’t THAT common). You could choose to use a Poisson zero-inflated model, but instead we are using negative binomial model because it estimates a dispersion parameter for each distal outcome. This is designated in our syntax by following the COUNT = designation with (nb).

lca4

This is where you tell the program how many classes to create. We are starting our analysis with two classes.

The CLASSES = C(N) statement tells the program how many classes you want it to create. This is where you need a theoretical understanding of LCA to really grasp the process.

Essentially, you will model your data several different ways during this process and, eventually, make a decision about which model fits the best.

In order to do this properly, you must start with the smallest possible number of “groups,” or classes, run the model, and then work your way up. For each model, Mplus will report Fit Indices (these will be discussed later on); you compare these for each model and see which model (or, how many classes) fits your data the best.

lca10..1

ANALYSIS tells Mplus what kind of analysis we are using.

For latent class analysis, we designate that analysis type using the command

TYPE = MIXTURE

Mixture modeling is what you use when the latent variable is categorical, which is obviously the case when we are trying to identify latent classes. Although mixture modeling can only be used with categorical latent variables, if the model were to use an outcome variable (LCA does not have an outcome variable), the outcome variables can be continuous, censored, binary, ordered categorical (ordinal), unordered categorical (nominal), counts, or combinations of these variable types.

The STARTS option is used to specify how many random sets of starting values should be generated in the initial model, and how many optimizations should be used in the final stage of modeling. The default number of starts for MIXTURE models in Mplus is 20, and the default number of optimizations is 4.

In this example, we specify 100 random sets of starting values and 20 optimizations. The reason that we designate more starts and optimizations is because the higher these values are, the more thorough the investigation of best-fitting model will be; there are typically multiple solutions in LCA, so we want to be sure that we are getting the best one.

Finally, the STITERATIONS command specifies how many iterations of each start are allowed. By default, Mplus allows 10 iterations. However, we increase this number once again to allow for a more thorough analysis.

lca11

The MODEL command is used to specify your model.

The MODEL Command is the crucial component of any analysis in Mplus, because you are specifying things that may be unique to your model.

In mixture modeling, there are many different “parts” that your model can have. For example, you can have the within-subjects component, the group component, the clustering component…That’s why it is called mixture modeling. You can have a mix of different models going on at once.

Here, we specify the %OVERALL% model, or the part of the model that is going to be the same for all latent classes.

The I s q | command is where we are specifying the growth model parameters. Here, we list each of the time points for the model (each delinquency measurement is a time point), and we fix the factors at equidistant values.

This is necessary because Mplus has no idea what our time points represent. However, by setting them factor loadings at incremental increases of 0.1, we are telling it that each measure is the same distance from the previous measure (in this case, one year).

lca12

The plot statement is not necessary for statistical purposes, but it is exceptionally useful to visually represent the growth curve for each latent class.

 

The PLOT statement tells Mplus that we want graphs. Here, we designate PLOT3 because it will generate all possible graphs, whereas TYPE = PLOT1 and TYPE = PLOT2 does not generate all of the potential graphs.

The series statement is simply telling Mplus what to graph: in this case, we want to graph delinquency, over age….and conveniently, we have created our variable to represent exactly that.

 

lca13

 

The output statement specifies what kind of output we want.

TECH11 request the Lo-Mendell-Rubin likelihood ratio test of model fit (Lo, Mendell, & Rubin, 2001) that compares the estimated model with a  model with one less class than the estimated model.

However, because the Lo-Mendell Rubin approach has been criticized (Jeffries, 2003), we also request TECH14, which requests a parametric bootstrapped likelihood ratio test (McLachlan & Peel, 2000) that also compares the estimated model to a model with one less class than the estimated model.

Honestly, the more model-fit indices you can use, the better – it’s like replication.

*NOTE: TECH-14 is is a very time-consuming analysis for Mplus because of the bootstrap draws, so if you’re in a hurry…You may just have to skip this and count on the LMR.

Step Three – Output

Now is when you get to figure out if you model is a good fit.

First, you need to check whether you model actually converged. If it didn’t, you need to head back to your syntax.

convergeance

To figure this out, you check out your best loglikelihood, or your optimum start seed.

Quite simply…does it replicate? We see in this example that our best loglikelihood (-16352.365) repeats many, many times. So we are good to go!

model fit

Now we check our model fit information, which is conveniently labeled as such in the output.

  1. Check your AIC and BIC.

    model fit1

    The smaller AIC/BIC, the better the model fit. Right now, in this example, this is our first model…so it doesn’t tell us much. But we need to write it down because after we run our next version, we need to compare those values to see which model has a better fit.

2. Now, check your Chi-Square Test for Model Fit

model fit2

3. Check the TECH output

LMRT

I’m only showing the TECH11 output here for examples sake…

But it’s pretty straight forward. Like the output says, it’s a test for “1 versus 2 classes.” And because our p-value is significant, we can say that the two class solution we are testing is significantly better than just having one latent class.

Step Four – Repeat

Now, you have to repeat the entire process for the next number of latent classes. So in this example, you would now run the model with three classes; then you would compare the model fit indices, as noted before.

Good luck! Please comment with questions or clarifications! Happy stat modeling!