Month: April 2016

calculating descriptive stats

Once you’ve been doing statistics for a while, you tend to take descriptive statistics for granted…mostly because we all use stats programs that just take our raw data and do it for us.

But for all of you who are just starting out, a thorough understanding of descriptive statistics is absolutely essential. So this is a quick post that will start from the ground up on descriptive stats.



measures of central tendency

Measures of Central Tendency

In statistics, we have this big collection of raw data from individuals, and we are always trying to describe the data set as a whole. Thus, we use “descriptive” statistics.

Measures of central tendency are commonly used method to describe data sets. Essentially, we are taking this collection of data and trying to explain where the “middle” of the data is. The three main measures of central tendency are the
mean, median, and mode.

Mean: Mean is just our fancy statistical way of saying average, and the calculation is quite simple: you add up all of the raw scores, and then divide by the number of scores that you added together.So let’s say that you bowl a lot, and you want to know your average score for the last five games you bowled. Your scores for each game were 150, 175, 250, 210, 195.

In order to calculate this, you would add the scores for each of those five games (150+175+250+210+195=980).

Now, you divide the sum of raw scores by the number of games (980/5 = 196). So your bowling average for the past 5 games was 196.

Median: The median is simply the “middle” number in the data set. If you rank your scores in order from high to low, the one that falls right in the middle represents the median. So if we refer back to our bowling example, the median is 195 because that is the “middle” score on the number line.

150     175     195     210     250

Mode: The mode is the most common score/value in the data set. In our bowling example, all of the scores have a frequency of 1 (i.e., the only occur one time), so there is no mode. But let’s say that we are talking about the age of high school seniors (see below). In this sample of ten students, the most common age is 17.5…So mode = 17.5.

17       17.25       17.5       17.5        17.5       18       18       18.5       18.5       19

latent class analysis

If you are interested in developmental trajectories, chances are that you will use a Latent Class Analysis (LCA) at some point. Essentially, you can use an LCA to identify groups of individuals who follow unique trajectories over time.

For example, a lot of my research focuses on delinquency among adolescents. So if I wanted to try and identify unique patterns of delinquent behavior over time, I could use latent class analysis.

This is an example of how to run an LCA using Mplus.

Step One – Preparing Your Data

In order to run a LCA, you have to have measures of your variable of interest at different time points.

In this example, because I am interesting in trajectories of offending behavior, I have measured delinquency as a count variable. Specifically, participants self-reported whether or not they engaged in certain delinquent behaviors (e.g., have you broken into a building to steal something?) in the past 12 months.

Now, I want to model that behavior over time. So the resulting variables would be things like “delinquency at age 11,” “delinquency at age 12,” and so on.


The image above shows how the data should be organized. You can see that my variables are “del.11” and so on, which means “delinquency at [age]”. If you need help getting your data into this format, please see my post (in the SPSS section) about transposing data.

You may also notice that there are a bunch of “-999” values. This is the number I have designated for missing data. You absolutely have to code all missing data or else Mplus will get grumpy and refuse to work.

Also, as you surely know, you have to save your data in a different format in order to use it in Mplus. I use comma-delimited, but there are a few other formats that will work.

lca save as

Notes: Select file type (.csv), be sure that you are using local encoding, and do not write variable names to spreadsheet – this box is usually checked by default, so be sure to uncheck it before saving.


Now that you’ve prepared your data, you can run the analysis.

Step Two – Syntax

Syntax is hands-down the most difficult part of using Mplus, so we are going to walk through the syntax for a latent class analysis step-by-step.


First, you have to designate where the data is coming from. This is done by using the Data command.


Note: In this example, I spelled out “file is.” However, you can substitute “=” for “is” and it will do the same thing. That’s one of the few things that Mplus is pretty flexible about.

Be sure that you include the entire file name, including the extension (i.e., “.csv”)

The next step is specifying information about your variables.

First, you have to name all of the variables in the data set…even if you don’t intend to use them. Mplus relies on the number of variables in the data set in order to “read in” the file the correct way.

For example, this data set has 13 variables total: ID, sex, race, and ten delinquency variables. If I only name the delinquency variables (NAMES = del10-del19), Mplus will still read the first three variables…now they’re just named wrong (i.e., del10, del11, and del12, respectively), which will mess up the analysis.


You must name all variables in the data file, even if you are not using them.

Another important note is that for variables that are repeated except for the number extension, you can do what I have done in this example to save some work.

So instead of typing out “del10 del11 del12 del13 del14 del15 del16 del17 del18 del19,” I can just put “first-last,” and Mplus is smart enough to figure out the rest.



The next few lines of syntax are just a way of telling Mplus what it is working with.

First of all, remember how I told you to designate a value for missing data? Well, now you have to tell Mplus what that value is. The statement MISSING = ALL (999); tells Mplus that for all variables, the value of missing data is 999. Also, note that if you designate -999 as the missing value, you would have to put -999 in this statement; it is sensitive to +/- values.

Now that you’ve named all of the variables in the file, you use the USEVAR statement to tell Mplus which variables you are using in this analysis. In latent class analysis, you only have to use the variable that you are trying to model over time; in this example, that is just delinquency.

The underlying process of modeling data like this involves creating individual trajectories (i.e., for each participant) and then using those to make grouping decisions. Thus, Mplus needs to know the IDVARIABLE so that it can keep the data straight.

Finally, so that the program handles the data appropriately, we have to designate that our delinquency variable is a count variable. This is the COUNT = statement.

Also, in this example, there are a lot of zero values (because delinquency isn’t THAT common). You could choose to use a Poisson zero-inflated model, but instead we are using negative binomial model because it estimates a dispersion parameter for each distal outcome. This is designated in our syntax by following the COUNT = designation with (nb).


This is where you tell the program how many classes to create. We are starting our analysis with two classes.

The CLASSES = C(N) statement tells the program how many classes you want it to create. This is where you need a theoretical understanding of LCA to really grasp the process.

Essentially, you will model your data several different ways during this process and, eventually, make a decision about which model fits the best.

In order to do this properly, you must start with the smallest possible number of “groups,” or classes, run the model, and then work your way up. For each model, Mplus will report Fit Indices (these will be discussed later on); you compare these for each model and see which model (or, how many classes) fits your data the best.


ANALYSIS tells Mplus what kind of analysis we are using.

For latent class analysis, we designate that analysis type using the command


Mixture modeling is what you use when the latent variable is categorical, which is obviously the case when we are trying to identify latent classes. Although mixture modeling can only be used with categorical latent variables, if the model were to use an outcome variable (LCA does not have an outcome variable), the outcome variables can be continuous, censored, binary, ordered categorical (ordinal), unordered categorical (nominal), counts, or combinations of these variable types.

The STARTS option is used to specify how many random sets of starting values should be generated in the initial model, and how many optimizations should be used in the final stage of modeling. The default number of starts for MIXTURE models in Mplus is 20, and the default number of optimizations is 4.

In this example, we specify 100 random sets of starting values and 20 optimizations. The reason that we designate more starts and optimizations is because the higher these values are, the more thorough the investigation of best-fitting model will be; there are typically multiple solutions in LCA, so we want to be sure that we are getting the best one.

Finally, the STITERATIONS command specifies how many iterations of each start are allowed. By default, Mplus allows 10 iterations. However, we increase this number once again to allow for a more thorough analysis.


The MODEL command is used to specify your model.

The MODEL Command is the crucial component of any analysis in Mplus, because you are specifying things that may be unique to your model.

In mixture modeling, there are many different “parts” that your model can have. For example, you can have the within-subjects component, the group component, the clustering component…That’s why it is called mixture modeling. You can have a mix of different models going on at once.

Here, we specify the %OVERALL% model, or the part of the model that is going to be the same for all latent classes.

The I s q | command is where we are specifying the growth model parameters. Here, we list each of the time points for the model (each delinquency measurement is a time point), and we fix the factors at equidistant values.

This is necessary because Mplus has no idea what our time points represent. However, by setting them factor loadings at incremental increases of 0.1, we are telling it that each measure is the same distance from the previous measure (in this case, one year).


The plot statement is not necessary for statistical purposes, but it is exceptionally useful to visually represent the growth curve for each latent class.


The PLOT statement tells Mplus that we want graphs. Here, we designate PLOT3 because it will generate all possible graphs, whereas TYPE = PLOT1 and TYPE = PLOT2 does not generate all of the potential graphs.

The series statement is simply telling Mplus what to graph: in this case, we want to graph delinquency, over age….and conveniently, we have created our variable to represent exactly that.




The output statement specifies what kind of output we want.

TECH11 request the Lo-Mendell-Rubin likelihood ratio test of model fit (Lo, Mendell, & Rubin, 2001) that compares the estimated model with a  model with one less class than the estimated model.

However, because the Lo-Mendell Rubin approach has been criticized (Jeffries, 2003), we also request TECH14, which requests a parametric bootstrapped likelihood ratio test (McLachlan & Peel, 2000) that also compares the estimated model to a model with one less class than the estimated model.

Honestly, the more model-fit indices you can use, the better – it’s like replication.

*NOTE: TECH-14 is is a very time-consuming analysis for Mplus because of the bootstrap draws, so if you’re in a hurry…You may just have to skip this and count on the LMR.

Step Three – Output

Now is when you get to figure out if you model is a good fit.

First, you need to check whether you model actually converged. If it didn’t, you need to head back to your syntax.


To figure this out, you check out your best loglikelihood, or your optimum start seed.

Quite simply…does it replicate? We see in this example that our best loglikelihood (-16352.365) repeats many, many times. So we are good to go!

model fit

Now we check our model fit information, which is conveniently labeled as such in the output.

  1. Check your AIC and BIC.

    model fit1

    The smaller AIC/BIC, the better the model fit. Right now, in this example, this is our first model…so it doesn’t tell us much. But we need to write it down because after we run our next version, we need to compare those values to see which model has a better fit.

2. Now, check your Chi-Square Test for Model Fit

model fit2

3. Check the TECH output


I’m only showing the TECH11 output here for examples sake…

But it’s pretty straight forward. Like the output says, it’s a test for “1 versus 2 classes.” And because our p-value is significant, we can say that the two class solution we are testing is significantly better than just having one latent class.

Step Four – Repeat

Now, you have to repeat the entire process for the next number of latent classes. So in this example, you would now run the model with three classes; then you would compare the model fit indices, as noted before.

Good luck! Please comment with questions or clarifications! Happy stat modeling!