What CLASS scores actually predict (and don't)
CLASS observation scores predict child outcomes — but only weakly, and the predictive signal has weakened in newer studies. Why we use them anyway, and how we weight them.
Data current as of April 2026.
The peer-reviewed literature on the Classroom Assessment Scoring System (CLASS) shows real but small predictive validity for child outcomes, with effect sizes typically below 0.1 SD and recent rigorous studies finding near-null associations. We use CLASS-style measures with that caveat baked in.
Bottom line
Effects are real but small, and the literature has drifted toward more skeptical estimates over time. The foundational NCEDL 11-state study (Mashburn et al., 2008) found that CLASS Instructional Support predicted gains in academic and language skills and Emotional Support predicted teacher-reported social skills, net of child, family, and program controls — but the effect sizes were modest, and subsequent work has repeatedly landed in the same small range or below.
- Meta-analysis (Perlman et al., 2016, PLOS ONE): "some, although small, associations between the CLASS and children's outcomes." No CLASS domain emerges as a reliably strong predictor of any outcome class.
- Burchinal's 2018 review (Child Development Perspectives): effect sizes typically less than 0.1 and often less than 0.05; rigorous quasi-experimental designs frequently produce null results. Attributed partly to psychometric ceiling effects (Instructional Support means ~2.3 on a 7-point scale in Head Start) and partly to a mismatch between what CLASS measures and what drives child gains.
- Threshold structure: Burchinal et al. (2010) and Hatfield et al. (2016) find that CLASS predicts gains mostly above a moderate-quality cut-point. ~76–87% of observed classrooms fall below that threshold — i.e., most US preschool classrooms operate in the range where CLASS and outcomes barely move together.
- Instrumental-variables work (Auger et al., 2014): causal effects of 0.03–0.14 SD per 1-SD improvement in process quality.
- Most sobering recent result: Sabol & Pianta (2014) and Weiland et al. (2021) find that "widely used measures of classroom quality are largely unrelated to preschool skill development."
Convergent validity
CLASS and ECERS-R/ECERS-3 correlate moderately when constructs overlap (typical r ≈ 0.3–0.5), strongest between ECERS-3 Language/Literacy and CLASS Instructional Support. Structural indicators (credentials, ratios) correlate only weakly with CLASS — so CLASS captures something distinct enough that it shouldn't be treated as redundant with the other big observation tools, but also not so distinct that it stands alone as the quality metric.
Honest caveats
- Publication bias is likely substantial. Most positive-finding CLASS studies come from developer-affiliated teams (UVA CASTL, Teachstone).
- Measurement error is enormous. A single 3–4 hour visit has low generalizability; 4–6 cycles may be needed for reliable classroom-level scores.
- Outcome measures are narrow. Most studies rely on proximal direct assessments collected once in fall and once in spring.
- Range restriction. Samples are overwhelmingly Head Start and state pre-K; very high- and very low-quality settings are under-represented.
- What's missing: curriculum fidelity, teacher-specific effects, child-level engagement (inCLASS), and dosage. Curricular content explains more variance than CLASS in several newer models.
How we use CLASS-style measures
When CLASS or related observation scores are part of a state's QRIS, we incorporate them with these adjustments:
- Combine measures. We don't rely on CLASS alone — we pair it with structural indicators and curriculum/dosage proxies.
- Respect the threshold structure. A linear composite would under-weight gains from lifting the bottom of the distribution; we use a piecewise transformation around the threshold suggested by Hatfield et al. (2016).
- Weight Emotional Support and Classroom Organization more than Instructional Support for US samples — IS has floor effects and the weakest predictive signal.
- Be honest in documentation. The score on a daycare page is built on measures whose individual predictive validity is small (often less than 0.1 SD).
What this means for parents
CLASS scores tell you something real about classroom quality, but the signal is small. A daycare with great CLASS scores is more likely than a daycare with poor ones to support your child's learning — but the difference may be modest, and other factors (curriculum, your child's specific teacher, dosage, your home environment) matter at least as much.
We surface CLASS-style scores when the state QRIS uses them, with this caveat baked in. We do not treat them as the bottom line.