The ShaSh distribution is defined by a transformation of a standard normal variable. If \(Y\) follows a standard normal distribution, \(Y \sim N(0,1)\), then the ShaSh-distributed variable \(X\) is generated by:
\[X = \mu + \sigma \cdot \sinh\left(\frac{\text{arcsinh}(Y) - \epsilon}{\delta}\right)\]
This transformation allows the resulting variable \(X\) to have its location, scale, skewness, and tail weight controlled by the four parameters:
The probability density function involves:
\[f(x|\mu,\sigma,\epsilon,\delta) = \frac{\delta}{\sigma\sqrt{2\pi}} \cdot \frac{\cosh(\delta \cdot \text{arcsinh}(z) + \epsilon)}{\sqrt{1 + z^2}} \cdot \exp\left(-\frac{1}{2}[\sinh(\delta \cdot \text{arcsinh}(z) + \epsilon)]^2\right)\]
where \(z = \frac{x - \mu}{\sigma}\).
The cumulative distribution function (CDF) does not have a closed-form expression but can be computed numerically. For a given value \(x\), the CDF is:
\[F(x|\mu,\sigma,\epsilon,\delta) = P(X \leq x) = \Phi[\sinh(\delta \cdot \text{arcsinh}(z) + \epsilon)]\]
where \(\Phi\) is the standard normal CDF and \(z = (x - \mu)/\sigma\). The quantile function (inverse CDF) can be expressed as:
\[Q(p|\mu,\sigma,\epsilon,\delta) = \mu + \sigma \cdot \sinh\left(\frac{\text{arcsinh}(\Phi^{-1}(p)) - \epsilon}{\delta}\right)\]
In cNORM’s ShaSh implementation, we model the four parameters as polynomial functions of standardized age. The tail weight parameter δ is held constant across ages in the default setting. It can be adjusted to reflect population characteristics, e. g. by increasing it to values delta > 1 for heterogenous samples or delta < 1 for homogenuous samples. By setting the delta_degree parameter, δ is as well modeled as a polynomial across age. If used, it is advisable to keep the delta_degree parameter low, for example 1 or 2, to avoid overfitting. Age is standardized as: \(\text{age}_{std} = \frac{\text{age} - \overline{\text{age}}}{\text{SD}(\text{age})}\) for numerical stability during optimization.
Parameters are estimated using maximum likelihood estimation.
1. Distributional Flexibility: The ShaSh distribution can model symmetric and asymmetric distributions with varying tail weights, making it suitable for diverse psychometric applications.
2. Continuous Scores: Unlike discrete distributions (e.g., beta-binomial), ShaSh naturally handles continuous scores, decimal values, and unbounded measures.
3. Zero and Negative Values: ShaSh can accommodate scores of zero and negative values without transformation, unlike Box-Cox approaches that require strictly positive data.
4. Independent Parameter Control: Skewness (ε) and tail weight (δ) can be controlled independently, allowing precise modeling of distributional characteristics.
The ShaSh distribution is particularly well-suited for:
Systematic Development: Test scores should exhibit systematic (though not necessarily monotonic) development across the predictor variable. The ShaSh model can capture complex non-linear relationships through polynomial terms.
Sufficient Sample Size: We recommend a minimum of 100 cases per major age group, with larger samples needed for higher polynomial degrees or complex developmental patterns. While empirical validation of this rule is ongoing, these sample sizes have proven effective in previous norming studies (Lenhard et al., 2019).
Score Characteristics: While ShaSh handles continuous scores optimally, it can also model discrete scores effectively, especially when the number of possible values is large (> 20).
Representativeness: The sample should be representative of the target population, though post-stratification weighting can help address moderate deviations from representativeness.
Polynomial Degrees: Choose polynomial degrees based on:
Delta Parameter Selection: The tail weight parameter should reflect population characteristics:
If the delta_degree parameter is used, keep it low (1 or 2) to avoid overfitting. In that case, the default δ is not used.
Convergence Considerations: Complex models may require adjusted optimization parameters. The function includes automatic fallback strategies for difficult convergence cases.
We demonstrate ShaSh modeling using the PPVT-4 vocabulary development dataset, which showcases the distribution’s ability to handle complex developmental patterns and distributional characteristics typical of psychometric data. The code examples are provided for reference but are not executed in this vignette due to computational intensity.
library(cNORM)
# Examine the data structure and distribution
str(ppvt)
#> 'data.frame': 4542 obs. of 6 variables:
#> $ age : num 2.6 2.6 2.62 2.86 2.88 ...
#> $ sex : num 1 1 1 1 1 1 1 1 1 1 ...
#> $ migration: num 0 0 0 0 0 0 0 0 0 0 ...
#> $ region : chr "west" "west" "west" "south" ...
#> $ raw : num 120 67 23 50 44 55 69 38 61 72 ...
#> $ group : num 3.16 3.16 3.16 3.16 3.16 ...
plot(ppvt$age, ppvt$raw, main="PPVT Raw Scores by Age",
xlab="Age", ylab="Raw Score", pch=16, col=rgb(0,0,0,0.3))
# Examine distributional characteristics
hist(ppvt$raw, breaks=30, main="Distribution of Raw Scores",
xlab="Raw Score", probability=TRUE, col="lightblue")
The vocabulary development data exhibits characteristic curvilinear growth patterns with potential distributional changes across age groups. The raw score distribution may show skewness or varying tail weights, making it an ideal candidate for ShaSh modeling.
# Fit ShaSh model with default settings
# Default: mu_degree=3, sigma_degree=2, epsilon_degree=2, delta=1
model.shash <- cnorm.shash(age = ppvt$age, score = ppvt$raw)
# The function automatically displays percentile plots
print(model.shash)
#> $mu_est
#> [1] 187.400553 26.220032 -9.829765 1.575112
#>
#> $sigma_est
#> [1] 2.78306883 -0.26105719 -0.02437886
#>
#> $epsilon_est
#> [1] 0.6796160 0.2457028
#>
#> $delta_est
#> NULL
#>
#> $delta
#> [1] 1
#>
#> $se
#> [1] 0.55998247 0.68059010 0.24621647 0.19370018 0.01853546 0.01489224 0.01110374
#> [8] 0.02367007 0.02277379
#>
#> $mu_degree
#> [1] 3
#>
#> $sigma_degree
#> [1] 2
#>
#> $epsilon_degree
#> [1] 1
#>
#> $delta_degree
#> NULL
#>
#> $result
#> $result$par
#> [1] 187.40055296 26.22003216 -9.82976519 1.57511216 2.78306883
#> [6] -0.26105719 -0.02437886 0.67961600 0.24570280
#>
#> $result$value
#> [1] 19670.99
#>
#> $result$counts
#> function gradient
#> 141 141
#>
#> $result$convergence
#> [1] 0
#>
#> $result$message
#> [1] "CONVERGENCE: REL_REDUCTION_OF_F <= FACTR*EPSMCH"
#>
#> $result$hessian
#> [,1] [,2] [,3] [,4] [,5] [,6]
#> [1,] 23.817309 12.107102 24.737912 19.067015 22.109318 -2.668045
#> [2,] 12.107102 24.737914 19.067015 49.592069 -2.668051 15.603729
#> [3,] 24.737912 19.067015 49.592069 35.465985 15.603701 -4.973938
#> [4,] 19.067015 49.592069 35.465985 128.016394 -4.973927 26.777866
#> [5,] 22.109318 -2.668051 15.603701 -4.973927 7993.261708 -546.933858
#> [6,] -2.668045 15.603729 -4.973938 26.777866 -546.933858 8088.462731
#> [7,] 15.603834 -4.973982 26.778281 -36.524817 8088.481895 -3161.473943
#> [8,] -398.922400 -96.699116 -392.719761 -107.000362 2685.998637 991.668459
#> [9,] -96.699114 -392.720176 -107.000241 -806.292850 991.668634 2175.283992
#> [,7] [,8] [,9]
#> [1,] 15.603834 -398.92240 -96.69911
#> [2,] -4.973982 -96.69912 -392.72018
#> [3,] 26.778281 -392.71976 -107.00024
#> [4,] -36.524817 -107.00036 -806.29285
#> [5,] 8088.481895 2685.99864 991.66863
#> [6,] -3161.473943 991.66846 2175.28399
#> [7,] 18015.172735 2175.28463 1594.02179
#> [8,] 2175.284631 10516.64239 -84.08582
#> [9,] 1594.021794 -84.08582 10421.97109
#>
#> attr(,"age_mean")
#> [1] 10.47791
#> attr(,"age_sd")
#> [1] 3.255976
#> attr(,"ageMin")
#> [1] 2.5202
#> attr(,"ageMax")
#> [1] 16.9952
#> attr(,"score_mean")
#> [1] 162.8355
#> attr(,"score_sd")
#> [1] 38.18122
#> attr(,"max")
#> [1] 221
#> attr(,"min")
#> [1] 7
#> attr(,"N")
#> [1] 4542
#> attr(,"scaleMean")
#> [1] 50
#> attr(,"scaleSD")
#> [1] 10
#> attr(,"delta")
#> [1] 1
#>
#> attr(,"class")
#> [1] "cnormShaSh"
The model uses reasonable default polynomial degrees and provides immediate visual feedback through percentile plots. The output includes fitted parameters, convergence information, and basic model statistics.
# Obtain comprehensive diagnostics
summary(model.shash, age = ppvt$age, score = ppvt$raw)
#> Sinh-Arcsinh Continuous Norming Model
#> -------------------------------------
#> Polynomial degrees:
#> Location (mu): 3
#> Scale (sigma): 2
#> Skewness (epsilon): 1
#> Tail weight (delta): Fixed at 1
#> Number of observations: 4542
#> Number of parameters: 9
#>
#> Model Fit:
#> Log-likelihood: -19670.99
#> AIC: 39359.98
#> BIC: 39417.77
#> R-squared: 0.9775
#> RMSE: 1.5216
#> Bias: 0.3307
#>
#> Convergence:
#> Converged: TRUE
#> Function evaluations: 141
#> Max gradient: 23945.01
#> Message: Successful convergence
#>
#> Parameter Estimates:
#> Location (mu) parameters:
#> Estimate Std..Error z.value Pr...z..
#> mu_0 187.401 0.5600 334.654 0.000e+00
#> mu_1 26.220 0.6806 38.525 0.000e+00
#> mu_2 -9.830 0.2462 -39.923 0.000e+00
#> mu_3 1.575 0.1937 8.132 4.441e-16
#>
#> Scale (sigma) parameters:
#> Estimate Std..Error z.value Pr...z..
#> log(sigma)_0 2.78307 0.01854 150.148 0.00000
#> log(sigma)_1 -0.26106 0.01489 -17.530 0.00000
#> log(sigma)_2 -0.02438 0.01110 -2.196 0.02812
#>
#> Skewness (epsilon) parameters:
#> Estimate Std..Error z.value Pr...z..
#> epsilon_0 0.6796 0.02367 28.71 0
#> epsilon_1 0.2457 0.02277 10.79 0
The summary provides:
Key diagnostic indicators:
For datasets with complex patterns, adjust polynomial degrees and distributional parameters:
# Example with more complex parameterization (not executed due to computation time)
model.custom <- cnorm.shash(age = ppvt$age, score = ppvt$raw,
mu_degree = 3, # Curvelinear pattern
sigma_degree = 3, # Complex variability changes
epsilon_degree = 2, # Age-varying skewness
delta_degree = 2) # Changing tail weights across age
# Compare models
compare(model.shash, model.custom, age = ppvt$age, score = ppvt$raw,
title = "ShaSh Model Comparison")
#>
#> Model Comparison Summary:
#> ------------------------
#> Metric Model1 Model2 Difference
#> R2 0.9567 0.9603 0.0037
#> Bias 0.3129 0.0542 -0.2587
#> RMSE 2.0976 1.9885 -0.1091
#> MAD 1.5190 1.4112 -0.1078
#> AIC 39359.9772 39274.0828 -85.8944
#> BIC 39417.7673 39363.9785 -53.7888
#>
#> Note: Difference = Model2 - Model1
#> Fit indices are based on the manifest and fitted norm scores of both models.
#> Scale metrics are T scores (scaleSD = 10)
#> AIC and BIC should only be used when comparing models of the same type.
Note: Higher polynomial degrees increase model flexibility but may lead to overfitting with insufficient data. Always validate complex models through visual inspection and cross-validation.
# More conservative parameterization for demonstration
model.simple <- cnorm.shash(age = ppvt$age, score = ppvt$raw,
mu_degree = 2, # Quadratic location pattern
sigma_degree = 1, # Linear variability change
epsilon_degree = 1, # Linear skewness change
delta = 1.1) # Slightly heavy tails
ShaSh models support weighted estimation for representative sampling:
# Calculate post-stratification weights
margins <- data.frame(variables = c("sex", "sex", "migration", "migration"),
levels = c(1, 2, 0, 1),
share = c(.52, .48, .7, .3))
weights <- computeWeights(ppvt, margins)
#> Raking converged normally after 3 iterations.
# Fit weighted ShaSh model
model.weighted <- cnorm.shash(ppvt$age, ppvt$raw, weights = weights)
# Compare weighted vs. unweighted
compare(model.shash, model.weighted, age = ppvt$age, score = ppvt$raw,
title = "Unweighted vs. Weighted ShaSh Models")
#>
#> Model Comparison Summary:
#> ------------------------
#> Metric Model1 Model2 Difference
#> R2 0.9567 0.9566 -0.0001
#> Bias 0.3129 0.4622 0.1493
#> RMSE 2.0976 2.1262 0.0287
#> MAD 1.5190 1.5541 0.0351
#> AIC 39359.9772 41352.8375 1992.8603
#> BIC 39417.7673 41410.6276 1992.8603
#>
#> Note: Difference = Model2 - Model1
#> Fit indices are based on the manifest and fitted norm scores of both models.
#> Scale metrics are T scores (scaleSD = 10)
#> AIC and BIC should only be used when comparing models of the same type.
# Generate norm scores for specific age-score combinations
ages <- c(10.25, 10.75, 11.25, 11.75)
raw_scores <- c(180, 185, 190, 195)
norm_scores <- predict(model.shash, ages, raw_scores)
prediction_table <- data.frame(
Age = ages,
Raw_Score = raw_scores,
Norm_Score = round(norm_scores, 1)
)
print(prediction_table)
#> Age Raw_Score Norm_Score
#> 1 10.25 180 53.4
#> 2 10.75 185 54.3
#> 3 11.25 190 55.6
#> 4 11.75 195 57.5
# Generate detailed norm tables for multiple ages
tables <- normTable.shash(model.shash,
ages = c(10.25, 10.75),
start = 150,
end = 220,
step = 1,
CI = 0.95,
reliability = 0.94)
# Display sample from first table
head(tables[[1]], 10)
#> x Px Pcum Percentile z norm lowerCI upperCI
#> 1 150 0.01069892 0.1697324 16.97324 -0.9552233 40.44777 36.36624 45.67556
#> 2 151 0.01112359 0.1788393 17.88393 -0.9197974 40.80203 36.69925 46.00856
#> 3 152 0.01155717 0.1883045 18.83045 -0.8841617 41.15838 37.03422 46.34354
#> 4 153 0.01199968 0.1981352 19.81352 -0.8483008 41.51699 37.37132 46.68063
#> 5 154 0.01245117 0.2083391 20.83391 -0.8121977 41.87802 37.71068 47.02000
#> 6 155 0.01291170 0.2189236 21.89236 -0.7758338 42.24166 38.05251 47.36182
#> 7 156 0.01338142 0.2298963 22.98963 -0.7391884 42.60812 38.39697 47.70629
#> 8 157 0.01386049 0.2412650 24.12650 -0.7022393 42.97761 38.74429 48.05361
#> 9 158 0.01434915 0.2530375 25.30375 -0.6649617 43.35038 39.09470 48.40402
#> 10 159 0.01484770 0.2652220 26.52220 -0.6273283 43.72672 39.44846 48.75777
#> lowerCI_PR upperCI_PR
#> 1 8.638209 33.27093
#> 2 9.174675 34.48936
#> 3 9.738827 35.73145
#> 4 10.331823 36.99684
#> 5 10.954874 38.28515
#> 6 11.609248 39.59601
#> 7 12.296277 40.92901
#> 8 13.017365 42.28377
#> 9 13.773995 43.65987
#> 10 14.567739 45.05694
The norm tables provide:
# Compare ShaSh with other approaches
# Beta-Binomial models should work worse, since the test has stop rules,
# leading to non-binomial distributions. Distribution-free models (Taylor)
# should be able to model the data flexibly as well.
model.bb <- cnorm.betabinomial(ppvt$age, ppvt$raw, n = 228)
model.taylor <- cnorm(group = ppvt$group, raw = ppvt$raw)
#> Powers of location: k = 5
#> Powers of age: t = 3
#> Multiple R2 between raw score and explanatory variable: R2 = 0.6915
#>
#> Final solution: 22 terms (highest consistent model)
#> R-Square Adj. = 0.991584
#> Final regression model: raw ~ L1 + L2 + L3 + L4 + A1 + A2 + A3 + L1A1 + L1A2 + L1A3 + L2A1 + L2A2 + L2A3 + L3A1 + L3A2 + L3A3 + L4A1 + L4A2 + L4A3 + L5A1 + L5A2 + L5A3
#> Regression function: raw ~ 320.4238239 + (-43.52313536*L1) + (1.576062212*L2) + (-0.02330857938*L3) + (0.0001252947997*L4) + (-119.8654247*A1) + (7.700634153*A2) + (-0.1530384974*A3) + (14.50741478*L1A1) + (-0.9423146812*L1A2) + (0.01974974212*L1A3) + (-0.4835093183*L2A1) + (0.03105400924*L2A2) + (-0.0006204492825*L2A3) + (0.006793980057*L3A1) + (-0.000372063774*L3A2) + (5.021380047e-06*L3A3) + (-3.069999226e-05*L4A1) + (2.100334875e-07*L4A2) + (5.678401213e-08*L4A3) + (-4.907683608e-08*L5A1) + (1.565428866e-08*L5A2) + (-7.273632324e-10*L5A3)
#> Raw Score RMSE = 3.49392
#>
#> Use 'printSubset(model)' to get detailed information on the different solutions, 'plotPercentiles(model) to display percentile plot, plotSubset(model)' to inspect model fit.
# Visual comparisons
compare(model.shash, model.bb, age = ppvt$age, score = ppvt$raw,
title = "ShaSh vs. Beta-Binomial")
#>
#> Model Comparison Summary:
#> ------------------------
#> Metric Model1 Model2 Difference
#> R2 0.9567 0.9368 -0.0198
#> Bias 0.3129 0.1408 -0.1721
#> RMSE 2.0976 2.5099 0.4123
#> MAD 1.5190 1.8816 0.3626
#> AIC 39359.9772 39856.6454 496.6682
#> BIC 39417.7673 39908.0144 490.2471
#>
#> Note: Difference = Model2 - Model1
#> Fit indices are based on the manifest and fitted norm scores of both models.
#> Scale metrics are T scores (scaleSD = 10)
#> AIC and BIC should only be used when comparing models of the same type.
compare(model.taylor, model.shash, age = ppvt$age, score = ppvt$raw,
title = "Distribution-Free vs. ShaSh")
#> Retrieving norm scores, please stand by ...
#>
#> Model Comparison Summary:
#> ------------------------
#> Metric Model1 Model2 Difference
#> R2 0.9609 0.9567 -0.0043
#> Bias 0.0460 0.3129 0.2669
#> RMSE 1.9787 2.0976 0.1188
#> MAD 1.4109 1.5190 0.1081
#> AIC 24320.8609 39359.9772 15039.1163
#> BIC -21528.1994 39417.7673 60945.9667
#>
#> Note: Difference = Model2 - Model1
#> Fit indices are based on the manifest and fitted norm scores of both models.
#> Scale metrics are T scores (scaleSD = 10)
#> AIC and BIC should only be used when comparing models of the same type.
The different continuous models have advantages for specific use cases, with the distribution free approach (Taylor polynomials) being the most flexible, the beta-binomial approach being optimal for discrete item counts, and ShaSh being ideal for continuous scores with complex distributional shapes.
Use ShaSh models for
Use Beta-Binomial models for
Use distribution-free (Taylor Polynomial) models when
In selecting models, please compare models using:
Convergence Issues and Numerical Instability:
Overfitting Signs:
The Sinh-Arcsinh distribution provides a sophisticated yet practical framework for modeling psychometric norm data. Its ability to independently control location, scale, skewness, and tail weight makes it particularly valuable for applications where traditional parametric approaches are inadequate due to distributional complexity.
Best practices for ShaSh modeling:
Jones, M. C., & Pewsey, A. (2009). Sinh-arcsinh distributions. Biometrika, 96(4), 761-780.
Lenhard, A., Lenhard, W., Gary, S. (2019). Continuous norming of psychometric tests: A simulation study of parametric and semi-parametric approaches. PLoS ONE, 14(9), e0222279. https://doi.org/10.1371/journal.pone.0222279
Lenhard, W., Lenhard, A., & Gary, S. (2025). cNORM: Continuous norming. R package version 3.5.0. https://CRAN.R-project.org/package=cNORM