Asserting coefficients and parameter scaling
Matt Bhagat-Conway
In travel demand modeling, it is somewhat common to “assert” coefficients rather than estimating them from data. I think most modelers would agree that this is a process that should be avoided when possible, but that is sometimes necessary—for example, to model mode choice for a mode that does not exist in a region yet. For example, the following guidance from the Federal Transit Administration appears in NCHRP 716:
- A typical range for the value of the in-vehicle time coefficient for home-based work trips is -0.03 and -0.02. If the coefficient falls outside the range, FTA says that “some further analysis (is) appropriate.”
- In-vehicle time coefficient for nonhome-based trips should approximately be the same as the in-vehicle time coefficient for home-based work trips.
- A typical range for the in-vehicle time coefficient for home-based other nonwork trips is 0.1 to 0.5 times the in-vehicle time coefficient for home-based work trips.
- A typical range for the coefficient of out-of-vehicle time is 2 to 3 times the corresponding coefficient for in-vehicle time. FTA believes that “compelling evidence” is needed to justify ratios outside this range.
The first one in particular presents an issue with error scaling. In a multinomial logistic regression, the probability of decisionmaker $j$ choosing outcome $i$ is
$$ p_i = \frac{e^{\mu V_{ij}}}{\displaystyle\sum_{i'} e^{\mu {V_{i'j}}}} $$where $V_{ij}$ is the systematic portion of the utility of alternative $j$ for decisionmaker $i$. $\mu$ is the scale of the error term. This cannot be estimated from the data, so it is generally just set to 1. The coefficients for the utility functions then re-scale relative to the amount of error present in the predictions of the maximum-likelihood model. Coefficient values are not in general comparable between logistic regression models. Since models may have different amounts of error, but the error scale is set to one, the coefficients will scale differently to account for the amount of error present.
Simulated situation
Here, I build three simulated variables. All are normally distributed, and are independent. Scatter plots of the variables are shown below: