¿Cómo hacer una regresión lineal por partes con múltiples nudos desconocidos?

14

¿Hay algún paquete para hacer una regresión lineal por partes que pueda detectar los múltiples nudos automáticamente? Gracias. Cuando uso el paquete strucchange. No pude detectar los puntos de cambio. No tengo idea de cómo detecta los puntos de cambio. De las parcelas, pude ver que hay varios puntos que quiero que me ayuden a elegirlos. ¿Alguien podría dar un ejemplo aquí?

regression change-point

— Honglang Wang
fuente

1

Esta parece ser la misma pregunta que stats.stackexchange.com/questions/5700/… . Si difiere de manera sustancial, háganoslo saber editando su pregunta para reflejar las diferencias; de lo contrario, lo cerraremos como un duplicado.

— whuber

1

He editado la pregunta.

— Honglang Wang

1

Creo que puede hacer esto como un problema de optimización no lineal. Simplemente escriba la ecuación de la función que se ajustará, con los coeficientes y las ubicaciones de los nudos como parámetros.

— mark999

1

Creo que el segmentedpaquete es lo que estás buscando.

— AlefSin

1

Tuve un problema idéntico, lo resolví con el segmentedpaquete de R : stackoverflow.com/a/18715116/857416

— un ben diferente

8

¿ Sería aplicable MARS ? R tiene el paquete earthque lo implementa.

— Wayne
fuente

8

En general, es un poco extraño querer ajustar algo como lineal por partes. Sin embargo, si realmente desea hacerlo, entonces el algoritmo MARS es el más directo. Desarrollará una función un nudo a la vez; y luego generalmente elimina el número de nudos para combatir los árboles de decisión de ala demasiado ajustados. Puede acceder al algoritmo MARS en R mediante eartho mda. En general, se ajusta al GCV que no está tan alejado del otro criterio de información (AIC, BIC, etc.)

MARS realmente no le dará un ajuste "óptimo" ya que los nudos crecen uno a la vez. Realmente sería bastante difícil ajustar un número de nudos verdaderamente "óptimo" ya que las posibles permutaciones de la colocación de nudos explotarían rápidamente.

En general, esta es la razón por la cual las personas recurren a suavizar las estrías. La mayoría de las estrías de suavizado son cúbicas solo para que puedas engañar a un ojo humano para que no vea las discontinuidades. Sin embargo, sería bastante posible hacer una spline de suavizado lineal. La gran ventaja de suavizar splines es su único parámetro para optimizar. Eso le permite llegar rápidamente a una solución verdaderamente "óptima" sin tener que buscar entre permutaciones. Sin embargo, si realmente desea buscar puntos de inflexión y tiene suficientes datos para hacerlo, entonces algo como MARS probablemente sea su mejor opción.

Aquí hay un código de ejemplo para splines de suavizado lineal penalizado en R:

require(mgcv);data(iris);
gam.test <- gam(Sepal.Length ~ s(Petal.Width,k=6,bs='ps',m=0),data=iris)
summary(gam.test);plot(gam.test);

Sin embargo, los nudos reales elegidos no se correlacionarán necesariamente con ningún punto de inflexión verdadero.

— Shea Parkes
fuente

3

Lo programé desde cero una vez hace unos años, y tengo un archivo Matlab para hacer una regresión lineal por partes en mi computadora. Alrededor de 1 a 4 puntos de interrupción son computacionalmente posibles para aproximadamente 20 puntos de medición más o menos. 5 o 7 puntos de quiebre comienzan a ser realmente demasiado.

El enfoque matemático puro, como lo veo, es probar todas las combinaciones posibles sugeridas por el usuario mbq en la pregunta vinculada en el comentario debajo de su pregunta.

Como las líneas ajustadas son todas consecutivas y adyacentes (sin superposiciones), la combinatoria seguirá el triángulo de Pascal. Si hubiera superposiciones entre los puntos de datos usados por los segmentos de línea, creo que la combinatoria seguiría los números de Stirling del segundo tipo.

La mejor solución en mi mente es elegir la combinación de líneas ajustadas que tenga la desviación estándar más baja de los valores de correlación R ^ 2 de las líneas ajustadas. Trataré de explicar con un ejemplo. Sin embargo, tenga en cuenta que preguntar cuántos puntos de ruptura se deben encontrar en los datos es similar a hacer la pregunta "¿Cuánto dura la costa de Gran Bretaña?" como en uno de los documentos de Benoit Mandelbrots (matemático) sobre fractales. Y existe una compensación entre el número de puntos de ruptura y la profundidad de regresión.

Ahora al ejemplo.

$y$ $x$ $x$ $y$

\begin{array}{cccccc} x & y & R^{2} l i n e 1 & R^{2} l i n e 2 & s u m o f R^{2} v a l u e s & s t a n d a r d d e v i a t i o n o f R^{2} \\ 1 & 1 & 1, 000 & 0, 0400 & 1, 0400 & 0, 6788 \\ 2 & 2 & 1, 000 & 0, 0118 & 1, 0118 & 0, 6987 \\ 3 & 3 & 1, 000 & 0, 0004 & 1, 0004 & 0, 7067 \\ 4 & 4 & 1, 000 & 0, 0031 & 1, 0031 & 0, 7048 \\ 5 & 5 & 1, 000 & 0, 0135 & 1, 0135 & 0, 6974 \\ 6 & 6 & 1, 000 & 0, 0238 & 1, 0238 & 0, 6902 \\ 7 & 7 & 1, 000 & 0, 0277 & 1, 0277 & 0, 6874 \\ 8 & 8 & 1, 000 & 0, 0222 & 1, 0222 & 0, 6913 \\ 9 & 9 & 1, 000 & 0, 0093 & 1, 0093 & 0, 7004 \\ 10 & 10 & 1, 000 & - 1, 978 & 1, 000 & 0, 7071 \\ 11 & 9 & 0, 9709 & 0, 0271 & 0, 9980 & 0, 6673 \\ 12 & 8 & 0, 8951 & 0, 1139 & 1, 0090 & 0, 5523 \\ 13 & 7 & 0, 7734 & 0, 2558 & 1, 0292 & 0, 3659 \\ 14 & 6 & 0, 6134 & 0, 4321 & 1, 0455 & 0, 1281 \\ 15 & 5 & 0, 4321 & 0, 6134 & 1, 0455 & 0, 1282 \\ 16 & 4 & 0, 2558 & 0, 7733 & 1, 0291 & 0, 3659 \\ 17 & 3 & 0, 1139 & 0, 8951 & 1, 0090 & 0, 5523 \\ 18 & 2 & 0, 0272 & 0, 9708 & 0, 9980 & 0, 6672 \\ 19 & 1 & 0 & 1, 000 & 1, 000 & 0, 7071 \\ 20 & 2 & 0, 0094 & 1, 000 & 1, 0094 & 0, 7004 \\ 21 & 3 & 0, 0222 & 1, 000 & 1, 0222 & 0, 6914 \\ 22 & 4 & 0, 0278 & 1, 000 & 1, 0278 & 0, 6874 \\ 23 & 5 & 0, 0239 & 1, 000 & 1, 0239 & 0, 6902 \\ 24 & 6 & 0, 0136 & 1, 000 & 1, 0136 & 0, 6974 \\ 25 & 7 & 0, 0032 & 1, 000 & 1, 0032 & 0, 7048 \\ 26 & 8 & 0, 0004 & 1, 000 & 1, 0004 & 0, 7068 \\ 27 & 9 & 0, 0118 & 1, 000 & 1, 0118 & 0, 6987 \\ 28 & 10 & 0, 04 & 1, 000 & 1, 04 & 0, 6788 \end{array}

$\begin{array}{|c|c|c|c|c|c|} \hline &x &y &R^2 line 1 &R^2 line 2 &sum of R^2 values &standard deviation of R^2 \\ \hline &1 &1 &1,000 &0,0400 &1,0400 &0,6788 \\ \hline &2 &2 &1,000 &0,0118 &1,0118 &0,6987 \\ \hline &3 &3 &1,000 &0,0004 &1,0004 &0,7067 \\ \hline &4 &4 &1,000 &0,0031 &1,0031 &0,7048 \\ \hline &5 &5 &1,000 &0,0135 &1,0135 &0,6974 \\ \hline &6 &6 &1,000 &0,0238 &1,0238 &0,6902 \\ \hline &7 &7 &1,000 &0,0277 &1,0277 &0,6874 \\ \hline &8 &8 &1,000 &0,0222 &1,0222 &0,6913 \\ \hline &9 &9 &1,000 &0,0093 &1,0093 &0,7004 \\ \hline &10 &10 &1,000 &-1,978 &1,000 &0,7071 \\ \hline &11 &9 &0,9709 &0,0271 &0,9980 &0,6673 \\ \hline &12 &8 &0,8951 &0,1139 &1,0090 &0,5523 \\ \hline &13 &7 &0,7734 &0,2558 &1,0292 &0,3659 \\ \hline &14 &6 &0,6134 &0,4321 &1,0455 &0,1281 \\ \hline &15 &5 &0,4321 &0,6134 &1,0455 &0,1282 \\ \hline &16 &4 &0,2558 &0,7733 &1,0291 &0,3659 \\ \hline &17 &3 &0,1139 &0,8951 &1,0090 &0,5523 \\ \hline &18 &2 &0,0272 &0,9708 &0,9980 &0,6672 \\ \hline &19 &1 &0 &1,000 &1,000 &0,7071 \\ \hline &20 &2 &0,0094 &1,000 &1,0094 &0,7004 \\ \hline &21 &3 &0,0222 &1,000 &1,0222 &0,6914 \\ \hline &22 &4 &0,0278 &1,000 &1,0278 &0,6874 \\ \hline &23 &5 &0,0239 &1,000 &1,0239 &0,6902 \\ \hline &24 &6 &0,0136 &1,000 &1,0136 &0,6974 \\ \hline &25 &7 &0,0032 &1,000 &1,0032 &0,7048 \\ \hline &26 &8 &0,0004 &1,000 &1,0004 &0,7068 \\ \hline &27 &9 &0,0118 &1,000 &1,0118 &0,6987 \\ \hline &28 &10 &0,04 &1,000 &1,04 &0,6788 \\ \hline \end{array}$

These y values have the graph:

idealized data

Which clearly has two break points. For the sake of argument we will calculate the R^2 correlation values (with the Excel cell formulas (European dot-comma style)):

=INDEX(LINEST(B1:$B$1;A1:$A$1;TRUE;TRUE);3;1)
=INDEX(LINEST(B1:$B$28;A1:$A$28;TRUE;TRUE);3;1)

for all possible non-overlapping combinations of two fitted lines. All the possible pairs of R^2 values have the graph:

R^2 values

The question is which pair of R^2 values should we choose, and how do we generalize to multiple break points as asked in the title? One choice is to pick the combination for which the sum of the R-square correlation is the highest. Plotting this we get the upper blue curve below:

sum of R squared and standard deviation of R squared

The blue curve, the sum of the R-squared values, is the highest in the middle. This is more clearly visible from the table with the value $1,0455$ as the highest value. However it is my opinion that the minimum of the red curve is more accurate. That is, the minimum of the standard deviation of the R^2 values of the fitted regression lines should be the best choice.

Piece wise linear regression - Matlab - multiple break points

— Mats Granvik
fuente

1

There is a pretty nice algorithm described in Tomé and Miranda (1984).

The proposed methodology uses a least-squares approach to compute the best continuous set of straight lines that fit a given time series, subject to a number of constraints on the minimum distance between breakpoints and on the minimum trend change at each breakpoint.

The code and a GUI are available in both Fortran and IDL from their website: http://www.dfisica.ubi.pt/~artome/linearstep.html

— arkaia
fuente

0

... first of all you must to do it by iterations, and under some informative criterion, like AIC AICc BIC Cp; because you can get an "ideal" fit, if number of knots K = number od data points N, ok. ... first put K = 0; estimate L = K + 1 regressions, calculate AICc, for instance; then assume minimal number of data points at a separate segment, say L = 3 or L = 4, ok ... put K = 1; start from L-th data as the first knot, calculate SS or MLE, ... and step by step the next data point as a knot, SS or MLE, up to the last knot at the N - L data; choose the arrangement with the best fit (SS or MLE) calculate AICc ... ... put K = 2; ... use all previous regressions (that is their SS or MLE), but step by step divide a single segment into all possible parts ... choose the arrangement with the best fit (SS or MLE) calculate AICc ... if the last AICc occurs greater then the previous one: stop the iterations ! This is an optimal solution under AICc criterion, ok

— Maciek
fuente

AIC, BIC can't be used because they penalised for extra parameters, which is clearly not the case here.

— HelloWorld

0

I once came across a program called Joinpoint. On their website they say it fits a joinpoint model where "several different lines are connected together at the 'joinpoints'". And further: "The user supplies the minimum and maximum number of joinpoints. The program starts with the minimum number of joinpoint (e.g. 0 joinpoints, which is a straight line) and tests whether more joinpoints are statistically significant and must be added to the model (up to that maximum number)."

The NCI uses it for trend modelling of cancer rates, maybe it fits your needs as well.

— psj
fuente

0

In order to fit to data a piecewise function :

where $a_1 , a_2 , p_1 , q_1, p_2 , q_2 , p_3 , q_3$ are unknown parameters to be approximately computed, there is a very simple method (not iterative, no initial guess, easy to code in any math computer language). The theory given page 29 in paper : https://fr.scribd.com/document/380941024/Regression-par-morceaux-Piecewise-Regression-pdf and from page 30 :

For example, with the exact data provided by Mats Granvik the result is :

Without scattered data, this example is not very signifiant. Other examples with scattered data are shown in the referenced paper.

— JJacquelin
fuente

0

You can use the mcp package if you know the number of change points to infer. It gives you great modeling flexibility and a lot of information about the change points and regression parameters, but at the cost of speed.

The mcp website contains many applied examples, e.g.,

library(mcp)

# Define the model
model = list(
  response ~ 1,  # plateau (int_1)
  ~ 0 + time,    # joined slope (time_2) at cp_1
  ~ 1 + time     # disjoined slope (int_3, time_3) at cp_2
)

# Fit it. The `ex_demo` dataset is included in mcp
fit = mcp(model, data = ex_demo)

Then you can visualize:

plot(fit)

Or summarise:

summary(fit)

Family: gaussian(link = 'identity')
Iterations: 9000 from 3 chains.
Segments:
  1: response ~ 1
  2: response ~ 1 ~ 0 + time
  3: response ~ 1 ~ 1 + time

Population-level parameters:
    name match  sim  mean lower  upper Rhat n.eff
    cp_1    OK 30.0 30.27 23.19 38.760    1   384
    cp_2    OK 70.0 69.78 69.27 70.238    1  5792
   int_1    OK 10.0 10.26  8.82 11.768    1  1480
   int_3    OK  0.0  0.44 -2.49  3.428    1   810
 sigma_1    OK  4.0  4.01  3.43  4.591    1  3852
  time_2    OK  0.5  0.53  0.40  0.662    1   437
  time_3    OK -0.2 -0.22 -0.38 -0.035    1   834

Disclaimer: I am the developer of mcp.

— Jonas Lindeløv
fuente

The use of "detect" in the question indicates the number--and even the existence--of changepoints are not known beforehand.

— whuber