#Least squares: why take partial derivatives of the slope (B0) and y-intercept (B1) to minimize?

69 messages ยท Page 1 of 1 (latest)

void cove
#

Taking the partial derivatives of the slope (B0) and y-intercept (B1), setting the result equal to 0, and solving for B0 and B1 gives us the values of B1 and B0 that minimize the sum of squared residuals. Why is this a valid assumption? In ordinary least squares regression, we aim to find the values of ( B_0 ) and ( B_1 ) that minimize the sum of squared residuals. To do this, we typically take the partial derivatives of the sum of squared residuals with respect to ( B_0 ) and ( B_1 ), set these derivatives to zero, and solve the resulting equations to find the optimal ( B_0 ) and ( B_1 ). But why? If we're only taking the partial derivative, couldn't we accidentally optimize a B0 (e.g.) that doesn't "play well" with B1? Like, for that B0, B1 is not optimal. I would expect that we'd have to consider both B1 and B0 at the same time in order to correctly optimize.

proper questBOT
#
  1. Wait patiently for a helper to come along.
  2. Once someone helps you, say thank you and close the thread with:
+close
  1. Feel free to nominate the person for helper of the week in #helper-nominations
  2. Do not ping the mods, unless someone is breaking the rules.
  3. If you're happy with the help you got here, and the server overall, you can contribute financially as well:
hazy gullBOT
void cove
#

I expect the answer has something to do with a linear combination of variables like in a linear regression is a convex function and these types of interactions will not exist

#

Like, clearly there is only one solution. It's convex, a quadratic, and it's also just obvious. But I don't see how the method would actually ensure that we are getting this one solution

#

I guess there's not a way to create an optimal B0 which allow for an optimal B1

#

Imagine that a valley leads into a deeper valley. The function is still monotonic everywhere, but you'd have two solutions for a partial derivative: the flat bottom of the first valley, and the flat bottom of the valley that it leads into

#

There is one global minima in that case but two solutions for the partial derivative

sleek fiber
#

If $f:U\subseteq\mathbb{R}^n\to \mathbb{R}$ and if $a\in U$ is a relative extremum, then $(\nabla f)(a)=0$

#

so if a is a max/min for f, then f's gradient at a is 0

#

you then check the Hessian to see the nature

hazy gullBOT
#

Omegabet_

sleek fiber
#

you set the gradient to be 0 to find the possible places for extrema, then check them with 2nd derivative test/the analog in multivar

void cove
sleek fiber
#

U is the domain of f

#

usually open but you can just take the interior if it's not open blobshrug

void cove
#

It should always be convex

sleek fiber
#

just seeing an equation doesnt mean anything about the structure of the graph

#

but you asked what the gradient being 0 had to do with it, so that's what it has to do with it

#

gives you the candidates for extrema

void cove
#

yeah but it's known that linear sums are convex like for linear regression. so assuming/given this, we wouldn't have to check the Hessian / do anything with gradients, right?

sleek fiber
#

there might be shortcuts, I've only ever done least squares via purely LinAlg notions w/ the normal equations

void cove
#

Any function ๐‘“(๐’™)=๐’‚T๐’™+๐‘ is convex, where ๐’‚, ๐’™โˆˆโ„›^n, ๐‘โˆˆโ„›

sleek fiber
#

but you're asking about optimization in multivar calc at the end of it, so that's how you do optimization in multivar

void cove
#

Proving the Convexity of Affine Functions ( f(\mathbf{x}) = \mathbf{a}^T \mathbf{x} + b )

Definitions:

  • Affine Function: Essentially a multi-dimensional linear function.
  • Convex Function: A function with a "bowl-like" shape, having no local minima or maxima except the global one. In graphical terms, the function surface will always lie below the line segment connecting any two points on it.

Approach:

To prove that ( f(\mathbf{x}) = \mathbf{a}^T \mathbf{x} + b ) is convex, we use the following definition of convexity: for any two points ( \mathbf{x}_1 ) and ( \mathbf{x}_2 ), and any ( \lambda ) between 0 and 1,

[
f(\lambda \mathbf{x}_1 + (1-\lambda) \mathbf{x}_2) \leq \lambda f(\mathbf{x}_1) + (1-\lambda) f(\mathbf{x}_2)
]

hazy gullBOT
#

Smonk
Compile Error! Click the errors reaction for more information.
(You may edit your message to recompile.)

void cove
#

Argument:

  1. Consider two points ( \mathbf{x}_1 ) and ( \mathbf{x}_2 ), and let ( \lambda ) be a weighting factor that varies between 0 and 1. The point ( \lambda \mathbf{x}_1 + (1-\lambda) \mathbf{x}_2 ) can be thought of as a weighted average and will lie on the line connecting ( \mathbf{x}_1 ) and ( \mathbf{x}_2 ).

  2. Evaluate ( f(\lambda \mathbf{x}_1 + (1-\lambda) \mathbf{x}_2) ), which represents the value of the function at this weighted sum point.

  3. Also evaluate ( \lambda f(\mathbf{x}_1) + (1-\lambda) f(\mathbf{x}_2) ), which is the value the function would take if it were flat between ( \mathbf{x}_1 ) and ( \mathbf{x}_2 ).

Special Cases:

  • Flat Function: If the function is flat, then both expressions will be equal. This is a special case when the function is not only convex but also affine.

  • Bowl-shaped Function: If the function is convex, the value of ( f(\lambda \mathbf{x}_1 + (1-\lambda) \mathbf{x}_2) ) will be less than ( \lambda f(\mathbf{x}_1) + (1-\lambda) f(\mathbf{x}_2) ).

  • Function with Local Extrema: If the function has local minima or maxima, then it is not guaranteed to be convex.

Algebraic Verification:

The function ( f(\mathbf{x}) = \mathbf{a}^T \mathbf{x} + b ) can be verified to be convex algebraically by substituting into the convexity definition:

[
\begin{aligned}
f(\lambda \mathbf{x}_1 + (1-\lambda) \mathbf{x}_2) &= \mathbf{a}^T (\lambda \mathbf{x}_1 + (1-\lambda) \mathbf{x}_2) + b \
&= \lambda \mathbf{a}^T \mathbf{x}_1 + (1-\lambda) \mathbf{a}^T \mathbf{x}_2 + b \
&= \lambda f(\mathbf{x}_1) + (1-\lambda) f(\mathbf{x}_2)
\end{aligned}
]

Thus, we find that ( f(\lambda \mathbf{x}_1 + (1-\lambda) \mathbf{x}_2) = \lambda f(\mathbf{x}_1) + (1-\lambda) f(\mathbf{x}_2) ) for all ( \lambda ) between 0 and 1, and for all possible ( \mathbf{x}_1 ) and ( \mathbf{x}_2 ).

#

lol

hazy gullBOT
#

Smonk
Compile Error! Click the errors reaction for more information.
(You may edit your message to recompile.)

sleek fiber
#

ok

#

and?

void cove
#

so no need to check hessian or gradient

sleek fiber
#

then your question is answered ig

#

want me to close the post or shall you?

void cove
#

im just trying to learn lol

sleek fiber
#

well it looks like you've learned then

void cove
#

I know for a fact we can accomplish it with just taking partial derivatives. I don't know why this is the case

sleek fiber
#

cause that probably shows for such functions, extrema are already maxima

#

idk, I didnt read it

#

I dont read blunderbusses of copy paste blobshrug

#

hence finding extrema is equivalent to finding maxima.

#

or if $\Gamma_f$ is a convex set, then $a\in U$ such that $(\nabla f)(a)=0$ means $a$ isnt a saddle point

hazy gullBOT
#

Omegabet_

sleek fiber
#

same difference

pliant ridge
#

Convex functions don't have saddle points

void cove
void cove
#

Can convex functions have a region where x1 temporarily does not change in height (z) for some distance along x1, this axis where x1 does not change in height is the minimum value of x2 in that region? Assume the region is not the global minima

void cove
void cove
pliant ridge
void cove
#

And that can be true even if that area isn't the local maxima? Like, that valley can eventually drop into a larger pit/valley?

#

Because if that's true, then taking a partial derivative wrt one variable might get me two solutions if I'm interpreting a partial derivative correctly

pliant ridge
#

It has to be a local minima, no?

#

and a global one too

#

There is one point on a convex function where the gradient is 0 and that is the global minimum

#

There is no global minimum iff there is no such point

void cove
#

But it that true when considering f(x1) alone?

#

Because that's what it seems that the partial derivs are doing

pliant ridge
#

?

sleek fiber
#

f is a function of 2 variables, f(x_1) is nonsense

void cove
#

I'm illustraing what a partial deriv is doing

#

Taking a partial deriv considers change in the function wrt a single var, e.g. x1