Taking the partial derivatives of the slope (B0) and y-intercept (B1), setting the result equal to 0, and solving for B0 and B1 gives us the values of B1 and B0 that minimize the sum of squared residuals. Why is this a valid assumption? In ordinary least squares regression, we aim to find the values of ( B_0 ) and ( B_1 ) that minimize the sum of squared residuals. To do this, we typically take the partial derivatives of the sum of squared residuals with respect to ( B_0 ) and ( B_1 ), set these derivatives to zero, and solve the resulting equations to find the optimal ( B_0 ) and ( B_1 ). But why? If we're only taking the partial derivative, couldn't we accidentally optimize a B0 (e.g.) that doesn't "play well" with B1? Like, for that B0, B1 is not optimal. I would expect that we'd have to consider both B1 and B0 at the same time in order to correctly optimize.
#Least squares: why take partial derivatives of the slope (B0) and y-intercept (B1) to minimize?
69 messages ยท Page 1 of 1 (latest)
- Wait patiently for a helper to come along.
- Once someone helps you, say thank you and close the thread with:
+close
- Feel free to nominate the person for helper of the week in #helper-nominations
- Do not ping the mods, unless someone is breaking the rules.
- If you're happy with the help you got here, and the server overall, you can contribute financially as well:
Smonk
I expect the answer has something to do with a linear combination of variables like in a linear regression is a convex function and these types of interactions will not exist
Like, clearly there is only one solution. It's convex, a quadratic, and it's also just obvious. But I don't see how the method would actually ensure that we are getting this one solution
I guess there's not a way to create an optimal B0 which allow for an optimal B1
Imagine that a valley leads into a deeper valley. The function is still monotonic everywhere, but you'd have two solutions for a partial derivative: the flat bottom of the first valley, and the flat bottom of the valley that it leads into
There is one global minima in that case but two solutions for the partial derivative
If $f:U\subseteq\mathbb{R}^n\to \mathbb{R}$ and if $a\in U$ is a relative extremum, then $(\nabla f)(a)=0$
so if a is a max/min for f, then f's gradient at a is 0
you then check the Hessian to see the nature
Omegabet_
The converse need not be true, for example saddle points give 0 gradient, hence why you check the Hessian blahblahblah
you set the gradient to be 0 to find the possible places for extrema, then check them with 2nd derivative test/the analog in multivar
I don't know what U is. I do know what the gradient of f is but not well. lol.
U is the domain of f
usually open but you can just take the interior if it's not open 
right but for the case of a linear regression, we have a sum of terms y = x1a + x2b ... so I don't think you even need to check
It should always be convex
just seeing an equation doesnt mean anything about the structure of the graph
but you asked what the gradient being 0 had to do with it, so that's what it has to do with it
gives you the candidates for extrema
yeah but it's known that linear sums are convex like for linear regression. so assuming/given this, we wouldn't have to check the Hessian / do anything with gradients, right?
there might be shortcuts, I've only ever done least squares via purely LinAlg notions w/ the normal equations
Any function ๐(๐)=๐T๐+๐ is convex, where ๐, ๐โโ^n, ๐โโ
but you're asking about optimization in multivar calc at the end of it, so that's how you do optimization in multivar
Proving the Convexity of Affine Functions ( f(\mathbf{x}) = \mathbf{a}^T \mathbf{x} + b )
Definitions:
- Affine Function: Essentially a multi-dimensional linear function.
- Convex Function: A function with a "bowl-like" shape, having no local minima or maxima except the global one. In graphical terms, the function surface will always lie below the line segment connecting any two points on it.
Approach:
To prove that ( f(\mathbf{x}) = \mathbf{a}^T \mathbf{x} + b ) is convex, we use the following definition of convexity: for any two points ( \mathbf{x}_1 ) and ( \mathbf{x}_2 ), and any ( \lambda ) between 0 and 1,
[
f(\lambda \mathbf{x}_1 + (1-\lambda) \mathbf{x}_2) \leq \lambda f(\mathbf{x}_1) + (1-\lambda) f(\mathbf{x}_2)
]
Smonk
Compile Error! Click the
reaction for more information.
(You may edit your message to recompile.)
Argument:
-
Consider two points ( \mathbf{x}_1 ) and ( \mathbf{x}_2 ), and let ( \lambda ) be a weighting factor that varies between 0 and 1. The point ( \lambda \mathbf{x}_1 + (1-\lambda) \mathbf{x}_2 ) can be thought of as a weighted average and will lie on the line connecting ( \mathbf{x}_1 ) and ( \mathbf{x}_2 ).
-
Evaluate ( f(\lambda \mathbf{x}_1 + (1-\lambda) \mathbf{x}_2) ), which represents the value of the function at this weighted sum point.
-
Also evaluate ( \lambda f(\mathbf{x}_1) + (1-\lambda) f(\mathbf{x}_2) ), which is the value the function would take if it were flat between ( \mathbf{x}_1 ) and ( \mathbf{x}_2 ).
Special Cases:
-
Flat Function: If the function is flat, then both expressions will be equal. This is a special case when the function is not only convex but also affine.
-
Bowl-shaped Function: If the function is convex, the value of ( f(\lambda \mathbf{x}_1 + (1-\lambda) \mathbf{x}_2) ) will be less than ( \lambda f(\mathbf{x}_1) + (1-\lambda) f(\mathbf{x}_2) ).
-
Function with Local Extrema: If the function has local minima or maxima, then it is not guaranteed to be convex.
Algebraic Verification:
The function ( f(\mathbf{x}) = \mathbf{a}^T \mathbf{x} + b ) can be verified to be convex algebraically by substituting into the convexity definition:
[
\begin{aligned}
f(\lambda \mathbf{x}_1 + (1-\lambda) \mathbf{x}_2) &= \mathbf{a}^T (\lambda \mathbf{x}_1 + (1-\lambda) \mathbf{x}_2) + b \
&= \lambda \mathbf{a}^T \mathbf{x}_1 + (1-\lambda) \mathbf{a}^T \mathbf{x}_2 + b \
&= \lambda f(\mathbf{x}_1) + (1-\lambda) f(\mathbf{x}_2)
\end{aligned}
]
Thus, we find that ( f(\lambda \mathbf{x}_1 + (1-\lambda) \mathbf{x}_2) = \lambda f(\mathbf{x}_1) + (1-\lambda) f(\mathbf{x}_2) ) for all ( \lambda ) between 0 and 1, and for all possible ( \mathbf{x}_1 ) and ( \mathbf{x}_2 ).
lol
Smonk
Compile Error! Click the
reaction for more information.
(You may edit your message to recompile.)
so no need to check hessian or gradient
im just trying to learn lol
well it looks like you've learned then
I know for a fact we can accomplish it with just taking partial derivatives. I don't know why this is the case
cause that probably shows for such functions, extrema are already maxima
idk, I didnt read it
I dont read blunderbusses of copy paste 
hence finding extrema is equivalent to finding maxima.
or if $\Gamma_f$ is a convex set, then $a\in U$ such that $(\nabla f)(a)=0$ means $a$ isnt a saddle point
Omegabet_
same difference
Convex functions don't have saddle points
exactly
Can convex functions have a region where x1 temporarily does not change in height (z) for some distance along x1, this axis where x1 does not change in height is the minimum value of x2 in that region? Assume the region is not the global minima
I typed that out myself weeks ago without looking anything up or copy-pasting. You need to chill
idk what gamma is but yeah finding extrema is the same as finding maxima
Yeah they can have that. They can be constant over a line segment or even a multidimensional region as long as that region is a convex set and everything around it is higher
And that can be true even if that area isn't the local maxima? Like, that valley can eventually drop into a larger pit/valley?
Because if that's true, then taking a partial derivative wrt one variable might get me two solutions if I'm interpreting a partial derivative correctly
It has to be a local minima, no?
and a global one too
There is one point on a convex function where the gradient is 0 and that is the global minimum
There is no global minimum iff there is no such point
wrt to minimum f(x1,x2) yes
But it that true when considering f(x1) alone?
Because that's what it seems that the partial derivs are doing
?
f is a function of 2 variables, f(x_1) is nonsense
ok bro
I'm illustraing what a partial deriv is doing
Taking a partial deriv considers change in the function wrt a single var, e.g. x1