#numerical-analysis
1 messages · Page 21 of 1
there's an analytical solution? i'm looking at the wikipedia page, and i can't imagine it
I think the idea is to find the best zipfian distribution to model the data then see if it's good
Deciding yes/no won't be a quantitative thing though
Yeah, and i cannot imagine doing logarithmic derivative of this thing and setting to 0 is analytically solvable
this sounds like a fun problem to explore whether this is convex, or a region near the minimum can be found that is convex, etc. 🙂
You can make it into a convex optimization problem without a doubt
Regression can always be made into a convex problem, no?
i think there is a result like regression with exponential families is always convex or something
Well
but not in general
Specific types
Like if the norm is convex
Because then you're just minimizing the sum of convex functions right?
A zipfian distribution is probably convex
But I don't think it belongs to the exponential family
Wait you can just do gradient descent on the sum
This is definitely in scipy
Stochastic gradient descent

and if scipy does some weird ass way to optimize it then you could try coding it yourself 
I have the person formerly known as hitbox noted as "cannot code"
I wonder why
Rip
Ummm
Probably something from last year
Or something
I don't remember
I can try to dig it up if you want
i know hitbox as the guy who studies probability
This is true
hitbox does do probability
Oh
Maybe it was because of your sage difficulties
Oh in January
A Dembo
Math 136
Rip

I think there are some results like MLE is fucked if it happens on the boundary
Or maybe it was more like MLE is provably good only when it happens in the interior, something like that
But this is sort of more mathematical rather than statistical
The result I recall is about MLE, it probably exists in some mathematical text like shao's mathematical statistics 😜

hardwiring is always bad:(
or maybe i can just handwave it away
it's probably fine
Famous last words

Unless you’re doing ML, as ML researchers aren’t held to any standards of statistical rigor
Everyone knows that the best way to get state of the art results is by adding more layers to your neural network
Imagine the shitshow when it's revealed that all these ML advances were actually bullshit
anyone know of application of nonorthogonal projections

now is better, but sth still wrong xD

kekw
its either nonsmooth like here
or red area everywhere then sudden drop on the rhs to blue color...
thats what my prof told me
to apply neumann everywhere but rhs
rhs must be dirichlet for uniqueness of the solution
hey, I don't understand this mini-proof which is about the Polyak step size for convex optimisation. x* is the optimum, the function is beta-smooth. In particular, I don't get the last two steps. It's from this paper (last page): https://arxiv.org/pdf/1905.00313.pdf
eta is the polyak stepsize: $\frac{h_t}{norm(\nabla_t)_2^2} = \frac{f(x_t)-f(x^*)}{norm(\nabla_t)2^2}$ and the update step is simply: $x{t+1} = x_t - \eta_t \nabla_t$ . if I plug this in the third line, I get something similar than the second last line. And I'm clueless how they derived the last line
lyinch
Maybe I should also clarify what smooth means:
$f(y) \leq f(x) + \nabla f(x)^T(y-x) + \frac{L}{2} norm{x-y}_2^2$ . It's a quadratic bound on the growth of a function (and not the usual definition)
lyinch
Are there bounds on eta?
I wonder if the last line is eg optimizing in eta or something
not that I know of, but I haven't read the full paper. I'm only looking for one specific thing for my own proof... And for the polyak stepsize that we saw in class there are no bounds. It's just this fraction that I wrote
however, it depends on the gradient of f(x_t) . And this gradient is limited by the convexity and the smoothness, so maybe that's a way to derive some bounds
Ah I think if is
If eta equals that fraction then it is bigger than 1/beta, by ht >= blah at the top of the image you posted
So if eta is bounded above by something reasonable then the minimum is achieved at 1/beta
Which gives the last line
oh... I totally overlooked this condition! I need a second to write this down and try it out
thank you, I got it! This seems to work out nicely
now I hope that I can use that result in my own exercise 🙂
I don't get it how they derive the condition $h_t \geq \frac{1}{\beta} norm(\nabla f(x_t))_2^2$ which is the prerequisite for case 3 . That's supposedly a general result for a smooth convex function, and not specific to the Polyak step size
lyinch
id like to modify this function so it doesnt explicitly compute Q and saves more time in general, how do i do that? This is QR decomposition using householder
You want to do a QR decomposition without computing Q?
im guess i just want a more efficient way to do this, right now im applying the householder H to M = [A I] to get the result (Hn...H1)M = [R Q^T] but for instance if im solving a least square problem Ax = b then really i just need to do back substitution on Rx = Q^Tc so is there a way to not compute explicitly the Q or shorten the process somehow to solve that system
the goal is to use QR decomp in least square
so i got this block that uses the QR decomp
ignore the first 2 block of code really they are to identify errors
Wait so you have a least squares problem min norm(Ax-b)
And you do QR on A to get A=QR
So you know that Rx=Q^Tc
So you still explicitly need Q?
Am I misunderstanding
Oh
So instead of performing Q=H_1H_2...H_n
You just store what they are
So when you need to compute Qv
You just do all the matrix multiplies then
Well the idea is that if you don't care about Q you save some time during the QR decomp
But then you need to do those computations once you compute Qv
So it saves time if you never compute Qv
Which is very ?????
hm
so it do look like these all must be computed
and theres no large place where i can optimize it significantly
thx

very shocking
whzup
https://youtu.be/K8NCa93fGiw
@wide spear @warm otter
Solving Navier-Stokes equation until convergence
Left inlet u = 1; v = 0;
Right outlet du/dn = dv/dn = 0
Pressure boundaries Neumann everywhere but the right p = 0
Oh my
PERFECTOOO
the prressure seems to drop for some reason in the middle
but the solution is DAMN SMOOTH
Nice
❤️
I'm stuck on this problem currently.
What did you try
Got it with some help wasn’t sure which channel this problem went into
can someone give me a hand with proximals and the jacobi method?
i'm trying to figure out one line in a paper where they give a result without saying anything other than they used those two concepts
specifically those scalings by the squared column norms of the matrix A
nvm i got it
Anyone ever seen a paper that defines proximal operators on piecewise functions that are differentiable at the nodes?
I think this question belong here? it is about kaczmarz method for least square which is not common but wonder if someone can help. My question is quite simple
What is the question
I know it can be used to solve Ax = b if A is overdetermined where there is a solution x, but how can it be translated to least square directly? like if there is no solution x to Ax = b. Someone said to just do x* = arg min norm(Ax - Ax* + Ax* - b) and then solving Ax* = b using kaczmarz. How does that produce the least square?
for now I can only see this as order n^2 way of approximating linear system
We present a randomized iterative algorithm that exponentially converges in
expectation to the minimum Euclidean norm least squares solution of a given
linear system of equations. The expected...
This seems to be what you want
There's also the dumb idea
Where you just solve the normal equations using Kaczmarz
But that probably isn't what you want
oh this paper has what I am looking for
thank you!
it seems regular kaczmarz has no guarantee to converge to least square as expected
find the interpolating polynomial of this function through the 5 Chebyshev nodes.
[6:20 PM]would i plug in the cosodd#*pi/2n nodes into my original f(x) for the x?
if i had 1/(1+25x^2) would I plug in the cos(stuff) into the x
to get the y for the (x,y)
i think im overthinking it too hard
What
To do polynomial interpolation
You calculate the function at some points
In this case the Chebyshev nodes
Then you do polynomial interpolation with those points
So you should get 5 (x,y) pairs
And then you should get a degree 4 polynomial out of the interpolation
yea
so i plug in the cos(stuff) into the original f(x) we're given which is 1/(1+25x^2)
Ok so what function are you evaluating
f(x)=1/(1+25x^2)
Ok
What are the 5 chebyshev nodes
That's fine
but i plugged those nodes into f(x) & did the process so ig i did it right just big numbers r scary
Ok
Just wanted to add on that I found another optimizing method

Each householder can be applied to A to produce R while avoiding matrix-matrix multiplication, and instead you can iterate over columns c_i of A by c_i - 2(u^T*c_i)u --> c_i
Notice each time you do this it is ≤4*n flops
Because each hh is $H = I - uu^\top$, and so you can choose instead of naively do $HA$ but instead apply to each column by $Hc_i = c_i - (u^\top c_i)u$
eonian
Oh nice
Well
Yes
The householder matrix is low rank
And doesn't have a lot of information
I implemented one in python that uses both of this just now, I was interested and want to practice
The overall is O(n^2m) for solving the minimization
The latter, the former would be crazy
Yes
No , I think anticipation has it above
henlo, my professor left as an exercise to my class to estimate a value for pi using the monte carlo method (drawing a unit circle inside the unit square, distributing n points inside the square and counting those who fall inside the circle)
the catch is that we were also asked to provide a thought process for choosing n such that our approximation is accurate to 0.05%
- what does it even mean to be accurate by 0.05%?
- any tips on how to choose n? of course, without using the try and error approach
It means that your estimated value of pi is within 0.05% of the true value of pi
You should have some theorems giving you convergence bounds
well talking about precision in percentage is still a bit confusing but i'll get the idea eventually
i checked the reference textbooks for this class and couldnt find anything useful on convergence
do you know any references (links, books etc) i could use?
wow i think i see what u mean
thx! this is pretty good
Anyone know where to find a rigorous proof that l1 minimization enforces sparsity?
Maybe this is really just a proof of lagrange multipliers
"enforces" sparsity is sort of a too strong word, more like pushes towards sparsity. I think a formalization of what this even means will probably just come down to the fact that a ball in l1 is pointy
Hmmmm
This reminds me of the Pontryagin Maximum Principle
But this is a completely different setting
Or
Maybe not so much
Well
It's optimal control theory
@tall solar most proofs hinge on the strong equivalence of the l0 pseudo norm and the l1 norm
if the l0 version of the problem has a unique solution, the convex relaxation, which is the l1 norm (the convex hull of the l0 one), finds the same solution
so they usually work with the "spark", "girth", or "kruskal rank" of the matrix in the noiseless case, or with the null space property or restricted isometry property for stable recovery with "noise"
how can I prove that $\log(\sum_i \exp(a_ix+b_i))$ is convex? I guess I could use Hölders inequality somehow, but I don't know what to do with the affine function in the exponential.
lyinch
this might sound dumb... But I could just extend the dimension (homogeneous coordinates?) by one and use $x' = [x,1]$ and $a_i' = [a_i,b_i]$ and then I have a new identical function: $\log(\sum_i \exp(a_i'x'))$ in $d+1$ dimensions on which I can apply Hölders inequality. Would that work?
lyinch
i'm not sure about using holder's inequality, but i think taking a second derivatives is not that bad, and it reduces to cauchy's inequality
I don't think its quite cauchy because i tried and got (a_1^2x_1+...+a_n^2x_n)(x_1+...+x_n) - |<a,x>|^2
unless i did it wrong or didnt see something
where x_1 = e^(a_1x+b) or whatever
yeah this is cauchy with (a1 sqrt(x1), ..., an sqrt(xn)) and (sqrt(x1), ..., sqrt(xn))
ah i c ic
I think I got it via Hölder's inequality, but it's an ugly mess. I'll now try the second derivatives as you mentioned
that's the monstrosity I did
1: I just substitute the exp() to make it easier to read and in 2 I apply Hölders inequality: $\sum_i^n \norm*{x_iy_i}_1 \leq \left( \sum_i^n \abs{x_i}^p \right)^{\frac{1}{p}} \left( \sum_i^n \abs{x_i}^q \right)^{\frac{1}{q}}$ where $\frac{1}{p} + \frac{1}{q} = 1$ with $\frac{1}{p} = \lambda$ and $\frac{1}{q} = (1-\lambda)$, $0 \leq \lambda \leq 1$
lyinch
nice. oh, is x a (d dimensional)vector?
euhm yes, I probably should have given more context 🙄
ah, i guess 2nd derivative/hessian will be a mess then
f is from R^d -> R, and a_i^T are the rows of a matrix of mxd
that was apparently the easy part, now I have to prove smoothness and then that there's a minimizer 😄 Thanks for your help!
i think the path they expected you to use was that exp is convex, composition with affine mappings is convex, nonnegative weighted sums preserve convexity, and composition with a monotonically increasing scalar func (log) is convex
if you've already shown that, anyway
because showing those things independently is a lot easier than going HAM with the hessian here
😛
I think we've only shown that composition of convex function with an affine function preserves convexity... But I agree, maybe I could show the other properties
these are classic "cookbook" properties, you could probably google the proofs
this is the first thing i found, it has some short handwavy proofs
ah, i was gonna say that is smarter. but can we apply it actually? f(x)=h(g(x)), and g must be convex and h must be convex and non-decreasing
maybe it's one of those log-convex instead of using just composition
i don't recall which one exactly it is
hmm seems i was mistaken about the last step, you have to show it's log convex indeed
Does conjugate gradient have a chance to fail if (A^T)A is only pos semidefinite?
because maybe the line search doesnt yield a min point and instead gives inflection point, but perhaps that is really rare?
fail in the sense that it doesn't converge in dimV iteration
inflection points are indefinite
semidefinite means it's rank deficient and has infinitely many solutions
you'll find one of them, but i'm not sure what happens if the observed vector is not exactly in the image of A
hm are you sure that is the case?
if you are referring to (A^T)A having infinite solutions
then i dont think its always true, semi definite doesnt mean there is a 0 eigenvalue
positive definiteness is directly related to the eigenvalues
it pretty much means all the eigenvalues are > 0
semidefiniteness, >= 0
so it might be rank defficient
you're minimizing something like Ax - b, right? you'd need b to be in the image of A for it to have a unique solution
yea, so instead im solving the normal equation
if A is not full column rank, A^T A is rank defficient
yes
hm interesting
because there exist A^T A positive semidefinite but invertible
or i mean
not A^T A
but
a general matrix
So like a general symmetric positive semidefinite matrix can be invertible
depends what you mean by invertible
you can get an exact solution if b is in the image of A
if it isn't, you'll get some projection
yea i mean as a square matrix totally invertible
if it's invertible it's pos def
all eig vals > 0
(if we're still talking about A^T A)
ok thanks, ill think about it
right i was just confusing myself
since if something was just symmetric, positive semidefinite but not positive definite then it must have a 0 eigenvalue
it "may", not necessarily must
why not, if it is symmetric
but if it was all > 0 then it would be positive definite
positive definite mats are also positive semidefinite
k
This sounds silly and is maybe a #calculus question (if so LMK and I'll move it)
I'm having trouble wrapping my head around the 2n-1 degree exactness in gaussian quadrature
I can sort of get my head around that if we have n x's, we can easily get n degree accuracy
but why 2n-1
jan Niku
er, sorry
I've made a typo, that should say f(x) is a fifth degree polynomial
2n-1 degree exactness, and all
the proof is quite involved but smart at the same time, gaussian quadrature uses legendre polynomials
hmm okay
maybe ill try to suffer through a proof
maybe just intuition is enough at this point though 😄
it sort of makes sense that 6 unknowns should give you 5 degrees of freedom
ive got a "proof" but it doesn't prove how to find the orthogonal polynomial
but other than that it is a proof
that you can reference
If it's easy to link, sure
I may ask my teacher exactly what level of comprehension I should be aiming at
thank you 
start from pg5
yea, we are tasked with this in the homework, which is not so bad
the mechanical portions arent too painful, the exactness and the transformation are conceptually strange to me though
we just use newtons for systems in our class
im sure that means this stuff is curated so that we get nice answers using this method
but it works for now

would someone be willing to help me with a transform 
I'm looking at this problem
n=3 gaussian quadrature coefficients over -1 to 1
so if id like to integrate over 0 to 1 instead, each weight should become 1/2 of its original value?
I think that these weights are for 0 to 1
For -1 to 1 you have x_i at -sqrt(3/5), 0, and sqrt(3/5)

Ok
jan Niku
These aren't the weights
These are the points
The weight at 0 is 8/9 and the weight at the other two points is 5/9
okay
so as we move from -1 to 1
each weight should become 1/2 of its original value
as (b-a)/2 = 1/2
so 4/9 and then endpoints of 5/18
Yes when you move to the interval [0,1]
sorry im at work
but this doesnt seem complicated
i think i just lack confidence
thanks for the help

Here's my current proof that the line search directions of conjugate gradient are Q_A = A^T A conjugate
I believe a problem is that I assumed a_i ≠ 0
any way to get around it our justify if that's fine? btw a_i is this
it's the optimal scaling value for each line search in the Q_A conjugate direction
so the only thing I realize is if a_k = 0 then r^(k) and d^(k) are orthogonal and also means there's no improvement in that particular iteration
which I didn't say clearly in the proof, I just said "which means least squares is solved" which isn't the case

btw do u know a faster way to do $v^\top A v$, where A is an inner product, than just to do the standard (dumb) way $v^\top(Av)$
Anticipation
Implemented practically, calling matrix-vector multiplies will probably be the fastest
Because you'll be calling BLAS routines
Of course, if you know more about A, you might be able to do something
Doing two mat-vec mult is O(n^2)
I think 4n^2?
Doing anything like decomposing A will be more expensive
ic
yea O(n^2)
so its like it makes no difference if A induces an inner product or not
tfw u want to solve a linear system exactly under O(n^3)
looks like law of nature dont let u
Well if A is dense with no structure, there is no hope
But
This is why I'm giving a talk!
nice
But eg if you are doing this multiple times for many v, then maybe there is a trick like a decomposition. Or even if A is changing, if the decomposition plays nice with the updates, then maybe it is still good
(I guess A is an inner product means it is positive definite?)
Yes
I think A is constant for CG
You would need to do v^TAv at least n times for a decomposition to make sense
And the point of CG is to use less steps than that
Oh
CG, as a member of the Krylov Subspace family of methods, will give the exact solution in n steps
But this is equivalent to doing LU directly
However, it will converge must faster to a solution
So you can get convergence within machine epsilon well before this point
will you cover sparse regularization in your talk?
I'll cover some sparse stuff
Probably not regularization though
It's only an hour
And there's a lot of stuff to potentially cover
fair enough, yeah
Think I solved this, 2 things: the proof is really nice, its induction with 3 statements and you use all of them to prove the next step and second is alpha = 0 will never happen because that would mean r = 0 and we would be done in the previous step
if all you have to do is verify, you can multiply the two expressions and show you get an identity matrix
Just multiply B(k+1) and B(k+1)^(-1)?
Yeah
will the result be so complicated?
I try the multiplication but it isnt look like an identity matrix
Well what do you have?
a long expression, maybe i dont know how to simplify it
would you mind to show me how to do it?
Is there any tactics on doing the multiplication?
Maybe you can do it as two rank 1 updates, using the woodbury formula. But otherwise doing the multiplication carefully is the only "tactic" I can think of
one thing im consered with conj grad for least square is its slow in the end
concerned*
cuz u always have to compute A^T A
Well yeah
So you use something like GMRES for least squares
And not CG
You never want to solve the normal equations
ic our prof recommended conj grad, it seems like the course is not going in depth enough
but assuming we already solved A^T A then its good ig
cuz that multiplication takes order n^2 * m i think which is just bad
Another thing is that it doubles the condition number so iterative methods take twice as long to converge (roughly)
which thing doubles the condition num?
Turning A into A^TA
oh ok thx
what definition of condition number are you using?
Yeah it squares them
But iterative methods have runtime proportional to log condition number
Or something
ah yes
So twice as long
yes yes
Yes
yes
Yes
Yes
Yes
YES!!!
yes
Yes?


Good afternoon. I've been having a rough time with this question. Let $a,b,c$ be complex numbers, $a\neq b,c\neq 0$, and let
\begin{align*}
w=f(z):=\frac{1}{c}(e^{az}-e^{bz}).
\end{align*}
Show that the inverse function near $w=0$ is represented by
\begin{align*}
z=f^{[-1]}(w)=\frac{1}{d}\sum_{n=1}^\infty \frac{(-1)^{n-1}c^n}{n!}\left(\frac{nb}{d}+1\right)_{n-1}w^n,
\end{align*}
where $d:=a-b$. Derive as special cases the representations for the inverses of the functions
\begin{align*}
w=\sin z,\quad w=e^z-1,\quad w=e^{\alpha z}\sin \beta z,
\end{align*}
and, as the limiting case $a=b+c$, $c\to 0$, $w=ze^{bz}$. [To compute residues use formal integration by parts; see Problem $1.8.6$.]
TheRedLotus
My bad. Didn't see that channel.
Hello,
I would like to plot a damped sinusoidal frequency spectrum in matlab
here is the equation of this damped sinusoidal
\begin{align}
$e^{-2t} . sin(4\pi t)$
\end{align}
the fourier transform of such signal would be that if i'm not wrong
\begin{align}
$\frac{4\pi}{(2+i2\pi f)^2 + (4\pi)^2}$
\end{align}
but i don't know what to do with that
DawnUltra
here is my matlab code
function [] = damped_sin()
x = 0:1:5;
a = zeros(1,numel(x));
a = (4*pi)/(square(2+1i*2*pi*x) + (2*pi*2)^(2))
plot(x,a);
title("damped sinusoidal spectrum")
xlabel("frequency")
ylabel("amplitude")
ok
can somebody eli5 me Forrest Tomlin update from simplex method?
Can give you simplex method matlab code
Not sure abt Forrest Temlin
Use 2d plots for complex plane, make your function in ai+b form
then a=x b=y
Can't)
I did it 7 years ago
can't remember the formulas
no problem
hey, how do I prove that this function (log sum exp) has a minimum? I know that it is smooth and convex (not strongly convex). I found it here: https://link.springer.com/content/pdf/10.1007/s10208-013-9150-3.pdf p. 12 but they just take the existence of a minimum for granted
I think there is a common technique of proving that a function is convex and... "Forcing", maybe it was called? The property is that f(x)->infty as |x|->infty
Ah sorry, maybe this doesn't work? Yeah, I don't think this works, hm
I've already proven convexity, but this tells me nothing about the existence of a minimizer
What conditions do you have on the ai? (Nothing?)
what if m=1 ?
no conditions on A and b
Ah yeah, something is weird, huh
I think intuitively if the ai "point in many directions" this will have a minimum (in R this would just be pointing in the positive and negative direction)
I think you need m > number of dimensions
Ah, the property I was thinking of is "coercive"
maybe it doesn't always have an optimum...
As it is smooth, we expect the region around the optimum to be well approximated by a quadratic (assuming the optimum exists)
In mathematics, a coercive function is a function that "grows rapidly" at the extremes of the space on which it is defined. Depending on the context
different exact definitions of this idea are in use.
ah yes, I read about coercive functions but don't know too much about it
maybe it's a good time now to look more into this
x \in R^n
rho is just a scalar parameter
A \in R^{mxd} and b \in R^m
So yeah, I think ai pointing in many directions should imply this function is coercive => minimum exists. But maybe there are weaker conditions under which a minimum ecists
no that should be an equivalence here
if you find a direction where all the scalar products are negative well you can get as close to -infinity as you want
Yeah I think it sounds right, although I'm not sure eg how many ai are necessary to do this
But linear functions are convex, exponential is convex, etc
and they have a minimum
I guess eg in ML world probably all loss functions are coercive, so there is an implication convex implies minimum exists
But we are working over the whole domain R/Rn, not eg over a compact domain
you write R/Rn as a quotient or just to mean R^n
Just as in R or Rn (really n=1 case), no quotient going on
then yeah, convexity implies there is a minimum
I think there is just a miscommunication, linear function obviously goes down to -infinity
because the function is strictly non-decreasing
it won't be unique in general, but they will all be gathered in a single region
You are saying: f is a function from Rn to R, then f convex implies f has a minimum?
Yes, I totally agree with local implies global
aside from that, the minimum would be at a boundary if not within the domain
you would have to either find where the gradient is a 0 vector, because that is necessary and sufficient due to what we just said, or find the optimum along the boundary of the domain if the gradient does not become 0
no, for that you need strong convexity
I was gonna say "one of us is on drugs 🙂 " but I am already drinking
there is strong and there is strict convex. Let me dig up the definitions
Strong convexity implies it is lower bounded by a quadratic right?
i might've mixed them up
strong convexity bounds the function from below
but I don't know if this function is strong convex. Maybe that's a starting point 🙂
But we already know it is not strong convex without some conditions on ai
a quick google-fu says you can use strict convexity and the function being coercive to show the minimum exists
Yeah, and coercive should mean that ai point in "many" directions, ie every vector in Rn has a positive dot product with some ai
but I think that it's not strictly convex
I have to read a bit more into coercive functions, I think that is the correct way to show it
where is this from?
a script from my lecture
Hello! So I am trying to write a program that draws my teacher using epicycles as a way of saying thanks. I have noticed that a lot of programs use the discrete fourier transform for this, however, I am not that comfortable with this method so I am here to ask a few questions. So let's say that I somehow get a lot coordinates of a simple picture that I want to use. Then I should use the DFT to get the complex points using the formula above. But how do I proceed from there? How do I calculate the radius of each circle? How fast should they spin?
Okay so from this I guess that the frequency of each circle would be k/N and that the amplitude (radius) would be 1/N*sqrt(Re^2+Im^2), is this correct?
So this is my guess. Let's say that I have the first circle, then I should add another circle that spins around the first circles circumference. The middle point of that circle should be x = r cos(2pi k/Nt), y=rsin(2pi k/Nt) where t is the time. But how much should t increase by for each loop? Can you just pick that an arbitrary number, say 0.5, that the time should increase by? I.e t+=0.5 each loop?
But doesn't the fourier series only work for periodic functions?
why?
So the way they do the fancy animations is they make the image into a periodic function in polar coordinates
oh oops...
But a lot of programs that I've seen use the DFT on both the x-coordinates and the y-coordinates
So what is the difference between DFT and fourier series?
Okay, thanks for the info! But can't I just a get a simple outline of by teachers face and somehow get the x and y coordinates of all those points and perform the DFT on both the x and y coordinates separately? Wouldn't the complex numbers generated by the DFT provide all the necessary info for the circles etc.?
Yeah true... But I am not that confident with polar coordinates
Oh yeah, that's true. I will definitely give this method a try! Thank you so much for the information!
What is interpolate? Taking the mean value?
i guess theres a typo here?
heres original matrix, i thought to clear out the 0
you need to use b1 rather than b2
and would the subsequent givens rotation just be Gk = G(k,k+1,theta)
They are doing a Givens rotation for the upper left 2x2 block
That's why the b_1 in position A_12 becomes q_1
I think
Hmmmm
it says choose theta satisfying that thing
but i thought it should be b1
there
instead of b2
yea
i think i got it though theres 2 typos
like I know its to triangularize and here its O(n) time apparently
the one thing i dont quite sure
is when you get (Q^T)A = R after performing the sequence of givens right
then A = QR, but why is it that when you do RQ = B, then B is again tridiagonal, I guess just prove it inductively by looking at each givens rot and look at the things elementwise?
ok i figured out the implementation
i just needhelp to prove QRQ^T is also tridiagonal
Ok i also figured out theoretically why
its cuz B = RQ = Q^TAQ is symmetric and u can show that when u apply the sequence of givens rotation on R from the right, the element below subdiagonal below main is maintained 0 because A is tridiagonal
damn QR diagonalization is epic
Hello, I have a question about preconditioning in numerical linear algebra. On Wikipedia and other sites I found that solving a system of linear equations Ax=b where the condition number of A is low can lead to a fast rate of convergence for an iterative method of solving Ax=b. However, I can't seem to find or understand why a low condition number leads to a fast convergence. Does anybody have good source which explains this or does anyone mind to explain themselves? Thanks in advance!
Thanks for the link! unfortunately I'm not really able find the answer with the information. The explanation also seems quiet advanced and the question connected with the course is an introductory numerical maths course. So I think there could be an 'easier' explanation for this.
Mmm yeah sorry advanced isn't quite right to say. tbh I don't really understand the answer and I think that this is not the answer they would expect in my situation since the explanation is not very related to the content of the course. I was more thinking about finding a formula or so that includes the condition number of the matrix and a measurement for the speed or convergence.
Not really, we should be able to find the answer for this question with the content that we saw. (But I'm not really able to find it with the course material I have.)
you can interpret it as being related to the lipschitz constant of the gradient of the function $\frac{1}{2}\Vert Ax - B \Vert_2^2$
Edd
if all of the singular values of A are the same, the gradient is equally smooth in all directions. if not, then the condition number increases and some dimensions change more slowly than others, but you have to account for the worst case scenario to guarantee convergence
this usually results in iterative updates that change some dimensions very slowly
Thank you! I think I'm getting it. The formula is also very helpful 🙂
I'm trying to decide which courses to take next semester, and I can't decide whether two in particular would be helpful. The first course gives an introduction to measure theory with a slight focus on its connection to probability theory, and the second course introduces functional analysis. I'm interested in machine learning and its foundational fields, so I'm curious whether these courses would be useful there.
(I hope this is the right channel, otherwise please point me to the correct one)
I am specializing in robotics/mechatronics/control theory, so if either course is useful there, that'd be great too
Go ask in #math-discussion
Thanks!
Hey svd question
say you have X=USV^T
I'm reading somewhere that the "orthogonal projection onto the column space of X is UU^T
What does that even mean?? UU^T should be an identity ??
Yes, UU^T is the identity
is it reduced svd?
I don't think it being reduced has to do with anything.
It doesn't seem like projecting onto the column space changes anything wth
I recall there are multiple SVDs, and some of them are such that UU^T are identity and some are not, or something like that
I guess it sounds sort of right, if X is surjective then probably this means good shit happens, and UU^T is the identity, maybe, and so is orthogonal projection
U^T U is always identity
depends on the definition of the SVD method
See I wanna take a matrix X1 and another X2
Then find the closest matrix to X2 that has the same row and column space as X1
right i think U and V i was thinking are square
I remember being sad because I was working on an exercise that talks about SVD, but they must have meant a different SVD than the normal one because the shit they were saying did not make sense
lol
I saw another formula for the "projection onto the column space of X"
That was like $ X*(X^{T} *X)^{-1} X^{T}$
But if you use the svd X=USV^T
It all reduces to nothing no matter what
Weird stuff
Nvm it works out I'm insane. You guys were right it works
this was already a while ago, but for completeness, U U^T is the identity if U is square, since then U contains the basis of the column space and the left null space. normally though, you ignore the singular vectors corresponding to the 0 singular values since you want to, as people said above, project onto the column space. in this case, U U^T is not an identity. this latter one is called "economy-size SVD"
so it's usually a good idea to say U = [U_c U_l], showing explicitly which part is a basis for the column space and which part for the left null space
then U U^H is always an identity, and U_c U_c^H (economy or reduced svd) is an identity if the matrix is full rank
idk if this is the channel but
Imagine i have an image, and i have a mask for it
what is the operation i need to do to leave the resulting background as white?
Like this
But with the background white
the masked image is the third bird
ty
Yes! I'm a bit embarrassed I didn't realize this at first lol.
I was curious about projecting svd of one natural image onto another. Then I realized what you're saying now. In a natural image all the sv's will typically be nonzero (or error or whatever reason). I was upset I kept getting the same image but then I realized since they were full rank they have exactly the same column and row space.
for i in range(N):
for j in range(M):
if not mask[i*M+j]:
image[i*M+j] = 0x88DDDA
:tinkTonk:
Holy shit that was painful to type on phone lmao

Btw
I ended up truncating the singular values of each natural image then performing the projection. But when I did the projection between a dog and a cat photo nothing particularly cool happened.
I wanted to see a cat dog :(
Conclusion cat=dog
the basis was probably similar
Yeah if you keep like 5 singular values you might get lucky with certain photos and keep some color of the other
you'd have to do a very low rank approx to see anything neat, maybe
yea
if youd had several examples, you can do something more similar to that
like a basis for several dog images
this is more like the so-called eigenfaces alg, which is pca, which is svd in a trench coat
Yes I was thinking the same thing! I have found a dataset utkFace of a bunch of centered face images.
I wonder if I can turn a man into a lady
that sort of classifier should work something like doing a projection and then measuring the error
that projection should be most similar component of the original image to those in the basis you found
very roughly, anyway. in ML, the nonlinear activation functions let the result be more sophisticated
Nice yeah machine learning is cool lol.
Before ml I wanted to try some tensor factorization on color images. I read about this cool factorization called the t svd that I'm really vibing with.

what's cool about it is that it essentially works by applying svd in the Fourier domain.
There's a block matrix associated with a 3d tensor called the block circulant that has the same spectra than the Fourier modes
HO-SVD 
i'd have to see how the circulant is constructed, but yeah
all circulant mats are diagonalized by fourier matrices
Something something no screenshot
Lol got you
@brave crypt was there a specific section you had questions about
Ok so let's look at the first one first
Sure 🙂
This is the paper, if you don't want to switch the channels: https://arxiv.org/pdf/1905.12120.pdf 😀
Ok
So I'm reading this
And I'm not entirely sure if X and W are matrices or vectors
Which is clearly an oversight on the part of the authors
Some more experienced with ML may be able to deduce this
I have a book, let me see if I can find annotation for it.
I am sorry.
Yes, but still.
Anyways
W is Weight Matrix.
$\sum_{j=1}X[i+jr]W[j]$ is a convolution
Angetenar
x: Input Vector
Then you must be right.
This paper is so bad
They don't define anything
Anyways
Do you understand how convolutions work
If you can explain me briefly then it would be really helpful.
Ok
Thanks 🙂
Let's consider the 3 by 3 matrix $\begin{bmatrix}1&2&3\4&5&6\7&8&9\end{bmatrix}$ and the 2 by 2 kernel $\begin{bmatrix}-1&1\-2&2\end{bmatrix}$
Angetenar
Then, the convolution of this would be $\begin{bmatrix}-1\cross1+1\cross2-2\cross4+2\cross5&-1\cross2+1\cross3-2\cross5+2\cross6\-1\cross4+1\cross5-2\cross7+2\cross8&-1\cross5+1\cross6-2\cross8+2\cross9\end{bmatrix}$
Angetenar
Do you see what has happened
We take the kernel (also called a filter) and we slide it through the matrix
We start at the top left
And we take all the elementwise products
And then add them all together
And then we slide the filter over by 1
And then we reach the end of the row so we move it down 1 and start over from the left
And then we slide over by 1 again
And then we're done
So for a filter
We also have two strides, sigma_x and sigma_y, which determine how much you slide in the x and y directions
You also have a dilation term
Which determines how big the kernel is when mapped onto the matrix
In this example I worked out, the dilation would be 1
However, if the dilation were 2, for example, we would have the result of the convolution as $\begin{bmatrix}-1\cross1+1\cross3-2\cross7+2\cross9\end{bmatrix}$
Angetenar
Does this make sense
In essence, the kernel $\begin{bmatrix}-1&1\-2&2\end{bmatrix}$ with dilation 2 is equivalent to $\begin{bmatrix}-1&0&1\0&0&0\-2&0&2\end{bmatrix}$
Anyways
Ok
Angetenar
$\sum_{j=1}X[i+jr]W[j]$ computes a convolution for a single output $Y(i)$
Angetenar
Y is a matrix I think
Their notation sucks
Anyways
Once we compute this
We apply ReLU
Which is $Re\qty(\sum_jX[i+jr]W[j])=\max\left{0,\sum_jX[i+jr]W[j]\right}$
Angetenar
Yes, max(0, N)
And if something is negative, it becomes 0
Yep
Ok
After we apply ReLU, we batch normalize
Do you know what batch normalize means
Where I can take material to study about applied-computational-math?
Nope. I am sorry.
Ok
Batch normalize means that you shift and scale the data that so that it has mean 0 and standard deviation 1
What sort of applied math are you interested in?
Cool
And to do this, we need the two parameters beta and gamma
Does the first formula make sense now
Ok now for the second formula
Do you know what a loss function is
I apologize for leaving.
I think it's related to gradient descendant to achieve local minima.
I see.
And would you please tell me how it measures that?
P_{n,m} is what the model predicts for pixel n at scale m
Notice in the model architecture in figure 2, there are three down sampling layers so you have a total of 4 scales
Let me finish this explanation first
Sure. 😀
G_n is the true label for pixel n
In the loss function, we compute $\sum_{m=1}^4\qty(1-\sum_{n=1}^N\frac{2G_nP_{n,m}}{G_n+P_{n,m}+\eps})$
Angetenar
The outer sum is straight forwards, because we want the errors across all 4 scales right
Now for the thing inside
Yes
We sum the error across all the pixels
Sensible
At each pixel
We have this Sorensen-Dice coefficient
Which is used to gauge how similar two things are
Tbh the paper doesn't really make sense here
In my opinion
Because you don't specify what values Gn and Pnm could take on
Oh hitbox
You are here
Don't you know ML
Have you heard of Sorensen-Dice loss before
Rip
Have fun coding
This is how it starts:
class DiceLossVariants(losses.Loss):
def __init__(self, *args, **kwargs):
super_args = dict()
if 'reduction' in kwargs.keys():
super_args['reduction'] = kwargs['reduction']
if 'name' in kwargs.keys():
super_args['name'] = kwargs['name']
super().__init__(**super_args)
if len(args) > 0:
self.loss_name = args[0]
elif 'loss_name' in kwargs.keys():
self.loss_name = kwargs['loss_name']
There's this paper
And it has some formulas
But it doesn't really define anything formally enough for the formulas to make sense

Here is the paper
😀
We are looking at formula (2) right now
Unclear
Ranges are unspecified
I was thinking that because this is binary classification, they would be 0 or 1
Which is why there would be an epsilon in the denominator
The prediction for pixel n at scale m
May I know what you mean by predict as big as possible?
Sorry, I am very new in this.
Yes.
Yes, Gradient Descendant.
I am sorry.
Sure.
advanced_losses.py
No worries.
Thank you for trying 😀
No issues 😀
Though do you know why we generate accuracy matrices for test dataset?
From that repository, I got three matrices.
But I don't understand why we need matrices for test and training.
Ohh. We have used that for other datasets.
This is on my local machine.
I am sorry, I didn't know the right word.
I see.
One last question:
I got IoU: 0.42 for train, test and validation.
Is that bad?
😀
I see.
Thank you for your help 😀
@wide spear Thank you for explaining me things patiently.
😊


Does anyone know what a "dependence set" is? (in the context of a matrix).
I'm reading a paper and the define it like:
for the upper characteristic we define depsU(i)as the set of all the indexes j such that the coefficient mi,j of the matrix M (of the linearlayer) is non-zero.
For the matrix:
1 0 1 1
1 0 0 0
0 1 1 0
1 0 1 0
They obtain the sets:
It seems like the i coefficients correspond to the positions of the 1's in the matrix, but I don't understand why the j coefficients are offset
for example, in deps(0,j) , 0, 2, 3 correspond to 1s in the first row of the matrix
but where do (j+2)%4) or (j+1)%4) come from?
the paper doesn't really go into any more detail than what ive posted so i am lost on how they derived depsU
Is this in the context of finite difference methods for pdes
Warning: long message
Hello, I'm using gmres in MATLAB for preconditioning on 2 specific sparse matrices A1 and A2. Their sparsity pattern, generated with spy is shown in the screenshot below. For both matrices A1 and A2 the following code is executed (A1 and A2 are A):
x0 = gmres(A,b);
y = gmres(@(x) A*( solve_Ub( U, solve_Lb(L,x))),b);
x1 = solve_Ub( U, solve_Lb(L,y));
LU is the incomplete lu factorisation of A. solve_Ub and solve_Lb are specific functions which return y for which Uy=b and Ly=b respectively. x0 is the solution of Ax=b where no preconditioner is used and x1 is the solution of the same system but with the incomplete LU-factorisation as a preconditioner.
Executing the code generates the following output for A1:
gmres stopped at iteration 10 without converging to the desired tolerance 1e-06
because the maximum number of iterations was reached.
The iterate returned (number 10) has relative residual 0.29.
gmres converged at iteration 2 to a solution with relative residual 8.2e-13.
for A2:
gmres stopped at iteration 10 without converging to the desired tolerance 1e-06
because the maximum number of iterations was reached.
The iterate returned (number 10) has relative residual 0.77.
gmres stopped at iteration 10 without converging to the desired tolerance 1e-06
because the maximum number of iterations was reached.
The iterate returned (number 10) has relative residual 0.086.
So the preconditioner has a good effect on A1 and already converges at iteration 2, but with with A2 the preconditioning doesn't improve the convergence here. My question is: Does anybody know why this helps relatively good for A1 and not for A2? I thought that A2 might be bad conditioned, but by calculating the condition number for A1 and A2, I came to the conclusion that A1 even has a bigger condition number than A2. Sorry for this long question, but thanks for reading this. Any sugestion would be much appreciated!
No, cryptography
Tiboat
Are you looping with the LU? If so, are you using a sparse LU?
You might be getting a lot of fill in for the second one
in general, most preconditioning* methods destroy the sparsity :x
you might do better with something like a jacobi preconditioning
Ok it looks like the j is offset in backwards order
the LU is sparse
So it’s like (i, j+(4-i) mod 4)
I don’t think the preconditioning is being applied to the matrix
I don't really get what you mean by looping the LU and also with the fill in for the second one 😦 . (I'm a big noob in this, sorry)
It’s being used to find an initial guess for a solution?
What do you mean exactly?
.
Well, I'm not practically using this to solve a system of linear equations for some purpose. This is more like a thoeretical question i'm trying to solve.
Let me think about this more when I get out of bed
ok thanks 🙂 haha
i'm not sure it holds for this second case:
matrix M^-1:
0 1 0 0
0 1 1 1
0 1 0 1
1 0 0 0
sets:
Thanks for the information! Does the destroying of sparsity have a relation with why it could wokr with A1 and not with A2, if you would know?
not necessarily, but it makes stuff a lot slower to compute
i guess here it is just (i, j+i % 4) - the real issue is why they do that 😄
for both U and L
I dont really understand what they mean by "coefficient" in the def:
Ok thank you! Than I know already that that isn't the solution for the answer
we define depsU(i) as the set of all the indexes j such that the coefficient m i,j of the matrix M (of the linear layer) is non-zero
Ok
tiboat
You are using GMRES to determine an initial guess for your iteration?
I'm not entirely sure what your Matlab is doing
Mostly because I'm not very familiar with it
I have no clue what's going on either
😦
You might consider asking in the CS server linked in #old-network , they probably have people who know more crypto
with x0 = gmres(A,b); I'm trying to get an approximate solution of Ax=b
y = gmres(@(x) A*( solve_Ub( U, solve_Lb(L,x))),b); x1 = solve_Ub( U, solve_Lb(L,y)); also but with a right preconditioner LU
What does @(x) mean
And what do you do with x1
Do you do this entire thing again?
Or is x1 what you calculate the relative residual with
there you kind of make a function with variable x
Ok y is a function of x?
yes
But you don't pass in x when you call y in x1
um no... I'm not even able to properly understand this haha 😓 . Maybe to give a bit more context: this question with the code was given for an assignment. We actually didn't see anything about gmres in the course. for this question we also can interprete gmres largely as a black box. So tbh I don't precisely know what's going with the gmres. The answer for the question also is probably not very related to the specific gmres method.
Ok so we should be thinking about qualitative differences between the two matrices?
I think so, yes
then maybe it's a matter of looking at the error between the original matrix and the incomplete LU?
you won't have a good preconditioning if the error between the original mat and the incomplete LU is close to the original matrix
Aha, that's a very hopeful suggestion. I will see what the error is!🙂
particularly, the frobenius norm of the difference
or also the condition number of the preconditioned problem
see if it improves as much as the other one's does
(which should also be related to how good the incomplete LU was)
The forbenius norm of the difference A1 is 0.594262658318646 and for A2 is 4.088316191164535e+04. So that's nice! Unfortunately I didn't see the forbenius norm in the course. So if you want to take the condition number of the preconditioned problem, what do you take the condition number of exactly? Thanks for all the help already! and sorry for the late answer
Yes i've seen the 1-norm, 2-norm and infinity-norm.
Ok
oh wait is frobenius norm = 2-norm?
It is a fact that $\norm{A}_2\leq\norm{A}_f\leq\sqrt{r}\norm{A}_2$
Angetenar
ah nevermind
Where r is the rank of the matrix
well, the 2-norm of a matrix is an induced norm, and it corresponds to the largest singular value
So you can also compute the 2-norm of the difference
oh you had already started, oops
anywho, the frobenius norm is the square root of the sum of singular values squared, while the 2 norm is the largest singular value of the matrix, and they follow the property angetenar gave up there
A lot of matrix norms have these equivalency properties
Actually, all matrix norms are equivalent
so if you compute the frobenius norm of the original matrix and the frobenius norm of the difference, you can get an idea of how bad the estimate is
equivalence of norms moment coming up
For any two matrix norms you have the inequalities $c_1\norm{A}{\alpha}\leq\norm{A}{\beta}\leq c_2\norm{A}_{\alpha}$
Angetenar
For arbitrary matrix norms
The moral is that the specific matrix norm you choose doesn't matter so much
how concrete and bullet proof must your explanation of this problem be?
Thanks for all the information!! 😊
at a handwavy level, i would say looking at the frobenius norm of the difference and the condition number before and after preconditioning should give a good idea of what's going on
or maybe relative frobenius norm difference
something like that
(or any other norm you like, as ange said)
oh, and as it turns out, it IS a consequence of ruining the sparsity
i had just missed the part where you said incomplete LU
since you have some M = LU - R, and LU is the incomplete LU decomp
the usual LU would destroy the sparsity, so you subtract this U term
and impose sparsity on L and U
this R takes up all the error
which you put in by making the LU incomplete to try and preserve the sparsity
when you do the frobenius norm of the error M - LU, it's really the norm of U that you see
Thanks for all the help and information @prime kraken and @wide spear ! I think I'm getting it
https://www.reddit.com/r/CFD/comments/hbmzhf/fluid_simulation_using_a_discrete_lagrangian/
Can someone share the mathematical model of this?
133 votes and 9 comments so far on Reddit
thank you, I'll move it here
I do not know the answer to your question
Sorry if this is wrong place for it, but that's the one that was recommended to me, and I don't think that early university statistics apply (although I'm a computer science PhD, perhaps for mathematicians that is trivial)
Below formula comes from "Conjugate Bayesian analysis of the Gaussian distribution" available at https://www.cs.ubc.ca/~murphyk/Papers/bayesGauss.pdf
In Normal-Inverse Wishart prior, what is the role of kappa_0?
In my case, I'm using it as part of Bayesian Rose Trees clustering. It is my understanding that NIW is used to model gaussian distribution in order to differentiate between clusters using the assumption that each cluster should follow gaussian distribution. Is it then correct that for given data and given Sigma (how to estimate Sigma is another problem), kappa_0 should be selected such that it would make Sigma/kappa_0 equal to the predicted variance of the distribution?
@prime kraken might know
If nobody knows here you might try asking in the AI/ML server linked in #old-network
(Edd is also asleep right now I think)
yeah, I'm there, so I'll do that if no one here will be able to answer. But I think I need more of a math understanding that ML-based one. Cause, to be honest, I could just optimize it as a hyperparameter. But well, then I still wont know what it means, really
and my intuition tells me that I can actually pick a proper value based on the context of the application, I just have to... calibrate myself
Ah, is this the problem of picking priors in bayesian stats? 🙂 I'm not very good at bayesian stats, but in general I think picking (the parameters to) a prior is not a simple thing. (Unless there is something specific about this problem where a certain way to choose is typically taken)
Indeed, I don't know 😞 but it could be fun to talk about this problem 🙂
Hmmm in my particular case, the samples are event times. Clusters correspond to trips, days, series of photos
(depending on the level of hierarchy)
so a single gaussian distribution should represent a distribution of timestamps
Therefore some reasonable heuristics could be used based on the fact that people, for example, sleep for like 8 hours a day
this kappa_0 parameter in case of clustering is called a "scale factor", so I guess that I could probably select a scale that corresponds to the variance of timestamps during a single day
If my understanding is correct, the eq. 246 talks about the relation to the normal distribution with given mean and given variance
But then again, same paper uses italic N to denote normal distribution. I assume that's just a negligence. If what I'm saying is correct, then I can pick correct value of kappa_0 if I manage to get the Sigma right, which is another problem. But so far I don't really know whether my intuition is even correct, this is the first time I even heard of Wishart and my knowledge about priors in general is superficial.
Did you read the caption for figure 4
i know little to nothing about bayesian estimation :( however, all throughout the manuscript, kappa_0 is the "belief that the prior mean is correct"
kappa_0 reduces the variance of the prior, meaning random realizations of the random process for the mean are closer to it, and increases the weight of the current (mu - mu_0)^T Sigma^-1 (mu - mu_0)
since one usually uses log likelihood for this, the argumentof the exp comes out and this works almost like a proximal mapping
The one in context of Normal Inverse Chi Squared prior rather than Normal Inverse Wishart prior? Yeah, I did.
As Edd says, it's described there as "how strongly we believe that mi_0 is the prior mean. And they use values like "1.0" or "5.0". So I don't think that the value corresponds to, say, probability. I lack the intuition to properly scale kappa_0.
So you're saying that I can assume that Sigma in eq. 246 stands for variance and therefore I can think of kappa_0 as a way to change the variance of the model. Since in eq. 245 Sigma is equivalent to Inverse Wishart distribution of Lambda^(-1). Wherein Lambda_n "plays role of" posterior sum of squares (eq. 257). So some variance times how strongly we believe in it. So I assume that in fact the "how strongly I believe" is actually measured by the relation between ni and kappa. Is that correct reasoning?
I see, that ni_0 is actually a number of degrees of freedom.
I do actually use that in log, as you've said.
I have to do some thinking to be able to actually put correct value in kappa_0, but you've given me a framework to work with, thank you! 🙂
for completeness, it's like a proximal based on the mahalanobis distance, since the difference between the prior mean and the current mean has that simga^-1 in the middle, so the coordinates are rescaled based on that inverse covariance
the interpretation becomes a bit muddy, though, because both this "proximal" term and the other term in the cost function depend on lambda
After some discussion elsewhere, I've become quite convinced that kappa_0 should respond to the number of samples in previous experiments that led to current parameters. So it's a measure of inertia of curret hyperparameters of my model. So I have to focus on different hyperparameters first, and then I will be able to say how far the distribution could change as a result of further data. Do you see any flaw in that?
yep, that's an alternative interpretation to it
if kappa_0 is large, the model willpreferentially stick to the first guess of the mean, i.e. mu_0
if you have lots of previous samples, one would expect mu_0 to be a good guess that need not be modified mich
Hey so i have been asked to firstly write if the LMM is explicit or implicit $y_{n+1} - y_{n} = h(1/2f_{n+1} + 1/2f_{n})$
B1GW0LF
I have been looking through the lecture notes for 2 hrs at this point and I am just getting more and more confused
,w LMM
whzup
In my case I don't have any previous examples, but that lets me pick a proper value for this parameter when I will establish other ones 🙂 so thanks!
@wide spear I hope you remember me.
We discussed about this paper 😀

