#numerical-analysis

1 messages · Page 21 of 1

wide spear
#

Yeah

#

Yes there is

#

torch.distributions or scipy.stats might have some things

brave crypt
#

there's an analytical solution? i'm looking at the wikipedia page, and i can't imagine it

wide spear
#

I think the idea is to find the best zipfian distribution to model the data then see if it's good

#

Deciding yes/no won't be a quantitative thing though

brave crypt
#

Yeah, and i cannot imagine doing logarithmic derivative of this thing and setting to 0 is analytically solvable

wide spear
#

Numerics!

#

What is a bearlain

brave crypt
#

this sounds like a fun problem to explore whether this is convex, or a region near the minimum can be found that is convex, etc. 🙂

wide spear
#

You can make it into a convex optimization problem without a doubt

#

Regression can always be made into a convex problem, no?

brave crypt
#

i think there is a result like regression with exponential families is always convex or something

wide spear
#

Well

brave crypt
#

but not in general

wide spear
#

Specific types

#

Like if the norm is convex

#

Because then you're just minimizing the sum of convex functions right?

#

A zipfian distribution is probably convex

#

But I don't think it belongs to the exponential family

#

Wait you can just do gradient descent on the sum

#

This is definitely in scipy

#

Stochastic gradient descent

brave crypt
#

and if scipy does some weird ass way to optimize it then you could try coding it yourself tinktonk

wide spear
#

I have the person formerly known as hitbox noted as "cannot code"

#

I wonder why

#

Rip

#

Ummm

#

Probably something from last year

#

Or something

#

I don't remember

#

I can try to dig it up if you want

brave crypt
#

i know hitbox as the guy who studies probability

wide spear
#

This is true

#

hitbox does do probability

#

Oh

#

Maybe it was because of your sage difficulties

#

Oh in January

#

A Dembo

#

Math 136

#

Rip

brave crypt
#

I think there are some results like MLE is fucked if it happens on the boundary

#

Or maybe it was more like MLE is provably good only when it happens in the interior, something like that

#

But this is sort of more mathematical rather than statistical

#

The result I recall is about MLE, it probably exists in some mathematical text like shao's mathematical statistics 😜

wide spear
brave crypt
#

hardwiring is always bad:(

glossy yoke
#

or maybe i can just handwave it away
it's probably fine
Famous last words

wide spear
glossy yoke
#

Unless you’re doing ML, as ML researchers aren’t held to any standards of statistical rigor

wide spear
#

Everyone knows that the best way to get state of the art results is by adding more layers to your neural network

glossy yoke
#

Adding layers is for schmucks

#

Real DL researchers use random seed tuning

brave crypt
#

@wide spear there is an issue with bc actually 😮

brave crypt
#

Imagine the shitshow when it's revealed that all these ML advances were actually bullshit

fleet sail
#

anyone know of application of nonorthogonal projectionsstare

wide spear
brave crypt
wide spear
brave crypt
#

kekw

#

its either nonsmooth like here

#

or red area everywhere then sudden drop on the rhs to blue color...

wide spear
#

Perhaps you have some right end boundary condition issues

brave crypt
#

thats what my prof told me

#

to apply neumann everywhere but rhs

#

rhs must be dirichlet for uniqueness of the solution

turbid jay
#

hey, I don't understand this mini-proof which is about the Polyak step size for convex optimisation. x* is the optimum, the function is beta-smooth. In particular, I don't get the last two steps. It's from this paper (last page): https://arxiv.org/pdf/1905.00313.pdf

eta is the polyak stepsize: $\frac{h_t}{norm(\nabla_t)_2^2} = \frac{f(x_t)-f(x^*)}{norm(\nabla_t)2^2}$ and the update step is simply: $x{t+1} = x_t - \eta_t \nabla_t$ . if I plug this in the third line, I get something similar than the second last line. And I'm clueless how they derived the last line

pine jettyBOT
#

lyinch

turbid jay
#

Maybe I should also clarify what smooth means:
$f(y) \leq f(x) + \nabla f(x)^T(y-x) + \frac{L}{2} norm{x-y}_2^2$ . It's a quadratic bound on the growth of a function (and not the usual definition)

pine jettyBOT
#

lyinch

brave crypt
#

Are there bounds on eta?

#

I wonder if the last line is eg optimizing in eta or something

turbid jay
#

not that I know of, but I haven't read the full paper. I'm only looking for one specific thing for my own proof... And for the polyak stepsize that we saw in class there are no bounds. It's just this fraction that I wrote

#

however, it depends on the gradient of f(x_t) . And this gradient is limited by the convexity and the smoothness, so maybe that's a way to derive some bounds

brave crypt
#

Ah I think if is

#

If eta equals that fraction then it is bigger than 1/beta, by ht >= blah at the top of the image you posted

#

So if eta is bounded above by something reasonable then the minimum is achieved at 1/beta

#

Which gives the last line

turbid jay
#

thank you, I got it! This seems to work out nicely

#

now I hope that I can use that result in my own exercise 🙂

turbid jay
#

I don't get it how they derive the condition $h_t \geq \frac{1}{\beta} norm(\nabla f(x_t))_2^2$ which is the prerequisite for case 3 . That's supposedly a general result for a smooth convex function, and not specific to the Polyak step size

pine jettyBOT
#

lyinch

fleet sail
#

id like to modify this function so it doesnt explicitly compute Q and saves more time in general, how do i do that? This is QR decomposition using householder

wide spear
#

You want to do a QR decomposition without computing Q?

fleet sail
#

im guess i just want a more efficient way to do this, right now im applying the householder H to M = [A I] to get the result (Hn...H1)M = [R Q^T] but for instance if im solving a least square problem Ax = b then really i just need to do back substitution on Rx = Q^Tc so is there a way to not compute explicitly the Q or shorten the process somehow to solve that system

wide spear
#

What is the goal

#

To compute QR decomp?

fleet sail
#

the goal is to use QR decomp in least square

#

so i got this block that uses the QR decomp

#

ignore the first 2 block of code really they are to identify errors

wide spear
#

Wait so you have a least squares problem min norm(Ax-b)

#

And you do QR on A to get A=QR

#

So you know that Rx=Q^Tc

#

So you still explicitly need Q?

#

Am I misunderstanding

fleet sail
#

yea that is the idea

#

but apparently this

wide spear
#

Oh

#

So instead of performing Q=H_1H_2...H_n

#

You just store what they are

#

So when you need to compute Qv

#

You just do all the matrix multiplies then

fleet sail
#

in what way is this faster though

#

or less expensive whatever that means

wide spear
#

Well the idea is that if you don't care about Q you save some time during the QR decomp

#

But then you need to do those computations once you compute Qv

#

So it saves time if you never compute Qv

#

Which is very ?????

fleet sail
#

hm

#

so it do look like these all must be computed

#

and theres no large place where i can optimize it significantly

wide spear
#

Yeah

#

Shockingly, when you compute a QR decomposition, you compute Q and R

fleet sail
#

sadcat thx

wide spear
fleet sail
#

very shocking

brave crypt
#

@wide spear

#

oooooooFFFFFF

pine jettyBOT
fleet sail
#

oh i see

#

ok thx!

wide spear
#

I see

#

That’s really something

brave crypt
wide spear
#

Oh my

brave crypt
#

PERFECTOOO

#

the prressure seems to drop for some reason in the middle

#

but the solution is DAMN SMOOTH

wide spear
#

Nice

brave crypt
#

❤️

brave crypt
#

I'm stuck on this problem currently.

wide spear
#

What did you try

brave crypt
#

Got it with some help wasn’t sure which channel this problem went into

prime kraken
#

can someone give me a hand with proximals and the jacobi method?

#

i'm trying to figure out one line in a paper where they give a result without saying anything other than they used those two concepts

#

specifically those scalings by the squared column norms of the matrix A

prime kraken
#

nvm i got it

tall solar
#

Anyone ever seen a paper that defines proximal operators on piecewise functions that are differentiable at the nodes?

azure shuttle
#

I think this question belong here? it is about kaczmarz method for least square which is not common but wonder if someone can help. My question is quite simple

wide spear
#

What is the question

azure shuttle
#

I know it can be used to solve Ax = b if A is overdetermined where there is a solution x, but how can it be translated to least square directly? like if there is no solution x to Ax = b. Someone said to just do x* = arg min norm(Ax - Ax* + Ax* - b) and then solving Ax* = b using kaczmarz. How does that produce the least square?

#

for now I can only see this as order n^2 way of approximating linear system

wide spear
#

This seems to be what you want

#

There's also the dumb idea

#

Where you just solve the normal equations using Kaczmarz

#

But that probably isn't what you want

azure shuttle
#

oh this paper has what I am looking for

#

thank you!

#

it seems regular kaczmarz has no guarantee to converge to least square as expected

willow nebula
#

find the interpolating polynomial of this function through the 5 Chebyshev nodes.
[6:20 PM]would i plug in the cosodd#*pi/2n nodes into my original f(x) for the x?

if i had 1/(1+25x^2) would I plug in the cos(stuff) into the x

to get the y for the (x,y)

#

i think im overthinking it too hard

wide spear
#

What

#

To do polynomial interpolation

#

You calculate the function at some points

#

In this case the Chebyshev nodes

#

Then you do polynomial interpolation with those points

#

So you should get 5 (x,y) pairs

#

And then you should get a degree 4 polynomial out of the interpolation

willow nebula
#

yea

#

so i plug in the cos(stuff) into the original f(x) we're given which is 1/(1+25x^2)

wide spear
#

Ok so what function are you evaluating

#

f(x)=1/(1+25x^2)

#

Ok

#

What are the 5 chebyshev nodes

willow nebula
#

cospi/10 cos3pi/10 cospi/2 cos7pi/10 cos9pi/10

#

this is all on [-1,1]

wide spear
#

Ok

#

There you go

#

Those are the 5 x values

willow nebula
#

ok

#

its just the coeffs i got for the int poly are rly ugly fractions ><

#

LMAO

wide spear
#

That's fine

willow nebula
#

but i plugged those nodes into f(x) & did the process so ig i did it right just big numbers r scary

wide spear
#

Ok

azure shuttle
wide spear
azure shuttle
#

Each householder can be applied to A to produce R while avoiding matrix-matrix multiplication, and instead you can iterate over columns c_i of A by c_i - 2(u^T*c_i)u --> c_i

#

Notice each time you do this it is ≤4*n flops

#

Because each hh is $H = I - uu^\top$, and so you can choose instead of naively do $HA$ but instead apply to each column by $Hc_i = c_i - (u^\top c_i)u$

pine jettyBOT
#

eonian

wide spear
#

Oh nice

#

Well

#

Yes

#

The householder matrix is low rank

#

And doesn't have a lot of information

azure shuttle
#

I implemented one in python that uses both of this just now, I was interested and want to practice

#

The overall is O(n^2m) for solving the minimization

wide spear
#

n^(2m) or (n^2)m

#

Parentheses are important

azure shuttle
#

The latter, the former would be crazy

wide spear
#

Yes

azure shuttle
#

No , I think anticipation has it above

craggy mauve
#

henlo, my professor left as an exercise to my class to estimate a value for pi using the monte carlo method (drawing a unit circle inside the unit square, distributing n points inside the square and counting those who fall inside the circle)

the catch is that we were also asked to provide a thought process for choosing n such that our approximation is accurate to 0.05%

  1. what does it even mean to be accurate by 0.05%?
  2. any tips on how to choose n? of course, without using the try and error approach
wide spear
#

It means that your estimated value of pi is within 0.05% of the true value of pi

#

You should have some theorems giving you convergence bounds

craggy mauve
#

well talking about precision in percentage is still a bit confusing but i'll get the idea eventually

#

i checked the reference textbooks for this class and couldnt find anything useful on convergence

#

do you know any references (links, books etc) i could use?

wide spear
#

Central Limit Theorem

fleet sail
#

wow i think i see what u mean

tall solar
#

Anyone know where to find a rigorous proof that l1 minimization enforces sparsity?

brave crypt
#

Maybe this is really just a proof of lagrange multipliers

#

"enforces" sparsity is sort of a too strong word, more like pushes towards sparsity. I think a formalization of what this even means will probably just come down to the fact that a ball in l1 is pointy

wide spear
#

Hmmmm

#

This reminds me of the Pontryagin Maximum Principle

#

But this is a completely different setting

#

Or

#

Maybe not so much

#

Well

#

It's optimal control theory

prime kraken
#

@tall solar most proofs hinge on the strong equivalence of the l0 pseudo norm and the l1 norm

#

if the l0 version of the problem has a unique solution, the convex relaxation, which is the l1 norm (the convex hull of the l0 one), finds the same solution

#

so they usually work with the "spark", "girth", or "kruskal rank" of the matrix in the noiseless case, or with the null space property or restricted isometry property for stable recovery with "noise"

turbid jay
#

how can I prove that $\log(\sum_i \exp(a_ix+b_i))$ is convex? I guess I could use Hölders inequality somehow, but I don't know what to do with the affine function in the exponential.

pine jettyBOT
#

lyinch

turbid jay
#

this might sound dumb... But I could just extend the dimension (homogeneous coordinates?) by one and use $x' = [x,1]$ and $a_i' = [a_i,b_i]$ and then I have a new identical function: $\log(\sum_i \exp(a_i'x'))$ in $d+1$ dimensions on which I can apply Hölders inequality. Would that work?

pine jettyBOT
#

lyinch

brave crypt
#

i'm not sure about using holder's inequality, but i think taking a second derivatives is not that bad, and it reduces to cauchy's inequality

fleet sail
#

unless i did it wrong or didnt see something

#

where x_1 = e^(a_1x+b) or whatever

brave crypt
#

yeah this is cauchy with (a1 sqrt(x1), ..., an sqrt(xn)) and (sqrt(x1), ..., sqrt(xn))

fleet sail
#

ah i c ic

turbid jay
#

I think I got it via Hölder's inequality, but it's an ugly mess. I'll now try the second derivatives as you mentioned

turbid jay
#

that's the monstrosity I did

#

1: I just substitute the exp() to make it easier to read and in 2 I apply Hölders inequality: $\sum_i^n \norm*{x_iy_i}_1 \leq \left( \sum_i^n \abs{x_i}^p \right)^{\frac{1}{p}} \left( \sum_i^n \abs{x_i}^q \right)^{\frac{1}{q}}$ where $\frac{1}{p} + \frac{1}{q} = 1$ with $\frac{1}{p} = \lambda$ and $\frac{1}{q} = (1-\lambda)$, $0 \leq \lambda \leq 1$

pine jettyBOT
#

lyinch

brave crypt
#

nice. oh, is x a (d dimensional)vector?

turbid jay
#

euhm yes, I probably should have given more context 🙄

brave crypt
#

ah, i guess 2nd derivative/hessian will be a mess then

turbid jay
#

f is from R^d -> R, and a_i^T are the rows of a matrix of mxd

#

that was apparently the easy part, now I have to prove smoothness and then that there's a minimizer 😄 Thanks for your help!

prime kraken
#

i think the path they expected you to use was that exp is convex, composition with affine mappings is convex, nonnegative weighted sums preserve convexity, and composition with a monotonically increasing scalar func (log) is convex

#

if you've already shown that, anyway

#

because showing those things independently is a lot easier than going HAM with the hessian here

#

😛

turbid jay
#

I think we've only shown that composition of convex function with an affine function preserves convexity... But I agree, maybe I could show the other properties

prime kraken
#

these are classic "cookbook" properties, you could probably google the proofs

#

this is the first thing i found, it has some short handwavy proofs

brave crypt
#

ah, i was gonna say that is smarter. but can we apply it actually? f(x)=h(g(x)), and g must be convex and h must be convex and non-decreasing

prime kraken
#

maybe it's one of those log-convex instead of using just composition

#

i don't recall which one exactly it is

#

hmm seems i was mistaken about the last step, you have to show it's log convex indeed

fleet sail
#

Does conjugate gradient have a chance to fail if (A^T)A is only pos semidefinite?

#

because maybe the line search doesnt yield a min point and instead gives inflection point, but perhaps that is really rare?

#

fail in the sense that it doesn't converge in dimV iteration

prime kraken
#

inflection points are indefinite

#

semidefinite means it's rank deficient and has infinitely many solutions

#

you'll find one of them, but i'm not sure what happens if the observed vector is not exactly in the image of A

fleet sail
#

if you are referring to (A^T)A having infinite solutions

#

then i dont think its always true, semi definite doesnt mean there is a 0 eigenvalue

prime kraken
#

positive definiteness is directly related to the eigenvalues

#

it pretty much means all the eigenvalues are > 0

#

semidefiniteness, >= 0

#

so it might be rank defficient

#

you're minimizing something like Ax - b, right? you'd need b to be in the image of A for it to have a unique solution

fleet sail
#

yea, so instead im solving the normal equation

prime kraken
#

if A is not full column rank, A^T A is rank defficient

fleet sail
#

yea, ok

#

but if A is full rank then A^T A must be PD?

prime kraken
#

yes

fleet sail
#

hm interesting

prime kraken
#

full column rank

#

A could be a tall matrix

fleet sail
#

because there exist A^T A positive semidefinite but invertible

#

or i mean

#

not A^T A

#

but

#

a general matrix

#

So like a general symmetric positive semidefinite matrix can be invertible

prime kraken
#

depends what you mean by invertible

#

you can get an exact solution if b is in the image of A

#

if it isn't, you'll get some projection

fleet sail
#

yea i mean as a square matrix totally invertible

prime kraken
#

if it's invertible it's pos def

#

all eig vals > 0

#

(if we're still talking about A^T A)

fleet sail
#

ok thanks, ill think about it

#

right i was just confusing myself

#

since if something was just symmetric, positive semidefinite but not positive definite then it must have a 0 eigenvalue

prime kraken
#

it "may", not necessarily must

fleet sail
#

why not, if it is symmetric

prime kraken
#

but yeah, it won't have inflection points

#

well, it's >= 0

#

0 is also >= 0

#

😛

fleet sail
#

but if it was all > 0 then it would be positive definite

prime kraken
#

positive definite mats are also positive semidefinite

fleet sail
#

ah i meant

#

positive semidefinite but not positive definite

prime kraken
#

ah

#

then yes

fleet sail
#

aight nice everything make sense then

#

thx

prime kraken
#

k

bright palm
#

This sounds silly and is maybe a #calculus question (if so LMK and I'll move it)

#

I'm having trouble wrapping my head around the 2n-1 degree exactness in gaussian quadrature

#

I can sort of get my head around that if we have n x's, we can easily get n degree accuracy

#

but why 2n-1

pine jettyBOT
#

jan Niku

bright palm
#

er, sorry

#

I've made a typo, that should say f(x) is a fifth degree polynomial

#

2n-1 degree exactness, and all

fleet sail
# bright palm er, sorry

the proof is quite involved but smart at the same time, gaussian quadrature uses legendre polynomials

bright palm
#

hmm okay

#

maybe ill try to suffer through a proof

#

maybe just intuition is enough at this point though 😄

#

it sort of makes sense that 6 unknowns should give you 5 degrees of freedom

fleet sail
#

ive got a "proof" but it doesn't prove how to find the orthogonal polynomial

#

but other than that it is a proof

#

that you can reference

bright palm
#

If it's easy to link, sure

#

I may ask my teacher exactly what level of comprehension I should be aiming at

fleet sail
bright palm
#

thank you catblush

fleet sail
#

start from pg5

bright palm
#

yea, we are tasked with this in the homework, which is not so bad

#

the mechanical portions arent too painful, the exactness and the transformation are conceptually strange to me though

fleet sail
#

only that the system is not linear wew

#

so it gets hard really quickly

#

wait i c hold up

bright palm
#

we just use newtons for systems in our class

#

im sure that means this stuff is curated so that we get nice answers using this method

#

but it works for now

wide spear
bright palm
#

would someone be willing to help me with a transform blobsweat

#

I'm looking at this problem

#

n=3 gaussian quadrature coefficients over -1 to 1

#

so if id like to integrate over 0 to 1 instead, each weight should become 1/2 of its original value?

wide spear
#

I think that these weights are for 0 to 1

#

For -1 to 1 you have x_i at -sqrt(3/5), 0, and sqrt(3/5)

bright palm
#

oh 🤔

#

maybe I can ask for a different problem then

wide spear
bright palm
#

well wait

#

so if we moved from -1 to 1

wide spear
#

Ok

bright palm
#

we begin with $\pm \sqrt{ \sfrac{3}{5}}, 0$

#

as weights

pine jettyBOT
#

jan Niku

wide spear
#

These aren't the weights

#

These are the points

#

The weight at 0 is 8/9 and the weight at the other two points is 5/9

bright palm
#

okay

#

so as we move from -1 to 1

#

each weight should become 1/2 of its original value

#

as (b-a)/2 = 1/2

#

so 4/9 and then endpoints of 5/18

wide spear
#

Yes when you move to the interval [0,1]

bright palm
#

sorry im at work

#

but this doesnt seem complicated

#

i think i just lack confidence

#

thanks for the help

wide spear
fleet sail
#

Here's my current proof that the line search directions of conjugate gradient are Q_A = A^T A conjugate

#

I believe a problem is that I assumed a_i ≠ 0

#

any way to get around it our justify if that's fine? btw a_i is this

#

it's the optimal scaling value for each line search in the Q_A conjugate direction

#

so the only thing I realize is if a_k = 0 then r^(k) and d^(k) are orthogonal and also means there's no improvement in that particular iteration

#

which I didn't say clearly in the proof, I just said "which means least squares is solved" which isn't the case

wide spear
fleet sail
pine jettyBOT
#

Anticipation

wide spear
#

Implemented practically, calling matrix-vector multiplies will probably be the fastest

#

Because you'll be calling BLAS routines

#

Of course, if you know more about A, you might be able to do something

#

Doing two mat-vec mult is O(n^2)

#

I think 4n^2?

#

Doing anything like decomposing A will be more expensive

fleet sail
#

ic

#

yea O(n^2)

#

so its like it makes no difference if A induces an inner product or notsadcat

#

tfw u want to solve a linear system exactly under O(n^3)realshit looks like law of nature dont let u

wide spear
#

Well if A is dense with no structure, there is no hope

#

But

#

This is why I'm giving a talk!

fleet sail
#

nice

brave crypt
#

But eg if you are doing this multiple times for many v, then maybe there is a trick like a decomposition. Or even if A is changing, if the decomposition plays nice with the updates, then maybe it is still good

#

(I guess A is an inner product means it is positive definite?)

wide spear
#

Yes

#

I think A is constant for CG

#

You would need to do v^TAv at least n times for a decomposition to make sense

#

And the point of CG is to use less steps than that

brave crypt
#

Oh

wide spear
#

CG, as a member of the Krylov Subspace family of methods, will give the exact solution in n steps

#

But this is equivalent to doing LU directly

#

However, it will converge must faster to a solution

#

So you can get convergence within machine epsilon well before this point

prime kraken
#

will you cover sparse regularization in your talk?

wide spear
#

I'll cover some sparse stuff

#

Probably not regularization though

#

It's only an hour

#

And there's a lot of stuff to potentially cover

prime kraken
#

fair enough, yeah

fleet sail
normal lava
#

can someone help me with this optimzation problem

prime kraken
#

if all you have to do is verify, you can multiply the two expressions and show you get an identity matrix

normal lava
#

Just multiply B(k+1) and B(k+1)^(-1)?

wide spear
#

Yeah

normal lava
#

will the result be so complicated?

wide spear
#

It will be reasonably complicated

#

But it should simplify to the identity matrix

normal lava
#

I try the multiplication but it isnt look like an identity matrix

wide spear
#

Well what do you have?

normal lava
#

a long expression, maybe i dont know how to simplify it

#

would you mind to show me how to do it?

#

Is there any tactics on doing the multiplication?

brave crypt
#

Maybe you can do it as two rank 1 updates, using the woodbury formula. But otherwise doing the multiplication carefully is the only "tactic" I can think of

fleet sail
#

one thing im consered with conj grad for least square is its slow in the end

#

concerned*

#

cuz u always have to compute A^T A

wide spear
#

Well yeah

#

So you use something like GMRES for least squares

#

And not CG

#

You never want to solve the normal equations

fleet sail
#

ic our prof recommended conj grad, it seems like the course is not going in depth enough

#

but assuming we already solved A^T A then its good ig

#

cuz that multiplication takes order n^2 * m i think which is just bad

wide spear
#

Another thing is that it doubles the condition number so iterative methods take twice as long to converge (roughly)

fleet sail
wide spear
#

Turning A into A^TA

fleet sail
#

oh ok thx

prime kraken
#

what definition of condition number are you using?

wide spear
#

Like

#

The condition number of a matrix

prime kraken
#

doesn't A^T A square the singular values?

#

not multiply by 2

wide spear
#

Yeah it squares them

#

But iterative methods have runtime proportional to log condition number

#

Or something

prime kraken
#

ah yes

wide spear
#

So twice as long

prime kraken
#

yes yes

wide spear
#

Yes

prime kraken
#

yes

wide spear
#

Yes

random hornet
#

Yes

fervent ermine
#

Yes

brave crypt
#

YES!!!

azure shuttle
#

yes

dark sinew
#

Yes?

wide spear
fleet sail
uneven agate
#

Good afternoon. I've been having a rough time with this question. Let $a,b,c$ be complex numbers, $a\neq b,c\neq 0$, and let
\begin{align*}
w=f(z):=\frac{1}{c}(e^{az}-e^{bz}).
\end{align*}
Show that the inverse function near $w=0$ is represented by
\begin{align*}
z=f^{[-1]}(w)=\frac{1}{d}\sum_{n=1}^\infty \frac{(-1)^{n-1}c^n}{n!}\left(\frac{nb}{d}+1\right)_{n-1}w^n,
\end{align*}
where $d:=a-b$. Derive as special cases the representations for the inverses of the functions
\begin{align*}
w=\sin z,\quad w=e^z-1,\quad w=e^{\alpha z}\sin \beta z,
\end{align*}
and, as the limiting case $a=b+c$, $c\to 0$, $w=ze^{bz}$. [To compute residues use formal integration by parts; see Problem $1.8.6$.]

pine jettyBOT
#

TheRedLotus

wide spear
uneven agate
#

My bad. Didn't see that channel.

wooden tendon
#

Hello,
I would like to plot a damped sinusoidal frequency spectrum in matlab
here is the equation of this damped sinusoidal
\begin{align}
$e^{-2t} . sin(4\pi t)$
\end{align}
the fourier transform of such signal would be that if i'm not wrong
\begin{align}
$\frac{4\pi}{(2+i2\pi f)^2 + (4\pi)^2}$
\end{align}
but i don't know what to do with that

pine jettyBOT
#

DawnUltra

wooden tendon
#

here is my matlab code

function [] = damped_sin()
x = 0:1:5;
a = zeros(1,numel(x));
a = (4*pi)/(square(2+1i*2*pi*x) + (2*pi*2)^(2))
plot(x,a);
title("damped sinusoidal spectrum")
xlabel("frequency")
ylabel("amplitude")
wide spear
#

Ask there

wooden tendon
#

ok

rotund terrace
#

can somebody eli5 me Forrest Tomlin update from simplex method?

brave crypt
brave crypt
wooden tendon
#

ok

#

dvanapasa do you think my fourier transform is correct by the way ?

brave crypt
wooden tendon
#

no problem

turbid jay
brave crypt
#

I think there is a common technique of proving that a function is convex and... "Forcing", maybe it was called? The property is that f(x)->infty as |x|->infty

#

Ah sorry, maybe this doesn't work? Yeah, I don't think this works, hm

turbid jay
#

I've already proven convexity, but this tells me nothing about the existence of a minimizer

brave crypt
#

What conditions do you have on the ai? (Nothing?)

turbid jay
#

none, they are the rows of a matrix

#

let me double check this

sage vapor
#

what if m=1 ?

turbid jay
#

no conditions on A and b

brave crypt
#

I think intuitively if the ai "point in many directions" this will have a minimum (in R this would just be pointing in the positive and negative direction)

sage vapor
#

I think you need m > number of dimensions

brave crypt
#

Ah, the property I was thinking of is "coercive"

turbid jay
#

maybe it doesn't always have an optimum...

As it is smooth, we expect the region around the optimum to be well approximated by a quadratic (assuming the optimum exists)

brave crypt
turbid jay
#

ah yes, I read about coercive functions but don't know too much about it

#

maybe it's a good time now to look more into this

#

x \in R^n

#

rho is just a scalar parameter

#

A \in R^{mxd} and b \in R^m

brave crypt
#

So yeah, I think ai pointing in many directions should imply this function is coercive => minimum exists. But maybe there are weaker conditions under which a minimum ecists

sage vapor
#

no that should be an equivalence here

#

if you find a direction where all the scalar products are negative well you can get as close to -infinity as you want

brave crypt
#

Yeah I think it sounds right, although I'm not sure eg how many ai are necessary to do this

prime kraken
#

doesn't convexity on its own imply the existence of a minimizer?

#

just not unique

brave crypt
#

But linear functions are convex, exponential is convex, etc

prime kraken
#

and they have a minimum

brave crypt
#

I guess eg in ML world probably all loss functions are coercive, so there is an implication convex implies minimum exists

brave crypt
prime kraken
#

you write R/Rn as a quotient or just to mean R^n

brave crypt
#

Just as in R or Rn (really n=1 case), no quotient going on

prime kraken
#

then yeah, convexity implies there is a minimum

brave crypt
#

I think there is just a miscommunication, linear function obviously goes down to -infinity

prime kraken
#

because the function is strictly non-decreasing

#

it won't be unique in general, but they will all be gathered in a single region

brave crypt
#

You are saying: f is a function from Rn to R, then f convex implies f has a minimum?

prime kraken
#

yeah

#

or rather, than any local minimum will also be global

brave crypt
#

Yes, I totally agree with local implies global

prime kraken
#

aside from that, the minimum would be at a boundary if not within the domain

#

you would have to either find where the gradient is a 0 vector, because that is necessary and sufficient due to what we just said, or find the optimum along the boundary of the domain if the gradient does not become 0

turbid jay
brave crypt
#

I was gonna say "one of us is on drugs 🙂 " but I am already drinking

prime kraken
#

checking again, strict convexity does not imply it either

#

😛

turbid jay
#

there is strong and there is strict convex. Let me dig up the definitions

brave crypt
#

Strong convexity implies it is lower bounded by a quadratic right?

prime kraken
#

i might've mixed them up

turbid jay
#

strong convexity bounds the function from below

#

but I don't know if this function is strong convex. Maybe that's a starting point 🙂

brave crypt
#

But we already know it is not strong convex without some conditions on ai

prime kraken
#

a quick google-fu says you can use strict convexity and the function being coercive to show the minimum exists

brave crypt
#

Yeah, and coercive should mean that ai point in "many" directions, ie every vector in Rn has a positive dot product with some ai

turbid jay
#

I have to read a bit more into coercive functions, I think that is the correct way to show it

sonic shuttle
turbid jay
cobalt lintel
#

Hello! So I am trying to write a program that draws my teacher using epicycles as a way of saying thanks. I have noticed that a lot of programs use the discrete fourier transform for this, however, I am not that comfortable with this method so I am here to ask a few questions. So let's say that I somehow get a lot coordinates of a simple picture that I want to use. Then I should use the DFT to get the complex points using the formula above. But how do I proceed from there? How do I calculate the radius of each circle? How fast should they spin?

cobalt lintel
#

Okay so from this I guess that the frequency of each circle would be k/N and that the amplitude (radius) would be 1/N*sqrt(Re^2+Im^2), is this correct?

cobalt lintel
#

So this is my guess. Let's say that I have the first circle, then I should add another circle that spins around the first circles circumference. The middle point of that circle should be x = r cos(2pi k/Nt), y=rsin(2pi k/Nt) where t is the time. But how much should t increase by for each loop? Can you just pick that an arbitrary number, say 0.5, that the time should increase by? I.e t+=0.5 each loop?

wide spear
#

The DFT is not appropriate for this

#

Instead you want to be taking a Fourier series

cobalt lintel
#

why?

wide spear
#

So the way they do the fancy animations is they make the image into a periodic function in polar coordinates

cobalt lintel
#

oh oops...

#

But a lot of programs that I've seen use the DFT on both the x-coordinates and the y-coordinates

#

So what is the difference between DFT and fourier series?

#

Okay, thanks for the info! But can't I just a get a simple outline of by teachers face and somehow get the x and y coordinates of all those points and perform the DFT on both the x and y coordinates separately? Wouldn't the complex numbers generated by the DFT provide all the necessary info for the circles etc.?

#

Yeah true... But I am not that confident with polar coordinates

#

Oh yeah, that's true. I will definitely give this method a try! Thank you so much for the information!

#

What is interpolate? Taking the mean value?

fleet sail
#

i guess theres a typo here?

#

heres original matrix, i thought to clear out the 0

#

you need to use b1 rather than b2

#

and would the subsequent givens rotation just be Gk = G(k,k+1,theta)

wide spear
#

They are doing a Givens rotation for the upper left 2x2 block

#

That's why the b_1 in position A_12 becomes q_1

#

I think

#

Hmmmm

fleet sail
#

it says choose theta satisfying that thing

#

but i thought it should be b1

#

there

#

instead of b2

wide spear
#

That's true

#

Hmmmmm

#

Well, do you understand the point of Givens rotations

fleet sail
#

yea

#

i think i got it though theres 2 typos

#

like I know its to triangularize and here its O(n) time apparently

#

the one thing i dont quite sure

#

is when you get (Q^T)A = R after performing the sequence of givens right

#

then A = QR, but why is it that when you do RQ = B, then B is again tridiagonal, I guess just prove it inductively by looking at each givens rot and look at the things elementwise?

fleet sail
#

ok i figured out the implementation

#

i just needhelp to prove QRQ^T is also tridiagonal

fleet sail
#

Ok i also figured out theoretically why

#

its cuz B = RQ = Q^TAQ is symmetric and u can show that when u apply the sequence of givens rotation on R from the right, the element below subdiagonal below main is maintained 0 because A is tridiagonal

#

damn QR diagonalization is epic

naive creek
#

Hello, I have a question about preconditioning in numerical linear algebra. On Wikipedia and other sites I found that solving a system of linear equations Ax=b where the condition number of A is low can lead to a fast rate of convergence for an iterative method of solving Ax=b. However, I can't seem to find or understand why a low condition number leads to a fast convergence. Does anybody have good source which explains this or does anyone mind to explain themselves? Thanks in advance!

naive creek
#

Thanks for the link! unfortunately I'm not really able find the answer with the information. The explanation also seems quiet advanced and the question connected with the course is an introductory numerical maths course. So I think there could be an 'easier' explanation for this.

naive creek
#

Mmm yeah sorry advanced isn't quite right to say. tbh I don't really understand the answer and I think that this is not the answer they would expect in my situation since the explanation is not very related to the content of the course. I was more thinking about finding a formula or so that includes the condition number of the matrix and a measurement for the speed or convergence.

#

Not really, we should be able to find the answer for this question with the content that we saw. (But I'm not really able to find it with the course material I have.)

prime kraken
#

you can interpret it as being related to the lipschitz constant of the gradient of the function $\frac{1}{2}\Vert Ax - B \Vert_2^2$

pine jettyBOT
prime kraken
#

if all of the singular values of A are the same, the gradient is equally smooth in all directions. if not, then the condition number increases and some dimensions change more slowly than others, but you have to account for the worst case scenario to guarantee convergence

#

this usually results in iterative updates that change some dimensions very slowly

naive creek
#

Thank you! I think I'm getting it. The formula is also very helpful 🙂

mental grail
#

I'm trying to decide which courses to take next semester, and I can't decide whether two in particular would be helpful. The first course gives an introduction to measure theory with a slight focus on its connection to probability theory, and the second course introduces functional analysis. I'm interested in machine learning and its foundational fields, so I'm curious whether these courses would be useful there.

#

(I hope this is the right channel, otherwise please point me to the correct one)

#

I am specializing in robotics/mechatronics/control theory, so if either course is useful there, that'd be great too

wide spear
mental grail
#

Thanks!

tall solar
#

Hey svd question
say you have X=USV^T

I'm reading somewhere that the "orthogonal projection onto the column space of X is UU^T

What does that even mean?? UU^T should be an identity ??

wide spear
#

Yes, UU^T is the identity

orchid sequoia
#

is it reduced svd?

tall solar
#

I don't think it being reduced has to do with anything.

It doesn't seem like projecting onto the column space changes anything wth

brave crypt
#

I recall there are multiple SVDs, and some of them are such that UU^T are identity and some are not, or something like that

orchid sequoia
#

yes, indeed

#

ig in this case UU^T could be not identity, as X can be singular

fleet sail
#

i think its always identity?

#

cuz AAT is diagonalizable

brave crypt
#

I guess it sounds sort of right, if X is surjective then probably this means good shit happens, and UU^T is the identity, maybe, and so is orthogonal projection

orchid sequoia
#

U^T U is always identity

orchid sequoia
tall solar
#

See I wanna take a matrix X1 and another X2

Then find the closest matrix to X2 that has the same row and column space as X1

fleet sail
#

right i think U and V i was thinking are square

brave crypt
#

I remember being sad because I was working on an exercise that talks about SVD, but they must have meant a different SVD than the normal one because the shit they were saying did not make sense

orchid sequoia
#

lol

tall solar
#

I saw another formula for the "projection onto the column space of X"
That was like $ X*(X^{T} *X)^{-1} X^{T}$
But if you use the svd X=USV^T
It all reduces to nothing no matter what

Weird stuff

tall solar
#

Nvm it works out I'm insane. You guys were right it works

prime kraken
#

this was already a while ago, but for completeness, U U^T is the identity if U is square, since then U contains the basis of the column space and the left null space. normally though, you ignore the singular vectors corresponding to the 0 singular values since you want to, as people said above, project onto the column space. in this case, U U^T is not an identity. this latter one is called "economy-size SVD"

#

so it's usually a good idea to say U = [U_c U_l], showing explicitly which part is a basis for the column space and which part for the left null space

#

then U U^H is always an identity, and U_c U_c^H (economy or reduced svd) is an identity if the matrix is full rank

brave crypt
#

idk if this is the channel but

#

Imagine i have an image, and i have a mask for it

#

what is the operation i need to do to leave the resulting background as white?

#

Like this

#

But with the background white

#

the masked image is the third bird

wide spear
brave crypt
#

ty

tall solar
# prime kraken then U U^H is always an identity, and U_c U_c^H (economy or reduced svd) is an i...

Yes! I'm a bit embarrassed I didn't realize this at first lol.

I was curious about projecting svd of one natural image onto another. Then I realized what you're saying now. In a natural image all the sv's will typically be nonzero (or error or whatever reason). I was upset I kept getting the same image but then I realized since they were full rank they have exactly the same column and row space.

brave crypt
#
for i in range(N):
  for j in range(M):
    if not mask[i*M+j]:
      image[i*M+j] = 0x88DDDA

:tinkTonk:
Holy shit that was painful to type on phone lmao

wide spear
tall solar
#

Btw
I ended up truncating the singular values of each natural image then performing the projection. But when I did the projection between a dog and a cat photo nothing particularly cool happened.

I wanted to see a cat dog :(

wide spear
#

Conclusion cat=dog

prime kraken
#

the basis was probably similar

tall solar
#

Yeah if you keep like 5 singular values you might get lucky with certain photos and keep some color of the other

prime kraken
#

you'd have to do a very low rank approx to see anything neat, maybe

#

yea

#

if youd had several examples, you can do something more similar to that

#

like a basis for several dog images

#

this is more like the so-called eigenfaces alg, which is pca, which is svd in a trench coat

tall solar
#

Yes I was thinking the same thing! I have found a dataset utkFace of a bunch of centered face images.

I wonder if I can turn a man into a lady

prime kraken
#

that sort of classifier should work something like doing a projection and then measuring the error

#

that projection should be most similar component of the original image to those in the basis you found

#

very roughly, anyway. in ML, the nonlinear activation functions let the result be more sophisticated

tall solar
#

Nice yeah machine learning is cool lol.

Before ml I wanted to try some tensor factorization on color images. I read about this cool factorization called the t svd that I'm really vibing with.

prime kraken
#

i only know HOSVD and PARAFAC

#

never heard of T SVD before

#

looks interesting

wide spear
tall solar
#

what's cool about it is that it essentially works by applying svd in the Fourier domain.

There's a block matrix associated with a 3d tensor called the block circulant that has the same spectra than the Fourier modes

brave crypt
#

HO-SVD tinkTonk

wide spear
#

Ah yes

#

Circulant matrices

prime kraken
#

i'd have to see how the circulant is constructed, but yeah

#

all circulant mats are diagonalized by fourier matrices

wide spear
#

Something something no screenshot

tall solar
#

Lol got you

wide spear
#

@brave crypt was there a specific section you had questions about

brave crypt
#

Hello

#

Please wait.

#

These two equations:

wide spear
#

Ok so let's look at the first one first

brave crypt
#

Sure 🙂

wide spear
#

Ok

#

So I'm reading this

#

And I'm not entirely sure if X and W are matrices or vectors

#

Which is clearly an oversight on the part of the authors

#

Some more experienced with ML may be able to deduce this

brave crypt
#

I have a book, let me see if I can find annotation for it.

wide spear
#

Honestly

#

They don't even share a github repo

brave crypt
#

I am sorry.

wide spear
#

I mean

#

It's not your fault

brave crypt
#

Yes, but still.

wide spear
#

Anyways

brave crypt
#

W is Weight Matrix.

wide spear
#

$\sum_{j=1}X[i+jr]W[j]$ is a convolution

pine jettyBOT
#

Angetenar

brave crypt
#

x: Input Vector

wide spear
#

Lower case x or capital X

#

I think that X here is also a matrix

brave crypt
#

Then you must be right.

wide spear
#

This paper is so bad

#

They don't define anything

#

Anyways

#

Do you understand how convolutions work

brave crypt
#

If you can explain me briefly then it would be really helpful.

wide spear
#

Ok

brave crypt
#

Thanks 🙂

wide spear
#

Let's consider the 3 by 3 matrix $\begin{bmatrix}1&2&3\4&5&6\7&8&9\end{bmatrix}$ and the 2 by 2 kernel $\begin{bmatrix}-1&1\-2&2\end{bmatrix}$

pine jettyBOT
#

Angetenar

wide spear
#

Then, the convolution of this would be $\begin{bmatrix}-1\cross1+1\cross2-2\cross4+2\cross5&-1\cross2+1\cross3-2\cross5+2\cross6\-1\cross4+1\cross5-2\cross7+2\cross8&-1\cross5+1\cross6-2\cross8+2\cross9\end{bmatrix}$

pine jettyBOT
#

Angetenar

wide spear
#

Do you see what has happened

#

We take the kernel (also called a filter) and we slide it through the matrix

#

We start at the top left

#

And we take all the elementwise products

#

And then add them all together

#

And then we slide the filter over by 1

#

And then we reach the end of the row so we move it down 1 and start over from the left

#

And then we slide over by 1 again

#

And then we're done

#

So for a filter

#

We also have two strides, sigma_x and sigma_y, which determine how much you slide in the x and y directions

#

You also have a dilation term

#

Which determines how big the kernel is when mapped onto the matrix

#

In this example I worked out, the dilation would be 1

#

However, if the dilation were 2, for example, we would have the result of the convolution as $\begin{bmatrix}-1\cross1+1\cross3-2\cross7+2\cross9\end{bmatrix}$

pine jettyBOT
#

Angetenar

wide spear
#

Does this make sense

brave crypt
#

Yes, we will skip it.

#

2 positions 🙂

wide spear
#

In essence, the kernel $\begin{bmatrix}-1&1\-2&2\end{bmatrix}$ with dilation 2 is equivalent to $\begin{bmatrix}-1&0&1\0&0&0\-2&0&2\end{bmatrix}$

#

Anyways

#

Ok

pine jettyBOT
#

Angetenar

wide spear
#

Ok

#

So now that we know what a convolution is

brave crypt
#

Yes 👍

#

Thank you for explaining that.

#

😀

wide spear
#

$\sum_{j=1}X[i+jr]W[j]$ computes a convolution for a single output $Y(i)$

pine jettyBOT
#

Angetenar

wide spear
#

Y is a matrix I think

#

Their notation sucks

#

Anyways

#

Once we compute this

#

We apply ReLU

#

Which is $Re\qty(\sum_jX[i+jr]W[j])=\max\left{0,\sum_jX[i+jr]W[j]\right}$

pine jettyBOT
#

Angetenar

wide spear
#

Ok

#

You know what relu is right

#

If something is positive, it stays the same

brave crypt
#

Yes, max(0, N)

wide spear
#

And if something is negative, it becomes 0

#

Yep

#

Ok

#

After we apply ReLU, we batch normalize

#

Do you know what batch normalize means

brave crypt
#

Where I can take material to study about applied-computational-math?

#

Nope. I am sorry.

wide spear
#

Ok

#

Batch normalize means that you shift and scale the data that so that it has mean 0 and standard deviation 1

#

What sort of applied math are you interested in?

brave crypt
#

Cool

wide spear
#

And to do this, we need the two parameters beta and gamma

#

Does the first formula make sense now

brave crypt
#

Got it.

#

Thank you very much 👍

wide spear
#

Ok now for the second formula

brave crypt
#

Yay.

#

Thank you for this 🙂

wide spear
#

Do you know what a loss function is

brave crypt
#

I apologize for leaving.

#

I think it's related to gradient descendant to achieve local minima.

wide spear
#

Yes

#

A loss function measures how bad your model is doing

brave crypt
#

I see.

wide spear
#

N is the total number of pixels

#

Sure, straightforward

brave crypt
#

And would you please tell me how it measures that?

wide spear
#

P_{n,m} is what the model predicts for pixel n at scale m

#

Notice in the model architecture in figure 2, there are three down sampling layers so you have a total of 4 scales

brave crypt
#

Yes, I am sorry I have two questions.

#

May I ask you?

wide spear
#

Let me finish this explanation first

brave crypt
#

Sure. 😀

wide spear
#

G_n is the true label for pixel n

#

In the loss function, we compute $\sum_{m=1}^4\qty(1-\sum_{n=1}^N\frac{2G_nP_{n,m}}{G_n+P_{n,m}+\eps})$

pine jettyBOT
#

Angetenar

wide spear
#

The outer sum is straight forwards, because we want the errors across all 4 scales right

#

Now for the thing inside

brave crypt
#

Yes

wide spear
#

We sum the error across all the pixels

#

Sensible

#

At each pixel

#

We have this Sorensen-Dice coefficient

#

Which is used to gauge how similar two things are

#

Tbh the paper doesn't really make sense here

#

In my opinion

#

Because you don't specify what values Gn and Pnm could take on

#

Oh hitbox

#

You are here

#

Don't you know ML

#

Have you heard of Sorensen-Dice loss before

#

Rip

#

Have fun coding

brave crypt
#

This is how it starts:

#
class DiceLossVariants(losses.Loss):
    
    def __init__(self, *args, **kwargs):
        super_args = dict()
        if 'reduction' in kwargs.keys():
            super_args['reduction'] = kwargs['reduction']
        if 'name' in kwargs.keys():
            super_args['name'] = kwargs['name']
        super().__init__(**super_args)
        if len(args) > 0:
            self.loss_name = args[0]
        elif 'loss_name' in kwargs.keys():
            self.loss_name = kwargs['loss_name']
wide spear
#

There's this paper

#

And it has some formulas

#

But it doesn't really define anything formally enough for the formulas to make sense

#

Here is the paper

brave crypt
#

😀

wide spear
#

We are looking at formula (2) right now

#

Unclear

#

Ranges are unspecified

#

I was thinking that because this is binary classification, they would be 0 or 1

#

Which is why there would be an epsilon in the denominator

brave crypt
#

Lol

#

Yes, there is dilation.

wide spear
#

The prediction for pixel n at scale m

brave crypt
#

May I know what you mean by predict as big as possible?

#

Sorry, I am very new in this.

#

Yes.

#

Yes, Gradient Descendant.

#

I am sorry.

#

Sure.

#

advanced_losses.py

#

No worries.

#

Thank you for trying 😀

#

No issues 😀

#

Though do you know why we generate accuracy matrices for test dataset?

#

From that repository, I got three matrices.

#

But I don't understand why we need matrices for test and training.

#

Ohh. We have used that for other datasets.

brave crypt
#

I am sorry, I didn't know the right word.

#

I see.

#

One last question:

#

I got IoU: 0.42 for train, test and validation.

#

Is that bad?

#

😀

#

I see.

#

Thank you for your help 😀

#

@wide spear Thank you for explaining me things patiently.

#

😊

wide spear
#

Lol rip hitbox

#

Oh nevermind

brave crypt
#

Haha

#

Nobody likes Tensorflow.

#

May I ask one more question?

wide spear
brave crypt
#

Thanks

#

Ohh, I figured it out.

#

Thanks @wide spear 😀

wide spear
slow niche
#

Does anyone know what a "dependence set" is? (in the context of a matrix).

#

I'm reading a paper and the define it like:

#
for the upper characteristic we define depsU(i)as the set of all the indexes j such that the coefficient mi,j of the matrix M (of the linearlayer) is non-zero.
#

For the matrix:

1 0 1 1
1 0 0 0
0 1 1 0
1 0 1 0

They obtain the sets:

#

It seems like the i coefficients correspond to the positions of the 1's in the matrix, but I don't understand why the j coefficients are offset

#

for example, in deps(0,j) , 0, 2, 3 correspond to 1s in the first row of the matrix

#

but where do (j+2)%4) or (j+1)%4) come from?

#

the paper doesn't really go into any more detail than what ive posted so i am lost on how they derived depsU

wide spear
#

Is this in the context of finite difference methods for pdes

naive creek
#

Warning: long message
Hello, I'm using gmres in MATLAB for preconditioning on 2 specific sparse matrices A1 and A2. Their sparsity pattern, generated with spy is shown in the screenshot below. For both matrices A1 and A2 the following code is executed (A1 and A2 are A):

x0 = gmres(A,b);

y = gmres(@(x) A*( solve_Ub( U, solve_Lb(L,x))),b);
x1 = solve_Ub( U, solve_Lb(L,y));

LU is the incomplete lu factorisation of A. solve_Ub and solve_Lb are specific functions which return y for which Uy=b and Ly=b respectively. x0 is the solution of Ax=b where no preconditioner is used and x1 is the solution of the same system but with the incomplete LU-factorisation as a preconditioner.

Executing the code generates the following output for A1:

gmres stopped at iteration 10 without converging to the desired tolerance 1e-06
because the maximum number of iterations was reached.
The iterate returned (number 10) has relative residual 0.29.

gmres converged at iteration 2 to a solution with relative residual 8.2e-13.

for A2:

gmres stopped at iteration 10 without converging to the desired tolerance 1e-06
because the maximum number of iterations was reached.
The iterate returned (number 10) has relative residual 0.77.

gmres stopped at iteration 10 without converging to the desired tolerance 1e-06
because the maximum number of iterations was reached.
The iterate returned (number 10) has relative residual 0.086.

So the preconditioner has a good effect on A1 and already converges at iteration 2, but with with A2 the preconditioning doesn't improve the convergence here. My question is: Does anybody know why this helps relatively good for A1 and not for A2? I thought that A2 might be bad conditioned, but by calculating the condition number for A1 and A2, I came to the conclusion that A1 even has a bigger condition number than A2. Sorry for this long question, but thanks for reading this. Any sugestion would be much appreciated!

slow niche
wide spear
#

Tiboat

#

Are you looping with the LU? If so, are you using a sparse LU?

#

You might be getting a lot of fill in for the second one

prime kraken
#

in general, most preconditioning* methods destroy the sparsity :x

#

you might do better with something like a jacobi preconditioning

wide spear
naive creek
wide spear
#

So it’s like (i, j+(4-i) mod 4)

#

I don’t think the preconditioning is being applied to the matrix

naive creek
#

I don't really get what you mean by looping the LU and also with the fill in for the second one 😦 . (I'm a big noob in this, sorry)

wide spear
#

It’s being used to find an initial guess for a solution?

slow niche
wide spear
naive creek
#

Well, I'm not practically using this to solve a system of linear equations for some purpose. This is more like a thoeretical question i'm trying to solve.

wide spear
#

Let me think about this more when I get out of bed

naive creek
#

ok thanks 🙂 haha

slow niche
naive creek
prime kraken
#

not necessarily, but it makes stuff a lot slower to compute

slow niche
# slow niche

i guess here it is just (i, j+i % 4) - the real issue is why they do that 😄

#

for both U and L

#

I dont really understand what they mean by "coefficient" in the def:

naive creek
slow niche
#
we define depsU(i) as the set of all the indexes j such that the coefficient m i,j of the matrix M (of the linear layer) is non-zero
wide spear
#

Ok

#

tiboat

#

You are using GMRES to determine an initial guess for your iteration?

#

I'm not entirely sure what your Matlab is doing

#

Mostly because I'm not very familiar with it

wide spear
slow niche
#

😦

wide spear
#

You might consider asking in the CS server linked in #old-network , they probably have people who know more crypto

naive creek
#

with x0 = gmres(A,b); I'm trying to get an approximate solution of Ax=b

wide spear
#

Yes

#

I understand that

naive creek
#

y = gmres(@(x) A*( solve_Ub( U, solve_Lb(L,x))),b); x1 = solve_Ub( U, solve_Lb(L,y)); also but with a right preconditioner LU

wide spear
#

What does @(x) mean

#

And what do you do with x1

#

Do you do this entire thing again?

#

Or is x1 what you calculate the relative residual with

naive creek
wide spear
#

Ok y is a function of x?

naive creek
#

yes

wide spear
#

But you don't pass in x when you call y in x1

naive creek
#

um no... I'm not even able to properly understand this haha 😓 . Maybe to give a bit more context: this question with the code was given for an assignment. We actually didn't see anything about gmres in the course. for this question we also can interprete gmres largely as a black box. So tbh I don't precisely know what's going with the gmres. The answer for the question also is probably not very related to the specific gmres method.

wide spear
#

Ok so we should be thinking about qualitative differences between the two matrices?

naive creek
#

I think so, yes

prime kraken
#

then maybe it's a matter of looking at the error between the original matrix and the incomplete LU?

#

you won't have a good preconditioning if the error between the original mat and the incomplete LU is close to the original matrix

naive creek
#

Aha, that's a very hopeful suggestion. I will see what the error is!🙂

prime kraken
#

particularly, the frobenius norm of the difference

#

or also the condition number of the preconditioned problem

#

see if it improves as much as the other one's does

#

(which should also be related to how good the incomplete LU was)

naive creek
#

The forbenius norm of the difference A1 is 0.594262658318646 and for A2 is 4.088316191164535e+04. So that's nice! Unfortunately I didn't see the forbenius norm in the course. So if you want to take the condition number of the preconditioned problem, what do you take the condition number of exactly? Thanks for all the help already! and sorry for the late answer

wide spear
#

Have you seen matrix norms before?

#

If so, which ones have you seen

naive creek
#

Yes i've seen the 1-norm, 2-norm and infinity-norm.

wide spear
#

Ok

naive creek
#

oh wait is frobenius norm = 2-norm?

wide spear
#

It is a fact that $\norm{A}_2\leq\norm{A}_f\leq\sqrt{r}\norm{A}_2$

pine jettyBOT
#

Angetenar

naive creek
#

ah nevermind

wide spear
#

Where r is the rank of the matrix

prime kraken
#

well, the 2-norm of a matrix is an induced norm, and it corresponds to the largest singular value

wide spear
#

So you can also compute the 2-norm of the difference

prime kraken
#

oh you had already started, oops

#

anywho, the frobenius norm is the square root of the sum of singular values squared, while the 2 norm is the largest singular value of the matrix, and they follow the property angetenar gave up there

wide spear
#

A lot of matrix norms have these equivalency properties

#

Actually, all matrix norms are equivalent

prime kraken
#

so if you compute the frobenius norm of the original matrix and the frobenius norm of the difference, you can get an idea of how bad the estimate is

#

equivalence of norms moment coming up

wide spear
#

For any two matrix norms you have the inequalities $c_1\norm{A}{\alpha}\leq\norm{A}{\beta}\leq c_2\norm{A}_{\alpha}$

pine jettyBOT
#

Angetenar

wide spear
#

For arbitrary matrix norms

#

The moral is that the specific matrix norm you choose doesn't matter so much

prime kraken
#

how concrete and bullet proof must your explanation of this problem be?

naive creek
#

Thanks for all the information!! 😊

prime kraken
#

at a handwavy level, i would say looking at the frobenius norm of the difference and the condition number before and after preconditioning should give a good idea of what's going on

#

or maybe relative frobenius norm difference

#

something like that

#

(or any other norm you like, as ange said)

#

oh, and as it turns out, it IS a consequence of ruining the sparsity

#

i had just missed the part where you said incomplete LU

#

since you have some M = LU - R, and LU is the incomplete LU decomp

#

the usual LU would destroy the sparsity, so you subtract this U term

#

and impose sparsity on L and U

#

this R takes up all the error

#

which you put in by making the LU incomplete to try and preserve the sparsity

#

when you do the frobenius norm of the error M - LU, it's really the norm of U that you see

naive creek
#

Thanks for all the help and information @prime kraken and @wide spear ! I think I'm getting it

brave crypt
wide spear
#

@echo ferry here fits better

#

Anyways

echo ferry
#

thank you, I'll move it here

wide spear
#

I do not know the answer to your question

echo ferry
#

Sorry if this is wrong place for it, but that's the one that was recommended to me, and I don't think that early university statistics apply (although I'm a computer science PhD, perhaps for mathematicians that is trivial)
Below formula comes from "Conjugate Bayesian analysis of the Gaussian distribution" available at https://www.cs.ubc.ca/~murphyk/Papers/bayesGauss.pdf
In Normal-Inverse Wishart prior, what is the role of kappa_0?
In my case, I'm using it as part of Bayesian Rose Trees clustering. It is my understanding that NIW is used to model gaussian distribution in order to differentiate between clusters using the assumption that each cluster should follow gaussian distribution. Is it then correct that for given data and given Sigma (how to estimate Sigma is another problem), kappa_0 should be selected such that it would make Sigma/kappa_0 equal to the predicted variance of the distribution?

wide spear
#

@prime kraken might know

#

If nobody knows here you might try asking in the AI/ML server linked in #old-network

#

(Edd is also asleep right now I think)

echo ferry
#

yeah, I'm there, so I'll do that if no one here will be able to answer. But I think I need more of a math understanding that ML-based one. Cause, to be honest, I could just optimize it as a hyperparameter. But well, then I still wont know what it means, really

#

and my intuition tells me that I can actually pick a proper value based on the context of the application, I just have to... calibrate myself

wide spear
#

Does 8da know

#

I guess not

brave crypt
#

Ah, is this the problem of picking priors in bayesian stats? 🙂 I'm not very good at bayesian stats, but in general I think picking (the parameters to) a prior is not a simple thing. (Unless there is something specific about this problem where a certain way to choose is typically taken)

#

Indeed, I don't know 😞 but it could be fun to talk about this problem 🙂

wide spear
#

I've asked my boyfie

#

Will update if he responds

echo ferry
#

Hmmm in my particular case, the samples are event times. Clusters correspond to trips, days, series of photos

#

(depending on the level of hierarchy)

#

so a single gaussian distribution should represent a distribution of timestamps

#

Therefore some reasonable heuristics could be used based on the fact that people, for example, sleep for like 8 hours a day

#

this kappa_0 parameter in case of clustering is called a "scale factor", so I guess that I could probably select a scale that corresponds to the variance of timestamps during a single day

#

If my understanding is correct, the eq. 246 talks about the relation to the normal distribution with given mean and given variance

#

But then again, same paper uses italic N to denote normal distribution. I assume that's just a negligence. If what I'm saying is correct, then I can pick correct value of kappa_0 if I manage to get the Sigma right, which is another problem. But so far I don't really know whether my intuition is even correct, this is the first time I even heard of Wishart and my knowledge about priors in general is superficial.

wide spear
#

Did you read the caption for figure 4

prime kraken
#

i know little to nothing about bayesian estimation :( however, all throughout the manuscript, kappa_0 is the "belief that the prior mean is correct"

#

kappa_0 reduces the variance of the prior, meaning random realizations of the random process for the mean are closer to it, and increases the weight of the current (mu - mu_0)^T Sigma^-1 (mu - mu_0)

#

since one usually uses log likelihood for this, the argumentof the exp comes out and this works almost like a proximal mapping

echo ferry
# wide spear Did you read the caption for figure 4

The one in context of Normal Inverse Chi Squared prior rather than Normal Inverse Wishart prior? Yeah, I did.
As Edd says, it's described there as "how strongly we believe that mi_0 is the prior mean. And they use values like "1.0" or "5.0". So I don't think that the value corresponds to, say, probability. I lack the intuition to properly scale kappa_0.
So you're saying that I can assume that Sigma in eq. 246 stands for variance and therefore I can think of kappa_0 as a way to change the variance of the model. Since in eq. 245 Sigma is equivalent to Inverse Wishart distribution of Lambda^(-1). Wherein Lambda_n "plays role of" posterior sum of squares (eq. 257). So some variance times how strongly we believe in it. So I assume that in fact the "how strongly I believe" is actually measured by the relation between ni and kappa. Is that correct reasoning?
I see, that ni_0 is actually a number of degrees of freedom.
I do actually use that in log, as you've said.
I have to do some thinking to be able to actually put correct value in kappa_0, but you've given me a framework to work with, thank you! 🙂

prime kraken
#

for completeness, it's like a proximal based on the mahalanobis distance, since the difference between the prior mean and the current mean has that simga^-1 in the middle, so the coordinates are rescaled based on that inverse covariance

#

the interpretation becomes a bit muddy, though, because both this "proximal" term and the other term in the cost function depend on lambda

echo ferry
#

After some discussion elsewhere, I've become quite convinced that kappa_0 should respond to the number of samples in previous experiments that led to current parameters. So it's a measure of inertia of curret hyperparameters of my model. So I have to focus on different hyperparameters first, and then I will be able to say how far the distribution could change as a result of further data. Do you see any flaw in that?

prime kraken
#

yep, that's an alternative interpretation to it

#

if kappa_0 is large, the model willpreferentially stick to the first guess of the mean, i.e. mu_0

#

if you have lots of previous samples, one would expect mu_0 to be a good guess that need not be modified mich

thin vapor
#

Hey so i have been asked to firstly write if the LMM is explicit or implicit $y_{n+1} - y_{n} = h(1/2f_{n+1} + 1/2f_{n})$

pine jettyBOT
#

B1GW0LF

thin vapor
#

I have been looking through the lecture notes for 2 hrs at this point and I am just getting more and more confused

brave crypt
#

,w LMM

thin vapor
#

yes

#

I think that it is Implicit

#

but i am not 100% confident on why

pine jettyBOT
echo ferry
brave crypt
#

@wide spear I hope you remember me.

brave crypt
wide spear
#

Yes

#

The bad paper

brave crypt
#

Haha

#

Yes 😅