#Need help with Learned Image Compression.
19 messages · Page 1 of 1 (latest)
Base Paper
The concept which I need an explanation on:
The part right from "In our discussion, the “information”
is denoted by a real column vector x ∈R N or a sequence of such vectors..." to "assumed that communication between encoder and decoder is perfect"
- Can some please help me understand this part and how actually a image can be represented as column vector (because most time's don't we consider them to a bunch of matrix and the other form is like a flat 1-D vector, so where do this representation fit).
I have tired using AI for this but the representation part was doped!
Rn notation: The set of all n-tuples of real numbers (vectors with n components) is denoted as Rn. For example, R2 represents the set of all vectors with two real number components (like points in a 2D plane), and R3 represents the set of all vectors with three real number components (like points in a 3D space)
It's just an array of length N
Every element of the array is a real number
I just read the section you're asking questions about, so I want to premise that I don't have a large context of the points being made, but these are pretty common design patterns and basic denotations. I'll also use markdown to help keep things clear. Here's the basics:
x ∈ R^N has a couple different things going on:
- x is a bold lowercase letter. This represents a 1D vector, like [1, 2, 3, 4, 5].
- ∈ is a set operator, stating that the value on the left is a member of the set on the right.
- R^N is making two points, the first is that we're operating on the set of all real numbers (R), and it's happening N times. This means that any sample in the set is a vector of length N containing all real numbers. N is not bound by any numbers as doing so would limit the generality of statement, but you could say that if x ∈ R^N and x = [1, 2, 3, 4, 5], then N = 5.
From what I can tell, you haven't gotten into the part of the paper that clarifies what the internal representation between the encoder and decoder looks like. It's possible you won't get to that point without the author providing the code used to produce the results. However, it's not always necessary to know exactly what it looks like, but here's a helpful gist:
Encoder/decoder frameworks (aka autoencoders) are architectures designed such that the encoder translates a current representation into a latent representation (this is generally an abstract representation, sometimes the scope of the latent space is known but rarely is that clarified) and a decoder translates the latent representation back into the starting representation.
The most interesting use of encoder/decoder architecture is in medium translation networks (text-to-speech, text-to-image, speech-to-text, etc). You can configure multiple encoder/decoder structures to map to a mutual latent space, then simply choose which medium to decode to.
Let me know if I can help with anything else!
Also, the reason the encoder/decoder concept is even brought up is because that's what compression is. The whole idea of compression is that you want some function to reduce the size of data temporarily. With files or directories, one example of this is zipping to an archive. The encoder functionality is zipping the file/directory. The latent space is the space of all possible zipped files/directories, and the currently zipped (encoded) file/directory is a member of that latent space (so you can say the zipped file/directory z ∈ L^N, but in this case, the size of N is also arbitrary because the length of z is dependent on the size of the file/directory. When you unzip the file, you're using an algorithm specifically designed to undo the zipping algorithm, restoring the original file/directory, which is known as a decoder in this lens. Notably, encoder/decoder architectures come in pairs, because the decoder must be able to undo what the encoder did.
This matters in image compression, because you have the opportunity to predetermine the size of the latent space, whether it be a vector of 1000 real numbers or it be a scaling factor on the size of the image input, you can arbitrarily choose. However, the latent space decision will impact the ability of the decoder, so making it too small can reduce the ability of the decoder to restore the image from the latent space.
Yes, I like get the point about the ordered real numbers, for simplicity and common ground be a single column vector, I just saw the old notes in the paper they where not updated so the communication was bad from my side. Sorry for this part
Like the issue is that I'm not really able the imagine a image matrix in a column vector form like not use to it in my current scope of work so just interpreting it is what adds complexity to the whole thing like single vector is fine but a 3-D matrix how should I handle this that's where my mind bugged up!
So, to add more context I'm working on a Learned Image Compression model and we are using the base paper as stated above and their observation where like the larger and concise the latent space the better bit rate so for this we are working on using image patches as nodes and hyperConv along of CNN's to have a relational as well as spatial features to achieve a better representation on which we can build our latent space and reduce the redundancy in spatial data! so mostly the Focus is on the Transform step itself.
The paper with issues: (Updated)
Like just wanted to figure out how to simulate the process with pen and paper to get my hands dirty. As I haven't really worked with image compression before. So I could like get the picture of where to fit this whole VAE's stuff which is used in the base paper model
And really thank you for your time, I do appreciate that a lot
The equation in the base paper which brought me to this paper:
I'm struggling to follow exactly what's confusing, and I don't want to write out an explanation of something that already makes sense to you, would you be interested in a quick voice chat later today? I'm sure I can help, I've spent the last 3 years working with image models at work.
And I'll drop a better explanation to refer back after our chat, in case anybody else reads through this thread and is curious
Sure, just let me know when you're free sir.
Dm'd