#Condor: A Neural Connection Network for Enhanced Attention - Feedback Request
35 messages · Page 1 of 1 (latest)
Is this equivalent to just adding learned weights for every window position to each head?
Superficially, yes. But there's an important difference in design motivation.
Simply "let's add positional weights" versus "let's model local connection functions with MLPs based on KY Transform theory" have different starting points.
In KY Transform, connections between two points are modeled not as straight lines but as arbitrary functions g(t;θ). In Neural KY-Attention, the MLP learns this connection function g, and the result manifests as positional weights.
The actual outcome is the same, but having theoretical foundations helps us understand why this structure works and enables systematic approaches when extending to other connection function classes (polynomial, exponential, etc.) or applying to other domains.
Essentially, your observation is correct. However, KY Transform provides the mathematical explanation for why those "positional weights" are effective.
in general, if you're looking to get published, your paper cannot make unsubstantiated claims - everything needs either a mathematical proof, experimental evidence, or citation of another paper
this, for example, is just pure conjecture
Does it make sense conceptually? Yes. Have you shown any proof at all? No.
Maybe you need to state that it's an example of a way in which heads could plausibly differentiate themselves.
Regarding the experimental evidence, I don't think this model is large enough, nor is your sequence length long enough, nor do you train it enough, for anyone to be able to consider there being evidence that it did well relative to llama.
These statements are completely unsupported:
You don't even say why you think this will be the case. And I can't infer any reason why it would be.
Is this because you're using a sliding window? If so, that's a completely unfair comparison to a model that does not use a sliding window.
You should be comparing to models with sliding windows.
I'm uncertain that KY-Attention can include RoPE as a special case. RoPE acts on each pair of components separately, and afaict KY-Attention cannot replicate this.
ok, thats it for now! sorry if that seemed harsh, just trying to give you as much feedback as possible on what publication in a journal or conference will require
Hi everyone! Thank you so much for all the valuable feedback on my previous paper.
I've incorporated your suggestions by running additional experiments, significantly expanding the related work section, and addressing all the issues you pointed out. The revised version should be much stronger now, though I acknowledge the experiments are still relatively small-scale at this point.
If you have time, could you please take another look at the updated version?
Thanks again for all your help!
btw GLA is already an acronym for a well known linear attention mechanism
Gated Linear Attention
related work was a good effort, but I think you might misunderstand the goal of such a section... it's supposed to relate to the change you specifically made, not just general ways in which people did other unrelated things to improve speed
Each head in GLA-Attention learns a different connection pattern to achieve a multilayered understanding of the sequence.
This is an example of a statement that still has zero proof associated with it.
A paper that will successfully pass peer review should basically be a series of proven statements, either via a) citation b) experimental evidence or c) direct proof supplied.
The fact that a local window improved your results is a serious red flag. This should not be the case.
The idea that windowed attention is "better" than full attention is completely wrong, and points to either a problem in your experimental setup or measurement of the wrong things
This is especially true for window size 32, which is like way way way too short to work well
Your sequence length needs to be 1024 at the very minimum
Right now unfortunately your experiments are in such a small regime that they don't correspond to anything that's useful in practice
I don't recommend using wikitext-2 as your dataset.
I suggest taking a look at other papers and using something widely chosen for dataset
I also don't recommend doing multiple epochs
Before working on the paper I would focus on improving the experiments significantly
You may learn that your method does not work as well as you think, which could cause you to revise the underlying idea
Or that it's great! Either way it's important to do a good experimental comparison
You should also make sure that you use hyperparameters that have at least been shown to work well by others at whatever scale your experiment ends up being run at