#Condor: A Neural Connection Network for Enhanced Attention - Feedback Request

35 messages · Page 1 of 1 (latest)

visual moon
#

Do add more datasets for experiment 5 to 6 , related work is lacking , to add ablation study also and more citation which are missing

#

To rigorous study for experiment comparison with standards modals which is missing

tawny copper
#

Is this equivalent to just adding learned weights for every window position to each head?

hazy torrent
# tawny copper Is this equivalent to just adding learned weights for every window position to e...

Superficially, yes. But there's an important difference in design motivation.
Simply "let's add positional weights" versus "let's model local connection functions with MLPs based on KY Transform theory" have different starting points.
In KY Transform, connections between two points are modeled not as straight lines but as arbitrary functions g(t;θ). In Neural KY-Attention, the MLP learns this connection function g, and the result manifests as positional weights.
The actual outcome is the same, but having theoretical foundations helps us understand why this structure works and enables systematic approaches when extending to other connection function classes (polynomial, exponential, etc.) or applying to other domains.
Essentially, your observation is correct. However, KY Transform provides the mathematical explanation for why those "positional weights" are effective.

tawny copper
#

in general, if you're looking to get published, your paper cannot make unsubstantiated claims - everything needs either a mathematical proof, experimental evidence, or citation of another paper

#

this, for example, is just pure conjecture

#

Does it make sense conceptually? Yes. Have you shown any proof at all? No.

#

Maybe you need to state that it's an example of a way in which heads could plausibly differentiate themselves.

#

Regarding the experimental evidence, I don't think this model is large enough, nor is your sequence length long enough, nor do you train it enough, for anyone to be able to consider there being evidence that it did well relative to llama.

#

These statements are completely unsupported:

#

You don't even say why you think this will be the case. And I can't infer any reason why it would be.

#

Is this because you're using a sliding window? If so, that's a completely unfair comparison to a model that does not use a sliding window.

#

You should be comparing to models with sliding windows.

#

I'm uncertain that KY-Attention can include RoPE as a special case. RoPE acts on each pair of components separately, and afaict KY-Attention cannot replicate this.

#

ok, thats it for now! sorry if that seemed harsh, just trying to give you as much feedback as possible on what publication in a journal or conference will require

hazy torrent
#

Hi everyone! Thank you so much for all the valuable feedback on my previous paper.
I've incorporated your suggestions by running additional experiments, significantly expanding the related work section, and addressing all the issues you pointed out. The revised version should be much stronger now, though I acknowledge the experiments are still relatively small-scale at this point.
If you have time, could you please take another look at the updated version?
Thanks again for all your help!

tawny copper
#

btw GLA is already an acronym for a well known linear attention mechanism

#

Gated Linear Attention

#

related work was a good effort, but I think you might misunderstand the goal of such a section... it's supposed to relate to the change you specifically made, not just general ways in which people did other unrelated things to improve speed

#

Each head in GLA-Attention learns a different connection pattern to achieve a multilayered understanding of the sequence.
This is an example of a statement that still has zero proof associated with it.

#

A paper that will successfully pass peer review should basically be a series of proven statements, either via a) citation b) experimental evidence or c) direct proof supplied.

#

The fact that a local window improved your results is a serious red flag. This should not be the case.

#

The idea that windowed attention is "better" than full attention is completely wrong, and points to either a problem in your experimental setup or measurement of the wrong things

#

This is especially true for window size 32, which is like way way way too short to work well
Your sequence length needs to be 1024 at the very minimum

#

Right now unfortunately your experiments are in such a small regime that they don't correspond to anything that's useful in practice

#

I don't recommend using wikitext-2 as your dataset.

#

I suggest taking a look at other papers and using something widely chosen for dataset

#

I also don't recommend doing multiple epochs

#

Before working on the paper I would focus on improving the experiments significantly

#

You may learn that your method does not work as well as you think, which could cause you to revise the underlying idea

#

Or that it's great! Either way it's important to do a good experimental comparison

#

You should also make sure that you use hyperparameters that have at least been shown to work well by others at whatever scale your experiment ends up being run at