#stanford-ribonanza-rna-folding

1 messages Ā· Page 1 of 1 (latest)

terse elbow
#

Hi

limpid oyster
#

Hi :)

dusky ember
#

hello all

#

lots of new competitions recently

#

this one looks pretty interesting, judging entirely by the name

limpid oyster
#

happy to connect with others who are thinking about this problem šŸ™‚

autumn sierra
#

Hi everyone, happy to discuss how to tackle this and/or work together.

limpid oyster
hasty iron
#

I can't see any prediction target columns (reactivity_DMS_MaP, reactivity_2A3_MaP) in the train_data.csv file, are they located somewhere else?

sharp torrent
#

(And the error columns if you want to take that into account)

#

Each column represents the target value at the given base index, keep in mind different solutions are of varying lengths

#

Some indexes don't have data either because of the length varyation or limitations in the experimental process

hasty iron
#

Oh I see, so one entry in sample_submission corresponds to one position in a sequence, not to the whole sequence, right?

sharp torrent
#

(Or, iirc, error that is too high)

sharp torrent
autumn sierra
hasty iron
#

Beware that reactivity is 1-indexed, not 0-indexed in train_data.csv. I got 0.304 lb instead of 0.177 because I missed my indices by 1 position.

quasi dove
#

Yes they have not given ground truth values directly for reactivity dms_map and reactivity_2a3_map we have to figure it out, any domain experts here ?

naive marsh
#

the ground truth in the test_sequence.csv will be given later during the competition. Right now, we have to work with the train_data.csv file and divide it into training and testing datasets.

#

anyone here is using R for this competition ? There's plenty of cool libraries for genomic data analysis, that could be useful.

sharp torrent
# quasi dove Yes they have not given ground truth values directly for reactivity dms_map and ...

I believe you’re confused about where the values are for the training data? It’s reactivity_0001 through reactivity_0170 (along with the reactivity error columns if you want to take experimental error into account). Sequence positions where data could not be collected will be null, and for sequences shorter than 170 they will be padded at the end with nulls. Each row provides data for either DMS or 2A3 depending on the experiment type column. Also in your submission it will be one row per sequence position with both DMS and 2A3, rather than all positions for just one.

naive marsh
sharp torrent
#

Oh sorry, it might be

#

I may be misremembering the columns

sharp torrent
#

Aaah yeah the columns are out of order in the list I was looking at šŸ˜…

dull owl
#

I did a general query to check what's null as most of the last columns seemed null in dataframe.

#

Are reactivity columns 1to 26, 166 to 206, all null? And the errors of them as well.

sharp torrent
#

The beginning and end of each sequence will be null - due to the way the experiments work, we can't get data for those regions

#

Be aware that not all sequences are the same length, so for some sequences the nulls will start earlier than others (since shorter sequences get padded with additional nulls on the end)

hasty iron
#

Has anyone figured out how to validate on longer sequences yet? It seems that the majority of training data (98%+) with positive SN_filter is all the same length. I'd like to see the experiment protocol to try searching for additional datasets.

quasi dove
sharp torrent
#

I believe those are computed based on existing (thermodynamic) algorithms

#

So I guess up to you whether you want to introduce that bias?

#

Structure may be computed based on being run against multiple available algorithms and chosing the best fit for the chemical mapping data, but I'm not sure - I'd need to see where specifically you're looking at

naive marsh
#

Has anyone figured out why we have negative reactivities ? chemical reactivities should only be positive and it doesn't seem to fall inside to reactivity error

proven quiver
naive marsh
#

yeah, that makes more sense like that. Thanks !

sharp torrent
quasi dove
hasty iron
quasi dove
hasty iron
#

Well you could probably do that, but you have to predict reactivity of each position of the RNA sequence in the testing dataset, so it'd probably be quite challenging considering you only use mean values.

quasi dove
hasty iron
#

Submission csv has one row per RNA position. You can see id_min and id_max in train_sequences.csv, you have to put corresponding predictions for this RNA between these indices.

quasi dove
#

Or how can I utilise that

hasty iron
quasi dove
slim juniper
hasty iron
#

I wonder if we can restore sparse BPPs to dense BPPs by some kind of imputation, and if we can consider BPPs as a "distance" matrix to employ some existing algorithms for missing data imputation.

quasi dove
hasty iron
quasi dove
naive marsh
#

I think the whole point of the competition is to be able to predict the reactivities with nothing more than the sequence so I probably wouldn't add any features in the test set

quasi dove
digital raptor
#

Is this really an ML competition šŸ˜„ Since we dont have a ground-truth data to train, but we're expected to derive the train data šŸ˜‚

#

I am starting to lose my interest on it :/ How should we know our results are good or not then

sharp torrent
#

reactivity_xxxx columns in the training data

digital raptor
sharp torrent
#

Only the beginning and the end are null

#

The start and end of the sequences have a special role in the experiments which prevent measurements there

#

Also some sequences are shorter, which means there’s extra null padding at the end

#

IIRC there are some instances of nulls in middle, but those are special cases of experimental limitations

mossy arch
#

Hi,
I am new to Deep learning, and while this is a Kaggle competition, I have taken up this RNA folding challenge as my college semester project and now going through it, While I understand I should have started with a simple project,I feel I might learn something through this. I was hoping if someone who is participating in this challenge would be greatful enough to point me in the right direction regarding getting started.
Thanks in advance.

naive marsh
# mossy arch Hi, I am new to Deep learning, and while this is a Kaggle competition, I have ta...

Hi! While this problem is quit a difficult one, you'll certainly learn a lot by trying to solve it. The basic idea is that you have an RNA sequence in the train_data.csv file with corresponding reactivities (one column for each character of the sequence). The goal is simply to predict the reactivities with only a sequence given to you.

The first thing you should do is to understand the problem. Kaggle gave a few links that are useful to understand what the RNA sequence is all about and what the reactivities are, I'd probably start reading that before doing any code.

hasty iron
#

Is it true that DMS always only reacts with adenosine and cytidine? Can I always replace predictions for other nucleotides with 0s, or are there some cases when reactivity to other nucleotides is non zero?

naive marsh
#

The traditional DMS mapping only detected adenosine and cytidine. But the data we have here have been made with a four-base mapping so there is no reason for the reactivities to be 0 with A and C.

ashen forge
potent dew
#

Hi, Leonor here. Based out of Cambridge, MA with an emphasis on deep learning, network of networks, end to end optimization and the brain. Have been in the Harvard/MIT community since 2016 and Cornell/ Columbia prior. Drop me a line if you are interested in chatting about the data

manic furnace
#

Is anybody still working on the problem,? Would love idea on how yall tackling the problem!

uncut crescent
#

Can we use arnie for submission?

sharp torrent
uncut crescent
sharp torrent
#

Yeah as you noted, particularly depending on the model you use, structure prediction can be slow

#

I think it's also worth mentioning that the models provided by arnie are solving a task analogous to yours - predicting reactivity, predicting secondary structure, and predicting base pairing probabilities are all highly related (they are all trying to determine whether some bases actually wind up paired in nature). The reason we're raising this challenge to kagglers is because the existing models are subpar - if you rely on them you benefit from the work they've already done to solve the problem, but you're also biased towards things that have already been tried with mixed success

uncut crescent
#

We're excited about the possibilities that deep learning, especially transformers, brings to predicting reactivities. The existing models, built on strong scientific foundations, have been incredibly insightful. We're inspired by those models and are looking to blend their strengths with some of the newer data science techniques we're familiar with. It's all about building on top of the great work already done and seeing where we can take it next. Our intention is to collaborate and enhance, not replace. We genuinely feel that this combination could be the kind of innovation the competition is hoping to see.

sharp torrent
#

Makes sense!

#

There is definitely something to be said for starting with things that physical modeling says "makes sense" and refining further

noble seal
#

Hi all! Im undergrad in Data Science and AI, I have joined competitions before and the most recent one was WiDS 2023 and my team and I scored high enough to get the 5th place prize. It may have given me a false sense of confidence coming into this competition as I have no background about this topic what so ever infact the last time I learned anything about biology was in highschool. The domain of this problem seems too complex and I would love to get a basic understanding that would allow me to compete and learn but not too detailed to the point where I get lost just trying to understand the topic. I would love any insights and help! Thank you for having me!

vital ridge
#

Was wondering if anyone had any insight into ground truths. Are we certain that these are the reactivity columns?

sharp torrent
#

(206 was it? Can’t remember, would need to check)

vital ridge
# noble seal Hi all! Im undergrad in Data Science and AI, I have joined competitions before a...

Hello Issac and welcome! My undergrad was in biology and my masters work was done at SMU. My post undergraduate work actually involved ligand binding of the muscarinic receptor (something found in the brain that appears to be the cause of signs of dementia such as Alzheimer's). I would definitely be willing to answer what questions you have in terms of the project. I am still learning and relearning myself. But would love to get some discussion going.

#

@sharp torrent 206 you are correct. Since I have you. Let me ask. I read the papers regarding the DMS and 2A3 experiment types and I am troubled with the negative values. As I read it. They are probabilities of the form P(paired i) = P(paired i) / P(paired i) + P(unpaired i). Meaning that 0.5 means we have no information and 1 means perfect information. So what does negative mean?

sharp torrent
#

I know that it’s an experimental artifact, let me see if I can dig up more details

#

I think it has to do with some form of normalization that gets applied

#

Subtracting out background noise and compensating for some ā€œphenomenons like over-modificationā€, from the source I’m looking at

vital ridge
#

@sharp torrent i remember reading something like that as well.

sharp torrent
#

I was going to link to experimental data processing code I think is relevant, but I realized the repo is private 🫠

vital ridge
#

I think i may have insight about these reactivity values and reactivity error columna

#

According to the DMs report.https://academic.oup.com/nar/article/51/16/8744/7201944.

The values in the study are processed by ShapeMapper2 https://github.com/Weeks-UNC/shapemapper2/blob/master/docs/analysis_steps.md#calculation-of-mutation-rates

According to this section the values are consolidated values of three groups. Modified - untreated / control. A negative value would infer that the untreated have shown to have reactivity greater then the modified. A fractional value means control had greater reactivity then both combined. But this is only if three groups of samples. If two then Modified - Untreated. If one then Modified alone. The reactivity_error_column it seems is a computation of the same form for the corresponding reactivity_column

GitHub

Public repository for ShapeMapper 2 releases. Contribute to Weeks-UNC/shapemapper2 development by creating an account on GitHub.

plush shoal
#

hey, i see several positions in sequences at the start are null. it is said that they cant be probed due to technical difficulties. My question is, will reactivities for those positions be available in test set during evaluation?

#

If yes, will they be measured in the tests or imputed using some rule?

sick raven
#

hey has anyone submitted? I dont get it my outputs are 1343823 x 2 is it meant to be 2 predications per character in the sequence?

quasi dove
#

As I submitted only median values of 2a3 and dms i got score of 0.23 somthing, now I'm more interested in this project data cleaned ready for training ā˜ŗļø

sharp torrent
quasi dove
#

I'm getting mse 0.0025 for dms is that good ?

quasi dove
#

Got 0.00328 on 2a3

fervent moth
sharp torrent
quasi dove
#

I'm getting 276827538 rows
But the competition overview says 539539340

#

And already my submission file getting bigger than 10 gb

quasi dove
#

after training model it is generalising well , what you think guys ?

quasi dove
sharp torrent
quasi dove
#

I tried to solve it both ways like using sequence model and artificial Neural network but both models getting same mean absolute error. Did cross validation, did features engineering able to improve slightly but no luck can go past 0.18 mae , anybody knows what else I can do like domain knowledge perspective?

#

Help me

zealous fable
#

@quasi dove This is a difficult challenge. There's many professional ML researchers struggling to get into the top 15. Myself included. I suggest you look here: https://pytorch.org/tutorials/beginner/transformer_tutorial.html . Also nanoGPT could be a good starting point. Think about how you could modify it to output reactivity rather than word predictions.

quasi dove
sick raven
#

has anyone figured out what to do with the error values?

#

just double checking are we certain that error values = 1/sqrt(reads)?

quasi dove
sick raven
#

Ok, and whats the best way to fillna?

sick raven
vital ridge
#

@sick raven the DMS paper has insight. It's a standard error.

vital ridge
# sick raven Ok, and whats the best way to fillna?

Since not even negative values mean much. Probably just throw them out. The represent gaps. RNA is a sequence we already know the first 26 are NA defacto. As far as I can tell only the first and last items are NA

quasi dove
vital ridge
#

What kind of models are you using? Key thing to understand is it's not a sequence that is being predicted it's a grouping. So if they are values with known reactivity they should inform the routine with unknown

quasi dove
# vital ridge What kind of models are you using? Key thing to understand is it's not a sequenc...

I'm using sequence model to understand RNA sequences and then on top of that added dense layer to predict reactivities
From 1 to 206
First i tried with
Neural network extracted features from RNA sequences and then predicted pretty fast training and prediction (mae =0.18 almost equal in test and validation set) for both 2a3 and dms
But the. I tried with the sequence model
I got mae like 0.14 good improvement
But prediction on CPU takes like 12 hrs
6 hrs for dms and 6 2a3
On tpu it takes 3 hrs . Due to this limitation I'm not able to use other models like i want to go with transformers to improve more

#

Input is RNA sequences and it's padded sequence length and the 206 reactivity are the target variables
Also I filtered out sn, filter, reads etc

zealous fable
quasi dove
#

I have to change outputs precision somehow

zealous fable
#

@quasi dove I recommend you use google colab, as well as Kaggle. This way you can use one for prediction, one for training. AWS also has a free tier for SageMaker and EC2, you could make use of that.

quasi dove
quasi dove
#

Even in sequence data I'm not getting good rank as i uploaded submission file 300+ rank

vital ridge
#

What sequence modeling package are you using

vital ridge
quasi dove
#

Next adding attention will be helpful

glacial tulip
#

Hi, guys. The question was asked earlier but it looks like it wasn't answered. What should we predict at these positions? Are they going to be included into evaluation?

The first 26 positions in the sequence are the leader on the 5' end that is necessary for the experiment. The first 26 positions in the sequence are identical across the set.
The last 39-51 positions (it varies) are the barcode hairpin and tail necessary for the experiment. The barcode hairpin is a unique identifier within each batch of experimental testing that was performed.

sharp torrent
# glacial tulip Hi, guys. The question was asked earlier but it looks like it wasn't answered. W...

See https://www.kaggle.com/competitions/stanford-ribonanza-rna-folding/discussion/441122 - the recommendation is to attempt predicting something by generalizing the interior model to the edges, but if the data is not available for those positions (which is entirely possible), those positions won’t affect the score

zealous fable
#

Are there any details on the training data update? I seem to have missed the information on that.

sick raven
#

Has anyone tried

reactivity + sqrt(error)*sqrt(s2n)?

sharp torrent
#

(I'm guessing the repo I was looking at was an older version, though I'm not certain)

sick raven
#

Can someone please explain how they calculated the error values?

vital ridge
sterile schooner
#

Hello everyone . I recently joined this competition and I want to participate and I have a questions . what are our input(features) and what are we trying to predict ?

vital ridge
#

@sterile schooner a sequence of RNA molecules and a list of 'reactivities' and reactivity errors. We are still trying to figure out

slim juniper
#

The input is the RNA sequence and reactivity values for each position in the sequence. You are predicting the reactivity value of each position for the RNA sequences in the test set. Go ahead and predict reactivity for the beginning and end sections even though no values for those sections are provided in the train set. The ultimate goal is to use the reactivity data to predict RNA base pairing and folding both in a 2D and 3D context.

sterile schooner
#

um so we will give the 'sequence' to the model (as x) and the model will return (a list) of reactivitys (as y) right ?

#

so this means that we will train the model by giving the sequence as x and reactivity s (all of them , or only those with non Null values ?) as y to train the model to predict the reactivity and then we apply the model to the test set and create reactivity columns and then submit that ?

#

and then the hosts will use reactivity values to create the structurs ?

sterile schooner
#

some thing like this :
sequence(GGGAACGACUCGAGUAGAGUCGAAAAAU...) ===> Model ===> reactivity_0001 = value , reactivity_0002 = value , ... , reactivity_error_0205= value , reactivity_error_0206= value
?

#

And since the sequence length of train and test(public data) is different from the privet data ; we have to pad our sequences to the max lenght of the test data right ? for that the model could work on the privet data

sharp torrent
sharp torrent
sterile schooner
#

so it should be like this ?
sequence(GGGAACGACUCGAGUAGAGUCGAAAAAU...) ===> Model ===> 2A3 :reactivity_0001 = value , reactivity_0002 = value , ... , reactivity_error_0205= value , reactivity_error_0206= value & DMS :
reactivity_0001 = value , reactivity_0002 = value , ... , reactivity_error_0205= value , reactivity_error_0206= value

sharp torrent
#

right

sterile schooner
#

Thanks a lot

sharp torrent
#

Assuming that sequence is 206 elements long

sterile schooner
#

yes

sharp torrent
sterile schooner
#

so you mean I have to create a model that traines on a data that has those null values but I also have to predict values for them in the test set ?

#

and if so does it mean I have to find some way to fill the null values ?

sharp torrent
#

You'll need a way to get your model to predict them, though you presumably won't want to augment the training data yourself, as values you'd insert would not actually be ground truth values

#

Unfortunately I don't have more useful suggestions on how to handle this specifically - I am not an ML expert myself šŸ™‚

sterile schooner
#

Thanks a lot I'll look in to it

sick raven
#

does anyone have a background in biology? I really wanna understand how those error values can be used
I'd be down to team up

sharp torrent
sharp torrent
sterile schooner
granite isle
#

Hello everyone, I just started and doing some eda, but a quick question, anyone who have tried the Graph Transformer in which the Nucleotide Identity as node ,open knot and rna seq join for Laplacian eigenvector as position encoding.

vital ridge
# sick raven does anyone have a background in biology? I really wanna understand how those er...

Hi Pranshu. I have been doing a lot of investigation. I think we discovered it has more to do with DasLab Ubr transformation. But it is in Matlab. I have transformed most of the code to python. I can get the data loaded but I have not completed the transformation. I have an issue somewhere in the process. But generally I can sketch out the meaning and get a better understanding of what is trying to be achieved. That being said I am split between whether it is a worthy investment to identify the exact nature of these positive and negative numbers except that they represent some nature of insertion, deletion or subtraction.

quasi dove
#

As i tried with sequence model , also with neural network using extracted features not able to achieve good score , so I need to utilise reactivity error to score up like weighing reactivities i guess

sick raven
sharp torrent
sharp torrent
vital ridge
zealous fable
#

Does anyone have the eternafold/vienna sequence predictions for the entire train/test set? It's going to take 2 days on 16 threads to process! 😮

sick raven
#

Do y'all think #reads_error =(1/reactivity_error)**2 ?

#

No worries thanks for the advise about error weightage tho

vital ridge
#

@sick raven not sure what reads_error is. It's not on the training data

sick raven
#

Hmm so reactivity_error is calculated as the Poisson distribution of the amount of reads that went wrong for the character
So number of reads that gave an error value = (1/reactivity_error)**2

That's how I interpret it but I could be wrong

Don't know how to use reactivity error otherwise

cobalt trout
#

Hi! Wondering whether it is too late to join this based on the time left and complexity of the problem. Very interested on this project

merry solstice
vital ridge
#

@sick raven looking through the code it gets a little confusing. But there is reactivity, reactivity error and normalized reactivity and error. For the former reactivity is just an accumulation of all the different kinds of mutation counts subtracting out no modification controls. And the error is simply the square root of each kind of mutation (transposition pairs, deletions, and 1/reads) squared and added.

zealous fable
#

Does anyone in the top 100 want to team up for a final push in these last 18 days?

vital ridge
#

I just posted my Ubr Walkthrough results. Wanted to know if anyone had any input

sick raven
#

Honestly I've given up on this lol nothing works. I might have to stay getting models from huggingface

vital ridge
sick raven
#

I tried everything on the custom model side...Transformers, Retention, RNN, LSTM

#

I think fine-tuning a pretrained model from huggingface might be the next step

zealous fable
#

I found the best gains improving the embeddings, pre-transformer. Look for how you can concat other sequence information to your nucleotide embeddings.

vital ridge
#

What kind of layers did you try for rnn. That is what I am working on right now

sick raven
cyan bear
sick raven
#

No i concat them

zealous fable
#

One thing I've notices on this competition is that the heads of the sequence models need aggressive dropouts, 0.5 for example. I've had luck increasing the drop out linearly during training, 0.1-0.4. Has anyone else seen this behavior - I don't understand the root cause?

sick raven
#

Yeah 0.5 dropout seems high to me too.

hollow cobalt
sharp torrent
# hollow cobalt I wanted to understand this comment from organizers. https://www.kaggle.com/comp...

RNA has sorta similar concerns that computer vision does. In computer vision, you should be able to identify a cat no matter if it is translated, rotated, flipped around, squashed, etc - you shouldn't need to rely on its specific position in the training data. Similarly in RNA, if you had something like GGGGAAAACCCCAAAAAAAAAAA vs AAAAAAAGGGGAAAACCCCAAAA , the Gs should be paired with the Cs. Of course it winds up getting more complex than that, but the point is that you dont want to be taking RNA bases at specific positions as features directly without any sort of processing, your model needs to be able to find patterns without relying on their specific positions in the sequence

#

Your model should also be able to work on RNA sequences of arbitrary length - same idea!

hollow cobalt
#

Thanks for the response and really appreciate the details. But in computer vision I do have ground truth saying it's a cat. Here we have none. Analogically if I'm doing semantic segmentation (per pixel classification) and if I don't have ground truth for some pixels it's just a best guess which can't be validated and different training would lead to different aggregation over the unknown area. What makes matters worse in my view is the final evaluation has some fixed values for these unknown regions (if I understand correctly)

rustic olive
#

In the case of transformers you can train with central labels and hope that positional encoding generalize the tails. I think it's about that bassically.

rustic olive
hollow cobalt
hollow cobalt
sharp torrent
#

You don’t know what the background is in that training data - and it could in fact contain a dog, but we have both positive and negative examples of dogs and cats at other positions, so you should be able to identify it at that position still

hollow cobalt
# sharp torrent Admittedly my expertise is not in ML, but here’s how I would approach the intuit...

the issue with this is there are a lot of things that can be legit and around the cat. The question is how much of thee gap in the image/sematic we are filling in. E.g is there is a ground truth hole in the image of a Table-> it could be a part of table, flower pot, kitchen plate, platee with food etc. And what your model chooses is how the mode matches with the loss. Also remember we do not find a global minima with SGD.

sharp torrent
#

Sure, that's true

vital ridge
#

do we have until midnight?

sharp torrent
# vital ridge do we have until midnight?

From the overview page, ā€œAll deadlines are at 11:59 PM UTC on the corresponding day unless otherwise noted. The competition organizers reserve the right to update the contest timeline if they deem it necessary.ā€

#

Which should be 8 minutes from now!

sick raven
#

GG everyone!

sharp torrent
#

Congrats all, and thanks for joining us in our research!

#

I know our team is excited to dig into the winning submissions

vital ridge
#

c'est la vie