#stanford-ribonanza-rna-folding | Kaggle | Page 1

terse elbow Sep 8, 2023, 3:42 PM

#

Hi

limpid oyster Sep 8, 2023, 4:06 PM

#

Hi :)

dusky ember Sep 8, 2023, 4:51 PM

#

hello all

#

lots of new competitions recently

#

this one looks pretty interesting, judging entirely by the name

limpid oyster Sep 8, 2023, 7:56 PM

#

happy to connect with others who are thinking about this problem 🙂

autumn sierra Sep 9, 2023, 4:50 PM

#

Hi everyone, happy to discuss how to tackle this and/or work together.

limpid oyster Sep 9, 2023, 5:29 PM

#

autumn sierra Hi everyone, happy to discuss how to tackle this and/or work together.

Wanna do a call some time?

hasty iron Sep 9, 2023, 10:26 PM

#

I can't see any prediction target columns (reactivity_DMS_MaP, reactivity_2A3_MaP) in the train_data.csv file, are they located somewhere else?

sharp torrent Sep 9, 2023, 11:39 PM

#

hasty iron I can't see any prediction target columns (reactivity_DMS_MaP, reactivity_2A3_Ma...

You'll want to look at reactivity_0001 through reactivity_0170

#

(And the error columns if you want to take that into account)

#

Each column represents the target value at the given base index, keep in mind different solutions are of varying lengths

#

Some indexes don't have data either because of the length varyation or limitations in the experimental process

hasty iron Sep 9, 2023, 11:41 PM

#

Oh I see, so one entry in sample_submission corresponds to one position in a sequence, not to the whole sequence, right?

sharp torrent Sep 9, 2023, 11:41 PM

#

(Or, iirc, error that is too high)

sharp torrent Sep 9, 2023, 11:41 PM

#

hasty iron Oh I see, so one entry in sample_submission corresponds to one position in a seq...

ID has the description of "Unique index for every sequence position in test sequences", so I assume so

autumn sierra Sep 11, 2023, 2:43 PM

#

limpid oyster Wanna do a call some time?

I could be down. I want to have a go at it end-to-end first.

hasty iron Sep 12, 2023, 1:03 AM

#

Beware that reactivity is 1-indexed, not 0-indexed in train_data.csv. I got 0.304 lb instead of 0.177 because I missed my indices by 1 position.

quasi dove Sep 12, 2023, 10:20 AM

#

Yes they have not given ground truth values directly for reactivity dms_map and reactivity_2a3_map we have to figure it out, any domain experts here ?

naive marsh Sep 12, 2023, 2:24 PM

#

the ground truth in the test_sequence.csv will be given later during the competition. Right now, we have to work with the train_data.csv file and divide it into training and testing datasets.

#

anyone here is using R for this competition ? There's plenty of cool libraries for genomic data analysis, that could be useful.

sharp torrent Sep 12, 2023, 3:06 PM

#

quasi dove Yes they have not given ground truth values directly for reactivity dms_map and ...

I believe you’re confused about where the values are for the training data? It’s reactivity_0001 through reactivity_0170 (along with the reactivity error columns if you want to take experimental error into account). Sequence positions where data could not be collected will be null, and for sequences shorter than 170 they will be padded at the end with nulls. Each row provides data for either DMS or 2A3 depending on the experiment type column. Also in your submission it will be one row per sequence position with both DMS and 2A3, rather than all positions for just one.

naive marsh Sep 12, 2023, 3:08 PM

#

sharp torrent I believe you’re confused about where the values are for the training data? It’s...

isn't the reactivity going from 1 to 206 ?

sharp torrent Sep 12, 2023, 3:08 PM

#

Oh sorry, it might be

#

I may be misremembering the columns

sharp torrent Sep 12, 2023, 4:22 PM

#

Aaah yeah the columns are out of order in the list I was looking at 😅

dull owl Sep 12, 2023, 5:35 PM

#

I did a general query to check what's null as most of the last columns seemed null in dataframe.

#

Are reactivity columns 1to 26, 166 to 206, all null? And the errors of them as well.

sharp torrent Sep 12, 2023, 5:56 PM

#

The beginning and end of each sequence will be null - due to the way the experiments work, we can't get data for those regions

#

Be aware that not all sequences are the same length, so for some sequences the nulls will start earlier than others (since shorter sequences get padded with additional nulls on the end)

hasty iron Sep 12, 2023, 9:10 PM

#

Has anyone figured out how to validate on longer sequences yet? It seems that the majority of training data (98%+) with positive SN_filter is all the same length. I'd like to see the experiment protocol to try searching for additional datasets.

quasi dove Sep 13, 2023, 2:11 AM

#

sharp torrent The beginning and end of each sequence will be null - due to the way the experim...

There is additional data provided like structure, and ribzona_bpp is there any use of that ?

sharp torrent Sep 13, 2023, 2:23 AM

#

I believe those are computed based on existing (thermodynamic) algorithms

#

So I guess up to you whether you want to introduce that bias?

#

Structure may be computed based on being run against multiple available algorithms and chosing the best fit for the chemical mapping data, but I'm not sure - I'd need to see where specifically you're looking at

naive marsh Sep 13, 2023, 9:51 AM

#

Has anyone figured out why we have negative reactivities ? chemical reactivities should only be positive and it doesn't seem to fall inside to reactivity error

proven quiver Sep 13, 2023, 2:43 PM

#

naive marsh Has anyone figured out why we have negative reactivities ? chemical reactivities...

I think that is instrument noise, which for practical purposes means that the reactivity is zero.

naive marsh Sep 13, 2023, 3:08 PM

#

yeah, that makes more sense like that. Thanks !

sharp torrent Sep 13, 2023, 3:10 PM

#

Yep, came up in the discussion here: https://www.kaggle.com/competitions/stanford-ribonanza-rna-folding/discussion/437850#2429772

Stanford Ribonanza RNA Folding

Create a model that predicts the structures of any RNA molecule

quasi dove Sep 13, 2023, 3:49 PM

#

sharp torrent Yep, came up in the discussion here: https://www.kaggle.com/competitions/stanfor...

Reactivity_0001...... reactivity_0206
In that 0001 , 0002 deos they have any meaning?

hasty iron Sep 13, 2023, 4:12 PM

#

quasi dove Reactivity_0001...... reactivity_0206 In that 0001 , 0002 deos they have any me...

0001 corresponds to the first nucleotide of the RNA sequence, 0002 corresponds to the second one, and so on.

quasi dove Sep 13, 2023, 4:26 PM

#

hasty iron 0001 corresponds to the first nucleotide of the RNA sequence, 0002 corresponds t...

Ohh i see, suppose i want to find ground truth values of reactivity_dms_map and 2a3_map using reactivity_000* so taking mean of all those reactivity columns will be worthless right ?

hasty iron Sep 13, 2023, 4:29 PM

#

Well you could probably do that, but you have to predict reactivity of each position of the RNA sequence in the testing dataset, so it'd probably be quite challenging considering you only use mean values.

quasi dove Sep 13, 2023, 4:30 PM

#

hasty iron Well you could probably do that, but you have to predict reactivity of each posi...

But in submission they have provided only dms and 2as3

hasty iron Sep 13, 2023, 4:33 PM

#

Submission csv has one row per RNA position. You can see id_min and id_max in train_sequences.csv, you have to put corresponding predictions for this RNA between these indices.

quasi dove Sep 13, 2023, 4:36 PM

#

hasty iron Submission csv has one row per RNA position. You can see id_min and id_max in tr...

Thanks 😊, what about other data that have been provided like ribonaza_bpp and other folder which contains structures and other stuff , adding those features will improve accuracy, but do you think that is relevant data ?

#

Or how can I utilise that

hasty iron Sep 13, 2023, 4:40 PM

#

quasi dove Thanks 😊, what about other data that have been provided like ribonaza_bpp and o...

I haven't tried using it yet, but I believe it can be helpful.

quasi dove Sep 13, 2023, 4:51 PM

#

hasty iron I haven't tried using it yet, but I believe it can be helpful.

Ohhk btw thanks for help

slim juniper Sep 14, 2023, 12:28 AM

#

quasi dove Thanks 😊, what about other data that have been provided like ribonaza_bpp and o...

The structures predicted by existing algorithms (based on the sequence only) and the base pair probabilities (based on the sequence only) are provided in case anyone wants to explore correlations or to use for training a model. I've no idea if any of them will prove useful. Ultimately, the reactivities are used by RNA researchers to predict the RNA molecule's structure (how it folds).

hasty iron Sep 14, 2023, 5:11 PM

#

I wonder if we can restore sparse BPPs to dense BPPs by some kind of imputation, and if we can consider BPPs as a "distance" matrix to employ some existing algorithms for missing data imputation.

quasi dove Sep 15, 2023, 4:28 AM

#

hasty iron I wonder if we can restore sparse BPPs to dense BPPs by some kind of imputation,...

Do we need to modify test data set like adding features to it ?

hasty iron Sep 15, 2023, 6:26 PM

#

quasi dove Do we need to modify test data set like adding features to it ?

If you add features for your training data you should most likely have them for the testing data as well because these features would most likely be required by your model to make predictions.

quasi dove Sep 16, 2023, 6:33 AM

#

hasty iron If you add features for your training data you should most likely have them for ...

Yes I know but in test.csv they have provided only sequence feature

naive marsh Sep 16, 2023, 12:23 PM

#

I think the whole point of the competition is to be able to predict the reactivities with nothing more than the sequence so I probably wouldn't add any features in the test set

quasi dove Sep 17, 2023, 5:09 AM

#

naive marsh I think the whole point of the competition is to be able to predict the reactivi...

Maybe we have to derive features from the sequence itself , as there are lots of features in training data but not in test data while predicting test set will get error as absence of features

digital raptor Sep 17, 2023, 9:06 PM

#

Is this really an ML competition 😄 Since we dont have a ground-truth data to train, but we're expected to derive the train data 😂

#

I am starting to lose my interest on it :/ How should we know our results are good or not then

sharp torrent Sep 17, 2023, 9:22 PM

#

digital raptor Is this really an ML competition 😄 Since we dont have a ground-truth data to t...

You do have ground truth data to train

#

reactivity_xxxx columns in the training data

digital raptor Sep 17, 2023, 9:58 PM

#

sharp torrent reactivity_xxxx columns in the training data

Yes but they seem to be %100 null values

sharp torrent Sep 17, 2023, 9:58 PM

#

Only the beginning and the end are null

#

The start and end of the sequences have a special role in the experiments which prevent measurements there

#

Also some sequences are shorter, which means there’s extra null padding at the end

#

IIRC there are some instances of nulls in middle, but those are special cases of experimental limitations

mossy arch Sep 18, 2023, 2:30 PM

#

Hi,
I am new to Deep learning, and while this is a Kaggle competition, I have taken up this RNA folding challenge as my college semester project and now going through it, While I understand I should have started with a simple project,I feel I might learn something through this. I was hoping if someone who is participating in this challenge would be greatful enough to point me in the right direction regarding getting started.
Thanks in advance.

naive marsh Sep 18, 2023, 3:01 PM

#

mossy arch Hi, I am new to Deep learning, and while this is a Kaggle competition, I have ta...

Hi! While this problem is quit a difficult one, you'll certainly learn a lot by trying to solve it. The basic idea is that you have an RNA sequence in the train_data.csv file with corresponding reactivities (one column for each character of the sequence). The goal is simply to predict the reactivities with only a sequence given to you.

The first thing you should do is to understand the problem. Kaggle gave a few links that are useful to understand what the RNA sequence is all about and what the reactivities are, I'd probably start reading that before doing any code.

#

This article for instance gives a good description of what the DMS reactivities are : https://academic.oup.com/nar/article/51/16/8744/7201944
and the same one about 2A3 : https://academic.oup.com/nar/article/49/6/e34/6062772

OUP Academic

Mutation signature filtering enables high-fidelity RNA structure pr...

Abstract. Chemical probing experiments have transformed RNA structure analysis, enabling high-throughput measurement of base-pairing in living cells. Dimethyl s

hasty iron Sep 18, 2023, 10:18 PM

#

Is it true that DMS always only reacts with adenosine and cytidine? Can I always replace predictions for other nucleotides with 0s, or are there some cases when reactivity to other nucleotides is non zero?

naive marsh Sep 19, 2023, 2:35 PM

#

The traditional DMS mapping only detected adenosine and cytidine. But the data we have here have been made with a four-base mapping so there is no reason for the reactivities to be 0 with A and C.

ashen forge Sep 21, 2023, 12:02 AM

#

quasi dove Maybe we have to derive features from the sequence itself , as there are lots o...

These are great questions. The extra columns in the train_data.csv like signal_to_noise, reads, and reactivity_error_xxxx may be most useful in deciding whether to filter out or somehow down weight noisy data during training, not as features to be input to the model (since you won't have those values to make predictions for test sequences).

potent dew Oct 5, 2023, 3:10 PM

#

Hi, Leonor here. Based out of Cambridge, MA with an emphasis on deep learning, network of networks, end to end optimization and the brain. Have been in the Harvard/MIT community since 2016 and Cornell/ Columbia prior. Drop me a line if you are interested in chatting about the data

manic furnace Oct 12, 2023, 10:49 PM

#

Is anybody still working on the problem,? Would love idea on how yall tackling the problem!

uncut crescent Oct 16, 2023, 10:11 PM

#

Can we use arnie for submission?

sharp torrent Oct 16, 2023, 10:21 PM

#

uncut crescent Can we use arnie for submission?

Most likely - how are you thinking of using it?

uncut crescent Oct 17, 2023, 5:51 PM

#

sharp torrent Most likely - how are you thinking of using it?

I just posted the ideas 🙂

sharp torrent Oct 17, 2023, 6:01 PM

#

Yeah as you noted, particularly depending on the model you use, structure prediction can be slow

#

I think it's also worth mentioning that the models provided by arnie are solving a task analogous to yours - predicting reactivity, predicting secondary structure, and predicting base pairing probabilities are all highly related (they are all trying to determine whether some bases actually wind up paired in nature). The reason we're raising this challenge to kagglers is because the existing models are subpar - if you rely on them you benefit from the work they've already done to solve the problem, but you're also biased towards things that have already been tried with mixed success

uncut crescent Oct 17, 2023, 8:15 PM

#

We're excited about the possibilities that deep learning, especially transformers, brings to predicting reactivities. The existing models, built on strong scientific foundations, have been incredibly insightful. We're inspired by those models and are looking to blend their strengths with some of the newer data science techniques we're familiar with. It's all about building on top of the great work already done and seeing where we can take it next. Our intention is to collaborate and enhance, not replace. We genuinely feel that this combination could be the kind of innovation the competition is hoping to see.

sharp torrent Oct 17, 2023, 9:00 PM

#

Makes sense!

#

There is definitely something to be said for starting with things that physical modeling says "makes sense" and refining further

noble seal Oct 18, 2023, 5:50 PM

#

Hi all! Im undergrad in Data Science and AI, I have joined competitions before and the most recent one was WiDS 2023 and my team and I scored high enough to get the 5th place prize. It may have given me a false sense of confidence coming into this competition as I have no background about this topic what so ever infact the last time I learned anything about biology was in highschool. The domain of this problem seems too complex and I would love to get a basic understanding that would allow me to compete and learn but not too detailed to the point where I get lost just trying to understand the topic. I would love any insights and help! Thank you for having me!

vital ridge Oct 18, 2023, 9:59 PM

#

Was wondering if anyone had any insight into ground truths. Are we certain that these are the reactivity columns?

sharp torrent Oct 18, 2023, 10:03 PM

#

vital ridge Was wondering if anyone had any insight into ground truths. Are we certain that ...

Yes - the ~200 reactivity columns are the ground truth values which you are trying to predict

#

(206 was it? Can’t remember, would need to check)

vital ridge Oct 18, 2023, 10:06 PM

#

noble seal Hi all! Im undergrad in Data Science and AI, I have joined competitions before a...

Hello Issac and welcome! My undergrad was in biology and my masters work was done at SMU. My post undergraduate work actually involved ligand binding of the muscarinic receptor (something found in the brain that appears to be the cause of signs of dementia such as Alzheimer's). I would definitely be willing to answer what questions you have in terms of the project. I am still learning and relearning myself. But would love to get some discussion going.

#

@sharp torrent 206 you are correct. Since I have you. Let me ask. I read the papers regarding the DMS and 2A3 experiment types and I am troubled with the negative values. As I read it. They are probabilities of the form P(paired i) = P(paired i) / P(paired i) + P(unpaired i). Meaning that 0.5 means we have no information and 1 means perfect information. So what does negative mean?

sharp torrent Oct 18, 2023, 10:11 PM

#

I know that it’s an experimental artifact, let me see if I can dig up more details

#

I think it has to do with some form of normalization that gets applied

#

Subtracting out background noise and compensating for some “phenomenons like over-modification”, from the source I’m looking at

vital ridge Oct 18, 2023, 10:32 PM

#

@sharp torrent i remember reading something like that as well.

sharp torrent Oct 18, 2023, 10:32 PM

#

I was going to link to experimental data processing code I think is relevant, but I realized the repo is private 🫠

vital ridge Oct 19, 2023, 4:01 AM

#

sharp torrent I was going to link to experimental data processing code I think is relevant, bu...

Let me know what you can find

vital ridge Oct 20, 2023, 9:33 PM

#

I think i may have insight about these reactivity values and reactivity error columna

#

According to the DMs report.https://academic.oup.com/nar/article/51/16/8744/7201944.

The values in the study are processed by ShapeMapper2 https://github.com/Weeks-UNC/shapemapper2/blob/master/docs/analysis_steps.md#calculation-of-mutation-rates

According to this section the values are consolidated values of three groups. Modified - untreated / control. A negative value would infer that the untreated have shown to have reactivity greater then the modified. A fractional value means control had greater reactivity then both combined. But this is only if three groups of samples. If two then Modified - Untreated. If one then Modified alone. The reactivity_error_column it seems is a computation of the same form for the corresponding reactivity_column

GitHub

shapemapper2/docs/analysis_steps.md at master · Weeks-UNC/shapemapp...

Public repository for ShapeMapper 2 releases. Contribute to Weeks-UNC/shapemapper2 development by creating an account on GitHub.

plush shoal Oct 21, 2023, 7:44 AM

#

hey, i see several positions in sequences at the start are null. it is said that they cant be probed due to technical difficulties. My question is, will reactivities for those positions be available in test set during evaluation?

#

If yes, will they be measured in the tests or imputed using some rule?

sick raven Oct 21, 2023, 8:55 AM

#

hey has anyone submitted? I dont get it my outputs are 1343823 x 2 is it meant to be 2 predications per character in the sequence?

vital ridge Oct 21, 2023, 4:20 PM

#

sick raven hey has anyone submitted? I dont get it my outputs are 1343823 x 2 is it meant...

One per experiment type

quasi dove Oct 21, 2023, 4:28 PM

#

As I submitted only median values of 2a3 and dms i got score of 0.23 somthing, now I'm more interested in this project data cleaned ready for training ☺️

sharp torrent Oct 21, 2023, 5:10 PM

#

Most likely not. There’s some additional detail here: https://www.kaggle.com/competitions/stanford-ribonanza-rna-folding/discussion/441122

Stanford Ribonanza RNA Folding

Create a model that predicts the structures of any RNA molecule

quasi dove Oct 22, 2023, 6:25 AM

#

I'm getting mse 0.0025 for dms is that good ?

quasi dove Oct 22, 2023, 6:54 AM

#

Got 0.00328 on 2a3

fervent moth Oct 23, 2023, 2:15 AM

#

sharp torrent Most likely not. There’s some additional detail here: https://www.kaggle.com/com...

I see the first significant values being populated in column 'reactivity_0027'

sharp torrent Oct 23, 2023, 2:19 AM

#

fervent moth I see the first significant values being populated in column 'reactivity_0027'

That is currently correct, yes, as so far the way the experiments are done is not able to get data before there

quasi dove Oct 23, 2023, 5:13 AM

#

I'm getting 276827538 rows
But the competition overview says 539539340

#

And already my submission file getting bigger than 10 gb

quasi dove Oct 23, 2023, 6:14 AM

#

after training model it is generalising well , what you think guys ?

quasi dove Oct 23, 2023, 2:08 PM

#

sharp torrent That is currently correct, yes, as so far the way the experiments are done is no...

Did you used LSTM or seq2seq model for training?

sharp torrent Oct 23, 2023, 2:12 PM

#

quasi dove Did you used LSTM or seq2seq model for training?

I’m not actually writing any models myself - I’m on the team/lab running the competition, though not a host myself 🙂

quasi dove Oct 23, 2023, 2:13 PM

#

sharp torrent I’m not actually writing any models myself - I’m on the team/lab running the com...

Ohk

quasi dove Oct 24, 2023, 2:09 PM

#

I tried to solve it both ways like using sequence model and artificial Neural network but both models getting same mean absolute error. Did cross validation, did features engineering able to improve slightly but no luck can go past 0.18 mae , anybody knows what else I can do like domain knowledge perspective?

#

Help me

zealous fable Oct 24, 2023, 6:17 PM

#

@quasi dove This is a difficult challenge. There's many professional ML researchers struggling to get into the top 15. Myself included. I suggest you look here: https://pytorch.org/tutorials/beginner/transformer_tutorial.html . Also nanoGPT could be a good starting point. Think about how you could modify it to output reactivity rather than word predictions.

quasi dove Oct 24, 2023, 6:21 PM

#

zealous fable <@1077622322258772070> This is a difficult challenge. There's many professional...

Thanks, I'm able to improve my model output with the sequence model , i will experiment with sequence models then will try to implement attention mechanism here I hope that will improve more

sick raven Oct 25, 2023, 8:26 AM

#

has anyone figured out what to do with the error values?

#

just double checking are we certain that error values = 1/sqrt(reads)?

quasi dove Oct 25, 2023, 9:53 AM

#

sick raven has anyone figured out what to do with the error values?

As per description don't use them

sick raven Oct 25, 2023, 1:30 PM

#

Ok, and whats the best way to fillna?

sick raven Oct 25, 2023, 1:37 PM

#

quasi dove As per description don't use them

i dont think the desc says that pretty sure you need to use them for top scores

vital ridge Oct 25, 2023, 9:19 PM

#

@sick raven the DMS paper has insight. It's a standard error.

vital ridge Oct 25, 2023, 9:22 PM

#

sick raven Ok, and whats the best way to fillna?

Since not even negative values mean much. Probably just throw them out. The represent gaps. RNA is a sequence we already know the first 26 are NA defacto. As far as I can tell only the first and last items are NA

quasi dove Oct 26, 2023, 5:19 AM

#

vital ridge Since not even negative values mean much. Probably just throw them out. The repr...

But when i predicted those reactivity from 0001 to 00026 also getting predicted with very small values like negligible
Now I built another sequence model which is a better improved mae but takes 6 hours to predict full test data set

vital ridge Oct 27, 2023, 3:53 AM

#

What kind of models are you using? Key thing to understand is it's not a sequence that is being predicted it's a grouping. So if they are values with known reactivity they should inform the routine with unknown

quasi dove Oct 27, 2023, 3:23 PM

#

vital ridge What kind of models are you using? Key thing to understand is it's not a sequenc...

I'm using sequence model to understand RNA sequences and then on top of that added dense layer to predict reactivities
From 1 to 206
First i tried with
Neural network extracted features from RNA sequences and then predicted pretty fast training and prediction (mae =0.18 almost equal in test and validation set) for both 2a3 and dms
But the. I tried with the sequence model
I got mae like 0.14 good improvement
But prediction on CPU takes like 12 hrs
6 hrs for dms and 6 2a3
On tpu it takes 3 hrs . Due to this limitation I'm not able to use other models like i want to go with transformers to improve more

#

Input is RNA sequences and it's padded sequence length and the 206 reactivity are the target variables
Also I filtered out sn, filter, reads etc

zealous fable Oct 27, 2023, 7:02 PM

#

quasi dove I'm using sequence model to understand RNA sequences and then on top of that add...

Why are you using CPU for prediction/submission? What's the blocker for using GPU? Maybe we can help.

quasi dove Oct 27, 2023, 8:41 PM

#

zealous fable Why are you using CPU for prediction/submission? What's the blocker for using GP...

My gpu resources exhausted for that week , even CPU it took 8 hrs total for prediction

#

I have to change outputs precision somehow

zealous fable Oct 27, 2023, 9:59 PM

#

@quasi dove I recommend you use google colab, as well as Kaggle. This way you can use one for prediction, one for training. AWS also has a free tier for SageMaker and EC2, you could make use of that.

quasi dove Oct 28, 2023, 7:31 AM

#

zealous fable <@1077622322258772070> I recommend you use google colab, as well as Kaggle. Thi...

I can't predict that with free resources it consumes a lot of time ,on AWS I'm not able to login

quasi dove Oct 28, 2023, 5:16 PM

#

Even in sequence data I'm not getting good rank as i uploaded submission file 300+ rank

vital ridge Oct 29, 2023, 3:04 PM

#

What sequence modeling package are you using

vital ridge Oct 29, 2023, 6:20 PM

#

quasi dove I'm using sequence model to understand RNA sequences and then on top of that add...

I am working on a notebook that might shed light. But it's going to take me a couple days to get it finished

quasi dove Oct 29, 2023, 10:58 PM

#

vital ridge What sequence modeling package are you using

Bilstm and then dense layer

#

Next adding attention will be helpful

glacial tulip Oct 30, 2023, 12:40 PM

#

Hi, guys. The question was asked earlier but it looks like it wasn't answered. What should we predict at these positions? Are they going to be included into evaluation?

The first 26 positions in the sequence are the leader on the 5' end that is necessary for the experiment. The first 26 positions in the sequence are identical across the set.
The last 39-51 positions (it varies) are the barcode hairpin and tail necessary for the experiment. The barcode hairpin is a unique identifier within each batch of experimental testing that was performed.

sharp torrent Oct 30, 2023, 1:28 PM

#

glacial tulip Hi, guys. The question was asked earlier but it looks like it wasn't answered. W...

See https://www.kaggle.com/competitions/stanford-ribonanza-rna-folding/discussion/441122 - the recommendation is to attempt predicting something by generalizing the interior model to the edges, but if the data is not available for those positions (which is entirely possible), those positions won’t affect the score

Stanford Ribonanza RNA Folding

Create a model that predicts the structures of any RNA molecule

zealous fable Oct 30, 2023, 7:01 PM

#

Are there any details on the training data update? I seem to have missed the information on that.

sick raven Oct 30, 2023, 7:38 PM

#

zealous fable Are there any details on the training data update? I seem to have missed the inf...

Yeah they fixed the reactivity errors apparently

https://www.kaggle.com/competitions/stanford-ribonanza-rna-folding/discussion/451158

Check this out

Stanford Ribonanza RNA Folding

Create a model that predicts the structures of any RNA molecule

#

Has anyone tried

reactivity + sqrt(error)*sqrt(s2n)?

sharp torrent Oct 30, 2023, 8:04 PM

#

vital ridge Let me know what you can find

BTW when I was reviewing the post that was just linked to in the previous message, it looks like it suggests https://github.com/DasLab/ubr was used for the data generation

GitHub

GitHub - DasLab/ubr: Ultraplex-Bowtie2-RNAframework pipeline for an...

Ultraplex-Bowtie2-RNAframework pipeline for analysis of SHAPE-MaP/DMS-MaP big library runs - GitHub - DasLab/ubr: Ultraplex-Bowtie2-RNAframework pipeline for analysis of SHAPE-MaP/DMS-MaP big libra...

#

(I'm guessing the repo I was looking at was an older version, though I'm not certain)

sick raven Oct 30, 2023, 8:59 PM

#

Can someone please explain how they calculated the error values?

sharp torrent Oct 30, 2023, 9:22 PM

#

sick raven Can someone please explain how they calculated the error values?

If you're willing to read some code, I believe it comes from here: https://github.com/DasLab/ubr/blob/main/matlab/data/get_reactivity.m
and probably normalized https://github.com/DasLab/ubr/blob/main/matlab/data/normalize_reactivity.m

vital ridge Oct 30, 2023, 10:55 PM

#

sharp torrent BTW when I was reviewing the post that was just linked to in the previous messag...

I was wondering about that. That helps

sterile schooner Nov 3, 2023, 1:04 AM

#

Hello everyone . I recently joined this competition and I want to participate and I have a questions . what are our input(features) and what are we trying to predict ?

vital ridge Nov 3, 2023, 5:29 AM

#

@sterile schooner a sequence of RNA molecules and a list of 'reactivities' and reactivity errors. We are still trying to figure out

slim juniper Nov 3, 2023, 12:31 PM

#

The input is the RNA sequence and reactivity values for each position in the sequence. You are predicting the reactivity value of each position for the RNA sequences in the test set. Go ahead and predict reactivity for the beginning and end sections even though no values for those sections are provided in the train set. The ultimate goal is to use the reactivity data to predict RNA base pairing and folding both in a 2D and 3D context.

sterile schooner Nov 3, 2023, 5:46 PM

#

um so we will give the 'sequence' to the model (as x) and the model will return (a list) of reactivitys (as y) right ?

#

so this means that we will train the model by giving the sequence as x and reactivity s (all of them , or only those with non Null values ?) as y to train the model to predict the reactivity and then we apply the model to the test set and create reactivity columns and then submit that ?

#

and then the hosts will use reactivity values to create the structurs ?

sterile schooner Nov 3, 2023, 6:03 PM

#

some thing like this :
sequence(GGGAACGACUCGAGUAGAGUCGAAAAAU...) ===> Model ===> reactivity_0001 = value , reactivity_0002 = value , ... , reactivity_error_0205= value , reactivity_error_0206= value
?

#

And since the sequence length of train and test(public data) is different from the privet data ; we have to pad our sequences to the max lenght of the test data right ? for that the model could work on the privet data

sharp torrent Nov 3, 2023, 6:13 PM

#

sterile schooner some thing like this : sequence(GGGAACGACUCGAGUAGAGUCGAAAAAU...) ===> Model ===>...

Yes, though for each position you will actually have to predict two values - one for the reactivity under the 2A3 method, and one under the DMS method

sharp torrent Nov 3, 2023, 6:15 PM

#

sterile schooner And since the sequence length of train and test(public data) is different from t...

AFAIK the model should be able to generalize to any sequence length, so padding is probably not the way to go - and is probably not the right way to be thinking about the problem anyways (the reactivities arise due to the relationships between the different bases in the sequence, there's no real useful stable information in the position itself)

sterile schooner Nov 3, 2023, 6:16 PM

#

so it should be like this ?
sequence(GGGAACGACUCGAGUAGAGUCGAAAAAU...) ===> Model ===> 2A3 :reactivity_0001 = value , reactivity_0002 = value , ... , reactivity_error_0205= value , reactivity_error_0206= value & DMS :
reactivity_0001 = value , reactivity_0002 = value , ... , reactivity_error_0205= value , reactivity_error_0206= value

sharp torrent Nov 3, 2023, 6:16 PM

#

right

sterile schooner Nov 3, 2023, 6:16 PM

#

Thanks a lot

sharp torrent Nov 3, 2023, 6:16 PM

#

Assuming that sequence is 206 elements long

sterile schooner Nov 3, 2023, 6:17 PM

#

yes

sharp torrent Nov 3, 2023, 6:18 PM

#

sterile schooner so this means that we will train the model by giving the sequence as x and react...

Also note that while you won't have training data for those positions that are null at the beginning/end of the sequence, we'd like to see you try to generalize your model to be able to predict values there too!

sterile schooner Nov 3, 2023, 6:20 PM

#

so you mean I have to create a model that traines on a data that has those null values but I also have to predict values for them in the test set ?

#

and if so does it mean I have to find some way to fill the null values ?

sharp torrent Nov 3, 2023, 6:25 PM

#

You'll need a way to get your model to predict them, though you presumably won't want to augment the training data yourself, as values you'd insert would not actually be ground truth values

#

I believe https://www.kaggle.com/competitions/stanford-ribonanza-rna-folding/discussion/444653 has been referred to as a useful resource in terms of generalization

Stanford Ribonanza RNA Folding

Create a model that predicts the structures of any RNA molecule

#

Unfortunately I don't have more useful suggestions on how to handle this specifically - I am not an ML expert myself 🙂

sterile schooner Nov 3, 2023, 6:26 PM

#

Thanks a lot I'll look in to it

sick raven Nov 4, 2023, 10:39 AM

#

sterile schooner so it should be like this ? sequence(GGGAACGACUCGAGUAGAGUCGAAAAAU...) ===> Model...

the final output is just reactivity values and not the errors

#

does anyone have a background in biology? I really wanna understand how those error values can be used
I'd be down to team up

sharp torrent Nov 4, 2023, 4:13 PM

#

sick raven the final output is just reactivity values and not the errors

Oh yep, missed that mhdaw included those, good callout

sharp torrent Nov 4, 2023, 4:15 PM

#

sick raven does anyone have a background in biology? I really wanna understand how those er...

FWIW this is less of a biology thing and more of a statistics thing. If it has a higher error, it means the corresponding reactivity is less reliable/precise, so you may want to down weight it

sterile schooner Nov 4, 2023, 5:18 PM

#

sick raven the final output is just reactivity values and not the errors

oh I missed that thanks for the reminding

granite isle Nov 5, 2023, 11:50 AM

#

Hello everyone, I just started and doing some eda, but a quick question, anyone who have tried the Graph Transformer in which the Nucleotide Identity as node ,open knot and rna seq join for Laplacian eigenvector as position encoding.

vital ridge Nov 5, 2023, 9:31 PM

#

sick raven does anyone have a background in biology? I really wanna understand how those er...

Hi Pranshu. I have been doing a lot of investigation. I think we discovered it has more to do with DasLab Ubr transformation. But it is in Matlab. I have transformed most of the code to python. I can get the data loaded but I have not completed the transformation. I have an issue somewhere in the process. But generally I can sketch out the meaning and get a better understanding of what is trying to be achieved. That being said I am split between whether it is a worthy investment to identify the exact nature of these positive and negative numbers except that they represent some nature of insertion, deletion or subtraction.

quasi dove Nov 6, 2023, 12:15 AM

#

As i tried with sequence model , also with neural network using extracted features not able to achieve good score , so I need to utilise reactivity error to score up like weighing reactivities i guess

sick raven Nov 7, 2023, 6:16 AM

#

vital ridge Hi Pranshu. I have been doing a lot of investigation. I think we discovered it h...

What is DasLab?

sick raven Nov 7, 2023, 6:17 AM

#

sharp torrent FWIW this is less of a biology thing and more of a statistics thing. If it has a...

Hey but some reactivity error values are 1k+ do you have any insights on down weighting?

sharp torrent Nov 7, 2023, 1:24 PM

#

sick raven What is DasLab?

The Das Lab is the research lab at Stanford who is running the experiments and hosting the competition

sharp torrent Nov 7, 2023, 1:25 PM

#

sick raven Hey but some reactivity error values are 1k+ do you have any insights on down we...

I don’t have any specific advice, sorry. That’s out of me area of expertise

vital ridge Nov 7, 2023, 4:57 PM

#

sick raven Hey but some reactivity error values are 1k+ do you have any insights on down we...

What do you mean 1k plus? they should be floating points less than 3. Can you give me the search

zealous fable Nov 7, 2023, 9:14 PM

#

Does anyone have the eternafold/vienna sequence predictions for the entire train/test set? It's going to take 2 days on 16 threads to process! 😮

sick raven Nov 8, 2023, 8:17 AM

#

vital ridge What do you mean 1k plus? they should be floating points less than 3. Can you gi...

Try looking for max values

sick raven Nov 8, 2023, 9:56 AM

#

Do y'all think #reads_error =(1/reactivity_error)**2 ?

#

No worries thanks for the advise about error weightage tho

vital ridge Nov 10, 2023, 11:35 PM

#

@sick raven not sure what reads_error is. It's not on the training data

sick raven Nov 11, 2023, 5:56 AM

#

Hmm so reactivity_error is calculated as the Poisson distribution of the amount of reads that went wrong for the character
So number of reads that gave an error value = (1/reactivity_error)**2

That's how I interpret it but I could be wrong

Don't know how to use reactivity error otherwise

cobalt trout Nov 11, 2023, 3:53 PM

#

Hi! Wondering whether it is too late to join this based on the time left and complexity of the problem. Very interested on this project

merry solstice Nov 11, 2023, 5:49 PM

#

cobalt trout Hi! Wondering whether it is too late to join this based on the time left and com...

Don't think it's too late at all. Go for it.

vital ridge Nov 12, 2023, 4:44 AM

#

@sick raven looking through the code it gets a little confusing. But there is reactivity, reactivity error and normalized reactivity and error. For the former reactivity is just an accumulation of all the different kinds of mutation counts subtracting out no modification controls. And the error is simply the square root of each kind of mutation (transposition pairs, deletions, and 1/reads) squared and added.

zealous fable Nov 12, 2023, 6:36 PM

#

Does anyone in the top 100 want to team up for a final push in these last 18 days?

vital ridge Nov 14, 2023, 6:13 PM

#

I just posted my Ubr Walkthrough results. Wanted to know if anyone had any input

#

https://www.kaggle.com/code/tuttlen/ubr-walkthrough

Ubr Walkthrough

Explore and run machine learning code with Kaggle Notebooks | Using data from [Private Datasource]

sick raven Nov 15, 2023, 4:55 AM

#

vital ridge <@164828726910124032> looking through the code it gets a little confusing. But t...

Oh gotcha thanks man

#

Honestly I've given up on this lol nothing works. I might have to stay getting models from huggingface

vital ridge Nov 16, 2023, 3:37 AM

#

sick raven Honestly I've given up on this lol nothing works. I might have to stay getting m...

what kind of models have you tried? I am working on a RNN right now. I have a bunch of ideas

sick raven Nov 17, 2023, 6:16 AM

#

I tried everything on the custom model side...Transformers, Retention, RNN, LSTM

#

I think fine-tuning a pretrained model from huggingface might be the next step

zealous fable Nov 17, 2023, 2:00 PM

#

I found the best gains improving the embeddings, pre-transformer. Look for how you can concat other sequence information to your nucleotide embeddings.

vital ridge Nov 18, 2023, 3:25 AM

#

What kind of layers did you try for rnn. That is what I am working on right now

sick raven Nov 22, 2023, 8:20 AM

#

vital ridge What kind of layers did you try for rnn. That is what I am working on right now

SimpleRNN, Retention wbu?

sick raven Nov 27, 2023, 6:14 AM

#

hey here's my notebook:

https://www.kaggle.com/code/pranshubahadur/rmdb-rna-dataset-with-transformers

Using the RMDB dataset worked better...I think it's because sequence length is 433

RMDB & RNA Dataset with Transformers

Explore and run machine learning code with Kaggle Notebooks | Using data from multiple data sources

cyan bear Nov 27, 2023, 10:38 AM

#

sick raven hey here's my notebook: https://www.kaggle.com/code/pranshubahadur/rmdb-rna-dat...

Are you augmenting with rmdb or using only rmdb data ?
From what I understand you aren't concatenating RNA and rmdb ...
Also does this help with generalization?

sick raven Nov 27, 2023, 10:50 AM

#

No i concat them

zealous fable Nov 27, 2023, 8:42 PM

#

One thing I've notices on this competition is that the heads of the sequence models need aggressive dropouts, 0.5 for example. I've had luck increasing the drop out linearly during training, 0.1-0.4. Has anyone else seen this behavior - I don't understand the root cause?

sick raven Nov 28, 2023, 6:18 AM

#

Yeah 0.5 dropout seems high to me too.

hollow cobalt Nov 29, 2023, 4:25 PM

#

I wanted to understand this comment from organizers. https://www.kaggle.com/competitions/stanford-ribonanza-rna-folding/discussion/444653#2541974
Basically (my maybe weaker RNA understanding) why would we expect to be legit float when GT (ground truth) have nan(s).

Stanford Ribonanza RNA Folding

Create a model that predicts the structures of any RNA molecule

sharp torrent Nov 29, 2023, 4:59 PM

#

hollow cobalt I wanted to understand this comment from organizers. https://www.kaggle.com/comp...

RNA has sorta similar concerns that computer vision does. In computer vision, you should be able to identify a cat no matter if it is translated, rotated, flipped around, squashed, etc - you shouldn't need to rely on its specific position in the training data. Similarly in RNA, if you had something like GGGGAAAACCCCAAAAAAAAAAA vs AAAAAAAGGGGAAAACCCCAAAA , the Gs should be paired with the Cs. Of course it winds up getting more complex than that, but the point is that you dont want to be taking RNA bases at specific positions as features directly without any sort of processing, your model needs to be able to find patterns without relying on their specific positions in the sequence

#

Your model should also be able to work on RNA sequences of arbitrary length - same idea!

hollow cobalt Nov 29, 2023, 5:44 PM

#

Thanks for the response and really appreciate the details. But in computer vision I do have ground truth saying it's a cat. Here we have none. Analogically if I'm doing semantic segmentation (per pixel classification) and if I don't have ground truth for some pixels it's just a best guess which can't be validated and different training would lead to different aggregation over the unknown area. What makes matters worse in my view is the final evaluation has some fixed values for these unknown regions (if I understand correctly)

rustic olive Nov 29, 2023, 6:07 PM

#

In the case of transformers you can train with central labels and hope that positional encoding generalize the tails. I think it's about that bassically.

rustic olive Nov 29, 2023, 6:12 PM

#

hollow cobalt Thanks for the response and really appreciate the details. But in computer visio...

Well, actually they say that they don't had the corresponding labels by the time competition started. But they remember you that by the end of competition it's possible they'll have them, not 100% sure.

hollow cobalt Nov 29, 2023, 6:15 PM

#

rustic olive In the case of transformers you can train with central labels and hope that posi...

It's a matter of choice of how we train. A cat can be legitimately next to sky, water, many indoor objects. The final output would depend on the training, loss choices.

(I appreciate your response, thanks)

hollow cobalt Nov 29, 2023, 6:16 PM

#

rustic olive Well, actually they say that they don't had the corresponding labels by the time...

Yes. I'm wary of this new possibility of evaluating unknown ones. Although I am far on the ranks to worry much. But nonetheless it's something that feels a bit weird to evaluate if they would.

sharp torrent Nov 29, 2023, 6:52 PM

#

hollow cobalt It's a matter of choice of how we train. A cat can be legitimately next to sky, ...

Admittedly my expertise is not in ML, but here’s how I would approach the intuition: We know some pixels correlate to a cat. We don’t know what the other pixels are. However, if we see a cat where the other pixels are, we should still be able to identify the cat as opposed to a dog, because the data does match that more closely

#

You don’t know what the background is in that training data - and it could in fact contain a dog, but we have both positive and negative examples of dogs and cats at other positions, so you should be able to identify it at that position still

hollow cobalt Nov 29, 2023, 7:09 PM

#

sharp torrent Admittedly my expertise is not in ML, but here’s how I would approach the intuit...

the issue with this is there are a lot of things that can be legit and around the cat. The question is how much of thee gap in the image/sematic we are filling in. E.g is there is a ground truth hole in the image of a Table-> it could be a part of table, flower pot, kitchen plate, platee with food etc. And what your model chooses is how the mode matches with the loss. Also remember we do not find a global minima with SGD.

sharp torrent Nov 29, 2023, 7:11 PM

#

Sure, that's true

vital ridge Dec 7, 2023, 10:59 PM

#

do we have until midnight?

sharp torrent Dec 7, 2023, 11:50 PM

#

vital ridge do we have until midnight?

From the overview page, “All deadlines are at 11:59 PM UTC on the corresponding day unless otherwise noted. The competition organizers reserve the right to update the contest timeline if they deem it necessary.”

#

Which should be 8 minutes from now!

sick raven Dec 8, 2023, 12:27 AM

#

GG everyone!

sharp torrent Dec 8, 2023, 12:43 AM

#

Congrats all, and thanks for joining us in our research!

#

I know our team is excited to dig into the winning submissions

vital ridge Dec 8, 2023, 12:47 AM

#

c'est la vie