#stanford-rna-3d-folding
1 messages · Page 1 of 1 (latest)
Hi
ok... looks like someone sleept upon the keyboard
I'm a bit confused by this challenge. On one hand, training a model from scratch would be very expensive, and require lots of training data. On the other hand, one could simply run AlphaFold 3, which has been shown to work well with RNA structures as well, and get a good score. Not sure how we could do better than the current state of the art.
One idea I have is to use one of the new foundation models published recently, and start from there. For example, Evo2 is a foundation model based on DNA sequences, that was published a couple of weeks ago. One of the figures of the paper (4d) mentions that the model is able to learn sequences associated with specific 2D/3D structures, from DNA (even if the challenge is based on RNA sequences).
Does anyone want to create a team to work on this challenge?
If your Health is own the line, is "works well" good enough?
Hi! I just think it would be difficult to improve the state of the art, because there are companies and academic groups that have been working for years on this, and training models on 3D structure is usually very expensive in terms of GPU. Still, some new idea may come up
@fickle nebula
utilizing grokfast (https://arxiv.org/abs/2405.20233) and bitnet(https://arxiv.org/pdf/2310.11453), for speed and effectiveness; and using Paperspace (https://www.paperspace.com/) for GPU options for training has been my go to.
This was my initial idea, but since I got distracted with a different idea, Ill give out this idea for RNA‑FoldNet.
included is a comprehensive outline of the idea, what sets it apart from AlphaFold, and An effectively comprehensive iterative development plan
One puzzling artifact in machine learning dubbed grokking is where delayed generalization is achieved tenfolds of iterations after near perfect overfitting to the training data. Focusing on the long delay itself on behalf of machine learning practitioners, our goal is to accelerate generalization of a model under grokking phenomenon. By regardin...
@fickle nebula are you still looking for teammates?
what prompt/LLM did you use for that?
https://chatgpt.com/share/67c99743-6568-8004-bc80-7f0e0586d81c
heres the full chat for that one, mainly utilizing o3-mini-high i believe (with search at times).
the txt file hase three sections, the first two were generated within the provided chat above,
here is my prompt for finalizing plans (requires a conversation first):
[ please take all of the provided papers, and create a comprehensive plan outlining in entire detail the concept, idea, and flowchart of this model. It's purpose is for RNA 3d Folding, taking the sequence of 4 values and predicting the full 3d structure of that sequence. we want to go into no specific coding or implementation details within the plan, but we REQUIRE A COMPREHENSIVE amount of required knowledge in order to create full fledged model. the input is supposed to be a string that is representing a 2d input of 4 possible values: A, C, G, and U. the output is supposed to be a list of tuples, representing the x y and z values of the item in the corresponding position of the input sequence. ] (papers are pasted below)
For programming I utilize these prompts (With O1 Pro): https://docs.google.com/document/d/1wlC7-k7VCJqJvTcFTwXeFbbaVuC7lcEHGjlzdcUXsHk/edit?usp=sharing
the first one is only used for initial generation, you just paste in the full generated plan.
the second and third one are the meat and potatoes, and are both required each time you want to make a modification.
it's oriented towards jupyter notebook usage (bc I use paperspace)
the only problem i have on a regular basis is that it needs more context as to what it means for a segment modified or not, but thats pretty minor. (it sometimes marks functionally unmodified segments as modified)
I almost exclusively use O1 Pro to program, While o3-Mini-high is really close; I'd rather wait the extra 5 minutes for o1 Pro than have to recursively reiterate upon a problem with o3-Mini-high.
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= O1 Pro =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= STEP 0: (used within initialization only using total plan, it’s also worth noting that its often useful to initially design a data loader that trains on random numbers in the desired shapes and sizes of your actual data)(occ...
Thank you, that's quite cool. Are you still participating to the challenge?
I've got to the point of using arnie to generate the secondary sequences, but still haven't had the time to do more than that.
I am participating still; Just working on a different approach.
for whatever reasons, i've been having trouble extracting a list of tuples from a string; which has been the entire past week for me.
I actually hadn't heard of arnie before, looks interesting.
although, the Multi-Scale High-Resolution Feature Extraction of RNA‑FoldNet should be capable of effectively learning the same information withheld inside of the secondary sequences as well as other relevant context organically.
also, whatever you end up doing, it may be worth also trying to use it on the new BYU - Locating Bacterial Flagellar Motors 2025 competition. The needs of the AI are surprisingly similar between the two, with the main exception being the BYU input data being Images instead sequences.
Hi guys I don't really understand this challenge much but really love to do something here. Can anyone explain what we are trying to do here ?
Thanks a lot
Nvm I think I just understand a little more about this competition
Does GCAU = dna bases?
Adenine guanine cytosine thymine
If I remember correctly
RNA has no thymine but it has uracil
sequences of Nucleotids, the bases are just a part of them
but you're almost right, the letters refers to the corresponding base in them, but RNA not DNA
Looking for team to join thanks you
Someone develop a model that bypass vfold_human_expert.csv already
Yeah, he is good AI coder.
Hengck23 has good data analysis and knowledge of how to use it.
I'm curious about vfold_human_expert.csv, they gave a sequence to an human an asked to model 3D from it? Imagine manual modeling for a 4k sequence
I think the data was manually modeled by humans. Looking at the GitHub code, it seems there's a pattern of repeatedly analyzing and submitting results for a single RNA sequence at a time. Of course, I might be wrong.
thx
Hey all, when it comes to training your model how have you set up your data when it comes to batches? Sequences have variable length and I'm unsure of how to batch them
Ive done one hot encoding for each nucleotide with padding adding 0s to the max sequence length and then masking to not have the padding influence the training. (take with a grain of salt though im very new to this so theres a good chance theres a better way to do it)
One more question, when asked to submit 5 sets of coordinates does this mean we submit 5 different inference passes for the same nucleotide? Train data only has 1 set of 3D coords
Furthermore from what I'm seeing there is a way to set up the external data, for the competition's sake should all data training etc be done in a single notebook? In the case of using the additional external data
Notebook that tackles the matter: https://www.kaggle.com/code/tomooinubushi/convert-uw-synthetic-dataset
I'm still unsure about this, I can set up my submissions.csv to contain the single set of coords on inference, is it still just running inference on the same series 5 times and appending the results in a single CSV?
just submit same prediction 5 times, but run it just once
run it 5 times only if your model is not deterministic
did any of you all had any success pretraining the model on the 400k samples of synthetic data? Im about to try some transformer-GNN but I doubt is any good compared to the alredy existing SOTA models for ARN folding
I've linked a notebook on how to get that data in the same format as the current comp training data, should help you get set up
and did you had any sucess ? i did not
anyone need team member i am open join team
Can anyone clarify how to account for rotations in your predictions? Couldn't I have infinitely many valid RNA structures that are just rotated differently than the one provided as label?
Im looking for experienced and active team members for this competition, highly interested ppl dm me thanks
USalign should find the best translated/rotated match between your predictions and the labels
I just recently got the submission set up (using a simple LSTM template as the model, obv it didn't go great), I'll now try to use that notebook to incorporate the additional data
This notebook in particular
What do you mean by rotate? As in invert the entire RNA sequence? Sounds like a good way to data augment if it still stays consistent with how RNA behaves (I'm new to RNA datasets so idk how viable that is as an option)
i mean the 3d position of the C1 primer atoms of the individual nucleotides, basically what you want to predict and I mean you can just rotete the entire structure in 3d space (or so i would imagine) without changing it function. Do you get what i mean by that?
i‘ll look into it
ty
I haven't found public information about USalign, but we can trust, a simple Kabasch algorithm can find the best rotation assuming coincident centroids. So if USalign can do it better, perfect. You always can submit A, rot90xy(A),rot90xz(A),rotyz(A)... and as many experiments you want and check if your score changes
I mean that might work for training but for submitting I have to rotate all my samples individually and check if that affects my score? That seems tedious + there aren‘t unlimited submissions every day
No, the scoring should automatically align your predictions, only experiment if you don't trust automatic alignment
ahh i get it, thanks alot man 🙏
I think I get it, you just rotate the entire sequence on an axis to generate new data. If specific placement within Angstrom coords doesn't matter too much then it sounds like a great way to augment without touching the data much
yeah but you‘ll learn nothing new that way I would think, because the input sequence is the same, just the output changes and your network should inherently be able to be invariant to rotation
How would it be invariant to rotation if you do not train it for the task? Similar to vertical/horizontal flips in images I'm presuming
In any case, I made a template in Torch if it helps anyone out, since TF seems to be the standard for a lot of these iterations https://www.kaggle.com/code/kostasanthoulis/stanford-rna-3d-folding-torch-template/notebook
For all interested in joining this competition I strongly recommend looking at the pinned discussion posts
Hey all quick question: is it ok to build my dataset in another notebook to avoid having building the dataset reducing my available training time? Given the resulting dataset is public ofc, I'm taking the synthetic data and converting it into a dict for faster training time, most likely saved as a .parquet
On that note, in a few hours I'll be uploading .parquet files of the sequences in the synthetic data (https://www.kaggle.com/datasets/andrewfavor/uw-synthetic-rna-final) formatted for use in the competition. My initial attempt was just reading straight from the dataframes but finding the corresponding labels given a sequence was done in O(N), with a dict it's now in O(1). Shaving time is important when notebook executions are timed so hopefully this will help anyone trying to extend their model training time
Usually in computational chemistry we use symmetry functions as the inputs for our networks because (X,Y, Z) coordinates are horrible for anything with physical interactions.
As far as the output if they defined the scoring function well, it should account for that. I need to see what they're using.
But usually if we're comparing two structures you don't try to predict exact X,Y,Z you usually predict structural features.
Then figure out how to map the X,Y,Z to the structural feature if you need, but anything trying to predict the exact XYZ directly is usually complete trash for generalization.
Is this too computationally expensive for a loss function though?
And is AlphaFold just an exception to this then?
yeah that's my approach as well, predicting exact 3d structures is a pain in the ass because no way in hell will that generalize
Not at all. A lot of them are mild compute times. It amounts to a change of variables, we did these on CPUs before. The hardest part is if you don't have a clean inverse function. THat's where it gets a bit troublesome.
Even Alpha Fold works off predicting a pair representation first. Which effectively is another symmetry function.
i'm currently predicting the EDM of the 3d structure
Given two sets of XYZs, what's the best way to see if they "structurally" compare? Is there a name I could search up
I'm about to read up on DNA/RNA what's popular for this project, but in computational chemistry we would do things like the Belher Symmetry functions for neural net inputs or Steinhardt order parameters and others for loss definition. Those are based on local symmetry. What they look at for example is the rotational symmetry of the structure.
This was some previous work for atomic predictions. Some of the ideas will translate, but not all of them. https://chemistry-europe.onlinelibrary.wiley.com/doi/abs/10.1002/cctc.202000774
But it can give you an idea of how invariant coordinates can be defined which might give you some ideas where to go with your modeling
Update on this: still working on it because getting the dict from the original df can take 12(!) hours for 10k files (let alone 400k) so I'll see if I can parallelize it to make it run smoother or speed it up somehow
hey how long does submission scoring take?
Not that much. About 10 minutes after finish your notebook which is only scoring phase.
Hi everyone, I’m Sam and I’m a Bioengineering student. There are explainers and tutorials for beginners in biology who are participating in this competition, but I can’t seem to find any resources for someone who understands the biology but not the programming part. Where do I start? Any help would be much appreciated! Thanks for your time 🙂
You'd need a decent understanding of how neural networks work imho. 3blue1brown has a great series on the subject, some maths is needed but mostly just linalg
I do have a working understanding of neural networks and have even implemented simple CNNs and U-Nets for image segmentation. I’m new to the protein/rna structure prediction side of NNs and I feel a bit overwhelmed. How can I get started in this area?
Since a lot of the approaches are multimodal (aka using multiple NN architectures at once), go over the rest of the basic NN architectures. What is an RNN? LSTM? Graph NN?
See how an RNN, a CNN and a graph NN can be used to tackle the problem
Once you have that down you can experiment w combininggg them for a multimodal solution
Or looking at what state of the art models are doing
Attention mechanisms would he something worth looking into as well
Imo read up on all those simpler iterations and then move to sota
If anyone is more experienced and has more to add please feel free to do so
Thanks for your guidance! 🙏
Almost done here! Wasn't processing the dataframes by chunks which really did slow down the entire process. I'll be uploading smaller 1k and 10k samples as a Kaggle dataset in a bit, along w the notebook I made to create the dicts and how to load the data (once I clean up the code a little)
Data is up https://www.kaggle.com/datasets/kostasanthoulis/uw-rna-synthetic-data-competition-format-10k/data
I've just noticed while working in custom adaptation but the encoder has already a pad token, 4. If you src_mask pad tokens won't be a big difference. But by using 0 the model can associate A with some kind of noise. Better put 4 in them.
Or try it at least to see if makes a difference.
I'm talking about RibonanzaNet of course.
So I've been trying to tackle this competition task for a while and I'm starting to have a few questions, I have experience with ML but this is the first time I'm handling RNA and the data and task are completely new to me
First of all regarding the data, does sequence length matter that much when training? In the proposed synthetic data each sequence is about 5k characters long while in the competition itself the sequences are about 200 chars max, how does this affect training? Given that currently the training dataset isn't huge per se
Secondly how much does padding the sequences to a common length (with a value that doesn't affect the loss function) affect training itself? Is this one of the tasks where the bigger the batch the better the outcome or is training with batch size 1 preferred to avoid padding altogether?
Personally I've circumvented this issue by mapping the nucleotides to 1, 2, 3, 4 and keeping 0 for padding. That and a mask not to have 0 contribute to the loss function is how I'm handling it at least
Yes. The sequence length and position both matter because these things interact with each other. Think of it like if you put magnets on a stiff rope. If the magnets are too close they may not be able to bend the rope to interact. If you change the location of the magnet you change the length of the the curled region between magnets. If you have multiple magnets you have multiple pairs that can interact.
Short ropes can't loop on itself while long ones can.
So training on 5k long sequences while the ones in the valid set are 200 chars long isn't a great idea
Can a parallel be drawn to image resolutions as an example I'm guessing? Training on 4k images and having 720p on inference
But if that's the case how useful is the proposed synthetic data for the task?
Since afaik you can't exactly crop RNA sequences since that massively changes their structure
Yup the whole chain matters because in 3D space these things can curl ok themselves. You can also change one residue and radically change the outcome.
Chemical physics is a pain like that.
So if that's the case the synthetic data isn't of much use for the task
So from the original dataset when any series that has NaN is removed we only get about 600 sequences if I remember correctly
Yes more data will be added in about April but I'm really trying to make something work and haven't gotten past 0.11 on TM score
I would guess going pretrained is the only viable option rn and just tuning
You can use synthetic data, but the key is you need to know how to generate it. When we did this for other materials we used proxy models from molecular dynamics.
It's a much more involved problem.
But yes traditional data science approaches don't work as well because it's got a high cross correlation factor
It's a similar problem to LLMs where your synthetic data is synthetic prompts.
And the response prompts still need to make logical sense
So I'm guessing a way to have the best of both words is to create synthetic data from the current data in the dataset
ups
Has anyone tried using actual RNA sequence data from RCSB? I'm building a parser that follows the contest's format (extracting coords from C1' atoms from nucleic acid residues).
The catch is you have to make sure the sequence length more or less matches the ones found in the current training data. I tried training with synthetic ~5k char sequences and the results went about as well as you'd expect considering the chars in the current data are around 200 max
Wdym 200 max? Some sequences in the training set reach 4k
i mean yeah, there are some longer sequences in the training data, but only like 6% are longer than 300 nucleotides and 3.5% longer than 1000 nucleotides. Not really what I would call balanced dataset.
Anyone want to collaborate?
Hi All, I am looking for Kaggle Grandmasters who have won competition who can mentor me. I am willing to pay for mentorship. Thank you!
Do y'all have some thoughts on a viable approach to normalizing the 3d structure labels?
I have a vague idea on how to approach it, basically involves using reference vectors that lie on the 3d unit sphere to calculate some dot products with respect to the output coordinates, but I still need to figure out that it plays nicely with their TM scoring method.
Not sure if it'll work though
Yeah, I had something similiar in mind, that would keep the relative distances of the nucleotides but scales them down/up accordingly, but unfortunately US-Align doesn't account for scaling, only for rotation and translation, which makes training easy but submissions are a pain because of the TM-Score.
Would anyone want to work on the Stanford RNA 3D competition together? If so, DM me, I’m down to work together and work on improving our accuracy and learning a lot from this competition and growing our AI and ML and Data science skills
How far u guys progressed on aimo lb btw if it's not over yet
I want to collaborate, can I dm to you ?
This is my first competition, so I don't quite understand how the submussion notebooks work? It kinda confuses me
I encountered this too. here is what is going wrong:
When you create your dataset, you have to split up the the long sequence into single characters for which you have to find the x_1, y_1, z_1.
That is why the train is so small. You don't have all of the values.
All going for deep learning methods or template-based n ab initio are worth trying too ?
I’m new to competitions and am simplifying, but basically your notebook does the following:
load the data,
train a model that predicts the targets,
feed the predictions into a submission file (formatted according to competition details).
Then they use your code to predict not only targets for the test cases for which you have a sequence that you can see in the test file already. They may add more (if I remember well, they will, in this competition).
So your code needs to be resilient, able to predict test sequences you haven’t seen.
Thanks for the feedback, but I could also load the model I trained locally right?
Ig yes ofc
There's no such restriction in the comp
perfect, thanks a lot guys
This is very exciting! However, the submission deadline is in May, and the competition ends in September. I was wondering when the winners will be announced.
If the competition is exciting it means it's exciting to take part too ig
I’m guessing not before September as evaluation goes on:
“Future Data Evaluation Timeline:
After the final submission deadline there will be periodic updates to the leaderboard to reflect up to 40 new RNA (sequences) generated after the competition has ended. New data updates that will be run against selected notebooks.”
There will be some early prizes in April though.
I'm using a large batch size and to do so i'm attempting to pad my batches to common lengths. I had the same question as you but thankfully there is something called maskign which basically tells the model to ignore the padding values
This is a valid approach, I opted to training with batch_size of one and accumulating the gradients before updating the model to get a more stable training.
How long does a whole training session take
Do someone know any gnn based approach
There are a few but idk the names
just search on google
Yes, I am in.
As that was my thought.
Do u ve any plan/idea ?
Hello Everyone!
I am Shashank, with 3+ years of experience in the domain of data - I am very much interested in creating and deploying end-to-end machine learning models.
Since, going deep into the idea, models related to Artificial Intelligence, also facinate me to work on, and to have a proper solutions to the business.
Same interest students/professionals can connect me on my linkedin: https://www.linkedin.com/in/snkp0018
Happy Learning!
Best,
Shashank Pandey
That largely depends on the model and training data, and hardware you use. My initial model was a rather simple CNN, utilizing only the training data provided by the competition and training on my RTX 4090, each epoch took like 10-20 seconds.
Holy shit
That's crazy
I'm trying to use paperspace for cuda and stuff but there's a bottleneck I can't fix and it's slowing it down so much
have you tried using pytorch.utils.bottleneck? that might give you some further insights
i mean where exactly your pipeline‘s spends the most time at, like data loading etc.
Any interest in joining a team?
Any issue on notebook submissions?
CNN? Don't you mean GNN?
Most probably
nope, i mean a CNN, i was performing a kronecker product on the input embeddings to make them quadratic in shape
I was curious. Over what dimensions you apply convolutions?
anyone having trouble with memory on the long sequences when submitting with DL models?
Scaler. joblib file?
Hey for real thanks for this tip I spent a while debugging with chatgpt to results but this i think actually will help
Given the one-hot encoded input tensor of shape LxM (L=Length of sequence; M=Encoding dimensions [4 -> G, A, U, C]) I perform the kronecker product to get to the shape (L, L, M^2) which is quadratic in shape, with M^2 number of chnanels which I then can process with a CNN. Does that clarify my approach?
def kronecker_product(self, one_hot_encoding):
"""
Computes the Kronecker product of the one-hot encoded sequence.
Args:
one_hot_encoding (torch.Tensor): One-hot encoding of the sequence (L x 4).
Returns:
torch.Tensor: Pairwise Kronecker product (L, L, 16).
"""
# Compute the outer product (Kronecker product) for all pairs
L = one_hot_encoding.shape[0]
kron_product = torch.einsum('ik,jm->ijkm', one_hot_encoding, one_hot_encoding)
kron_product = kron_product.view(L, L, -1) # Reshape to (L, L, 16)
return kron_product
In mathematics, the Kronecker product, sometimes denoted by ⊗, is an operation on two matrices of arbitrary size resulting in a block matrix. It is a specialization of the tensor product (which is denoted by the same symbol) from vectors to matrices and gives the matrix of the tensor product linear map with respect to a standard choice of basi...
don't worry man, hope that helps you 🙂
but kronecker product of ortonormal vectors don't produces unnecessary larger and sparsed new vectors?
Yeah what optimizer do you use
Just adam or fused or 8bit or lion
yes it is sparse, you‘re right, as each entry along each dimensions contains a 1 only one time, but I haven‘t found a better way to go about it, especially because I want to start off with CNNs and therefore I need the input tensor to br square as well, if you have a more intuitive approach, let me know though haha
I used AdamW, but tbh, i haven‘t experienced with different optimizers yet
My team mate told me about a RNA model using 2D images. What I first though was base pair probabilities, but personally I'm skeptical about any CNN to this task.
A cnn would work well with it it just needs to be in a transformer
That's how their ribonanzanet model works and its provided to you and works kinda well
Fr ribonanzanet based on ViTs ?
No
1d convolutional i think
it uses each possible nucleotide as a diff vector so 4 total and treats each sequence as an input
so it runs self attention on the sequence
What about ViT instead ?
I'm not too familiar with vits how would it be used here
my current work is based on some paper that only employ cnn's which works sufficiently well for them, for now I would just want to recreate their scores.
Ah really
Interesting
So no encoding whatsoever?
That's surprising with nucleotide sequences you'd think you'd need attention yk for long range interactions like folding
i mean it‘s one-hot encoded, but nothing more yeah
the model converges sufficiently well, but the resulting structures are somewhat off, so i‘ll try training with the diffusion data and making the model bigger and if that doesn‘t improve the results i‘ll probably start working with attention
Oh and if you're interested in using attention the ribonanzanet model they provide has a bunch of stuff setup
Try differential transformer
this one:
listed under „additional files“ for the competition
i‘ll look into it, thanks man 🙏
Yeah you'll have to peek into their premade network.py
What would be the advantage there for this
"Potentially Scam"
Hey guys im new to the competition, what are some good scores you've seen on the TM scale for the public leaderboard?
Or have found yourself
You can see the public leaderboard here: https://www.kaggle.com/competitions/stanford-rna-3d-folding/leaderboard the top scores are ranging from 0.35-0.5 right now
Thanks! Is there a place where I can test my code to get a score or do I have to program the TM scale inside my program to get a score?
https://www.kaggle.com/code/fernandosr85/rna-3d-fold-hybrid-template-nn-structure/notebook#RNA-3D-Structure-Prediction-Pipeline-🧬 This notebook looks so comprehensive and well built but it still has a score of 0.2 now im not entirely sure if its good or bad but 🤔
No need to program it, you just
-
transfer your predictions to a csv file structured in the same way as the sample submission.
-
Then you save your notebook. After it’s saved, you can
-
submit it to the competition. You can do that from the notebook, or from the submission tab in the competition page. It will run again and a score will appear after it ran
Edits for clarity
Thats really helpful, thank you so much!
Yeah like what Diego said, as long as you format your submission like the sample submission csv file, you should be good 👍
Another question, in training_labels there's a lot of missing coordinate data. Should I just clean up all the empty ones (entire rows) or is there something else i can do?
cus i can see there's a significant amount of missing coordinates
I merged train_seq with train_lab, then dropped the rows with missing values for x_1, y_1, z_1
I couldn’t think of an alternative
Is the current top 1 legit or data leakage?
If a solution ends up being too tailored to the existing scoring dataset, there can be a big shake up in the leaderboard when a new scoring dataset is introduced. The introduction of a new scoring set will happen two times in this competition. So we’ll find out soon I guess. Don’t forget to check the discussions on Kaggle. You can search for terms like leaderboard and see what folks have been speculating about. There’s a lot more discussion going on there.
I checked it already there's just a lot going on there so it's hard to find the gist
Hey, is anyone else getting this issue?
When I try to submit to the Stanford RNA 3D Folding competition, it says:
Cannot submit — Submissions have been disabled for this competition.
But the competition deadline is still a month away. Just wanted to check if it’s a platform issue or something temporary. Let me know if you’re seeing the same thing.
submission failed !. Can anyone tell why this is happening?
It depends on your code. I think some of the most promising public codes have some error when executed on new test. You should check it carefully or step by step debugging, submit with incremental pieces of the code to find what crashes it.
Fast answer, test samples have changed.
If your notebook is failing, in addition to checking if you have the right columns, you might want to check the order of rows.
The order of rows needs to be exactly as in the sample submission file. And the order is not always perfectly sequential in that file (at least it wasn’t before the leaderboard pause).
And the indexes and IDs also need to match those of the sample submission.
At least, that has been my experience. Changed everything: columns, data types. Only worked when indexes and IDs were aligned in the same way as in the sample submission.
guys do these values in the validation labels have any meaning? Or should I just delete them
NaN
Hello everyone!
I'm a B.Tech undergraduate currently looking to join a team. If any team has an open spot and is looking for a dedicated member, I’d love to be a part of it!
Alternatively, if you’re also looking for a team, feel free to join mine— I’m open to collaborating with like-minded people. Let’s connect!
Hi guys; quick question: is the foundational ribonanzanet model trained on data that isn't publicly available/posted within the comp? Would I be losing out on a lot of data if I'm not using ribonanzanet? Thanks in advance
Hey guys pardon if i ask dumb questions, the sample submission asks for 5 sets of coordinates. I've gotten one set of coordinate by using a model. How do I get the rest of the 4? Should I use different models in the same code notebook and generate 4 other coordinate sets then put all of these together?
you have 5 guesses, you can use five models, postprocess output from a single model 5 different ways, or just repeat a single prediction five times, is up to you
Hey, I'm getting the submission file not found error when I submit. I've checked that the notebook runs through and generates the attached submission file format.
I'm wondering if it could be a dependency thing? I'm !pip install _ing two modules at the top of my notebook with internet enabled, and I have them listed in my dependency file and turn internet off when I submit. The submission process seems to get past the dependency installation and moves on to actually running the notebook, but not sure this necessarily means the dependencies were installed successfully.
Is there any way to get more information on what might be going wrong? How do people typically debug submission errors-- just add/remove components until it starts submitting properly?
does your code produce extra files on disk? Often that causes this error, so if it does you should clean everything from disk before to_csv().
Ah, and not this time but save csv with index=False will avoid future format errors.
I don't think it produces extra files, it does download a model and tokenizer though
so yes, those are files different from csv, try leave only submission.csv at disk
Thank you, realized also I was attempting to download said model/tokenizer with internet off 😂 😭
lol, I miss that too, but that shouldn't produce submission.csv not found, does it?
I figured if any cell errors out, it stops notebook execution the csv file saved off in a later cell just wont get created?
Hey guys there's about 15 missing values in the validation labels. I need to predict them to make a submission file. How am I supposed to predict values for them if my test set doesn't have data for them?
Test in local is just an example and you should not use it to train. Anyway, if for any reason you have missing values in your training data you can either ignore them at loss calculation or value imputation, in this case, I suggest just ignore them.
I'm not sure if I understand correctly your question.
these values are missing in the validation labels. But we need to make predictions of them
cus the submission file demands it
Yes, but that's the key. For a test, you don't need to know label, just input (-CAU-). If you want to use them as training data since is only a toy test you can, just mask the unknown positions and don't count them in your loss calculation. Same for local testing, process the full chain, but when you compute score, remove the unknown positions.
You wont know any label for actual hidden test.
But for the hidden test, the format of the file will be the same. My code will pick up the y_1 and z_1 labels for even the hidden set and predict the x_1 labels. Which is also what im doing here. And if the y_1 and z_1 are faulty in the test set here, im not sure how to get x_1
Your submitting code have to reading only input sequences of a general test produce as many xyz coordinates than five times the lenght of the sequence. You don't need any labels for that. Only need them to train and score.
What should been done a part with properly labeled or masked data.
I ran a different model and it got rid of the missing labels, it now gives proper labels. But the submission still shows an error, i cant seem to get a score. I don't understand why :((
If anyone would want to help with checking out my submission file, id be very grateful
can't seem to get a score
you can share the code in Kaggle and ask for help
in the forum?
btw are too many decimal places eg 7 a problem for my predictions in submission file?
not at all, but I think your code is not general and is producing submission for the toy test example rather than a general unknown test.
Thing is im getting an error like this, submission scoring error. This only happens when the file format itself is faulty as i read from the error documentation on kaggle itself
hello guys,is there one use esm2/3 (protein language model) to solve this competition?
This is more of a general Kaggle question than specific to this competition--but when I edit one of the provided starter notebooks, my new "forked" notebook seems to get saved in a different "folder" or "area" on Kaggle where it doesn't show up at all in "Code" under "Your work". I'm wondering why this is--I'm guessing that "competition notebooks" are different from general Kaggle notebooks and that only the former show up under "Your work".
Going off of that, I was wondering if there's an easy way to copy the dependencies from an existing notebook into your own fresh competition notebook (not an "edit" of that other notebook). The RibonanzaNet secondary structure inference works beautifully in the example notebook, but if I just copy the notebook code to my own new notebook, it doesn't run, at least in part because RibonanzaNet is not included in the inputs directory. But even if I download and re-upload the .pt files into my own notebook, I'm not sure if it will work because there are other files in the directory of that starter notebook than just the weights.
No idea what are you talking about
forks should be found in your work with all inputs from original attached, refresh if not
I made the fork a month ago, it should definitely be showing up by now. The starting notebook was this one: https://www.kaggle.com/code/shujun717/rnet2-alpha-2d-structure-inference
It's only by going to that page and clicking "Edit My Copy" in the upper right that I'm able to get to it--it shows nowhere else.
I'm wondering if it's possibly because it says "Draft session" at the top--maybe that's a kind of temporary notebook rather than a "full" one? It's still saved, because as I said I've been working on it for like a month, and I was able to rename it and everything but it still doesn't show in "Your Work".
if is from another competition you will need to upload notebook and set all inputs manually, I think
so try to search that fork on your work but in that competition, if possible
ok is not a competition, but is about I was thinking, to submit in this competition you'll need to upload it manually in this competition code section, and there, set all inputs manually (or may be that will be automatic, I'm not sure)
I am wondering. The competition is in “close” but the timeline says it has 4 months to go. Does this mean I have time to form a submission?
I don't think so, those 4 months are to get a final safe test of true new structures
Anyway we can get confirmation on that?
Bummer
its closed
Is this the case with every Notebook you fork or just this particular one?
This is the only one I've tried forking.
By the way, I am still looking for people who want to collaborate on RNA structure prediction in the long term. I presented some ideas to the Eterna game people, I don't know if anyone from here is on there too.