#leash-belka
1 messages · Page 1 of 1 (latest)
this is gonna be a dumb question, but I am struggling to open the trainig set for this comp. theres just a red circle at the bottom left on colab and when I try open it on excel its just grey and this isnt an issue with any other data set Ive tried to open. Does anyone know why or how to fix it?
Its probably a lot bigger than excel can handle, you might need to figure out how to break it into smaller pieces if you want to open it in excel. You can also use a tool like duckdb or polars to open it in python
Would that apply to colab as well? When I tried opening it on colab i just get a red circle. However when I look at the disk/ram statistics it doesnt seem to be overloading colab.
I hope that makes sense.
Yes unfortunately @dire token
The dataset is far too huge
Try and use duckdb
Will try it out. Thanks
all the molecule smiles in train there is a centeral structure of Nc1nc(N)nc(N)n1 (wrote this smiles out of my head not 100% sure) and three branches its the general structure but in test this is not the case?
I think there is structural inconsistent between test and train
Yes, that will be one of the central challenges of this competition
Generalizing from the triazine you described to other structures
Hey! Need to confirm one thing - bind column in train dataset with 0 value - are those observed true no binding - or only the 1 are true observed from experimentation and 0 can be no binding or unknown result ?
+binds - The target column. A binary class label of whether the molecule binds to the protein. Not available for the test set.
@proper venture They are assumed to be non-binders. In DNA-encoded libraries, we start with the full set of molecules and after several rounds of screening, you can see which ones bind. Therefore you can assume that the remaining ones did not bind
Thanks for the clarification 🙂
Hello everyone, anyone interested to be part of a team to this project?
Hello everyone , I am interested to work on Leash-Bio -BELKA challenge and looking for a team to join. Please do let me know if anyone is looking for a team member!Thank You u
I will DM you
Sure , thanks @red hamlet
umm we have a team already, we can merge if its suitable
Hello everyone👋,I am a Master's student in bioinformatics, and I am interested in working on the Leash-Bio BELKA challenge. I am currently looking for a team to join. Please let me know if anyone is looking for a team member!Thank you.
Can you understand the dataset well? If so probably we can.
A PhD AI student this side
I am yet to participate in that challenge, didn't do yet coz of no knowledge about the dataset
No, I am still trying to understand it because my background is in business intelligence, and it's my first year in Master's
Oh ok surely let me know if you can understand the data
Hey there, soon-to-be B.Sc Bioinformatics graduate here whose thesis was in using Molecular Docking. I know how to use AutoDock (with PyRx) and Hex. Do I have a chance? I am good with Deep Learning in PyTorch as well. I am assuming this competition is about predicting whether a ligand has high binding affinity with the 3 protein targets?
Hey Everyone, I am a data engineer at a cancer research institute and I have my master's in data science. I have experience in rdkit and i'm looking for a team. please let me know if you want to team up.
sure join up
dm me your kaggle id
is the dysprosium the placeholder for where the DNA itself starts, or is it the placeholder for the atom that is then attached to the DNA? it's the start of the DNA itself, right?
on the website it says under the data section "Note we use a [Dy] as the stand-in for the DNA linker." I believe the linker is a synthetically made molecule that connect to the protein which would then go on and attach to the DNA, but bio isn't my strong suit so I'm not 100% sure that is entirely the correct explanation. I have a cousin in bio meche and cheme so I can see if I can get a more detailed answer if needed.
I think it would be the 2nd option in this pic since that's what it sounds like, but I figured it's worth asking to be sure 👍
Hey guys, I'm seeking a mentor (teammate) for Kaggle competitions and Leash Bio in particular, reposting from here 😄 : #👥┊looking-for-a-team message
yea it should be the second option and its just standing in for some molecule thats not really of concern
Both pictures are the same. At second one you are generating a CH2 that is part of the molecule or has been neglected from final SMILES (what has no sense).
correct, I was asking whether we can remove the dysprosium outright or if we have to replace it with something like a carbon (which I agree would be really silly)
just wanted to be absolutely sure haha
I'm chemist. Long time out. But I would just leave it or if you need really replace it... I would use the fillvalue on organic chemistry, H.
But that would be a dramatic change on global chemistry interactions. I'll definite would leave Dy. Or a token as link meaning.
The thing is that as what I understand by DEL. The DNA identifies each sample and try don't interact with it. But it's still a huge molecule that has been there the whole process.
Bhai kay bolt ma, kutcha tu
Mumbai
Me Kharghar cha 😉
Cool
Hi everyone, I have a couple questions about the data. Has anyone been able to find specific info about experimental conditions used? eg. pH and solvent composition.
And is anyone looking for a teammate? I finished my undergrad in molecular biology and bioinformatics several weeks ago and am very interested in modeling biological processes
cool, you can join in 😄 , We're Fallen Angels.
we got a paramedic with us
@dire token I am facing the same problem of opening the 50 GB xl file given.
Hello all, I am Rajiv from Mumbai. I am not able to open the xl input as it is very large. Can you help? Connect on rajivkjs@gmail.com
Hi, from GREYSNOW on code
A more compact version of the same dataset
Hi everyone, I'm looking at the data here, and it seems that for one target there are no binders. Can anyone confirm that?
> train.groupby(by='protein_name',sort=False).value_counts(subset=['binds'])
protein_name binds
HSA 0 98007200
BRD4 0 97958646
sEH 0 97691078
1 724532
BRD4 1 456964
HSA 1 408410
Name: count, dtype: int64
There are no class-1 (+bind) for hSE in the training set. However, the test set is equally distributted:
> test.value_counts(subset=['protein_name'])
protein_name
BRD4 558859
sEH 558142
HSA 557895
Name: count, dtype: int64
Can anyone confirm that? Can we just assume the data with no label (724532 entries) refers to sEH?
I think you need to check something in your code:
Actually, there are exactly the same 724532 samples that appear without label in your groupby
"Can we just assume the data with no label (724532 entries) refers to sEH?" Yes, looks like that's the case.
Thanks!
hey im getting accuracy 0.71 using graph convolution method will lstm based models will achieve more accuracy ?
Anybody able to use full training data ?
does anyone find that the data contains smiles with unclosed parantheses?
rdkit doesn't had any lecture problem in my case
it was due to google colab disconnecting
the issue was solved by mounting drive
Something I observed from a notebook:
This one guy tokenized the smiles and passed them into a 1d CNN
Does that work similarly to or as well as ECFPs?
what are the latest updates on this competition, how's the current situation? what models or methods people use?
i have a hard time comprehending why CNNs work so much better than ECFPs when they kind of do the same thing
maybe people haven't tried given enough radius to the ECFPs, or CNNs capture more information? Maybe CNNs also capture different structures being close to each other better?
does anyone have any insight on why CNNs are so much better than ECFPs?
I've just checked ECFPs principles and not much idea on the particular implementation of those CNNs. But for what I've read ECFPs are very large and sparse representations. Even more, if they have a fixed length by folding a certain amount of information is usually lost by bit collision (twor more different substructural features could be represented by the same bit position).
If CNNs doesn't have those limitations that can be the key.
im pretty sure the research paper said that information loss was there, but it was not so much that it would render ECFPs much less useful
can someone help me understand why sparse is bad?
in this context
Also I've noticed that the original author only used ECFPs with a radius of 2, which means it wont be able to capture much information, so maybe increasing that would yield better results
Personally. I've been working on graphs till the moment with poor results. I think I'll give a chance to SMILES next.
Are building blocks important pieces of data to utilize? From my limited domain knowledge, it seems that the full SMILES string contains all the information from the building blocks.
I've been turning the full SMILES into graphs and running it in my GCN. My team is considering doing docking simulations to look at where the binding happens. We think that is a more "intuitive" approach than utilizing building blocks. Can someone more knowledgeable than I explain the purpose of building blocks?
I understand that building blocks can help us find correlations since we have so many that are "similar" (2/3 building blocks are identical) but training it like this won't enable us to generalize for unseen molecule-protein interactions.
Am I wrong for thinking that is the case?
I have built my model, saved it and uploaded as dataset such that I can perform inference in the inference notebook. IN the inference notebook, I generate the submission.csv file which is in the kaggle/working directory.
When I Quick save to submit, I get an error of submission file not found.
Where could I be going wrong?
Ooohh.. This one seems to be a little easy. I have submitted only the submission file and it got scored and I am ranked🤭
Hey everyone! This may be a silly question, but just to clarify, is the objective to build a binary classifier or a binding affinity prediction model that outputs the predicted binding affinities of a molecule towards a target protein? I am still kind of confused on this because the overview says to build an ML to predict binding affinities, but one of the tags is binary classification.
Your submission should contain the binding probability. The competition metric is "average precision", if you understand the metric it will be clearer.
Oh ok, just wanted to make sure. Thank you for helping.
@rotund gazelle We participated in this competition
It seems that kaggle selected wrong top files
Bronze medal is supposingly at 0.250
Our file has a score of 0.251
@static lantern
Yet we didnt get the lb position
N we didn't select any files to submit as well as it Said it would consider top 2 files