#leash-belka

1 messages · Page 1 of 1 (latest)

quiet vessel
#

love drug contest

grim loom
#

lol

#

i just entered it

dire token
#

this is gonna be a dumb question, but I am struggling to open the trainig set for this comp. theres just a red circle at the bottom left on colab and when I try open it on excel its just grey and this isnt an issue with any other data set Ive tried to open. Does anyone know why or how to fix it?

paper laurel
dire token
rapid yarrow
#

Yes unfortunately @dire token
The dataset is far too huge
Try and use duckdb

dire token
#

Will try it out. Thanks

sacred moss
#

all the molecule smiles in train there is a centeral structure of Nc1nc(N)nc(N)n1 (wrote this smiles out of my head not 100% sure) and three branches its the general structure but in test this is not the case?

#

I think there is structural inconsistent between test and train

rose rivet
#

Yes, that will be one of the central challenges of this competition

#

Generalizing from the triazine you described to other structures

proper venture
#

Hey! Need to confirm one thing - bind column in train dataset with 0 value - are those observed true no binding - or only the 1 are true observed from experimentation and 0 can be no binding or unknown result ?

hard birch
#

+binds - The target column. A binary class label of whether the molecule binds to the protein. Not available for the test set.

rose rivet
#

@proper venture They are assumed to be non-binders. In DNA-encoded libraries, we start with the full set of molecules and after several rounds of screening, you can see which ones bind. Therefore you can assume that the remaining ones did not bind

proper venture
#

Thanks for the clarification 🙂

icy pendant
#

Hello everyone, anyone interested to be part of a team to this project?

floral sonnet
#

Hello everyone , I am interested to work on Leash-Bio -BELKA challenge and looking for a team to join. Please do let me know if anyone is looking for a team member!Thank You u

floral sonnet
#

Sure , thanks @red hamlet

grim loom
stuck salmon
#

Hello everyone👋,I am a Master's student in bioinformatics, and I am interested in working on the Leash-Bio BELKA challenge. I am currently looking for a team to join. Please let me know if anyone is looking for a team member!Thank you.

flat oasis
#

I am yet to participate in that challenge, didn't do yet coz of no knowledge about the dataset

stuck salmon
flat oasis
#

Oh ok surely let me know if you can understand the data

sacred moss
#

Hey there, soon-to-be B.Sc Bioinformatics graduate here whose thesis was in using Molecular Docking. I know how to use AutoDock (with PyRx) and Hex. Do I have a chance? I am good with Deep Learning in PyTorch as well. I am assuming this competition is about predicting whether a ligand has high binding affinity with the 3 protein targets?

boreal lake
#

Hey Everyone, I am a data engineer at a cancer research institute and I have my master's in data science. I have experience in rdkit and i'm looking for a team. please let me know if you want to team up.

grim loom
#

dm me your kaggle id

sweet pilot
#

is the dysprosium the placeholder for where the DNA itself starts, or is it the placeholder for the atom that is then attached to the DNA? it's the start of the DNA itself, right?

elder stone
# sweet pilot is the dysprosium the placeholder for where the DNA itself starts, or is it the ...

on the website it says under the data section "Note we use a [Dy] as the stand-in for the DNA linker." I believe the linker is a synthetically made molecule that connect to the protein which would then go on and attach to the DNA, but bio isn't my strong suit so I'm not 100% sure that is entirely the correct explanation. I have a cousin in bio meche and cheme so I can see if I can get a more detailed answer if needed.

sweet pilot
#

I think it would be the 2nd option in this pic since that's what it sounds like, but I figured it's worth asking to be sure 👍

wanton basalt
elder stone
#

yea it should be the second option and its just standing in for some molecule thats not really of concern

hard birch
sweet pilot
hard birch
#

I'm chemist. Long time out. But I would just leave it or if you need really replace it... I would use the fillvalue on organic chemistry, H.

#

But that would be a dramatic change on global chemistry interactions. I'll definite would leave Dy. Or a token as link meaning.

#

The thing is that as what I understand by DEL. The DNA identifies each sample and try don't interact with it. But it's still a huge molecule that has been there the whole process.

eternal canopy
flat oasis
#

Mumbai

eternal canopy
flat oasis
#

Cool

cerulean kite
#

Hi everyone, I have a couple questions about the data. Has anyone been able to find specific info about experimental conditions used? eg. pH and solvent composition.

#

And is anyone looking for a teammate? I finished my undergrad in molecular biology and bioinformatics several weeks ago and am very interested in modeling biological processes

grim loom
#

we got a paramedic with us

waxen nimbus
#

@dire token I am facing the same problem of opening the 50 GB xl file given.

waxen nimbus
#

Hello all, I am Rajiv from Mumbai. I am not able to open the xl input as it is very large. Can you help? Connect on rajivkjs@gmail.com

hard birch
#

Hi, from GREYSNOW on code

#

A more compact version of the same dataset

trail hill
#

Hi everyone, I'm looking at the data here, and it seems that for one target there are no binders. Can anyone confirm that?

> train.groupby(by='protein_name',sort=False).value_counts(subset=['binds'])
protein_name  binds
HSA           0        98007200
BRD4          0        97958646
sEH           0        97691078
              1          724532
BRD4          1          456964
HSA           1          408410
Name: count, dtype: int64

There are no class-1 (+bind) for hSE in the training set. However, the test set is equally distributted:

> test.value_counts(subset=['protein_name'])
protein_name
BRD4            558859
sEH             558142
HSA             557895
Name: count, dtype: int64

Can anyone confirm that? Can we just assume the data with no label (724532 entries) refers to sEH?

hard birch
#

I think you need to check something in your code:

#

Actually, there are exactly the same 724532 samples that appear without label in your groupby

hard birch
#

"Can we just assume the data with no label (724532 entries) refers to sEH?" Yes, looks like that's the case.

eternal canopy
#

hey im getting accuracy 0.71 using graph convolution method will lstm based models will achieve more accuracy ?

eternal canopy
#

Anybody able to use full training data ?

glossy pumice
#

does anyone find that the data contains smiles with unclosed parantheses?

hard birch
#

rdkit doesn't had any lecture problem in my case

glossy pumice
#

it was due to google colab disconnecting
the issue was solved by mounting drive

glossy pumice
#

Something I observed from a notebook:
This one guy tokenized the smiles and passed them into a 1d CNN
Does that work similarly to or as well as ECFPs?

copper rivet
#

what are the latest updates on this competition, how's the current situation? what models or methods people use?

glossy pumice
#

i have a hard time comprehending why CNNs work so much better than ECFPs when they kind of do the same thing
maybe people haven't tried given enough radius to the ECFPs, or CNNs capture more information? Maybe CNNs also capture different structures being close to each other better?

does anyone have any insight on why CNNs are so much better than ECFPs?

hard birch
#

I've just checked ECFPs principles and not much idea on the particular implementation of those CNNs. But for what I've read ECFPs are very large and sparse representations. Even more, if they have a fixed length by folding a certain amount of information is usually lost by bit collision (twor more different substructural features could be represented by the same bit position).

#

If CNNs doesn't have those limitations that can be the key.

glossy pumice
#

im pretty sure the research paper said that information loss was there, but it was not so much that it would render ECFPs much less useful

#

can someone help me understand why sparse is bad?

glossy pumice
#

Also I've noticed that the original author only used ECFPs with a radius of 2, which means it wont be able to capture much information, so maybe increasing that would yield better results

hard birch
#

Personally. I've been working on graphs till the moment with poor results. I think I'll give a chance to SMILES next.

thorn tide
#

Are building blocks important pieces of data to utilize? From my limited domain knowledge, it seems that the full SMILES string contains all the information from the building blocks.

I've been turning the full SMILES into graphs and running it in my GCN. My team is considering doing docking simulations to look at where the binding happens. We think that is a more "intuitive" approach than utilizing building blocks. Can someone more knowledgeable than I explain the purpose of building blocks?

#

I understand that building blocks can help us find correlations since we have so many that are "similar" (2/3 building blocks are identical) but training it like this won't enable us to generalize for unseen molecule-protein interactions.

#

Am I wrong for thinking that is the case?

rotund dagger
#

I have built my model, saved it and uploaded as dataset such that I can perform inference in the inference notebook. IN the inference notebook, I generate the submission.csv file which is in the kaggle/working directory.
When I Quick save to submit, I get an error of submission file not found.
Where could I be going wrong?

#

Ooohh.. This one seems to be a little easy. I have submitted only the submission file and it got scored and I am ranked🤭

potent canopy
#

Hey everyone! This may be a silly question, but just to clarify, is the objective to build a binary classifier or a binding affinity prediction model that outputs the predicted binding affinities of a molecule towards a target protein? I am still kind of confused on this because the overview says to build an ML to predict binding affinities, but one of the tags is binary classification.

toxic plume
potent canopy
#

Oh ok, just wanted to make sure. Thank you for helping.

fossil trail
#

@rotund gazelle We participated in this competition

#

It seems that kaggle selected wrong top files

#

Bronze medal is supposingly at 0.250

Our file has a score of 0.251

#

@static lantern

#

Yet we didnt get the lb position

#

N we didn't select any files to submit as well as it Said it would consider top 2 files