#leash-belka | Kaggle | Page 1

quiet vessel Apr 5, 2024, 5:27 AM

#

love drug contest

grim loom Apr 5, 2024, 7:01 AM

#

lol

#

i just entered it

dire token Apr 5, 2024, 9:14 PM

#

this is gonna be a dumb question, but I am struggling to open the trainig set for this comp. theres just a red circle at the bottom left on colab and when I try open it on excel its just grey and this isnt an issue with any other data set Ive tried to open. Does anyone know why or how to fix it?

paper laurel Apr 5, 2024, 9:50 PM

#

dire token this is gonna be a dumb question, but I am struggling to open the trainig set fo...

Its probably a lot bigger than excel can handle, you might need to figure out how to break it into smaller pieces if you want to open it in excel. You can also use a tool like duckdb or polars to open it in python

dire token Apr 5, 2024, 10:23 PM

#

paper laurel Its probably a lot bigger than excel can handle, you might need to figure out ho...

Would that apply to colab as well? When I tried opening it on colab i just get a red circle. However when I look at the disk/ram statistics it doesnt seem to be overloading colab.
I hope that makes sense.

rapid yarrow Apr 11, 2024, 7:32 PM

#

Yes unfortunately @dire token
The dataset is far too huge
Try and use duckdb

dire token Apr 11, 2024, 9:01 PM

#

Will try it out. Thanks

sacred moss Apr 12, 2024, 4:33 PM

#

all the molecule smiles in train there is a centeral structure of Nc1nc(N)nc(N)n1 (wrote this smiles out of my head not 100% sure) and three branches its the general structure but in test this is not the case?

#

I think there is structural inconsistent between test and train

rose rivet Apr 12, 2024, 7:49 PM

#

Yes, that will be one of the central challenges of this competition

#

Generalizing from the triazine you described to other structures

proper venture Apr 17, 2024, 2:35 AM

#

Hey! Need to confirm one thing - bind column in train dataset with 0 value - are those observed true no binding - or only the 1 are true observed from experimentation and 0 can be no binding or unknown result ?

hard birch Apr 17, 2024, 10:08 AM

#

+binds - The target column. A binary class label of whether the molecule binds to the protein. Not available for the test set.

rose rivet Apr 17, 2024, 11:43 AM

#

@proper venture They are assumed to be non-binders. In DNA-encoded libraries, we start with the full set of molecules and after several rounds of screening, you can see which ones bind. Therefore you can assume that the remaining ones did not bind

proper venture Apr 17, 2024, 12:24 PM

#

Thanks for the clarification 🙂

icy pendant Apr 17, 2024, 10:17 PM

#

Hello everyone, anyone interested to be part of a team to this project?

floral sonnet Apr 18, 2024, 12:57 AM

#

Hello everyone , I am interested to work on Leash-Bio -BELKA challenge and looking for a team to join. Please do let me know if anyone is looking for a team member!Thank You u

red hamlet Apr 18, 2024, 2:53 AM

#

floral sonnet Hello everyone , I am interested to work on Leash-Bio -BELKA challenge and looki...

I will DM you

floral sonnet Apr 18, 2024, 6:58 PM

#

Sure , thanks @red hamlet

grim loom Apr 19, 2024, 5:02 AM

#

icy pendant Hello everyone, anyone interested to be part of a team to this project?

umm we have a team already, we can merge if its suitable

stuck salmon Apr 20, 2024, 1:36 PM

#

Hello everyone👋,I am a Master's student in bioinformatics, and I am interested in working on the Leash-Bio BELKA challenge. I am currently looking for a team to join. Please let me know if anyone is looking for a team member!Thank you.

flat oasis Apr 20, 2024, 9:16 PM

#

stuck salmon Hello everyone👋,I am a Master's student in bioinformatics, and I am interested ...

Can you understand the dataset well? If so probably we can.
A PhD AI student this side

#

I am yet to participate in that challenge, didn't do yet coz of no knowledge about the dataset

stuck salmon Apr 20, 2024, 9:56 PM

#

flat oasis Can you understand the dataset well? If so probably we can. A PhD AI student thi...

No, I am still trying to understand it because my background is in business intelligence, and it's my first year in Master's

flat oasis Apr 20, 2024, 10:45 PM

#

Oh ok surely let me know if you can understand the data

sacred moss Apr 21, 2024, 11:16 AM

#

Hey there, soon-to-be B.Sc Bioinformatics graduate here whose thesis was in using Molecular Docking. I know how to use AutoDock (with PyRx) and Hex. Do I have a chance? I am good with Deep Learning in PyTorch as well. I am assuming this competition is about predicting whether a ligand has high binding affinity with the 3 protein targets?

boreal lake Apr 22, 2024, 2:03 AM

#

Hey Everyone, I am a data engineer at a cancer research institute and I have my master's in data science. I have experience in rdkit and i'm looking for a team. please let me know if you want to team up.

grim loom Apr 22, 2024, 5:20 AM

#

boreal lake Hey Everyone, I am a data engineer at a cancer research institute and I have my ...

sure join up

#

dm me your kaggle id

sweet pilot Apr 22, 2024, 5:21 PM

#

is the dysprosium the placeholder for where the DNA itself starts, or is it the placeholder for the atom that is then attached to the DNA? it's the start of the DNA itself, right?

elder stone Apr 22, 2024, 7:16 PM

#

sweet pilot is the dysprosium the placeholder for where the DNA itself starts, or is it the ...

on the website it says under the data section "Note we use a [Dy] as the stand-in for the DNA linker." I believe the linker is a synthetically made molecule that connect to the protein which would then go on and attach to the DNA, but bio isn't my strong suit so I'm not 100% sure that is entirely the correct explanation. I have a cousin in bio meche and cheme so I can see if I can get a more detailed answer if needed.

sweet pilot Apr 22, 2024, 7:23 PM

#

I think it would be the 2nd option in this pic since that's what it sounds like, but I figured it's worth asking to be sure 👍

wanton basalt Apr 22, 2024, 7:25 PM

#

Hey guys, I'm seeking a mentor (teammate) for Kaggle competitions and Leash Bio in particular, reposting from here 😄 : #👥┊looking-for-a-team message

elder stone Apr 22, 2024, 7:27 PM

#

yea it should be the second option and its just standing in for some molecule thats not really of concern

hard birch Apr 23, 2024, 5:56 AM

#

sweet pilot I think it would be the 2nd option in this pic since that's what it sounds like,...

Both pictures are the same. At second one you are generating a CH2 that is part of the molecule or has been neglected from final SMILES (what has no sense).

sweet pilot Apr 23, 2024, 5:59 AM

#

hard birch Both pictures are the same. At second one you are generating a CH2 that is part ...

correct, I was asking whether we can remove the dysprosium outright or if we have to replace it with something like a carbon (which I agree would be really silly)

just wanted to be absolutely sure haha

hard birch Apr 23, 2024, 6:02 AM

#

I'm chemist. Long time out. But I would just leave it or if you need really replace it... I would use the fillvalue on organic chemistry, H.

#

But that would be a dramatic change on global chemistry interactions. I'll definite would leave Dy. Or a token as link meaning.

#

The thing is that as what I understand by DEL. The DNA identifies each sample and try don't interact with it. But it's still a huge molecule that has been there the whole process.

eternal canopy Apr 27, 2024, 11:28 AM

#

flat oasis Can you understand the dataset well? If so probably we can. A PhD AI student thi...

Bhai kay bolt ma, kutcha tu

flat oasis Apr 27, 2024, 11:38 AM

#

Mumbai

eternal canopy Apr 27, 2024, 12:03 PM

#

flat oasis Mumbai

Me Kharghar cha 😉

flat oasis Apr 27, 2024, 12:18 PM

#

Cool

cerulean kite Apr 29, 2024, 12:44 AM

#

Hi everyone, I have a couple questions about the data. Has anyone been able to find specific info about experimental conditions used? eg. pH and solvent composition.

#

And is anyone looking for a teammate? I finished my undergrad in molecular biology and bioinformatics several weeks ago and am very interested in modeling biological processes

grim loom May 7, 2024, 7:01 AM

#

cerulean kite And is anyone looking for a teammate? I finished my undergrad in molecular biol...

cool, you can join in 😄 , We're Fallen Angels.

#

we got a paramedic with us

waxen nimbus May 10, 2024, 6:15 AM

#

@dire token I am facing the same problem of opening the 50 GB xl file given.

waxen nimbus May 14, 2024, 5:11 AM

#

Hello all, I am Rajiv from Mumbai. I am not able to open the xl input as it is very large. Can you help? Connect on rajivkjs@gmail.com

hard birch May 14, 2024, 8:25 AM

#

Hi, from GREYSNOW on code

#

https://www.kaggle.com/code/shlomoron/belka-shrunken-train-set-loading

BELKA: Shrunken train set loading

Explore and run machine learning code with Kaggle Notebooks | Using data from multiple data sources

#

A more compact version of the same dataset

trail hill May 14, 2024, 3:52 PM

#

Hi everyone, I'm looking at the data here, and it seems that for one target there are no binders. Can anyone confirm that?

> train.groupby(by='protein_name',sort=False).value_counts(subset=['binds'])
protein_name  binds
HSA           0        98007200
BRD4          0        97958646
sEH           0        97691078
              1          724532
BRD4          1          456964
HSA           1          408410
Name: count, dtype: int64

There are no class-1 (+bind) for hSE in the training set. However, the test set is equally distributted:

> test.value_counts(subset=['protein_name'])
protein_name
BRD4            558859
sEH             558142
HSA             557895
Name: count, dtype: int64

Can anyone confirm that? Can we just assume the data with no label (724532 entries) refers to sEH?

hard birch May 15, 2024, 3:10 PM

#

I think you need to check something in your code:

#

#

Actually, there are exactly the same 724532 samples that appear without label in your groupby

hard birch May 15, 2024, 4:17 PM

#

"Can we just assume the data with no label (724532 entries) refers to sEH?" Yes, looks like that's the case.

trail hill May 16, 2024, 2:56 PM

#

hard birch Actually, there are exactly the same 724532 samples that appear without label in...

Thanks!

eternal canopy May 19, 2024, 7:45 PM

#

hey im getting accuracy 0.71 using graph convolution method will lstm based models will achieve more accuracy ?

eternal canopy May 24, 2024, 6:55 AM

#

Anybody able to use full training data ?

glossy pumice Jun 3, 2024, 7:29 AM

#

does anyone find that the data contains smiles with unclosed parantheses?

hard birch Jun 3, 2024, 12:06 PM

#

rdkit doesn't had any lecture problem in my case

glossy pumice Jun 3, 2024, 4:18 PM

#

it was due to google colab disconnecting
the issue was solved by mounting drive

glossy pumice Jun 5, 2024, 8:07 AM

#

Something I observed from a notebook:
This one guy tokenized the smiles and passed them into a 1d CNN
Does that work similarly to or as well as ECFPs?

copper rivet Jun 5, 2024, 2:27 PM

#

what are the latest updates on this competition, how's the current situation? what models or methods people use?

glossy pumice Jun 9, 2024, 11:37 AM

#

i have a hard time comprehending why CNNs work so much better than ECFPs when they kind of do the same thing
maybe people haven't tried given enough radius to the ECFPs, or CNNs capture more information? Maybe CNNs also capture different structures being close to each other better?

does anyone have any insight on why CNNs are so much better than ECFPs?

hard birch Jun 9, 2024, 12:25 PM

#

I've just checked ECFPs principles and not much idea on the particular implementation of those CNNs. But for what I've read ECFPs are very large and sparse representations. Even more, if they have a fixed length by folding a certain amount of information is usually lost by bit collision (twor more different substructural features could be represented by the same bit position).

#

If CNNs doesn't have those limitations that can be the key.

glossy pumice Jun 9, 2024, 7:47 PM

#

im pretty sure the research paper said that information loss was there, but it was not so much that it would render ECFPs much less useful

#

can someone help me understand why sparse is bad?

glossy pumice Jun 9, 2024, 7:55 PM

#

hard birch I've just checked ECFPs principles and not much idea on the particular implement...

in this context

#

Also I've noticed that the original author only used ECFPs with a radius of 2, which means it wont be able to capture much information, so maybe increasing that would yield better results

hard birch Jun 10, 2024, 11:49 AM

#

Personally. I've been working on graphs till the moment with poor results. I think I'll give a chance to SMILES next.

thorn tide Jun 11, 2024, 7:03 AM

#

Are building blocks important pieces of data to utilize? From my limited domain knowledge, it seems that the full SMILES string contains all the information from the building blocks.

I've been turning the full SMILES into graphs and running it in my GCN. My team is considering doing docking simulations to look at where the binding happens. We think that is a more "intuitive" approach than utilizing building blocks. Can someone more knowledgeable than I explain the purpose of building blocks?

#

I understand that building blocks can help us find correlations since we have so many that are "similar" (2/3 building blocks are identical) but training it like this won't enable us to generalize for unseen molecule-protein interactions.

#

Am I wrong for thinking that is the case?

rotund dagger Jun 18, 2024, 2:25 PM

#

I have built my model, saved it and uploaded as dataset such that I can perform inference in the inference notebook. IN the inference notebook, I generate the submission.csv file which is in the kaggle/working directory.
When I Quick save to submit, I get an error of submission file not found.
Where could I be going wrong?

#

Ooohh.. This one seems to be a little easy. I have submitted only the submission file and it got scored and I am ranked🤭

potent canopy Jun 28, 2024, 10:04 PM

#

Hey everyone! This may be a silly question, but just to clarify, is the objective to build a binary classifier or a binding affinity prediction model that outputs the predicted binding affinities of a molecule towards a target protein? I am still kind of confused on this because the overview says to build an ML to predict binding affinities, but one of the tags is binary classification.

toxic plume Jun 28, 2024, 10:12 PM

#

potent canopy Hey everyone! This may be a silly question, but just to clarify, is the objectiv...

Your submission should contain the binding probability. The competition metric is "average precision", if you understand the metric it will be clearer.

potent canopy Jun 28, 2024, 10:15 PM

#

Oh ok, just wanted to make sure. Thank you for helping.

fossil trail Jul 9, 2024, 2:22 AM

#

@rotund gazelle We participated in this competition

#

It seems that kaggle selected wrong top files

#

Bronze medal is supposingly at 0.250

Our file has a score of 0.251

#

@static lantern

#

Yet we didnt get the lb position

#

N we didn't select any files to submit as well as it Said it would consider top 2 files